Statistical Physics Methods in Optimization and Machine Learning Notes
Statistical Physics Methods in Optimization and Machine Learning Notes
March 9, 2024
ii F. Krzakala and L. Zdeborová
Introduction
This set of lectures will be discussing probabilistic models, and focus on problems coming from
Statistics, Machine Learning and Constrained Optimization, while using tools and techniques
from Statistical Physics. The focus will be more theoretical than practical, so you have been
warned! Our goal is to show how some methods from Statistical Physics allow us to derive
precise answers to many mathematical questions. As pointed out by Archimedes, once these
answers are given, even if they are obtained through heuristic methods, it is a simpler task
(but still non-trivial) to prove them rigorously. Over the last few decades, there has been an
increasing convergence of interest and methods between Theoretical Physics and Applied
Mathematics, and many theoretical and applied works in Statistical Physics and Computer
Science have relied on a connection with the Statistical Physics of Spin Glasses. The aim of
this course is to present the background necessary for entering this fast-developing field.
At first glance, it may seem surprising that Physics has any connection with minimization and
probabilistic inference problems. The connection lies in the Gibbs (or Boltzmann) distribution,
the fundamental object in Statistical Mechanics. From the point of view of Statistics and
Optimization we will be interested in two types of problems: a) minimizing a cost function
and b) sampling from a distribution. In both cases, the Statistical Physics approach, or more
precisely the Boltzmann measure, turns out to be convenient.
Say that you have a "cost" function E(x) of x ∈ Rd . In statistical mechanics, one associates
a temperature-dependent "Boltzmann" probability measure to each possible value of x as
follows:
1
PBoltzmann (x) = e−βE(x)
ZN (β)
where β = 1/T is called the inverse temperature, and
Z
ZN (β) = dx e−βE(x)
Rd
is called the partition function, or the partition sum. The introduction of the temperature is
very practical. For instance one can check that the limit β → ∞ allows us to study minimization
0 Introduction iii
problems, since
lim −∂β log Z(β) = min E(x)
β→∞ x
One can also obtain the number of minima by computing limβ→∞ Z(β).
The formalism is equally interesting for sampling problems. A typical problem that arises
in Statistical Inference, Information Theory, Signal Processing or Machine Learning is the
following: let X be an unknown quantity that we would like to infer, but which we don’t have
access to. Instead, we are given access to a quantity Y , related to X (usually a noisy version of
X). For concreteness, assume that X is simply a scalar Gaussian variable with zero mean and
unit variance, and that Y = X + Z for another standard Gaussian variable Z. Given Y , what
can we say about X? In other words: what is the probability of X given our measurement
Y ? The quantity PX|Y (x, y) is called the posterior distribution of X given Y . Bayes’ formula
famously states that
PY |X (x, y)PX (x)
PX|Y (x, y) = ,
PY (y)
so that the posterior PX|Y (x, y) is given by the product of the prior probability on X, PX (x),
times the likelihood PY |X (x, y), divided by the evidence PY (y), which is just a normalization
constant (in the sense that it is not a function of X, the random variable of interest).
Thus, the evidence PY is simply the partition sum. This simple rewriting is behind the
popularity of Statistical Mechanics language in Bayesian inference. Indeed, many words in the
vocabulary of the Machine Learning community are borrowed directly from Physics (such as
"energy-based model", "free energy", "mean-field", "Boltzmann machine"...).
In what follows, we shall be interested in the accuracy of the resulting estimates, for instance
the mean squared error over a given set of models. We will thus need to apply the methods
from Statistical Physics, not only to derive the posterior distribution and the partition sum, but
also to take averages over many models or many realizations, in order to determine the typical
behavior. For example, we would like to access information-theoretical quantities such as the
entropy. Computing the partition sum Z is already difficult, but computing such averages is
notoriously even harder.
Conveniently, there is a part of Statistical Physics that focuses exactly on this task: the field of
disordered systems and spin glasses. Spin glasses are magnets in which the interaction strength
between each pair of particles is random. Starting in the late 70s with the groundbreaking
work of Sir Sam Edwards and Nobel prize winner Philip W. Anderson, the Statistical Physics of
disordered systems and spin glasses has grown into a versatile theory, with powerful heuristic
tools such as the replica and the cavity methods. In itself, the idea of using Statistical Physics
methods to study some problems in computer science is not a new. It was the inspiration, for
instance, behind the creation of Simulated Annealing. Anderson, on the one hand, and Parisi
iv F. Krzakala and L. Zdeborová
and Mézard on the other, used this connection back in 1986 to study optimisation problems,
and since then many applied it with great success to study a large variety of problems in
optimization (random satisfiability & coloring), error correcting codes, inference and machine
learning.
In the lectures, we wish to address these questions with an interdisciplinary approach, lever-
aging tools from mathematical physics and statistical mechanics, but also from information
theory and optimization. Our modelling strategy and the analysis originate in studies of
phase transitions for models of Condensed Matter Physics. Yet, most of our objectives and
applications belong to the fields of Machine Learning, Computer Science, and Statistical Data
Processing.
Notations
We shall use probabilistic notations: random variables are uppercase, and a particular value is
lowercase. For instance, we will speak of the probability P(X = x) that the random variable
X takes the value x.
=: definition of a new
quantity
≍ asymptotically equal,
used for large deviation
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Appendices 27
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.3 The relation between Loopy Belief Propagation and the Cavity method 65
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Appendices 89
Appendices 131
12.3 From AMP to proofs of replica prediction: The Bayes Case . . . . . . . . . . . 190
12.4 From AMP to proofs of replica prediction: The MAP case . . . . . . . . . . . . 192
15.3 The connection between the replica potential, and the complexity . . . . . . . 229
17 Perceptron 251
V Appendix 257
The world was so recent that many things lacked names, and in
order to indicate them it was necessary to point. Every year
during the month of March a family of ragged gypsies would set
up their tents near the village, and with a great uproar of pipes
and kettledrums they would display new inventions. First they
brought the magnet.
The Curie-Weiss model is a simple model for ferromagnetism. The essential phenomenon
associated with a ferromagnet is that below a certain critical temperature, a magnetization will
spontaneously appear in the absence of an external magnetic field. The Curie-Weiss model is
simple enough that all the thermodynamic functions characterising its macroscopic properties
can be computed exactly. And yet, it is rich enough to capture the basic phenomenology
of phase transitions — namely the transition between a disordered paramagnetic phase (not
magnetized) and an ordered ferromagnetic phase (magnetized). Because of its simplicity and
because of the correctness of at least some of its predictions, the Curie-Weiss model occupies
an important place in the Statistical Mechanics literature.
In this model, the magnetic moments are encoded by N microscopic spin variables Si ∈
{−1, +1} for i = 1, . . . , N . Every magnetic moment Si interacts with every other magnetic
moment Sj for j ̸= i. Ferromagnetism can be modelled by a collective alignment of the
magnetic moments in the same direction. Therefore, to encourage a ferromagnetic phase, we
add an energy cost for spins which are not aligned. In its simplest flavour, this cost takes
the form of a two-body interaction −Si Sj . The total cost function associated with a given
configuration of spins S ∈ {−1, +1}N , also known in Physics as the Hamiltonian, is given by:
0 1 X
HN (S) = − Si Sj .
2N
ij
for h > 0 and Si = −1 for h < 0, so we shall work with the Hamiltonian:
X 1 X X
0
HN (S) = HN (S) − h Si = − Si Sj − h Si . (1.1)
2N
i ij i
The probability of finding the system at the configuration s ∈ {−1, +1}N is given by the
Boltzmann measure1 :
e−βH(s)
PN,β,h (S = s) = .
ZN (β, h)
where β = T −1 ≥ 0 is the inverse temperature. Note that for β > 0 the Boltzmann measure
gives more weight to configurations with lower cost or energy. In particular, when β → ∞ (or
equivalently T = 0) it concentrates around configurations that minimise H. In the opposite
limit, when β = 0 (or equivalently T → ∞) it assigns equal weight to every configuration,
yielding the uniform measure. The normalization of the Boltzmann measure plays a very
important role in Statistical Physics, and is known as the partition sum or partition function:
X β P P
Si Sj +βh Si
ZN (β, h) = e 2N ij . (1.2)
S∈{−1,+1}N
As we will show next, the partition sum is closely related to the thermodynamic functions that
characterize the macroscopic properties of the model.
Note that S̄ is the empirical average of the spins, and therefore is itself a random variable. It is
interesting to note that the Hamiltonian of the Curie-Weiss model is actually only a function
of the magnetization:
1 2
H(S̄) = −N S̄ + hS̄ ,
2
making it explicit that it is an extensive quantity H(S̄) ∝ N . And each time we flip a single
spin from −1 to +1, the magnetization per spin increases by 2/N . Therefore, what is the
probability that the magnetization per spin takes a particular value in the set SN = {−1, −1 +
2/N, −1 + 4/N, . . . , 1}? According to the Boltzmann measure, this is given by:
Ω(m, N ) βN ( 1 m2 +hm)
P(S̄ = m) = e 2
ZN (β, h)
1
Also known as Gibbs measure or Gibbs-Boltzmann measure.
1.1 Rigorous solution 5
0.7
0.6
0.5
0.4
H(m)
0.3
0.2
0.1
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m
Figure 1.1: Binary entropy H(m) defined in equation 1.3 as a function of the magnetization m.
N!
Ω(m, N ) = N −N m
N +N m
2 ! 2 !
The expression above is different from the usual binomial distribution because the Si take
their values in {−1, 1} instead of {0, 1}. At first glance it is not very friendly, but with some
work refining Stirling’s approximation (see exercise 1.1), one can show that
eN H(m)
≤ Ω(m, N ) ≤ eN H(m) ,
N +1
with H(m), often called the binary entropy , given by Note that the binary
entropy is usually
1+m
1+m
1−m
1−m
defined with log2 in
H(m) = − log − log . (1.3) Information Theory.
2 2 2 2 We shall stick to the
natural logarithm
here.
We thus reach our first result:
1 eN ϕ(m) eN ϕ(m)
≤ P(S̄ = m) ≤ (1.4)
N + 1 ZN (β, h) ZN (β, h)
One can also compute, and bound, the value of ZN (β, h). Indeed, summing over m on both
sides of the right part of equation 1.4 one reaches
∗
X eN ϕ(m) eN ϕ(m )
1≤ ≤ (N + 1) , (1.5)
m
ZN (β, h) ZN (β, h)
where we have defined the value m∗ ∈ [−1, 1] that maximizes ϕ(m) (note that m∗ depends on
β and h). Therefore, taking the logarithm on both sides:
1 eN ϕ(m)
≤ P(S̄ = m) ≤ 1
N + 1 ZN (β, h)
log (N + 1) log ZN (β, h)
− + ϕ(m) ≤ .
N N
This is true for all the discrete values m ∈ SN , and in particular for the value mmax that
maximizes ϕ(m) over this set. It is easy to see that maximizing over [−1, 1] instead of SN does
not change the result substantially since ϕ(mmax ) > ϕ(m∗ ) − log N/N 2 . Therefore, for N large
enough we finally obtain
Additionally,
log P(S̄ = m)
lim = ϕ(m) − ϕ(m∗ ) (1.8)
N →∞ N
Note that this result is quite remarkable: we have turned the seemingly impossible sum over
2N states of equation 1.2 into a simple maximization of a one-dimensional potential function
ϕ(m). In particular, equation 1.8 shows that in the thermodynamic limit N → ∞ the potential
ϕ(m) fully characterizes the probability of finding the system at a given macroscopic state
S̄ = m. More importantly, we have done this without any approximation. All the steps are
rigorous! This is, of course, thanks to the simplicity of the Curie-Weiss model.
From this exact solution, we can now analyze the phenomenology of the Curie-Weiss model.
The extremization ϕ′ (m) = 0 leads to the condition
1 1+m
log = β(m + h)
2 1−m
which, using tanh−1 (x) = 12 log 1+x
1−x gives the so-called Curie-Weiss mean-field or saddle point
equation:
m = tanh (β(h + m)) . (1.9)
2
The mean value theorem applied between m∗ and mmax yields a c ∈ [m∗ , mmax ] which satisfies ϕ(m∗ ) =
ϕ(mmax ) + ϕ′ (c)(m∗ − mmax ), and such that |m∗ − mmax | ≤ 2/N . Since the difference is K/N for some constant
K, it will be eventually smaller than log N/N .
1.1 Rigorous solution 7
1.00 =0 =2
= 0.5 = 2.0
0.75 =1 -h
0.50 1.5
0.25
tanh( (m + h))
1.0
(m)
0.00
0.25 0.5
0.50
0.75 0.0
1.00
0.5
3 2 1 0 1 2 3 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m m
Figure 1.2: (Left) Right-hand side of equation 1.9 as a function of m, for fixed h = 0.5 and
different values of the inverse temperature (β solid lines). Solutions of equation 1.9 (dots)
are given by the intersection of f (m) = tanh(β(h + m)) with the line f (m) = m (red dashed).
(Right) Same picture in terms of the potential ϕ(m), where the solutions of equation 1.9
correspond to the global maximum of ϕ(m). Note that for β ≫ 1 an unstable solution
corresponding to a minimum of ϕ appear.
0.25 0.74
0.50 0.72
0.75
0.70
1.00
0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m
Figure 1.3: (Left )Fixed point m⋆ of the mean-field equation m = tanh(βm) above the critical
temperature (h = 0, β = 1.5) as a function of the inverse temperature β. Depending on the
sign of the initialisation m0 , we reach one of the two global maxima of ϕ(m) (right).
Figure 1.2 shows the right-hand side of the mean-field equation equation 1.9 for a fixed
external field h and different values of the inverse temperature β. The solution m∗ of these
self-consistent equations is given by the intersection of f (m) = tanh(β(m + h)) with the line
f (m) = m. Depending on the value of the parameters, there can be up to three solutions.
The property N1 log P(S̄ = m) → ϕ(m) − ϕ(m∗ ) is usually called a Large Deviation Principle in
mathematics. It basically tell us that the probability that S̄ takes any other value than m∗ is
exponentially small, i.e. a very rare event. In a nutshell, we can write that the probability to
find the system in a given value of m is approximately:
∗ ))
P(S̄ = m) ≍ eN (ϕ(m)−ϕ(m ,
N →∞
where we have used the symbol ≍ to denote an equality valid asymptotically in N . If the
N →∞
maximum is unique, then the magnetization is found to be equal to m∗ with probability one.
This convergence in probability is called a concentration phenomena in probability theory. For
8 F. Krzakala and L. Zdeborová
tanh( (m + h))
(m)
0.00 1.0
0.25 0.8
0.50
0.6
0.75
1.00 0.4
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m m
Figure 1.4: Same setting as Fig. 1.2, but for zero external field h = 0. For β < 1, the potential
ϕ(m) has only one maximum corresponding to a disordered phase m⋆ = 0. For β > 1,
the system has two ordered ferromagnetic phases corresponding to the emergence of two
symmetric global maxima ±m⋆ .
the physicist, it means that "macroscopic" quantities such as the magnetization are entirely
deterministic, as their random fluctuations around the mean are negligible: this concentration
of the measure is at the root of the success of Statistical Mechanics.
In the Curie-Weiss model, however, if h = 0 and β > 1, then the global maximum is not unique:
there are two degenerate maxima at ±m∗ . This means that if one samples a configuration
from the Boltzmann measure, then with a probability 1/2 the configuration will have a mag-
netization ±m∗ , see Figs. 1.3 and 1.4. This situation is called phase coexistence in Physics, and is
a fundamental property of liquids, solids and gases in nature. Phase coexistence arises only
for h = 0 and β > 1 in the Curie-Weiss model. Indeed, as shown in Fig. 1.2, for h ̸= 0 there
might be more than one solution to the mean-field equations, but there is only one global
maximum corresponding to a single phase, also known as a single Gibbs state.
In more involved models one cannot hope to directly access the statistical distribution of
relevant quantities with a direct computation, as we did in Theorem 1. However, we should
expect to find similar phenomenology: concentration of macroscopic quantities in the thermo-
dynamic limit, a large deviation principle for the Boltzmann measure, single phase vs phase
coexistence, and of course, phase transitions. It turns out that almost all these phenomena can
be understood from the computation of the asymptotic free entropy density Φ(β, h). This is
why computing this quantity is the single most important analytical task in Statistical Physics.
First, a piece of trivia: most physicists do not use the free entropy or the free entropy density
— for reasons rooted in the history of thermodynamics, dating back to Carnot and Clausius —
but rather the free energy and the corresponding free energy density, which are the same as the
1.2 The free energy/entropy is all you need 9
The fact that physicists (since Clausius) use the free energy with a factor −β −1 in front of the
log seems to be a notational problem for many mathematicians, who just cannot understand
why they should bother with a trivial minus sign. Thus, many of them simply refer to Φ(β, h) as
the free energy density (or worse, sometime using the terminology "pressure" from the theory
of gases) which should make Clausius turn in his grave. It is also common for mathematicians
to define the Hamiltonian with a global minus sign with respect to the one used by Physicists.
Since this monograph is not concerned with actual applications in Physics, we might forgive
these bad habits. Nevertheless, we will attempt to use the correct terminology, so that we shall
not, for instance, "maximize" the energy and "minimize" the entropy!
We now discuss how knowing the free entropy actually allows one to rediscover all the
phenomena we have discussed. First, we notice that for any finite value of N , the free entropy
is a generating functional for the (connected) moments of the magnetization S̄. Denoting by
⟨·⟩N the average with respect to the Boltzman measure, and recalling that
0
X
ZN (β, h) = e−βHN +βN hS̄
S∈{−1,1}N
we have:
1 ∂ ∂ 1 X S̄e−βHN
ΦN (β, h) = log ZN (β, h) = = ⟨S̄⟩N = mN , (1.10)
β ∂h ∂h βN ZN (β, h)
S∈{−1,1}N
This also shows why it is useful to introduce an external magnetic field h at the beginning of
our derivations: we can take obtain moments the moments of the magnetization by taking
derivatives with respect to h. While the second derivative yield the variance, etc... However, it
is far from trivial that this relation holds when the limit N → ∞ is taken: the mathematical
conditions for switching the limit and the derivative are non-trivial. Indeed, this can only be
done away from the phase transition, in which case it follows from the convexity of the free
entropy. Consider the second derivative with respect to h, we find:
1 ∂2 ∂ X S̄e−βHN
ΦN (β, h) = = N β(⟨S̄ 2 ⟩N − ⟨S̄⟩2N ) ≥ 0 . (1.11)
β ∂h2 ∂h ZN (β, h)
S∈{−1,1}N
Therefore ΦN is convex. This result is also known in the statistical physics context as the
fluctuation-dissipation theorem. A fundamental theorem on the limit of a sequence of convex
functions fn as n → ∞ tells us that if fn (x) → f (x) for all x, and if fn (x) is convex, then
fn′ (x) → f ′ (x) for all x where f (x) is differentiable. Out of the phase transition points, where
10 F. Krzakala and L. Zdeborová
the free entropy is singular, the derivative of the asymptotic free entropy thus yields the
asymptotic magnetization. Let us check that this is true. We know that Φ(β, h) = ϕ(m∗ (h, β)),
therefore we write:
1 ∂ ∂ϕ ∂m∗
Φ(β, h) = m∗ (h, β) + = m∗ (h, β) . (1.12)
β ∂h ∂m m∗ ∂h
given that the derivative of ϕ(m) is zero when evaluated at m∗ . The derivative of the free
entropy has thus given us the equilibrium magnetization m∗ , as it should.
Another instructive way to look at the problem arises using two important mathematical facts
coming from prominent French mathematicians: Laplace and Legendre. First, let’s state the
very useful Laplace method for computing integrals.
Theorem 2 (Laplace method). Suppose f (x) is a twice continuously differentiable function on [a, b]
and there exists a unique point x0 ∈ (a, b) such that:
then: Rb
a enf (x) dx
lim q = 1,
n→∞ 2π
enf (x0 ) n(−f ′′ (x0 ))
and in particular
b
1
Z
lim log enf (x) dx = f (x0 )
n→∞ n a
Rb nf (x) dx
a g(x)e
lim Rb = g(x0 ) (1.13)
n→∞ e nf (x) dx
a
These formulas, proven by Laplace in his fondamental text "Mémoire sur la probabilité des
causes par les évènements" in 1774, have profound implications when combined with the
large deviation principle.
Consider the Curie-Weiss model with zero field (h = 0). By definition, we have
0
e−βHN 1(S̄ = m)
P
S∈{−1,1}N
P(S̄ = m; h = 0) =
ZN (β, h = 0)
0
e−βHN 1(S̄ = m)
P
S∈{−1,1}N ∗
= ≍ e−N I0 (m)
eN log ΦN (β,0) N →∞
where we have denoted I0∗ (m) the true large deviation rate at zero external field. Simply
by assuming this large deviation principle, we can reach deep conclusions even if we do not
know the actual expression of I0∗ (m). Indeed, we can write the total partition sum of the
1.2 The free energy/entropy is all you need 11
system in presence of an external field, as a Laplace integral over the possible values of m. Using
HN = HN 0 − N hS̄ we write:
0
X X
ZN (β, h) = e−βHN 1(S̄ = m) eN βhm
m∈SN S∈{−1,1}N
Z 1
dm eN (Φ(β,0)−I0 (m,β)+βhm) .
∗
≍ (1.14)
N →∞ −1
At this point the Laplace method applied to the limit of ZN (β, h) gives us automatically that
We thus obtain a very generic relation between the free entropy of the system in a field and the
large deviation rate (without field) I0∗ (m): they are related through a Legendre transform:
In fact, the theory of Legendre transforms tells us slightly more: if we further take the Legendre
transform of Φ(β, h) we can (almost) recover the true rate I0∗ (m). Let us define
then a fundamental property of the Legendre transform reveals that I0 (m) is the convex
envelope of I0∗ (m). We thus can recover the large deviation rate even simply by "Legendre-
transforming" the free entropy. Again, we see that everything can be computed through the
knowledge of the free entropy Φ(β, h). Truly, the free entropy is all you need.
Note however that there is a fundamental limitation to the ability to compute large deviation
rates with this technique. Given we can only recover the convex envelope of the true rate,
if the true rate is not convex there is a part of the curve that we cannot not obtain! This is
illustrated in Figure 1.5: we only get the part of I0∗ (m) that coincides with the convex envelope
I0 (m) (this set is called the "exposed points" of I0∗ (m)), while the other points are just given
an upper bound. These considerations are classical in statistical mechanics, and at the basis
of the "equivalence of ensembles" as well as to the derivation of thermodynamics (which is
really nothing but Legendre transforms).
What we have said above can also be written rigorously using the language of modern large
deviation theory. In particular, the Gartner-Ellis Theorem connects the large deviation rates
with the Legendre transform of the partition sum in a very generic way:
Theorem 3 (Gartner-Ellis, informal). If
1
λ(k) = lim log E(exp(N kAN ))
N →∞ N
exists and is differentiable for all k ∈ R, then defining
0.76
1.00
I(m)
I (m) 0.75
0.95
0.74
0.90
0.73
0.85
0.72
(m)
0.80 0.71
0.75 0.70
0.70 0.69
0.65 0.68
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m m
Figure 1.5: The true large deviation rate I(m), and the function I ∗ (m) obtained after taking two
consecutive Legendre transforms. I ∗ (m) is the convex envelope of I(m). The two functions
coincide for all the "exposed points", where the tangent is different from the curve, but they
differ in the dashed region, which is not "exposed". Legendre transforms allows only to
compute sharp large deviation rates for the exposed points, and upper bounds otherwise.
In the Curie-Weiss model, the connection with our former derivation is immediate: consider
the model without a field (i.e. h = 0) and let AN = S̄ be the averaged magnetization in the
system. The theorem (which in this particular case is called Cramer’s theorem) thus tells us
that we need to compute:
1 X 1 0
λ(k) = lim log e−βHN +N kS̄ = lim (ΦN (β, h = k/β) − ΦN (β, h = 0))
N →∞ N ZN N →∞
S
while the rate function for the magnetization (at zero field) is given by
I(m) = max km − λ(k)
k
Given the property of the Legendre transform, I(m) is indeed given by the convex envelope
of ϕ(m, 0), as expected.
One cannot hope that the computation of the partition sum will always be so easy, so it is
worth learning a few tricks. A fundamental result in statistical physics, that is at the roots
1.3 Toolbox: Gibbs free entropy and the variational approach 13
of many analytical and practical approaches, and that has been used extensively in machine
learning as well, is the following one:
Theorem 4 (Gibbs variational approach). Consider a Hamiltonian HN (x) with x ∈ RN , and the
associated Boltzmann-Gibbs measure PGibbs (x) = e−βHN (x) /ZN (β). Given an arbitrary probability
distribution Q(x) over RN and its entropy S[Q] = −⟨log Q⟩Q , let the Gibbs functional be
N ϕGibbs (Q) =: S[Q] − β⟨HN ⟩Q
where ⟨.⟩Q denotes the expectation with respect the distribution Q. Then
1
∀Q, ΦN (β) = log ZN (β) ≥ ϕGibbs (Q)
N
with equality when Q = PGibbs .
Let us first prove a simpler result. We shall introduce the following quantity, known as
the Kullback-Leibler divergence (or "relative entropy"), which measures how much two
distributions P(x) and Q(x) differ:
P(x)
Z
DKL (P||Q) =: dxP(x) log
Q(x)
We can prove the following lemma:
Lemma 3 (Gibbs inequality).
DKL (P||Q) ≥ 0
with equality if and only if P = Q almost everywhere.
We then write the difference between the Gibbs free entropy and the actual entropy using the
Kullback-Leibler divergence:
Lemma 4. Denoting the Boltzmann-Gibbs probability distribution as PGibbs (x) = e−βH(x) /ZN , and
an arbitrary distribution as Q, we have
N ΦN = N ϕGibbs (Q) + DKL (Q||PGibbs )
Together, lemma 3 and lemma 4 imply theorem 4. Why is this interesting? Basically, it gives
us a way to approximate the partition sum (and the true distribution) by using many "trial"
distributions, and piking up the one with the largest free entropy. This has been used in
countless many ways since the birth of statistical and quantum physics, under many names
(for instance "Gibbs-Bogoliubov-Feynman"), and it is at the root of many applications of
Bayesian Statistics and machine learning as well, in which case the Gibbs free entropy is called
Evidence Lower BOund, or ELBO in short.
Let us see how it can be used for the Curie-Weiss model. The simplest thing we could try is a
factorized distribution, identical for all spins:
Y 1 + m 1−m
Y
Q(S) = Qi (Si ) = δ(Si − 1) + δ(Si + 1) (1.15)
2 2
i i
with H(m) the binary entropy, and we find back the correct free entropy — at any fixed m —
through this simple variational ansatz.
Here, we are going to see another important method: the cavity trick. It will be at the root of
many of our computations in the course. The whole idea is based on the following question:
what happens when you add one variable to the system? Physicists like imagery: one could
visualize a system of N variables, making a little "hole" or "cavity" in it, and delicately adding
one variable, hence the name "cavity method".
Let us see how it works by comparing the two Hamiltonians with N and N + 1 variables.
Denoting the new spin as the number "0" we have, for a system at inverse temperature β ′ and
field h′ :
S0 + i Si 2
P
′1
X
′
−β HN +1 = β (N + 1) + β ′ h′ (S0 + Si )
2 N +1
i
2
β′ N 2
P P
′ 1 i Si ′ N i Si
X
=β + + β S0 + β ′ h′ Si + β ′ h′ S0
2(N + 1) 2 N +1 N N +1 N
i
(1.18)
If we define β ′ = β(N + 1)/N , and our new field h′ as h′ = hN/(N + 1), we get
P 2 P
β i Si i Si
X
′ ′
−β HN +1 (h ) = cst + N + βS0 + βh Si + βhS0
2 N N
i
P
i Si
= cst − βHN + βS0 + βhS0
N
1.4 Toolbox: The cavity method 15
We thus have two systems: one with N +1 spins at temperature and fields (β ′ , h′ ) and one with
N spins at temperature and fields (β, h). The relation we just derived makes the expectation
over the N + 1 variable easy to compute in the new system, as a function of the sum in the old
system. In fact, one can directly write the expectation of the new variable as follows:
′ ′
S0 e−β HN +1 S0 e−βHN +βS0 S̄+βhS0
P PP
S0 ,S S S0 ⟨sinh (β(S̄ + h))⟩N,β
⟨S0 ⟩N +1,β ′ = P −β ′ H′ = P P −βH +βS S̄+βhS =
e N +1 e N 0 0 ⟨cosh (β(S̄ + h)))⟩N,β
S0 ,S S S0
Assuming, as a physicist would do, that S̄ concentrates on a deterministic value m∗ (at least
out of the phase co-existence line), and assuming that m∗ should be the same for N and N + 1
systems when N is large enough (as it should), we have immediately:
As N → ∞, the difference between β and β ′ vanished, and we recover the mean field equation:
m∗ = tanh(β(m∗ + h))
We can also recover the free energy from a similar cavity argument. First we note that
N
1 1 ZN 1 ZN ZN −1 Z1 1 X Zn+1
log ZN = log ZN −1 = log ... = log (1.19)
N N ZN −1 N ZN −1 ZN −2 1 N Zn
n=0
so that (a rigorous argument can be made using Cesaro sums, see appendix 1.C):
ZN +1 (β, h)
Φ(β, h) = lim log
N →∞ ZN (β, h)
Note, however, that the β should be the same at N and N + 1 systems in this computation.
Starting from equation 1.18, we thus write, making sure we keep all terms that are not o(1):
P 2 P
β i Si i Si
X
−βHN +1 = o(1) + (N − 1 + o(1)) + βS0 (1 + o(1)) + βh Si + βhS0
2 N N
i
P 2 P
β i Si i Si
= o(1) − βHN − + βS0 + βhS0 (1.20)
2 N N
Notice the presence of the term −β(S̄)2 /2, which we could have overlooked, have we been less
cautious. We did not keep this term in the previous computation, because we could include
it in the β ′ , and it was not making any difference in the large N limit and could be absorbed
in the normalisation at the price of a minimal o(1) change in β. Here, however, we need to
compute ZN +1 /ZN , which is O(1), so we need to pay attention to any constant correction. We
can now finally compute3
ZN +1 β 2
= ⟨e− 2 S̄ 2 cosh (β(S̄ + h))⟩N,β (1.21)
ZN
3
Let us make a remark that shall be useful later on, when we shall discuss the cavity method on sparse graphs:
the new Hamiltonian has two terms in addition to the old one: a "site" term that is a function of S0 , and a "link"
term, that appears because, on top of adding one spin, we added N links to the systems. This will turn out to be a
generic property.
16 F. Krzakala and L. Zdeborová
One should not, however, assume that the last expression ϕ̃(m) is the correct large deviation
quantity: it is not. The reader is invited to check that the Legendre transform of Φ(β, h) is
giving back the correct large deviation function ϕ, as it should.
1.00
(m)
0.95 (m)
0.90
0.85
0.80
0.75
0.70
0.65
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m
Figure 1.6: The functions ϕ(m) (1.4) and ϕ̃(m) (1.23) are different but coindide for all their
extremums.
All these considerations can be made entirely rigorous, as we show in Appendix 1.C. This
is one of the very important tools that mathematicians may use to prove some of the results
discussed in this monograph.
1.5 Toolbox: The "field-theoretic" computation 17
To conclude this lecture, it will be also worth learning a trick that physicists use a lot, and that
will be central to all replica computations in the next chapters. It is probably a good idea to
learn it in the context of the simple Curie-Weiss model. We write again the Hamiltonian
0 1 X 1 X X
HN =− Si Sj ≈ − Si Sj − Si
N 2N
i<j i,j i
Now, let us pretend we do not know how to compute the binomial coefficient Ω(m). Instead,
we are going to use the so-called "Dirac-Fourier" method, which starts with the following
identity for the delta "function":
Z
du f (u)δ(u − x) = f (x)
In this case, physicists would actually use the following version of the identity:
1
Z Z Z
x x
dm f (m)δ(N m − x) = dm δ N m − f (m) = dm f (m)δ m −
N N N
1 x
= f
N N
but notice that the additional 1/N term does not matter since we take the log and divide by
N : it thus makes no difference asymptotically. We then write:
X XZ X Nβ 2
ZN = e−βHN = N dm δ N m − Sj e 2 m +N βhm
S S j
Z
Nβ
m2 +βhN m
X X
=N dm e 2 δ N m − Sj
S j
We recognize that one would need to compute the entropy at fixed m. Again, let us pretend we
cannot compute it (the idea is to do so without any combinatorics). Instead, using a Fourier
transform of the delta "function" (which is really a distribution), one finds that
Z Z
Nβ 2 +βhmN
X i2πλN m−P Sj
m
ZN = N dm dλ e 2 e j N
We do not like to keep the i explicitly, so instead, we write m̂ = i2πλ and integrate in the
complex plane
Z 1 Z i2π∞
Nβ 2
X P
ZN = 2iN π dm dm̂ e 2 m +βhN m+N m̂m e−m̂ j Sj
−1 −i2π∞ S
Z 1 Z i2π∞ β
m2 +N βhm+N m̂m
= 2iN π dm dm̂ eN 2 (2 cosh m̂)N
−1 −i2π∞
18 F. Krzakala and L. Zdeborová
We are interested only in the density of the log, and thus we write:
1 i2π∞
log ZN 1 n Z Z o
Nβ
m2 +N βhm+N m̂m+N log 2+N log cosh m̂
= log dm dm̂ e 2 + o(1)
N N −1 −i2π∞
We have gotten read of the combinatoric sums, at the price of introducing integrals. It turns
out, however, that these are very easy integrals, at least when N is large. The trick is to use
Cauchy’s theorem to deform the contour integral in m̂ and put the path right into a saddle
in the complex plane. This is called the saddle point method, and it is a generalization of
Laplace’s method to the complex plane. More information is provided in Exercice 1.3, but
what the saddle point methods tell us is that for integrals of this type, we have
log Z
Φ(β, h) = lim = extrm,m̂ g(m, m̂)
N →∞ N
with
β 2
g(m, m̂) = m + βhm + m̂m + log 2 + log cosh m̂
2
In other words, the complex two-dimensional integral has been replaced by a much simpler
differentiation where we just need to find the extremum of g(m, m̂)!
How consistent is this expression with the previous one we obtained? Let us do the differenti-
ation over m̂: the extrema condition imposes m = − tanh m̂, or m̂ = − tanh−1 m. If we plug
this into the expression, we finally find
β 2
m + βhm − m (tanh−1 m) + log 2 cosh tanh−1 m
g(m) =
2
This does not look directly like our good old ϕ(m), but by using the following identity:
we get back
β 2
g(m) = ϕ(β, m) + βhm = m + βhm + H(m) = ϕ(m, β)
2
and Φ(β, h) = extrm ϕ(β, m) as it should.
It is worth noting, however, that most physicists choose to write things in a slightly different,
but equivalent, way. Indeed, starting again from
β 2
g(m, m̂) = m + βhm + m̂m + log 2 + log cosh m̂
2
the typical physicist imposes m̂ = −β(m + h) instead of m = − tanh m̂. They would then get
rid of m̂ instead of m, since it is most convenient, and write:
β 2
Φ(β, h) = extrm − m + log(2 cosh β(m + h)) = max ϕ̃(m, h, β)
2
As we have seen in the previous section in the cavity computation, this is not a problem, since
this formula is correct as well. In fact, it is reassuring that both formulations can be found
using this method.
1.5 Toolbox: The "field-theoretic" computation 19
Bibliography
The Curie-Weiss model is inspired from the pioneering works of the Frenchmen Curie (1895)
and Weiss (1907). The history of mean-field models and variational approaches in physics is
well described in Kadanoff (2009). The presentation of the rigorous solution of the Curie-Weiss
model follows Dembo et al. (2010a). The Laplace method was introduced by Pierre Simon
de Laplace in its revolutionary work (Laplace, 1774), where he founded the field of statistics.
The saddle point method was first published by Debye (1009), who in turns credited it to an
unpublished note by Riemann (1863). A classical reference on large deviation is Dembo et al.
(1996). The nice review by Touchette (2009) covers the large deviation approach to statistical
mechanics, and is a recommended read for physicists. Variational approaches in statistics and
machine learning are discussed in great detail in Wainwright and Jordan (2008).
20 F. Krzakala and L. Zdeborová
1.6 Exercises
enH(k/m)
n n!
< = < enH(k/n)
n+1 k k!(n − k)!
(a) Use the Binomial to prove that, for any 0 < p < 1
n
X n
1= (1 − p)i pn−i
i
i=0
(b) Using a particular value of p, and keeping only one term of the sum, show that
n
< enH(k/n)
k
(c) If one makes n draws from a Bernoulli variable with probability of positive outcome
p = k/n, what is the most probable value for the number of positive outcomes?
Deduce that
n
(n + 1) > enH(k/n)
k
(a) Intuitively, what are the regions in the interval [a, b] which will contribute more to
the value of I(λ)?
(b) Suppose the function f (t) has a single global maximum at a point c ∈ [a, b] such
that f ′′ (c) < 0, and assume h(c) ̸= 0. Using a Taylor expansion for f , show that for
λ ≫ 1 we expect the integral to behave as follows:
Z c+ϵ
dt h(c)eλ[f (c)+ 2 f ]
1 ′′ (c)(t−c)2
I(λ) =
λ≫1 c−ϵ
h(c)eλf (c)
Z
2
I(λ) ≍ p e−t dt.
−λf ′′ (c) R
The saddle point method is a generalisation of Laplace’s method to the complex plane.
As before, we search for an asymptotic formula for integrals of the type:
Z
I(λ) = dz h(z)eλf (z)
γ
where γ : [a, b] → C is a curve in the complex plane C and λ > 0 is a real positive
number which we will take to be large. If the complex function f is holomorphic on a
connected open set Ω ⊂ C, the integral I(λ) is independent of the curve γ. The goal is
therefore to choose γ wisely.
(a) Show that at a critical point, the gradients of u and v are zero.
(b) Using the Cauchy integral formula, show that for all z0 = x0 + iy0 in an open
22 F. Krzakala and L. Zdeborová
(d) Let γ(t) = x(t)+iy(t) for t ∈ [a, b] be a parametrisation of the curve passing through
z0 = γ(t0 ) through the steepest-descent direction of Re[f]. Letting f (t) = f (γ(t)),
h(t) = h(γ(t)), u(t) = Re[f(t)] and v(t) = Im[f(t)], show that the problem boils
down to the evaluation of the following integral:
Z b
dt γ ′ (t)h(t)eλu(t)
a
(b) Write the second derivative of f (t) with respect to t and show that at the critical
point z0 we have:
d2 f (t0 ) ′
2
2 d f (z0 )
= γ (t 0 )
dt2 dz 2
1.6 Exercises 23
(c) Show that the second derivative f ′′ (t0 ) is necessarily real and negative. Conclude
that:
(d) Let θ be the angle between the curve γ and the real axis at the critical point z0 , see
figure below. Show that:
Consider again the Hamiltonian of the Curie-Weiss model. A very practical way to
sample configurations of N spins from the Gibbs probability distribution
exp (−βHN (s; h))
P (S = s; β, h) = (1.24)
ZN (β, h)
is the Monte-Carlo-Markov-Chain (MCMC) method, and in particular the Metropolis-
Hastings algorithm. It works as follows:
1. Choose a starting configuration for the N spins values si = ±1 for i = 1, . . . , N .
2. Choose a spin i at random. Compute the current value of the energy Enow and
the value of the energy Eflip if the spins i is flipped (that is if Sinew = −Siold ).
3. Sample a number r uniformly in [0, 1] and, if r < eβ(Enow −Eflip ) perform the flip
(i.e. Sinew = −Siold ) otherwise leave it as it is.
4. Go back to step 2.
If one is performing this program long enough, it is guarantied that the final configura-
tion S will have been chosen with the correct probability.
24 F. Krzakala and L. Zdeborová
(a) Write a code to perform the MCMC dynamics, and start by a configuration where
all spins are equal to Si = 1. Take h = 0, β = 1.2 and try your dynamics for a long
enough time (say, with tmax = 100N P attempts to flip spins) and monitor the value
of the magnetization per spin m = i Si /N as a function of time. Make a plot
for N = 10, 50, 100, 200, 1000 spins. Compare with the exact solution at N = ∞.
Remarks? Conclusions?
(b) Start by a configuration where all spins are equal to 1 and take
P h = −0.1, β = 1.2.
Monitor again the value of the magnetization per spin m = i si /N as a function
of time. Make a plot for N = 10, 50, 100, 200, 1000 spins. Compare with the exact
solution at N = ∞. Remarks? Conclusions?
An alternative local algorithm to sample from the measure eq. 1.24 is known as the
Glauber or heat bath algorithm. Instead of flipping a spin at random, the idea is to
thermalise this spin with its local environment.
Part I: The algorithm
(a) Let S̄ = N1 N i=1 si be the total magnetisation of a system of N spins. Show that
P
for all i = 1, · · · , N , the probability of having a spin at Si = ±1 given that all other
spins are fixed is given by:
1 ± tanh(β(S̄ + h))
P (Si = ±1|{Sj }j̸=i ) ≡ P± =
2
(b) The Glauber algorithm is defined as follows:
1. Choose a starting configuration for the N spins. Compute the magnetisation mt
and the energy Et corresponding to the configuration.
2. Choose a spin Si at random. Sample a random number uniformly r ∈ [0, 1]. If
r < P+ , set Si = +1, otherwise set Si = −1. Update the energy and magnetisa-
tion.
3. Repeat step 2 until convergence.
Write a code implementing the Glauber dynamics. Repeat items (a) and (b) of
exercise 1.4 using the same parameters. Compare the dynamics. Comment on the
observed differences.
Part II: Mean-field equations from Glauber
Let’s now derive the mean-field equations for the Curie-Weiss model from the Glauber
algorithm.
(a) Let mt denote the total magnetisation at time t, and define Pt,m = P(mt = m). For
simplicity, consider β = 1 and h = 0. Show that for δ ≪ 1 we can write:
1 − tanh(m + 2/N )
1 2
Pt+δt,m = Pt,m+ 2 × 1+m+ ×
N 2 N 2
1 2 1
+ Pt,m− 2 × 1−m+ (1 + tanh(m − 2/N ))
N 2 N 2
1 − tanh(m)
1 1 + tanh(m) 1
+ Pt,m (1 + m) + (1 − m) .
2 2 2 2
1.6 Exercises 25
and using the master equation above, show we can get an equations for the expected
magnetisation:
1 − tanh(m + 2/N )
1
Z
⟨m(t + δt)⟩ = Pt,m+2/N × (1 + m + 2/N ) × × m dm
2 2
1 + tanh(m − 2/N )
1
Z
+ Pt,m−2/N (1 − m + 2/N ) × × m dm
2 2
1 + m 1 + tanh(m) 1 − m 1 − tanh(m)
Z
+ Pt,m × × + × × m dm
2 2 2 2
(c) Making the change of variables m → m+2/N in the first integral and m → m−2/N
in the second and choosing δ = N1 , conclude that for N → ∞ we can write the
following continuous dynamics for the mean magnetisation:
d
⟨m(t)⟩ = −⟨m(t)⟩ + tanh⟨m(t)⟩
dt
(d) Conclude that the stationary expected magnetisation satisfies the Curie-Weiss
mean-field equation. Generalise to arbitrary β and h.
(e) We can now repeat the experiment of the previous exercise, but using the theoretical
ordinary differential equation: start by a configuration where all spins are equal to
1 and take different values of h and β. For which values will the Monte-Carlo chain
reach the equilibrium value? When will it be trapped in a spurious maximum
of the free entropy ϕ(m)? Compare your theoretical prediction with numerical
simulations.
(a) Using the variational approach of Section 1.3, write the Gibbs free-energy of the
Potts model, as a function of the fractions {ρτ }i=1,...N,τ =1...q of spins in state τ .
(b) Write the mean-field self-consistent equation governing the {ρτ } by extremizing
the Gibbs free-energy and solve it numerically.
26 F. Krzakala and L. Zdeborová
(c) Using the cavity approach of section 1.4, show that one can recover the mean-field
self-consistent equation using this method.
Appendix
In these notes we will often adopt a terminology from Statistical Physics. While this is standard
for someone who already took a course on the subject, it can often be confusing for newcomers
from other fields.
In Classical Mechanics, the goal is to study the microscopic properties of a system, for instance
the trajectory of each molecule of a gas or the dynamics of each neuron of a neural network
during training. Instead, in Statistical Mechanics the goal is to study macroscopic properties of
the system, which are collective properties of the d.o.f. In our previous examples, a macroscopic
property of the gas is its mean energy, its temperature or its pressure, while a macroscopic
property of the neural network is the generalization error. To make these notions more precise,
in Statistical Physics we define an ensemble over the configurations, which is simply a probability
measure over the space of all possible configurations. Different ensembles can be defined for
the same system, but in these notes we will mostly focus on the canonical ensemble, which is
28 F. Krzakala and L. Zdeborová
1
P(X = x) = e−βH(x) (1.25)
ZN (β)
where the normalization constant ZN (β) is known as the partition function. Note that the
partition function is closely related to the moment generating function for the energy. From
this probabilistic perspective, a configuration x ∈ X N is simply a random sample from the
Boltzmann-Gibbs distribution, and a macroscopic quantity can be seen as a statistic from the
ensemble. Physicists often denote the average with respect to the Boltzmann-Gibbs distribution
with brackets ⟨·⟩β and refer to it as a thermal average. We now define the most important quantity
in these notes, the free energy density:
1
−βfN (β) = log ZN (β). (1.26)
N
Note that since the Hamiltonian is typically extensive H = O(N ), the free entropy (the
logarithm of the partition sum) is also extensive, and therefore its density is an intensive
quantity. It is closely related to the cumulant generating function for the energy. For this
reason, and as discussed in the introduction, from the free entropy we can access many of the
important macroscopic properties of our system. Two macroscopic quantities physicists are
often interested in are the energy and entropy densities:
1
eN (β) = H(x) = ∂β (βfN (β)), sN (β) = β 2 ∂β fN (β) (1.27)
N β
1
fN (β) = eN (β) − sN (β) (1.28)
β
One of the main goals of the Statistical Physicist is to characterize the different phases of a
system. A phase can be loosely defined as a region of parameters defining the model that share
common macroscopic properties. For example, the Curie-Weiss model studied in Chapter
1 is defined by the parameters (β, h) ∈ R+ × R, and we have identified two phases in the
thermodynamic limit: a paramagnetic phase characterised by no net system magnetization and
a ferromagnetic phase characterized by net system magnetization. The macroscopic quantities
characterising the phase of the system (in this case the net magnetization) are known as the
order parameters for the system. In Chapter 16 we will study a model for Compressive Sensing,
which is the problem of reconstructing a compressed sparse signal corrupted by noise. The
parameters of this system are the sparsity level (density of non-zero elements) ρ ∈ [0, 1] and
the noise level ∆ ∈ R+ , and we will identify the existence of three phases: one in which
reconstruction is easy, one in which it is hard and a one in which it is impossible. The order
parameter in this case will be the correlation between the estimator and the signal. Although
all examples studied here have clear order parameters, it is not always easy to identify one,
and it might not always be unique. For instance, in the Compressive sensing example we
could also have chosen the mean squared error as an order parameter.
1.B The ABC of phase transitions 29
When a system changes phase by varying a parameter (say the temperature), we say the system
undergoes a phase transition, and we refer to the boundary (in parameter space) separating the
two phases as the phase boundary. In Physics, we typically summarize the information about
the phases of a system with a phase diagram, which is just a plot of the phase boundaries in
parameter space. See Figure 1.B.1 for two examples of well-known phase diagrams in Physics.
Figure 1.B.1: Phase diagrams of water (left) and of the cuprate (right), from Taillefer (2010)
and Schwarz et al. (2020) respectively.
Phase transitions manifest physically by a macroscopic change in the behavior of the system
(think about what happens to the water when it starts boiling). Therefore, the reader who
followed our discussion from Chapter 1 and Appendix 1.A should not be surprised by the fact
that phase transitions can be characterised and classified from the free energy. Indeed, the
classification of phase transitions in terms of the analytical properties of the free energy dates
back to the work of Paul Ehrenfest in 1933 (see Jaeger (1998) for a historical account). At this
point, the mathematically inclined reader might object: the free energy from equation 1.26 is
an analytic function of the model parameters, so how can it change behavior across phases?
Indeed, for finite N the free energy is an analytic function of the model parameters. However,
the limit of a sequence of analytic functions need not be analytic, and in the thermodynamic
limit the free energy can develop singularities. Studying the singular behavior of the limiting
free energy is the key to Ehrenfest’s classification of phase transitions. The two most common
types of phase transitions are:
First order phase transition: A first order phase transition is characterised by the disconti-
nuity in the first derivative of the limiting free energy with respect to a model parameter.
The most common example is the transition of water from a liquid to a gas as we change
the temperature at fixed pressure (what you do when you cook pasta), see Fig. 1.B.1
(left). In this example, the derivative of the free energy with respect to the temperature,
also known as the entropy, discontinuously jumps across the phase boundary.
Second order phase transition: A second order phase transition is characterised by a dis-
continuity in the second derivative of the limiting free energy with respect to a model
parameter. Therefore, in a second order transition the free energy itself and its first
derivative are continuous. Note that second order derivatives of the free energy are
typically associated with response functions such as the susceptibility. Perhaps the most
30 F. Krzakala and L. Zdeborová
Although less commonly used, we can define an n-th order phase transition in terms of the
discontinuity of the n-th derivative of the limiting free energy. The order of a phase transition
is associated to a rich phenomenology, which we now discuss, for the sake of concreteness, in
our favourite model: the Curie-Weiss model for ferromagnetism.
Phase transitions in the Curie-Weiss model Recall that in Chapter 1 we have computed the
thermodynamic limit of the free energy density for the Curie-Weiss model:
1
−βfβ = lim log ZN = max ϕ(m)
N →∞ N m∈[−1,1]
where:
β 2
ϕ(m) = m + βhm + H(m)
2
1+m 1+m 1−m 1−m
H(m) = − log − log .
2 2 2 2
As we have shown, the parameter m⋆ solving the minimization problem above gives the order
parameter of the system, the net magnetization at equilibrium:
In particular, the limiting average energy and entropy densities are given by:
⋆
⋆ m
e(β, h) = ∂β (βf (β, h)) = −m +h s(β, h) = β 2 ∂β f (β, h) = H(m⋆ ) (1.30)
2
In particular, note that the entropy density depends on the model parameters (β, h) only
indirectly through the magnetization m⋆ = m⋆ (β, h). The potential ϕ(m) is an analytic function
of m and the parameters (β, h). However, due to the optimization over m, the free energy
density can develop a non-analytic behavior as a function of (β, h), signaling the presence of
phase transitions, which we now recap.
Zero external field and the second order transition: Note that the decomposition f = e−βs
(see equation 1.27) makes it explicitly that the free energy is a competition between two
parabolas: the energy (convex) and the entropy (concave). At zero external field h = 0,
we note that the potential is a symmetric function of the magnetization ϕ(−m) = ϕ(m).
At high temperatures β → 0+ , the dominant term is given by the entropy, which has
a single global minimum at m⋆ = 0 (see Fig. 1.1): this is the paramagnetic phase in
which the system has no net magnetization. At the critical temperature βc = 1, m = 0
becomes a maximum of the system, with two global minima (having the same free
energy) continuously appearing, see Fig. 1.4. This signals a phase transition towards a
ferromagnetic phase defined by a net system magnetization |m⋆ | > 0. Note that the first
derivative of the free energy with respect to β (proportional to the entropy) remains a
continuous function across the transition. However, we notice that the second derivative
1.B The ABC of phase transitions 31
0.7
c 1000
0.6
800
0.5
convergence time
600
entropy
0.4
400
0.3
0.2 200
0.1 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Figure 1.B.2: (Left) Entropy as a function of the inverse temperature β at zero external field
h = 0. Note that the entropy is a continuous function of the temperature, with a cusp at the
critical point βc = 1, indicating that its derivative (proportional to the second derivative of the
free energy) has a discontinuity. (Right) Convergence time of the saddle-point equation as a
function of the inverse temperature β at zero external field h = 0. Note the critical slowing
down close to the second order critical point βc = 1.
of the free energy is discontinuous, indicating this is a second order phase transition. This
transition corresponds to a significant change in the statistical behavior of the system at
macroscopic scales: while for β < 1 a typical configuration from the Boltzmann-Gibbs
distribution has no net magnetization m⋆ = ⟨S̄⟩β ≈ 0 (disordered phase), for β > 1 a
typical configuration has a net magnetization |m⋆ | = |⟨S̄⟩β | > 0 (ordered phase). This is
an example of an important concept in Physics known as spontaneous symmetry breaking:
while the Hamiltonian of the system is invariant under the Z2 symmetry S̄ → −S̄, for
β > 1 a typical draw of the Gibbs-Boltzmann distribution S ∼ PN,β breaks this symmetry
at the macroscopic level. Second order transitions carry a rich phenomenology. Since
the transition is second order (i.e. continuous first derivative), the critical temperature
can be obtained by studying the expansion of the free energy potential around m = 0:
m2
ϕ(m) = log 2 + (β − 1) + O(m3 )
m→0 2
which give us the critical βc = 1 as the point in which the second derivative changes
sign (m = 0 goes from a minimum to a maximum). It is also useful to have the picture
in terms of the saddle-point equation:
m = tanh(βm).
The fact that m = 0 is always a fixed point of this equation signals it is always an
extremizer of the free energy potential. From this perspective, the critical temperature
βc = 1 corresponds to a change of stability of this fixed point. Seeing the saddle-point
equations as a discrete dynamical system mt+1 = f (mt ), the stability of a fixed point
can be determined by looking at the Jacobian of the update function f : [−1, 1] → [−1, 1]
around the fixed point m = 0:
For β < 1, the fixed point is stable (attractor/sink of the dynamics), while for β > 1
it becomes an unstable (repeller/source of the dynamics). Note that this implies that
32 F. Krzakala and L. Zdeborová
close to the transition β ≈ 1+ , iterating the saddle point equations starting close to zero
mt=0 = ϵ ≪ 1 (but not exactly at zero) takes long to converge to a non-zero magnetization
m > 0, with the time diverging as we get closer to the transition. This phenomenon is
known is Physics as the critical slowing down, and together with the expansion of the free
energy and the stability analysis of the equations give yet another way to characterise a
second order critical point. See Figure 1.B.2 (right) for an illustration.
1.1
|h| > hsp
1.0 h = hsp 1
|h| < hsp
0.9 2
3
0.7
4
0.6
= 0.5
5 =1
0.5
= 1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
m external field (h)
Figure 1.B.3: (Left) Free energy potential ϕ(m) as a function of m for fixed inverse temperature
β = 1.5 and varying external field h < 0. Note that the free energy potential has a local
maximum for |h| > hsp that disappears at the spinodal transition h = hsp . (Right) Free energy
as a function of the external field h at different temperatures. Note the non-analytical cusp at
h = 0.
Finite external field and the first order transition: Turning on the external magnetic field
h ̸= 0 can dramatically change the discussion above. First, note that the Hamiltonian
loses the Z2 symmetry: this is known in Physics as explicit symmetry breaking. At high
temperatures β → 0+ , the free energy potential is convex, with a single minimum at
m = h aligned with the field. As temperature is lowered and we enter what previously
was the ferromagnetic phase (β > 1), two behaviors are possible. For small h, the
field simply has the effect of breaking the symmetry between the previous two global
minima and making the with opposite sign a local minimum, see Fig. 1.B.3 (left). In this
situation, even though the equilibrium free energy is given by the now unique global
minimum of the potential, the presence of a local minimum has an important effect in the
dynamics. Indeed, if we initialize the saddle-point equations close to the magnetization
corresponding to the local minimum, it will converge to this local minimum, since it
is also a stable fixed point of the corresponding dynamical system, see Fig. 1.B.4 (left).
This phenomenon is known as metastability in Physics. Note that metastability can be a
misleading name, since in the thermodynamic limit N → ∞ metastable states are stable
fixed points of the free energy potential. However, at finite system size N , the system
will dynamically reach equilibrium in a time of order t = O(eN ). Metastability will play
a major role in the Statistical Physics analysis of inference problems, since it is closely
related to algorithmic hardness.
As the external field h is increased, the difference in the free energy potential between the
two minima increases, and eventually at a critical field hsp , known as the spinodal point,
the local minimum disappears, making the potential convex again, see Fig. 1.B.3 (left).
The spinodal points can be derived from the expression of the free energy potential, and
1.C A rigorous version of the cavity method 33
magnetization (m)
0.25 0.25
magnetization
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
external field (h) external field (h)
Figure 1.B.4: (Left) Stable, metastable and unstable branches of the magnetization as a function
of the external field at fixed inverse temperature β = 1.5. (Right) Magnetization obtained
by iterating the saddle-point equations from different initial conditions mt=0 as a function of
the external field h and fixed inverse temperature β = 1.5. Note the hysteresis loop: point at
which the magnetization discontinuously jumps from negative to positive depends on the
initial state of the system.
is given by:
s r
1 1 1 1
hsp (β) = ± 1− ∓ tanh−1 1− , β>1
β β β β
From this discussion, it is clear that for β > 1 the magnetization (which is the derivative
of the free entropy with respect to h) has a discontinuity at h = 0, since for h ̸= 0 we
have a non-zero magnetization and for h = 0 we are in the paramagnetic phase h = 0.
This is a first order phase transition of the system with respect to the external field h, see
Fig. 1.B.3 (right). Note that as a consequence of metastability, in the region |h| < |hsp |
the system magnetization will depend of the state in which it was initially prepared.
This memory of the initial state is known as hysteresis in Physics, see Fig. 1.B.4 (right).
The cavity method presented in the main chapter can also be done entirely rigorously, as we
now show. This appendix can be skipped for non mathematically-minded readers, although
it is always interesting to see how things can be done precisely. In fact, this section is a good
training for the later Chapters where the rigorous proofs are more involved, despite using the
very same techniques.
First, we want to show that the magnetization, as well as many other observables, does indeed
converge to a fixed value as N increases. This is done through the following lemma that tells
us that, if indeed S̄ concentrates, then any observable will concentrate as well.
Lemma 5. For any bounded observable O({S}N ) there exists a constant C = ∥O(.)∥∞ such that:
q
|⟨O(S)⟩N +1,β ′ − ⟨O(S)⟩N,β | ≤ Cβ sinh(βh + β) VarN,β (S̄) (1.32)
34 F. Krzakala and L. Zdeborová
We can now prove the main thesis and obtain the mean field equation:
Theorem 5. There exists a constant C(β, h) such that
q
|⟨Si ⟩N,β − tanh β(h + ⟨S̄⟩N,β )| ≤ C(β, h) Var(S̄)
All is left to do to get the mean field equation is showing that the variance of the magnetization
is going to zero outside of the phase transition/coexistence line. This is not entirely easy to do,
but we can easily show that this true almost everywhere in the plane (β, h) using the so-called
fluctuation-dissipation approach:
Lemma 6 (Bound on the variance). For any β, and values h1 , h2 of the magnetic field, one has
h2
2
Z
Var(S̄)β,N,h dh ≤
h1 βN
2
so that Var(S̄)β,N,h ≤ N for almost every h.
∂2 ∂
N Var(S̄)β,N,h = ΦN (β, h) = ⟨S̄⟩β,N
∂(βh)2 ∂(βh)
Therefore Z βh2
N Var(S̄)β,N,h dβh = ⟨S̄⟩β,N,h2 − ⟨S̄⟩β,N,h1 ≤ 2
βh1
We can also prove the free entropy, using a technique that shall be very useful for more complex
problems:
1 1 ZN 1 ZN ZN −1 Z1
ϕN (β, h) = log ZN = log ZN −1 = log ...
N N ZN −1 N ZN −1 ZN −2 1
N −1
1 X ZN +1
= An (β, h); AN (β, h) =: log
N ZN
n=0
we have, by the Stolz–Cesàro theorem, that if An converges to A, then ΦN (β, h) also converges
to A. Using equation 1.21 we thus write
β 2
AN (β, h) =: log ZN +1 /ZN = log⟨e− 2 S̄ 2 cosh (β(S̄ + h))⟩N,β + o(1) .
By lemma 5 and 6 we thus obtain for any β > 0 and almost everywhere in h, that
m∗ 2
lim AN (β, h) = −β + log 2 cosh(β(m∗ + h))
N →∞ 2
with m∗ = ⟨S̄⟩. Therefore almost everywhere in field we have, following eq.(1.21) that the
free entropy is given by one of the extremum m∗ of eq.(1.21):
Φ(β, h) = ϕ̃(⟨S̄⟩) .
36 F. Krzakala and L. Zdeborová
It just remains to show that the correct extremum, if they are many of them, is the maximum
one. This can be done by noting that it is necessary the maximum since ϕ̃(m∗ ) = ϕ(m∗ ), and
that by the Gibbs variational approach, theorem 4, we already proved that Φ ≥ ϕ(m)∀m. Given
the entropy is Lipschitz continuous in h for all N (its derivative is the magnetization, which is
bounded), its limit is continuous, so that the if the free entropy is given by the maximum of
ϕ(m) almost everywhere, it is true everywhere.
Chapter 2
i.i.d.
but now, the additional fields h are fixed, once and for all. We choose hi ∼ N (0, ∆).
This is a simple variation on the Ising model, but now we have these new random fields h. This
makes the problem a bit more complicated. For this reason, it is called the Random Field Ising
Model, or RFIM. We may ask many questions. For instance: what is the assignment of the spins
that minimizes the energy? This is a non-trivial question since there is a competition between
aligning all variables together in the same direction, and aligning them to the direction of
the local random fields. What will be the energy of this assignment? Will the energy be very
different when one picks up another random value for h? How do we find such an assignment
in practice? With which algorithm?
We shall be interested in the behavior of this system in the large N limit. How are we going to
deal with the random variables? The entire field of statistical mechanics of disordered systems,
and its application to optimization and statistics, is based on the idea of self-averaging: that is,
the idea that the particular realization of h does not matter in the large size limit N → ∞ (the
38 F. Krzakala and L. Zdeborová
asymptotic limit for mathematicians, or the "thermodynamic" limit for physicists). This is a
powerful idea which allows us to average over all h.
Let us see how this can be proven rigorously. First, a word of caution, the partition sum ZN is
exponentially large in N , so we expect its fluctuation around its mean to be very large as well.
Fortunately, as we have seen, our focus will be on log ZN (h)/N , which we hope converges to
a constant O(1) value as N → ∞. The quantity log ZN (h)/N is random, it depends on the
value of h, but we may expect that the typical value of log ZN (h)/N is close to its mean. This
notion, that a large enough system is close to its mean, is called "self-averaging" in statistical
physics. In probability theory, this is a concentration of measure phenomenon.
Indeed for the Random Field Ising model, we can show that the free entropy concentrates
around its mean value for large N :
Theorem 7 (Self-averaging). Let ΦN (h, β) =: log ZN (β, h)/N be the free entropy density of the
RFIM, then:
∆β 2
Var[ΦN (h, β)] ≤
N
Proof. The proof is a consequence of the very useful Gaussian Poincaré inequality (see the
theorem in exercise 9): Suppose f : Rn 7→ R is a smooth function and X has multivariate
Gaussian distribution X ∼ N (0, Γ), where Γ ∈ Rn×n . Then
and the Gaussian Poincaré inequality with Γ = ∆I (covariance matrix of h) yields the final
result.
Hence, instead of computing the free entropy density for each realization of h, we turn to
computing the expectation over all possible h, i.e. we define
1
Φ(β, ∆) ≜ lim Eh ΦN (β, δ, h) = lim Eh [log ZN (β, h)]
N →∞ N →∞ N
and the self-averaging property guarantees that Eh [log ZN (β, h)] /N is close to [log ZN (β, h)] /N
when N is large. In probability theory, when the variance goes to zero one says that the random
variable converges in probability. Here, we have thus showed that ΦN (β, h, ∆) converges in
probability to Φ(β, ∆). In fact, we could work a bit more and show that the probability to
have a deviation larger than the variance is exponentially low, so we are definitely safe, even
at moderate values of N . This leaves us with the question of how to compute Φ(β, ∆), as it
involves the expectation of a logarithm.
2.2 Replica Method 39
In order to compute the average of the logarithm, a powerful heuristic method has been used
widely in statistical physics: the replica method, proposed in the 70’s by Sir Sam Edwards
(who credits it to Marc Kac). Here is the argument: Suppose n is close to zero, then
Zn − 1
Z n = en log Z = 1 + n log(Z) + o(n) =⇒ log Z = lim
n→0 n
If Z is a random variable and we suppose that swapping limit and expectation is valid (which
is by no means evident), then we find the following identity, often referred to as "the replica
trick:"
Zn − 1 E[Z n ] − 1
E [log Z] = E lim = lim
n→0 n n→0 n
This is at the root of the replica method: we replace the average of the logarithm of Z, which
is hard to compute, by the average of powers of Z. If n is integer, we may hope that we shall
indeed be able to compute averages of Z n . We could then "pretend" that our computation
with n finite is valid for n ∈ R (which is again not a trivial step), and perform an analytic
continuation from N to ∈ R, and send n → 0. While this sounds acrobatic and certainly not like
rigorous mathematics, it seems, amazingly, to work quite well when we stick to the guidelines
that physicists —following the trail of Giorgio Parisi and Marc Mézard— have proposed over
the last few decades. Indeed, when it can be applied, this method appears to seemingly always
lead to the correct result, at least when we can compare it to rigorous computation. Nowadays,
there is a deep level of trust in the results given by the replica method.
We shall come back on what we can say with rigorous mathematics later on, but first, let us
see how it works in detail for the computation of Φ(β, ∆) in the Random Field Ising Model,
using our "field-theoretic" toolbox of the previous chapter.
Let n be the number of replicas and α = 1, . . . , n be the index of replicas (these are traditional
notation in the replica literature) we have
!2
n X s(α)
XX X X N X (α)
Zn = ··· exp β i
+ hi s i .
2 N
s(1) s (2) s (n) α=1 i i
40 F. Krzakala and L. Zdeborová
We now write the average over the random fields, and proceed to fix the magnetization, as we
did for the Curie-Weiss model, this time for each of the n "replicas" indexed from α = 1, ..., n :
!2
N Pn P s(α)
i
Pn P (α)
β 2 α=1 i N +β α=1 h s
i i i
X
Eh [Z n ] = Eh e
n
{s }α=1
(α)
Z !
(a) X Y X (α) N P 2
P P (α)
= N n Eh dmα δ si − N mα eβ 2 α mα +β α i hi si
{s(a) }α α i
Z hP i
(α) (α)
(b)
P
= (2iπN )n Eh
m̂ s −N mα β N 2
P P P
X Y
dmα dm̂α e α α i i e 2 α mα +β α i hi si
{s(a) }α α
Z Y
X Pα m̂α Pi s(α) (α)
dmα dm̂α e(β 2 α mα −N α m̂α mα ) Eh
N P 2
P P P
∝ e i +β α i hi si
α { }α
s (a)
where (a) is obtained by splitting the sum by magnetization and (b) is obtained by taking the
Fourier transform of the Dirac delta function, followed by a change of variables m̂α = 2π iλα .
We then chose to ignore the irrelevant prefactors in front of the integral. Then:
Z Y
X Y Y m̂α s(α) (α)
dmα dm̂α e(β 2 α mα −N α m̂α mα ) Eh
N P 2
P
Eh [Z n ] ∝ e i +βhi si
α {s(a) }α α i
Z Y
(c) (α) (α)
dmα dm̂α e(β 2 α mα −N α m̂α mα ) Eh
N P 2
P Y Y X
∝ em̂α si +βhi si
α i α (α)
si =±1
Z Y ( " #)N
(d)
(β N2 2
α m̂α mα )
P P Y
∝ dmα dm̂α e α mα −N Eh 2 cosh (βh + m̂α )
α α
Z Y
N [ β2 2
α m̂α mα +log(Eh [ α 2 cosh(βh+m̂α )])]
P P Q
= dmα dm̂α e α mα −
where (c) is obtained by writing the sum of products into a product of sums and (d) comes
from the fact that all hi ’s are i.i.d. so that the integral over the vectors h has been replaced by
a single intergal over a scalar h. We have again got read entirely of the combinatoric sums but
at the price of introducing integrals.
At this point, we seem to have reached a quite complicated expression: we now have to
somehow manage to integrate over all the mα , m̂α , and magically take the n → 0 limit. These
integrals can be see seen as an integral over the two n-dimensional vectors matrices m and m̂,
i.e: Z Y Z
dmα dm̂α =: dm dm̂
α
2.2 Replica Method 41
From the structure of the integral, we should expect that a saddle point method will hold, so
that we should extremize the expression in the exponential over these two vectors. While this
looks like a formidable challenge (maximization over all possible vectors!) we may guess how
these vectors will look like at the extremum. A very reasonable assumption, called the replica
symmetry (RS) ansatz, is that at the extremum all the replicas are equivalent, so that
mα ≡ m, m̂α ≡ m̂, ∀α
Physicists, who have been trained to follow the steps of the giant german scientists of the late
XIXth and early XXth century, call such a guess an ansatz. Following then the replica symmetric
ansatz, the seemingly huge monster Eh [Z n ] is now reduced to the more gentle:
Z
n β 2 n n
Eh [Z ] =∝ dm dm̂ exp N nm − nm̂m + log (Eh [2 cosh (βh + m̂)])
2
We have thus quite simplified the problem and the averaged free entropy we are looking for
reads
1
Φ(β, ∆) = lim Eh [log (Z(β, h))]
N →∞ N
(a) 1 Eh [(Z(β, h))n ] − 1
= lim lim
N →∞ N n→0 n
(b) 1 Eh [(Z(β, h))n ] − 1
= lim lim
n→0 n N →∞ N
(c) 1 β 2 n n
= lim Extrm,m̂ nm − nm̂m + log (Eh [2 cosh (βh + m̂)])
n→0 n 2
where (a) is simply applying the replica trick; (b) is a non-rigorous swap of two limits which
is assumed to be correct in the replica method; and (c) is the saddle point method. We have
almost finished the replica computation, the last step is to get rid of the remaining n. This can
be done by using the replica trick once more:
(d) 1 β 2
Φ(β, ∆) = lim Extrm,m̂ nm − nm̂m + nEh [log (2 cosh (βh + m̂))]
n→0 n 2
β 2
= Extrm,m̂ m − m̂m + Eh [log (2 cosh (βh + m̂))]
2
which implies
log (E[X n ]) ≈ log (1 + nE [log(X)]) ≈ nE [log(X)]
This kind of voodoo replica magic should be astonishing: essentially, we see that we can push
the expectation within a function when n is going to 0! This already hints at the fact that, for
this to be really valid, we shall require some concentration property for the random variables!
In any case, we have finished the replica part of our computation, and have managed to bring
back the computation to an extremization of a two-dimensional function, just like we did for
the Curie-Weiss model:
β 2
Φ(β, ∆) = extrm,m̂ m − m̂m + Eh [log (2 cosh (βh + m̂))]
2
42 F. Krzakala and L. Zdeborová
Plugging this back, we reach a formula very similar to the one obtained for Curie-Weiss.
Defining
β
ΦRS (m, β, ∆) ≜ − m2 + Eh [log (2 cosh (β(h + m)))]
2
we find:
Φ(β, ∆) = extrm ΦRS (m) = ΦRS (m∗ ) (2.1)
where m∗ will satisfy the self-consistent mean-field equation
h2
e− 2∆
Z
m = Eh [tanh (β(h + m))] = dh √ tanh(β(h + m))
2π∆
As we did in the Curie-Weiss model, we can also compute the large deviation function that
gives us the free entropy for a fixed value of m. In fact, this is self-averaging as well, since we
could repeat all the steps of theorem 7 with an indicator function. The free entropy is obtained
by doing the saddle point in the correct order and first differentiating wrt m̂, leading to an
implicit equation on m̂∗ :
m = Eh [tanh (βh + m̂∗ )]
so that the replica symmetric approach predicts
LD (β,∆,m)
P(S̄ = m) ≍ eN Φ
where
LD β 2
Φ (β, ∆, m) = extrm̂ m − m̂m + Eh [log (2 cosh (βh + m̂))]
2
where the extremization over m̂ imposes that it solves eq. (2.2.3). Indeed it is easy to check
that this recover the large deviation function of the previous chapter.
We illustrate the result of the replica predictions are shown in Figure.2.2.1 where we plot the
minimal cost of the assignement versus the variance of the random field. This is obtained by
solving the self consistant equation 2.1, and computing the equilibrium energy by deriving
the free entropy with respect to the (inverse) temperature :
(m∗ )2
E⟨e⟩ = −∂β Φ(β, ∆) = − Eh [(h + m∗ ) tanh (β(h + m∗ ))] (2.2)
2
and taking the zero temperature limit so that
(m∗ )2
E[emin ] = − Eh [(h + m∗ ) sign (h + m∗ )] (2.3)
2
2.3 A rigorous computation with the interpolation technique 43
Figure 2.2.1: Minimum energy in the Random Field Ising Model depending on the variance ∆.
Given this computation was somehow acrobatic, it would be only natural to seek rigorous
re-assurance that the result we reached is exact. In order to do so, we shall use the interpolation
method introduced by Francesco Guerra to prove results from the replica method.
First we start by a different, but simpler, problem. Consider a system with the Hamiltonian:
X
H0 (s, h; m) = − si (hi + m)
i
The corresponding partition function and free entropy per spin read
X P Y X Y
Z0 (β, h; m) = eβ i si (hi +m) = eβsi (hi +m) = 2 cosh (β(hi + m))
s i si =±1 i
log (Z0 (β, h; m))
Φ0 (β, ∆; m) = Eh = Eh [log (2 cosh (β(h + m)))]
N
In fact, we can even define this partition sum at fixed value of S̄ = s, to access large deviations:
X P
Z0 (β, h; m, s) = 1(S̄ = s)eβ i si (hi +m)
We do not know how to do this computation directly but we can, however, do it using the
Gartner-Ellis theorem, or Legendre transform, since as N → ∞ this is equivalent to computing
the rate. We thus write
X P P
Z̃0 (β, h; m, k) = eβ i si (hi +m)+k i si
s
1
Z̃0 (β, h; m, k) → Eh [log (2 cosh (β(h + m) + k))]
N
44 F. Krzakala and L. Zdeborová
1
Φ0 (β, m, ∆) = lim log Z0 (β, h; m, S̄ = m) = Eh [log (2 cosh (β(h + m) + k ∗ ))] − k ∗ m
N →∞N
m = Eh [tanh (β(h + m) + k ∗ )]
and we now have something that looks already very close to the replica prediction!
!2
X N X si
Ht (s, h; m) = − si [hi + m(1 − t)] − t ,
2 N
i i
X
Zt (β, h; m) = 1(S̄ = m)e−βHt (s,h;m) , ∀ t ∈ (0, 1]
s
It is easy to verify that that, for t = 0, we recover the model of the previous paragraph, while
for t = 1 we have:
We are now going to interpolate the model at time 1 from the one at time 0, and write, using
the fundamental theorem of calculus:
log (ZRFIM (β, h))
Φ(β, m, ∆) = lim Eh
N →∞ N
log (Z1 (β, h; m))
= lim Eh
N →∞ N
Z 1
log (Z0 (β, h; m)) ∂ log (Zt (β, h; m))
= lim Eh + dτ
N →∞ N 0 ∂t N t=τ
Z 1
∂ log (Zt (β, h; m))
= Φ0 (m, ∆, β) + lim Eh dτ
N →∞ 0 ∂t N t=τ
2.3 A rigorous computation with the interpolation technique 45
Bibliography
A nice review on the random field Ising model in physics is Nattermann (1998). It played
a fundamental role in the development of disordered systems. The replica method was
introduced by Sam Edwards, who credited it to Marc Kac (Edwards et al. (2005)). It has been
turned into a powerful and versatile tool by the work of a generation of physicists led by Parisi,
Mézard and Virasoro (Mézard et al. (1987b)). The interpolation trick we discussed to prove the
replica formula was famously introduced by Guerra (2003). The peculiar technique we used
here fixing the magnetization is inspired from El Alaoui and Krzakala (2018). Probabilistic
inequalities such as Gaussian Poincaré are fundamental to modern probability and statistics
theories. A good reference is Boucheron et al. (2013). These concentration inequalities are
the cornerstone of all approaches to rigorous mathematical treatments of statistical physics
models.
46 F. Krzakala and L. Zdeborová
2.4 Exercises
In order to prove the Gaussian Poincaré inequality, we first need to prove the very generic
Efron-Stein inequality, which is at the root of many important result in probability
theory:
Theorem 8 (Efron-Stein). Suppose that X1 , . . . , Xn and X1′ , . . . , Xn′ are independant random
variable, with Xi and Xi′ having the same law for all i. Let X = (X1 , . . . , Xi , . . . Xn ) and
X (i) = (X1 , . . . , Xi−1 , Xi′ , Xi+1 , . . . Xn ). Then for any function f : Rn → R we have:
n
1X
var(f (X)) ≤ E[(f (X) − f (X (i) ))2 ].
2
i=1
We are going to prove Efron-Stein using the so-called Lindeberg trick, by consider-
ing averages over mixed ensembles of the Xi and Xi′ . First we define the set X(i)
as the random vector equal to X ′ to i, and equal to X for all larger indices, i.e.
X(i) = (X1′ , . . . , Xi−1 i+1 , . . . Xn ). In particular X(0) = X and X(n) = X .
′ , X ′, X ′
i
E[f (X)(f (X(i−1) ) − f (X(i) ))] = E[f (X (i) )(f (X(i) ) − f (X(i−1) ))]
1 h i
= E (f (X) − f (X (i) )(f (X(i−1) ) − f (X(i) )
2
1 h i
|E[f (X)(f (X(i−1) ) − f (X(i) ))]| ≤ E (f (X) − f (X (i) )2
2
and proove the Efron-Stein theorem.
Now that we have Efron-Stein, we can prove Poincare’s inequality for Gaussian random
variables. We shall do it for a single variable, and let the reader generalize the proof to
the multi-valued case.
With Xi a ±1 random variable that takes each value with probability 1/2 (this is called
a Rademacher variable), define:
Sn = X1 + X2 + . . . + Xn .
5. Using the central limit theorem, show that this leads, as n → ∞, to the following
theorem:
ZN +1 β 2
AN (β, ∆) =: Eh,h log = Eh log⟨e− 2 S̄ 2 cosh (β(S̄ + h))⟩N,β,h + o(1)
ZN
2. Show
P that, by adding an external magnetic field B to the Hamiltonian (i.e. a term
B i Si , one can get a concentration of the magnetization for almost all B so that,
for any h, we have:
Z B2
⟨S̄ 2 ⟩N,β,h − ⟨S̄⟩2N,β,h dB ≤ 2/βN
B1
Note that this gives the concentration over the Boltzmann averages, but not over
the disorder (the fields h). This means we showed that the magnetization con-
verges to a value m(h) that could — a priori — depend on the given realization
of the disorder h.
4. Given that we proved that ⟨S⟩N,β,h concentrates as N grows to a value m(h), show
that this implies, as N → ∞, the bound:
β
Φ(β, ∆, B) ≤ supm − m2 + Eh log 2 cosh (β(m + h + B))
2
5. Use the variational approach of lecture 1 to obtain the converse bound and finally
show:
β
Φ(β, ∆) = supm − m2 + Eh log 2 cosh (β(m + h))
2
48 F. Krzakala and L. Zdeborová
Exercise 2.3: Mean-field algorithm and state evolution for the RFIM
Our aim in this exercise is to provide an algorithm for finding the lowest energy config-
uration, and to analyse its property.
1. Using the variational approach of section 1.3 or the cavity method of section 1.4,
explain why the following iterative algorithm might be a good one for finding
the lowest energy in practice (tip: this is the zero temperature limit of a finite
temperature iteration):
!
X
Sit+1 = sign hi + Sit /N
i
2. Implement the code of this algorithm, check that it indeed finds configurations
with minimum values that match the replica predictions for the minimum energy
when the system is large enough.
3. Show that in the large N limit, the dynamics of this algorithm obey a "state
evolution" equation, that is, that at each time the average magnetization mt =
i Si /N is given by the deterministic equation:
t
P
mt+1 = Eh sign(h + mt )
and conclude that the algorithm is performing a fixed point iteration of the replica
symmetric free energy equation 2.2.3.
Chapter 3
Unfortunately, no one can be told what the Matrix is. You have to
see it for yourself
We shall move now to a first non-trivial application of the replica method to compute the
spectrum of random matrices. Random matrices were introduced by Eugene Wigner to model
the nuclei of heavy atoms. He postulated that the spacings between the lines in the spectrum
of a heavy atom nucleus should resemble the spacings between the eigenvalues of a random
matrix, and should depend only on the symmetry class of the underlying evolution. Since
then, the study of random matrices has become a field in itself, with numerous applications
ranging from solid-state physics and quantum chaos to machine learning and number theory.
The simplest of all random matrix is the Wigner one:
1
AN = √ G + G⊤
2N
where G is a random matrix where each element Gij ∼ N (0, 1) is i.i.d. chosen from N (0, 1).
The central question is: what does the distribution of eigenvalues νAN (λ) of such random
matrices look like as N → ∞? It turns out that νAN (λ) converges towards a well-defined
deterministic function ν(λ). We shall see how one can use the replica and the cavity method
to compute it, as a first real application of our tools.
We will use a technique called the Stieltjes transform. It starts with a very useful identity,
defined in the sense of distributions, called Sokhotsky’s Formula. It is often used in field
theory in physics, where people refer to it as the "Feynman trick". It can also be useful to
compute the Kramers-Kronig relations in optics, or to define the Hilbert transform of causal
50 F. Krzakala and L. Zdeborová
functions1
1 1
δ(x − x0 ) = − lim ℑ
ϵ→0 π x − x0 + iϵ
This formula is at the roots of the theoretical approach to random matrix theory. Indeed, given
a probability distribution ν(x) that can take N values with uniform probability, we have:
1 X 1 X 1
ν(x) = δ(x − xi ) = − lim ℑ
N ϵ→0 N π x − x0 + iϵ
i i
Now we use the fact that 1/x is the derivative of the logarithm to write
1 X 1 1 X 1 Y
= ∂x log (x − xi ) = ∂x log (x − xi )
N x − xi N N
i i i
So that, given an N × N matrix A with eigenvalues λ1 , . . . , λN , and using the fact that the
determinant is the product of eigenvalues, we have:
1 X 1
ν(λ) = δ(λ − λi ) = − ℑ lim ∂λ log det(A − (λ + iϵ)1)
N N π ϵ→0
i
This is the basis of the computation techniques used in random matrix theory. For a given
matrix A, we introduce the Stieltjes transform, as
1 X 1 1
SA (λ) = − = − ∂λ log det(A − λ1) ,
N λ − λi N
i
and once we have the Stieltjes transform, we can access the probably density via its imaginary
part:
1
νA (λ) = lim ℑSA (λ + iϵ) .
π ϵ→0
The entire field of random matrix theory is thus reduced to the computation of the Stieltjes
transform associated with the probability distribution of eigenvalues.
Our aim is now to compute the Stieltjes transform of the Wigner matrix using the replica
method. We shall not aim for mathematical rigor here, as our goal is merely the demonstration
of the power of our tools.
1
This formula is easily proven using Cauchy integration in the complex plane. Alternatively, one can simply
use x2 − ε2 = (x + iε)(x − iε). Indeed,
x2
Z Z Z
f (x) ε f (x)
lim dx = ∓iπ lim f (x) dx + lim dx.
R x ± iε
2 2 2 2
R π(x + ε ) R x +ε x
ε→0 + ε→0 + ε→0 +
The first integral approaches a Dirac delta function as ε → 0+ (it is a nascent delta function) and therefore, the
first term equals ∓iπf (0). The second term converges to a (real) Cauchy principal-value integral, so that
Z Z
f (x)
ℑ lim dx = ∓iπf (0) = ∓iπ δ(x)f (x)dx . (3.1)
ε→0+ R x ± iε R
3.2 The replica method 51
Assuming that both the Stieltjes transform and the density of eigenvalues are self-averaging,
we need to compute their expectation in the large size limit.
1
lim ESAN (λ) = −∂λ lim E log det (AN − λIN ) . (3.2)
N →∞ N →∞ N
Instead of directly using the replica trick on the log-det, we shall instead use the following
approach, which will turn out to be more practical:
h i
E log det (AN − λIN ) = −2E log det (AN − λIN )−1/2
so that, following the replica strategy of computing log X by computing instead (X n − 1)/n,
we shall need to compute average of the n-th power of det (AN − λIN )−1/2 . We now use the
Gaussian integral 2 to express the square root of determinant as an integral and write:
" n Z #
−n/2
Y dx − 12 x⊤ (AN −λIN )x
E det (AN − λIN ) =E e
RN (2π)N/2
a=1
n n
n
" #
dxa λ P
∥xa ∥22 − 12 xa⊤ AN xa
Z Y P
2
= N/2
e a=1 E e a=1 (3.3)
RN a=1 (2π)
For our Wigner matrix (the so-called GOE(N) ensemble) we can write A = √1 G + G⊤
2N
for G ∼ N (0, 1) i.i.d. We can thus write the average in eq. equation 3.3 as:
" n # " n # " G n #
− 12 xa⊤ AN xa − √1 xa Gxa − √ ij xa a
P P P
i xj
2N a=1
Y 2N a=1
EA e a=1 = EG e = EGij e .
ij
At this point, it is a good idea to refresh our knowledge of Gaussian integrals, as we shall use
them often. In particular, we have
Z r
−ax2 +bx π b2 h i b2
e dx = e 4a , or Ex ebx = e 2a
a
so that
!
" n # Pn xa a b b
i xj xi xj
N 2
− 21 xa⊤ AN xa xa ·xb
P
Pn 2 a,b=1 N2 N Pn
1
( a=1 xi xj )
a a 4
Y Y
EA e a=1 = e 4N = e =e 4 a,b=1 N
.
ij ij
2
Remember that Gaussian distributions are normalized so that
Z
1 dx 1 ⊤
p N/2
e− 2 x (AN −λIN )x = 1 .
det (AN − λIN ) RN (2π)
A more generic formula, that turns out to be used in 90% of replica computations, is that if A is a symmetric
positive-definite matrix, then
Z r
1 T T (2π)n 12 B T A−1 B
e− 2 x Ax+B x dn x = e .
det A
52 F. Krzakala and L. Zdeborová
This is a very typical step of a replica computation! We have now performed the integration
over the disorder (randomness) and reached:
n n
n λ P 1 a ·xa )+ N
P 1 a b 2
dxa N 2 a=1( N x (N x ·x )
Z
−n/2
Y 4
E det (AN − λIN ) = N/2
e a,b=1
.
RN a=1
(2π)
A very important phenomenon has occurred: we see that the integration over the disorder has
"coupled" the previously independent replicas. This is indeed what always happens when
integrating over the disorder. This new term, that coupled the replicas, is of fundamental
importance, and has a name: it is called the overlap between replicas. We shall thus define:
1 a b
q ab =: x ·x .
N
We now introduce a "delta function" to free the overlap order parameter, just like we did
previously for the magnetization in the random field Ising model. For any function f , we
have:
1 a b
Z Y
f x · x = Nn dq ab δ N q ab − xa · xb f (xa · xb )
N
1≤a≤b≤n
As before, we shall drop the N n prefactor, that does not count as we shall eventually take the
normalized logarithm and send N → ∞, and write
n n
Nλ qaa + N 2
P P
xa · xb qab
Z
−n/2
Y 2 4
E det (AN − λIN ) ≈ ab
δ q − ab
dq e a=1 a,b=1
. (3.4)
N
1≤a≤b≤n
Performing the exact same steps we used in the previous chapters, we now take the Fourier
representation of the delta functions (and change variables so that they appear "real" instead
of "complex"):
dq̂ ab − 1≤a≤b≤n q̂ (N q −x ·x )
ab ab a b
Z P
Y Y
ab a b
δ Nq −x ·x = e ,
2π
1≤a≤b≤n 1≤a≤b≤n
where:
n n
X λ X aa 1 X ab 2
Φ(q ab , q̂ ab ) = − q̂ ab q ab + q + q + Ψx (q̂ ab )
2 4
1≤a≤b≤n a=1 a,b=1
At N → ∞, the integral in equation 3.5 can be evaluated with the saddle point method, and
therefore:
−n/2 ab ab
E det (AN − λIN ) ≈ exp N extr Φ(q , q̂ )
q ab ,q̂ ab
Again, we are trapped with the extremization of a function, but this time it should be over a
n × n matrix, which seems like a complicated space. To make progress in the extremization
problem, we need can restrict our search for particular solutions, and we are going, again, to
assume replica symmetry:
1
q ab = δ ab q, q̂ ab = − δ ab q̂
2
With this ansatz, we have:
n n
X n X X 2
q̂ ab q ab = − q̂q, q aa = nq, q ab = nq 2
2
1≤a≤b≤n a=1 a,b=1
Finally,
n
n
dxa q̂ a=1(xa )2
P
dx
Z Y Z
1 2 n
Ψx (q̂) = log √ e = n log √ e− 2 q̂x = − log(q̂)
2π 2π 2
a=1
Putting together and applying the saddle point method, we thus reach
− nN
2
extr{log q̂−q q̂−λq− 21 q 2 }
2 e q,q̂
−1
− E log det (An − λIn )−1/2 ≈ −2 lim
N n→0
n
1
≈ extr log q̂ − q q̂ − λq − q 2
2
To solve this problem, we look at the saddle point equations obtained by taking the derivatives
with respect to the parameters (q, q̂):
1 1
q̂ = , q = −q̂ − λ ⇔ +q+λ=0
q q
This has two solutions:
√
⋆ −λ ± λ2 − 4
q± =
2
Finally, to get the Stieltjes transform, we use the relation in eq. equation 3.2:
√
1 ⋆ −λ ± λ2 − 4
lim ESAN (λ) = −∂λ lim E log det (A − λIN ) = q± =
N →∞ N →∞ N 2
we have thus found the Stieltjes transform, and we have two (!) solutions for λ > 0 and λ < 0.
Only one of them will be the correct one when we shall inverse the Stieltjes transform, but it
will be easy to check which one, since probabilities needs to be positive.
54 F. Krzakala and L. Zdeborová
Figure 3.2.1: Simulation of the semicircle law using a 1000 by 1000 Wigner matrix.
It is a useful exercise to use the cavity method instead of the replica one to compute the Stieltjes
transform. We start from the very definition
1 X 1
SAN (λ) = −
N λ − λi
i
3.3 The Cavity method 55
The idea behind the cavity computation consists in finding a recursion equation between the
transform for an N × N matrix and the one for an (N + 1) × (N + 1) matrix. Let us define:
MN = λ1 − AN RN = [λ1 − AN ]−1 = MN
−1
where RN is called the resolvant matrix. Using the formula for computing inverse of matrices
via their matrix of cofactors, we find
det MN
(RN +1 )N +1,N +1 =
det MN +1
Additionally, we can compute the determinant of the N + 1 matrix with the Laplace expansion
along the last row, and then on the last column, so that:
N
X
N
det MN +1 = (MN +1 )N +1,N +1 det MN − (MN +1 )N +1,k (MN +1 )l,N +1 Cl,k
k,l=1
with Cl,k
N the matrix of co-factors of M . Dividing the previous expression by det M , we
N N
thus find
det MN +1 1
=
det Mn (RN +1 )N +1,N +1
N
1 X
N
= (MN +1 )N +1,N +1 − (MN +1 )N +1,k (MN +1 )l,N +1 Cl,k
det MN
k,l=1
N
X
= λ − (AN +1 )N +1,N +1 − (AN +1 )N +1,k (AN +1 )l,N +1 (RN )k,l
k,l=1
At this point, we make the additional assumption that the off-diagonal elements of the resolvent
RN are of order O(N −1/2 ). This can be checked, for instance by expanding in powers of λ
with pertubation theory, and observing that, at each order in this expansion, the off-diagonal
elements are indeed of order N −1/2 . In which case the equation further simplifies to
1
(RN +1 )N +1,N +1 = N
−1
P
λ−N l=1 (RN )l,l
where we have used that the AN have i.i.d. elements. Given the matrix elements of the
diagonal of RN +1 are identically distributed, we find
1 1
TrRN +1 = 1
N λ − N TrRN
1
SAN +1 (λ) = −
λ + SAN (λ)
and we thus expect, as N increases, that the Stieltjes transform will converge to the fixed point.
This is indeed the same (correct) equation that we found with the replica method. We have
thus checked that both methods give the correct solution in this case.
56 F. Krzakala and L. Zdeborová
Bibliography
The Wigner random matrix was famously introduced in Wigner (1958). The use of the replica
method for random matrices iniated with the seminal work of Edwards and Jones Edwards
and Jones (1976). It has now grown into a field in iteself, with hundred of deep non-trivial
results. A good set of lecture notes on the subject can be found in Livan et al. (2018); Potters
and Bouchaud (2020). A classical mathematical reference is Bai and Silverstein (2010).
3.4 Exercises
The goal of this exercise is to repeat the replica computation for Wishart-Matrices,
and to derive their distribution of Eigenvalues, also called the Marcenko-Pastur law.
Wishart matrices are defined as follows: Consider a M × N random matrix X with i.i.d.
coefficients distributed from a standard normalized Gaussian N (0, 1). The Wishart
matrix Σ̂ is:
1
Σ̂ =: XX T .
N
In other words, these are the correlation matrices between random data points. As
such they are used in many concrete situations in data science and machine learning, in
particular to filter out the signal from the noise in data.
3. Perform simulation of such random matrices, for large values of M and N and
check your predictions for the distribution of eigenvalues.
4. Repeat the simulation for a value α > 1, and compare again with the distribution
νΣ̂ (λ). Are we missing something? Hint: the Stieltjes transform of a delta function
δ(λ) is −1/λ.
Chapter 4
e−βH(s)
PN,β (S = s) = .
ZN (β)
for the random field Ising model, but we shall now look at a more complex, and more interest-
ing, topology. We assume the existence of a graph Gij that connects some of the nodes, such
that Gij = 1 if i and j are connected, and 0 otherwise. So far, the last chapters dealt only with
the case of fully connected graphs Gij = 1 for all i, j. Now our Hamiltonian can be written as:
X N
X
HN,J,{h},G (s) = −J Si Sj − hi Si
(i,j)∈G i=1
Let us assume we have N spins i = 1, . . . , N , that are all isolated. In this case, their probability
distribution is simple enough, we have mi = tanh(βhi ). Imagine now we are connecting
these N spins to a new spin S0 , in the spirit of the cavity method. Of course, the mi are now
changed! Let us thus refer to the old values of mi as the "cavity ones" and write:
With this definition, note that (1 + Si mci )/2 = eβhi Si / cosh(βhi ). Our main question of interest
is: can we write the magnetization of the new spin m0 as a function of the {mci }? Let us try!
Clearly
P P
βS0 h0 + i βJSi S0 +βSi hi P βS0 h0
Q P βJSi S0 1+Si mci
S0 ,{s} S0 e S0 S0 e i Si e 2
m0 = P βS h +
P
βJS S +βS h
= 1+Si mci
S0 ,{s} e
βS h βJS S
0 0 i 0 i i
P Q P
S0 e Si e
i 0 0 i 0
i 2
+
X −X − Y X 1 + Si mi c
= with X s = eβsh0 eβJSi s
X +X − 2
i Si
X + −X −
1 1+ X + +X − 1 X+
atanh(m0 ) = log X + −X −
= log −
2 1− 2 X
X + +X −
X1 eβJ (1 + mci ) + e−βJ (1 − mci )
= βh0 + log
2 e−βJ (1 + mci ) + eβJ (1 − mci )
i
X1 cosh(βJ) + mci sinh(βJ)
= βh0 + log
2 cosh(βJ) − mci sinh(βJ)
i
X1 1 + mci tanh(βJ)
= βh0 + log
2 1 − mci tanh(βJ)
i
X
= βh0 + atanh (mci tanh(βJ)) .
i
We can thus finally write the magnetization of the new spins as the function of the spins in
the "old" system in a relatively simple form:
!
X
m0 = tanh βh0 + atanh (mci tanh(βJ)) (4.1)
i
It is quite simple to repeat the same argument iteratively on a tree. If we start with initial
conditions then we can write, at any layer of the tree,
X
mi→j = tanh βhi + atanh (mk→i tanh(βJ)) (4.2)
k∈∂i ̸=j
What if we want to know the true marginals? This is easy, we just write
X
mi = tanh βhi + atanh (mk→i tanh(βJ)) (4.3)
k∈∂i
This is the root of the so-called Belief propagation approach. We solve the problem for the
cavity marginals mi→j , which has a convenient interpretation as a message passing problem.
Once we know them all, we can compute the true marginal!
4.2 Exact recursion on a tree 59
This is a method that first appeared in statistical physics, when Bethe and Peierls used it as an
approximation of the regular lattice. Indeed, we could iterate this method on a large infinite
tree of connectivity, say c = 2d to approximate a hypercubic lattice of dimension d (where
each node has 2d neighbors). Consider for instance the situation with zero field. In this case,
the magnetization at distance ℓ from the leaves follows
mℓ+1 = tanh (c − 1)atanh mℓ tanh(βJ)
We can look for a fixed point of this equation, and check for which values of the (inverse)
temperature a non-zero value for the magnetization is possible. This is the same phenomenon
as in the Ising model, just with a slightly more complicated fixed point. By ploting this
equation, one realizes that, assuming d = 2c, this happens at β BP = ∞ for d = 1, β BP = 0.346
for d = 2, β BP = 0.203 for d = 3, β BP = 0.144 for d = 4, and β BP = 0.112 for d = 5. If we
compare these numbers to the actual transition on a real hypercubic lattice, we find β lattice = ∞
for d = 1, β lattice = 0.44 for d = 2, β lattice = 0.221 for d = 3, β lattice = 0.149 for d = 4 and
β lattice = 0.114 for d = 5. Not so bad, and in fact we see that the predictions become exact as d
grows! This approach is quite a good one to estimate a critical temperature. In fact, one can
show that it gives a rigorous upper bound on the ferromagnetic transition, in any topology!
The iterative approach we just discussed can be made completely generic on a tree graph! The
example that we have been considering so far reads, in full generality
X X
H=− Jij Si Sj − hi Si .
(ij)∈G i
This is an instance of a very generic type of model: those with pairwise interactions, where
the probability of each configuration is given by
1 Y Y
P ((S)) = ψi (Si ) ψij (Si , Sj )
Z
i (ij)∈G
X Y Y
Z = ψi (Si ) ψij (Si , Sj )
{Si=1,...,N } i (ij)∈G
and the connection is clear once we define ψij (Si , Sj ) = exp(βJij Si Sj ) and ψi (Si ) = exp(βhi Si ).
We want to compute Z on a tree, like we did for the RFIM. For two adjacent sites i and j,
the trick is to consider the variable Zi→j (Si ), defined as the partial partition function for the
sub-tree rooted at i, when excluding the branch directed towards j, with a fixed value Si of the
i spin variable. We also need to introduce Zi (Si ), the partition function of the entire complete
tree when, again, the variable i is fixed to a value Si . On a tree, these intermediate variables
can be computed exactly according to the following recursions
Y X
Zi→j (Si ) = ψi (Si ) Zk→i (Sk )ψik (Si , Sk ) (4.4)
k∈∂i\j Sk
Y X
Zi (Si ) = ψi (Si ) Zj→i (Sj )ψij (Si , Sj ) (4.5)
j∈∂i Sj
60 F. Krzakala and L. Zdeborová
where ∂i denotes the set of all neighbors of i. In order to write these equations, the only
assumption that has been made was that, for all k ̸= k ′ ∈ ∂i \ j, the messages Zk→i (Sk ) and
Zk′ →i (Sk′ ) are independent. On a tree, this is obviously true: since there are no loops, the sites
k and k ′ are connected only through i and we have "cut" this interaction when considering
the partial quantities. This recursion is very similar, in spirit, to the standard transfer matrix
method for a one-dimensional chain.
In practice, however, it turns out that working with partition functions (that is, numbers that
can be exponentially large in the system size) is somehow impractical. We can thus normalize
equation 4.4 and rewrite these recursions in terms of probabilities. Denoting ηi→j (Si ) as the
marginal probability distribution of the variable Si when the edge (ij) has been removed, we
have
Zi→j (Si ) Zi (Si )
ηi→j (Si ) = P ′ , ηi (Si ) = P ′ .
S ′ Zi→j (Si )
i S ′ Zi (Si ) i
So that the recursions equation 4.4 and equation 4.5 now read
ψi (Si ) Y X
ηi→j (Si ) = ηk→i (Sk )ψik (Si , Sk ) , (4.6)
zi→j
k∈∂i\j Sk
ψi (Si ) Y X
ηi (Si ) = ηj→i (Sj )ψij (Si , Sj ) , (4.7)
zi
j∈∂i Sj
The iterative equations equation 4.6 and equation 4.7), along with their normalization equa-
tion 4.8 and equation 4.9, are called the belief propagation equations. Indeed, since ηi→j (Si )
is the distribution of the variable Si when the edge to variable j is absent, it is convenient to
interpret it as the "belief" of the probability of Si in absence of j. It is also called a "cavity"
probability since it is derived by removing one node from the graph. The belief propagation
equations are used to define the belief propagation algorithm
1. Initialize the cavity messages (or beliefs) ηi→j (Si ) randomly or following a prior infor-
mation ψi (Si ) if we have one.
2. Update the messages in a random order following the belief propagation recursion
equation 4.6 and equation 4.7 until their convergence to their fixed point.
3. After convergence, use the beliefs to compute the complete marginal probability distri-
bution ηi (Si ) for each variable. This is the belief propagation estimate on the marginal
probability distribution for variable i.
4.2 Exact recursion on a tree 61
Using the resulting marginal distributions, one can compute, for instance, the equilibrium
local magnetization via mi = ⟨Si ⟩ = Si ηi (Si )Si , or basically any other local quantity of
P
interest.
At this point, since we have switched from partial partition sums to partial marginals, the
astute reader could complain that we have lost sight of our prime objective: the computation
of the partition function. Fortunately, one can compute it from the knowledge of the marginal
distributions. To do so, it is first useful to define the following quantity for every edge (ij):
X zj zi
zij = ηj→i (Sj )ηi→j (Si )ψij (Si , Sj ) = = ,
zj→i zi→j
Si ,Sj
where the last two equalities are obtained by plugging equation 4.6 into the first equality and
realizing that it almost gives equation 4.9. Using again equation 4.6 and equation 4.9, we
obtain
X Y X
zi = ψi (Si ) ηj→i (Sj )ψij (Si , Sj )
Si j∈∂i Sj
P
Zj→i (Sj ) Si Zi (Si )
X Y X
= ψi (Si ) P ′
ψij (Si , Sj ) = Q P ,
Si j∈∂i Sj S ′ Zj→i (S ) j∈∂i Sj Zj→i (Sj )
For any spin Si , the total partition function can be obtained using Z = Si Zi (Si ). We can
P
thus start from an arbitrary spin i
X Y X Y Y X
Z= Zi (Si ) = zi Zj→i (Sj ) = zi zj→i Zk→j (Sk ) ,
Si j∈∂i Sj j∈∂i k∈∂j\i Sk
and we continue to iterate this relation until we reach the leaves of the tree. Using eq4.2.1, we
obtain
Q
Y Y Y zj Y zk zi
Z = zi zj→i zk→j · · · = zi
··· = Q i
.
zij zjk (ij) zij
j∈∂i k∈∂j\i j∈∂i k∈∂j\i
We thus obtain the expression of the free energy in a convenient form, that can be computed
directly from the knowledge of the cavity messages, often called the Bethe free energy on a tree:
X X
fTree N = −T log Z = fi − fij ,
i (ij)
where fi is a "site term" coming from the normalization of the marginal distribution of site i,
and is related to the change in Z when the site i (and the corresponding edges) is added to
the system. Meanwhile, fij is an "edge" term that can be interpreted as the change in Z when
62 F. Krzakala and L. Zdeborová
the edge (ij) is added. This provides a convenient interpretation of the Bethe free energy
equation 4.2.1: it is the sum of the free energy fi for all sites but, since we have counted each
edge twice we correct this by subtracting fij .
We have now entirely solved the problem on a tree. There is, however, nothing that prevents
us from applying the same strategy on any graph. Indeed the algorithm we have described is
well defined on any graph, but we are not assured that it gives exact results nor that it will
converge. Using these equations on graphs with loops is sometimes referred to as loopy belief
propagation in Bayesian inference literature.
One may wonder if there is a connection between the BP approach and the variational one.
We may even wonder if this could be simply the same as using our variational bound with a
better approach than the naive mean field one! Sadly, the answer is no! We cannot prove in
general that the BP free entropy is a lower bound on any graph: indeed there are examples
where it is larger than log Z, and some where it is lower.
• There is however a connection between the variational approach and the BP one. If one
writes the variational approach and uses the following parametrization for the guess
(where ci = |∂i|): Q
ij bij (Si , Sj )
Q(S) = Q ci −1
i bi (S)
then it is possible to show that optimizing on the function bij and bi one finds the BP
free entropy. The sad news, however, is that Q(S) does not really correspond to a true
probability density, it is not always normalizable, so one cannot apply the variational
bound.
• In the case of ferromagnetic models, or more exactly, on attractive potential Ψij , and
only in this case, it can be shown rigorously that BP does give a lower bound on the free
entropy, on any graph. For such models (thus including the RFIM!) it is thus effectively
equivalent to a variational approach. In fact, it can be further shown that in the limit
of zero temperature, BP finds the ground state of the RFIM on any graph (through a
mapping to linear programming).
We shall now discuss the basic properties of sparse Erdős-Rényi (ER) random graphs.
An ER random graph is taken uniformly at random from the ensemble, denoted G(N, M ),
of graphs that have N vertices and M edges. To create such a graph, one has simply to add
M random edges to an empty graph. Alternatively, one can also define the so called G(N, p)
ensemble, where an edge exists independently for each pair of nodes with a given probability
0 < c/N < 1. The two ensembles are asymptotically equivalent in the large N limit, when
M = c(N − 1)/2. The constant c is called the average degree. We denote by ci the degree
4.3 Cavity on random graphs 63
Figure 4.3.1: Iterative construction of a random graph for the Cavity method. The average
degree of the graph is c.
of a node i, i.e. the number of nodes to which i is connected. The degrees are distributed
according to Poisson distribution, with average c.
Alternatively, one can also construct the so-called regular random graphs from the ensemble
R(N, c) with N vertices but where the degree of each vertex is fixed to be exactly c. This means
that the number of edges is also fixed to M = cN/2.
At the core of the cavity method is the fact that such random graphs locally look like trees, i.e.
there are no short cycles going trough a typical node. The key point is thus that, in this limit,
such random graphs can be considered locally as trees. The intuitive argument for this result
is the following one: starting from a random site, and moving following the edges, in ℓ steps
cℓ sites will be reached. In order to have a loop, we thus need cℓ ∼ N to be able to come back
on the initial site, and this gives ℓ ∼ log(N ).
Let us now apply our beloved cavity method on an ER random graph with N links and on
average M = cN/2 links. We will see how this method leads to free entropies using the same
telescopic sum trick as before. However, this should be done with caution. If we apply our
usual Cesaro trick naively and write
then we encounter a problem when choosing the value of m at each step. We want the new
spin to have on average c neighbors. That means we must add a Poissonian variable m with
mean c at each step, so that the new spin has the correct numbers of neighbors. But this also
adds a link to c spins in the previous graph, so that, on average, we went from M = cN/2 to
M ′ = cN/2 + c while we actually wanted to get M ′ = c(N + 1)/2. The difference is ∆M = c/2,
so we need to construct our sequence of graphs such that we remove m/2 links on average
64 F. Krzakala and L. Zdeborová
every time we add one spin connected to m previous spins. Therefore, we write instead
Concretely, this means that we must apply the Cesaro theorem to the following term to
compute the free entropy
ZN,M =N 2c ZN −1,M −c ZN,M =N 2c ZN −1,M −c+ 2c
AN = log = log − log .
ZN −1,M −c ZN −1,M −c+ 2c ZN −1,M −c ZN −1,M −c
We thus have two terms to compute to compute the free entropy, that corresponds to the 2-step
iteration on the graph depicted in Fig.4.3.1
1. The first one corresponds to the change in free entropy when one adds one spin to a graph,
connecting it to c spins with c cavities.
2. The second one corresponds to the change in free entropy when one adds c/2 links to a
graph, connecting c spins with cavities pairwise.
This corresponds exactly to what we found on the tree! Indeed, the in 4.2.1, we see that we
have N sites terms, minus M = cN/2 links terms! It is reassuring to find that this construction
gives us the same answer as on the tree.
Interestingly, both these equations depend on the distribution of the joint cavity magnetizations
in the graph so we can write
Z d
Z Y X
(site)
Φ = E dPe (d) dmi Qc ({mi }) log⟨2 cosh (βh0 + βJ Si )⟩{mi }
i=1 i∈∂0
Z
(link) c
Φ = E dm1 Q(m1 , m2 ) log⟨eβJSi Sj ⟩{m1 ,m2 }
2
where Pe (d) is the excess degree probability. For a regular graph with fixed connectivity c,
Pe (d) = δ(d − (c − 1)) while for a Erdos-renyi random graph, interestingly Pe (d) = P (d) again!
See exercises section.
4.3 Cavity on random graphs 65
At the level of rigor used in physics, these formulas can be further simplified! First, we
make the assumption that the distribution of cavity fields converges to a limit distribution,
independently of the disorder. Secondly, the crucial point is now to assume the distribution of
these cavity marginals factorizes! This makes sense: the different cavities {mi } are all far from
each other in the graph. If we are not exactly at a critical point (a phase transition) then the
correlations are not infinite range. This is case, our formula depends only on the single point
distribution Qc (m) and we thus write:
Z d Z
Z Y X
Φ(site) = dPe (d) dmi Qc (mi ) log⟨2 cosh (βh0 + βJ Si )⟩{mi }
i=1 i∈∂0
Z
c
Φ(link) = dm1 Q(m1 ) dm2 Q(m2 ) log⟨eβJSi Sj ⟩{m1 ,m2 }
2
Our task is thus to find the asymptotic distribution Q(m). At the level of rigor used in physics,
this is easily done. We obviously assume that Q(m) is unique, and does not depend on the
realization of the disorder (this is not so trivial). Then we realize that the distribution of cavity
fields must satisfy a recursion such as
Z Z d Z
Y
QcN +1 (m) = dh0 P (h0 ) dPe (d) dmi QcN +1 (mi )δ (m − fBP ({mi }, h0 ))
i=1
with !
X
fBP ({mi }, h0 ) = tanh βh0 + atanh (mi tanh(βJ))
i
Obviously, once we find the fixed point, we can compute the distribution to total magnetization,
which reads almost exactly the same, except now we have to use the actual distribution of
neighbors:
Z Z d Z
Y
Q(m) = dh0 P (h0 ) dPe (d) dmi Qc (mi )δ (m − fBP ({mi }, h0 ))
i=1
2. Note that the free energy is the same as the one on graphs!!! Dictionary GRAPH to
POPULATION
4.3.3 The relation between Loopy Belief Propagation and the Cavity method
• tree-like etc....
Well, we can try !!! First Qc (m) is not clearly self-averaging, but for sure
Z d Z
Y X
Φ ≤ maxQc (m) dPe (d) dmi Qc ({mi }) log⟨2 cosh (βh0 + βJ Si )⟩{m}
i=1 i∈∂0
Z
c
− dm1 Qc ({m1 }) dm2 Qc ({m2 }) log⟨eβJJSi Sj ⟩m1 ,m2
2
It is possible to show that the extremization leads to Q(m) being a solution of the cavity
recursion. This means that we obtain a bound! If we found a distribution Q(m) (or actually,
all of them) that satisfies eq., then we have a bound.
Can we get the converse bound easily? Sadly, no. The point is that BP is not a variational
method on a given instance, so we cannot use the mean-field technics! Fortunately, it can be
shown rigorously, that, for any ferromagnetic model, BP does give a lower bound on the free
entropy.
It is also instructive to compare to what we had in the fully connected model. Indeed, if Q(m)
become a delta (which we expect as c grows) we obtain, using J = 1/N
N 2 m2
Φ ≤ maxm Eh log 2 cosh (βh + βm) − log eβm /N = Eh log 2 cosh (βh + βm) − β
2 2
which is indeed the result we had in the fully connected limit!
Clearly, this must be a good approach for describing a system in a paramagnetic phase, or
even a system with a ferromagnetic transition (where we should expect to have two different
fixed points of the iterations). It could be, however, that there exists a huge number of fixed
points for these equations: how to deal with this situation? Should they all correspond to a
given pure state? Fortunately, we do not have such worries, as the situation we just described
is the one arising when there is a glass transition. In this case, one needs to use the cavity
method in conjunction with the so-called “replica symmetry breaking” approach as was done
by Mézard, Parisi, and Virasoro.
Bibliography
The cavity method is motivated by the original ideas from Bethe (1935) and Peierls (1936) and
used in detail to study ferromagnetism (Weiss, 1948). Belief propagation first appeared in
4.4 Exercises 67
computer science in the context of Shanon error correction (Gallager, 1962) and was rediscov-
ered in many different contexts. The name "Belief Propagation" comes in particular from Pearl
(1982). The deep relation between loopy belief propagation and the Bethe "cavity" approach
was discussed in the early 2000s, for instance in Opper and Saad (2001) and Wainwright and
Jordan (2008). The work by Yedidia et al. (2003) was particularly influential. The construction
of the cavity method on random graphs presented in this chapter follows the classical papers
Mézard and Parisi (2001, 2003). That loopy belief propagation gives a lower bound on the
true partition on any graph in the case of ferromagnetic (and in general attractive models) is a
deep non-trivial result proven (partially) by Willsky et al. (2007) using the loop calculus of
Chertkov and Chernyak (2006), and (fully) by Ruozzi (2012). Chertkov (2008) showed how
beleif propagation finds the ground state of the RFIM at zero temperature. Finally, that the
cavity method gives rigorous upper bounds on the critical temperature was shown in Saade
et al. (2017).
4.4 Exercises
Consider a Erdos-Renyi random graph with N nodes and M links in the asymptotic
regime where N → ∞, with c = 2M/N the average degree.
1. Consider one given node, what is the probability p that it is connected with
another given node, say j? Since it has N − 1 potential neighbors, show that the
probability distribution of the number of neighbors for each node follows
ck e−c
P(d = k) = (4.10)
k!
In the cavity method, we are often interested as well in the excess degree distribution,
that is, given one site i that has a neighbor j, what is its distribution of additional
neighbors d?
2 Argue that finding first a link (ij) and then looking to i is equivalent to sampling all
nodes with a probability that is proportional to their number of neighbors:
di
Pi = (4.11)
c
2 Finally show that the probability distribution of having k + 1 neighbors when one
chose each sites with probability dci is
ck+1 e−c k + 1
P(d = k + 1) = (4.12)
k + 1! c
so that the probability distribution of excess degree reads
ck e−c
Pe (d = k) = (4.13)
k!
68 F. Krzakala and L. Zdeborová
Exercise 4.2: The random field ising model on a regular random graph
We have seen that the BP update equation for the RFIM is given by
!
X
fBP ({mi }, h0 ) = tanh βh0 + atanh (mi tanh(βJ)) (4.14)
i
and that the distribution of cavity fields follows (for a random graph with fixed connec-
tivity c − 1):
Z c−1
YZ
Q cav.
(m) = dh0 N (h0 ; 0, ∆) dmi Qcav (mi )δ (m − fBP ({mi }, h0 )) (4.15)
i=1
This can be solved in practice using the population dynamics approach where we
represent Qcav. (m) by a population of Npop elements. In this case, formally we iterate
a collection, or a pool, of elements. Starting from Qt=0 (m) = m1 , m2 , m3 , . . . , mNpop
cav.
• For T steps:
If Npop is large enough (say 105 ) then this is a good approximation of the true population
density, and if T is large enough, then we should have converged to the fixed point.
Once this is done, we can compute the average magnetization by computing the true
marginal as follows:
• Set m = 0
• For N steps:
• m = m + mnew /Npop
2. Implement the population dynamics and find numerically the phase transition
point when m(β, ∆) become non zero. Draw the phase diagram in (β, ∆) separat-
ing the phase where m = 0 with the one where m ̸= 0.
4.4 Exercises 69
3. Now, let us specialize to the low (and eventually zero) temperature limit. Using
the change of variable mi = tanh(β h̃i ), show that the iteration has the following
limit when β → ∞
1 X
h̃new = lim atanhfBP ({mi }, h0 ) = h0 + ϕ(h̃i ) (4.16)
β→∞ β
i
with (
x, if |x| < 1
ϕ(x) = (4.17)
sign(x) if |x| > 1
4. Using this equation, perform the population dynamics at zero temperature and
compute the critical value of ∆ where a non zero magnetization appears.
Chapter 5
A common theme in the previous chapters has been the study of the Boltzmann-Gibbs distri-
bution:
e−βH(s)
PN,β (S = s) = .
ZN (β)
This is a joint probability distribution over the random variables S1 , . . . , SN defined through
the energy function H. In the two examples we have seen so far, the energy function is
composed of two pieces: an interaction term that couples different random variables and a
potential term which acts on each random variable separately. For instance, for the Curie-Weiss
model:
N N
1 X X
HN,h (s) = − si sj −h si
2N
i,j=1 i=1
| {z } | {z }
interaction potential
Notice that it is the interaction term that correlates the random variables: if it was zero, the
Gibbs-Boltzmann distribution would factorize and we would be able to fully characterize the
system by studying each variable independently. It is the interaction term that makes the
problem truly multi-dimensional.
In the Curie-Weiss model and RFIM, the interaction term is quadratic: it couples the ran-
dom variables pairwise. In the Chapters that follow, we will study many other examples of
Gibbs-Boltzmann distributions, each defined by different flavours of variables, potentials and
interactions terms. Therefore, it will be useful to introduce a very general way to think about
and represent multi-dimensional probability distributions. This is the subject of this Chapter.
72 F. Krzakala and L. Zdeborová
To proceed with the study of probabilistic models, we introduce a tool called Graphical Models
that will give us a neat and very generic way to think about a broad range of probability
distributions. A Graphical Model is a way to represent relations or correlations between
variables.
In this section we give basic definitions and introduce a couple of examples that will be studied
in more detail later in the class.
5.1.1 Graphs
Let |S| denote the size of set S. For the purpose of this section we denote the total number
of nodes by |V | = N and the total number of edges by |E| = M . The adjacency matrix of the
graph G(V, E) is a symmetric N × N binary matrix A ∈ {0, 1}N ×N with entries:
(
1 if (ij) ∈ E
Aij =:
0 if (ij) ∈/E
N N
1(ij)∈E .
X X
di = Aij =
j=1 j=1
i k
• A factor graph is a graph with nodes of type ‘circle’ and of type ‘square’
5.1 Graphical Models 73
A graphical model represents a joint probability distribution over the variables {si }N
i=1 :
M
1 Y
P {si }N
i=1 =: fa {sj }j∈∂a ,
ZN
a=1
In this lecture we will use graphical models extensively as a language to represent probability
distributions arising in optimization, inference and learning problems. We will study a variety
of fa , Λ and graphical models. Let us start by giving several examples.
Consider a graph G(V, E) as a graph of interactions, e.g. 3D cubic lattice with N = 1023 nodes.
The nodes may represent N Ising spins si ∈ Λ = {−1, +1}. In statistical physics, systems
are often defined by their energy function, which we call Hamiltonian. One can think of the
Hamiltonian as a simple cost function where lower values are better. The Hamiltonian of a
spin glass then reads X X
H {si }N
i=1 = − Jij si sj − hi si
(ij)∈E i
74 F. Krzakala and L. Zdeborová
where Jij ∈ R are the interactions and hi ∈ R are magnetic fields. Associated with the
Hamiltonian, we consider the Boltzmann probability distribution defined as:
N
1 −βH({si }N
i=1 ) =
1 Y βhi si Y βJij si sj
P {si }N
i=1 = e e e ,
ZN ZN
i=1 (ij)∈E
N
where the normalization constant ZN = e−βH({si }i=1 ) is called the partition function.
P
{si }N
i=1
The graphical model associated with a spin glass defined on a graph G(V, E) (left) is drawn
on the right.
• For each node i one factor node fi (si ) = eβhi si corresponding to the magnetic field hi .
• For each edge (ij) ∈ E one factor node f(ij) (si , sj ) = eβJij si sj corresponding to the
interaction Jij .
Consider a graph G(V, E), and a set of q colors si ∈ {red, blue, green, yellow, . . . , black} =
{1, 2, . . . , q}. In graph coloring we seek to assign a color to each node so that neighbors do not
have the same color.
In the figure we show a proper 4-coloring of the corresponding (planar) graph. Note that the
same graph can be also colored using only 3 colors, but not 2 colors.
To set up graph coloring in the language of graphical models, we write the number of proper
colorings of the graph G(V, E) as
X Y
ZN = 1 − δsi ,sj
{si }N
i=1
(ij)∈E
The graph is colorable if and only if ZN ≥ 1, in that case we can also define a probability
measure uniform over all proper colorings as
1 Y
P {si }N
i=1 = 1 − δsi ,sj
ZN
(ij)∈E
We can also soften the constraint on colors and introduce a more general probability distribu-
tion:
N
1 Y −βδ
P {si }i=1 , β = e si ,sj
ZN (β)
(ij)∈E
As β ↗ ∞ we recover the case with strict constraints. The graphical model for graph coloring
then corresponds to factor nodes on each edge (ij) of the form f(ij) (si , sk ) = e−βδsi ,sj .
Probability distributions that are readily represented via graphical models also naturally
arise in statistical inference. We can give the example of the Stochastic Block Model, which is
a commonly considered model for community detection in networks. In the SBM, N nodes
are divided in q groups, the group of node i being written s∗i ∈ {1, 2, . . . , q} for i = 1, . . . , N .
Node iP is assigned in group s∗i = a with probability (fraction of expected group size) na ≥ 0,
where qa=1 na = 1. Pairs of nodes are then connected with probability that corresponds to
their group memberships. Specifically:
P (ij) ∈ E s∗ , s∗ = ps∗ s∗
i j i j
⇒ G(V, E) & Aij
P (ij) ∈ ∗ ∗
/ E si , sj = 1 − ps∗i s∗j
The question of community detection is whether, given the edges Aij and the parameters
θ = (na , pab , q), one can retrieve s∗i for all i = 1, . . . , N . A less demanding challenge is to find
an estimator ŝi such that ŝi = s∗i for as many nodes i as possible.
We adopt the framework of Bayesian inference. All the information we have about s∗i is
included in the posterior probability distribution1 :
1
P {si }N
i=1 A, θ = P A {si }N
i=1 , θ P {s } N
i i=1 θ
ZN (A, θ)
N
1 Y h 1−Aij
Aij
iY
= 1 − psi ,sj psi ,sj nsi
ZN (A, θ)
i<j i=1
1
Note that si in the posterior distribution is just a dummy variable, the argument of a function.
76 F. Krzakala and L. Zdeborová
The corresponding graphical model has one factor node per variable (field), and one factor
node per pair of nodes (interaction). Indeed, we notice that even pairs without edge (where
Aij = 0) have a factor node with fij (si , sj ) = 1 − psi ,sj . In the class we will study this posterior
quite extensively.
Let n be the number of data samples and d be the dimension of data. Samples are denoted
Xµ ∈ Rd and labels yµ ∈ {−1, +1} (cats/dogs), for µ = 1, . . . , n. Generalized linear regression
is then formulated as the minimization of a loss function of the form
n
X d
X
L(w) = ℓ(yµ , Xµ · w) + r(wi ) .
µ=1 i=1
The loss minimization problem can then be regarded as the β → ∞ limit of the following
probability measure
d n
1 1 Y Y
P (w) = e−βL(w) = e−βr(wi ) e−βℓ(yµ ,Xµ ·w)
ZN (X, y, β) ZN (X, y, β)
i=1 µ=1
fi (wi ) = e−βr(wi )
wi
As we saw in the above examples the factors are often of two types: (i) factor nodes that are
related to only one variable — such as the magnetic field in the spin glass, the regularization
5.2 Belief Propagation and the Bethe free energy 77
in the generalized linear regression, or the prior in the stochastic block model — and (ii)
factor nodes that are related to interactions between the variables. For convenience, we will
thus treat those two types separately and denote the type (i) as gi (si ), and the type (ii) as
fa ({si }i∈∂a ). Factor nodes of type (i) will be denoted with indices i, j, k, l, . . . , factors of type
(ii) will be denoted with indices a, b, c, d, . . . . In this notation the probability distribution of
interest becomes:
N M
N
1 Y Y
P {si }i=1 = gi (si ) fa {si }i∈∂a
Z
i=1 a=1
As we saw already and will see throughout the course, quantities of interest can be extracted
from the value of the normalization constant, or partition function in physics jargon, that
reads
X Y N M
Y
Z= gi (si ) fa {si }i∈∂a .
{si }N i=1 a=1
i=1
We will in particular be interested in the value of the free entropy N Φ = log Z. Another quan-
tity of interest is the marginal distribution for each variable, related to the local magnetization
in physics, defined as
X
µi (si ) =: P {sj }Nj=1
{sj }N
j=1
j̸=i
The hurdle with computing the partition function and the marginals for large system sizes
is that it entails evaluating sums over a number of terms that is exponential in N . From the
computational complexity point of view, we do not know of exact polynomial algorithms able
to compute these sums for a general graphical model. In the rest of the lecture, we will cover
cases where the marginals and the free entropy can be computed exactly up to the leading
order in the system size N .
A special case of graphical models where the marginals and the free entropy can be computed
exactly is that of tree graphical models, i.e. graphs that do not contain loops. How to approach
this case is explained in the next section. Then, we will show that problems on graphs that
locally look like trees, i.e. the shortest loop going trough a typical node is long, can be solved
by carefully employing a tree-like approximation. This type of approximation holds for
graphical models corresponding to random sparse graphs, i.e. those where the average degree
is constant as the size N grows. We will see that also a range of problems defined on densely
connected factor graphs can be solved exactly in the large size limit.
We will start by computing the partition function Z and marginals µi (si ) for tree graphical
models. We recall that the probability distribution under consideration is
N M
1 Y Y
{si }N
P i=1 = gi (si ) fa {si }i∈∂a
Z
i=1 a=1
78 F. Krzakala and L. Zdeborová
b
j
In order to express the marginals and the partition function we define auxiliary partition
functions for every (ia) ∈ E.
X Y Y
Rsj→a
j
= gj (sj ) gk (sk ) fb {sl }l∈∂b
{sk }all k above j all k above j all b above j
X Y Y
Vsa→i
i
= fa {sk }k∈∂a gj (sj ) fb {sk }k∈∂b
{sj }all j above a all j above a all b above a
The meaning of these quantities can be understood from the figure above. Rsj→a j represents
the partition function of the part of the system above the red dotted line, with variable node
j restricted to taking value sj . Analogously Vsa→i
i
is the partition function of the subsystem
above the blue dotted line, with variable node i restricted to taking value si 2 .
Since the graphical model is a tree, i.e. it has no loops, the restriction of the variable j to
sj makes the branches above j independent. We can thus split the sum over the variables,
according to the branch to which they belong, to obtain
Y X Y Y
Rsj→a
j
= gj (sj ) fb {sk }k∈∂b gk (sk ) fc {sl }l∈∂c
b∈∂j\a {sk }all k above b all k above b all c above b
| {z }
=Vsb→j
j
Y
= gj (sj ) Vsb→j
j
b∈∂j\a
2
Note si appears in the argument of fa since i ∈ ∂a
5.2 Belief Propagation and the Bethe free energy 79
split the sum over the different branches of the tree above factor node a, to get
X Y X Y Y
Vsa→i
i
= fa {s j } j∈∂a
gj (s j ) g k (sk ) fb {sl }l∈∂b
{sj }j∈∂a\i j∈∂a\i {sk }all k above j all k above j all b above j
| {z }
=Rsj→a
j
X Y
= fa {sj }j∈∂a Rsj→a
j
(∗)
{sj }j∈∂a\i j∈∂a\i
If we start on the leaves of the tree, i.e. variable nodes that only belong to one factor a, we
have from the definition Rsj→aj = gj (sj ), for j being a leaf. This, together with the relations
above, would allow us to collect recursively the contribution from all branches and compute
the total partition function of a tree graphical model rooted in node j as
X Y
Z= gj (sj ) Vsb→j
j
.
sj b∈∂j
This value will not depend on the node in which we rooted the tree, as all the contributions to
the partition function are accounted for regardless of the root.
The partition function usually scales like exp(cN ), exponentially in the system size (simply
because it is a sum over exponentially many terms), which is a huge number. A more conve-
nient way to deal with the above restricted partition functions R and V is to define messages
χj→a
sj and ψsa→i
i
as follows:
Rsj→a X
χj→a
sj =: P j j→a so that χj→a
s = 1, ∀ (ja) ∈ E
s Rs s
Vsa→i X
ψsa→i
i
=: P i
a→i
so that ψsa→i = 1, ∀ (ia) ∈ E . (∗∗)
V
s s s
and similarly
Q
P j→a Q P j→a
by (∗)
f
{sj }j∈∂a\i a {s }
j j∈∂a j∈∂a\i Rsj j∈∂a\i s′j Rs′j
ψsa→i = ×Q
Rsj→a
i
Q
j→a
P P P
si {sj }j∈∂a\i fa {s j }j∈∂a j∈∂a\i Rs j j∈∂a\i s′′
j
′′
j
| {z }
=1
Rsj→a
Q
j
P
{sj }j∈∂a\i fa {sj }j∈∂a j∈∂a\i P
s′′ Rj→a
′′
j s
j
=
Rsj→a
Q
j
P P
si {sj }j∈∂a\i fa {sj }j∈∂a j∈∂a\i P
s′ Rj→a
′
j s
j
Q
P j→a
by (∗∗) {sj }j∈∂a\i fa {sj }j∈∂a j∈∂a\i χsj
= Q
P P j→a
si f
{sj }j∈∂a\i a {s }
j j∈∂a j∈∂a\i χsj
1 X Y
= fa {sj }j∈∂a χj→a
sj
Z a→i
{sj }j∈∂a\i j∈∂a\i
X X Y X Y
Z a→i =: fa {sj }j∈∂a χj→a
sj = fa {sj }j∈∂a χj→a
sj
si {sj } j∈∂a\i {sj }j∈∂a j∈∂a\i
j∈∂a\i
Above, we just obtained the self-consistent equations for the messages χ’s and ψ’s that are called
the Belief Propagation equations.
Now, we want the marginals µi (si ) and the partition function Z expressed in a way that can be
generalized to factor graphs that are not trees. With this in mind, we first write the marginals
5.2 Belief Propagation and the Bethe free energy 81
Here we see that each marginal µi (si ) is a very simple function of the incoming messages
ψsa→i
i
, a ∈ ∂i.
The partition function Z, instead, can be compute quite directly by rooting the tree in node i
and noticing (independently of which node i we chose as the root)
X Y
Z= gi (si ) Vsa→i
i
. (†)
si a∈∂i
We want an expression for Z that only involves the messages χ and ψ, in a way that the result
does not depend explicitly on the rooting of the tree. To do this, we first define
Vsa→i
P Q
s gi (s)
X Y
i
Z =: gi (s) ψsa→i
= Q Pa∈∂i a→i
s a∈∂i a∈∂i s′ Vs′
i→a
P Q
X Y {s i } fa {si }i∈∂a i∈∂a Rsi
a i→a
i∈∂a
Z =: fa {si }i∈∂a χsi = Q P i→a
{si }i∈∂a i∈∂a i∈∂a s′ Rs′
P a→i i→a
ia
X
i→a a→i Vs Rs
Z =: χs ψs = P sa→i P i→a
s s′ V s′ s′′ Rs′′
Let’s go:
The last step comes from the fact that, in a tree, if we take for all variables node the edges
towards the root, and for all factor nodes the edge towards the root we accounted for exactly
all the edges.
It is quite instructive to keep in mind the interpretation of free energy terms (deduced from
their definitions)
• Z i : change of the partition function Z when variable node i is added to the factor graph
s s′ s′′ si gi (si )
• Z a : change of the partition function Z when factor node a is added to the factor graph
( ) ( ) ( ) fa
s′ ( ) s′′
5.2 Belief Propagation and the Bethe free energy 83
The formula for the partition function is hence quite intuitive: first, we are adding up all the
contributions from nodes and factors. Since with every new addition we account for all the
connected edges, we end up counting each edge exactly twice (since it is connected both to a
node and a factor). Thus, we have to subtract the edge contributions once, in order to correct
for the double-counting.
After a bulky derivation, let us summarize the Belief Propagation equations, and the formulas
for the marginals and for the free entropy.
j→a
where Z j→a and Z a→i are normalization factors set so that = 1 and a→i
P P
s χs s ψs = 1.
The free entropy density Φ, which is exact on trees and is called the Bethe free entropy on
more general graphs, reads
N
X M
X X
N Φ = log Z = log Z i + log Z a − log Z ia (5.1)
i=1 a=1 (ia)
X Y
i
Z =: gi (s) ψsa→i
s a∈∂i
X Y
a
Z =: fa {si }i∈∂a χi→a
si
{si }i∈∂a i∈∂a
X
Z ia =: χi→a
s ψsa→i
s
A landmark property of Belief Propagation and the Bethe entropy is that the BP equations
can be obtained from the stationary point condition of the Bethe free entropy. Show this for
homework. This is a crucial property in the task of finding the correct fixed point of BP when
several of them exist.
84 F. Krzakala and L. Zdeborová
As it often happens for successful methods/algorithms, Belief Propagation has been indepen-
dently discovered in several fields. Notable works introducing BP in various forms are:
• Hans Bethe & Rudolf Peierls 1935, in magnetism to approximate the regular lattice by a
tree of the same degree.
• Robert G. Gallager 1962 Gallager (1962), in information theory for decoding sparse error
correcting codes.
Above we derived the BP equations and the free entropy on tree factor graphs. To use BP
as an algorithm on trees, we initialize the messages on the leaves according to definition
χj→a
sj = gj (sj ), and then spread towards the root. A single iteration is sufficient. The trouble
is that basically no problems of interest are defined on tree graphical models. Thus its usage
in this context is very limited.
On graphical models with loops BP can always be used as a heuristic iterative algorithm
1 Y
χj→a
sj (t + 1) = gj (sj ) ψsb→j (t)
Z j→a (t) j
b∈∂j\a
1 X Y
ψsa→i (t) = fa {sj }j∈∂a χj→a
sj (t)
i
Z a→i (t)
{sj }j∈∂a\i j∈∂a\i
• χsj→a
j (t = 0) = gj (sj ) + εj→a
sj + normalize, εj→a
sj are small perturbation of the “prior".
• χsj→a
j (t = 0) = εsj→a
j + normalize, random initialization.
• χsj→a
j (t = 0) = δsj ,s∗j , planted initialization
We will discuss in the follow-up lectures the usage of these different initializations and their
properties.
Then, we iterate the Belief Propagation equations until convergence or for a given number of
steps. The order of iterations can be parallel, or random sequential. Again, we will discuss in
what follows the advantages and disadvantages of the different update schemes.
It is useful to remark that, if the graph is not a tree, then the independence of branches when
conditioning on a value of a node is in general invalid. In general, the products in the BP equa-
tions should involve joint probability distributions. In this case, the simple recursive structure
of the algorithm would get replaced by joint probability distributions over neighborhoods
5.2 Belief Propagation and the Bethe free energy 85
(up to a large distance), which would be just as intractable as the original problem. We will
see in this lecture that there are many circumstances in which the independence between the
incoming messages ψsb→i j
for b ∈ ∂j \ a and between χj→a
sj for j ∈ ∂a \ i is approximately true.
We will focus on cases when it leads to results that are exact up to the leading order in N . One
important class of graphical models on which BP and its variants lead to asymptotically exact
results is that of sparse random factor graphs. This is the case because such graphs look like
trees up to a distance that grows (logarithmically) with the system size.
Informally, a graph is locally tree-like if, for almost all nodes, the neighborhood up to distance d
is a tree, and d → ∞ as N → ∞.
Importantly, sparse random factor graphs are locally tree-like. Sparse here refers to the fact
that the average degree of both the variable node and the factor nodes are constants while the
size of the graph N, M → ∞.
We will illustrate this claim in the case of sparse random graphs (for sparse random factor
graphs the argument is analogous). In a random graph, each edge is present with probability
N −1 , where c = O(1) and N → ∞, the average degree of nodes is c.
c
In order to compute the length of the shortest loop that goes trough a typical node i, consider
a non-backtracking spreading process starting in node i. The probability of the spreading
process returning to i in d steps (through a loop) is
d
1 c
1 − Pr (does not return to i) ≈ 1 − 1 −
N
where cd is the expected number of explored nodes after d steps. In the limit N → ∞, c = O(1)
so this probability is exponentially small for small distances d, and exponentially close to one
for large distances d. The distance d at which this probability becomes O(1) marks the order
of the length of the shorted loop going trough node i:
1 1 1
cd log 1 − = cd − − + · · · ≈ O(1)
N N 2N 2
This happens when cd ≈ O(N ) ⇒ d ≈ log(N )/ log(c). We conclude that a random graph
with average degree c = O(1) is such that the length of the shortest loop going through a
random node i is O(log(N )) with high probability. We thus see that up to distance O(log(N ))
the neighborhood of a typical node is a tree.
We will hence attempt to use belief propagation and the Bethe free entropy on locally tree-like
graphs. The key assumption BP makes is the independence of the various branches of the tree.
If the branches are connected through loops of length O(log(N )) → ∞ and the correlation
between the root and the leaves of the tree decays fast enough (we will make this condition
much more precise), the independence of branches gets asymptotically restored and BP and
the Bethe free entropy lead to asymptotically exact results. We will investigate cases when
the correlation decay is fast enough, and also those when it is not, and show how to still use
BP-based approach to obtain adjusted asymptotically exact results (this will lead us to the
notion of replica symmetry breaking).
86 F. Krzakala and L. Zdeborová
5.3 Exercises
Write the following problems (i) in terms of a probability distribution and (ii) in terms
of a graphical model by drawing a (small) example of the corresponding factor graph.
Finally (iii) write the Belief Propagation equations for these problems (without coding
or solving them) and the expression for the Bethe free energy that would be computed
from the BP fixed points.
A key connection between Belief Propagation and the Bethe free entropy:
5.3 Exercises 87
In Section 5.2, we have derived the Belief propagation (BP) equations for a general tree-like
graphical model. To first sight, the BP equations might look daunting. The goal of this
Appendix is to provide additional intuition behind the BP equations by constructing them
from scratch, node by node, in a concrete setting.
Let G = (V, E) be a graph with N = |V | nodes, and let’s consider for concreteness the case of
a pair-wise interacting spin model on G:
N
1 Y Y
P(S = s) = gi (si ) f(ij) (si , sj )
Z
i=1 (ij)∈E
This encompasses many cases of interest, for instance the RFIM we studied in Chapter 2, for
which:
The reader can keep any of these two problems in mind in what follows. To lighten notation, we
will write µ(s) =: P(S = s) for the probability distribution, with µ understood as a function
µ : {−1, 1}N → [0, 1]. We will also use ∝ to denote "equal up to a multiplicative factor", which
here it will always denote the constant which normalizes the probability distributions or
messages. Recall that the BP equations are self-consistent equations for the messages or beliefs
from variables to factors χi→a
si and from factors to variables ψsa→i
i
. For pair-wise models, we
have one factor per edge, and the BP equations read:
1 Y
χj→(ij)
sj = g (s )
j→(ij) j j
ψs(kj)→j
j
Z
(kj)∈∂j\(ij)
1 X Y
ψs(ij)→i = f(ij) (si , sj ) χj→(ij)
sj
i
Z (ij)→i
{sj }j∈∂(ij)\i j∈∂(ij)\i
Since in pair-wise models there is a one-to-one correspondence between edges and factor
nodes, we can simply rewrite the BP equations directly on the graph G. For each node i ∈ V
90 F. Krzakala and L. Zdeborová
i→(ij)
of the graph, define the outgoing messages χi→j si =: χsi and the incoming messages
j→i (ij)→i
ψsi = ψsi . The BP equations in terms of these "new" messages read:
1 Y 1 X
χi→j
si = gi (si ) ψsk→i , ψsk→i = f(ik) (si , sk ) χsk→i
Z i→j i i
Z k→i k
k∈∂i\j sk ∈{−1,1}
Recall that the marginal for variable si is given in terms of the messages as:
Y
µi (si ) =: P(Si = si ) ∝ gi (si ) ψsj→i
i
(5.3)
j∈∂i
Which basically tell us that the probability that Si = si is simply given by the "local belief" (or
prior) gi times the incoming beliefs ψsj→i
i from all neighbors of i. In other words, BP factorizes
the marginal distribution at every node in terms of independent beliefs.
Note that in this case it is also easy to solve for one of the messages to obtain a self-consistent
equation for only one of them. For instance, solving for the incoming messages ψsk→i k
gives:
gi (si ) Y X
χi→j
si = f(ik) (si , sk )χk→i
sk .
Z i→j
k∈∂i\j sk ∈{−1,+1}
When the graph has a single node N = 1, we have only a single spin S1 ∈ {−1, 1} and therefore
E = ∅. The corresponding factor graph has a single variable node and a single factor node
for the "local field" gi . In this case, the marginal distribution over the spin S1 is simply given
by the "local belief":
1 g1 (±1)
µ(±1) =: P(S1 = ±1) = gi (±1) =
Z g1 (+1) + g1 (−1)
Note that in the absence of a prior g1 (s1 ) = 1, the marginal is simply the uniform distribution
µ1 (s1 ) = 12 .
For two nodes N = 2, we have two spin variables S1 , S2 ∈ {−1, +1} and two possible graphs:
either the spins are decoupled and E = ∅, or they interact through an edge E = {(12)}.
1
µ(s1 , s2 ) = g1 (s1 )g2 (s2 )
Z
g1 (s1 )g2 (s2 )
=
g1 (+1)g2 (+1) + g1 (−1)g2 (+1) + g1 (+1)g2 (−1) + g1 (−1)g2 (−1)
5.A BP for pair-wise models, node by node 91
This is a direct consequence of the absence of an interaction term coupling the two spins. Note
that we cannot write BP equations for this case since there are no factors.
One edge: The case in which E = {(12)} is more interesting. Now the joint distribution is
given by:
1
µ(s1 , s2 ) = g1 (s1 )g2 (s2 )f(12) (s1 , s2 )
Z
Notice that very quickly it becomes cumbersome to write the exact expression for the normal-
ization Z, which we keep implicit from now on. The marginals are now given by:
1
µ1 (s1 ) = g2 (+1)f(12) (s1 , +1) + g2 (−1)f(12) (s1 , −1) g1 (s1 )
Z
1
µ2 (s2 ) = g1 (+1)f(12) (+1, s2 ) + g1 (−1)f(12) (−1, s2 ) g2 (s2 )
Z
Crucially, it is easy to check that the joint distribution doesn’t factorise anymore:
Let’s now look at what the BP equations are telling us. For instance, the outgoing messages
are given by:
χ1→2
s1 ∝ g1 (s1 ), χ2→1
s2 ∝ g2 (s2 )
From that, it is pretty clear that the marginals factorize in terms of the local beliefs times the
incoming beliefs eq. equation 5.3.
92 F. Krzakala and L. Zdeborová
Finally, let’s consider the more involved case of three nodes N = 3. The case in which there
is no edge E = ∅ or only one edge |E| = 1 reduces to one of the cases we have seen before.
Therefore, the interesting cases are when we have either two or three edges.
Two edges: For two edges, there are two nodes which have degree 1 and one node with
degree two. Without loss of generality, we can choose node 2 to have degree 2, and the joint
distribution read:
1
µ(s1 , s2 , s2 ) = f(12) (s1 , s2 )f(23) (s2 , s3 )g1 (s1 )g2 (s2 )g3 (s3 )
Z
The marginal probability of S1 = s1 is then given by:
X X
µ1 (s1 ) = µ(s1 , s2 , s2 )
s2 ∈{−1,1} s3 ∈{−1,1}
∝ f(23) (−1, −1)g3 (−1) + f(23) (−1, +1)g2 (−1)g3 (+1) f(12) (s1 , −1)g2 (−1)+
f(23) (+1, −1)g3 (−1) + f(23) (+1, +1)g3 (+1) f(12) (s1 , +1)g2 (+1) g1 (s1 )
Note that each two of the four terms share a common term. The outgoing BP messages now
read:
X
χ1→2
s1 ∝ g1 (s1 ), χ2→1
s2 ∝ g2 (s2 ) f(23) (s2 , s3 )χ3→2
s3 , χ3→2
s3 ∝ g3 (s3 )
s3 ∈{−1,+1}
Three edges: For three edges, all nodes are connected and have degree two. The joint
distribution read:
1
µ(s1 , s2 , s2 ) = f(12) (s1 , s2 )f(13) (s1 , s3 )f(23) (s2 , s3 )g1 (s1 )g2 (s2 )g3 (s3 )
Z
The marginal probability of S1 = s1 now reads:
X X
µ1 (s1 ) = µ(s1 , s2 , s2 )
s2 ∈{−1,1} s3 ∈{−1,1}
∝ f(12) (s1 , −1)f(13) (s1 , −1)g2 (−1)g3 (−1) + f(12) (s1 , +1)f(13) (s1 , −1)g2 (+1)g3 (−1)
+f(12) (s1 , −1)f(13) (s1 , +1)g2 (−1)g3 (+1) + f(12) (s1 , +1)f(13) (s1 , +1)g2 (+1)g3 (+1) g1 (s1 )
5.A BP for pair-wise models, node by node 93
Note that, different from the 2 nodes case the four terms above don’t share any common factor
apart from g1 (s1 ). The outgoing BP messages now read:
X X
χ1→2
s1 ∝ g1 (s1 ) f(13) (s1 , s3 )χ3→1
s3 , χ1→3
s1 ∝ g1 (s1 ) f(12) (s1 , s2 )χ2→1
s2
s3 ∈{−1,1} s2 ∈{−1,1}
X X
χ2→1
s2 ∝ g2 (s2 ) f(23) (s2 , s3 )χ3→2
s3 , χ2→3
s2 ∝ g2 (s2 ) f(21) (s2 , s1 )χ1→2
s1
s3 ∈{−1,1} s1 ∈{−1,1}
X X
χ3→1
s3 ∝ g3 (s3 ) f(32) (s3 , s2 )χ2→3
s2 , χ3→2
s3 ∝ g3 (s3 ) f(31) (s3 , s1 )χ1→3
s1
s2 ∈{−1,1} s1 ∈{−1,1}
Note that in this case it is not simple to solve for the messages. For instance, the marginal of
S1 = s1 is given by:
X
µ1 (s1 ) ∝ g1 (s1 ) f(12) (s1 , s2 )χ2→1
s2
s2 ∈{−1,+1}
The picture will have charm when each colour is very unlike the
one next to it.
In physics that variables of the type are called Potts spins, and the model is called correspond-
ingly the Potts model. In what follows we will consider both the repulsive (anti-ferromagnetic)
case of β > 0 and the attractive (ferromagnetic) case of β < 0. In our notation, the node-factors
will simply be gi (si ) = 1 for all i, and the interaction factors f(ij) (si , sj ) = e−βδsi ,sj for all
(ij) ∈ E.
where in the third step we split the sum into the sum of all the configurations that will be at
energy e and the sum over all the energies and in the last step we replaced the sum over the
96 F. Krzakala and L. Zdeborová
discrete values of e by an integral over e and we used the definition introduced in equation 6.2.
This is well justified since we consider the limit N → ∞ and we are interested in the leading
order (in N ) of Φ, e and s. The saddle-point method then gives us
∂s(e)
= β, Φ(β) = s(e∗ ) − βe∗ (6.3)
∂e e=e∗
dΦ(β)
= − ⟨e⟩Boltz . (6.4)
dβ
Thus, if we compute the free entropy density Φ as a function of β, we can compute also its
derivative and consequently access the number of configurations of a given energy s(e). Doing
these calculations exactly is in general a computationally intractable task. We will use BP and
the Bethe free entropy to obtain approximate — and in some cases asymptotically exact —
results.
How can we use Belief Propagation to evaluate the above quantities? The BP fixed point gives
us a set of messages χi→j and the Bethe free entropy ΦBethe as a function of the messages. In
general, the messages give us an approximation of the true marginals of the variables, and the
Bethe free entropy ΦBethe give us an approximation for the true free entropy Φ. Remembering
that a BP fixed point is a stationary point of the Bethe free entropy we get:
Thus, given a BP fixed point we can approximate both the free entropy and the average energy.
Once we evaluated ΦBethe and e∗ we can readily obtain the entropy s(e), which we defined as
the logarithm of the number of colorings at energy e:
Remember, however, that the correctness of the resulting s(e) relies on the correctness of the
Bethe free entropy for a given fixed point of the BP equations.
We now apply the BP equations, as derived in the previous section, even though in general the
graph G(V, E) is not a tree. In what follows, we will see under what circumstances this can lead
to asymptotically exact results for coloring of random graphs. Note that, on generic graphs this
approach can always be considered as a heuristic approximation of the quantities of interest
(in most cases the approximation is hard to control). In this section we will manipulate the
BP equations to see what can be derived from them and, when needed, we will restrict our
considerations to random sparse graphs.
6.3 Bethe free energy for coloring 97
We see that since every factor node has two neighbors, the second BP equations has a simple
form (as the product is only over one term). We can thus eliminate the messages ψ from the
equations while still keeping a simple form. We get
1 h i
e
Y
−β
χj→(ij)
sj = 1 − 1 − χ k→(kj)
sj
Z j→(ij) (kj)∈∂j\(ij) Z (kj)→j
Q
(kj)∈∂j\(ij)
1 Y h i
⇒ χsj→i = 1 − 1 − e −β
χ k→j
s
j
Z j→i j
k∈∂j\i
In the last equation we defined an overall normalization term Z j→i and went back to a graph
notation, where ∂i denotes the set of neighbouring nodes of i (instead, in a factor graph
notation ∂i would denote the set of neighbouring factors of i). On a first sight, this change
of notation can be confusing, since we use the same letter χ to denote the messages on the
original graph and on the factor graph. However, note it presents no ambiguity: messages
between two variable nodes i → j can only refer to the original graph, since in a factor graph
we cannot connect two variable nodes directly.
The resulting belief propagation equations have quite an intuitive meaning. Recall that χj→i
sj
represents the probability that node j takes color sj if the connection between j and i was
temporarily removed. Keeping this in mind, the terms in the above equations have the
following meaning
→ 1 − 1 − e−β χk→i sk + e
= sk ̸=sj χk→i −β χk→i , is the probability that neighbor k allows
P
sj sj
node j to take color sj .
→ k∈∂j\i [· · ·] is the the probability that all the neighbors let node j take color sj (i was
Q
excluded, as the edge (ij) is removed). The product is used because of the implicit
assumption of BP, about the independence of the neighbors when conditioning on the
value sj .
In a homework problem you will show that using similar simplifications as above we can
rewrite the generic Bethe free entropy for graph coloring as
N
X X
N ΦBethe (β) = log Z (i) − log Z (ij) (6.7)
i=1 (ij)∈E
where
XYh i
Z (i) = 1 − 1 − e−β χk→i
s
s k∈∂i
X
e
−βδsi ,sj
e
X
(ij) −β
Z = χi→j
si χj→i
sj = 1 − 1 − χi→j
s χs
j→i
si ,sj s
We keep in mind that the Bethe free entropy is evaluated at a fixed point of the BP equations.
Sometimes we will think of the Bethe free entropy as a function of all the messages χi→j or of
a parametrization of the messages.
We will always denote the free entropy with the index "Bethe" when we are evaluating it
using the BP approximation. While on a tree we showed that Φ = ΦBethe , in the case of a
generic (even locally tree-like) graph we will discuss the relation between Φ and ΦBethe (more
precisely its global maximizers) in more detail in the next lectures.
We note that, so far, all we wrote depends explicitly on the graph G(V, E) through the list of
nodes V and edges E. No average over the graph (i.e. the disorder) was taken! This is rather
different from the replica method, where the first step in the computation is in fact taking the
average over the disorder.
We notice that for graph coloring χj→i sj = q −1 for all (ij) and sj is a fixed point of the BP
equations on any graph G(V, E). We will call this the paramagnetic fixed point. In general
there might be, and often are, other fixed points, as we will discuss in next sections. To check
that q −1 is indeed a fixed point of BP, call di the degree of node i (i.e. the number of neighbors
of node i) and obtain the BP recursion
h idi −1
1 − 1 − e−β 1q 1
idi −1 = (6.8)
q
h
q 1 − (1 − e ) q
−β 1
which is indeed always true. For the Bethe free entropy this paramagnetic fixed point gives us:
N
( )
1 di
1 −β 1
log q 1 − 1 − e log q · 2 e + (q − q) ·
X X
−β 2
N ΦBethe (β) = −
q q q
i=1 (ij)∈E
1 1
c
ΦBethe (β) = c log 1 − 1 − e−β + log(q) − log 1 − 1 − e−β
q 2 q
c 1
= log(q) + log 1 − 1 − e−β
2 q
6.4 Paramagnetic fixed point for graph colorings 99
where we called c = i di /N = 2M/N the average degree of the graph. For the average
P
energy and entropy at inverse temperature β, we get
∂ΦBethe (β)
∗ c e q −β 1
c e−β
energy e = − = =
∂β 2 1 − (1 − e−β ) 1q 2 (q − 1) + e−β
| {z }
prob. an edge is monochromatic
e−β
βc
c −β 1
entropy s(e ) = ΦBethe (β) + βe = log(q) + log 1 − 1 − e
∗ ∗
+
2 q 2 (q − 1) + e−β
Notice that these expressions give us a parametric form for s(e). Sometimes we can even
exclude β and write s(e) in a closed form, but generically we simply plot parametrically
e(β), s(β). In the figure we take q = 4 colors and several values of the average degrees c.
0
1
1
2
s( )
c=2
6 c=6
c = 12
0 1 2 3 4 5 6
e( )
Figure 6.4.1: Entropy as a function on the energy cost for graph coloring with q = 4 colors
corresponding to the paramagnetic fixed point of belief propagation. Average degree of the
graph is c.
For a given pair of parameters c, q, the curve for s(e) ranges from 0 to c/2 because at most all
the edges can be violated. The curve achieves a maximum equal to log(q) at e = c/2q, because
that is the typical cost of a random coloring (each edge is violated with probability 1/q). The
slope of the curve s(e) corresponds to the inverse temperature β since ∂s(e)
∂e = β.
Recalling that the entropy s is a logarithm of an integer (number of colorings with given
energy), it is clear it should not take negative values. The fact that we see negative values for
large energies and also for energy close to zero for c = 12 indicates that either coloring with
such energy do not exist with high probability or that there was a flaw in what we did. Note
also that so far we did not specify anything about the graph, except its average degree. Clearly
we could have graphs with average degree c = 6 that contain one or several (even linearly
many in N ) 5-cliques and thus will not have any valid coloring. Thus the result we obtained
cannot be valid for those graphs (indicating exponentially many valid 4-colorings e = 0 for
c = 6). At this point, we thus restrict to random sparse graphs and we investigate whether the
results we obtained are plausible at least in that case.
We notice, from the results we obtained for the paramagnetic fixed point, that χi = 1q for all
i implies that values of energies larger than certain values strictly smaller than c/2 are not
accessible, as they have negative entropy. At the same time, if all nodes had the same color this
100 F. Krzakala and L. Zdeborová
would automatically achieve the energy e = c/2. To reconcile this paradox we must realize
that the paramagnetic fixed point χi = 1q for all i assumes that every color is represented the
same number of times, which is clearly not the case if all nodes have the very same color. With
BP, it is often the case that there are several fixed points and we need to select the correct one.
In general we need to find the fixed point with larger free entropy (from the saddle point
method we know this is the one that dominates the probability measure). For the graphs
coloring problems at large energy, i.e. β < 0 this motivates the investigation of a different
fixed point that is able to break the equal representation of every color. We will call it the
ferromagnetic fixed point.
We now investigate whether the BP equations for graph coloring have fixed points of the
following form for all (ij) ∈ E:
1−a
χi→j
1 = a, χi→j
s =b= , ∀ s ̸= 1 (6.9)
q−1
Then, the graph coloring BP equations would read
1 Y h i
χsj→i = 1 − 1 − e −β
χ k→j
sj
j
Z j→i
k∈∂j\i
1 h idj −1 1
a = j→i 1 − 1 − e−β a =: j→i Adj −1
Z Z
1 h idj −1 1
b = j→i 1 − 1 − e −β
b =: j→i B dj −1
Z Z
with the normalization
Adj −1 B dj −1
Z j→i = (q − 1)B dj −1 + Adj −1 , a= , b=
(q − 1)B dj −1 + Adj −1 (q − 1)B dj −1 + Adj −1
(6.10)
For such a ansatz to be a fixed point for every (ij) ∈ E we need dj ≡ d for all j. This is the
case with random d-regular graphs where every variable node has degree d and satisfies this
condition. We could, of course, have a ferromagnetic fixed point where the a depends on (ij)
and solve the corresponding distributional equations, but in this section we look for a simpler
solution to illustrate the basic concepts and we will thus restrict to random d-regular graphs,
d ≥ 3. For d-regular random graphs we obtain the following self-consistent equation for the
parameter a, given the degree d, inverse temperature β and number of colors q
1 − 1 − e−β a
d−1
a= h id−1 =: RHS(a; β, d, q) (6.11)
[1 − (1 − e−β ) a] + (q − 1) 1 − (1 − e−β ) 1−a
d−1
q−1
To express the Bethe free entropy corresponding to this ferromagnetic fixed point we use
eq. (6.7) and plug (6.9) in it to get:
( d h )
−β 1 − a
id
ΦBethe (β) = log (q − 1) 1 − (1 − e ) + 1 − (1 − e )a
−β
(6.12)
q−1
(1 − a)2
d
− log 1 − (1 − e )
−β
+a 2
2 q−1
6.5 Ferromagnetic Fixed Point 101
1.0 a 1.8
= 0.7
= 0.5108 1.7
0.8
= 0.4
1/q 1.6
0.6
1.5
(a)
rhs
0.4 1.4
1.3
0.2
1.2
0.0
1.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
a a
Figure 6.5.1: Illustration of a second order (continuous) phase transition for 2-coloring on
5-regular graph and several values of the inverse temperature β. In the left panel, the right
hand side of eq. (6.11) is plotted against the parameter a. In the right panel, the Bethe free
entropy is plotted for the same inverse temperatures. The stable fixed points are marked and
correspond to the local maxima of the Bethe free entropy.
In the last expressions we can think of the Bethe free entropy as a function of the parameter a,
keeping in mind that we are seeking to evaluate it at the global maximizer.
In Fig. 6.5.1 we plot the left hand side and the right hand size of eq. (6.11) as a function of a, for
a given value of degree d and number of colors q and several values of the inverse temperature
β. We also plot the Bethe free entropy as a function of a.
We observe that for β = −0.4 (green curve) the only fixed point of (6.11) and maximum of
(6.12) is reached at a = 1/q. This is the paramagnetic fixed point we investigated previously.
For β = −0.7 (blue curve) we see, however, a different picture. The fixed point a = 1/q is
unstable under iterations of (6.11) and corresponds to a local minimum of the Bethe free
entropy. There are two new stable fixed points that appear and correspond to the maxima of
the Bethe entropy.
At what value of the inverse temperature β do the additional fixed points appear? For this we
need to evaluate the stability of the paramagnetic fixed point, i.e. when the derivative at the
paramagnetic fixed point ∂RHS∂a a=1/q = 1:
(d − 1)(1 − e−β )
∂RHS q
1= =− ⇒ βstab = − log 1 + (6.13)
∂a a= 1q e−β + (q − 1) d−2
We can investigate other values of the degree d and still two colors q = 2 and we will observe
the same picture for all d. We recognize the second order phase transition from a paramagnet
to a ferromagnet that we saw already in the Curie-Weiss model. Indeed the only difference
between the present model and the Curie-Weiss model is that the graph of interactions was
fully connected in the Curie-Weiss model while it is a d-regular random graph in the present
case.
102 F. Krzakala and L. Zdeborová
1.0 a
a
1/q
0.8
0.6
a*
0.4
0.2
0.0
1.50 1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50
Figure 6.5.2: Magnetization for 2-coloring on 5-regular graph as a function of the inverse
temperature β.
This is the Bethe approximation to the solution of the Ising model on regular cubic lattices. Of
course lattices are not trees, but if we match the degree of the random graph to the coordination
number of a cubic lattice in D dimension we obtain d = 2D and we can observe that for D = 1
(d = 2) the βstab = −∞. This is actually an exact solution because the 1-dimensional cubic
lattice is just a chain which is a tree-graph. Thus the BP solution is exact. For a 2-dimensional
√
Ising model that has been famously solved by Onsager we have βOnsager = − log(1 + 2) =
−0.881 which is relatively close to the Bethe approximation βstab (d = 4) = −0.693. Note that
the sign is opposite from what can be found in the literature because here we defined positive
temperature for the coloring (the anti-ferromagnet), and also that there is a multiplicative
factor two as here the energy cost for a variable change is 1 (whereas in the usual Ising
model it is 2). For the 3-dimensional Ising model no closed form solution exists yet, the
critical temperature has been evaluated numerically to very high precision and reads β3D =
−0.4433, again to be compared with its Bethe approximation βstab (d = 6) = −0.4055 which is
remarkably close. As the degree grows the Bethe approximation actually gets closer and closer
to the finite dimensional values. And eventually as d → ∞ we recover exactly the Curie-Weiss
solution that we studied in the first lecture with a proper rescaling on the interaction strength.
You will show this for a homework.
For more than two colors, q ≥ 3, we find a somewhat different behaviour leading to a 1st
order phase transition. Let us again start by plotting the fixed point equations and the Bethe
free entropy in Fig. 6.5.3. We see that, as before, βstab marks the inverse temperature at which
the paramagnetic fixed point 1/q becomes unstable and the corresponding Bethe entropy
maximum becomes a minimum. But there is another stable fixed point, corresponding to a
local maximum of the Bethe entropy appearing at βs > βstab . This inverse temperature where
a new stable fixed point appears discontinuously is called the spinodal temperature in physics.
When there are more than one stable fixed points, more than one local maximas in Bethe
free entropy, eq. (6.12), we must compare their free entropies, the larger one dominates the
corresponding saddle point and hence is the correct solution.
6.5 Ferromagnetic Fixed Point 103
1.0 2.8
0.8 2.6
0.6 2.4
(a)
rhs
0.4 a
= 1.1 2.2
= 1.0
0.2 = 0.88
= 0.7 2.0
0.0 1/q
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
a a
Figure 6.5.3: Illustration of a first order (discontinuous) phase transition for 6-coloring on
5-regular graph and several values of the inverse temperature β. In the left panel, the right
hand side of eq. (6.11) is plotted against the parameter a. In the right panel, the Bethe free
entropy is plotted for the same inverse temperatures. The stable fixed points are marked and
correspond to the local maxima of the Bethe free entropy.
1.0
Ferromagnetic
4.00 Equilibrium
Paramagnetic
0.8 3.75
3.50
0.6
3.25
(a*)
a*
3.00
0.4
2.75
2.25
1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8
Figure 6.5.4: The magnetization (left panel) and Bethe free entropy as a function of the inverse
temperature β for q = 10 and d = 5. (Note a different values of q from previous figure.)
The inverse temperature at which the ferromagnetic fixed point a > 1/q becomes the global
maximum of the Bethe entropy, instead of the paramegnetic fixed point, will be denoted βc and
corresponds to a first order phase transition. We have βstab < βc < βs . The order parameter a
changes discontinuously at βc in a first order phase transition. In the case of q = 2, instead,
we saw a continuous 2nd order phase transition, where the ferromagnetic fixed points appear
at the same temperature at which the corresponding free entropy becomes a global maximum,
together with the instability of the paramagnetic fixed point. In other words, in 2nd order
phase transitions βs = βc = βstab .
With the knowledge of the ferromagnetic point we can hence correct the result for the number
of colorings at a given cost, in Fig. 6.5.5.
104 F. Krzakala and L. Zdeborová
1 c
s( )
1
2
Paramagnetic
Ferromagnetic
0.0 0.5 1.0 1.5 2.0 2.5 3.0
e( )
Figure 6.5.5: Entropy as a function of the energy for graph coloring with q = 4 colors and
average degree c = d = 6 (same as Fig. 6.4.1).
• The results obtained with BP on random graphs are exact for β < 0 in the sense that
N →∞
∀ ε > 0, P r (|Φ(β) − ΦBethe (β)| < ε) −→ 1 w.h.p (6.14)
– For q = 2 this has been proven by Dembo & Montanari 2010 Dembo et al. (2010b)
(on sparse random graphs even beyond d-regular).
– For q ≥ 3 this has been proven by Dembo, Montanari, Sly, Sun, 2014 Dembo et al.
(2014) on d-regular graphs.
The situation for β > 0 is much more involved and will be treated in the following.
• We note that the paramagnetic and ferromagnetic fixed points are BP fixed points even
for finite size regular graphs, not only in the limit N → ∞. But of course not every
regular graph has the same number of colorings of a given energy, especially not the
non-random ones. Also, the discussed phase transitions exist only in the limit N → ∞,
but BP fixed point behave the same way we discussed even at finite size N . This means
that BP ignores part of the finite size effects and gives us an interesting proxy of a phase
transition even in finite size, where formally no phase transition can exists since the free
entropy is an analytic function for every finite N .
Previously, we evaluated the entropy s(e) of the paramagnetic fixed point χi→j s ≡ 1q . We
can observe that the corresponding Bethe free entropy is equal to the so-called annealed free
entropy.
To define this term, let us remind the general probability distribution we are considering:
N M
1 Y Y
P {si }N (6.15)
i=1 = gi (s i ) fa {si }i∈∂a .
ZG
i=1 a=1
The free entropy ΦG = N1 log(ZG ) then depends explicitly on the graph G. We expect the free
entropy to be self-averaging, i.e. concentrating around its mean as
N →∞
∀ ε > 0, Pr (|ΦG − EG [ΦG ]| > ε) −→ 0 (6.16)
When N → ∞, computing ΦG and EG [ΦG ] should thus lead to the same result.
We define:
1
quenched free entropy Φquench ≡ EG log(ZG )
N
1
annealed free entropy Φanneal ≡ log (EG [ZG ])
N
Naively, we could expect to have (E[log(Z)] − log(E[Z])) /N → 0, but this is often not the case.
The partition function is of order Z = O (exp(N )) as N → ∞, and concentration holds only
for N1 log(Z). The annealed free entropy can get dominated by rare instances of the graph G.
Let us give a simple example.
eN , w.p. 1 − e−N
(
ZG =
e3N , w.p. e−N
The quenched and annealed averages are then given by:
1
log(ZG ) = 1 − e−N + e−N 3 = 1 + e−N (2) → 1
EG
N
1 1 1
log eN − 1 + e2N = 2 + log 1 + e−N − e−2N → 2
log (EG [ZG ]) =
N N N
We see that while the quenched entropy represents the typical values, the annealed entropy got
influenced by exponentially rare values and could completely mislead us about the properties
of the typical instance. In general, since log(·) is a concave function, by Jensen’s inequality we
have Φanneal ≥ Φquench , therefore the annealed free entropy will at least provide us with an
upper bound. Of course, Φanneal is usually much easier to compute than Φquench .
For the coloring problem, let G(N, M ) represent a random graph with N nodes, and M edges
chosen at random among all possible edges. The annealed free entropy then follows from
M
N −β 1 1
h i
EG(N,M ) [ZG (β)] = q E{si }N EG(N,M ) e = q e
P
N −β (ij)∈E δsi ,sj
+ 1−
q q
i=1
| {z }
free entropy of one edge
106 F. Krzakala and L. Zdeborová
Here we use the fact that edges in the random graph are independent and that, for β > 0,
the contributions to the average are dominated by colorings where each color is represented
roughly equally. The annealed entropy is thus positive and vanishes at average degree
2 log q
cann (β) = − h i (6.17)
log 1 − (1 − e−β ) 1q
Notice that the annealed free entropy and the Bethe one corresponding to the paramagnetic
fixed point are the same Φanneal = ΦBethe |χ≡ 1 .
q
The question is whether the annealed/paramagnetic free entropy is correct for all average
degrees c and inverse temperatures β > 0. Unfortunately, the answer is negative. For instance
for β → 0 in Coja-Oghlan (2013) the authors show that all proper colorings disappear with high
probability for values of average degree strictly smaller than the average degree cann (β → ∞)
at which Φanneal (β → ∞) becomes negative. This means that Φanneal cannot be equal to the
quenched free entropy all the way to cann .
What could possibly go wrong with the paramagnetic fixed point χi→j s ≡ 1q ? One immediate
thing we should ask is whether belief propagation converges to this fixed point on large random
graphs. Is the paramagnetic fixed point even a stable one, i.e. if we initialize as χi→j
s = 1q + εi→j
s
P i→j i→j
(with s εs = 0 due to normalization), does BP converge back to χs ≡ q ? To investigate
1
this question one could implement the iterations on a large single graph and simply try out.
A computationally more precise way to find the answer to these questions is to perform the
linear stability analysis. Let θ = 1 − e−β , and consider the first-order Taylor expansion around
the fixed point χ ≡ 1q (or equivalently εk→j
sk ≡ 0)
1 X X ∂χj→i
sj (t + 1)
χsj→i (t + 1) = + εk→j
sk
j
q ∂χsk→i (t)
s
k∈∂j\i k k χ≡ 1q
Note that
Y θ dj −1 θ dj −1
1 1 1 1 j→i
= j→i 1−θ = j→i 1− , Z χ≡ 1q
=q 1−
q Z |χ≡ 1 q Z |χ≡ 1 q q
q k∈∂j\i q
(6.18)
∂χsj→i
j (t + 1) 1 ∂ Y
= 1 − θχℓ→j
sj (t)
∂χk→j
sk (t) χ≡ 1q
Z j→i ∂χk→j
sk (t) ℓ∈∂j\i χ≡ 1q
1 Y ∂Z j→i
− 1 − θχℓ→j
sj (t)
(Z j→i )2 ℓ∈∂j\i
∂χk→j
sk (t)
χ≡ 1q
dj −2 dj −1 "
−θ 1 − θq 1 − θq
#
θ dj −2
= δsj ,sk − 2 −θ 1 −
Z j→i |χ≡ 1 j→i q
q Zχ≡ 1
q
θ θ
= −δsj ,sk +
q − θ q(q − θ)
6.6 Back to the anti-ferromagnetic (β > 0) case 107
b ··· b a
Notice that the matrix T has q − 1 degenerate eigenvalues
(
−θ < 0, for β > 0 T
, with eigenvector +1 −1 0 · · ·
λmax = a − b = ⇒ 0
q−θ > 0, for β < 0
Keeping this in mind, we can see that the linear expansion of the BP equations for ε’s reads:
X X
εsj→i
j
(t + 1) = Tsj ,sk εk→j
sk (t) . (6.20)
k∈∂j\i sk
We now define the excess degree distribution, p̃k = (k + 1)pk+1 /c, for a random graph ensemble
with degree distribution pk and average degree c. p̃k represents the probability that a randomly
chosen edge is incident to a Pnode has k other edges except the chosen edge, i.e. has degree
k + 1. And similarly, let c̃ = k k p̃k be the average excess degree. With it we obtain:
where the ⟨·⟩ is the average over edges. A phase transition occurs at c̃ λmax = 1 that determines
whether lim ⟨ε(t + 1)⟩ blows up to infinity or converges to zero.
t→∞
−c̃ 1 − e−βstab
q
c̃ λmax = =1 ⇒ βstab = − log 1 + (6.22)
q − (1 − e−βstab ) c̃ − 1
This is the stability transition that we have already computed for the ferromagnetic solution for
β < 0. For β > 0 we notice that λmax < 0 and thus the corresponding instability corresponds to
an oscillation from one color to another at each parallel iteration. Such an oscillatory behaviour
would be possible on a bipartite graph, where indeed one side of the graph could have one
color and the other side another color. But this two-color solution is not compatible with the
existence of many loops of even and odd length in random graphs. The temperature βstab will
thus not have any significant bearing on the anti-ferromagnetic case β > 0.
Is it possible that the mean of the perturbation ⟨ε⟩ → 0 but the variance ε2 ↗ ∞? Let us
investigate:
* " #2 " #" #+
2 X X X X X
j→i k→j k→j ℓ→j
εsj (t + 1) = Tsj ,sk εsk (t) + Tsj ,sk εsk (t) Tsj ,sℓ εsℓ (t)
k∈∂j\i sk k,ℓ∈∂j\i sk sℓ
k̸=ℓ
| {z D E
}
=0 since neighbors are independent εk→j ℓ→j
sk εsℓ =0
*" #2 +
X
= c̃ Tsj ,sk εsk→j
k
(t) .
sk
108 F. Krzakala and L. Zdeborová
Therefore, the variance will be determined by the maximum eigenvalue λmax of T, such that
We can thus distinguish two cases for the graph coloring problem:
(q−θ)2
• For c̃ < θ2
BP converge back to χ ≡ 1
q . Specifically for β → ∞ we get
(q − θ)2
c̃KS = → (q − 1)2 (6.24)
θ2
(q−θ)2
• For c̃ > θ2
BP goes away from 1
q and actually does not converege.
Note that the abbreviation KS comes from the works on Kesten and Stigum (1967), and in
physics this transition is related to the works of de Almeida and Thouless (1978).
For Erdös-Rényi random graphs, where the average degree and the excess degree are equal
c̃ = c, in the setting q = 3 and β → ∞ we have that cKS (q = 3) = 4 < cann (q = 3) = 5.4. cKS
in this case is also smaller that the upper bound on colorability threshold from Coja-Oghlan
(2013). At average degree c > 4 the BP equations do not converge anymore and another
approach, based on replica symmetry breaking, will be needed to understand what is going
on. It is interesting to note that algorithms that are provably able to find proper 3-coloring
exist for all c < 4.03 Achlioptas and Moore (2003), so even slightly above the threshold
cKS (q = 3) = 4.
For q ≥ 4 and β → ∞ we have that cKS > cann and the investigated instability cannot be used
to explain what goes wrong in the colorable regime. An important motivation to understand
what is happening comes from the algorithmic picture that appears for large values of the
number of colors q. In that case the annealed upper bound on colorability scales as 2q log q,
and probabilistic lower bounds (see e.g. Coja-Oghlan and Vilenchik (2013)) imply that this
is indeed the right scaling for the colorability threshold. Yet, for what concerns polynomial
algorithms that provably find proper colorings, we only know of algorithms that work up
to degree q log q Achlioptas and Coja-Oghlan (2008), i.e. up to half of the colorable region.
Design of tractable algorithms able to find proper coloring for average degrees (1 + ϵ)q log q is
a long-standing open problem. What is happening in the second half of the colorable regime
will be clarified in the follow up lectures.
6.7 Exercises
6.7 Exercises 109
(a) Show from the generic formula for Bethe free entropy density that we derived in
the lecture
N M
1 X 1 X 1 X
Φgeneral
Bethe = log Z i + log Z a − log Z ia (†)
N N N
i=1 a=1 ia
where
X Y
Zi = gi (s) ψsa→i
s a∈∂i
X Y
a
Z = fa {si }i∈∂a χi→a
si
{si }i∈∂a i∈∂a
X
Z ia = ψsa→i χi→a
s
s
that the Bethe free entropy density for graph coloring can be written as
N
1 X 1 X
Φcoloring
Bethe = log Z (i) − log Z (ij)
N N
i=1 (ij)∈E
where
XYh i
Z (i) = 1 − 1 − e−β χk→i
s
s k∈∂i
X
Z (ij)
= 1 − 1 − e−β χi→(ij)
s χj→(ij)
s
s
(b) We now move the large d limit, and use β = − 2βdMF . Why is this necessary in order
to recover the fully connected model? Show that it leads indeed to the mean field
equation
m = tanh (βMF m). (6.27)
Another approach is to look directly to the free entropy, and to recover directly the free
entropy expression of the fully connected model by taking the large connectivity limit
of the Bethe free entropy:
(a) Show that the Bethe free entropy reads, according to Belief Propagation:
d d !
−β 1 − m
−β 1 + m
Φ(β, m) = log 1 − (1 − e ) + 1 − (1 − e )
2 2
d 1
− log 1 − (1 − e−β ) 1 + m2
2 2
(b) We use again β = 2βdMF . Show that, to leading order in d, we recover, as d → ∞ the
expression (1.22). Hint: First rewrite the expressions inside the log of first line as
d 2
1±m d log 1−(1−e− d βMF ) 1±m
1 − (1 − e−β ) (6.28)
2
=e .
2
Notice however that there is an additional trivial term βMF /2 and explain why this
additional term is here.
Exercise 6.3: Belief propagation for the matching problem on random graphs
Consider now the matching problem on sparse random graphs. Use the graphical
model representation from the previous homework.
(a) Write belief propagation equations able to estimate the marginals of the probability
distribution
N
1
eβS(ij) I S(ij) ≤ 1
Y Y X
P S(ij) (ij)∈E =
Z(β)
(ij)∈E i=1 j∈∂i
Be careful that in the matching problem the nodes of the graph play the role of
factor nodes in the graphical model and edges in the graph carry the variable nodes
in the graphical model.
(b) Write the corresponding Bethe free entropy in order to estimate log(Z(β)). Use re-
sults of the previous homework to suggest how to estimate the number of matchings
of a given size on a given randomly generated large graph G.
(c) Consider now d-regular random graphs and draw the number of matchings as a
function of their size for several values of d. Comment on what you obtained, does
it correspond to your expectation? If not, explain the differences.
6.7 Exercises 111
Part II
Probabilistic Inference
Chapter 7
In this chapter, we shall discuss the estimation, or the learning, of a quantity that we do not
know directly, but only through some indirect, noisy measurements. They are actually many
different ways to think of the problem depending on whether we are in the context of signal
processing, Bayesian statistics, or information theory, but it boils down to separating the signal
from the noise in some data.
This is the setting of Bayesian estimation. PX is called the "prior" distribution, as it tells us
what we know on the variable X a priori, before any measurement is done. PY |X is telling us
the probability to obtain a given result y, given the value of x. Seen as a function of x for a
116 F. Krzakala and L. Zdeborová
given value of y, L = (x; y) = PY|X (y, x) is called the Likelihood of x. What we are really
interested in, however, is the value of x (the signal) if we measure y (the data). This is called
the posterior probability of X given Y : PX|Y (x, y). To obtain the latter with the former, we
follow the direction given by Laplace and Bayes in the late XVIII century, and we just write
the celebrated "Bayes" formula:
so that the posterior PX|Y (x, y) is given by the product of the prior probability on X, PX (x),
times the likelihood PY |X (x, y), divided by the "evidence" PY (y) (which is just a normalization
constant). Of course, if we deal with continuous variable, we can write the same formula with
probability density instead:
Note that this Bayesian setting is not entirely general! Unfortunately, we often do not know
what PX is in many estimation problems (and sometime, we do not know PY |X either) which
makes the use of this formalism tricky (and has generated a long standing dispute between
so-called frequentist and Bayesian statisticians), but in this chapter, we shall forget about these
problems, and restricted ourselves to the situation where we do know these distributions, so
that one can safely use Bayesian statistics. There are many concrete problems where this is
the case (central to fields such as information theory and error correction, signal processing,
denoising, . . . ) and so this will be enough to keep us busy for some time. We will move to
more complicated situations later.
Let us look at three concrete problems where the "true value", a scalar that we shall denote x∗ ,
is generated by:
3. A Gauss-Bernoulli random variable that is 0 with probability 1/2, and a Gaussian with
mean 0 and variance 1 otherwise: X ∼ N (x; 0, 1)/2 + δ(x)/2.
We shall concentrate on noisy measurements with Gaussian noise. In this case, we are given n
measurements
√
yi = x ∗ + ∆zi , i = 1, · · · , N (7.4)
7.2 Scalar estimation 117
with zi ∼ N (0, 1) a standard Gaussian noise, with zero mean and unit variance. Following
Bayes formula, we can now compute the posterior probability for our estimate of x∗ as:
1 1 P (y −x)2
− i i2∆
PX|Y (x, y) = e PX (x) (7.5)
PY (y) (2π∆)N/2
The posterior tells us all there is to know about the inference of the unknown x∗ .
For instance, in the Rademacher example, if we are given the five measurement numbers:
1.04431591, 2.55352006, 1.43665582, 1.37069702, 0.77697312 . (7.6)
It is very likely that the x∗ = 1 rather than x∗ = −1. How likely? We can compute explicitly
the posterior and find
N
!
1 X yi
Rademacher
PX|Y (x|y) = = σ 2x (7.7)
−2x
N
P yi ∆
∆ i=1
1+e i=1
where σ(x) is the sigmoid function. We thus find that we can estimate the probability that
x∗ = 1 to be larger that 0.999. So we are indeed pretty sure of our estimation. What this
number really mean is "if we repeat many time such experiments: when we measure such
outcomes for the y, then less than 1 in 1000 times it would have been with x∗ = −1".
Let us move to the more difficult Gaussian example. In this case, x∗ is chosen randomly from
a Gaussian distribution, and we are given 10 measurements:
0.04724576, 1.26855971, −0.19887457, 1.09534511, −1.46442807
0.44767123, 2.6244575, 1.94488421, 0.58953688, 0.572018 . (7.8)
We compute explicitly the posterior and find that it is itself also a Gaussian, given by:
PN
yi
∆
Gaussian
PX|Y (x|y) = N x; i=1 , . (7.9)
N + ∆ N + ∆
The posterior distribution for this particular data set is shown in figure 7.2.1: it is Gaussian
with mean 0.630. and variance 0.091.. Actually, the true value of x∗ is this case was 0.67.
Let us consider finally the third example. In this case the posterior is slightly more complicated
and reads
P N
yi
i=1 ∆
N x; N +∆ , N +∆
δ(x)
Gauss−Bernoulli
PX|Y (x|y) = N
!2 + N
!2 (7.10)
P P
yi − yi
i=1 i=1
q q
∆ N +∆
1+ N +∆ e
2∆(N +∆) 1+ ∆ e
2∆(N +∆)
The posterior is shown in Figure 7.2.2 (here x∗ = 0.8, and p non zero is 0.8233787142909471).
118 F. Krzakala and L. Zdeborová
Figure 7.2.1: Posterior distribution for the Gaussian prior example and the data (7.8). The
true value x∗ is marked by a red vertical line.
Figure 7.2.2: Posterior distribution for the Gaussian prior example and the data (7.11). Note
the point mass component represented by the blue arrow and the Gaussian component. The
true value x∗ is marked by the red line.
Often, we are not really interested by the posterior distribution, but rather by a given estimate
of the unknown x∗ . We would really like to give a number and make our best guess! Such
estimates are denoted as x̂(y). x̂(y) should be a function of the data y that gives us a number
that is, ideally, as close as possible to the true value x∗ . The first idea that comes to mind is to
use the most probable value, that is the "mode" of the posterior distribution. This is called
maximum a posteriori estimation:
The MAP estimate is the default-estimator. This is the one of choice in many situations, in
particular because it is very often simple to compute.
However, it is not (at least for finite amount of data) always the best estimator. For instance
what if the posterior has bad looking as below? Is x̂MAP still reasonable for this case? We need
to think of a way to define a "best" estimator. The particular choice of the estimator depends
7.2 Scalar estimation 119
P (x|y)
on our definition of "error". For instance one could decide to minimize the squared error
(x̂(y) − x∗ )2 , or the absolute error |x̂(y) − x∗ |. If x∗ is discrete, however, we might instead
be interested by minimizing the probability of having a wrong value and use 1 − δx̂(y),x∗ .
Depending on our objective, we shall see that we should use a different estimator.
Let us consider the expected error one can get using a given estimator. Formally, we define a
"risk" as the average of the loss function L(x̂(y), x∗ ) over the jointed distribution of signal and
measurements. This is called the averaged posterior risk, as it is —indeed— the average of the
posterior risk:
Z
R averaged
(x̂) = Ex ,y [L(x̂, x )] = PX,Y (x∗ , y)dx∗ dyL(x̂(y), x∗ )
∗
∗
(7.13)
Z Z
= dyPY (y) PX|Y (x∗ , y)dx∗ L(x̂(y), x∗ ) (7.14)
(7.15)
posterior
= EY R (x̂, y)
Our goal, of course, is to find a way to minimize this risk. This minimal value, that is the "best"
possible error one can possibly get (on average) is called the Bayes risk:
Our goal is two-fold: we want to know what is the best possible error, the Bayes error, as well
as how to get it: we want to know what is the Bayes-optimal estimator that gives us the Bayes
risk.
This line of reasoning leads, for the square loss, to the following theorem:
Theorem 10 (MMSE Estimator). The optimal estimator for the square error, called the Minimal
Mean Square Error (MMSE) estimator, is given by the posterior mean:
and the minimal mean square error is given by the variance of the estimator with respect to the posterior
distribution:
Z
MMSE =: min RBayes (y, x̂(.)) = dx∗ dyPX|Y (x∗ , y)(EX|Y (x, y) − x∗ )2 = VarPX|Y [X]
x̂
(7.18)
120 F. Krzakala and L. Zdeborová
In what follows we shall study many such problems with Gaussian noise, so it is rather
convenient to define the optimal denoising function as the MMSE estimator for a given
problem:
(x−R)2
dxxe− 2Σ2 P (x)
R
η(R, Σ) =: EP (X|X+R+ΣZ) = R (x−R)2
(7.22)
dxe− 2Σ2 P (x)
It is interesting to see, however, that for other errors, the optimal function can be different.
If one choose the absolute value as the cost, then we find instead that one should use the
Median:
Theorem 11 (MMAE Estimator). The optimal estimator for the absolute error, called the Minimal
Mean absolute error (MMAE) estimator, is given by the posterior median:
Finally, if we are interested to choose between a finite number of hypothesis, like in the case
±1, or if we want to know if the number was exactly zero in the Gauss-Bernoulli case, a good
measure of error is to look to the optimal decision version and to minimize the number of
mistakes:
7.2 Scalar estimation 121
Theorem 12 (Optimal Decision). The Optimal Bayesian decision estimator is the one that maximizes
the (marginal) probability for each class:
Let us now discuss how to think about these problems with a statistical physics formalism.
We can write down the posterior distribution as
A way to define our Boltzmann measure would be to use β = 1, H(x; y) = − log (P (y | x)) −
log (P (x)), and Z(y) = P (y). In practice, for such problems with a Gaussian noise, we shall
employ a slightly different convention that is more practical, and use instead
P (yi −x)2
exp (log (P (y | x) P (x))) 1 e− i 2∆ P (x)
P (x | y) = = (7.30)
P (y) (2π∆)N/2 P (y)
P yi2
e− i 2∆ x2
P yi x
− 2∆
= e i ∆
P (x) (7.31)
(2π∆)N/2 P (y)
P x2 yi x
i − 2∆ + ∆
e P (x)
=: (7.32)
Z(y)
with −1
yi2
P
Z P x2 yi x −
i − 2∆ + ∆
e i 2∆
Z(y) = dx e P (x) =
(2π∆)N/2 P (y)
Interestingly, with this definition, the partition sum is also equal to the ratio between the
probability that y is a pure random noise (a Gaussian with variance ∆), and that y has been
actually generated by a noisy process from x:
PYmodel (y)
Z(y) = .
PYrandom (y)
This is called the likelihood ratio in hypothesis testing. Obviously, if the two distributions
are the same, then Z = 1 for all values of y. With this definition, we define the expected free
entropy, as before, as
FN = EY log Z(y) . (7.33)
In fact, the free entropy turns out to be nothing more than the Kullback-Liebler divergence
between PYmodel (y) and PYrandom (y):
PYmodel (y)
FN = Emodel
Y log = DKL (PYmodel (y)|PYrandom (y)) (7.34)
PYrandom (y)
122 F. Krzakala and L. Zdeborová
Many other information quantities would have been equally interesting, but they are all
equivalent. We could have used for instance the entropy of the variable y, which is related
trivially to our free entropy.
Ey [y 2 ] N
H(Y ) = −EY log P (y) = N + log 2π∆ − FN (7.35)
2∆ 2
Information theory practitioners would, typically, use the mutual information between X and
Y , that is the Kullback-Leibler distance between the jointed and factorized distribution of Y
and X.
Again, this can be expressed directly as a function of the free entropy, using (see exercise
section for basic properties of the mutual information and conditional entropies):
N
I(X, Y ) = H(Y ) − H(Y |X) = H(Y ) − log(2πe∆) (7.37)
2
N Ey [y 2 ] N Ex [x2 ] + ∆
= −FN − +N =F − +N (7.38)
2 2∆ 2 2∆
2
Ex [x ]
= −FN + N (7.39)
2∆
Given these equivalences, we shall thus focus on the free entropy.
Before going further, we need to note some important mathematical identities that we shall
use all the time, especially in the context of Bayesian inference.
The first one is a generic property of the Gaussian integrals, a simple consequence of integration
by part, called Stein’s lemma:
Lemma 7 (Stein’s Lemma). Let X ∼ N (µ, σ 2 ). Let g be a differentiable function such that the
expectation E [(X − µ)g(X)] and E [g ′ (X)] exists, then we have
E [Xg(X)] = E g ′ (X)
Additionally, there is a set of identities that are extremly useful, that are usually called "
Nishimori symmetry" in the context of physics and error correcting codes. In its more general
form, it reads
7.2 Scalar estimation 123
Theorem 13 (Nishimori Identity). Let X (1) , . . . , X (k) be k i.i.d. samples (given Y ) from the
distribution P (X = · | Y ). Denoting ⟨·⟩ the "Boltzmann" expectation, that is the average with respect
to the P (X = · | Y ), and E [·] the "Disorder" expectation, that is with respect to (X ∗ , Y ). Then for all
continuous bounded function f we can switch one of the copies for X ∗ :
hD E i D E
∗
(1)
E f Y, X , . . . , X (k−1)
,X (k) (1)
= E f Y, X , . . . , X (k−1)
,X (7.40)
k k−1
Proof. The proof is a consequence of Bayes theorem and of the fact that both x∗ and any of
the copy X (k) are distributed from the posterior distribution. Denoting more explicitly the
Boltzmann average over k copies for any function g as
D E k
Z Y
g(X (1)
,...,X (k)
) =: dxi P (xi |Y )g(X (1) , . . . , X (k) ) (7.41)
k
i=1
We shall drop the subset "k" from Boltzmann averages from now on. The Nishimori property
has many useful consequences that we can now discuss. First let us look at the expression of
the MMSE. It has a nice expression in terms of overlaps:
2 h i
MMSE(λ) = Ey,x∗ ⟨x⟩y − x ∗
= Ey,x∗ ⟨x⟩2y + (x∗ )2 − 2x∗ ⟨x⟩y = q + q0 − 2m
where
h i h i
• q ≜ Ey ⟨x⟩2y = Ey x(1) x(2) y is overlap between two copies
h i
• q0 ≜ Ex∗ (x∗ )2 is the self overlap
h i
• m ≜ Ey,x∗ x∗ ⟨x⟩y is the overlap with the truth.
Using Stein’s lemma on the variable z, the third term can be written as
"√ # "√ #
∆ ∆ 1
∂z ⟨x⟩ = Ez,x∗ ⟨x2 ⟩ − ⟨x⟩2 (7.47)
Ez,x∗ ⟨x⟩z = Ez,x∗
2 2 2
so that
1 1
∂∆−1 F = − Ez,x∗ ⟨x2 ⟩ + Ez,x∗ [⟨x⟩x∗ ] + Ez,x∗ ⟨x2 ⟩ − ⟨x⟩2 (7.48)
2 2
∗ 1
(7.49)
2
= Ez,x∗ [⟨x⟩x ] − Ez,x∗ ⟨x⟩
2
q m
=m− = (7.50)
2 2
Where the last step follows from Nishimori.
It is obvious to check that all the theorems that we have discussed applied equally to d-
dimensional vectors. We can thus apply our newfound knowledge to a more interesting
problem: denoising a sparse vector.
the d vectors x = [10000 . . .], x = [01000 . . .], etc. Instead of this vector, you are given a noisy
d-dimensional vector y which has been polluted by a very small Gaussian noise
r
∗ ∆
y=x + z (7.51)
N
Can we recover x∗ ? We proceed in the Bayesian way, and write that
(yi −xi )2
− Q x2
1 i N (yi , 0, ∆/N )
Y e 2∆/N Y − i −2xi yi
P (x|y) = P (x) p = e 2∆/N P (x) (7.52)
P (y) 2π∆/N P (y)
i i
We recognize that i N (yi , 0, ∆/N ) is just the probability of the null model (if the vector y
Q
was simply a random one). Additionally, using the prior on x tell us that it has to be on the
corner of the hypercube (one of the vector xi which are zero everywhere but xi = 1), we write
P random (y) 1 N yi − N 1 1 N yi − N
P (xi |y) = e∆ 2∆ =: e∆ 2∆ . (7.53)
P model (y) 2N Z 2N
As before, the partition sum stands for the ratio of probability between the model and the null,
and reads
d d N δi,i∗
1 X − N + N yi 1 X − 2∆
q
N
+ ∆ + N z
Z= e 2∆ ∆ = e ∆ i (7.54)
2N 2N
i=1 i=1
1
Φ(∆) = lim E log Z (7.55)
N →∞ N
This can be done rigorously, as we shall now see. In fact we can prove the following expression
for the free energy:
1
f (∆) = − log 2 , if , ∆ ≤ 1/2 log 2 (7.57)
2∆
f (∆) = 0 , if ∆ ≥ 1/2 log 2 (7.58)
We shall prove this theorem by proving an upper and lower bound. Let us start by
Proof. The bound comes from using only one term in the sum, the one corresponding to the
correct position i∗ :
q
1 1 1 − 2∆
N
+N +z N
ΦN (∆) = E [log Z] ≥ E log e ∆ ∆ (7.60)
N N 2N
1
≥ − log 2 (7.61)
2∆
Additionally, since the two distributions become indistinguishable for infinite noise and that Z
is just the likelihood ratio, we have ΦN (∆ = ∞) = 0. Since ∂∆ ΦN (∆) = (∂∆−1 ΦN (∆))(∂∆ (1∆)) =
q
− 2∆ 2 ≤ 0, we have ΦN (∆) ≥ 0.
Proof. The bound comes from the Jensen inequality (the annealed bound):
q
1 1 1
q
N ∗ N N N
+z −N log 2 − +z
X
ΦN (∆) = E [log Z] ≤ Ezi∗ log e 2∆ i ∆ + Ezi N e 2∆ i ∆
N N 2
i̸=i∗
1 1
q
N
+z ∗ N −N log 2 N N
≤ Ezi∗ log e 2∆ i ∆ + 1 − N e 2∆ − 2∆ (7.63)
N 2
1
q
N ( 2∆
1
−log 2)+zi∗ N
≤ Ezi∗ log e ∆ +1
N
It is intuitively clear that, depending on where or not the term 1/2∆ − log 2 in the exponential
is positive or negative, then we should expect to either completely dominate the expression,
or to disappear exponentially.
We can show this with rigor, for instance by defining the monotonic growing function
q
N ( 2∆
1
−log 2)+zi∗ N
g(zi∗ ) =: e ∆ + 1. (7.64)
We have g(zi∗ ) ≤ g(|zi∗ |), and
r
g′ N
log g(|zi∗ |) ≤ log g(0) + |zi∗ |max = log g(0) + |zi∗ | (7.65)
g ∆
so that
1 1 1 1
log 1 + eN ( 2∆ −log 2) + E|z| √
1
E log g(zi∗ ) ≤ E log g(|zi∗ |) ≤ (7.66)
N N N N∆
1
N ( 2∆ −log 2)
1
≤ log 1 + e + o(1) (7.67)
N
We conclude by noting that the bounds tends to f (∆) as N → ∞ (taking the exponential of
1/2∆ − log 2 out of the log) and using log(1 + x) ≤ x.
Now that we know the free entropy, we can apply the I-MMSE theorem. A phase transition
occurs depending on ∆ being lower or larger than ∆c
1
∆c =
2 log 2
7.3 Application: Denoising a sparse vector 127
If ∆ > ∆c , the MMSE is 1, and we cannot find the signal. Even the best guess is not better than
a random one. If, on the other end, ∆ < ∆C , then we should be able to solve the problem, and
finds a perfect MMSE (that is, a zero error).
EASY IMPOSSIBLE
∆
∆c
For signal of dimension d, and a noise σ 2 this yields σc2 = 1/(2 log(d)).
Actually, one can prove an even stronger result in the regime ∆ ≥ ∆c . As we show in appendix
7.A, not only the free entropy divided by N goes to zero, but the total free entropy as well. We
recall that it is nothing but the KL divergence between the distribution of the model and the
random one:
PYmodel (y)
FN = Emodel
Y log = DKL (PYmodel (y)|PYrandom (y)) →N →∞ 0 (7.68)
PYrandom (y)
This actually means that the two distributions are eventually just becoming just the same one,
and are thus indistinguishable: not only we cannot find the signal but there is just no way
to know that a signal has been hidden for ∆ > ∆c , as the data looks perfectly like Gaussian
noise.
Bibliography
The legacy of the Bayes theorem, and the fundamental role of Laplace in the invention of
"inverse probabilities" is well discussed in McGrayne (2011). Bayesian estimation is a fun-
damental field at the frontier between information theory and statistics, and is discussed in
many references such as Cover and Thomas (1991). The I-MMSE theorem was introduced by
Guo et al. (2005). Nishimori symmetries were introduced in physics is Nishimori (1980) and
soon realized to have deep connection to information theory Nishimori (1993) and Bayesian
inference Iba (1999). The model of denoising a sparse vector was discussed in Donoho et al.
(1998). This problem has deep relation to Shannon’s Random codes Shannon (1948) and the
Random energy model in statistical physics Derrida (1981).
128 F. Krzakala and L. Zdeborová
7.4 Exercises
In what follow, we shall denote the entropy of a random variable X with a distribution
pX (x) as Z
H(X) = − dx p(x) log p(x)
1
H(X) = log 2πe∆
2
• Mutual information:
The mutual information between two (potentially) correlated variable X and Y
is defined as the Kullback-Leibler divergence between their joint distribution and
the factorized one. In other words, it reads
pX,Y (x, y)
Z
I(X; Y ) = DKL (PX,Y ||PX PY ) = dxdy pX,Y (x, y) log
pX (x) pY (y)
Show that the mutual information satisfies the following chain rules:
1
H(X|Y ) = log 2πe∆
2
Perform simulation of the 3 models discussed in section 6.2.1 with the MMSE, MAP,
and MMAE estimators discussed in section 6.2.2
Using different values for the number of observation n (from 10 to 1000, or even more)
and averaging your finding on many instances, plots how, for each problems the error
7.4 Exercises 129
We saw in section 6.2.5 that the first derivative of the free entropy (with Gaussian noise)
with respect to ∆−1 is (one-half) the overlap m.
Compute the second derivative (use again Stein and Nishimori) and relate it to a
variance of a quantity. Show that it implies the convexity of the free entropy with
respect to ∆−1 .
Appendix
We shall here prove that the average partition sum is not only going to zero when ∆ > ∆c ,
but that the corrections to the free entropy are actually exponentially small. This require a
tighter analysis of the likelihood ratio and of the upper bound in 9.
Our goal is to shat that for ∆ > ∆c , this is exponentially small. Let us simplify notation a bit
and denote f = − 2∆ 1
+ log 2 > 0, and write, spitting the integral in two:
2
Z +∞ − x2
1
q q
−f N +z N e −f N +z N
ΦN (∆) ≤ Ez log e ∆ + 1 = dz √ log e ∆ + 1 (7.69)
N −∞ 2π
Z √N ∆f x2 Z ∞ − x2
2
e− 2
q q
−f N +z N e −f N +z N
≤ dz √ log e ∆ + 1 + √ √ log e ∆ + 1
−∞ 2π N ∆f 2π
(7.70)
≤ I1 + I2 (7.71)
We first deal with I1 . Since the exponential term is positive, we can write, using the "worst"
possible value of z:
√ x2 2
Z ∞ − x2
N ∆f (1−ϵ)
e− 2
Z q
−f N +z N e
I1 = dz √ log e ∆ + 1 ≤ dz √ log e−f N +N f (1−ϵ) + 1
−∞ 2π −∞ 2π
Z ∞ x2
e− 2
≤ dz √ log e−ϵf N + 1 ≤ e−ϵf N
−∞ 2π
2 /2
For a Gaussian random variable, we have that P (Z > b) ≤ e−b thus
2
√
q
N − 12 N ∆f (1−ϵ)− N
−f N + 2∆
(7.75)
∆
I2 ≤ e e
N
−f N + 2∆ − 12 N ∆f (1−ϵ)2 − 21 N
= e e ∆
N f (1−ϵ)
e e (7.76)
−ϵN − 12 N ∆f (1−ϵ)2
= e e (7.77)
This is again decaying exponentially fast. We thus obtain the following result:
Note that for the divergence to go to zero, one needs the total free entropy to go to zero, not
just the one divided by N (that is, the density).
It is instructive, and a good exercise, to redo the computation of the free entropy in theorem:15
using the replica method. The computation is very reminiscent of the one for the random
energy model in chap 15, which we encourage the reader to consult. In this, the present is
nothing but a "Bayesian" version of the random energy model.
Let us see how the replica computation goes. We first remind that the partition sum reads
d N δi,i∗ d N δi,i∗
1 X − 2∆
q q
N
+ ∆ + N∆ i = e−N (log 2+ 2∆ )
z 1 + N z
X
Z= e e ∆ ∆ i (7.79)
2N
i=1 i=1
We now move to the computation of the averaged free entropy by the replica method, starting
zith the replicated partition sum:
n d q !
N N
Z n = e−nN ( )
1 Y X δ ∗+ z
log 2+ 2∆
e ∆ i,i ∆ i . (7.80)
a=1 i=1
7.B A replica computation for vector denoising 133
First, let us start using seemingly trivial rewriting, using i∗ = 1 without loss of generality:
q
d Pn N
z +N δ
n nN (log 2+ 2∆
1
) ∆ ia ∆ ia ,1
X a=1
Z e = e (7.81)
i1 ,...,in =1
d Pn q
Pn N N
z
X
= e a=1 ∆ δia ,1 e a=1 ∆ ia (7.82)
i1 ,...,in =1
d Pn Pd q
Pn N N
z δ
X
= e a=1 ∆ δia ,1 e a=1 j=1 ∆ j j,ia (7.83)
i1 ,...,in =1
d d Pn q
Pn N N
zj δ
X Y
= e a=1 ∆ δia ,1 e a=1 ∆ j,ia (7.84)
i1 ,...,in =1 j=1
Now we perform the expectation over disorder, using the fact that we have now a product of
independent Gaussians:
d Pn N d Pn q N
E [Z n ] = e−nN (log 2+ 2∆ ) E
1 X Y z δ
e a=1 ∆ δia ,1 e j a=1 ∆ j,ia (7.85)
i1 ,...,in =1 j=1
d d P q
Pn n N
= e−nN (log 2+ 2∆ )
1 X N Y z δ
e a=1 ∆ δia ,1 E e j a=1 ∆ j,ia (7.86)
i1 ,...,in =1 j=1
2
Using E ebz = eb /2 for Gaussian variables, we thus find
d Pn d Pn
−nN (log 2+ 2∆
1
) N N 2
X Y
n
E [Z ] = e e a=1 ∆ δia ,1 e 2∆ ( a=1 δj,ia ) (7.87)
i1 ,...,in =1 j=1
d Pd Pn
Pn N
−nN (log 2+ 2∆
1
) N
X
= e e a=1 ∆ δia ,1 e 2∆ j=1 a,b=1 δj,ia δj,ib (7.88)
i1 ,...,in =1
d Pn Pn
e∆( )
N 1
−nN (log 2+ 2∆
1
)
X
= e a=1 δia ,1 + 2 a,b=1 δia ,ib (7.89)
i1 ,...,in =1
Given the replicas configurations (i1 ...in ), that can take values in 1, . . . , d, we now denote the
so-called n × n overlap matrix Qab = δia ,ib , that takes elements in 0, 1, respectively if the two
replicas (row and column) have different or equal configuration. We also write the n × n
magnetization vector Ma = δia ,1 . With this notation, we can write the replicated sum as
d Pn Pn
e∆( Qa,b )
N
−nN (log 2+ 2∆
1
) Ma + 12
X
n
E[Z ] = e a=1 a,b=1 (7.90)
i1 ,...,in =1
Pn Pn
#(Q, M )e ∆ ( Qa,b )
N
Ma + 12
= e−nN (log 2+ 2∆ )
1 X
a=1 a,b=1 (7.91)
{Q},{M }
where {Q},{M } is the sum over all possible such matrices and vectors, while #(Q, M ) is the
P
numbers of configurations that leads to the overlap matrix Q and magnetization vector M .
In this form, it is not yet possible to perform the analytic continuation when n → 0. Keeping
for a moment n integer, it is however natural to expect that the number of such configurations
134 F. Krzakala and L. Zdeborová
(for a given overlap matrix and magnetization vector), to be exponentially large. Denoting
#(Q, M ) = eN s(Q,M ) we thus write
Z Pn
Z
1 Pn
e nN (log 2+ 2∆
1
) EZ n
≈ dQ dM e N s(Q,M )+ N
∆ ( a=1 Ma + 2 a,b=1 Q a,b ) =: dQ, M eN g(∆,Q,M )
In this case, we have only three natural choices for the entries of Q and M :
1. All the replicas are in the same, identical configuration, that is ia = i∀a. Let us further
assume that i ̸= 1. In this case Qab = 1 for all a, b, and Ma = 0 for all a. There are
d − 1 ≈ 2N possibility for this so s(Q, M ) = log 2 and we find g(β, Q) = log 2 + n2 /∆.
This does not look right: this expression does not have a limit with a linear part in n,
so we cannot use this solution in the replica method. Clearly, this is a wrong analytical
continuation.
2. All the replica are in the same, identical configuration, which is the "correct" one ia =
i = 1. Then Qab = 1 for all a, b, and Ma = 1 for all a. There is only one possibility, so
that s(Q, M ) = 0 and g(β, Q) = n/∆ + n2 /2∆.
3. If instead all replicas are in a different, random, configurations then Qaa = 1, Qab = 0
for all a ̸= b and Ma = 0. In this case #(Q) = 2N (2N − 1) . . . (2N − n + 1), so that
s(Q) ≈ n log 2 if n ≪ N . Therefore g(β, Q) = n/2∆ + n log 2.
At the replica symmetric level, we thus find that the free entropy is given by two possible
solutions as n → 0. In the first one all replicas are in the correct, hidden solution:
while in the second case, all replicas are distributed randomly over all states and
We thus have recovered exactly the rigorous solution from the replica method. Indeed,
choosing the right solution is easy: the free entropy is continuous and convex in ∆, non-
negative, and goes from ∞ to 0 as ∆ grows, so that we the free energy must be log 2 − 1/2∆
for ∆ < ∆c = 1/2 log 2 and 0 for ∆ > ∆c .
Chapter 8
The signal is the truth. The noise is what distracts us from the
truth [. . . ] Distinguishing the signal from the noise requires both
scientific knowledge and self-knowledge: the serenity to accept the
things we cannot predict, the courage to predict the things we can,
and the wisdom to know the difference.
Now that we presented Bayesian estimation problems, we can apply our techniques to a
non-trivial problem. This is a perfect example for testing our newfound knowledge.
i.i.d. i.i.d.
where x∗ ∈ RN with x∗i ∼ PX (x), ξij = ξji ∼ N (0, 1) for i ≤ j.
This is called the Wigner spike model in statistics. The name "Wigner" refer to the fact that Y
is a Wigner matrix (a symmetric random matrix with components sampled randomly from a
Gaussian distribution) plus a "spike", that is a rank one matrix x∗ x∗⊺ .
Our task shall be to recover the vector x from the knowledge of Y . As we just learned, this
136 F. Krzakala and L. Zdeborová
q 2
"N # − 12 yij − Nλ ∗ ∗
xi xj
1 Y Y e
P (x | Y) = PX (xi ) √
Z(Y) 2π
i=1 i≤j
For completeness, we present here an alternative model, which is also extremely interesting,
called the Wishart-spike model. In this case
r
λ ∗ ∗⊺
Y= u v } + ξ
N | {z |{z}
M ×N rank-one matrix iid noise
Strictly speaking, the name "Wishart" might sounds strange here. This is coming from the
fact that this model, for Gaussian vectors u, is exactly the same as another model involving a
Wishart matrix, a model also called the Spiked Covariance Model. Indeed, when the factors
are independent, the model can be viewed as a linear model with additive noise and scalar
random design:
r
λ
yi = v j u + ξj , (8.1)
N
Assuming the vj have zero mean and unit variance, this indeed is a model of spiked covariance:
YY
the mean of the empirical covariance matrix Σ = Y N is a rank one perturbation of the identity
1 + uu Random covariance matrices are called Wishart matrices, so this is a model with a
T
Regardless of its name, given the matrix Y , the posterior distribution over X reads
q 2
# − 21 yij − Nλ ∗ ∗
"M N ui vj
1 Y Y Y e
P (u, v | Y) = PU (ui ) PV (vi ) √
Z(Y) 2π
i=1 j=1 i,j
8.2 From the Posterior Distribution to the partition sum 137
We shall now make a mapping to a Statistical Physics formulation. Consider the spike-Wigner
model, using Bayes rule we write:
" # q 2
1 λ
P (Y | x) P (x) Y Y 1 − y ij − x x
N i j
P (x | Y) = ∝ PX (xi ) √ e 2
P (Y) 2π
i i≤j
" # " r #
Y X λ λ
∝ PX (xi ) exp − x2 x2 + yij xi xj
2N i j N
i i≤j
" # " r #
1 Y X λ 2 2 λ
⇒ P (x | Y) = PX (xi ) exp − x x + yij xi xj
Z(Y) 2N i j N
i i≤j
x̂MSE,1 (Y)
.
Z
⇒ x̂MSE (Y) =
.. ,
x̂MSE,i (Y) = ⟨xi ⟩Y = dx P (x | Y) xi
x̂MSE,N (Y)
" # " r #
1 Y X λ 2 2 λ
P (x | Y) = PX (xi ) exp − x x + yij xi xj
Z(Y) 2N i j N
i i≤j
" # " r #
1 Y X λ 2 2 λ λ
= PX (xi ) exp − xi xj + xi xj x∗i x∗j + ξij xi xj
Z(Y) 2N N N
i i≤j
We are interested in
1 1
lim EY log (Z(Y)) = lim Ex∗ ,ξ log (Z(Y))
N →∞ N N →∞ N
138 F. Krzakala and L. Zdeborová
"Z !2 !2 #
(α) X x(α) x(β)
Y
(α)
(α) λN X X x∗i xi λN X i i
= Ex∗ PX xi dxi exp +
α,i
2 α i
N 2 i
N
α<β
"Z Z Y ! Z Y !
(c) Y
(α)
(α) 1 X (α) ∗ 1 X (α) (β)
= Ex∗ PX xi dxi δ mα − x xi dmα δ qαβ − x xi dqαβ
α,i α
N i i N i i
α<β
#
λN X 2 X 2
exp mα + qαβ
2 α α<β
"Z Z Y h Z Y
P (α) i h P (α) (β) i
m̂α N mα − i xi x∗
(d)
q̂αβ N qαβ − i xi xi
Y (α) (α) i
= Ex∗ PX xi dxi e dm̂α dmα e dq̂αβ dqαβ
α,i α α<β
#
λN X 2 X 2
exp mα + qαβ
2 α α<β
Z Y Z Y
(e) λN X 2 X 2 X X
= dm̂α dmα dq̂αβ dqαβ exp mα + qαβ + N mα m̂α + qαβ q̂αβ
α
2 α α
α<β α<β α<β
N
Z Y X X
Ex ∗ PX (xα ) dxα exp − m̂α x∗ xα − q̂αβ xα xβ
α α α<β
where
2 /2
(a) uses the fact that Dz eaz = ea
R
(c) partitions the huge integral according to overlap with true signal mα and overlap between
two distinct replicas qαβ with definitions
Under replica symmetry Ansatz, we have mα ≡ m, m̂α ≡ m̂, qαβ ≡ q, q̂αβ ≡ q̂.
Z Z h 2 −n i h 2 i
λN
nm2 + n q 2 +N nmm̂+ n −n q q̂
E[Z n ] = dm̂ dm dq̂ dq e 2 2 2
×
( "Z #)N
Y P P
× Ex∗ PX (xα ) dxα e−m̂ α x∗ xα −q̂ α<β xα xβ
α
Z
(a) n−1
dm̂ dm dq̂ dq enN [ 2 m + 4 (n−1)q +mm̂+ 2 qq̂] ×
λ 2 λ 2
=
( "Z #)N
√ P
Z
PX (xα ) dxα e 2 α xα −m̂ α x∗ xα Dz e−iz q̂ α xα
Y q̂ P 2
P
× Ex∗
α
Z
(b) n−1
dm̂ dm dq̂ dq enN [ 2 m + 4 (n−1)q +mm̂+ 2 qq̂] ×
λ 2 λ 2
=
( "Z
Y Z √
#)N
× Ex∗ Dz
q̂ 2
x
PX (xα ) dxα e 2 α − m̂x x
∗ α −i z q̂xα
α
Z
(c) nN [ λ m2 + λ (n−1)q 2 +mm̂+ n−1 q q̂ ]
= dm̂ dm dq̂ dq e 2 4 × 2
Z √ ion
N
i
n h q̂ 2
x − m̂x x− z q̂x
× Ex∗ Dz Ex e 2 ∗
Z
(d)
dm̂ dm dq̂ dq enN [ 2 m ]×
λ 2 − λ q 2 +mm̂− 1 q q̂
= 4 2
√
q̂ 2
−m̂x∗ x−iz q̂x
PX (x) dx e 2 x
R R
nN Ex∗ Dz log
Z ×e
= dm̂ dm dq̂ dq enN Φ(m,q,m̂,q̂)
where
√
Z Z
λ 1 q̂ 2
Φ(m, q, m̂, q̂) = (2m2 − q 2 ) + mm̂ − q q̂ + Ex∗ Dz log PX (x) dx e 2
x +( q̂z−m̂x∗ )x
4 2
and
Then we have
!2
X q̂ X q̂ X q̂ X
exp −q̂ xα xβ = exp − xα xβ = exp x2 − xα
2 2 α α 2 α
α<β α̸=β
!Z !
q̂ X 2 1 z 2
√ dz exp − − iz q̂
p X
= exp x xa
2 α α 2π 2 α
!Z !
q̂ X 2
Dz exp −iz q̂
p X
= exp x xa
2 α α α
(d) take limit n → 0 and use the fact that when n is small we have
h i
E[X n ] = E en log(X) ≃ E [1 + n log(X)] = 1 + nE [log(X)]
= exp (log (1 + nE [log(X)])) ≃ exp (nE [log(X)])
Recall that according to the Nishimori identity we have q = m and q̂ = m̂, it simplifies to
λ 2 1
ΦNishi (m, m̂) ≜ Φ(m, q, m̂, q̂)|q=m,q̂=m̂ = m + mm̂
4 Z 2
√
Z
+ Ex∗ Dz log
m̂ 2
PX (x) dx e 2 x −( i m̂z+m̂x ∗ )x
We can further reduce the problem by taking partial derivative w.r.t. m and set it to zero
∂ λ 1
ΦNishi (m, m̂) = m + m̂ = 0 ⇒ m̂ = −λm
∂m 2 2
Plug this back to ΦNishi (m, m̂) we will obtain the final free entropy function under replica
symmetry Ansatz
Preliminaries
Lemma 11 (Stein’s Lemma). Let X ∼ N (µ, σ 2 ). Let g be a differentiable function such that the
expectation E [(X − µ)g(X)] and E [g ′ (X)] exists, then we have
E [Xg(X)] = E g ′ (X)
Proposition 1 (Nishimori Identity). Let (X, Y ) be a couple of random variables on a polish space.
Let k ≥ 1 and let X (1) , . . . , X (k) be k i.i.d. samples (given Y ) from the distribution P (X = · | Y ),
independently of every other random variables. Let us denote ⟨·⟩ the expectation w.r.t. P (X = · | Y )
and E [·] the expectation w.r.t. (X, Y ). Then for all continuous bounded function f
hD Ei hD Ei
E f Y, X (1) , . . . , X (k−1) , X (k) = E f Y, X (1) , . . . , X (k−1) , X
Corollary 1 (Nishimori Identity for Two Replicas). Consider model y = g(x∗ ) + w, where g is a
continuous bounded function and w is the additive noise. Let us denote ⟨·⟩x∗ ,w the expectation w.r.t.
P (X = · | Y = g(x∗ ) + w). Then we have
h i D E
EX ∗ ,W ⟨f (X, X ∗ )⟩X ∗ ,W = EX ∗ ,W f X (1) , X (2)
X ∗ ,W
Two Problems
√
Problem A: From previous lecture we studied the scalar denoising problem that y = λx∗ + ω
√
λ
P (x | y) ∝ exp log (PX (x)) − x2 + λx∗ x + λωx (8.3)
2
√
Z
λ ∗ ∗
Φdenoising (λ) = Ex∗ ,ω log PX (x) dx exp − x + λx x + λxz (8.4)
2
√
Suppose we solve N such problem parallelly such that y = λmx∗ + ω as problem A and
define the Hamiltonian HA (x, λ, x∗ , ω; m)
!
X X λm
∗
√
P (x | y) ∝ exp log (PX (xi )) + − x + λmxi xi + λmωi xi (8.5)
2
2 i
i i
X λm √
X
∗ ∗
HA (x, λ, x , ω; m) ≜ − log (PX (xi )) − − 2
x + λmxi xi + λmωi xi (8.6)
2 i
i i
142 F. Krzakala and L. Zdeborová
q
Problem B: Our target rank-one matrix factorization problem Y = Nλ x∗ x∗⊺ + ξ, define the
Hamiltonian HB (x, x∗ , λ)
" r #
X X λ 2 2 λ λ
P (x | Y) ∝ exp log (PX (xi )) + − x x + x∗ x∗ xi xj + ξij xi xj (8.9)
2N i j N i j
N
i i≤j
" r #
∗
X X λ 2 2 λ ∗ ∗ λ
HB (x, λ, x , ξ) ≜ − log (PX (xi )) − − x x + x x xi xj + ξij xi xj (8.10)
2N i j N i j N
i i≤j
with partition function ZB (λ, x∗ , ξ) is the quantity that we are interested in.
Define
(t)
H̃A (xλ, x∗ , ω; m) ≜ HA (x, tλ, x∗ , ω; m)
X tλm √
X
2 ∗
=− log (PX (xi )) − − x + tλmxi xi + tλmωi xi
2 i
i i
(t)
H̃B (x, λ, x∗ , ξ) ∗
≜ HB (x, tλ, x , ξ)
" r #
X X tλ 2 2 tλ ∗ ∗ tλ
=− log (PX (xi )) − − x x + x x xi xj + ξij xi xj
2N i j N i j N
i i≤j
( q
Yij = tλ x∗ x∗ + ξij , ∀1 ≤ i ≤ j ≤ N
p N i j
yi = (1 − t)λmx∗i + ωi , ∀1 ≤ i ≤ N
Notice that
∂
Ht (x, λ, x∗ , ω, ξ; m)
∂t " √ # " p #
X λm X
∗ λm X λ 2 2 λ ∗ ∗ λ/N
=− 2
xi − λmxi xi − √ ωi xi − − x x + x x xi xj + √ ξij(8.14)
xi xj
2 2 1 − t 2N i j N i j 2 t
i i i≤j
Therefore this derivative can be into several Boltzmann average terms associated to Ht (x, λ, x∗ , ω, ξ; m).
For short we denote θ = {λ, x∗ , ω, ξ}
Let ΦMF (λ) be the free entropy density of the rank-one matrix factorization problem under
144 F. Krzakala and L. Zdeborová
where
(a) Uses Eqn 8.12, and the short hand notation θ = {λ, x∗ , ω, ξ}.
(b) Plug in Eqn 8.15 and uses the Stein’s Lemma to deal with terms containing ξij and ωi
r
∂ tλ
Ht (x, θ; m) = − xi xj
∂ξij N
Z
∂ ′ ∂
Zt (θ; m) = − dx′ e−Ht (x ,θ,m) Ht x′ , θ; m
∂ξij ∂ξij
−Ht (x′ ,θ;m)
r Z r
tλ ′e ′ ′ tλ ′ ′
= Zt (θ; m) · dx xi xj = Zt (θ; m) · xx
N Zt (θ; m) N i j t,θ,m
h i ∂
Ex ,ω,ξ ξij ⟨xi xj ⟩t,θ,m = Ex ,ω,ξ
∗ ∗ ⟨xi xj ⟩t,θ,m
∂ξij
"Z ( ∂ ∂ )#
− ∂ξij Ht (x, θ; m) ∂ξ Z t (θ; m)
= Ex∗ ,ω,ξ dx xi xj e−Ht (θ;m) − ij
Zt (θ; m) [Zt (θ; m)]2
"Z r #
e−Ht (θ;m) tλ n o
= Ex∗ ,ω,ξ dx xi xj xi xj + x′i x′j t,θ,m
Zt (θ; m) N
r
tλ h i
= Ex∗ ,ω,ξ x2i x2j t,θ,m − ⟨xi xj ⟩2t,θ,m
N
∂ p
Ht (x, θ; m) = − (1 − t)λmxi
∂ωi Z
∂ ′ ∂
Zt (θ; m) = − dx′ e−Ht (x ,θ,m) Ht x′ , θ; m
∂ωi ∂ωi
′
e−Ht (x ,θ;m) ′ ′
Z
= Zt (θ; m) · (1 − t)λm dx′
p
xx
Zt (θ; m) i j
= Zt (θ; m) · (1 − t)λm x′i t,θ,m
p
h i ∂
Ex∗ ,ω,ξ ωi ⟨xi ⟩t,θ,m = Ex∗ ,ω,ξ ⟨xi ⟩t,θ,m
∂ωi
"Z ( ∂ ∂
)#
−Ht (θ;m)
− ∂ωi Ht (x, θ; m) ∂ωi Zt (θ; m)
= Ex∗ ,ω,ξ dx xi e −
Zt (θ; m) [Zt (θ; m)]2
"Z #
e−Ht (θ;m) p n
′
o
= Ex∗ ,ω,ξ dx xi (1 − t)λm xi + xi t,θ,m
Zt (θ; m)
h i
= (1 − t)λm Ex∗ ,ω,ξ x2i t,θ,m − ⟨xi ⟩2t,θ,m
p
146 F. Krzakala and L. Zdeborová
Analogously, we have
Z 1
λ X 2
X
dτ Ex∗ ,ω,ξ x∗i x∗j xi xj − 2N m ⟨x∗i xi ⟩2τ,θ,m
2N 2 0
τ,θ,m
i,j i
* +
λ X X
= Ex∗ ,ω,ξ x∗i x∗j xi xj − 2N m x∗i xi
2N 2
i,j i τ,θ,m
* !2 ! +
λ X X
= Ex∗ ,ω,ξ x∗i xi − 2N m x∗i xi + N 2 m2 − N 2 m2
2N 2
i i τ,θ,ξ
* !2 +
λ 2
− λm
X
= Ex∗ ,ω,ξ x∗i xi − m
2 2
i τ,θ,m
To get start, we first redefine the Hamiltonian of problem A by replacing some m into θ:
X X λq
∗ ∗
(8.16)
2
p
ĤA (x, λ, x , ω; m, q) ≜ − log (PX (xi )) − − xi + λmx x + λqωi xi
2
i i
Consider models at fixed value of magnetization M = N1 i x∗i xi , i.e. we use the same
P
family of Hamiltonians as above, but the system only contains configurations with the given
magnetization M
!
1
Z
∗
X
fixed
ZA (λ, x∗ , ω; m, q, M ) ≜ dx e−ĤA (x,λ,x ,ω;m,q) δ M − x∗i xi (8.18)
N
i
!
1
Z
∗
X
fixed
ZB (λ, x∗ , ξ; M ) ≜ dx e−HB (x,λ,x ,ξ) δ M − x∗i xi (8.19)
N
i
!
1
Z X
Ztfixed (θ; m, q, M ) ≜ dx e−Ĥt (x,θ;m,q) δ M − x∗i xi (8.20)
N
i
Z1fixed (θ; m, q, M ) ≡ ZB
fixed
(λ, x∗ , ξ; M ) , ∀ m, q
Notice that
" √ #
∂ X λq X λq
Ĥt (x, λ, x∗ , ω, ξ; m, q) = − x2i − λmx∗i xi − √ ωi xi
∂t 2 2 1−t
i i
" p #
X λ 2 2 λ ∗ ∗ λ/N
− − x x + x x xi xj + √ ξij xi xj (8.21)
2N i j N i j 2 t
i≤j
148 F. Krzakala and L. Zdeborová
Therefore this derivative can be into several Boltzmann average terms associated to
!
∂ log Ztfixed (θ; m, q, M ) 1 1 1
Z
∂ X
= dx e−Ĥt (x,θ;m,q) δ M − x∗i xi
∂t N N Ztfixed (θ; m, q, M ) ∂t N
i
−Ĥ (x,θ;m,q) 1 P ∗
1 e δ M − N i xi xi ∂
Z t
=− dx Ĥt (x, θ; m, q)
N Zt (θ; m) ∂t
| {z }
=Pt,θ,m,q,M (x)
1 ∂
=− Ĥt (x, θ; m, q)
N ∂t t,θ,m,q,M
D E
( x2 x2
1 λ X i j t,θ,m,q,M
=− − x∗i x∗j xi xj
N N
2 t,θ,m,q,M
i≤j
X hq i
−λ x2i t,θ,m,q,M
− m ⟨x∗i xi ⟩t,θ,m,q,M
2
i
p √ )
λ/N X λq X
− √ ξij ⟨xi xj ⟩t,θ,m,q,M + √ ωi ⟨xi ⟩t,θ,m,q,M
2 t i≤j 2 1−t i
(8.22)
8.4 A rigorous proof via Interpolation 149
λq 2 λ λm2
ΦFixed
MF (λ; M ) ≤ Φdenoising (λm) + + (M − m)2 − , ∀ m, q
4 2 2
λq 2 λ λm2
⇒ Φfixed
MF (λ; M ) ≤ min Φdenoising (λm) + + (M − m)2 −
m,q 4 2 2
In the large N limit, the Boltzmann distribution will be dominated by the configurations with
specific magnetization, so by Laplace method (we are a bit sloppy here) we have
Finally, combine the bound from both sides and note that ΦMF (λ) does not depend on m and
M
h i
ΦMF (λ) ≥ maxm Φdenoising (λm) − λm2
λm2
4
h i ⇒ ΦMF (λ) = max Φdenoising (λm) −
ΦMF (λ) ≤ maxM Φdenoising (λM ) − λM 2 m 4
4
Bibliography
8.5 Exercises
a) Using the replica expression for the free entropy, show that the overlap m between
the posterior estimate ⟨X⟩ and the real value x∗ obeys a self consistent equation.
b) Solve this equation numerically, and show that m is non zero only for SNR λ > 1.
and compare the MMSE obtained with this approach with the one of any algorithm
you may invent so solve the problem. A classical algorithm for instance, is to use as
an estimator the eigenvector of Y corresponding to its largest eigenvalue.
model 2: sampled uniformly from ±1 (with probability ρ), otherwise 0 (with proba-
bility 1 − ρ)
a) Using the replica expression for the free entropy, show that the overlap m between
the posterior means estimate ⟨X⟩ and the real value X ∗ obeys a self consistent
equation.
b) Solve this equation numerically, and show that m is non zero only for SNR λ > 1,
for models 1 and for a non-trivial critical value for model 2. Check also that, for ρ
small enough, the transition is a first order one for model 2.
Chapter 9
N variables
x2i x2j λ xi xj x∗i x∗j λ
r
X λ
−HN = − + + xi xj ξij (9.1)
2N N N
1≤i≤j
N + 1 variables
x2i x2j λ xi xj x∗i x∗j λ
r
X λ
− HN +1 = − + + xi xj ξij (9.2)
2(N + 1) N +1 N +1
0≤i≤j
!
x2i x2j λ xi xj x∗i x∗j λ
r
X λ x2 λ X x2i
= − + + xi xj ξij − 0
2(N + 1) N +1 N +1 2 N +1
1≤i≤j i
r
X xi x∗ X λ
∗
+ x0 x0 λ i
+ x0 xi ξ0i (9.3)
N +1 N +1
i i
r
N X x2 X xi x∗
2λ λ
X
∗
= −HN (λ ) − x0 i
+ x0 x0 i
+ x0 xi ξ0i + o(1) (9.4)
N +1 2 N N N
i i i
Let us now look at the average magnetization of the new spin. It must satisfies:
x2 x0 x∗
q
2λ i ∗ 0 +x λ
P P P
−x i N +x0 x0 xi ξ0i
dx0 PX (x0 ) x∗0 x0 ⟨e 0 2
R 0
i N i N ⟩N
m = Eξ,x∗ ,x∗0 ,ξ0 x2 xi x∗
q (9.5)
−x2 λ i +x x∗ i λ
P P P
i N +x0 xi ξ0i
R 0 0
dx0 PX (x0 )⟨e 0 2 i N i N ⟩N
Let us therefore evaluate this term in brackets. Using concentration of measure for the overlaps,
we find
x2 xi x∗
q q
−x20 λ i ∗ i +x λ
−x20 λ ρ+x0 x∗0 m+x0 λ
P P P P
i N +x0 x0 xi ξ0i xi ξ0i
⟨e 2 i N 0 i N ⟩N ≈ ⟨e 2 i N ⟩N (9.6)
q
2λ ∗
P λ
x0 xi ξ0i
= e−x0 2 ρ+x0 x0 m ⟨e i N ⟩N (9.7)
154 F. Krzakala and L. Zdeborová
How to deal with the last term? We could expand in power of m! Indeed, concentration
suggest that the xi are xj are only weekly correlated so that we could hope that:
r
(x0 xi )2 2 λ
q
P
x0 i xi ξ0i Nλ Y x x ξ qλ Y λ
0 i 0i
⟨e ⟩N = ⟨ e N ⟩N ≈ ⟨ (1 + x0 xi ξ0i + ξ0i )⟩N
N 2 N
i i
r r
Y λ 1 2 λ
Y λ 1 λ
≈ (1 + x0 ⟨xi ⟩ξ0i + x20 ⟨x2i ⟩ξ0i )≈ (1 + x0 ⟨xi ⟩ξ0i + x20 ⟨x2i ⟩ )
N 2 N N 2 N
i i
x2 x2 x2 2
q q
λ
− 20 2 2 λ 0 2 2 λ λ 0 qλ+ x0 ρλ
P P P P
x0 i ⟨xi ⟩ξ0i i ⟨xi ⟩ ξ0i N + 2 i ⟨xi ⟩ξ0i N x0 i ⟨xi ⟩ξ0i −
≈e N ≈e N 2 2 (9.8)
This can be actually proved rigorously. Indeed consider the two following expressions:
r
x20 X λ
A = − λρ + x0 ξi0 xi (9.9)
2 N
i
r
x20 X λ
B = − λq + x0 ξi0 ⟨xi ⟩ (9.10)
2 N
i
E⟨(eA − eB )2 ⟩ → 0 (9.11)
We thus obtain
x2 x0 x∗
q q
−x20 λ i ∗ 0 +x λ
−x20 λ q+x0 x∗0 m+x0 λ
P P P P
i N +x0 x0 xi ξ0i i ⟨xi ⟩ξ0i
⟨e 2 i N 0 i N ⟩N ≈ e 2 N (9.12)
Recognizing that the last term is actually a random Gaussian variable thanks to the CLT, we
have
2 qλ ∗
√
dx0 PX (x0 ) x∗0 x0 e−x0 2 +x0 x0 m+x0 λqz
R
m = Ex∗0 ,z R 2 qλ ∗
√ (9.13)
dx0 PX (x0 )e−x0 2 +x0 x0 m+x0 λqz
and we have recovered our self-consistent equation from a rigorous computation. This is the
power of the cavity method!
We can derive the free energy as well. To do this, we need to be keep track of all order 1
constant, so we write:
We thus write
!2 !2 r
λ X x2 λ X xi x∗ 1 λ X
i i
−HN +1 = − HN + − − xi xi ξij + o(1)
4 N 2 N 4N N
i i i,j
r
λ X x2i X xi x∗ X λ
− x20 + x0 x∗0 i
+ x0 xi ξ0i
2 N N N
i i i
We can now compute the free energy using the Cavity method and write
2 2
x2 xi x∗
q
λ i −λ i 1 λ
P P P
ZN +1 4 i N 2 i N
+ 4N N i,j xi xi ξij
F ≈ E log =Eξ,x∗ log⟨e ⟩N
ZN
x2 xi x∗
Z q
−x20 λ i ∗ i +x λ
P P P
i N +x0 x0 0 xi ξ0i
+ Eξ,x∗ ,ξ0 log dx0 P (x0 )⟨e 2 i N i N ⟩N
(9.19)
All these terms are somehow simple, and more importantly, look like they depends on our
order parameters, excepts for the Gaussian random variables ξij and ξ0i . In fact, we already
took care of the second line, so we can further simply the expression as
2 2
x2 xi x∗
q
λ i −λ i 1 λ
P P P
ZN +1 4 i N 2 i N + 4N N i,j xi xi ξij
F ≈ E log =Eξ,x∗ log⟨e ⟩N
ZN
Z q
−x20 λ q+x0 x∗0 m+x0 λ
P
i ⟨xi ⟩ξ0i
+ Eξ,x∗ ,ξ0 log dx0 PX (x0 )e 2 N (9.20)
Let us this deal with the first line. First notice that, because of concentration, we have:
X xi x∗ X x2
i
→ m, i
→ ρ. (9.21)
N N
i i
This last terms look complicated. However, it also follows a concentration property. First, let
us notice that, using use Stein lemma, we have
r
1 λ X X λ
E⟨ xi xj ξij ⟩ = E⟨ 2 2 2
⟨x x ⟩ − ⟨xi xj ⟩ ⟩ (9.24)
4N N 4N 2 i j
i,j
We can also compute the variance of this quantity, and check that it concentrates. This is,
actually, nothing but the Matrix-MMSE. We can thus write
2 2
x2 xi x∗
q
λ i −λ i 1 λ
P P P
i N i + 4N i,j xi xj ξij
(9.25)
4 2 N N
⟨e ⟩
λρ2
− λm + λ2 2 2 2
P
≈e 4 2 4N ij ⟨xi xj ⟩−⟨xi xj ⟩ (9.26)
λq
− λm
≈e 4 2 (9.27)
λm
Z
2 λm ∗ ∗
√
F ≈− + Ex∗ ,z log dx0 PX (x0 )e−x0 2 +x0 x0 m+x0 x0 λmz (9.28)
4
9.3 AMP
We come back to the cavity equation. We have the new spins x0 that "sees" an effective model
as
P ⟨xi ⟩2 ∗
1 P ⟨xi ⟩xi q
−x20 λ ∗ λ λ
P
i N +x0 x0 N +x0 i ⟨xi ⟩ξ0i N
x0 ∼ PX (x0 )e 2 i N (9.29)
Z
Can we turn this into an algorithm? If we remember that yij = x∗i x∗j λ/n + ξij , this can be
p
written as
P ⟨xi ⟩2
1
q
−x2 λ λ P
+ x0 y0i ⟨xi ⟩
x0 ∼ PX (x0 )e 0 2 i N N i
(9.30)
Z
Defining the denoising function η(A, B) as
2
dx0 PX (x0 )x0 e−x0 A/2+x0 B
R
η(A, B) =: R 2 (9.31)
dxPX (x0 )e−x0 A/2 +x0 B
It is VERY tempting to turn this into an iterative algorithm, and write, denoting x̂ the estimator
of the marginal of x at time t
r
λ t
ht = x̂ Y (9.34)
N
t t+1
x̂ · x̂
x̂t+1 = η λ , ht (9.35)
N
However, there is a problem! In these, we have indeed the mean of X0 , but it is not expressed
as a function of the mean of xi , but the mean of xi in a system WITHOUT X0 (the cavity
system, where x0 has been removed). What we need would be to express the mean of X0 as a
function of the mean of xi instead, not its cavity mean!
This problem was solved by Thouless, Anderson and Palmer, using Onsager’s retraction term.
To first order, we can express the cavity mean of xi (in absence of x0 ) as a function of the
actual mean xi (in presence of x0 ) as
r r
X λ X X λ X
⟨xi ⟩c ≈ η λ ⟨xj ⟩2c /N, ⟨xj ⟩c yij ≈ η λ ⟨xj ⟩2 /N, ⟨xj ⟩yij (9.36)
N N
j̸=0 j̸=0 j̸=0 j̸=0
r r
X λ X λ
≈ η λ ⟨xj ⟩2 /N − λ⟨x0 ⟩2 /N, ⟨xj ⟩yij − ⟨x0 ⟩yi0 (9.37)
N N
j j
r
X λ X
(9.38)
p
≈ η λ ⟨xj ⟩2 /N, ⟨xj ⟩yij − (∂B η) λ/N ⟨x0 ⟩yi0 + O(1/N )
N
j j
′
(9.39)
p
≈ ⟨xi ⟩ − η λ/N ⟨x0 ⟩yi0 + O(1/N )
So the algorithms need to be slightly modified! The local field acting on on the spin 0 is now
r r r !
λ X λ X λ
y0i ⟨xi ⟩c → y0i ⟨xi ⟩ − ηi′ ⟨x0 ⟩yi0
N N N
i i
r !
λ X 1 X ′ 2
= y0i ⟨xi ⟩ − λ ηi y01 ⟨x0 ⟩ (9.40)
N N
i i
r !
λ X 1 X ′
≈ y0i ⟨xi ⟩ − λ ηi ⟨x0 ⟩ (9.41)
N N
i i
The really power-full things about this algorithm is that z is really the cavity field, at each,
time, and we know the distribution of cavity fields! In fact, from eqs.(9.30) and (9.13), we
expect that h is distributed as Gaussian so that
√
ht = λmt x∗ + λmt z (9.44)
with
√
2 mt λ ∗ t tz
dx0 PX (x0 ) x∗0 x0 e−x0 2 +λx0 x0 m −x0 λm
R
mt+1 = Ex∗0 ,z 2 mt λ ∗ t
√
t
(9.45)
dx0 PX (x0 )e−x0 2 +λx0 x0 m −x0 λm z
R
In other words, we can track the performance of the algorithm step by step. This last equation
is often called the "state evolution" of the algorithm.
9.4 Exercises
Implement AMP for the problems discussed in the Exercises in chap 7, and compare its
performance with the optimal ones
Bibliography
The cavity method as described in this section was developed by Parisi, Mézard, & Virasoro
Mézard et al. (1987a,b). It is often used in mathematical physics as well, as initiated in
Aizenman et al. (2003), and has been applied to the problem discussed in this chapter in
Lelarge and Miolane (2019). The vision of this cavity method as an algorithm is initially due
to Thouless-Anderson & Palmer Thouless et al. (1977), an article that extraordinary influential
in the statistical physics community. Its study as an iterative algorithm with a rigorous state
evolution is due to Bolthausen (2014); Bayati and Montanari (2011). For the present low-rank
problem, it was initially studied by Rangan and Fletcher (2012) and later discussed in great
details in Lesieur et al. (2017), where it was derived starting from Belief-Propagation.
Chapter 10
In this lecture we will discuss clustering of sparse networks also known as community detection
problem. To have a concrete example in mind you can picture part of the Facebook network
corresponding to students in a high-school where edges are between user who are friends.
Knowing the graph of connections the aim is to recover from such a network a division of
students corresponding to the classes in the high-school. The signal comes from the notion
that students in the same class will more likely be friends and hence connected than students
from two different classes. A widely studied simple model for such a situation is called the the
Stochastic Block Model that we will now introduce a study. Each of N nodes i = 1, . . . , N belong
of one among q classes/groups. The variable denoting to which class the node i belongs will
be denoted as s∗i ∈ {1, 2, . . P
. , q}. A node is in group a with probability (fraction of expected
group size) na ≥ 0, where qa=1 na = 1.
The edges in the graph are generated as follows: For each pair of node i, j we decide indepen-
dently whether the edge (ij) is present or not with probability
P (ij) ∈ E, Aij = 1 s∗ , s∗ = ps∗ s∗
i j i j
⇒ G(V, E) & Aij
/ E, Aij = 0 s∗i , s∗j = 1 − ps∗i s∗j
P (ij) ∈
The goal of community detection is given the adjacency matrix Aij to find ŝi so that ŝi is as
close as s∗i as possible according to some natural measure of distance, such as the number of
misclassification of node into wrong classes. In what follows we will discuss the case where
θ
the parameters of the model na , pab , q are known to the community detection algorithms, but
z }| {
also when they are not and need to be learned. Following previous lecture on inference a
natural approach is to consider Bayesian inference where for known values of parameters θ,
all information about s∗i we can extract from the knowledge of the graph is included in the
160 F. Krzakala and L. Zdeborová
posterior distribution:
1
P {si }N
i=1 G, θ = P G {s }N
i i=1 , θ P {s } N
i i=1 θ
ZG
N
1 Yh 1−Aij Aij i Y
= 1 − psi ,sj psi ,sj nsi
ZG
i<j i=1
Leading to the fully connected graphical model with one factor node per every variable and
per every (non-ordered) pair of variables.
We will study this posterior and investigate what is the algorithm that gives the highest
performance. Whether this algorithm achieves information-theoretically optimal performance
and whether there are interesting phase transition in this problem. In order to obtain self-
averaging results, we study the thermodynamic limit where N → ∞ and pab = cab /N with
cab , na , q = O(1). The intuition behind this limit is that in this way every node has on average
O(1) neighbors while the side of the graph goes large, just as in real world where one has one
average the same number of friends independently of the size of the world. This limit also
ends up challenging and presents intriguing behaviour.
q q q
X X cab X
ca = pab N nb = N
nb
cab nb (10.1)
N
b=1 b=1 b=1
Thus we can see that the average degree of a node in group a does not depends on N, i.e. even
if the number of nodes is increasing we have always the same average degree.
The overall average degree is then
X
c= cab na nb . (10.2)
a,b
We start by defining a natural measure of performance the agreement between the original
assignment {s∗i } and its estimate {ti } as
1 X
A({s∗i }, {ti }) = max δs∗i ,π(ti ) , (10.3)
π N
i
10.2 Bayesian Inference and Parameter Learning 161
where π ranges over the permutations on q elements. We also define a normalized agreement
that we call the overlap,
1 P
∗ N i δs∗i ,π(ti ) − maxa na
Q({si }, {ti }) = max . (10.4)
π 1 − maxa na
The overlap is defined so that if s∗i = ti for all i, i.e., if we find the original labeling without
error, then Q = 1. If on the other hand the only information we have are the group sizes
na , and we assign each node to the largest group to maximize the probability of the correct
assignment of each node, then Q = 0. We will say that a labeling {ti } is correlated with the
original one {s∗i } if in the thermodynamic limit N → ∞ the overlap is strictly positive, with
Q > 0 bounded above some constant.
The probability that the stochastic block model generates a graph G, with adjacency matrix A,
along with a given group assignment {si }, conditioned on the parameters θ = {q, {na }, {cab }}
is
Y csi ,sj Aij csi ,sj 1−Aij Y
P (G, {si } | θ) = 1− nsi . (10.5)
N N
i<j i
Note that the above probability is normalized, i.e. G,{si } P (G, {si } | θ) = 1. Assume now
P
that we know the graph G and the parameters θ, and we are interested in the probability
distribution over the group assignments. Using Bayes’ rule we have
Theorem 12 from previous lectures tells us that in order to maximize the overlap Q({s∗i }, {ŝi })
between the ground truth assignment and an estimator ŝi we need to compute the
where µi (ti ) is
Pthe marginal probability of the posterior probability distribution. We remind
that µi (si ) = {sj }j̸=i P ({si } | G, θ).
Note the key difference between this optimal decision estimator and the maximum likelihood
estimator that is evaluating the configuration at which the posterior distribution has the
largest value. In high-dimensional, N → ∞, noisy setting as we consider in the SBM the
maximum likelihood estimator is sub-optimal with respect to the overlap with the ground
truth configuration. Note also that the posterior distribution is symmetric with respect to
permutations of the group labels. Thus the marginals over the entire distribution are uniform.
However, we will see that when communities are detectable this permutation symmetry is
broken, and we obtain that marginalization is the optimal estimator for the overlap defined
in equation 10.4, where we maximize over all permutations.
When the graph G is generated from the SBM using indeed parameters of value θ then
Nishimori identities derived in Section 7.2.4 hold and their consequence in the SBM is that in
162 F. Krzakala and L. Zdeborová
the thermodynamic limit we can evaluate the overlap Q({si }, {s∗i }) even without the explicit
knowledge of the original assignment {s∗i }. Due to the Nishimori identity it holds
1 P
i µi (ŝi ) − maxa na
lim N
= lim Q({ŝi }, {s∗i }) . (10.8)
N →∞ 1 − maxa na N →∞
The marginals µi (qi ) can also be used to distinguish nodes that have a very strong group
preference from those that are uncertain about their membership.
Another consequence of the Nishimori identities is that two configurations taken at ran-
dom from the posterior distribution have the same agreement with each other as one such
configuration with the original assignment s∗i , i.e.
1 X 1 XX
lim max µi (π(s∗i )) = lim µi (a)2 , (10.9)
N →∞ N π N →∞ N
i a i
where π again ranges over the permutations on q elements. Overall we are seeing that in
the spirit of the Nishimori identities the ground truth configuration s∗ has exactly the same
properties as any other configuration drawn uniformly from the posterior measure. This
property lets us use the ground truth configuration to probe the equilibrium properties of the
posterior, a fact that we will use heavily when coming back to the graph coloring problem.
Now assume that the only knowledge we have about the system is the graph G, and not the
parameters θ. The general goal in Bayesian inference is to learn the most probable values of
the parameters θ of an underlying model based on the data known to us. In this case, the
parameters are θ = {q, {na }, {cab }} and the data is the graph G, or rather the adjacency matrix
Aij . According to Bayes’ rule, the probability P (θ | G) that the parameters take a certain value,
conditioned on G, is proportional to the probability P (G | θ) that the model with parameters θ
would generate G. This in turn is the sum of P (G, {si } | θ) over all group assignments {si }:
P (θ) P (θ) X
P (θ | G) = P (G | θ) = P (G, {si } | θ) . (10.10)
P (G) P (G)
{si }
The prior distribution P (θ) includes any graph-independent information we might have about
the values of the parameters. In our setting, we wish to remain perfectly agnostic about these
parameters; for instance, we do not want to bias our inference process towards assortative
structures. Thus we assume a uniform prior, i.e., P (θ) = 1 up to normalization. Note, however,
that since the sum in (10.10) typically grows exponentially with N , we could take any smooth
prior P (θ) as long as it is independent of N ; for large N , the data would cause the prior to
“wash out,” leaving us with the same distribution we would have if the prior were uniform.
Thus maximizing P (θ | G) over θ is equivalent to maximizing the partition function over θ, or
equivalently the free energy entropy over θ.
We now write Belief Propagation as we derived it in previous lectures for the probability
distribution (10.6). We note that this distribution corresponds to a fully-connected factor
10.3 Belief propagation for SBM 163
graph, while we argued Belief Propagation is designed for trees and works well on tree-like
graphs. At the same time the main reason we needed tree-like graphs was the assumption of
independence of incoming messages, if the incoming messages are changing only very weekly
the outcoming one then the needed independence can be weaker. Indeed results for other
models that we have studied so far, such as the Curie-Weiss model or the low-rank matrix
estimation can be derived from belief propagation. In the case of the Curie-Weiss model the
interactions were very weak, every neighbor was influencing the magnetization by interaction
of strength inversely proportional to N. We note that the probability distribution (10.6) is
a mixture of the terms corresponding to edges that are O(1) and organized on a tree-like
graph, and non-edged that are dense but each of them contributing by a interaction of strength
inversely proportional to N . It is thus an interesting case where a sparse and dense graphical
model combine.
The canonical Belief Propagation equations for the graphical model (10.6) read
" #
1 cti tk
1−Aik
χi→j
Y X
ti = i→j nti cA
ti tk 1 −
ik
χk→i
tk , (10.11)
Z t
N
k̸=i,j k
Then the marginal probability is then estimated from a fixed point of BP to be µi (ti ) ≈ χiti ,
where " #
1 Y X A cti tk 1−Aik k→i
i
χti = i nti cti tk 1 −
ik
χtk . (10.12)
Z t
N
k̸=i k
Since we have nonzero interactions between every pair of nodes, we have potentially N (N − 1)
messages, and indeed (10.11) tells us how to update all of these for finite N . However, this
gives an algorithm where even a single update takes O(N 2 ) time, making it suitable only for
networks of up to a few thousand nodes. Happily, for large sparse networks, i.e., when N
is large and cab = O(1), we can neglect terms of sub-leading order in N . In that case we can
assume that i sends the same message to all its non-neighbors j, and treat these messages as
an external field, so that we only need to keep track of 2M messages where M is the number
of edges. In that case, each update step takes just O(M ) = O(N ) time. To see this, suppose
that (i, j) ∈
/ E. We have
" # " #
i→j 1 Y 1 X Y X 1
χti = i→j nti 1− k→i
ctk ti χtk k→i
ctk ti χtk i
= χti + O . (10.13)
Z N t t
N
k∈∂i\j
/ k k∈∂i k
Hence the messages on non-edges do not depend to leading order on the target node j. On
the other hand, if (i, j) ∈ E we have
" # " #
1 1
χi→j
Y X Y X
ti = i→j nti 1− ctk ti χk→i
tk ctk ti χk→i
tk . (10.14)
Z N t t
k∈∂i
/ k k∈∂i\j k
164 F. Krzakala and L. Zdeborová
where we neglected terms that contribute O(1/N ) to χi→j , and defined an auxiliary external
field that summarizes the contribution and overall influence of the non-edges
1 XX
hti = ctk ti χktk . (10.16)
N t k k
In order to find a fixed point of Eq. (10.15) in linear time we update the messages χi→j ,
recompute χj , update the field hti by adding the new contribution and subtracting the old
one, and repeat. The estimate of the marginal probability µi (ti ) is then
1
ctj ti χj→i
Y X
χiti = i nti e−hti tj
. (10.17)
Z t
j∈∂i j
The role of the magnetic field hti in the belief propagation update is similar as the one of the
prior on group sizes nti . Contrary to the fixed nti the field hti adapts to the current estimation
of the groups sizes. For the assortative communities this term is crucial as if one group a
was becoming more represented then the corresponding field ha would be larger and would
weaken all the messages χa . The field is hence adaptively keeping the group sizes of the
correct size preventing all node to fall into the same group. For the disassortative case the field
plays a similar role of adjusting the sizes of the groups. A particular case if the disassortative
structure with groups of the same size and the same average degree of every group in which
case the term e−hti does not change the behaviour of the BP equations. This will stand on the
basis of the mapping to planted coloring problem.
When the Belief Propagation is asymptotically exact then the true marginal probabilities are
given as µi (ti ) = χiti . The estimator maximizing the overlap Q is then
The overlap with the original group assignment is then computed from (10.8) under the
assumption that the assumed parameters θ were indeed the ones used to generate the graph,
leading to
1 P i
N i χŝi − maxa na
Q = lim (10.19)
N →∞ 1 − maxa na
In order to write the Bethe free entropy we use similar simplification and neglect subleading
terms. To write the resulting formula we introduce
j→i
+ χbi→j χj→i
X X
Z ij = cab (χi→j
a χb a )+ caa χi→j
a χa
j→i
for (i, j) ∈ E (10.20)
a<b a
X cab i j
Z̃ ij
= 1− χa χb for (i, j) ∈
/ E, , (10.21)
N
a,b
ctj ti χj→i
X YX
Zi = nti e−hti tj (10.22)
ti j∈∂i tj
10.3 Belief propagation for SBM 165
we can then write the Bethe free entropy, in the thermodynamic limit as
1 X 1 X c
ΦBP (q, {na }, {cab }) = log Z i − log Z ij + , (10.23)
N N 2
i (i,j)∈E
homework.
Under the assumption that the model parameters θ are known, hence Nishomori conditions
hold, we can now state a conjecture about exactness of the belief propagation in the asymptotic
limit of N → ∞. The fixed point of belief propagation corresponding to the largest Bethe free
entropy provides the Bayes-optimal estimator for the SBM in the following sense:
In case the parameters θ are not known and need to be learned the Bethe free energy is greatly
useful. In previous sections we concluded that the most likely parameters are those maximizing
the Bethe free entropy. One can thus simply run gradient descent on the parameters to
maximize the Bethe entropy. Even more conveniently, we can write the stationarity conditions
of the Bethe free entropy with respect to the parameters na and cab and use them as an iterative
procedure to estimate the parameters. Keeping in mind that a BP fixed point is a P stationary
point of the Bethe free entropy, and that for na we need to impose the normalization a na = 1
and thus we are looking for a constraint optimizer we obtain
1 X i
na = χa , (10.27)
N
i
1 1 X cab (χi→j
a χb
j→i
+ χi→j χj→i
a )
cab = ij
b
, (10.28)
N nb na Z
(i,j)∈E
where Z ij is defined in (10.20). Derivation of this expression will be your homework. The
interpretation of these expressions are very intuitive and again stems from Nishmori identities
for Bayes-optimal estimation. The first equations states that the fraction of nodes in group a
should be the expected number of nodes in group a according to the BP prediction. Similarly
for cab the meaning of the second equations is that the expected number of edges between
group a and group b is the same as when computed from the BP marginals. Therefore, BP can
also be used readily to learn the optimal parameters.
166 F. Krzakala and L. Zdeborová
We will now consider the case when the parameters q, {na }, {cab } used to generate the network
are known. We will further limit ourselves to a particularly algorithmically difficult case of
the block model, where every group a has the same average degree c and hence there is no
information about the group assignment simply in the degree distribution. The condition
reads:
Xq Xq
cad nd = cbd nd = c , for all a, b . (10.29)
d=1 d=1
If this is not the case, we can achieve a positive overlap with the original group assignment
simply by labeling nodes based on their degrees. The first observation to make about the
belief propagation equations (10.15) in this case is that
χi→j
ti = nti (10.30)
is always a fixed point, as can be verified by plugging (10.30) into (10.15). The free entropy at
this fixed point is
c
Φpara = − (1 − log c) . (10.31)
2
For the marginals we have χiti = nti , in which case the overlap (10.8) is Q = 0. This fixed
point does not provide any information about the original assignment—it is no better than a
random guess. If this fixed point gives the correct marginal probabilities and the correct free
entropy, we have no hope of recovering the original group assignment. For which values of q,
na and cab is this the case?
Fig. 10.4.1 represents two examples where the overlap Q is computed on a randomly generated
graph with q groups of the same size and an average degree c. We set caa = cin and cab = cout
for all a ̸= b and vary the ratio ϵ = cout /cin . The continuous line is the overlap resulting from the
BP fixed point obtained by converging from a random initial condition (i.e., where for each i, j
the initial messages χi→jti are random normalized distributions on ti ). The convergence time is
plotted in Fig. 10.4.2. The points in Fig. 10.4.1 are results obtained from Gibbs sampling, using
the Metropolis rule and obeying detailed balance with respect to the posterior distribution,
starting with a random initial group assignment {qi }. We see that Q = 0 for cout /cin > ϵc . In
other words, in this region both BP and MCMC converge to the paramagnetic state, where the
marginals contain no information about the original assignment. For cout /cin < ϵc , however,
the overlap is positive and the paramagnetic fixed point is not the one to which BP or MCMC
converge.
Fig. 10.4.1(b) shows the case of q = 4 groups with average degree c = 16. We show the large N
results and also the overlap computed with MCMC for a rather small size N = 128. Again, up
to symmetry breaking, marginalization achieves the best possible overlap that can be inferred
from the graph by any algorithm.
10.4 The phase diagram of community detection 167
(a) (b)
1
1 N=100k, BP
N=500k, BP N=70k, MC
0.9 N=70k, MC N=128, MC
0.8 0.8 N=128, full BP
0.7 q=4, c=16
0.6 q*=2, c=3 0.6
overlap
overlap
0.5
0.4 0.4
0.3
0.2 detectable undetectable 0.2 undetectable
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0 0.2 0.4 0.6 0.8 1
ε*= c*out/c*in ε= cout/cin
Figure 10.4.1: (color online): The overlap (10.4) between the original assignment and its best
estimate given the structure of the graph, computed by the marginalization (10.7). Graphs
were generated using N nodes, q groups of the same size, average degree c, and different
ratios ϵ = cout /cin . Thus ϵ = 1 gives an Erdős-Rényi random graph, and ϵ = 0 gives completely
separated groups. Results from belief propagation (10.15) for large graphs (red line) are
compared to Gibbs sampling, i.e., Monte Carlo Markov chain (MCMC) simulations (data
points). The agreement is good, with differences in the low-overlap regime that we attribute to
finite size fluctuations. In the part (b) we also compare to results from the full BP (10.11) and
MCMC for smaller graphs with N = 128, averaged over 400 samples. The finite size effects are
not very strong in this case, and BP is reasonably close to the exact (MCMC) result even on
√ √
small graphs that contain many short loops. For N → ∞ and ϵ > ϵc = (c − c)/[c + c(q − 1)]
it is impossible to find an assignment correlated with the original one based purely on the
structure of the graph. For two groups and average degree c = 3 this means that the density
of connections must be ϵ−1 c (q = 2, c = 3) = 3.73 greater within groups than between groups
to obtain a positive overlap.
700
N=10k
N=100k
600
500
convergence time
300
200
100
εc
0
0 0.2 0.4 0.6 0.8 1
ε= cout/cin
Figure 10.4.2: (color online): The number of iterations needed for convergence of the BP
algorithm for two different sizes. The convergence time diverges at the critical point ϵc . The
equilibration time of Gibbs sampling (MCMC) has qualitatively the same behavior, but BP
obtains the marginals much more quickly.
168 F. Krzakala and L. Zdeborová
Let us now investigate the stability of the paramagnetic fixed point under random perturbations
to the messages when we iterate the BP equations. In the sparse case where cab = O(1), graphs
generated by the block model are locally treelike in the sense that almost all nodes have a
neighborhood which is a tree up to distance O(log N ), where the constant hidden in the O
depends on the matrix cab . Equivalently, for almost all nodes i, the shortest loop that i belongs
to has length O(log N ). Consider such a tree with d levels, in the limit d → ∞. Assume that
on the leaves the paramagnetic fixed point is perturbed as
and let us investigate the influence of this perturbation on the message on the root of the
tree, which we denote k0 . There are, on average, cd leaves in the tree where c is the average
degree. The influence of each leaf is independent, so let us first investigate the influence of
the perturbation of a single leaf kd , which is connected to k0 by a path kd , kd−1 , . . . , k1 , k0 . We
define a kind of transfer matrix
" #
∂χ ki χ ki c χ ki c c
ab sb ab
X
i
Tab ≡ k
a
= a
k
− χ ki
a
s
k
= n a − 1 . (10.33)
∂χ i+1 χt =nt car χr i+1 csr χr i+1 χt =nt c
P P
b r s r
where this expression was derived from (10.15) to leading order in N . The perturbation ϵkt00
on the root due to the perturbation ϵktdd on the leaf kd can then be written as
"d−1 #
X Y
ϵkt00 = Ttii ,ti+1 ϵktdd (10.34)
{ti }i=1,...,d i=0
Now let us consider the influence from all cd of the leaves. The mean value of the perturbation
on the leaves is zero, so the mean value of the influence on the root is zero. For the variance,
however, we have
2 +
* X cd
2 2
k0
ϵt0 ≈ d k
λ ϵt ≈ c λ d 2d
ϵkt . (10.35)
k=1
cλ2 = 1 . (10.36)
For cλ2 < 1 the perturbation on leaves vanishes as we move up the tree and the paramagnetic
fixed point is stable. On the other hand, if cλ2 > 1 the perturbation is amplified exponentially,
the paramagnetic fixed point is unstable, and the communities are easily detectable.
Consider the case with q groups of equal size, where caa = cin for all a and cab = cout for
all a ̸= b. If there are q groups, then cin + (q − 1)cout = qc. The transfer matrix Tab has only
two distinct eigenvalues, λ1 = 0 with eigenvector (1, 1, . . . , 1), and λ2 = (cin − cout )/(qc) with
10.4 The phase diagram of community detection 169
The stability condition (10.36) is known in the literature on spin glasses as the de Almeida-
Thouless local stability condition de Almeida and Thouless (1978), in information science as
the Kesten-Stigum bound on reconstruction on trees Kesten and Stigum (1967).
We observed empirically that for random initial conditions both the belief propagation con-
verges to the paramagnetic fixed point when cλ2 < 1. On the other hand when cλ2 > 1 then
BP converges to a fixed point with a positive overlap, so that it is possible to find a group
assignment that is correlated (often strongly) to the original assignment. We thus conclude
that if the parameters q, {na }, {cab } are known and if cλ2 > 1, it is possible to reconstruct the
original group assignment.
For the cases presented in Fig. 10.4.1 we can thus distinguish two phases:
√
• If |cin − cout | < q c, the graph does not contain any significant information about
the original group assignment, and community detection is impossible. Moreover, the
network generated with the block model is indistinguishable from an Erdős-Rényi random
graph of the same average degree.
√
• If |cin − cout | > q c, the graph contains significant information about the original group
assignment, and using BP or MCMC yields an assignment that is strongly correlated
with the original one. There is some intrinsic uncertainty about the group assignment
due to the entropy, but if the graph was generated from the block model there is no
better method for inference than the marginalization introduced by Eq. (10.7).
Fig. 10.4.1 hence illustrates a phase transition in the detectability of communities. Unless
the ratio cout /cin is far enough from 1, the groups that truly existed when the network was
generated are undetectable from the topology of the network. Moreover, unless the condition
(10.37) is satisfied the graph generated by the block model is indistinguishable from a random
graph, in the sense that typical thermodynamic properties of the two ensembles are the same.
The situation of a continuous (2nd order) phase transition, illustrated in Fig. 10.4.1 is, however,
not the most general one. Fig. 10.4.3 illustrates the case of a discontinuous (1st order) phase
transition that occurs e.g. for q = 5, cin = 0, and cout = qc/(q − 1). In this case the condition for
stability (10.37) leads to a threshold value cℓ = (q − 1)2 . We plot again the overlap obtained
with BP, using two different initializations: the random one, and the planted/informed one
corresponding to the original assignment. In the latter case, the initial messages are
χi→j
qi = δqi s∗i , (10.38)
where s∗i is the original assignment. We also plot the corresponding BP free energies. As the
average degree c increases, we see four different phases in Fig. 10.4.3:
170 F. Krzakala and L. Zdeborová
(a) (b)
1 0.5 1
0.6
BP planted init.
0.8 0.4 0.8 BP random init.
overlap, planted init 0.4
0.6 overlap, random init 0.3
ffactorized-fBP
ffactorized-fBP 0.6
overlap
entropy
0.2 εl
0.4 0.2 0.4
cd cl
0.2 0.1 0
0.2
0.17 0.18 0.19 0.2
q=10, c=10, N=500k
0 0
cc 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
12 13 14 15 16 17 18 ε= cout/cin
c
Figure 10.4.3: (color online): (a) Graphs generated with q = 5, cin = 0, and N = 105 . We
compute the overlap (10.4) and the free energy with BP for different values of the average
degree c. The green crosses show the overlap of the BP fixed point resulting from using the
original group assignment as the initial condition, and the blue crosses show the overlap
resulting from random initial messages. The red stars show the difference between the
paramagnetic free energy (10.31) and the free energy resulting from the informed initialization.
We observe three important points where the behavior changes qualitatively: cd = 12.84,
cc = 13.23, and cℓ = 16. We discuss the corresponding phase transitions in the text. (b) The
case q = 10 and c = 10. We plot the overlap as a function of ϵ; it drops down abruptly from
about Q = 0.35. The inset zooms in on the critical region. We mark the stability transition
ϵℓ , and data points for N = 5 · 105 for both the random and informed initialization of BP. In
this case the data are not so clear. The overlap from random initialization becomes positive a
little before the asymptotic transition. We think this is due to strong finite size effects. From
our data for the free energy it also seems that the transitions ϵc and ϵd are very close to each
other (or maybe even equal, even though this would be surprising). These subtle effects are,
however, relevant only in a very narrow region of ϵ and are, in our opinion, not likely to appear
for real-world networks.
I. For c < cd , both initializations converge to the paramagnetic fixed point, so the graph
does not contain any significant information about the original group assignment.
II. For cd < c < cc , the planted initialization converges to a fixed point with positive overlap,
and its free entropy is smaller than the paramagnetic free entropy. In this phase there are
exponentially many basins of attraction (states) in the space of assignments that have
the proper number of edges between each pair of groups. These basins of attraction have
zero overlap with each other, so none of them yield any information about any of the
others, and there is no way to tell which one of them contains the original assignment.
The paramagnetic free entropy is still the correct total free entropy, the graphs generated
by the block model are thermodynamically indistinguishable from Erdős-Rényi random
graphs, and there is no way to find a group assignment correlated with the original one.
III. For cc < c < cℓ , the planted initialization converges to a fixed point with positive overlap,
and its free entropy is larger than the paramagnetic free entropy. There might still be
exponentially many basins of attraction in the state space with the proper number of
edges between groups, but the one corresponding to the original assignment is the
one with the largest free entropy. Therefore, if we can perform an exhaustive search of
the state space, we can infer the original group assignment. However, this would take
10.4 The phase diagram of community detection 171
exponential time, and initializing BP randomly almost always leads to the paramagnetic
fixed point. In this phase, inference is possible, but conjectured to be exponentially hard
computationally.
IV. For c > cℓ , both initializations converge to a fixed point with positive overlap, strongly
correlated with the original assignment. Thus inference is both possible and tractable,
and BP achieves it in linear time.
We saw in our experiments that for assortative communities where cin > cout , phases (II)
and (III) are extremely narrow or nonexistent. For q ≤ 4, these phases do not exist at all,
and the overlap grows continuously from zero in phase (IV), giving a continuous phase
transition as illustrated in Fig. 10.4.1. We can think of the continuous (2nd order) phase
transition as a degenerate case of the 1st order phase transition where the three discussed
thresholds cd = cc = cℓ are the same. For q ≥ 5, phases (II) and (III) exist but occur in an
extremely narrow region, as shown in Fig. 10.4.3(b). The overlap jumps discontinuously from
zero to a relatively large value, giving a discontinuous phase transition. In the disassortative
(antiferromagnetic) case where cin < cout , phases (II) and (III) are more important. For
instance, when cin = 0 and the number of groups is large, the thresholds scale as cd ≈ q log q,
cc ≈ 2q log q and cℓ = (q − 1)2 . However, in phase (III) the problem of inferring the original
assignment is hard and conjectured to be insurmountable to all polynomial algorithms in this
region.
Having seen that BP and also MCMC are able to attain the algorithmic threshold cl a natural
question is what other classes of algorithms are able to do so. Very natural spectral methods
such as principal component analysis, based on leading eigenvalues of the associated matrices,
are commonly used for clustering. For community detection in sparse graphs, however, most
of the commonly used spectral methods do not attain the threshold because in sparse graphs
their leading eigenvectors become localized on subgraphs that have nothing to do with the
communities (e.g. the adjacency matrix spectrum localizes on high degree nodes in the limit
of large sparse graphs).
A very generic and powerful idea to design spectral methods achieving optimal thresholds
is to realize that the way we computed the threshold cl in the first place was by linearizing
the belief propagation around it paramagnetic fixed point by introducing its perturbations
χi→j
t = nt + εi→j
t we obtained
X X ∂χi→j
εi→j
X X
t = t
k→i
εk→i
q = Ttq εk→i
q (10.39)
q
∂χq χk→i
q =nq
q
k∈∂i\j k∈∂i\j
where the matrix T was computed in (10.33). Introducing the so-called non-backtracking
matrix as a 2M but 2M matrix (M being the number of edges) with coordinates corresponding
to oriented edges
Bi→j,k→l = δi,l (1 − δj,k ) (10.40)
we can write the linearized BP as
ε = (T ⊗ B)ε . (10.41)
172 F. Krzakala and L. Zdeborová
We thus see that the linearized belief propagation corresponds to a power-iteration of a tensor
product of a small q×q matrix T and the non-backtracking matrix B. The spectrum of B indeed
provides information about the communities that is more accurate than other commonly used
spectral methods and it attains the detectability threshold in sparse SBM.
10.5 Exercises
ctj ti χj→i
X YX
−hti
Z i
= nti e tj (10.43)
ti j∈∂i tj
as
1 X 1 X c
ΦBP (q, {na }, {cab }) = log Z i − log Z ij + , (10.44)
N N 2
i (i,j)∈E
1 X i
na = χa , (10.45)
N
i
1 1 X cab (χi→j
a χb
j→i
+ χi→j χj→i
a )
cab = b
. (10.46)
N nb na Z ij
(i,j)∈E
Chapter 11
Consider the following problem: N people are given a red or black card, or, equivalently, a
value Si∗ = ±1. You are allowed to ask M pairs of people, randomly, to tell you if they had the
same card (without telling you which one). Can we figure out the two groups?
This is simple enough to answer formally with Bayesian statistics. Given the M answers
Jij = ±1 (1 for the same card, −1 for different ones) the posterior probability assignment is
given by
1 1 Y
Ppost (S|J) = Ppost (J|S)Pprior (S) = P (Jij = Si Sj ) (11.1)
N N 2N
ij∈G
where G is the graph where two sites are connected if you asked the question to the pairs.
Given some people are not trustworthy, let us denote the probability of lies as p. Then
1
(11.2)
Q
Ppost (S|J) = N 2N ij∈G [(1 − p)δ(Jij = Si Sj ) + pδ(Jij = −Si Sj )]
and using the change of variable p = e−β /(e−β + eβ ) = 1/(1 + e2β ) we find
174 F. Krzakala and L. Zdeborová
1 β Pij∈G Jij Si Sj
Ppost (S|J) = e (11.3)
Z
This is the spin glass problem with Hamiltonian H = − Jij Si Sj , thus the name the "Spin
P
Glass Game" for this particular inference problem.
First, we discuss the problem on sparse graphs. In this case, as N, M → ∞, we have random
tree-like regular graph, and we can thus write the BP equations. We can use the result from
appendix 5.A, For such pair-wise models, we have one factor per edge, and the BP equations
read:
1 Y
χj→(ij)
sj = ψs(kj)→j (11.4)
Z j→(ij) j
(kj)∈∂j\(ij)
1 X
ψs(ij)→i = eβJij si sj χj→(ij)
sj (11.5)
i
Z (ij)→i sj
1 + sj mj→(ij)
χj→(ij)
sj = (11.6)
2
βs h(ij)→i
e i
ψs(ij)→i = (11.7)
i
2 cosh(βh(ij)→i )
so that
X
mj→(ij) = tanh β h(kj)→j (11.8)
(kj)∈∂j\(ij)
(ij)→i j→(ij)
eβsi h 1 X
βJij si sj 1 + sj m
= e (11.9)
2 cosh(βh(ij)→i ) Z (ij)→i s 2
j
X+ − X−
tanh βh(ij)→i = (11.10)
X+ + X−
X 1 + sj mj→(ij)
Xs = eβJij ssj (11.11)
s
2
j
1 1
= eβsJij (1 + mj→(ij) ) + e−βsJij (1 − mj→(ij) ) (11.12)
2 2
11.2 Sparse graph 175
One can prove this algorithm perfectly solve the spin glass game. We can also analyze the
Bayes performances, using population dynamics. Defining
X
FBP {mk→j } = tanh atanh mk→j tanh(βJkj ) (11.21)
k∈∂j\i
It is easy to locate the phase transition, as it turns out to be a second-order one, using the local
perturbation approach. Writing mj→i = ϵj→i S∗j we find to linear order than
X X
ϵj→i S∗j ≈ tanh ϵk→j S∗k tanh(βJkj ) ≈ ϵk→j S∗k tanh(βJkj ) (11.23)
k∈∂j\i k∈∂j\i
176 F. Krzakala and L. Zdeborová
Now we look to the problem with DENSE graph, in fact we may assume we observe ALL
pairs, but to make the problem interesting, we take the probability of lies close to 1/2. We
write √ √ √ √
pDense = e−β/ N /(e−β/ N + eβ/ N ) = 1/(1 + e2β/ N ) (11.27)
This is a very nice limit to study the problem, however there is a little annoying fact: we have
to update N 2 messages! This is way too much!!!! The trick is now to Taylor expand BP. We
start by the BP iteration, which reads at first order:
√ β
mj→i atanh mk→j mtk→j Jkj
X X
t+1 = tanh
t tanh(β/ N Jkj ≈ tanh √
k∈∂j\i
N k∈∂j\i
(11.31)
11.4 Approximate Message Passing 177
At this point, we realize that we can close the equation on the full marginal defined as
!
β X k→j
mjt+1 = tanh √ mt Jkj (11.32)
N k
Indeed
!
β X k→j β i→j
mj→i
t+1 = tanh √ mt Jkj − √ mt Jij (11.33)
N k N
! !
β X k→j β i→j β X k→j
≈ tanh √ mt Jkj −√ mt Jij tanh′ √ mt Jkj
N k N N k
(11.34)
β 2
≈ mjt+1 − √ mi→j
t Jij (1 − mjt+1 ) (11.35)
N
β 2
≈ mjt+1 − √ mit Jij (1 − mjt+1 ) (11.36)
N
(11.37)
where we keep the first correction in N . Finally, combining the two following equations:
β j 2
mtk→j = mkt − √ mt−1 Jjk (1 − mkt ) (11.38)
N
!
j β X k→j
mt+1 = tanh √ mt Jkj (11.39)
N k
we reach
!
β X β j
k2
mjt+1 = tanh √ k
Jkj mt − √ mt−1 Jjk (1 − mt ) (11.40)
N k N
!
β X β 2
mjt+1 = tanh √ Jkj mkt − √ mjt−1 (1 − mkt ) (11.41)
N k N
X mk 2
!!
β X
mjt+1 = tanh √ k 2 j
Jjk mt − β mt−1 1 − t
(11.42)
N N
k k
1
ht = √ Jmt − βmt−1 (1 − mt 2 ) (11.43)
N
mt+1 = tanh βht (11.44)
This is the TAP, or AMP algorithm. The second term in the first equation is called the Onsager
term, and it makes a subtle difference with the naive mean field approx!
178 F. Krzakala and L. Zdeborová
1
ht = √ Jmt − mt−1 ∂h η(βht−1 ) (11.45)
N
mt+1 = η(βht ) (11.46)
Note how convenient and easy is it to write this algorithm! In fact in this form, this is known
as the AMP algorithm!
Instead of focusing on the population dynamics on m, it seems like a good idea to decompose
the iteration as
mj→i
t+1 = tanh hj→i
t (11.48)
√
hj→i atanh mk→j
X
t = t tanh(β/ N Jkj ) (11.49)
k∈∂j\i
With this notation, the distribution of hj→i t is a sum of uncorrelated terms on a tree, so we
expect that for large tree, in the limit of large connectivities, we could simply things a bit. Let
us first rewrite in the large connectivity limits:
mj→i
t+1 = tanh hj→i
t (11.50)
β X k→j
hj→i
t = √ mt Jkj ) (11.51)
N k
we have
t
hj→i
t ∼ N (h , ∆ht ) (11.52)
With this, we can close the equations as follows: the mean and variance of the distribution of
h is m and q, then
β X
hj→i
t Si∗ = Si∗ √ δ(J = OK)S∗i Sk∗ mk→j
t − δ(J = lies)S∗i Sk∗ mk→j
t ) (11.53)
N k
β X
= √ δ(J = OK)Sk∗ mk→j
t − δ(J = lies)Sk∗ mk→j
t ) (11.54)
N k
11.5 From the spin glass game to rank-one factorization 179
Let us denote the mean and second moments of the distribution of the m in the direction of
the hidden states as
then we have
t β β √
h = N √ (1 − 2p)mt = N √ tanh(β/ N )mt = β 2 mt (11.57)
N N
and
∆ t = β 2 qt (11.58)
so that we can write
√ 2
q t = Ez tanh β 2 mt + zβ qt (11.59)
√
mt = Ez tanh β 2 mt + zβ qt (11.60)
Which is the same as what we could have had with the replica method.
It turns out what we wrote here is entirely general in the large connectivity limit. Assuming
the following problem:
1 ∗ ∗ T √
r
Y = x (x ) + ∆W (11.61)
N
where W is a random matrix, and x∗ is sampled from a distribution P0 , it turns out that if we
wanted to solve the problem, we could simply write the AMP algorithm as
1
ht = √ Jmt − mt−1 βη ′ (βht−1 ) (11.62)
N
mt+1 = η(βht ) (11.63)
Using this algorithm, we, again, can get the state evolution as
Let us follow Bolthausen Bolthausen (2009) and Bolthausen (2014) and consider the iteration
using symmetric matrices.
Consider the symmetric AMP iteration with the same prescriptions as in i.e. assumptions on
the matrices, functions and inputs.
where Zt ∼ N(0, κt,t In ). We claim that in the iterations, all xs are behaving as Gaussian
variables, with xt ∼ N (0, κt ), and X ∼ N (0, κt,s ). Additionally, we also have
1 s−1 t−1
κs,t = m m =: qs−1,t−1 (12.4)
N
This is precisely what would happen WITHOUT the Onsager term IF the matrix A would be a
new one at each iteration time. The Onsager correction makes it work in such a way that it
remains true when A remains the same over iteration.
182 F. Krzakala and L. Zdeborová
AMP, we get :
xt+1 |St = A − P⊥ ⊥ ⊥ ⊥ t
Mt−1 APMt−1 + PMt−1 ÃPMt−1 m − bt m
t−1
(12.8)
= A − Id − PMt−1 A Id − PMt−1 m + P⊥ ⊥
(12.9)
t t t−1
Mt−1 ÃPMt−1 m − bt m
= APMt−1 + PMt−1 APTMt−1 mt + P⊥ ⊥ t
Mt−1 ÃPMt−1 m − bt m
t−1
(12.10)
= APMt−1 mt + P⊥ ⊥ t t
Mt−1 ÃPMt−1 m + PMt−1 Am⊥ − bt m
t−1
(12.11)
assuming Mt−1 has full rank, defining (unique way) αt as the coefficients of the projection of
mt onto the columns of Mt−1 , we have :
It is clear that Part 1 in the above expression is a combination of previous terms with an
additional Gaussian one (product of independent GOE with frozen Ms). Part 2 cancels out
in high-dimensional limit, isometry+Stein’s lemma, AKA Onsager magic. Then recursion
intuitively gives Gaussian equivalent model at each and across iterations.
so that
Here we need to be a bit carefful with the scaling of these terms. Remember than this should
be a vector with order one compotent. To underline this, we can write:
−1
1 T
A= T
Mt−1 (Mt−1 Mt−1 ) N t t t−1 t
X (f (x ) − M α ) (12.22)
N t−1
Now we see that we have a matrix with O(1) element that multiply vectors with O(1) elements,
and when we multiply them we have a again an O(1) quantities (since we sum over t values,
and t is finite). Let us focus on the two terms in the second parenthesis:
1 T
B= X (f t (xt ) − Mt−1 αt ) = C + D (12.23)
N t−1
These terms can be simplifies using Stein lemma. Indeed
1 T t t
C= X f (x ) (12.24)
N t−1
1 P (1) (t)
i xi f (xi )
N1 P (2) (t)
=
N
i xi f (xi ) (12.25)
...
1 P (t) (t)
N i xi f (xi )
1
E[z f (z (t) )]
E[z 2 f (z (t) )]
=N →∞ (12.26)
...
E[z t f (z (t) )]
(12.27)
where we have used the recurence hypothesis. Now we can use Stein lemma1 and we find
and therefore
(m0 )T mt−1
1 (m1 )T mt−1 = 1 bt MTt−1 mt−1
C= bt (12.30)
N ... N
t−1
(m ) m T t−1
1 T
Mt−1 (MTt−1 Mt−1 )−1 N t t t−1 t
(12.32)
A= X (f (x ) − M α )
N t−1
Mt−1 (MTt−1 Mt−1 )−1 MTt−1 bt mt−1 − [0|Mt−2 ] Bt αt (12.33)
=
PMt−1 bt mt−1 − [0|Mt−2 ] Bt αt (12.34)
=
This is precisely the part needed to cancel part 2! At this point, we have proven the claim.
Additionally, we can now write the evolution of the κt , or equivalently, of the qt . This is called
state evolution!
1 h √ i h √ i
qt = mt mt = E (ft ( κt Z))2 = E (ft ( qt−1 Z))2 (12.35)
N
What is interesting here is the generality of the statement. In fact, f not even need to be
separable or even well defined for it to be true Berthier et al. (2020). This allows to use such
iteration with neural networks ?Gerbelot and Berthier (2021) and black-box functions in signal
processing ?.
A crucial question at this point can be asked about whether or not xt (or mt ) does converge.
This can be studied by looking at
1 2
mt+1 − mt 2
= q t+1 + q t − 2q t,t+1 (12.36)
N
If we assume that we have a convergence in q (but not necessary in x!!) such that we are on
the orbit of AMP are q t+1 = q t = q ∗ , then it amonts to wther or not q t,t+1 is convergeing to q ∗ .
We can write
1
C t+1 = qt+1,t = mt+1 mt = E ft (X t+1 ) ft (X t ) (12.37)
N
Using our Gaussian equivalance, we can now replcae the two Xs by correlated gaussians! We
have:
√
Xt = κt − κt,s Z ′ + κt,s Z (12.38)
p
′′ √
s
(12.39)
p
X = κs − κt,s Z + κt,s Z
12.1 AMP And SE 185
The question is thus is this fixed-point iteration of g, defined for 0 to q ∗ is converging. Accorid-
ing to the fixed point theorem, this will be the case if the map is contrative, that is if for any x
and y, we have
∥g(x) − g(y)∥ ≤ |x − y| (12.43)
In other words, we need g to be 1-Lipschitz. One can show that the first two derivative of g
are positive (takes the derivative and use Stein lemma!), so that the derivative is maximum at
q ∗ ! We thus obtain, after a quick computation, the criterion:
2
′
(12.44)
p
∗
E ft ( q Z) ≤1
and
so that
mt+1 = tanh(β Amt − β(1 − q t )mt−1 + βh) (12.49)
There are the TAP equations Thouless et al. (1977) for the SK model, and q obbey the parisi
replica symmetric equation:
h √ i
qt = E (tanh(β qt−1 Z + βh))2 (12.50)
which is precisely the Almeida-Thouless criterion. This was the main result of the original
paper by Bolthausen (2014).
186 F. Krzakala and L. Zdeborová
Since the Parisi RS’s equation of the SK model seems to appear, we may wonder, are the RSB
equation a state evolution of something? The answer is yes, as shown in ?. Define in full
generality the function
h p m p i
EV cosh βx + β q1t − q0t V + βh tanh βx + β q1t − q0t V + βh
ft (x) = h p m i (12.52)
EV cosh βx + β q1t − q0t V + βh
h m i
EV cosh βx + β q1t − q0t V + βh tanh2 βx + β q1t − q0t V + βh
p p
q1t+1 = EU h p m i
EV cosh βx + β q1t − q0t V + βh
(12.53)
h p m p i 2
t t t t
EV cosh βx + β q1 − q0 V + βh tanh βx + β q1 − q0 V + βh
q0t+1 = EU h p m i
EV cosh βx + β q1t − q0t V + βh
(12.54)
then the AMP iteration follows has a state evolution that follows Parisi’s 1RSB equations, in
particular, the norm of ft is q0 !
We thus see that the Parisi equation can be interpreted as a way to follow some iterative
algorithm in time.
12.2 Applications
How do we deal with this? Let us make a change of variable! We write simply
√
xt+1 = Amt − bt mt−1 + λx∗ q0t (12.57)
with
with this change of variable, we can write, defining f˜t (x) = ft (x + x0 q0t ) and
x̃t+1 = Amt − bt mt−1 (12.60)
mt = f˜t (x̃t ) (12.61)
so that we can use state evolution for the tilde variable! This leads again to
2
1 t t ˜ √
qt = m m = E ft ( qt−1 Z) (12.62)
N
√
2
√
= E ft ( qt−1 Z + λq0t−1 X ∗ ) (12.63)
and from the definition of q0t , we have the jointed state evolution:
√ t−1 ∗ 2
√
t
q = E ft ( qt−1 Z + λq0 X ) (12.64)
h √ √ i
q0t = E ft ( qt−1 Z + λq0t−1 X ∗ ) X∗ (12.65)
This is very nice! Let’s look a concrete example: the Rademacher spike model!
q
Say we are given Y = √1N W + Nλ x∗ xT∗ , with xi∗ = ±1. Can we recover the vector ? We
can use the AMP algorithm for this,
xt+1 = Y mt − bt mt−1 (12.66)
t t
m = ft (x ) (12.67)
√
using the binary denoiser, that is using ft (x) = tanh βx. This is a particular case of the TAP
equation (this is often called the planted SK model). In this case we have the state evolution:
√
2
√
t
q =E tanh(β qt−1 Z + β λq0t−1 X ∗ ) (12.68)
h √ √ i
q0t = E tanh(β qt−1 Z + β λq0t−1 X ∗ ) X∗ (12.69)
These are the one of the SK model with a ferromagnetic bias (λ is often called J0 in this
context).
Why did we use the hyperbolic tangent? This has to do with √ Bayesian
p methods in statistics.
After all, we know that we are given X, and that X = q0t−1 λX∗ + q t−1 Z. Given we know
X is just X0 up to a Gaussian noise, what is the best we can do to estimates it?
It is a general rule that the "best" estimates in Bayesian method is the mean of the posterior, in
terms of MMSE. This is easily proven: assume that we have an observable Y which is given by
a polluted version of X∗ . Then for all estimators we have:
h i h h ii
MMSE = EX∗ ,Y (X̂(Y ) − X∗ )2 = EY EX ∗ |Y (X̂(Y ) − X∗ )2 (12.70)
188 F. Krzakala and L. Zdeborová
Let us minimize it! COnditioned on Y, X̂ is just a number so we can derive with respect to it
and find that for each Y , we should choose
Z
ˆ(X) = E ∗ X ∗ = P (X|Y )X (12.71)
X |Y
This is the Bayesian √optimal!pIn our problem, this is easyly done: Given we are given an
observable X = q0 t−1
λX∗ + q t−1 Z, we see that
√ t+1 ∗ 2
(X− λq0 X )
∗ X|X∗ −
P (X |X) ∝ P prior
(X)P (X) ∝ P prior
(X)e 2q t−1 (12.72)
√ t+1 2
(X− λq0 x)
−
dxxP prior (x)e
R
2q t−1
∗
E[X |X] = √ t+1 (12.73)
(X− λq0 x)2
R −
dxP prior (x)e 2q t−1
Theorem 17 (Nishimori Identity). Let X (1) , . . . , X (k) be k i.i.d. samples (given Y ) from the
distribution P (X = · | Y ). Denoting ⟨·⟩ the "Boltzmann" expectation, that is the average with respect
to the P (X = · | Y ), and E [·] the "Disorder" expectation, that is with respect to (X ∗ , Y ). Then for all
continuous bounded function f we can switch one of the copies for X ∗ :
hD E i D E
E f Y, X (1) , . . . , X (k−1) , X (k) = E f Y, X (1) , . . . , X (k−1) , X ∗ (12.74)
k k−1
Proof. The proof is a consequence of Bayes theorem and of the fact that both x∗ and any of
the copy X (k) are distributed from the posterior distribution. Denoting more explicitly the
Boltzmann average over k copies for any function g as
D E k
Z Y
(1) (k)
g(X , . . . , X ) =: dxi P (xi |Y )g(X (1) , . . . , X (k) ) (12.75)
k
i=1
We shall drop the subset "k" from Boltzmann averages from now on.
12.2 Applications 189
√ √
Using a binary prior leads indeed to f ( t) = tanh λX (indeed we have only a factor ex λq/qx
Given two vectors u∗ ∈ Rn and v∗ ∈ Rm we are given the n × m matrix (with α = m/n):
r
λ
W = u∗ v∗T + ξ (12.76)
n
The AMP algorithm ? reads (conveniently removing the fact that variances can be pre-
computed so that the functions f is defined at each time t):
1 ′
ut+1 = √ W gvt (vt ) − α < gvt (v) > gut−1 (ut−1 ) (12.77)
n
1 ′
vt+1 = √ W gut (ut )− < gut (ut ) > gvt−1 (vt−1 ) (12.78)
n
Conveniently, one can recast these equations into a new form, that is amenable to a rigorous
analysis. Define the n + m × 2 matrix X
u 0
x= (12.79)
0 v
Now we define f t : Rm+n×2 → Rm+n×2 defined as each time t by the following rule at each
lines (with matrix-like notations):
Y 1 ′
xt+1 = √ f t (xt ) − < f t (xt ) > f t−1 (X t−1 ) (12.84)
n n
Y
= √ f t (xt ) − bt f t−1 (xt−1 ) (12.85)
n
where we have put the Onsager terms into the moniker bt . We thus obtain the state evolution
equation for eq.(12.77,12.78) directly from the Wigner ones:
190 F. Krzakala and L. Zdeborová
" #
q √ 0 t−1 ∗ 2
qut =E gut ( t−1
αqv Z + α λqv V ) (12.86)
√ 0 t−1 ∗
q
t
qu0 t t−1
= E gu ( αqv Z + α λqv )V U∗ (12.87)
" q #
√ 0 t−1 ∗ 2
t t t−1
qv = E gv ( qu Z + λqu U ) (12.88)
t
h p √ t−1
i
qv0 = E gvt ( qu t−1 Z + λqu0 U ∗ ) V∗ (12.89)
Again, they are many application (inclyding denoising spike models, analysing Mixture of
Gaussians, Hopfield models, etc....)
√
if r > 4 + 2 α, there is a gap between the threshold for informationtheoretically optimal
performance and the threshold at which known algorithms succeed
Let us start from the Bayes case! Suppose we are given as data a n × n symmetric matrix Y
created as follows: r
λ ∗ ∗⊺
Y= x x} + ξ
N | {z |{z}
N ×N rank-one matrix symmetric iid noise
i.i.d. i.i.d.
where x∗ ∈ RN with x∗i ∼ PX (x), ξij = ξji ∼ N (0, 1) for i ≤ j.
This is called the Wigner spike model in statistics. The name "Wigner" refer to the fact that Y
is a Wigner matrix (a symmetric random matrix with component sampled randomly from a
Gaussian distribution) plus a "spike", that is a rank one matrix x∗ x∗⊺ .
We shall now make a mapping to a Statistical Physics formulation. Consider the spike-Wigner
model, using Bayes rule we write:
" # q 2
1 λ
P (Y | x) P (x) Y Y 1 − y ij − x x
N i j
P (x | Y) = ∝ PX (xi ) √ e 2
P (Y) 2π
i i≤j
" # " r #
Y X λ λ
∝ PX (xi ) exp − x2 x2 + yij xi xj
2N i j N
i i≤j
" # " r #
1 Y X λ 2 2 λ
⇒ P (x | Y) = PX (xi ) exp − xi xj + yij xi xj
Z(Y) 2N N
i i≤j
x̂MSE,1 (Y)
..
Z
⇒ x̂MSE (Y) = . , x̂ (Y) = ⟨x ⟩ = dx P (x | Y) xi
MSE,i i Y
x̂MSE,N (Y)
12.3 From AMP to proofs of replica prediction: The Bayes Case 191
Interestingly, we see that the partition function Z(Y ) is given by the ratio between
P (Y )
Z(Y ) = P 2 √ N2 (12.90)
e− ij yij /2 / 2π
This is the so-called likelihood ratio betweem the probablity that our model has been generated
randomly, and the one that it has been generated by accident from a pure random problem.
It is convenient to use the notation of information theory, and to compute instead the mutual
information. It is defined as
P (X, Y )
I(X, Y ) = EX,Y log = H(X) − H(X|Y ) = H(Y ) − H(Y |X) (12.91)
P (X), P (Y )
In our model, the relation between the free energy and the mutual information is trivial to
compute, one simply finds:
I(X, Y ) λ(E[X 2 ])2
=f+ (12.92)
N 4
Why work with mutual information since it is just the free energy ? It has actually nice
properties that we now list:
• The mutual information is monotonic in λ from 0 to H(X) (the first is trivial from the
definition, since then the joint distribution factorize, while the second comes from the
fact that if we recover X perfectly from Y , then H(Y |X) = 0).
• Call the matrix M = X ∗ X, it is easy to show using Bayesian technics that the best
possible error in reconstructing M from Y is given by the derivative of the mutual
information. This is the I-MMSE theorem:
1 ∂I(λ)
M − MMSE = (12.93)
4 ∂λ
Additionally, we see that√ the model looks like a SK problem, where the role of the inverse
temperature is played by λ This is often called the Nishimori condition in the litterature. It
is a very peculiar condition, in fact we can show that q = m if we impose this!
Using the replica method, one can work hard and derive the free energy! It reads (here q = m)
λm2
Z
2 λm ∗ ∗
√
f (m) ≈ − Ex∗ ,z log dx0 P (x0 )e−x0 2 +λx0 x0 m+x0 x0 λmz (12.94)
4
Notice that we also find an interesting identity : dfrs /dλ = 4m2 /4, so that the replica prediction
for the mutual information is just
Let us notice that AMP cannot be better than the MMSE, so let us use the following estimator:
AMP − M = mi mj (12.96)
192 F. Krzakala and L. Zdeborová
1 X
AMP − MSE = (ft (Xi )ft (Xj ) − Xi∗ Xj∗ ) (12.97)
N2
i
Now, it turns out that the derivative of the replica symmetric free energy is just AMP-MSE!
Indeed the derivative of the replica mutual information is just
1 1
− m2 + ρ2 + ∂λ T (12.101)
4 4
but the derivative of the free energy gives the correct term and we reach indeed the MSE at
the end! Imagine we can integrate without discontinuity (that is, m is continuous) then:
1 1
Z Z
Ireplica (∞) − Ireplica (0) = dλ AMP − MSE(λ) > MMSE(λ) = I(∞) − I(0) = H(X)
4 4
(12.102)
λ(ρ2 + m2 )
Z
2 λm +λx x∗ m+x x∗
√ λE 2
Ireplica (λ) = − Ex∗ ,z log dx0 P (x0 )e−x0 2 0 0 0 0 λmz
= I(X; X + ΣZ) +
4 4
(12.103)
1 1
Z Z
H(X) = dλ AMP − MSE(λ) > MMSE(λ) = H(X) (12.104)
4 4
So we must have AMP reaching the MMSE almost everywhere and the free energy to be
correct!
that it is well normalized (on the sphere). A typical thing to do would be to use maximum
12.4 From AMP to proofs of replica prediction: The MAP case 193
√
likelihood. We want to minimize ∥Y − λxxT ∥22 constainted to X being on the sphere. This
can be done using the following cost fucntion
X X
L=− xi Yij xj + µ( x2i − N )2 (12.105)
i,j i
where we use µ as a Lagrange parameter to fix the norm. This problem is solved when the
following condition is satisfied: X
Yij xj = µxi (12.106)
j
or equivalently when
Y x̂ = µx̂ (12.107)
that is, of course, when x is an eigenvector. Them, the loss is minimized using the largest
possible eigevnvalue.
Let us see how we can use AMP for this! We simply use the function that will preserve the
norm and write
√ !
λ
xt+1 = A + x∗ xT∗ mt − bt mt−1 = Y mt − bt mt−1 (12.108)
N
mt = ft (xt ) (12.109)
Imagine that we iterate AMP and find a fixed point. Then what is this fixed point? Well, we
have
√ √ √
N N N
x=Y x t − x (12.112)
∥x ∥2 ∥x∥2 ∥x∥2
√
N N
x 1+ 2 =Y x (12.113)
∥x∥2 ∥x∥2
√ !
∥x∥2 N
m √ + =Ym (12.114)
N ∥x∥ 2
Clearly, this is the same fixed point as the one we are looking for! In fact, we even are given
the value of the Lagrange parameter! So it seems that, if it converges, AMP can solved the
problem for us! √ Now we can analyse the solution that AMP gives using state evolution! We
have x = Z + q0 λx∗ , so clearly |x|22 = N (1 + q02 λ). Additionally we have
√ ! √
1 Z + q0 λx0 λ
q0 = E p
2
x 0 = q0 p (12.115)
N 1 + q0 λ 1 + q02 λ
194 F. Krzakala and L. Zdeborová
This is the BBP transition! Additionally, we can even compute the value of the eigenvalue!
INdeed |x|22 = N (1 + q02 λ) become |x|22 = N λ so that
√
1
λmax = λ+ √ (12.118)
λ
This is given a generic strategy for prooving replica prediction: find an AMP that matches
the fixed point of the minimum, prove that it converges, then simply study its state evolution
prediction!
This is from Bayati and Montanari (2011); Rangan (2011), but the present discussion uses
Berthier et al. (2020); Gerbelot and Berthier (2021).
There are many problem defined for rectangular non symmetric matrices A ∈ Rm×n , and for
such, it will be interesting to look at the following iteration:
α 1
dt = divgt (vt ), bt = divgt (ut ) (12.121)
m n
It would be great to have a state evolution for those equations, so that u and v can be treated as
Gaussian at all times! This can be done by mapping (reducing) this recursion to the original
symmetric one! Define:
r
1 B A 0
As = and x = 0
(12.122)
1+α AT C u0
and
r
√
gt (x1 , . . . , xm ) 1+α 0
f2t+1 = 1+α , f2t = . (12.123)
0 α et (xm+1 , . . . , xm+n )
and then we see, again, that xt follows the regular AMP, equation. Making the change of
variable, we see that both u and v are Gaussian variables, and we can compute their variances
etc using state evolution,
12.6 Learning problems: the LASSO 195
A very important application of this rectangular AMP is the Approximate Message Passing
for generalized linear problem, called Generalized AMP (GAMP in short), written slightly
differently, as
In practice, however, since we are interested in cases where there is a hidden x0 and a set of
measurement given by a function of z = Ax0 , we need to be slightly more generic (in the
same as in the wigner spike model). In fact, we want to use function g tht depends potentially
on z (through y and function f tgat correlates with x0 . In full generality GAMP reads as
and this add a new direction for u (that can get correlated with x0 ) and for ω (that can get
correlated with z0 ). In this case, the state evolution reads
mt+1
u = αE[z0 (gtout (ω, ϕ(z0 ))] = αE[∂z0 (gtout (ω, ϕ(z0 ))] (12.134)
ρ mt
and where z, ω have a joint Gaussian distribution with covariance
mt q t
We start with a initial guess x0 and we iterate to tend to the minimum of the function. At each
iteration, the next approximation of x is given by:
xt+1 = xt − ηf ′ (xt )
To find the next value xt+1 , the function is approximated around the current point using a
quadratic approximation and xt+1 is given by the minimum of the quadratic approximation.
196 F. Krzakala and L. Zdeborová
(x − xt )2 ′′ t
f (x) ≈ f (xt ) + (x − xt )f ′ (xt ) + f (x )
2
(x − xt )2
f (x) ≈ f (xt ) + (x − xt )f ′ (xt ) +
2η
x∗ − xt
f ′ (xt ) + =0
η
Def
⇒ xt+1 = x∗ = xt − ηf ′ (xt )
We see that we can write GD as the minimum, at each steps, of the approximate f (x) cost
function
Let f (x) be a function we want to optimize, which is not derivable somewhere but can be
decomposed as a sum of two functions, one derivable and the other not.
is derivable
g(x)
Where
h(x) is not derivable, e.g. |x|
Therefore, the gradient approximation can be done, up to the second derivative, only for g(x).
Since multiplying terms won’t change the optimum value, the expression for the minimum x
can be derived like so:
(x − xt )2
f (x) ≃ g(xt ) + (x − xt )g ′ (xt ) + + h(x)
2η
(x − xt )2
t ′ t
x t+1
= argmin (x − x )g (x ) + + h(x)
2η
1 ′ t
2
= argmin h(x) + t
x − x + g (x )η
2η
12.6 Learning problems: the LASSO 197
1 2
= argmin h(x)η + x − (xt − g ′ (xt )η)
2
If h(x) is not very complicated, the former expression can be solved analytically (this is a
fundamental object of study in Convex Optimization).
z − a if z ≥ a
P rox(z) = 0 if z ∈ [−a; a]
l(.)
z + a if z ≤ −a
1
L(w) = ∥y − Aw∥22 + h(λ) (12.135)
2
This can be solve, for convex h, using PGD, with the iteration:
Clearly, this is an instance of GAMP! Use gt (v t ) = y − v t = rt and e(u) = P rox, then GAMP
reads
ut+1 = AT rt + x (12.140)
rt = y − vt = y − Aprox(ut ) + bt gt−1 (vt−1 ) = y − Ax + bt r (12.141)
So state evolution can be written for the AMP considered! At the fixed point, if this AMP
converges, we find
x = ft (x + AT (y − Ax)/(1 − b) (12.142)
So this recursion solved the LASSO (or any convex optimization problem for that matter).
Just takes its AMP recursion, and voila, you just solved LASSO!
Let us write the state evolution in the simplest case where y = Ax0 . Then we have u and v
Gaussian and we need to track their variance and projection to x0 and y.
qut+1 = αE[(gtout (ω, ϕ(z0 ))2 ] = αE[(y − w)2 ] = α(ρ + qω − 2mω ) = αE t (12.143)
(12.144)
p
qωt = E[ftin ( qut Z + mtu x0 )2 ]
= αE[∂z0 (gtout ( qωt Z ′ + mtω z0 , ϕ(z0 ))] = α (12.145)
p
mt+1
u
(12.146)
p
mtω = E[ftin ( qut Z + mtu x0 )x0 ]
(12.147)
p
q t+1 = E[ftin ( α(ρ + q t − 2mt )Z + αx0 )2 ]
(12.148)
p
mt+1 = E[ftin ( α(ρ + q t − 2mt )Z + αx0 )x0 ]
√ √
E t+1 = ρ + E[ftin ( αE t Z + αx0 )2 ] − 2E[ftin ( αE t Z + αx0 )x0 ]
The inference problem addressed in this section is the following. We observe a M dimensional
vector yµ , µ = 1, . . . , M that was created component-wise depending on a linear projection of
an unknown N dimensional vector xi , i = 1, . . . , N by a known matrix Fµi
N
!
X
yµ = fout Fµi xi , (12.149)
i=1
where fout is an output function that includes noise. Examples are the additive white Gaussian
noise (AWGN) were the output is given by fout (x) = x + ξ with ξ being random Gaussian
variables, or the linear threshold output where fout (x) = sign(x − κ), with κ being a threshold
value. The goal in linear estimation is to infer x from the knowledge of y, F and fout . An
alternative representation of the output function fout is to denote zµ = N i=1 Fµi xi and talk
P
about the likelihood of observing a given vector y given z
M
Y
P (y|z) = Pout (yµ |zµ ) . (12.150)
µ=1
12.7 Inference with GAMP 199
One uses
(z−ω)2
dzPout (y|z) (z − ω) e− 2V
R
gout (ω, y, V ) ≡ (z−ω)2
. (12.151)
V dzPout (y|z)e− 2V
R
and (using R = u ∗ Σ)
(x−R)2
dx x PX (x) e− 2Σ
R
fin (Σ, R) = R (x−R)2
, fv (Σ, R) = Σ∂R fa (Σ, R) . (12.152)
dx PX (x) e− 2Σ
Again, these are not new equations! This was written as early as in 1989 by Mézard as the
TAP equation for the so-called perceptron problem ?. The same equation were written also by
Kabashima ? and Krzakala et al ?. The form presented here is due to Rangan Rangan (2011).
Part III
Consider here a special case of the stochastic block model with na = 1/q and caa = cin , and
cab = cout for a ̸= b. We call assortative/ferromagnetic the case with cin > cout , i.e. connections
withing groups being more likely than between different groups. We call disassortative/anti-
ferromagnetic the case with cin < cout , i.e. connections withing groups being less likely than
between different groups.
In the last lecture we derived the belief propagation equations for the stochastic block model
that read
" #
i→j 1 Y X 1 Y h i
χti = i→j nti e−hti ctk ti χk→i
tk = i→j nti e−hti cout − (cin − cout )χk→i
ti ,
Z t
Z
k∈∂i\j k k∈∂i\j
(13.1)
with an auxiliary external field that summarizes the contribution and overall influence of the
non-edges
1 XX
hti = ctk ti χktk . (13.2)
N t k k
1 Y h i
χi→j
ti = nti e−hti cout − (cout − cin )χk→i
ti (13.4)
Z i→j
k∈∂i\j
1 cin k→i
χi→j
Y
−hti
= nti e cout 1 − (1 − )χ (13.5)
ti
Z i→j cout ti
k∈∂i\j
204 F. Krzakala and L. Zdeborová
1 Y h i
χi→j
ti = nti e−hti 1 − (1 − e−β )χk→i
ti , (13.6)
Z i→j
k∈∂i\j
where we abused the notation as we denoted the new normalization in the same way. We
notice that this is very close to the BP that we wrote for graph coloring in Section 6
1 Y h i
χti→j = 1 − (1 − e−β )χk→i
ti , (13.7)
i
Z i→j
k∈∂i\j
the only difference is the missing term nti e−hti that corresponds to the prior fixing the proper
sizes of groups. In the disassortative/anti-ferromagnetic case with cin < cout or equivalently
β > 0, and equal group sizes na = 1/q for all a = 1, . . . , q, this term is not needed and does
not asymptotically influence the behaviour of the BP equations upon iterations. This claim
can be checked empirically by running the BP algorithm with and without the term, showing
it formally mathematically is non-trivial.
The stochastic block model under the setting of this section (na = 1/q, caa = cin , and cab = cout
for a ̸= b) can be seen as a planted coloring problem at finite inverse temperature β > 0.
Planted coloring is defined as follows:
• Start with N nodes, q colors, each node gets randomly one color s∗i ∈ [q] ≜ {1, . . . , q} as
the true configuration.
• Put at random M = cN/2 edges between nodes so that fraction cout (q − 1)/(cq) (chosen
so that the expected number of edges between groups agrees with the SBM) of them is
between nodes with different colors and the rest between nodes with the same colors.
This produces the adjacency matrix A = [Aij ]N
i,j=1 .
The goal is to find the true configuration s∗ (up to permutation of colors) from the knowledge
of the adjacency matrix A, and the parameters θ. We call this way of generating coloring
instances planted because we picture the ground truth group assignment to be the planted
configuration around which is the graph to be colored constructed. The inverse temperature
β is then related to the fraction of monochromatic edges.
The threshold cd , cc and cℓ that we described to occur in the SBM for q = 5 colors at β → ∞ in
Fig. 10.4.3 thus exist in the planted coloring. More interestingly they bear consequences on
the original non-planted graph coloring problem as we describe in what follows.
the BP algorithm converges to a configuration with positive overlap with the ground truth
assignment into groups. In terms of inverse temperature β and the average degree c this
condition is written as
q − 1 + e−β
2
c > cℓ = . (13.9)
1 − e−β
This is thus the algorithmic transition that in the SBM marks the onset of a phase where belief
propagation is able to find the optimal overlap with the ground truth group assignment.
We derived this phase transition via a stability of the paramagnetic BP fixed point. Since the
BP equations for the planted coloring are the same as the one for the random graph coloring
the stability of the BP fixed point also applies to the random graph coloring. Indeed, at the
end of Section 6 we derived a condition for when the BP for random graph coloring converges
to the paramagnetic BP fixed point and when it goes away from it (6.24). This condition was
exactly eq. (13.9). In fact in the normal (non-planted) coloring the BP algorithm does not
converge for c > cℓ .
Let us now review in the following table how is this BP convergence threshold related to
the estimation of the colorability threshold for cin = 0 or equivalently β → ∞ and to the
average degree beyond which we do not know of polynomial algorithms that would provably
find proper colorings for large number of colors. We see from the table that the convergence
of BP had little to do with colorability nor with hardness of finding a proper coloring. In
the planted coloring this is the algorithmic threshold beyond which the inference of the
planted coloring starts to the algorithmically easy, but in the non-planted coloring the cℓ BP
convergence threshold does not have algorithmic consequences.
206 F. Krzakala and L. Zdeborová
We recall that for the SBM we found a threshold cc (in continuous phase transition cc = cℓ )
such that
• For c < cc the BP fixed point with the largest free entropy is the paramagnetic fixed point
with χi→j
a = na and Q = 0.
• For c > cc the BP fixed point with the largest free entropy is the feromagnetic fixed point
with Q > 0.
Recall Fig. 10.4.3 for illustration of this threshold and also recall the fact that for c < cc the
paramagnetic fixed point thus provides the exact marginals for most variables and the exact
free entropy in the leading order.
A consequence of the paramagnetic fixed point being the one describing the correct marginals,
overlap and free energy is that there is no information in the graph that allows us to find a
configuration correlated with the planted configuration. This can only be true if all properties
that are true with high-probability in the limit of large graphs are the same in the planted
graph as they would be in a non-planted graph. Properties that hold with high-probability
in the large size limit are called thermodynamic properties in physics. Mathematically, this
kind of indistinguishability of two graph ensembles (the random and the planted one) by
high-probability properties is termed contiguity. For c < cc the random and planted ensemble
are hence contiguous, i.e. not distinguishable by thermodynamics properties. This among
other things implies that for c < cc the paramagnetic BP fixed point and the corresponding
Bethe free entropy are exact in the leading order not only for the planted graphs but also
for the random ones. We will see in the next lecture that for c > cc this is not the case, thus
answering one of the main questions we have about correctness of BP to graph coloring.
Nishimori identities for Bayes optimal inference are implying that the planted configuration
has all the properties of a configuration randomly sampled from the posterior distribution.
Putting this property together with the contiguity of the random and planted ensemble for
13.2 Relation between planted and random graph coloring 207
In the section on SBM we observed empirically that the fixed point of BP and marginals
reached by correspondingly initialized Monte Carlo Marlo chain in time linear in the size
of the system are equivalent in the large size limit (MCMC was just slower to convergence).
We thus conclude that if BP converges to the paramagnetic fixed point from the planted
initialization, i.e. for c < cd , then MCMC would be able to equilibrate, i.e. estimate in linear
time correctly in the leading order all quantities that concentrate. Thus for c < cd dynamics
such as MCMC is able to equilibrate in a number of steps that is O(1) per node, this happens
if from the physics point of view the phase is liquid and not glassy.
In the cases where we have a 1st order phase transition in the planted model, i.e. when cd < cc
the phase occurring for cd < c < cc has interesting properties. Recall that for all c < cc the
planted and random ensemble are contiguous. Yet in the planted ensemble the BP initialized
in the planted configuration converges to a fixed point strongly correlated with the planted
configuration Q > 0. Because of the contiguity this means that also in the random ensemble
BP initialized in a configuration drawn uniformly at random from the Boltzmann distribution
would converge to an analogous fixed point. Considering that such a BP fixed point describes
the subspace that would be sampled by MCMC if initialized in the same way we conclude
that dynamics initialized at equilibrium would remain stuck close to the initialisation and
would not explore in linear time the whole phase space. The domain of attraction where the
dynamics is stuck will be denoted as a cluster. Randomly initialized MCMC will not be able
to equilibrate, i.e. sample the space of configuration almost uniformly for cd < c < cc . In
this phase is thus conjectured that sampling configurations uniformly from the Boltzmann
measure is algorithmically hard.
Having introduced the notion of clustering, let us state one possible definition of how to
split configurations into clusters via equivalence classes of BP fixed points: Consider the
random graph coloring problem at inverse temperature β, consider BP equations (at the same
temperature) initialized in all possible configurations and iterate each of those BP to a fixed
point. Then define a cluster of configurations as all configurations from which BP converges
to the same fixed point. Cluster are then equivalence classes of configurations with respect
to BP fixed points. In case BP does not reach convergence from some configurations we can
think that one cluster corresponds to all such configurations. While it may be harder to relate
this definition to the more physical notion such as Gibbs states it is one that is most readily
translated into a method of analysis that is able to describe organization of clusters, e.g. how
many there are of a given free entropy.
The size of the cluster is described by the free entropy of the corresponding BP fixed point
Φferro . We remind that for c < cc the total equilibrium free entropy is the paramagnetic one
Φpara , that is in the phase cd < c < cc strictly larger than the ferromagnetic one Φpara > Φferro .
If equilibrium solutions are in cluster (basins of attraction) corresponding to free entropy
Φferro and the total free entropy is Φpara is must mean that there are eN (Φpara −Φferro ) of such
clusters. We will define the complexity of clusters as the logarithm of the number of such
clusters per node, defined as:
eN Σ = number of clusters (13.10)
where:
Σ = Φpara − Φferro (13.11)
208 F. Krzakala and L. Zdeborová
for c ∈ (cd , cc ). Since at cd the ferromagnetic fixed point appears discontinuously, it means
that exponentially many clusters appear discontinuously. As c → cc the complexity Σ is then
going to zero.
To summarize
• For c < cd most configurations belong to the same cluster, MCMC is able to equilibrate
in linear time.
• For cd < c < cc configurations belong to one of exponentially many clusters, MCMC is
not able to equilibrate in linear time.
This relation to dynamics being able to equilibrate before the threshold and not being able to
equilibrate in linear time after the threshold gives the name dynamical threshold.
So far the method we described to locate the clustering threshold cd is based on running BP
equations from the random and planted fixed point and comparing the free entropies of the
corresponding fixed points. This is a somewhat demanding procedure and does not lead to a
simple closed-form expression for the threshold as we obtained e.g. for the linear stability
thresholds cℓ . In the next section we will give a closed-form upper bound on the clustering
threshold cd for the case of proper colorings, i.e. when cin = 0 or equivalently when β → ∞.
Figure 13.2.1: Cartoon of the space of proper colorings in random graph coloring problem.
The grey circles correspond to unfrozen clusters, the black circles to frozen clusters. The size
of the circle illustrates the size of the clusters.
For graph coloring at zero temperature, β → ∞, the BP fixed points initialized in a solution
(proper coloring) may have so-called frozen variables, i.e. variables i that stay at their initial
value up until the fixed point. This means that the frozen variables take the same value in
all the solutions belonging to the cluster. Frozen variables: Variable i is frozen if in the fixed
point of BP we still have χj→i
s = δs,s∗j . Frozen cluster: A cluster in frozen if a finite fraction of
variables are frozen in the cluster.
In order to monitor whether clusters in which a randomly samples solution lives are frozen of
not we recall the contiguity between the random and planted ensemble and the fact that the
planted configuration has all the properties of the equilibrium configurations. This allows
13.2 Relation between planted and random graph coloring 209
us to monitor the threshold where typical clusters get frozen by tracking the updates of BP
initialized in the planted configuration and monitoring how many variables are frozen.
Let us consider a simple case, for degree-d regular random graph, that is locally-treelike in
the large size limit N → ∞. Consider then a rooted tree around of on the nodes. The planted
solution can be seen as a broadcasting process on that tree where the root took a given color
and then recursively the children (i.e. the neighbors in the direction of the leaves) are given at
random one of the other colors. The question of existence of frozen variables is then translated
into the question whether the leaves of such a tree of large depth imply the value of the root
or whether the root can take another color while staying compatible with the coloring rule
and the colors on the leaves.
Denote ηℓ the probability that a node in ℓ-th generation is implied by the leaves above it, i.e.
can take only one color (the planted). A node will be implied to have its planted color if each
of the other colors is implied in at least one of its children.
Assume now that r of the remaining q − 1 colors are not implied on any of its d − 1 children.
This probability can be written as:
Pr (for a node in ℓ-th generation, given r colors are not implied by any children)
= {Pr (for a node in ℓ-th generation, the given r colors are not implied by a given child)}d−1
= {1 − Pr (for a node in ℓ-th generation, a given child is implied to take one of the r given colors)}d−1
r ηℓ+1 d−1
= 1−
q−1
We then proceed by inclusion-exclusion principle. For r = 1 we over-counted the cases where
in fact two colors were not implied etc. obtaining
q−1
q−1
X
r−1
ηℓ = 1 − (−1) Pr (for a node in ℓ-th generation, the given r colors are not implied by any children)
r
r=1
q−1
r q−1 r ηℓ+1 d−1
X
=1+ (−1) 1−
r q−1
r=1
For a tree of depth d we start from ηd = 1, since the nodes in d-th generation are a leaves
themselves. The probability ηℓ is then updates as we proceed down to the root up to a fixed
point.
The rigidity threshold cr then provides an upper bound on the dynamical threshold cd . For
small number of colors the two threshold are not particularly close to each other and the
rigidity threshold in even larger than cc for q < 9. But for large number of colors q ≫ 1 we
get by expansion cr = q log(q) + o (q log(q)). This thus tells us that for large number of colors
clusters containing typical solution are frozen with high probability starting from average
degree at least cr .
We note that while we presented this argument for d-regular trees, for the random ones with
fluctuating degree distribution we can get an analogous closed-form expression that has the
same larger q behaviour in the leading order.
Contiguity between planted and random ensembles only holds when paramagnetic (or anal-
ogous) fixed point exists. In general problems such as random K-SAT there the planted
ensemble is always different from the random one. However, the notion of clustering and clus-
ters corresponding to basins of attraction of Monte Carlo sampling as well as their definition
via BP fixed point is more general and holds beyond the models where planted and random
ensembles are related.
We also note that the scaling of the rigidity threshold at large number of colors is cr ≈ q log(q)
and this coincides with the scaling beyond which we do not have any known analyzable
algorithms working. Moreover numerical tests in the literature show that solutions found by
polynomial algorithms do not lead to a frozen BP fixed point. This leads to a conjecture that
frozen solutions are hard to find. At the same time the rigidity threshold concerns freezing
of the typical clusters and there might be atypical ones that remain unfrozen up to much
larger average degree. This is indeed expected to be the case in analogy with related constraint
satisfaction problems that have been analyzed Braunstein et al. (2016). Overall the precise
threshold at which solution become hard to find remain open, even on the level of a conjecture.
Chapter 14
In the last section we deduced that in a region of parameters the space of solutions in the
random graph coloring problem is split into so-called clusters. We associated clusters to BP
fixed points, and their free entropy to the Bethe free entropy ΦBethe corresponding to the fixed
point. This allows us to analyse how many clusters there are of a given free entropy in the same
way as we analyzed how many configurations there are of a given energy/cost. We define the
complexity function Σ(ΦBethe ) as the logarithm of the number of BP fixed points with a given
Bethe free entropy ΦBethe per node. With this definition we can define the so-called replicated
free entropy Ψ(m) as
Z
e e eN [Σ(Φ)+mΦ]
X
Ψ(m)N
= N mΦBethe
= (14.1)
BP fixed points Φ
Just as before for the relation between energy, entropy and free entropy, we have from the
properties of the saddle point and the Legendre transform that
∂Σ(Φ) ∂Ψ(m)
= −m , = Φ. (14.2)
∂Φ ∂m
Thus if we are able to compute the replicated free entropy Ψ(m) we can compute from it the
number of clusters of a given free entropy Σ(Φ) just as we did when computing the number
of configurations of a given energy.
We would now like to determine what is the free entropy Φ of clusters that contain the equilib-
rium configurations, i.e. configurations samples at random from the Boltzmann measure. For
this we need to consider the free entropy Φ that maximizes the total free entropy Φ + Σ(Φ).
We also need to take into account that the BP fixed point are structures that can be counted
and those that exist must be at least one, and hence Σ(Φ) cannot be negative, since logarithm
of positive integers are non-negative. Thus the equilibrium of the system is described by
This expression is maximized in two possible ways. Assume that for neighborhood of m = 1
the complexity Σ(Φ) has non-zero value (except maybe one point where it crosses zero). Then
define Φ1 as the value of the free entropy at which the function Σ(Φ) has slope −1 i.e.
∂Σ(Φ)
= −1 (14.4)
∂Φ Φ1
then
• Either Σ(Φ1 ) > 0, in this case eq. (14.3) is maximized at Φ1 and the number of clusters of
that free entropy is exponentially large corresponding to Σ(Φ1 ). We call such a clustered
phase the dynamical one-step replica symmetry breaking phase.
• Or Σ(Φ1 ) < 0, in this case the value of free entropy that maximized the expression
(14.3) is the largest Φ such that Σ(Φ) > 0. We denote this the Φ0 . When this happens
the dominating part of the free entropy is included in the largest clusters that are not
exponentially numerous and in fact arbitrarily large fraction of the weight is carried by a
finite number of largest clusters. The equilibrium of the system thus condensed in a few
of the larger clusters. We call this phase the static one-step replica symmetry breaking
phase.
In the last lecture we computed the complexity of clusters in random graph coloring for
cd < c < cc as the difference between the free entropy of the paramagnetic and the feromagnetic
fixed point. In problems where we have contiguity between the planted and the random
ensembles, this complexity from eq. (13.11) is exactly equal to Σ(Φ1 ) and thus goes to zero
at the condensation threshold cc . Above the condensation threshold c > cc the equilibrium
of the system is no longer described by cluster of free entropy Φ1 (that is equal to the free
entropy of the planted cluster) instead it is given by Φ0 as defined above by the free entropy
at which the complexity becomes zero.
As the average degree grows further in the random graph coloring problem the maximum of
the complexity curve Σ(Φ) becomes zero at which point the last existing clusters disappear
and this thus marks the colorability threshold. To compute the colorability threshold we thus
need to count the total number of clusters, corresponding to the maximum of the curve Σ(Φ).
From the previous section we see that we can learn many details about clusters if we are able
to compute the replicated free entropy Ψ(m) from eq. (14.1) where we sum the exponential of
the Bethe free entropy to the power m over all the BP fixed points. More explicitly this means
to compute
eΨ(m)N = em[ i log Z + a log Z − (ia) log Z ]
i a ia
X P P P
(14.5)
BP fixed points
Figure 14.1.1: Illustration of the complexity curve Σ(Φ) that is the logarithm of the number
of clusters per nodes versus the free entropy Φ. The point where the curve has slope −1 is
marked in purple. Recall the properties of the Legendre transform that imply that the slope of
this curve is equal to −m.
What we just wrote is in fact very naturally a partition function of an auxiliary problem living
on the original graph where the variables (χi→a , ψ a→i ) and fields (Z ia )m live on the original
edges and the factors live on the original variable i and factor nodes a.
We thus put the problem of computing the replicated free energy Ψ(m) into a form in which
we can readily apply belief propagation as we derived it in Section 5.2. The only difference
now is that the variables are continuous real numbers and that the messages in the auxiliary
problem are probability distributions. Moreover we realize that it is consistent to assume that
the new BP message in the auxiliary problem in the direction from i → a does not depend on
the original message in the opposite direction, because of the assumed independence between
BP messages. This allows us to write the BP for the auxiliary problem that is a generic form of
a survey propagation (instead of messages we are now passing surveys). These equations
are also called the one step replica symmetry breaking cavity equations (but deriving the
214 F. Krzakala and L. Zdeborová
Figure 14.2.1: Illustration of the graphical model constructed from the original factor graph
used to analyze BP fixed points.
same result with the replica approach is rather demanding and much less transparent). The
parameter m is called the Parisi parameter. They read
1
Z Y
P a→i (ψ a→i ) = a→i dP j→a (χj→a )(Z a→i )m δ[ψ a→i − Fψ ({χj→a
j∈∂a\i })] (14.10)
Z
j∈∂a\i
1
Z Y
P i→a (χi→a ) = i→a dP b→j (ψ b→j )(Z i→a )m δ[χi→a − Fχ ({ψb∈∂i\a
b→i
})] (14.11)
Z
b∈∂i\a
where the messages are now probability distributions over the original BP messages. As
before we need to find a fixed point of these equations which in a single graph is numerically
demanding because we need to update a full probability distribution (in fact two) on every
edge.
In problems on random regular graphs without any additional disordered, e.g. in the graph
coloring on random regular graphs, the correct solution often has a form that is independent
on the edge (ia). In that case we have the same probability distribution on every edge and the
fixed point can be found naturally by so-called population dynamics where the probability
distribution of edges is represented by a large set of samples from the probability distributions
and the so-called reweighting factor (Z i→a )m or (Z a→i )m is proportional to the probability
with which each element appears in the population.
The replicated free entropy can then be computed from the fixed point using the recipe for
14.3 Computing the colorability threshold 215
where
Z Y
e mΦi
= dP a→i (ψ a→i )(Z i )m (14.13)
a∈∂i
Z Y
emΦ
a
= dP i→a (χi→a )(Z a )m (14.14)
i∈∂a
Z
e mΦia
= dP a→i (ψ a→i )dP i→a (χi→a )(Z ia )m . (14.15)
The curve Σ(Φ) is then computed from the Legendre transform of Ψ(m).
It might happen that also the clusters cluster into super-clusters, or split into mini-clusters.
This would then lead to the so-called two-step replica symmetry breaking. And one could
continue to speak about K-step RSB, or full-RSB in case an infinite hierarchy of step is needed.
Currently we do not know how to solve the full-RSB equations on random sparse graphs, but
in dense optimization problems this can be done, but this is beyond the scope of the present
lecture.
The concept of frozen variables that we introduced to upper bound the clustering threshold
enables a key simplification to compute the colorability threshold ccol that we aimed at from
the very first lecture on graph coloring. We will again restrict to random d-regular graphs as
the computation simplifies in this case. We learned that the space of solutions is divided in
clusters and some clusters have frozen variables.
Note that frozen variables are needed in order for a cluster of solution to vanish when the
average degree is infinitesimally increasing (addition of new edge will create contradiction
with high probability). Further it is not possible to have a very small fraction of frozen variables
because they must sustain each other in a sort of backbone structure. We will thus assume
that the colorability threshold is given by the average degree at which clusters with frozen
variables all disappear. Since clusters were identified with BP fixed points, counting frozen
clusters to see when all disappear becomes counting BP fixed points with frozen variables.
The key idea of describing clusters is again that being a BP fixed point is just another type of
constraints for another type of variables. It is an auxiliary problem that can be formulated
using a tree-like graphical model and solved using again belief propagation and counting
frozen clusters comes from the value of the corresponding free entropy. This "BP on frozen
BP fixed points" is called the survey propagation.
Let us now describe the structure of a frozen BP fixed point. The frozen BP messages are
χj→i j→i
sj = δsj ,s = “s” for q possible values of s; the not frozen ones are χsj = something else =
“∗” and will be called "joker". For each edge (ij) in each direction the message will be either
"joker" or frozen in one of the actual colors. The variables will now be living on the edges, let
216 F. Krzakala and L. Zdeborová
νij ∈ {∗, 1, . . . , q} denote the value for variable node (ij), in total this variables can be in one
of the (q + 1) possible states. In the new graphical model BP equations lead to the constraints,
terms in Bethe entropy are the new factor weights, BP messages are the new variables. The
new messages are probability distribution on the original BP messages.
Let us not discuss the constraints (factor nodes) that will be in place of the original nodes (as
in graph matching problems)
ik (q + 1) values
i constraint
ij
• if an incoming value νki = s, the outgoing value νij cannot be s since the neighbor k is
frozen to s; if an incoming value νki = ∗, then it will not impose any constraint on the
outgoing value νij .
• if νij = ∗ at least two colors are not forbidden by the incoming edges, if vij = s, only s is
not forbidden by the incoming edges.
We can now write the survey propagation equation, which is just BP on this new graphical
model
1 X Y
ni→j
νij = C νij , {ν }
ki k∈∂i\j nk→j
νki
Z i→j
{νki }k∈∂i\j k∈∂i\j
Survey propagation (SP) equations can be made more explicit by using the specific form of the
constraints imposed above. We proceed again according to the inclusion-exclusion principle
14.3 Computing the colorability threshold 217
• When νij ̸= ∗
( )
1 Y
ni→j
νij = 1 − nνk→i + correct for the cases that also allow p ̸= νij
Z i→j ij
k∈∂i\j
| {z }
neighbors are not forbidding νij
)
q
(
1 Y X Y Y X
= 1 − nνk→i − 1 − nk→i k→i
νij − np + · · · − (−1)q 1 − npk→i
Z i→j ij
k∈∂i\j p̸=νij k∈∂i\j k∈∂i\j p=1
| {z }
no body is forbidden
q−1
!
1 X l
X Y X
= (−1) 1− nk→i
νij − nk→i
v
Z i→j
l=0 V ⊆{1,...,q}\νij k∈∂i\j v∈V
|V |=l
• When νij = ∗
1
ni→j
∗ = {not forbidding at least two colors}
Z i→j
q
!
1 X X Y X
= (−1)l 1− nk→i
v .
Z i→j
l=2 V ⊆{1,...,q}\νij k∈∂i\j v∈V
|V |=l
The Bethe free entropy of this problem is the complexity Σ, counting all the frozen clusters. We
remind that when we expressed complexity in the previous section of was the one of clusters
that contained typical solutions. In the next lecture we will see the relation between the two.
The complexity can be computed from the fixed point of SP using the usual recipe for Bethe
free entropy
N
1 X 1 X
Σ= log Σ(i) − log Σ(ij)
N N
i=1 (ij)∈E
q
!
X l−1
X Y X
(i)
Σ = (−1) 1− nk→i
v
l=1 V ⊆{1,...,q} k∈∂i v∈V
|V |=l
q
X
Σ(ij) = 1 − ni→j
p np
j→i
p=1
218 F. Krzakala and L. Zdeborová
Here the term Σ(i) is related to the normalization of the messages that does not exclude the
node j, the term Σ(ij) is then counting the probability that the messages in the two directions
of one edge are compatible, i.e. not imposing the same color on the two ends.
On a d-regular graph and assuming symmetry among colors (np = η, ∀ p = 1, . . . , q), this
leads to a very concrete conjecture for the colorability threshold of random d-regular graphs:
Pq ℓ−1 q−1 d−1
ℓ=1 (−1) ℓ−1 (1 − ℓη)
η = Pq ℓ−1 q d−1
ℓ=1 (−1) ℓ (1 − ℓη)
q
( )
X q d
(−1)ℓ−1 (1 − ℓη)d − log 1 − qη 2
Σ = log
ℓ 2
ℓ=1
We can first compute η to be the fixed point, and then plug the fixed point in Σ to compute
the complexity. If Σ > 0, then it means a random d-regular graph is colorable w.h.p. for large
N , and uncolorable vise versa. For the random graph with fluctuating variables degree the
expressions are only slightly more complex, but derived in the very same spirit.
Going back to the overall picture we now described the whole regime of average degrees,
except cc < c < ccol . This is the condensed phase which has rather peculiar properties and
will be discussed in the next lecture.
14.4 Exercises
The random-subcube model is defined by its solution space S ⊂ {0, 1}N (not by a
graphical model). We define S as the union of ⌊2(1−α)N ⌋ random clusters (where ⌊x⌋
denotes the integer value of x). A random cluster A being defined as:
such that for each variable i, πiA = {0} with probability p/2, {1} with probability p/2,
and {0, 1} with probability 1 − p. A cluster is thus a random subcube of {0, 1}. If
πiA = {0} or {1}, variable i is said “frozen” in A; otherwise it is said “free” in A. One
given configuration σ might belong to zero, one or several clusters. A “solution” belongs
to at least one cluster.
We will analyze the properties of this model in the limit N → ∞, the two parameters α
and p being fixed and independent of N . The internal entropy s of a cluster A is defined
as N1 log2 (|A|), i.e. the fraction of free variables in A. We also define complexity Σ(s) as
the (base 2) logarithm of the number of clusters of internal entropy s per variable (i.e.
divide by N ).
(b) Compute the αd threshold below which most configurations belong to at least one
cluster.
(c) For α > αd write the expression for the complexity Σ(s) as a function of the parame-
ters p and α. Compute the total entropy defined as stot = maxs [Σ(s) + s | Σ(s) ≥ 0].
Observe that there are two regimes in the interval α ∈ (αd , 1), discuss their proper-
ties and write the value of the “condensation” threshold αc .
Chapter 15
In this chapter, we come back to the replica method, and will in particular discuss replica
symmetry breaking (in its simpler form, the one-step replica symmetry breaking, 1RSB in
short), and will make contact with the form of 1RSB discussed with the cavity method.
To do so, we shall discuss a very simple model, that was introduced by Bernard Derrida in
1980 to understand spin glasses and the replica method. It turns out to be indeed a very good
toy model to understand key concepts, and in fact, it is also a very important model in its own,
connecting with important concepts in denoising and information theory.
The random energy model is a trivial spin models with N variables, in the sense that the
energy of the possible M = 2N configurations are choosen, once and for all, randomly from a
Gaussian distribution.
Formally, in the Random Energy Model (REM), we have 2N configuration, with random fixed
energies Ei sampled from
N 1 E2
P (E) = N 0, =√ e− N (15.1)
2 πN
2N
X
ZN = e−βEi (15.2)
i=1
222 F. Krzakala and L. Zdeborová
We start to computing the exact asymptotic solution of the model, without using the replica
method. To do so, we shall show that the entropy (in the sense of Boltzmann) density as a
function of the energy has a deterministic asymptotic limit, that we can compute.
To do so, we first ask: if we sample 2N energy randomly, what is the number of energies
that fall between [N e, N (e + de)]? Let us call this number #(E). This is a random variable,
it depends on the specific draw of the 2N configuration, but we can compute its mean and
variance. First we compute the average; when de is small enough, we have
2N −N e2 1
E(#N e) = 2N E[1(E ∈ [N e, N (e + de)])] ≈ √
2
e de = √ e−N (e −log 2) de (15.3)
πN πN
Also, we see that, the probability of the energy taken randomly from P (E) to be between e
and e + de being small, this follows a Poisson law, so that the variance is equal to the mean.
Defining the function sann (e) = log 2 − e2 (physicists call this quantity the annealed entropy
density) we thus have two regimes:
2 !
E (#(e) − E[#(e)])2
#(e) #(e) 2
P −1 ≥k = P −1 ≥k ≤
E#(e) E#(e) k 2 (E[#(e)])2
Var([#(e)]
E[#(e)] e−N sann (e)
≤ ≃
∝ . (15.5)
k 2 (E[#(e)])2 k 2 (E[#(e)])2 k2
Where we have used the fact that in this case, since the probability of the energy taken
randomly from P (E) to be between e and e + de being small follows a Poisson law,
the variance is equal to the mean. As N grows, the probability of deviation is going
#(e)
exponentially to zero when sann (e) > 0, so that with high probability, E#(e) is arbitrary
close to 1.
Thanks to this very tight concentration, we now can state that, with high probability, the
number of configurations is either 0 —if s(E) is negative— or close to eN sann (e) otherwise.
More precisely, we can write the entropy density as a function of e:
s(e) =: sann (e) if, sann (e) ≥ 0, and s(e) = −∞ otherwise . (15.6)
15.1 Rigorous Solution 223
With this knowledge, we can now solve the random energy model easily, without any disorder
averaging. We write that, with high probability, we have
√
X Z log 2
−βE
ZN = #(E)e ≈ √ dee−N (βe−s(e)) (15.7)
E − log 2
and using Laplace integral, with the sum is dominated by its maximum, we have the free
entropy given by the Legendre transform of the entropy:
1
lim log Z = s(e∗ ) − βe∗ (15.8)
N →∞ N
with
e∗ = min[−√log 2,√log 2] [βe − s(e)] (15.9)
Again, there are two situations: the minimum can be reached when the derivative is 0, that is
when
β = ∂e s(e) = −2e (15.10)
√
but this can only be the√case when s(e) > 0. However, at e = − log 2, s(e) reaches
√ zero, and
at this point s√
′ (e) = 2 log 2. So this minimum is only valid when β < β = 2 log 2, after
c
which e∗ = − log 2. In a nutshell, we have:
1 β2
log Z = −(e∗ )2 + log 2 − βe∗ = log 2 + (15.11)
p
lim , ifβ < βc = 2 log 2
N →∞ N 4
1
(15.12)
p p
lim log Z = log 2β , ifβ ≥ βc = 2 log 2
N →∞ N
The phase transition arising a βc is called a condensation transition. This is because, at this
point,
√ all the probability measure condensate on the lowest energy configurations of energy
−N log 2. This phenomenon is of crucial importance in understanding the 1RSB phase
transition.
224 F. Krzakala and L. Zdeborová
We now move to the computation by the replica method. We start by the replicated partition
sum:
Y X 2N
Zn = e−βEi (15.13)
n i=1
2 N
X
Z n
= e−β(Ei1 +...+Ein ) (15.14)
i1 ,...,in =1
N N
2 2 P2N
1
Pn
X Pn X −β Ej (j=i a )
e−β
a=1 j=1
= a=1 Eia
= e (15.15)
i1 ,...,in =1 i1 ,...,in =1
2 N 2N
1(j=ia ))
X Y Pn
= e−βEj ( a=1 (15.16)
i1 ,...,in =1 j=1
We now compute the average over the disorder. By linearity of expectation, we can push the
average in the sum. Using independence of the different 2N energies, we can also push the
Remember that if X average over each Ej in the product. Using the Gaussian integral , we thus find that:
is Gaussian with zero
mean and variance ∆, N β 2 Pn 2
1(j=ia ))2 = e N4β 1(j=ia )1(j=ib )
Pn Pn
then E[ebX
]=e
b2 ∆
2 . EEj e−βEj ( a=1 I(j=ia ))
=e 4
( a=1 a,b=1 (15.17)
2 N
N β2
1(ia =ib )
X Pn
E[Z ] =n
e 4 a,b=1 (15.18)
i1 ,...,in =1
15.2 Solution via the replica method 225
Following our tradition of using overlaps, we see that, given the replicas configurations (i1 ...in ),
it is convenient to introduce the n × n overlap matrix Qab = 1(ia = ib ), that takes elements in
0, 1, respectively if the two replicas (row and column) have different or equal configuration.
We can finally write the replicated sum over configurations as
2 N
X N β2 Pn X β2 Pn
Qa,b
n
E[Z ] = e 4 a,b=1 = #(Q)eN 4 a,b=1 Qa,b
(15.19)
iα =1,...,in {Q}
where {Q} is the sum over all possible such matrices, and #(Q) is the numbers of configura-
P
tions that leads to the overlap matrix Q.
Considering, for a moment, the n to be integers, we can derive an instructive bound on all the
moments of Z. Indeed, a possible value of the matrix Q is the one where all the n replica are
in the same configuration, in which case Qa,b = 1, ∀a, b. There are 2N such matrices, so
β2 2 β2 2
EZ n > 2N eN 4
n
> eN 4
n
(15.20)
2
so that we now know, rigorously that the moment of Z grow at least as en . This is a bad new.
Indeed, if the moments do not grow faster than exponentially, their knowledge completely
determines the distribution of Z, and thus the expectation of its logarithm, according to
Carleman’s condition. Since, in our case, they do grow faster, this means that the moments do
not necessarily determine the distribution of Z, and in particular the analytic continuation at
n < 1 may not be unique. We thus will need to choose the "right" one, which is the essence of
the replica method. This is precisely what Parisi’s ansatz is doing for us: it provides a well
defined class of analytic continuations, which turns out to be the correct one.
Let us try to perform the analytic continuation when n → 0. Keeping for a moment n integer,
it is natural to expect that the number of such configurations , for a given overlap matrix, to be
exponential so that, denoting #(Q) = eN sq (Q) we write
Z β2 Pn Z
N a,b=1 Q a,b +s q (Q)
EZ n ≈ dQeN g(β,Q) (15.21)
4
dQe =:
As N is large, we thus expect to be able to perform a Laplace (or Saddle point) approximation
by choosing the "right" structure of the matrix Q, which will "dominate" the sum. A quite
natural guess (Physicists like to call this an ansatz) is to assume that all replicas are identical,
and therefore the system should be invariant under the relabelling of the replicas (permutation
symmetry). In this case, we only have two choices for the entries of Q: 1) Qab = 1 for all a, b or
2),Qaa = 1 for a = 1, · · · , n and Qab = 0 for a ̸= b. In both case, we can easily evaluate sq (Q).
1. If Qab = 1 for all a, b, then all the replica are in a single configuration, as we have
already seen. Then N = 2N , and we find g(β, Q) = n2 β 2 /4 + log 2. This is actually very
frustrating, as this does not have a limit with a linear part in n, so we cannot use this
solution in the replica method. Clearly, this is a wrong analytical continuation.
226 F. Krzakala and L. Zdeborová
2. If instead Qaa = 1 and Qab = 0 for all a ̸= b then all replicas are in a different configu-
ration; #(Q) = 2N (2N − 1) . . . (2N − n + 1), so that s(Q) ≈ n log 2 if n ≪ N . Therefore
g(β, Q) = nβ 2 /4 + n log 2.
At the replica symmetric level, we thus find that the free entropy is given, at all temperature,
by
fRS (β) = β 2 /4 + log 2 (15.22)
This is not bad, and indeed we know it is the correct solution for β < βc . However, this solution
is obviously wrong for β > βc .
We now follow Parisi’s prescription, and use a different ansatz for the saddle point. We assume
the n replica are divided into many groups of m replica, and that the overlap is 1 within the
group, and 0 between different groups. The number of group is of course n/m. In order
to count the number of such matrices, we can choose one configuration by group, so that
N mn
2
N = 2N (2N −1) . . . (2N −n/m+1), sq (Q) = log(2N ) = (n/m) log 2 and a,b Qa,b = nn = nm.
P
m
Thus, we have
n β2
E[Z n ] ≃ eN [ m log 2+ 4
nm]
(15.23)
and again 1 ≤ m ≤ n
We now need to choose m. The historical reasoning to choose m proposed by Parisi is almost a
joke, as it sounds crazy. First, we know that for n integer, we have obviously n ≥ m ≥ 1. Parisi
tells us than, when n → 0, we must instead choose n ≤ m ≤ 1. Secondly, as it was not crazy
enough, we will see that it corresponds to a minimum rather than to a maximum. We shall see
later on how to rationalize these results, thanks to the cavity method, but so far, let us follow
the replica prescriptions. We extremize the replica free entropy with respect to m and find:
This, amazingly, turns out to be the correct solution! Indeed, we have now:
which is indeed the correct free energy. There is definitely something magical, but it works.
The replica method can also be used to shed some light on the type of phase transition hap-
pening at βc . Indeed, we know that for low
√ temperature, the Boltmzann measure condensate
on the lowest energy level, close to e = − log 2, but we may ask, how many are they, really?
This can be answered by computing the participation ratio defined as
N
2 −βEj 2
X e
Y = (15.30)
Z
i=1
Handwavingly speaking, the participation ratio is the inverse of the number of configuration
that matters in the sum. Let’s compute its average by the replica method! We find
2N −βEj 2 2N 2N
X e 1 X X
EY = E = E e−2βj = lim E Z n−2 e−2βEj (15.31)
Z Z2 n→0
i=1 i=1 i=1
2N
−β(Ei1 +Ei2 +...+Ein−2 )
X X
= lim E e e−2βEj (15.32)
n→0
i1 ,...,in−2 i=1
Now, using the invariance of all the replica, we symmetrize the last expression so that
1
e−β(Ei1 +Ei2 +...+Ein ) 1(ia = ib )
X X
EY = lim E (15.34)
n→0 n(n − 1)
a̸=b i1 ,...,in
We can now plot the participation ratio and see that it foes from 1 at zero temperature and
diverges at T = Tc = 1./βc . A similar computation shows that second moment is finite, so
that this number is NOT self-averaging, the number of configurations participating in the
states fluctuate from instance to instance.
Another way to think about the replica solution is through the distribution of possible overlaps.
Consider two actual real replica of the system. We may ask, what if we take two independent
copies of the same system, what is the probability that they are in the same configuration?
Or, equivalently, what is the probability that their overlap is one? This is precisely what we
computed with the participation ratio! Indeed, the (average) probability that two copies are
in the same configurations reads
2N −βEi −βEj
1(Qa,b )
P
e
1(i = j) = E[Y ] = lim a̸=b
X
E(P(q = 1)) = E 2
. (15.40)
Z n→0 n(n − 1)
i,j=1
In the high-temperature solution, when the RS approach is correct, we see that the only
possible overlap is 0. If we sample two configurations in the high-temperature region, then
with high-probability, they are different ones.
In the 1RSB low temperature solution, we see instead that with probability 1 − m we have q = 1
(and therefore, with probability m, we have q = 0). If we sample two configurations, they will
be the same ones with probability 1 − m. This is another way to think of the condensation
phenomenon. In spin glass parlance, one often says that the distribution of overlap become
non trivial. We went from a simple delta-peak at high-temperature
EPRS (q) = δ(q) (15.41)
to a distribution with two pick at low temperature
EP1RSB (q) = (1 − m)δ(q − 1) + mδ(q) (15.42)
15.3 The connection between the replica potential, and the complexity 229
15.3 The connection between the replica potential, and the complex-
ity
We have seen that the replica method is indeed very powerful, and allows to compute ev-
erything we want, pretty much. Can we understand how this version of replica symmetry
breaking is connected with the one introduced in the cavity method, while at the same time
shed some light on the strange replica ansatz and minimization? The answer is yes, using the
construction derived in the cavity method.
The 1RSB potential — Let us discuss again the idea introduced at the previous chapter. The
whole concept is that there are many "Gibbs states", and that we would like to count them. So
the entire partition sum can be divided into "many" partition sums, each of them defining a
Gibbs states (so that quantities of interest are well defined and concentrate within each state).
We thus have X
Z= Zα (15.43)
α
The idea is then that we can count those states. To do so, we introduce this modified partition
sum, that weights all partition sum by a power m:
X X
Z(m) = Zαm = eN mfα (15.44)
α α
where we have denoted the free entropy of each state α as fα = (log Zα )/N . We assume that
there are exponential many of them, so we need to replace this discrete sum, and approximate
it by an integral and hope that the version we compute using the our approximation (BP, RS
etc...) will be correct:
Z Z
Z(m) = df #(f )e N mfα
= df eN (mfα +Σ(f ) (15.45)
where we have introduced the "complexity" Σ(f ), that is the logarithm of the number of state
of free entropy f (divided by N ). At this point, we can perform the Laplace integral and write
Z Z
Z(m) = df #(f )e N mfα
= df eN (mfα +Σ(f ) (15.46)
1
Ψ(m) = log Z(m) → mfα∗ + Σ(f ∗ ) (15.47)
N
m = −∂f Σ(f )|f ∗ (15.48)
This construction was at the basis of the 1RSB approach in the cavity method, and it allowed
to compute the complexity function. Additionally, the Legendre transform ensure that
∂m Ψ(m) = f ∗ (15.49)
As discussed in the previous chapter, we have to pay attention to the fact that Σ is actually an
average version of the true logarithm of the number of state, and can be negative, so that we
need to evaluate the integral on values of m such that Σ ≥ 0.
230 F. Krzakala and L. Zdeborová
This is reminiscent of the solution for the REM, except that for the REM, a "configuration"
is the same as a "Gibbs" state (this is just because the REM is very simple model). Still, the
same thing happens, with condensation on the lowest energy configuration, so that m can be
different from one when Σ is negative.
With replica: the "Real replica" method — Can we compute the same potential Ψ(m) with
the replica method? The answer is yes, and very instructively so.
First, we notice that if we force m copies (or real replica) to be in the same "states", then the
replicated sum X X
Zm = ( Zα )m = Zi1 Zi2 . . . Zim , (15.50)
α i1 ,i2 ,...im
becomes X
m
Zconstrained = Zαm , (15.51)
α
which is what we want to compute the potential.
Now, we need to compute the average of the log of this quantity and write, using the replica
method ′
m
(Zconstrained )n − 1
1 m
Ψ(m) = E log Zconstrained = lim (15.52)
N n′ →∞ N n′
This would be the way to compute the potential that gives the Legendre transform of the
complexity. However, we also see that
mn ′
1 1 Zconstrained −1
m
E log Zconstrained = lim ′
(15.53)
m N ′
n →0 N mn
n
Zconstrained −1
= lim (15.54)
n→0 Nn
(15.55)
We thus see that when we perform the replica computation, constraining m replica to be in
the same Gibbs states, and thus perform the 1RSB ansatz, what we do is nothing but the
computation of the Legendre transform of the complexity (up to a trivial 1/m factor)! The
1RSB ansatz is nothing but a fancy way (in replica space) to construct the function Ψ(m).
This is, we believe, a much clearer motivation for the 1RSB ansatz! indeed, with this motivation
in mind, it is clearer to see why m should be 1, unless the complexity is negative, so that indeed
m ∈ [0 : 1]. Additionally, it also explain why we must extremize the replica 1RSB formula
with respect to m. Indeed
Ψ(m) Ψ′ (m)m − Ψ(m)
∂m = (15.56)
m m2
which is zero when
so indeed, we see that extremizing over m in the replica formalism is the same as looking for
the zero complexity. This is exactly what we had to do for the REM at low temperature!
15.4 Exercises 231
Bibliography
Replica symmetry Breaking was invented by Parisi in a serie of deep thought provoking
papers (Parisi, 1979, 1980) that ultimately led him to receive the Physics Nobel prize in 2021.
In particular, the distribution of overlaps was introduced in (Parisi, 1983) and its fundamental
role in the replica theory described in Mézard et al. (1984). The Random Energy Model was
introduced by Derrida (1980, 1981) as a toy model to understand replicas and spin glasses,
and it played an important role in the clarification of the replica method, especially as it leads
to the study of the celebrated p-spin model by (Gross and Mézard, 1984). The construction in
terms of counting Gibbs states is discussed in Monasson (1995); Mézard and Parisi (1999). It
was instrumental in the creation of the modern version of the cavity method (Mézard and
Parisi, 2001; Mézard et al., 2002; Mézard and Parisi, 2003) discussed in the previous chapters.
15.4 Exercises
The p-spin model is one of the cornerstones of spin glass theory. It is defined as
follows: there are 2N possible configurations for the N spins variables Si = ±1, and the
Hamiltonian is given by all possible p-body (or p-upplet) interactions:
X
H=− Ji1 ,...,ip Si1 . . . Sip (15.60)
i1 ,i2 ,...,ip
with r
πp! − N p−1 J2
P (J) = p−1
e p! (15.61)
N
1. Computing the moment of the partition function using Gaussian integrals, show
that !p
X β 2 X X
E[Z n ] = exp p−1
Sia Sib (15.62)
4N
a b n
{Si },{Si },...,{Si } a,b i
2. introducing delta functions to fix the overlap and taken its Fourier transform,
show that Z Y
n
E[Z ] ≈ dqa,b dq̂a,b e−N G(Q,Q̂) (15.63)
a<b
with
β2 β2 X p X X P 1
a<b iq̂a,b N
P a b
i Si Si
G(Q, Q̂) = −n − qa,b + i q̂a,b qa,b − log e
4 2
a<b a<b {Sia ,...,Sin }
(15.64)
3. Using the replica method, with the replica symmetric solution, show that the free
entropy is given by
β2
fRS = + log 2 (15.65)
4
232 F. Krzakala and L. Zdeborová
as in the REM. It is possible, of course, to break the symmetry and to obtain the
1RSB solution, which turns out to be correct at low temperature (at least for p
large enough, and for a range of temperature).
4. It is interesting that the free energy is the same as the one of the REM. Show that,
when p → ∞, the energies of the p − spin model becomes uncorrelated and that
the p − spin model become the REM in this limit.
We consider again the REM. Using the replica method, show that at for β ≥ βc , the
second moment of the participation ratio is given by
3 − 5m + 2m2
E[Y 2 ] = (15.66)
3
Deduce that Y is not self-averaging. Actually, these results do not depends on the REM,
but on the 1RSB structure, and are universal to all 1RSB models.
The inference problem addressed in this section is the following. We observe a M dimensional
vector yµ , µ = 1, . . . , M that was created component-wise depending on a linear projection of
an unknown N dimensional vector xi , i = 1, . . . , N by a known matrix Fµi
N
!
X
yµ = fout Fµi xi , (16.1)
i=1
where fout is an output function that includes noise. Examples are the additive white Gaussian
noise (AWGN) were the output is given by fout (x) = x + ξ with ξ being random Gaussian
variables, or the linear threshold output where fout (x) = sign(x − κ), with κ being a threshold
value. The goal in linear estimation is to infer x from the knowledge of y, F and fout . An
alternative representation of the output function fout is to denote zµ = N i=1 Fµi xi and talk
P
about the likelihood of observing a given vector y given z
M
Y
P (y|z) = Pout (yµ |zµ ) . (16.2)
µ=1
236 F. Krzakala and L. Zdeborová
16.2 relaxed-BP
Figure 16.2.1: Factor graph of the linear estimation problem corresponding to the poste-
rior probability for generalized linear estimation . Circles represent the unknown variables,
whereas squares represent the interactions between variables.
We shall describe the main elements necessary for the reader to understand where the resulting
algorithm comes from. We are given the posterior distribution
M N N
1 Y Y X
P (x|y, F ) = Pout (yµ |zµ ) PX (xi ) , where zµ = Fµi xi , (16.3)
Z
µ=1 i=1 i=1
where the matrix Fµi has independent random entries (not necessarily identically distributed)
with mean and variance O(1/N ). This posterior probability distribution corresponds to a
graph of interactions yµ between variables (spins) xi called the graphical model as depicted in
Fig. 16.2.1.
A starting point in the derivation of AMP is to write the belief propagation algorithm cor-
responding to this graphical model. The matrix Fµi plays the role of randomly-quenched
disorder, the measurements yµ are planted disorder. As long as the elements of Fµi are inde-
pendent and their mean and variance of of order O(1/N ) the corresponding system is a mean
field spin glass. In the Bayes-optimal case (i.e. when the prior is matching the true empirical
distribution of the signal) the fixed point of belief propagation with lowest free energy then
provides the asymptotically exact marginals of the above posterior probability distribution.
For model such as (16.3) BP implements a message-passing scheme between nodes in the
graphical model of Fig. 16.2.1, ultimately allowing one to compute approximations of the
posterior marginals. Messages mi→µ are sent from the variable-nodes (circles) to the factor-
nodes (squared) and subsequent messages mµ→i are sent from factor-nodes back to variable-
nodes that corresponds to algorithm’s current “beliefs” about the probabilistic distribution of
16.2 relaxed-BP 237
While being easily written, this BP is not computationally tractable, because every interaction
involves N variables, and the resulting belief propagation equations involve (N − 1)-uple
integrals. Furthermore, here we are considering continuous variable, thus the messages would
be densities and, as we have seen in Chapter 8, it would be quite hard dealing with them
numerically.
Two facts enable the derivation of a tractable BP algorithm: the central limit theorem, on
the one hand, and a projection of the messages to only two of their moments (also used in
algorithms such as Gaussian BP or non-parametric BP.
This results in the so-called relaxed-belief-propagation (r-BP): a form of equations that is
tractable and involves a pair of means and variances for every pair variable-interaction.
Z Z
ai→µ ≡ dxi xi mi→µ (xi ) vi→µ ≡ dxi x2i mi→µ (xi ) − a2i→µ . (16.6)
We thus can replace the multi-dimensional integral in eq. (16.5) by a scalar Gaussian one over
the random variable z:
Z (z−ωµ→i −Fµi xi )2
−
mµ→i (xi ) ∝ dzµ Pout (yµ |zµ )e 2Vµ→i
. (16.7)
One can further simplify the equations: The next step is to rewrite (z − ωµ→i 2
√ − Fµi xi ) =
(z − ωµ→i ) + Fµi xi − 2(z − ωµ→i )Fµi xi Fµi xi and to use the fact that Fµi is O(1/ N ) to expand
2 2 2
the exponential
Z (z−ωµ→i )2
−
mµ→i (xi ) ∝ dzµ Pout (yµ |zµ )e 2Vµ→i
(16.8)
2 2 1
× 1 + Fµi xi − 2(z − ωµ→i )Fµi xi + (z − ωµ→i )2 Fµi
2 2
xi + o(1/N ) .
2
At this point, it is convenient to introduce the output function gout , defined via the output
probability Pout as
(z−ω)2
dzPout (y|z) (z − ω) e− 2V
R
gout (ω, y, V ) ≡ (z−ω)2
. (16.9)
V dzPout (y|z)e− 2V
R
238 F. Krzakala and L. Zdeborová
The following useful identity holds for the average of (z − ω)2 /V 2 in the above measure:
(z−ω)2
dzPout (y|z) (z − ω)2 e− 2V
R
1
(z−ω)2
= 2
+ ∂ω gout (ω, y, V ) + gout (ω, y, V ) . (16.10)
V 2 dzPout (y|z)e− 2V
R V
Using definition (16.9), and re-exponentiating the xi -dependent terms while keeping all
the leading order terms, we obtain (after normalization) that the iterative form of equation
eq. (16.5) reads:
t )2
s
x2 (Bµ→i
Atµ→i − 2Ni Atµ→i +Bµ→i
t x
√i −
2At
(16.11)
N
mµ→i (t, xi ) = e µ→i
2πN
(16.12)
with
t
Bµ→i t
= gout (ωµ→i t
, yµ , Vµ→i ) Fµi and Atµ→i = −∂ω gout (ωµ→i
t t
, yµ , Vµ→i 2
) Fµi .
Given that mµ→i (t, xi ) can be written with a quadratic form in the exponential, we just write
it as a Gaussian distribution. We can now finally close the equations by writing (16.4) as a
product of these Gaussians with the prior
(xi −Ri→µ )2
−
mi→µ (xi ) ∝ PX (xi )e 2Σi→µ
(16.13)
so that we can finally give the iterative form of the r-BP algorithm Algorithm 1 (below) where
we denote by ∂ω (resp. ∂R ) the partial derivative with respect to variable ω (resp. R) and we
define the “input" functions as
(x−R)2
dx x PX (x) e− 2Σ
R
fa (Σ, R) = R (x−R)2
, fv (Σ, R) = Σ∂R fa (Σ, R) . (16.14)
dx PX (x) e− 2Σ
The r-BP algorithm is written for a generic prior on the signal PX (as long as it is factorized
over the elements) and a generic element-wise output channel Pout . The algorithm depends
on their specific form only trough the function fa and gout defined by (16.14) and (16.9). It is
useful to give a couple of explicit examples.
The sparse prior that is most commonly considered in probabilistic compressed sensing is
the Gauss-Bernoulli prior, that is when in (??) we have ϕ(x) = N (x, σ) Gaussian with mean x
and variance σ.
The most commonly considered output channel is simply additive white Gaussian noise
(AWGN) (??).
y−ω
AW GN
gout (ω, y, V ) = . (16.26)
∆+V
As we anticipated above, the example of linear estimation that was most broadly studied in
statistical physics is the case of the perceptron problem discussed in detail e.g. in Watkin et al.
(1993). In the perceptron problem each of the M N -dimensional patterns Fµ is multiplied by
a vector of synaptic weights xi in order to produce an output yµ according to
N
X
yµ = 1 if Fµi xi > κ , (16.27)
i=1
yµ = −1 otherwise , (16.28)
where κ is a threshold value independent of the pattern. The perceptron is designed to classify
patterns, i.e. one starts with a training set of patterns and their corresponding outputs yµ
and aims to learn the weights xi in such a way that the above relation between patterns and
outputs is satisfied. To relate this to the linear estimation problem above, let us consider the
perceptron problem in the teacher-student scenario where the teacher perceptron generated
the output yµ using some ground-truth set of synaptic weights x∗i . The student perceptron
knows only the patterns and the outputs and aims to learn the weights. How many patterns
are needed for the student to be able to learn the synaptic weights reliably? What are efficient
learning algorithms?
In the simplest case where the threshold is zero, κ = 0 one can redefine the patterns Fµi ←
Fµi yµ in which case the corresponding redefined output is yµ = 1. The output function in that
case reads
ω2
perceptron 1 e− 2V
gout (ω, V ) = √ , (16.29)
2πV H − √ω
V
where Z ∞
dt t2
H(x) = √ e− 2 . (16.30)
x 2π
In physics a case of a perceptron that was studied in detail is that of binary synaptic weights
xi ∈ {±1}. To take that into account in the G-AMP we consider the binary prior PX (x) =
[δ(x − 1) + δ(x + 1)]/2 which leads to the input function
T
fabinary (Σ, T ) = tanh . (16.31)
Σ
Let us try to write the cavity equation here, using our simplified formulation
240 F. Krzakala and L. Zdeborová
First, let us now see how Ri behaves (note the difference between the letters ω and w):
Ri X X
= Bµ→i = Fµi gout (ωµ→i , yµ , Vµ→i ) (16.32)
Σi µ µ
X X
= Fµi gout (ωµ→i , fout ( Fµj sj + Fµi si , ξµ ), V ) (16.33)
µ j̸=i
X X
= Fµi gout (ωµ→i , fout ( Fµj sj , ξµ ), V ) + si αm̂ , (16.34)
µ j̸=i
where we define
We further write
Ri
(16.36)
p
∼ N (0, 1) αq̂ + si αm̂
Σi
with N (0, 1) being a random Gaussian variables of zero mean and unit variance, and where
(16.37)
2
q̂ = Eω,z,ξ gout (ω, fout (z, ξ) V )
Finally, we observe that Σ should not fluctuate between sites, and we thus expect they are
close to the value
Σ−1 (t) = αχ̂ (16.38)
where we have defined
Given these three variables, one can express the order parameters as simple integrals. Indeed,
if we define:
We can now close the equation by writing the hat variables as a function of the order parameters.
First, we realize that from the definition, we have z and ω jointly Gaussian with covariance
q0 mt
z
∼ N 0, (16.46)
ω mt q t
So now, we can simply collect the definition, which become a 3-d integral:
While one can implement and run the r-BP, it can be simplified further without changing the
leading order behavior of the marginals by realizing that the dependence of the messages
on the target node is weak. This is exactly what we have done to go from the standard belief
propagation to the TAP equations in sec. 9.
After the corresponding Taylor expansion the corrections add up into the so-called Onsager
reaction terms Thouless et al. (1977). The final G-AMP iterative equations are written in terms
of means ai and their variances ci for each of the variables xi . The whole derivation is done in
a way that the leading order of the marginal ai are conserved. Given the BP was asymptotically
exact for the computations of the marginals so is the G-AMP in the computations of the means
and variances. Let us define
X
ωµt+1 = Fµi ati→µ , (16.50)
i
X
Vµt+1 = 2 t
Fµi vi→µ , (16.51)
i
1
Σt+1
i =P t+1 , (16.52)
µ Aµ→i
P t+1
µ Bµ→i
Rit+1 =P t . (16.53)
µ Aµ→i
From now on we will use the notation ≈ for equalities that are true only in the leading order.
242 F. Krzakala and L. Zdeborová
ati→µ = fa (Ri→µ
t t
, Σi→µ ) ≈ fa (Ri→µ , Σi ) (16.60)
≈ fa (Rit , Σi ) − Bµ→i
t
fv (Rit , Σi ) (16.61)
≈ ati − gout (ωµt , yµ , Vµt )Fµi vit (16.62)
So that
X X X
ωµt+1 = Fµi ati − gout (ωµt , yµ , Vµt )Fµi
2 t
vi = Fµi ai − Vµt gout (ωµt , yµ , Vµt ) (16.63)
i i i
An important aspect of these equations to note is the index t − 1 in the Onsager reaction term
in eq. (16.65) that is crucial for convergence and appears for the same reason as in the TAP
equations in sec. 9. Note that the whole algorithm is comfortably written in terms of matrix
multiplications only, this is very useful for implementations where fast linear algebra can be
used.
16.5.1 Bayes-Optimal
In fact, one can prove rigorously the following theorem: the “the averaged free energy” in
the statistical physics literature, in the asymptotic limit –when the number of variable is
growing—of the entropy density of the variable Y reads:
H(Y |Φ) rq
lim = − sup inf ψP0 (r) + αΨPout (q; ρ) − . (16.70)
n→∞ n r≥0 q∈[0,ρ] 2
where q0 = ρ = E[x2 ] and, denoting Gaussian integrals as D,
√
Z Z
2
ψP0 (r) := DZdP0 (X0 ) ln dP0 (x)erxX0 + rxZ−rx /2 , (16.71)
√ √
Z
ΨPout (q; ρ) := DV DW dYe0 Pout (Ye0 | q V + ρ − q W )
√ √
Z
ln DwPout (Ye0 | q V + ρ − q w) (16.72)
It is a simple check to see that q follows state evolution with the Nishimori points:
16.5.2 Bayes-Optimal
Another set of problem where the eisults is rigorous is when the optimization corresponds a
convex problem
y−ω
AW GN
gout (ω, y, V ) = (16.75)
∆+V
Let us change variable: we now denote V = ∆Ṽ and Σ = ∆Σ̃. When ∆ is large fa and fb
can now be evaluated as Laplace integrals, and finally, all equation closes on the orginal algo
where ∆ is replaced by one, and where the update function are given by:
(x−R)2
dx x PX (x) e− 2∆Σ̃ (x − R)2
R
faMAP (Σ, R) = R (x−R)2
, = argmin − + λfreg (x) (16.82)
dx PX (x) e −
2∆Σ̃
2Σ̃
For instance, for a ℓ1 regularization, we find faMAP (Σ, R) = argmin(x − R)2 /2Σ̃ + λ|x| so that
and
0.8
Easy Phase
rows to columns ratio α
0.6
e
as
Ph
rd
Ha
0.4
Impossible Phase
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
fraction of nonzeros ρ
Figure 16.6.1: Phase diagram of noiseless compressed sensing for Gauss-Bernulli prior
Bibliography
Donoho et al. (2009), Krzakala et al. (2012) and Zdeborová and Krzakala (2016).
16.7 Exercises
Input: y
Initialize: ai→µ (t = 0),vi→µ (t = 0), t = 1
repeat
r-BP Update of {ωµ→i , Vµ→i }
X
Vµ→i (t) ← 2
Fµj vj→µ (t − 1) (16.15)
j̸=i
X
ωµ→i (t) ← Fµj aj→µ (t − 1) (16.16)
j̸=i
1
Σi→µ (t) ← P (16.19)
ν̸=µ ν→i (t)
A
X
Ri→µ (t) ← Σi→µ (t) Bν→i (t) (16.20)
ν̸=µ
t←t+1
until Convergence on ai→µ (t),vi→µ (t)
output: Estimated marginals mean and variances:
P
ai ← fa P A 1
, Pν Bν→i , (16.23)
ν→i ν Aν→i
ν P
Bν→i
1
vi ← fv P Aν→i , Pν Aν→i . (16.24)
ν ν
Input: y
Initialize: a0 ,v0 , t = 1 gout,µ
0
repeat
AMP Update of ωµ , Vµ
X
Vµt ← 2 t−1
Fµi vi (16.64)
i
X
ωµt ← Fµi at−1
i
t−1
− Vµt gout (16.65)
i
at+1
i ← fa (Σi , Rit+1 , ) (16.68)
vit+1 ← fv (Σi , Rit+1 ) (16.69)
t←t+1
until Convergence on a,v
output: a,v.
Algorithm 2: Generalized Approximate Message Passing (G-AMP)
248 F. Krzakala and L. Zdeborová
Input: y
Initialize: a0 ,v0 , t = 1 gout,µ
0
repeat
AMP Update of ωµ , Vµ
1 X t−1
Vt ← vi (16.76)
N
i
X
ωµt ← Fµi at−1
i
t−1
− Vµt gout (16.77)
i
∆+Vt
Σt ← (16.78)
α
X y µ − ωµ
Rit t−1
← ai + Fµi (16.79)
µ
α
at+1
i ← fa (Σt , Rit+1 , ) (16.80)
vit+1 ← fv (Σ t
, Rit+1 ) (16.81)
t←t+1
until Convergence on a,v
output: a,v.
Algorithm 3: Approximate Message Passing (AMP) for Linear estimation
16.7 Exercises 249
Input: y
Initialize: a0 ,v0 , t = 1 gout,µ
0
repeat
AMP Update of ωµ , Vµ
1 X t−1
Vt ← vi (16.90)
N
i
X
ωµt ← Fµi at−1
i
t−1
− Vµt gout (16.91)
i
1+Vt
Σt ← (16.92)
α
X yµ − ωµ
Rit ← at−1
i + Fµi (16.93)
µ
α
at+1
i ← faST (Σt , Rit+1 , ) (16.94)
vit+1 ← fvST (Σt , Rit+1 ) (16.95)
t←t+1
until Convergence on a,v
output: a,v.
Algorithm 4: Approximate Message Passing (AMP) for LASSO
Chapter 17
Perceptron
17.1 Perceptron
1943, McCulloch-Pitts model: y = Heavyside(ξ · w − κ), where ξ is the signal coming through
dendrites, w is the synaptic weights.
1958, Rosenblatt build McCulloch-Pitts model mechanically: Given set of ξ patterns: {ξ1 , . . . , ξn }
and objectives yi ∈ {0, 1}. Goal is given new pattern ξnew , determine its objective.
1969, Minsky and Papert introduced Perceptron: the McCulloch-Pitts model is leaning a
separating hyperplane, and hence cannot learn cases that are not linearly separable, for
example, the XOR problem.
But by embedding the points in a higher dimensional space, it is possible to find a separating
hyperplane, this is called representation learning.
• Idea 1 – linear embedding: the points will still live in a low-dimensional space, BAD.
Now we change the Heavyside function to sign function, since the bijection from {0, 1} to
{−1, 1} does not change the analysis but in physics it is more common to use ±1 spins.
Capacity
Goal: learn w ∈ {−1, +1}N such that we can make prediction ŷ(x) = sign(ξ · w) and minimize
the cost function X
H w; {ξµ , yµ }M
µ=1 = M − δ (yµ , ŷ(ξµ ))
µ
Let α = M/N , it is observed that there exists a threshold αc such that at in the large-N limit
with α fixed
• When α < αc , w.h.p. there exists some w so that the cost function H w; {ξµ , yµ }M
µ=1 =
0.
• When α > αc , w.h.p. for all w the cost function H w; {ξµ , yµ }M
µ=1 > 0.
Replica Method
1989, Knuth, Mezard, Storage capacity of memory networks with binary couplings
Let’s consider the partition function of Boltzmann distribution under inverse temperature β
X
Z(β; {ξµ , yµ }M
µ=1 ) = exp −βH w; {ξµ , yµ }M
µ=1
w
We are interested in the average free entropy for all possible ‘training set’ {ξµ , yµ }M
µ=1
log Z(β; {ξµ , yµ }M
µ=1 )
Φ(β, α) = lim E{ξµ ,yµ }M
N →∞ µ=1 N
17.1 Perceptron 253
It is still an open problem to rigorously prove that ΦRS (β, α) is correct under αc ≈ 0.83, we
just believe it is correct from the replica formula. When α > αc , the ΦRS is incorrect, one needs
to turn to 1RSB analysis.
Teacher-Student Scenario
• Teacher: there is a y ∗ = sign(ξ · w∗ ), where w∗ is also binary.
{ξ1 , . . . , ξM } is still random but their corresponding {y1 , . . . , yM } are generated accord-
ing to the above rule.
• Student: Need to learn the rule
– There is no capacity here because there is a true w∗ .
– How many patterns does the student need to see to learn the rules?
Treat ξµ are the µ-th row of matrix F, treat w∗ as the true signal x∗ we want to recover, then
y = sign (F x)
Let’s put this problem in a probabilistic framework. We introduce a Gaussian noise inside the
sgn, so as to have a probit likelihood; moreover we will assume that the weights are binary,
x ∈ {+1, −1}N . The full model reads
1 1
xi ∼ δ(xi − 1) + δ(xi + 1),
2 2
1
y ∼ erfc √ 1
µ Fµ · x .
2 2σ 2
Hence, this problem is same as the generalized linear model, and we can use AMP to solve it.
Let PX (x) be the prior of xi and ρ = E[x2 ]. Then the replica formula reads
log (Z (F, y))
lim EF,y = Extq,q̂ ΦRS (q, q̂)
N →∞ N
where
∗√ √
Z
√ √
Z
α ∗
ΦRS (q, q̂) = − q q̂ + α dx dy Dz Pout y x ρ − q + z q × log dx Pout y x ρ − q + z q
2
√ √
Z Z
+ dx∗ Dz PX (x∗ )e− 2 (x ) +zx q̂ × log dx PX (x)e− 2 x +zx q̂
αq̂ ∗ 2 ∗ αq̂ ∗
p2 (x−p)2
exp − 2q(t) − 2 ρ−q(t)
( )
Z Z
q̂ (t) = − dp dx q dy P out (y | x) ∂ p gout p, y, ρ − q (t)
2π q (t) ρ − q (t)
!
1
Z Z
z
q (t+1) = dx PX (x) Dz fa2 ,x + p
αq̂ (t) αq̂ (t)
• Impossible Phase: When α < 1.245, q = 1 is not the global maxima of ΦRS , so it is
impossible to find a solution.
• Hard Phase: When α ∈ (1.245, 1.493), q = 1 is the global maxima, but there is another
local maxima at which AMP will get stuck.
• Easy Phase: When α > 1.493, q = 1 is the only local maxima, AMP will always converge
to it.
17.1 Perceptron 255
α = 1.2 α = 1.3
0.78
0.72 0.76
ΦRS(q)
0.74
0.70
0.72
0.68 0.70
α = 1.4 α = 1.5
0.85
0.80
ΦRS(q)
0.80
0.75
0.75
0.70 0.70
10−5 10−3 10−1 10−5 10−3 10−1
q q
Appendix
Chapter 18
Abbe, E. and Montanari, A. (2013). Conditional random fields, planted constraint satisfaction
and entropy concentration. In Approximation, Randomization, and Combinatorial Optimization.
Algorithms and Techniques, pages 332–346. Springer.
Achlioptas, D. and Moore, C. (2003). Almost all graphs with average degree 4 are 3-colorable.
Journal of Computer and System Sciences, 67(2):441–471.
Aizenman, M., Sims, R., and Starr, S. L. (2003). Extended variational principle for the
sherrington-kirkpatrick spin-glass model. Physical Review B, 68(21):214403.
Aubin, B., Loureiro, B., Maillard, A., Krzakala, F., and Zdeborová, L. (2020). The spiked matrix
model with generative priors. IEEE Transactions on Information Theory.
Bai, Z. and Silverstein, J. W. (2010). Spectral analysis of large dimensional random matrices,
volume 20. Springer.
Baik, J., Arous, G. B., Péché, S., et al. (2005). Phase transition of the largest eigenvalue for
nonnull complex sample covariance matrices. Annals of Probability, 33(5):1643–1697.
Barbier, J., Dia, M., Macris, N., Krzakala, F., Lesieur, T., and Zdeborová, L. (2016). Mutual
information for symmetric rank-one matrix estimation: A proof of the replica formula. arXiv
preprint arXiv:1606.04142.
Barbier, J. and Macris, N. (2019). The adaptive interpolation method: a simple scheme to prove
replica formulas in bayesian inference. Probability theory and related fields, 174(3):1133–1185.
Bayati, M. and Montanari, A. (2011). The dynamics of message passing on dense graphs, with
applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785.
Berthier, R., Montanari, A., and Nguyen, P.-M. (2020). State evolution for approximate
message passing with non-separable functions. Information and Inference: A Journal of the
IMA, 9(1):33–79.
Bethe, H. A. (1935). Statistical theory of superlattices. Proceedings of the Royal Society of London.
Series A-Mathematical and Physical Sciences, 150(871):552–575.
Bolthausen, E. (2014). An iterative construction of solutions of the tap equations for the
sherrington–kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic
theory of independence. Oxford university press.
Braunstein, A., Dall’Asta, L., Semerjian, G., and Zdeborová, L. (2016). The large deviations
of the whitening process in random constraint satisfaction problems. Journal of Statistical
Mechanics: Theory and Experiment, 2016(5):053401.
Chertkov, M. (2008). Exactness of belief propagation for some graphical models with loops.
Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10016.
Chertkov, M. and Chernyak, V. Y. (2006). Loop series for discrete statistical models on graphs.
Journal of Statistical Mechanics: Theory and Experiment, 2006(06):P06009.
Coja-Oghlan, A., Krzakala, F., Perkins, W., and Zdeborová, L. (2018). Information-theoretic
thresholds from the cavity method. Advances in Mathematics, 333:694–795.
Coja-Oghlan, A. and Vilenchik, D. (2013). Chasing the k-colorability threshold. In 2013 IEEE
54th Annual Symposium on Foundations of Computer Science, pages 380–389. IEEE.
Cover, T. M. and Thomas, J. A. (1991). Information theory and statistics. Elements of Information
Theory, 1(1):279–335.
Curie, P. (1895). Propriétés magnétiques des corps a diverses températures. Number 4. Gauthier-
Villars et fils.
Debye, P. (1009). Näherungsformeln für die zylinderfunktionen für große werte des arguments
und unbeschränkt veränderliche werte des index. Math. Ann., 67:535–558.
Dembo, A., Montanari, A., et al. (2010a). Gibbs measures and phase transitions on sparse
random graphs. Brazilian Journal of Probability and Statistics, 24(2):137–211.
Dembo, A., Montanari, A., et al. (2010b). Ising models on locally tree-like graphs. The Annals
of Applied Probability, 20(2):565–592.
Dembo, A., Montanari, A., Sly, A., and Sun, N. (2014). The replica symmetric solution for
potts models on d-regular graphs. Communications in Mathematical Physics, 327(2):551–575.
Dembo, A., Zeltouni, O., and Fleischmann, K. (1996). Large deviations techniques and
applications. Jahresbericht der Deutschen Mathematiker Vereinigung, 98(3):18–18.
Donoho, D. L., Johnstone, I. M., et al. (1998). Minimax estimation via wavelet shrinkage. The
annals of Statistics, 26(3):879–921.
Donoho, D. L., Maleki, A., and Montanari, A. (2009). Message-passing algorithms for com-
pressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919.
Edwards, S. F., Goldbart, P. M., Goldenfeld, N., Sherrington, D. C., and Edwards, S. F., editors
(2005). Stealing the gold: a celebration of the pioneering physics of Sam Edwards. Number 126 in
International series of monographs on physics. Clarendon Press ; Oxford University Press,
Oxford : New York.
Edwards, S. F. and Jones, R. C. (1976). The eigenvalue spectrum of a large symmetric random
matrix. Journal of Physics A: Mathematical and General, 9(10):1595.
El Alaoui, A. and Krzakala, F. (2018). Estimation in the spiked wigner model: A short proof
of the replica formula. In 2018 IEEE International Symposium on Information Theory (ISIT),
pages 1874–1878. IEEE.
Gross, D. J. and Mézard, M. (1984). The simplest spin glass. Nuclear Physics B, 240(4):431–452.
Guerra, F. (2003). Broken replica symmetry bounds in the mean field spin glass model.
Communications in mathematical physics, 233(1):1–12.
Guo, D., Shamai, S., and Verdú, S. (2005). Mutual information and minimum mean-square
error in gaussian channels. IEEE transactions on information theory, 51(4):1261–1282.
Iba, Y. (1999). The nishimori line and bayesian statistics. Journal of Physics A: Mathematical and
General, 32(21):3875.
Jaeger, G. (1998). The ehrenfest classification of phase transitions: Introduction and evolution.
Arch Hist Exact Sc., 53:51–81.
Kadanoff, L. P. (2009). More is the same; phase transitions and mean field theories. Journal of
Statistical Physics, 137(5):777–797.
Korada, S. B. and Macris, N. (2009). Exact solution of the gauge symmetric p-spin glass model
on a complete graph. Journal of Statistical Physics, 136(2):205–230.
Krzakala, F., Mézard, M., Sausset, F., Sun, Y., and Zdeborová, L. (2012). Probabilistic recon-
struction in compressed sensing: algorithms, phase diagrams, and threshold achieving
matrices. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08009.
264 F. Krzakala and L. Zdeborová
Krzakala, F., Xu, J., and Zdeborová, L. (2016). Mutual information in rank-one matrix estima-
tion. In 2016 IEEE Information Theory Workshop (ITW), pages 71–75. IEEE.
Laplace, P. S. d. (1774). Memoire sur les probabilites des causes par lesevenements.
Lelarge, M. and Miolane, L. (2019). Fundamental limits of symmetric low-rank matrix estima-
tion. Probability Theory and Related Fields, 173(3):859–929.
Lesieur, T., Krzakala, F., and Zdeborová, L. (2017). Constrained low-rank matrix estimation:
Phase transitions, approximate message passing and applications. Journal of Statistical
Mechanics: Theory and Experiment, 2017(7):073403.
Livan, G., Novaes, M., and Vivo, P. (2018). Introduction to random matrices: theory and practice,
volume 26. Springer.
McGrayne, S. B. (2011). The theory that would not die. Yale University Press.
Mézard, M. and Parisi, G. (2001). The bethe lattice spin glass revisited. The European Physical
Journal B-Condensed Matter and Complex Systems, 20(2):217–233.
Mézard, M. and Parisi, G. (2003). The cavity method at zero temperature. Journal of Statistical
Physics, 111(1):1–34.
Mézard, M., Parisi, G., Sourlas, N., Toulouse, G., and Virasoro, M. (1984). Nature of the
spin-glass phase. Physical review letters, 52(13):1156.
Mézard, M., Parisi, G., and Virasoro, M. (1987a). Sk model: The replica solution without
replicas. SPIN GLASS THEORY AND BEYOND: AN INTRODUCTION TO THE REPLICA
METHOD AND ITS APPLICATIONS. Edited by MEZARD M ET AL. Published by World
Scientific Press, pages 232–237.
Mézard, M., Parisi, G., and Virasoro, M. A. (1987b). Spin glass theory and beyond: An Introduction
to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company.
Mézard, M., Parisi, G., and Zecchina, R. (2002). Analytic and algorithmic solution of random
satisfiability problems. Science, 297(5582):812–815.
Monasson, R. (1995). Structural glass transition and the entropy of the metastable states.
Physical review letters, 75(15):2847.
Nattermann, T. (1998). Theory of the random field ising model. In Spin glasses and random
fields, pages 277–298. World Scientific.
Nishimori, H. (1980). Exact results and critical properties of the ising model with competing
interactions. Journal of Physics C: Solid State Physics, 13(21):4071.
Opper, M. and Saad, D. (2001). Advanced mean field methods: Theory and practice. MIT press.
Parisi, G. (1979). Infinite number of order parameters for spin-glasses. Physical Review Letters,
43(23):1754.
Parisi, G. (1980). A sequence of approximated solutions to the sk model for spin glasses.
Journal of Physics A: Mathematical and General, 13(4):L115.
Parisi, G. (1983). Order parameter for spin-glasses. Physical Review Letters, 50(24):1946.
Pearl, J. (1982). Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive
Systems Laboratory, School of Engineering and Applied Science . . . .
Peierls, R. (1936). Statistical theory of adsorption with interaction between the adsorbed
atoms. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 32, pages
471–476. Cambridge University Press.
Potters, M. and Bouchaud, J.-P. (2020). A First Course in Random Matrix Theory: For Physicists,
Engineers and Data Scientists. Cambridge University Press.
Rangan, S. (2011). Generalized approximate message passing for estimation with random
linear mixing. In 2011 IEEE International Symposium on Information Theory Proceedings, pages
2168–2172. IEEE.
Saade, A., Krzakala, F., and Zdeborová, L. (2017). Spectral bounds for the ising ferromag-
net on an arbitrary given graph. Journal of Statistical Mechanics: Theory and Experiment,
2017(5):053403.
Schwarz, A., Bluhm, J., and Schröder, J. (2020). Modeling of freezing processes of ice floes
within the framework of the tpm. Acta Mechanica, 231.
Thouless, D. J., Anderson, P. W., and Palmer, R. G. (1977). Solution of ’solvable model of a
spin glass’. Philosophical Magazine, 35(3):593–601.
Touchette, H. (2009). The large deviation approach to statistical mechanics. Physics Reports,
478(1):1–69.
Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational
inference. Now Publishers Inc.
Watkin, T. L. H., Rau, A., and Biehl, M. (1993). The statistical mechanics of learning a rule.
Reviews of Modern Physics, 65(2):499–556.
266 F. Krzakala and L. Zdeborová
Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Annals of
Mathematics, pages 325–327.
Willsky, A., Sudderth, E., and Wainwright, M. J. (2007). Loop series and bethe variational
bounds in attractive graphical models. Advances in neural information processing systems, 20.
Yedidia, J. S., Freeman, W. T., Weiss, Y., et al. (2003). Understanding belief propagation and its
generalizations. Exploring artificial intelligence in the new millennium, 8(236-239):0018–9448.
Zdeborová, L. and Krzakala, F. (2016). Statistical physics of inference: thresholds and algo-
rithms. Advances in Physics, 65(5):453–552.
Todo list