0% found this document useful (0 votes)
47 views

Statistical Physics Methods in Optimization and Machine Learning Notes

Uploaded by

vuhl05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Statistical Physics Methods in Optimization and Machine Learning Notes

Uploaded by

vuhl05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 279

Statistical Physics Methods

in Optimization and Machine Learning


An Introduction to Replica, Cavity & Message-Passing techniques

Florent Krzakala and Lenka Zdeborová

March 9, 2024
ii F. Krzakala and L. Zdeborová

Introduction

I thought fit to [. . . ] explain in detail in the same book the


peculiarity of a certain method, by which it will be possible [. . . ]
to investigate some of the problems in mathematics by means of
mechanics. This procedure is [. . . ] no less useful even for the
proof of the theorems themselves; for certain things first became
clear to me by a mechanical method, although they had to be
demonstrated by geometry afterwards [. . . ] But it is of course
easier, when we have previously acquired, by the method, some
knowledge of the questions, to supply the proof than it is to find it
without any previous knowledge. I am persuaded that it [. . . ] will
be of no little service to mathematics; for I apprehend that some,
either of my contemporaries or of my successors, will, by means of
the method when once established, be able to discover other
theorems in addition, which have not yet occurred to me.

The method of mechanical theorems


Letter from Archimedes to Eratosthenes – circa 220BC

This set of lectures will be discussing probabilistic models, and focus on problems coming from
Statistics, Machine Learning and Constrained Optimization, while using tools and techniques
from Statistical Physics. The focus will be more theoretical than practical, so you have been
warned! Our goal is to show how some methods from Statistical Physics allow us to derive
precise answers to many mathematical questions. As pointed out by Archimedes, once these
answers are given, even if they are obtained through heuristic methods, it is a simpler task
(but still non-trivial) to prove them rigorously. Over the last few decades, there has been an
increasing convergence of interest and methods between Theoretical Physics and Applied
Mathematics, and many theoretical and applied works in Statistical Physics and Computer
Science have relied on a connection with the Statistical Physics of Spin Glasses. The aim of
this course is to present the background necessary for entering this fast-developing field.

At first glance, it may seem surprising that Physics has any connection with minimization and
probabilistic inference problems. The connection lies in the Gibbs (or Boltzmann) distribution,
the fundamental object in Statistical Mechanics. From the point of view of Statistics and
Optimization we will be interested in two types of problems: a) minimizing a cost function
and b) sampling from a distribution. In both cases, the Statistical Physics approach, or more
precisely the Boltzmann measure, turns out to be convenient.

Say that you have a "cost" function E(x) of x ∈ Rd . In statistical mechanics, one associates
a temperature-dependent "Boltzmann" probability measure to each possible value of x as
follows:
1
PBoltzmann (x) = e−βE(x)
ZN (β)
where β = 1/T is called the inverse temperature, and
Z
ZN (β) = dx e−βE(x)
Rd

is called the partition function, or the partition sum. The introduction of the temperature is
very practical. For instance one can check that the limit β → ∞ allows us to study minimization
0 Introduction iii

problems, since
lim −∂β log Z(β) = min E(x)
β→∞ x

One can also obtain the number of minima by computing limβ→∞ Z(β).

The formalism is equally interesting for sampling problems. A typical problem that arises
in Statistical Inference, Information Theory, Signal Processing or Machine Learning is the
following: let X be an unknown quantity that we would like to infer, but which we don’t have
access to. Instead, we are given access to a quantity Y , related to X (usually a noisy version of
X). For concreteness, assume that X is simply a scalar Gaussian variable with zero mean and
unit variance, and that Y = X + Z for another standard Gaussian variable Z. Given Y , what
can we say about X? In other words: what is the probability of X given our measurement
Y ? The quantity PX|Y (x, y) is called the posterior distribution of X given Y . Bayes’ formula
famously states that
PY |X (x, y)PX (x)
PX|Y (x, y) = ,
PY (y)
so that the posterior PX|Y (x, y) is given by the product of the prior probability on X, PX (x),
times the likelihood PY |X (x, y), divided by the evidence PY (y), which is just a normalization
constant (in the sense that it is not a function of X, the random variable of interest).

This, of course, can be trivially rewritten as

elog [PY |X (x,y)PX (x)]


PX|Y (x, y) = ,
PY (y)

and thus can be represented as a Boltzmann measure provided we identify

β = 1, Z = PY (y), E = − log PY |X (x, y)PX (x).

Thus, the evidence PY is simply the partition sum. This simple rewriting is behind the
popularity of Statistical Mechanics language in Bayesian inference. Indeed, many words in the
vocabulary of the Machine Learning community are borrowed directly from Physics (such as
"energy-based model", "free energy", "mean-field", "Boltzmann machine"...).

In what follows, we shall be interested in the accuracy of the resulting estimates, for instance
the mean squared error over a given set of models. We will thus need to apply the methods
from Statistical Physics, not only to derive the posterior distribution and the partition sum, but
also to take averages over many models or many realizations, in order to determine the typical
behavior. For example, we would like to access information-theoretical quantities such as the
entropy. Computing the partition sum Z is already difficult, but computing such averages is
notoriously even harder.

Conveniently, there is a part of Statistical Physics that focuses exactly on this task: the field of
disordered systems and spin glasses. Spin glasses are magnets in which the interaction strength
between each pair of particles is random. Starting in the late 70s with the groundbreaking
work of Sir Sam Edwards and Nobel prize winner Philip W. Anderson, the Statistical Physics of
disordered systems and spin glasses has grown into a versatile theory, with powerful heuristic
tools such as the replica and the cavity methods. In itself, the idea of using Statistical Physics
methods to study some problems in computer science is not a new. It was the inspiration, for
instance, behind the creation of Simulated Annealing. Anderson, on the one hand, and Parisi
iv F. Krzakala and L. Zdeborová

and Mézard on the other, used this connection back in 1986 to study optimisation problems,
and since then many applied it with great success to study a large variety of problems in
optimization (random satisfiability & coloring), error correcting codes, inference and machine
learning.

In the lectures, we wish to address these questions with an interdisciplinary approach, lever-
aging tools from mathematical physics and statistical mechanics, but also from information
theory and optimization. Our modelling strategy and the analysis originate in studies of
phase transitions for models of Condensed Matter Physics. Yet, most of our objectives and
applications belong to the fields of Machine Learning, Computer Science, and Statistical Data
Processing.

Notations

We shall use probabilistic notations: random variables are uppercase, and a particular value is
lowercase. For instance, we will speak of the probability P(X = x) that the random variable
X takes the value x.

=: definition of a new
quantity

≍ asymptotically equal,
used for large deviation

P(X = x) probability that the ran-


dom variable X takes
value x
Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

I Techniques and methods 1

1 The mean-field ferromagnet of Curie & Weiss 3

1.1 Rigorous solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Phase transition in the Curie-Weiss model . . . . . . . . . . . . . . . . . 6

1.2 The free energy/entropy is all you need . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Derivatives of the free entropy . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.2 Legendre transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.3 Gartner-Ellis Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Toolbox: Gibbs free entropy and the variational approach . . . . . . . . . . . . 12

1.4 Toolbox: The cavity method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Toolbox: The "field-theoretic" computation . . . . . . . . . . . . . . . . . . . . 17

1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Appendices 27

Appendix 1.A The jargon of Statistical Physics . . . . . . . . . . . . . . . . . . . . . 27

Appendix 1.B The ABC of phase transitions . . . . . . . . . . . . . . . . . . . . . . 28

Appendix 1.C A rigorous version of the cavity method . . . . . . . . . . . . . . . . 33

2 A simple example: The Random Field Ising Model 37


vi F. Krzakala and L. Zdeborová

2.1 Self-averaging and Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2 Replica Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.1 Computing the replicated partition sum . . . . . . . . . . . . . . . . . . 39

2.2.2 Replica Symmetry Ansatz . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2.3 Computing the Saddle points: mean-field equation . . . . . . . . . . . 42

2.3 A rigorous computation with the interpolation technique . . . . . . . . . . . . 43

2.3.1 A Simple Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.3.2 Guerra’s Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 A first application: The spectrum of random matrices 49

3.1 The Stieltjes transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 The replica method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.1 Averaging replicas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.2 Replica symmetric ansatz . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.3 From Stieltjes to the spectrum . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3 The Cavity method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 The Random Field Ising Model on Random Graphs 57

4.1 The root of all cavity arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Exact recursion on a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 Belief propagation on trees for pairwise models . . . . . . . . . . . . . 59

4.3 Cavity on random graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.1 Random graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.2 Cavity method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.3 The relation between Loopy Belief Propagation and the Cavity method 65

4.3.4 Can we prove it? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.5 When do we expect this to work? . . . . . . . . . . . . . . . . . . . . . . 66


CONTENTS vii

4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Sparse Graphs & Locally Tree-like Graphical Models 71

5.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.2 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.3 Some properties and usage of factor graphs . . . . . . . . . . . . . . . . 76

5.2 Belief Propagation and the Bethe free energy . . . . . . . . . . . . . . . . . . . 77

5.2.1 Derivation of Belief Propagation on a tree graphical model . . . . . . . 77

5.2.2 Belief propagation equations summary . . . . . . . . . . . . . . . . . . 83

5.2.3 How do we use Belief Propagation? . . . . . . . . . . . . . . . . . . . . 84

5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Appendices 89

Appendix 5.A BP for pair-wise models, node by node . . . . . . . . . . . . . . . . 89

5.A.1 One node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.A.2 Two nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.A.3 Three nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Belief propagation for graph coloring 95

6.1 Deriving quantities of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Specifying generic BP to coloring . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3 Bethe free energy for coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.4 Paramagnetic fixed point for graph colorings . . . . . . . . . . . . . . . . . . . 98

6.5 Ferromagnetic Fixed Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.5.1 Ising ferromagnet, q = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.5.2 Potts ferromagnet q ≥ 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.6 Back to the anti-ferromagnetic (β > 0) case . . . . . . . . . . . . . . . . . . . . 104

6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108


viii F. Krzakala and L. Zdeborová

II Probabilistic Inference 113

7 Denoising, Estimation and Bayes Optimal Inference 115

7.1 Bayes-Laplace inverse problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.2 Scalar estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.2.1 Posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.2.2 Point-estimate inference: MAP, MMSE and all that. . . . . . . . . . . . 118

7.2.3 Back to free entropies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.2.4 Some useful identities: Nishimori and Stein . . . . . . . . . . . . . . . . 122

7.2.5 I-MMSE theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.3 Application: Denoising a sparse vector . . . . . . . . . . . . . . . . . . . . . . . 124

7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Appendices 131

Appendix 7.A A tighter computation of the likelihood ratio . . . . . . . . . . . . . 131

Appendix 7.B A replica computation for vector denoising . . . . . . . . . . . . . . 132

8 Low-Rank Matrix Factorization: the Spike model 135

8.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8.2 From the Posterior Distribution to the partition sum . . . . . . . . . . . . . . . 137

8.3 Replica Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.4 A rigorous proof via Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9 Cavity method and Approximate Message Passing 153

9.1 Self-consistent equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.2 Rank-one by the cavity method . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.3 AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

10 Stochastic Block Model & Community Detection 159


CONTENTS ix

10.1 Definition of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

10.2 Bayesian Inference and Parameter Learning . . . . . . . . . . . . . . . . . . . . 160

10.2.1 Community detection with known parameters . . . . . . . . . . . . . . 161

10.2.2 Learning the parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

10.3 Belief propagation for SBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

10.4 The phase diagram of community detection . . . . . . . . . . . . . . . . . . . . 166

10.4.1 Second order phase transition . . . . . . . . . . . . . . . . . . . . . . . . 166

10.4.2 Stability of the paramagnetic fixed point . . . . . . . . . . . . . . . . . . 168

10.4.3 First order phase transition . . . . . . . . . . . . . . . . . . . . . . . . . 169

10.4.4 The non-backtracking matrix and spectral method . . . . . . . . . . . . 171

10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

11 The spin glass game from sparse to dense graphs 173

11.1 The spin glass game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

11.2 Sparse graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

11.2.1 Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

11.2.2 Population dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

11.2.3 Phase transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

11.3 Dense graph limit: TAP/AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

11.4 Approximate Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

11.4.1 State Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

11.5 From the spin glass game to rank-one factorization . . . . . . . . . . . . . . . . 179

12 Approximate Message Passing and State Evolution 181

12.1 AMP And SE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

12.1.1 Symmetric AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

12.1.2 Conditioning on linear observations . . . . . . . . . . . . . . . . . . . . 182

12.1.3 Convergence of AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

12.1.4 A first application: the TAP equation! . . . . . . . . . . . . . . . . . . . 185


x F. Krzakala and L. Zdeborová

12.1.5 Parisi equation reloaded: Approximate Survey propagation . . . . . . 186

12.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

12.2.1 Wigner Spike Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

12.2.2 A primer on Bayesian denoising . . . . . . . . . . . . . . . . . . . . . . 187

12.2.3 Wishart Spike Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

12.3 From AMP to proofs of replica prediction: The Bayes Case . . . . . . . . . . . 190

12.4 From AMP to proofs of replica prediction: The MAP case . . . . . . . . . . . . 192

12.5 Rectangular AMP and GAMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

12.6 Learning problems: the LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

12.6.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

12.6.2 Proximal Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 196

12.6.3 The LASSO and other linear problems . . . . . . . . . . . . . . . . . . . 197

12.6.4 AMP reloaded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

12.7 Inference with GAMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

III Random Constraint Satisfaction Problems 201

13 Graph Coloring II: Insights from planting 203

13.1 SBM and planted coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

13.2 Relation between planted and random graph coloring . . . . . . . . . . . . . . 204

13.2.1 BP convergence and algorithmic threshold . . . . . . . . . . . . . . . . 204

13.2.2 Contiguity of random and planted ensemble and clustering . . . . . . 206

13.2.3 Upper bound on clustering threshold . . . . . . . . . . . . . . . . . . . 208

13.2.4 Comments and bibliography . . . . . . . . . . . . . . . . . . . . . . . . 210

14 Graph coloring III: Colorability threshold and properties of clusters 211

14.1 Analyzing clustering: Generalities . . . . . . . . . . . . . . . . . . . . . . . . . 211

14.2 Analyzing BP fixed points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

14.3 Computing the colorability threshold . . . . . . . . . . . . . . . . . . . . . . . 215


CONTENTS xi

14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

15 Replica Symmetry Breaking: The Random Energy Model 221

15.1 Rigorous Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

15.2 Solution via the replica method . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

15.2.1 An instructive bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

15.2.2 The replica symmetric solution . . . . . . . . . . . . . . . . . . . . . . . 225

15.2.3 The replica symmetry breaking solution . . . . . . . . . . . . . . . . . . 226

15.2.4 The condensation transition and the participation ratio . . . . . . . . . 227

15.2.5 Distribution of overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

15.3 The connection between the replica potential, and the complexity . . . . . . . 229

15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

IV Statistics and machine learning 233

16 Linear Estimation and Compressed Sensing 235

16.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

16.2 relaxed-BP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

16.3 State Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

16.3.1 The hat variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

16.3.2 Order parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

16.3.3 Back to the hats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

16.4 From r-BP to G-AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

16.5 Rigorous results for Bayes and Convex Optimization . . . . . . . . . . . . . . . 242

16.5.1 Bayes-Optimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

16.5.2 Bayes-Optimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

16.6 Application: LASSO, Sparse estimation, and Compressed sensing . . . . . . . 243

16.6.1 AMP with finite ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

16.6.2 LASSO: infinite ∆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243


xii F. Krzakala and L. Zdeborová

16.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

17 Perceptron 251

17.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

V Appendix 257

18 A bit of probabilty theory 259


Part I

Techniques and methods


Chapter 1

The mean-field ferromagnet of Curie &


Weiss

The world was so recent that many things lacked names, and in
order to indicate them it was necessary to point. Every year
during the month of March a family of ragged gypsies would set
up their tents near the village, and with a great uproar of pipes
and kettledrums they would display new inventions. First they
brought the magnet.

100 años de soledad


Gabriel Garcia Márquez – 1967

The Curie-Weiss model is a simple model for ferromagnetism. The essential phenomenon
associated with a ferromagnet is that below a certain critical temperature, a magnetization will
spontaneously appear in the absence of an external magnetic field. The Curie-Weiss model is
simple enough that all the thermodynamic functions characterising its macroscopic properties
can be computed exactly. And yet, it is rich enough to capture the basic phenomenology
of phase transitions — namely the transition between a disordered paramagnetic phase (not
magnetized) and an ordered ferromagnetic phase (magnetized). Because of its simplicity and
because of the correctness of at least some of its predictions, the Curie-Weiss model occupies
an important place in the Statistical Mechanics literature.

In this model, the magnetic moments are encoded by N microscopic spin variables Si ∈
{−1, +1} for i = 1, . . . , N . Every magnetic moment Si interacts with every other magnetic
moment Sj for j ̸= i. Ferromagnetism can be modelled by a collective alignment of the
magnetic moments in the same direction. Therefore, to encourage a ferromagnetic phase, we
add an energy cost for spins which are not aligned. In its simplest flavour, this cost takes
the form of a two-body interaction −Si Sj . The total cost function associated with a given
configuration of spins S ∈ {−1, +1}N , also known in Physics as the Hamiltonian, is given by:

0 1 X
HN (S) = − Si Sj .
2N
ij

It is actually convenient to add a constant external field h ∈ R enforcing alignment in Si = +1


4 F. Krzakala and L. Zdeborová

for h > 0 and Si = −1 for h < 0, so we shall work with the Hamiltonian:
X 1 X X
0
HN (S) = HN (S) − h Si = − Si Sj − h Si . (1.1)
2N
i ij i

The probability of finding the system at the configuration s ∈ {−1, +1}N is given by the
Boltzmann measure1 :
e−βH(s)
PN,β,h (S = s) = .
ZN (β, h)

where β = T −1 ≥ 0 is the inverse temperature. Note that for β > 0 the Boltzmann measure
gives more weight to configurations with lower cost or energy. In particular, when β → ∞ (or
equivalently T = 0) it concentrates around configurations that minimise H. In the opposite
limit, when β = 0 (or equivalently T → ∞) it assigns equal weight to every configuration,
yielding the uniform measure. The normalization of the Boltzmann measure plays a very
important role in Statistical Physics, and is known as the partition sum or partition function:
X β P P
Si Sj +βh Si
ZN (β, h) = e 2N ij . (1.2)
S∈{−1,+1}N

As we will show next, the partition sum is closely related to the thermodynamic functions that
characterize the macroscopic properties of the model.

1.1 Rigorous solution

As mentioned in the introduction, ferromagnetism is modelled in this context by the alignment


of the magnetic moments or spins. Therefore, a good probe for a ferromagnetic state is given
by the magnetization per spin:
N
1 X
S̄ =: Si .
N
i=1

Note that S̄ is the empirical average of the spins, and therefore is itself a random variable. It is
interesting to note that the Hamiltonian of the Curie-Weiss model is actually only a function
of the magnetization:
 
1 2
H(S̄) = −N S̄ + hS̄ ,
2

making it explicit that it is an extensive quantity H(S̄) ∝ N . And each time we flip a single
spin from −1 to +1, the magnetization per spin increases by 2/N . Therefore, what is the
probability that the magnetization per spin takes a particular value in the set SN = {−1, −1 +
2/N, −1 + 4/N, . . . , 1}? According to the Boltzmann measure, this is given by:

Ω(m, N ) βN ( 1 m2 +hm)
P(S̄ = m) = e 2
ZN (β, h)
1
Also known as Gibbs measure or Gibbs-Boltzmann measure.
1.1 Rigorous solution 5

0.7

0.6

0.5

0.4

H(m)
0.3

0.2

0.1

0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m

Figure 1.1: Binary entropy H(m) defined in equation 1.3 as a function of the magnetization m.

where Ω(m, N ) is the number of configurations with magnetization S̄ = m. This can be


computed explicitly, and is simply given by a standard binomial:

N!
Ω(m, N ) = N −N m
 N +N m 
2 ! 2 !

The expression above is different from the usual binomial distribution because the Si take
their values in {−1, 1} instead of {0, 1}. At first glance it is not very friendly, but with some
work refining Stirling’s approximation (see exercise 1.1), one can show that

eN H(m)
≤ Ω(m, N ) ≤ eN H(m) ,
N +1

with H(m), often called the binary entropy , given by Note that the binary
entropy is usually
1+m

1+m

1−m

1−m
 defined with log2 in
H(m) = − log − log . (1.3) Information Theory.
2 2 2 2 We shall stick to the
natural logarithm
here.
We thus reach our first result:

Lemma 1. Let ϕ(m) = H(m) + 12 βm2 + βhm, then

1 eN ϕ(m) eN ϕ(m)
≤ P(S̄ = m) ≤ (1.4)
N + 1 ZN (β, h) ZN (β, h)

One can also compute, and bound, the value of ZN (β, h). Indeed, summing over m on both
sides of the right part of equation 1.4 one reaches

X eN ϕ(m) eN ϕ(m )
1≤ ≤ (N + 1) , (1.5)
m
ZN (β, h) ZN (β, h)

where we have defined the value m∗ ∈ [−1, 1] that maximizes ϕ(m) (note that m∗ depends on
β and h). Therefore, taking the logarithm on both sides:

log ZN (β, h) log(N + 1)


≤ ϕ(m∗ ) + .
N N
6 F. Krzakala and L. Zdeborová

Additionally, the left part of equation 1.4 teaches us that

1 eN ϕ(m)
≤ P(S̄ = m) ≤ 1
N + 1 ZN (β, h)
log (N + 1) log ZN (β, h)
− + ϕ(m) ≤ .
N N
This is true for all the discrete values m ∈ SN , and in particular for the value mmax that
maximizes ϕ(m) over this set. It is easy to see that maximizing over [−1, 1] instead of SN does
not change the result substantially since ϕ(mmax ) > ϕ(m∗ ) − log N/N 2 . Therefore, for N large
enough we finally obtain

Lemma 2. Let ΦN (β, h) = log ZN (β, h)/N , then

log(N (N + 1)) log(N + 1)


ϕ(m∗ ) − ≤ ΦN (β, h) ≤ ϕ(m∗ ) + (1.6)
N N

We shall call log ZN the free entropy, while ΦN =: 1


N log ZN is the free entropy density. Asymp-
totically, it reads:

Theorem 1. Let ΦN (β, h) = log ZN (β, h)/N , then

Φ(β, h) =: lim ΦN (β, h) = max ϕ(m) = ϕ(m∗ ) (1.7)


N →∞ m∈[−1,1]

Additionally,
log P(S̄ = m)
lim = ϕ(m) − ϕ(m∗ ) (1.8)
N →∞ N

Note that this result is quite remarkable: we have turned the seemingly impossible sum over
2N states of equation 1.2 into a simple maximization of a one-dimensional potential function
ϕ(m). In particular, equation 1.8 shows that in the thermodynamic limit N → ∞ the potential
ϕ(m) fully characterizes the probability of finding the system at a given macroscopic state
S̄ = m. More importantly, we have done this without any approximation. All the steps are
rigorous! This is, of course, thanks to the simplicity of the Curie-Weiss model.

1.1.1 Phase transition in the Curie-Weiss model

From this exact solution, we can now analyze the phenomenology of the Curie-Weiss model.
The extremization ϕ′ (m) = 0 leads to the condition
 
1 1+m
log = β(m + h)
2 1−m
 
which, using tanh−1 (x) = 12 log 1+x
1−x gives the so-called Curie-Weiss mean-field or saddle point
equation:
m = tanh (β(h + m)) . (1.9)
2
The mean value theorem applied between m∗ and mmax yields a c ∈ [m∗ , mmax ] which satisfies ϕ(m∗ ) =
ϕ(mmax ) + ϕ′ (c)(m∗ − mmax ), and such that |m∗ − mmax | ≤ 2/N . Since the difference is K/N for some constant
K, it will be eventually smaller than log N/N .
1.1 Rigorous solution 7

1.00 =0 =2
= 0.5 = 2.0
0.75 =1 -h
0.50 1.5
0.25
tanh( (m + h))

1.0

(m)
0.00
0.25 0.5
0.50
0.75 0.0

1.00
0.5
3 2 1 0 1 2 3 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m m

Figure 1.2: (Left) Right-hand side of equation 1.9 as a function of m, for fixed h = 0.5 and
different values of the inverse temperature (β solid lines). Solutions of equation 1.9 (dots)
are given by the intersection of f (m) = tanh(β(h + m)) with the line f (m) = m (red dashed).
(Right) Same picture in terms of the potential ϕ(m), where the solutions of equation 1.9
correspond to the global maximum of ϕ(m). Note that for β ≫ 1 an unstable solution
corresponding to a minimum of ϕ appear.

1.00 sign(m0) > 0


sign(m0) < 0 0.80
0.75
0.50 0.78
0.25
0.76
0.00
(m)
m

0.25 0.74

0.50 0.72
0.75
0.70
1.00
0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m

Figure 1.3: (Left )Fixed point m⋆ of the mean-field equation m = tanh(βm) above the critical
temperature (h = 0, β = 1.5) as a function of the inverse temperature β. Depending on the
sign of the initialisation m0 , we reach one of the two global maxima of ϕ(m) (right).

Figure 1.2 shows the right-hand side of the mean-field equation equation 1.9 for a fixed
external field h and different values of the inverse temperature β. The solution m∗ of these
self-consistent equations is given by the intersection of f (m) = tanh(β(m + h)) with the line
f (m) = m. Depending on the value of the parameters, there can be up to three solutions.

The property N1 log P(S̄ = m) → ϕ(m) − ϕ(m∗ ) is usually called a Large Deviation Principle in
mathematics. It basically tell us that the probability that S̄ takes any other value than m∗ is
exponentially small, i.e. a very rare event. In a nutshell, we can write that the probability to
find the system in a given value of m is approximately:
∗ ))
P(S̄ = m) ≍ eN (ϕ(m)−ϕ(m ,
N →∞

where we have used the symbol ≍ to denote an equality valid asymptotically in N . If the
N →∞
maximum is unique, then the magnetization is found to be equal to m∗ with probability one.
This convergence in probability is called a concentration phenomena in probability theory. For
8 F. Krzakala and L. Zdeborová

1.00 = 0.5 = 1.5


=1 =3 1.6
0.75
1.4
0.50
0.25 1.2

tanh( (m + h))

(m)
0.00 1.0
0.25 0.8
0.50
0.6
0.75
1.00 0.4
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m m

Figure 1.4: Same setting as Fig. 1.2, but for zero external field h = 0. For β < 1, the potential
ϕ(m) has only one maximum corresponding to a disordered phase m⋆ = 0. For β > 1,
the system has two ordered ferromagnetic phases corresponding to the emergence of two
symmetric global maxima ±m⋆ .

the physicist, it means that "macroscopic" quantities such as the magnetization are entirely
deterministic, as their random fluctuations around the mean are negligible: this concentration
of the measure is at the root of the success of Statistical Mechanics.

In the Curie-Weiss model, however, if h = 0 and β > 1, then the global maximum is not unique:
there are two degenerate maxima at ±m∗ . This means that if one samples a configuration
from the Boltzmann measure, then with a probability 1/2 the configuration will have a mag-
netization ±m∗ , see Figs. 1.3 and 1.4. This situation is called phase coexistence in Physics, and is
a fundamental property of liquids, solids and gases in nature. Phase coexistence arises only
for h = 0 and β > 1 in the Curie-Weiss model. Indeed, as shown in Fig. 1.2, for h ̸= 0 there
might be more than one solution to the mean-field equations, but there is only one global
maximum corresponding to a single phase, also known as a single Gibbs state.

1.2 The free energy/entropy is all you need

In more involved models one cannot hope to directly access the statistical distribution of
relevant quantities with a direct computation, as we did in Theorem 1. However, we should
expect to find similar phenomenology: concentration of macroscopic quantities in the thermo-
dynamic limit, a large deviation principle for the Boltzmann measure, single phase vs phase
coexistence, and of course, phase transitions. It turns out that almost all these phenomena can
be understood from the computation of the asymptotic free entropy density Φ(β, h). This is
why computing this quantity is the single most important analytical task in Statistical Physics.

First, a piece of trivia: most physicists do not use the free entropy or the free entropy density
— for reasons rooted in the history of thermodynamics, dating back to Carnot and Clausius —
but rather the free energy and the corresponding free energy density, which are the same as the
1.2 The free energy/entropy is all you need 9

free entropy up to a −β −1 term. In the Curie-Weiss model, this gives:


 
1 1
fN (β, h) =: − log ZN (β, h) = − ΦN (β, h)
βN β
f (β, h) =: lim fN (β, h) = min f (m, β, h)
N →∞ m
 
1 2 1 1
f (m, β, h) =: − m − hm − H(m) = − ϕ(m)
2 β β

The fact that physicists (since Clausius) use the free energy with a factor −β −1 in front of the
log seems to be a notational problem for many mathematicians, who just cannot understand
why they should bother with a trivial minus sign. Thus, many of them simply refer to Φ(β, h) as
the free energy density (or worse, sometime using the terminology "pressure" from the theory
of gases) which should make Clausius turn in his grave. It is also common for mathematicians
to define the Hamiltonian with a global minus sign with respect to the one used by Physicists.
Since this monograph is not concerned with actual applications in Physics, we might forgive
these bad habits. Nevertheless, we will attempt to use the correct terminology, so that we shall
not, for instance, "maximize" the energy and "minimize" the entropy!

1.2.1 Derivatives of the free entropy

We now discuss how knowing the free entropy actually allows one to rediscover all the
phenomena we have discussed. First, we notice that for any finite value of N , the free entropy
is a generating functional for the (connected) moments of the magnetization S̄. Denoting by
⟨·⟩N the average with respect to the Boltzman measure, and recalling that
0
X
ZN (β, h) = e−βHN +βN hS̄
S∈{−1,1}N

we have:
1 ∂ ∂ 1 X S̄e−βHN
ΦN (β, h) = log ZN (β, h) = = ⟨S̄⟩N = mN , (1.10)
β ∂h ∂h βN ZN (β, h)
S∈{−1,1}N

This also shows why it is useful to introduce an external magnetic field h at the beginning of
our derivations: we can take obtain moments the moments of the magnetization by taking
derivatives with respect to h. While the second derivative yield the variance, etc... However, it
is far from trivial that this relation holds when the limit N → ∞ is taken: the mathematical
conditions for switching the limit and the derivative are non-trivial. Indeed, this can only be
done away from the phase transition, in which case it follows from the convexity of the free
entropy. Consider the second derivative with respect to h, we find:

1 ∂2 ∂ X S̄e−βHN
ΦN (β, h) = = N β(⟨S̄ 2 ⟩N − ⟨S̄⟩2N ) ≥ 0 . (1.11)
β ∂h2 ∂h ZN (β, h)
S∈{−1,1}N

Therefore ΦN is convex. This result is also known in the statistical physics context as the
fluctuation-dissipation theorem. A fundamental theorem on the limit of a sequence of convex
functions fn as n → ∞ tells us that if fn (x) → f (x) for all x, and if fn (x) is convex, then
fn′ (x) → f ′ (x) for all x where f (x) is differentiable. Out of the phase transition points, where
10 F. Krzakala and L. Zdeborová

the free entropy is singular, the derivative of the asymptotic free entropy thus yields the
asymptotic magnetization. Let us check that this is true. We know that Φ(β, h) = ϕ(m∗ (h, β)),
therefore we write:
1 ∂ ∂ϕ ∂m∗
Φ(β, h) = m∗ (h, β) + = m∗ (h, β) . (1.12)
β ∂h ∂m m∗ ∂h

given that the derivative of ϕ(m) is zero when evaluated at m∗ . The derivative of the free
entropy has thus given us the equilibrium magnetization m∗ , as it should.

1.2.2 Legendre transforms

Another instructive way to look at the problem arises using two important mathematical facts
coming from prominent French mathematicians: Laplace and Legendre. First, let’s state the
very useful Laplace method for computing integrals.

Theorem 2 (Laplace method). Suppose f (x) is a twice continuously differentiable function on [a, b]
and there exists a unique point x0 ∈ (a, b) such that:

f (x0 ) = max f (x) and f ′′ (x0 ) < 0 ,


x∈[a,b]

then: Rb
a enf (x) dx
lim q = 1,
n→∞ 2π
enf (x0 ) n(−f ′′ (x0 ))

and in particular
b
1
Z
lim log enf (x) dx = f (x0 )
n→∞ n a
Rb nf (x) dx
a g(x)e
lim Rb = g(x0 ) (1.13)
n→∞ e nf (x) dx
a

These formulas, proven by Laplace in his fondamental text "Mémoire sur la probabilité des
causes par les évènements" in 1774, have profound implications when combined with the
large deviation principle.

Consider the Curie-Weiss model with zero field (h = 0). By definition, we have
0
e−βHN 1(S̄ = m)
P
S∈{−1,1}N
P(S̄ = m; h = 0) =
ZN (β, h = 0)
0
e−βHN 1(S̄ = m)
P
S∈{−1,1}N ∗
= ≍ e−N I0 (m)
eN log ΦN (β,0) N →∞

where we have denoted I0∗ (m) the true large deviation rate at zero external field. Simply
by assuming this large deviation principle, we can reach deep conclusions even if we do not
know the actual expression of I0∗ (m). Indeed, we can write the total partition sum of the
1.2 The free energy/entropy is all you need 11

system in presence of an external field, as a Laplace integral over the possible values of m. Using
HN = HN 0 − N hS̄ we write:

 
0
X X
ZN (β, h) =  e−βHN 1(S̄ = m) eN βhm
m∈SN S∈{−1,1}N
Z 1
dm eN (Φ(β,0)−I0 (m,β)+βhm) .

≍ (1.14)
N →∞ −1

At this point the Laplace method applied to the limit of ZN (β, h) gives us automatically that

Φ(β, h) − Φ(β, 0) = max [βhm − I0∗ (m)] = ϕ(m∗ )


m

We thus obtain a very generic relation between the free entropy of the system in a field and the
large deviation rate (without field) I0∗ (m): they are related through a Legendre transform:

Φ(β, h) − Φ(β, 0) = max [βhm − I0∗ (m)]


m

In fact, the theory of Legendre transforms tells us slightly more: if we further take the Legendre
transform of Φ(β, h) we can (almost) recover the true rate I0∗ (m). Let us define

I0 (m) = max (βhm − Φ(β, h)) + Φ(β, 0)


h

then a fundamental property of the Legendre transform reveals that I0 (m) is the convex
envelope of I0∗ (m). We thus can recover the large deviation rate even simply by "Legendre-
transforming" the free entropy. Again, we see that everything can be computed through the
knowledge of the free entropy Φ(β, h). Truly, the free entropy is all you need.

Note however that there is a fundamental limitation to the ability to compute large deviation
rates with this technique. Given we can only recover the convex envelope of the true rate,
if the true rate is not convex there is a part of the curve that we cannot not obtain! This is
illustrated in Figure 1.5: we only get the part of I0∗ (m) that coincides with the convex envelope
I0 (m) (this set is called the "exposed points" of I0∗ (m)), while the other points are just given
an upper bound. These considerations are classical in statistical mechanics, and at the basis
of the "equivalence of ensembles" as well as to the derivation of thermodynamics (which is
really nothing but Legendre transforms).

1.2.3 Gartner-Ellis Theorem

What we have said above can also be written rigorously using the language of modern large
deviation theory. In particular, the Gartner-Ellis Theorem connects the large deviation rates
with the Legendre transform of the partition sum in a very generic way:
Theorem 3 (Gartner-Ellis, informal). If
1
λ(k) = lim log E(exp(N kAN ))
N →∞ N
exists and is differentiable for all k ∈ R, then defining

I(a) = sup(ka − λ(k)) ,


k
12 F. Krzakala and L. Zdeborová

0.76
1.00
I(m)
I (m) 0.75
0.95
0.74
0.90
0.73
0.85
0.72

(m)
0.80 0.71
0.75 0.70
0.70 0.69

0.65 0.68
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m m

Figure 1.5: The true large deviation rate I(m), and the function I ∗ (m) obtained after taking two
consecutive Legendre transforms. I ∗ (m) is the convex envelope of I(m). The two functions
coincide for all the "exposed points", where the tangent is different from the curve, but they
differ in the dashed region, which is not "exposed". Legendre transforms allows only to
compute sharp large deviation rates for the exposed points, and upper bounds otherwise.

we have the large deviation principle


1
lim log P(AN = a) ≤ −I(a)
N →∞ N
with equality for the exposed points
1
limlog P(AN = a) = −I(a) ∀a ∈ {exposed points} .
N →∞ N

In the Curie-Weiss model, the connection with our former derivation is immediate: consider
the model without a field (i.e. h = 0) and let AN = S̄ be the averaged magnetization in the
system. The theorem (which in this particular case is called Cramer’s theorem) thus tells us
that we need to compute:
1 X 1 0
λ(k) = lim log e−βHN +N kS̄ = lim (ΦN (β, h = k/β) − ΦN (β, h = 0))
N →∞ N ZN N →∞
S

and we thus recognise


λ(k) = max ϕ(m, h = k/β) = max I(m) + βhm
m m

while the rate function for the magnetization (at zero field) is given by
I(m) = max km − λ(k)
k

Given the property of the Legendre transform, I(m) is indeed given by the convex envelope
of ϕ(m, 0), as expected.

1.3 Toolbox: Gibbs free entropy and the variational approach

One cannot hope that the computation of the partition sum will always be so easy, so it is
worth learning a few tricks. A fundamental result in statistical physics, that is at the roots
1.3 Toolbox: Gibbs free entropy and the variational approach 13

of many analytical and practical approaches, and that has been used extensively in machine
learning as well, is the following one:
Theorem 4 (Gibbs variational approach). Consider a Hamiltonian HN (x) with x ∈ RN , and the
associated Boltzmann-Gibbs measure PGibbs (x) = e−βHN (x) /ZN (β). Given an arbitrary probability
distribution Q(x) over RN and its entropy S[Q] = −⟨log Q⟩Q , let the Gibbs functional be
N ϕGibbs (Q) =: S[Q] − β⟨HN ⟩Q
where ⟨.⟩Q denotes the expectation with respect the distribution Q. Then
1
∀Q, ΦN (β) = log ZN (β) ≥ ϕGibbs (Q)
N
with equality when Q = PGibbs .

Let us first prove a simpler result. We shall introduce the following quantity, known as
the Kullback-Leibler divergence (or "relative entropy"), which measures how much two
distributions P(x) and Q(x) differ:
P(x)
Z
DKL (P||Q) =: dxP(x) log
Q(x)
We can prove the following lemma:
Lemma 3 (Gibbs inequality).
DKL (P||Q) ≥ 0
with equality if and only if P = Q almost everywhere.

Proof. For any scalar u ∈ R+ we have log(u) ≤ u − 1, with equality at u = 1. Therefore


Q(x)
Z
−DKL (P||Q) = dxP(x) log
P(x)
Q(x)
Z   Z Z
≤ dxP(x) − 1 = dxQ(x) − dxP(x) = 1 − 1 = 0
P(x)
Given the log(u) < u − 1 unless u = 1, the inequality is strict unless P = Q almost everywhere.

We then write the difference between the Gibbs free entropy and the actual entropy using the
Kullback-Leibler divergence:
Lemma 4. Denoting the Boltzmann-Gibbs probability distribution as PGibbs (x) = e−βH(x) /ZN , and
an arbitrary distribution as Q, we have
N ΦN = N ϕGibbs (Q) + DKL (Q||PGibbs )

Proof. The proof is a trivial application of definition of the divergence:


⟨log PGibbs ⟩Q = −β⟨HN ⟩Q − log ZN
⟨log PGibbs ⟩Q − ⟨log Q⟩Q = −β⟨HN ⟩Q − N ΦN − ⟨log Q⟩Q
−DKL (Q||PGibbs ) = N ϕGibbs (Q) − N ΦN
N ΦN = N ϕGibbs (Q) + DKL (Q||PGibbs )
14 F. Krzakala and L. Zdeborová

Together, lemma 3 and lemma 4 imply theorem 4. Why is this interesting? Basically, it gives
us a way to approximate the partition sum (and the true distribution) by using many "trial"
distributions, and piking up the one with the largest free entropy. This has been used in
countless many ways since the birth of statistical and quantum physics, under many names
(for instance "Gibbs-Bogoliubov-Feynman"), and it is at the root of many applications of
Bayesian Statistics and machine learning as well, in which case the Gibbs free entropy is called
Evidence Lower BOund, or ELBO in short.

Let us see how it can be used for the Curie-Weiss model. The simplest thing we could try is a
factorized distribution, identical for all spins:
Y 1 + m 1−m
Y 
Q(S) = Qi (Si ) = δ(Si − 1) + δ(Si + 1) (1.15)
2 2
i i

Then, we can write the Gibbs free entropy density as


1 X
ϕGibbs (Q) = β⟨(S̄)2 /2 + hS̄⟩Q + H(Qi ) (1.16)
N
i
= βm2 /2 + βhm + H(m) (1.17)

with H(m) the binary entropy, and we find back the correct free entropy — at any fixed m —
through this simple variational ansatz.

1.4 Toolbox: The cavity method

Here, we are going to see another important method: the cavity trick. It will be at the root of
many of our computations in the course. The whole idea is based on the following question:
what happens when you add one variable to the system? Physicists like imagery: one could
visualize a system of N variables, making a little "hole" or "cavity" in it, and delicately adding
one variable, hence the name "cavity method".

Let us see how it works by comparing the two Hamiltonians with N and N + 1 variables.
Denoting the new spin as the number "0" we have, for a system at inverse temperature β ′ and
field h′ :
S0 + i Si 2
 P 
′1
X

−β HN +1 = β (N + 1) + β ′ h′ (S0 + Si )
2 N +1
i
2
β′ N 2
P P 
′ 1 i Si ′ N i Si
X
=β + + β S0 + β ′ h′ Si + β ′ h′ S0
2(N + 1) 2 N +1 N N +1 N
i
(1.18)

If we define β ′ = β(N + 1)/N , and our new field h′ as h′ = hN/(N + 1), we get
P 2 P 
β i Si i Si
X
′ ′
−β HN +1 (h ) = cst + N + βS0 + βh Si + βhS0
2 N N
i
P 
i Si
= cst − βHN + βS0 + βhS0
N
1.4 Toolbox: The cavity method 15

We thus have two systems: one with N +1 spins at temperature and fields (β ′ , h′ ) and one with
N spins at temperature and fields (β, h). The relation we just derived makes the expectation
over the N + 1 variable easy to compute in the new system, as a function of the sum in the old
system. In fact, one can directly write the expectation of the new variable as follows:
′ ′
S0 e−β HN +1 S0 e−βHN +βS0 S̄+βhS0
P PP
S0 ,S S S0 ⟨sinh (β(S̄ + h))⟩N,β
⟨S0 ⟩N +1,β ′ = P −β ′ H′ = P P −βH +βS S̄+βhS =
e N +1 e N 0 0 ⟨cosh (β(S̄ + h)))⟩N,β
S0 ,S S S0

Assuming, as a physicist would do, that S̄ concentrates on a deterministic value m∗ (at least
out of the phase co-existence line), and assuming that m∗ should be the same for N and N + 1
systems when N is large enough (as it should), we have immediately:

⟨sinh (β(S̄ + h))⟩N,β sinh (β(m∗ + h))


⟨S0 ⟩N +1,β ′ = ≈ = tanh (β(m∗ + h))
⟨cosh (β(S̄ + h))⟩N,β cosh (β(m∗ + h))

As N → ∞, the difference between β and β ′ vanished, and we recover the mean field equation:

m∗ = tanh(β(m∗ + h))

We can also recover the free energy from a similar cavity argument. First we note that
N
1 1 ZN 1 ZN ZN −1 Z1 1 X Zn+1
log ZN = log ZN −1 = log ... = log (1.19)
N N ZN −1 N ZN −1 ZN −2 1 N Zn
n=0

so that (a rigorous argument can be made using Cesaro sums, see appendix 1.C):
ZN +1 (β, h)
Φ(β, h) = lim log
N →∞ ZN (β, h)
Note, however, that the β should be the same at N and N + 1 systems in this computation.
Starting from equation 1.18, we thus write, making sure we keep all terms that are not o(1):
P 2 P 
β i Si i Si
X
−βHN +1 = o(1) + (N − 1 + o(1)) + βS0 (1 + o(1)) + βh Si + βhS0
2 N N
i
P 2 P 
β i Si i Si
= o(1) − βHN − + βS0 + βhS0 (1.20)
2 N N

Notice the presence of the term −β(S̄)2 /2, which we could have overlooked, have we been less
cautious. We did not keep this term in the previous computation, because we could include
it in the β ′ , and it was not making any difference in the large N limit and could be absorbed
in the normalisation at the price of a minimal o(1) change in β. Here, however, we need to
compute ZN +1 /ZN , which is O(1), so we need to pay attention to any constant correction. We
can now finally compute3
ZN +1 β 2
= ⟨e− 2 S̄ 2 cosh (β(S̄ + h))⟩N,β (1.21)
ZN
3
Let us make a remark that shall be useful later on, when we shall discuss the cavity method on sparse graphs:
the new Hamiltonian has two terms in addition to the old one: a "site" term that is a function of S0 , and a "link"
term, that appears because, on top of adding one spin, we added N links to the systems. This will turn out to be a
generic property.
16 F. Krzakala and L. Zdeborová

Using now the concentration of S̄ to m∗ , we get, taking the log


m∗ 2
Φ(β, h) = −β + log [2 cosh(β(m∗ + h))] (1.22)
2
This might seem a little bit odd, since it does not look the same as the former expression!
Indeed we now find:
m2
Φ(β, h) = max ϕ̃(m) with ϕ̃(m) =: −β + log 2 cosh(β(m + h)) (1.23)
m 2
And indeed, we are taking the maximum since the derivative of ϕ̃(m) with respect to m yields
m∗ = tanh(β(m∗ + h)). Did we make a mistake? No, we did not! While ϕ̃(m) and ϕ(m) are
different, both have the same value at any of their extrema, as can be checked by a simple
plot. Indeed, we can show analytically that their fixed points are identical: Remember that
m∗ = tanh(β(m∗ + h)) so that β(m∗ + h) = atanh(m∗ ), and therefore, using the identity
1+x 1−x 1−x
 
1+x
log [2 cosh (atanh(x))] − x atanh(x) = H(x) = − log + log
2 2 2 2
we obtain:
m∗ 2
ϕ̃(m∗ ) = −β + log [2 cosh(β(m∗ + h))]
2
m∗ 2
= −β + log [2 cosh(atanh(m∗ ))]
2
m∗ 2
= −β + m∗ atanh(m∗ ) + H(m∗ )
2
m∗ 2
= −β + βm∗ 2 + βm∗ h + H(m∗ )
2
= ϕ(m∗ )

One should not, however, assume that the last expression ϕ̃(m) is the correct large deviation
quantity: it is not. The reader is invited to check that the Legendre transform of Φ(β, h) is
giving back the correct large deviation function ϕ, as it should.
1.00
(m)
0.95 (m)

0.90

0.85

0.80

0.75

0.70

0.65
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
m

Figure 1.6: The functions ϕ(m) (1.4) and ϕ̃(m) (1.23) are different but coindide for all their
extremums.

All these considerations can be made entirely rigorous, as we show in Appendix 1.C. This
is one of the very important tools that mathematicians may use to prove some of the results
discussed in this monograph.
1.5 Toolbox: The "field-theoretic" computation 17

1.5 Toolbox: The "field-theoretic" computation

To conclude this lecture, it will be also worth learning a trick that physicists use a lot, and that
will be central to all replica computations in the next chapters. It is probably a good idea to
learn it in the context of the simple Curie-Weiss model. We write again the Hamiltonian

0 1 X 1 X X
HN =− Si Sj ≈ − Si Sj − Si
N 2N
i<j i,j i

as well as the partition sum


N β P Sj 2 P S
( j N ) +N βh i Ni
X
ZN = e 2

Now, let us pretend we do not know how to compute the binomial coefficient Ω(m). Instead,
we are going to use the so-called "Dirac-Fourier" method, which starts with the following
identity for the delta "function":
Z
du f (u)δ(u − x) = f (x)

In this case, physicists would actually use the following version of the identity:
1
Z Z Z
  x   x
dm f (m)δ(N m − x) = dm δ N m − f (m) = dm f (m)δ m −
N N N
1 x
= f
N N
but notice that the additional 1/N term does not matter since we take the log and divide by
N : it thus makes no difference asymptotically. We then write:
 
X XZ X Nβ 2
ZN = e−βHN = N dm δ N m − Sj  e 2 m +N βhm
S S j
 
Z

m2 +βhN m
X X
=N dm e 2 δ N m − Sj 
S j

We recognize that one would need to compute the entropy at fixed m. Again, let us pretend we
cannot compute it (the idea is to do so without any combinatorics). Instead, using a Fourier
transform of the delta "function" (which is really a distribution), one finds that
Z Z
Nβ 2 +βhmN
X i2πλN m−P Sj 
m
ZN = N dm dλ e 2 e j N

We do not like to keep the i explicitly, so instead, we write m̂ = i2πλ and integrate in the
complex plane
Z 1 Z i2π∞
Nβ 2
X P
ZN = 2iN π dm dm̂ e 2 m +βhN m+N m̂m e−m̂ j Sj
−1 −i2π∞ S
Z 1 Z i2π∞ β
m2 +N βhm+N m̂m
= 2iN π dm dm̂ eN 2 (2 cosh m̂)N
−1 −i2π∞
18 F. Krzakala and L. Zdeborová

We are interested only in the density of the log, and thus we write:
1 i2π∞
log ZN 1 n Z Z o

m2 +N βhm+N m̂m+N log 2+N log cosh m̂
= log dm dm̂ e 2 + o(1)
N N −1 −i2π∞

We have gotten read of the combinatoric sums, at the price of introducing integrals. It turns
out, however, that these are very easy integrals, at least when N is large. The trick is to use
Cauchy’s theorem to deform the contour integral in m̂ and put the path right into a saddle
in the complex plane. This is called the saddle point method, and it is a generalization of
Laplace’s method to the complex plane. More information is provided in Exercice 1.3, but
what the saddle point methods tell us is that for integrals of this type, we have

log Z
Φ(β, h) = lim = extrm,m̂ g(m, m̂)
N →∞ N
with
β 2
g(m, m̂) = m + βhm + m̂m + log 2 + log cosh m̂
2
In other words, the complex two-dimensional integral has been replaced by a much simpler
differentiation where we just need to find the extremum of g(m, m̂)!

How consistent is this expression with the previous one we obtained? Let us do the differenti-
ation over m̂: the extrema condition imposes m = − tanh m̂, or m̂ = − tanh−1 m. If we plug
this into the expression, we finally find

β 2
m + βhm − m (tanh−1 m) + log 2 cosh tanh−1 m

g(m) =
2
This does not look directly like our good old ϕ(m), but by using the following identity:

log(2 cosh(atanh(m)) − matanh(m) = H(m),

we get back

β 2
g(m) = ϕ(β, m) + βhm = m + βhm + H(m) = ϕ(m, β)
2
and Φ(β, h) = extrm ϕ(β, m) as it should.

It is worth noting, however, that most physicists choose to write things in a slightly different,
but equivalent, way. Indeed, starting again from

β 2
g(m, m̂) = m + βhm + m̂m + log 2 + log cosh m̂
2
the typical physicist imposes m̂ = −β(m + h) instead of m = − tanh m̂. They would then get
rid of m̂ instead of m, since it is most convenient, and write:
 
β 2
Φ(β, h) = extrm − m + log(2 cosh β(m + h)) = max ϕ̃(m, h, β)
2

As we have seen in the previous section in the cavity computation, this is not a problem, since
this formula is correct as well. In fact, it is reassuring that both formulations can be found
using this method.
1.5 Toolbox: The "field-theoretic" computation 19

Bibliography

The Curie-Weiss model is inspired from the pioneering works of the Frenchmen Curie (1895)
and Weiss (1907). The history of mean-field models and variational approaches in physics is
well described in Kadanoff (2009). The presentation of the rigorous solution of the Curie-Weiss
model follows Dembo et al. (2010a). The Laplace method was introduced by Pierre Simon
de Laplace in its revolutionary work (Laplace, 1774), where he founded the field of statistics.
The saddle point method was first published by Debye (1009), who in turns credited it to an
unpublished note by Riemann (1863). A classical reference on large deviation is Dembo et al.
(1996). The nice review by Touchette (2009) covers the large deviation approach to statistical
mechanics, and is a recommended read for physicists. Variational approaches in statistics and
machine learning are discussed in great detail in Wainwright and Jordan (2008).
20 F. Krzakala and L. Zdeborová

1.6 Exercises

Exercise 1.1: Bounds on the binomial


We wish to prove the bounds on the Binomial coefficient used in the derivation at the
beginning of the chapter:

enH(k/m)
 
n n!
< = < enH(k/n)
n+1 k k!(n − k)!

where H(p) = −p log p − (1 − p) log (1 − p) is the Binomial entropy (we used p =


(1 + m)/2 in the chapter).

(a) Use the Binomial to prove that, for any 0 < p < 1
n  
X n
1= (1 − p)i pn−i
i
i=0

(b) Using a particular value of p, and keeping only one term of the sum, show that
 
n
< enH(k/n)
k

(c) If one makes n draws from a Bernoulli variable with probability of positive outcome
p = k/n, what is the most probable value for the number of positive outcomes?
Deduce that
 
n
(n + 1) > enH(k/n)
k

Exercise 1.2: Laplace method


In this exercise, we will be interested in the asymptotic behaviour (λ → ∞) of the
following class of real integrals:
Z b
I(λ) = dt h(t)eλf (t)
a

(a) Intuitively, what are the regions in the interval [a, b] which will contribute more to
the value of I(λ)?

(b) Suppose the function f (t) has a single global maximum at a point c ∈ [a, b] such
that f ′′ (c) < 0, and assume h(c) ̸= 0. Using a Taylor expansion for f , show that for
λ ≫ 1 we expect the integral to behave as follows:
Z c+ϵ
dt h(c)eλ[f (c)+ 2 f ]
1 ′′ (c)(t−c)2
I(λ) =
λ≫1 c−ϵ

where ϵ > 0 is a positive but small real number.


1.6 Exercises 21

(c) Using your result above, conclude that:

h(c)eλf (c)
Z
2
I(λ) ≍ p e−t dt.
−λf ′′ (c) R

(d) (Gaussian integral) Show that:



Z
2
e−t dt = π
R

(e) Deduce Laplace’s formula:


s
b

Z
λf (t)
dt h(t)e ≍ eλf (c) h(c)
a −λf ′′ (c)

Exercise 1.3: Saddle point method

The saddle point method is a generalisation of Laplace’s method to the complex plane.
As before, we search for an asymptotic formula for integrals of the type:
Z
I(λ) = dz h(z)eλf (z)
γ

where γ : [a, b] → C is a curve in the complex plane C and λ > 0 is a real positive
number which we will take to be large. If the complex function f is holomorphic on a
connected open set Ω ⊂ C, the integral I(λ) is independent of the curve γ. The goal is
therefore to choose γ wisely.

Part I: Geometrical properties of holomorphic functions


Let f : C → C be a holomorphic function, and let z = x + iy ∈ C for real x, y ∈ R.
Without loss of generality, we can write f (z) = u(x, y) + iv(x, y) for u, v : R2 → R
real-valued functions. The goal of this exercise is to study the properties of f around a
critical point f ′ (z0 ) = 0 for z0 ∈ C.

(a) Show that at a critical point, the gradients of u and v are zero.

(b) Using the Cauchy integral formula, show that for all z0 = x0 + iy0 in an open
22 F. Krzakala and L. Zdeborová

convex set Ω where f is holomorphic we have:


Z 2π
1
u(x0 , y0 ) = dθ u(x0 + r cos θ, y0 + r sin θ)
2π 0
Z 2π
1
v(x0 , y0 ) = dθ v(x0 + r cos θ, y0 + r sin θ)
2π 0
for all circles of radius r > 0 centred at z0 contained inside Ω. This result is known
as the Mean Value Theorem in complex analysis.
(c) Conclude that neither u or v can have a local extremum (maximum or minimum)
inside Ω. Therefore, all critical points z0 ∈ Ω are necessarily saddle points of u
and v.
(d) Let z0 be a critical point of f such that f ′′ (z0 ) ̸= 0. Using the Taylor series of f
around z0 and using the polar decompositions f ′′ (z0 ) = ρeiα , z − z0 = reiθ , find
the values of θ ∈ [0, 2π) corresponding to the two directions of steepest-descent of
u as a function of α in the complex plane.
Part II: Choosing the good curve γ
(a) What are the regions of γ which dominate the integral I(λ)?
(b) Let z0 be a critical point f ′ (z0 ) = 0. Explain why should we choose γ to pass
through z0 following the steepest-descent directions of the real part Re[f]?
(c) Show that such a γ, we can rewrite the integral as:
Z
iλIm[f(z0 )]
I(λ) = e dz h(z)eλRe[f(z)]
γ

(d) Let γ(t) = x(t)+iy(t) for t ∈ [a, b] be a parametrisation of the curve passing through
z0 = γ(t0 ) through the steepest-descent direction of Re[f]. Letting f (t) = f (γ(t)),
h(t) = h(γ(t)), u(t) = Re[f(t)] and v(t) = Im[f(t)], show that the problem boils
down to the evaluation of the following integral:
Z b
dt γ ′ (t)h(t)eλu(t)
a

Part III: Back to Laplace’s method


(a) Suppose h(t0 ) ̸= 0, and note we can choose a parametrisation of γ such that
γ ′ (t0 ) ̸= 0. Use Laplace’s method to show that I(λ) admits the following asymptotic
expansion for λ ≫ 1:
s
′ 2π
I(λ) ≍ h(t0 )γ (t0 ) eλf (t0 )
−λu′′ (t0 )

(b) Write the second derivative of f (t) with respect to t and show that at the critical
point z0 we have:
d2 f (t0 ) ′
2
2 d f (z0 )
= γ (t 0 )
dt2 dz 2
1.6 Exercises 23

(c) Show that the second derivative f ′′ (t0 ) is necessarily real and negative. Conclude
that:

u′′ (t0 ) = −|f ′′ (z0 )||γ ′ (t0 )|2

(d) Let θ be the angle between the curve γ and the real axis at the critical point z0 , see
figure below. Show that:

γ ′ (t0 ) = |γ ′ (t0 )|eiθ

(e) Letting f ′′ (z0 ) = |f ′′ (z0 )|eiα , show that θ = 21 (π − α) or θ = 12 (π − α) + π depending


on the orientation of the curve γ.

(f) Conclude that:


s s
λf (z0 ) i π−α 2π 2π
I(λ) ≍ ±h(z0 )e e 2 = h(z0 )eλf (z0 )
λ|f ′′ (z0 )| −λf ′′ (z0 )

where the ± is given by the orientation of the steepest-descent curve.

Exercise 1.4: Metropolis-Hastings algorithm

Consider again the Hamiltonian of the Curie-Weiss model. A very practical way to
sample configurations of N spins from the Gibbs probability distribution
exp (−βHN (s; h))
P (S = s; β, h) = (1.24)
ZN (β, h)
is the Monte-Carlo-Markov-Chain (MCMC) method, and in particular the Metropolis-
Hastings algorithm. It works as follows:
1. Choose a starting configuration for the N spins values si = ±1 for i = 1, . . . , N .
2. Choose a spin i at random. Compute the current value of the energy Enow and
the value of the energy Eflip if the spins i is flipped (that is if Sinew = −Siold ).
3. Sample a number r uniformly in [0, 1] and, if r < eβ(Enow −Eflip ) perform the flip
(i.e. Sinew = −Siold ) otherwise leave it as it is.
4. Go back to step 2.
If one is performing this program long enough, it is guarantied that the final configura-
tion S will have been chosen with the correct probability.
24 F. Krzakala and L. Zdeborová

(a) Write a code to perform the MCMC dynamics, and start by a configuration where
all spins are equal to Si = 1. Take h = 0, β = 1.2 and try your dynamics for a long
enough time (say, with tmax = 100N P attempts to flip spins) and monitor the value
of the magnetization per spin m = i Si /N as a function of time. Make a plot
for N = 10, 50, 100, 200, 1000 spins. Compare with the exact solution at N = ∞.
Remarks? Conclusions?

(b) Start by a configuration where all spins are equal to 1 and take
P h = −0.1, β = 1.2.
Monitor again the value of the magnetization per spin m = i si /N as a function
of time. Make a plot for N = 10, 50, 100, 200, 1000 spins. Compare with the exact
solution at N = ∞. Remarks? Conclusions?

Exercise 1.5: Glauber Algorithm

An alternative local algorithm to sample from the measure eq. 1.24 is known as the
Glauber or heat bath algorithm. Instead of flipping a spin at random, the idea is to
thermalise this spin with its local environment.
Part I: The algorithm
(a) Let S̄ = N1 N i=1 si be the total magnetisation of a system of N spins. Show that
P
for all i = 1, · · · , N , the probability of having a spin at Si = ±1 given that all other
spins are fixed is given by:
1 ± tanh(β(S̄ + h))
P (Si = ±1|{Sj }j̸=i ) ≡ P± =
2
(b) The Glauber algorithm is defined as follows:
1. Choose a starting configuration for the N spins. Compute the magnetisation mt
and the energy Et corresponding to the configuration.
2. Choose a spin Si at random. Sample a random number uniformly r ∈ [0, 1]. If
r < P+ , set Si = +1, otherwise set Si = −1. Update the energy and magnetisa-
tion.
3. Repeat step 2 until convergence.
Write a code implementing the Glauber dynamics. Repeat items (a) and (b) of
exercise 1.4 using the same parameters. Compare the dynamics. Comment on the
observed differences.
Part II: Mean-field equations from Glauber
Let’s now derive the mean-field equations for the Curie-Weiss model from the Glauber
algorithm.
(a) Let mt denote the total magnetisation at time t, and define Pt,m = P(mt = m). For
simplicity, consider β = 1 and h = 0. Show that for δ ≪ 1 we can write:
1 − tanh(m + 2/N )
  
1 2
Pt+δt,m = Pt,m+ 2 × 1+m+ ×
N 2 N 2
  
1 2 1
+ Pt,m− 2 × 1−m+ (1 + tanh(m − 2/N ))
N 2 N 2
1 − tanh(m)
 
1 1 + tanh(m) 1
+ Pt,m (1 + m) + (1 − m) .
2 2 2 2
1.6 Exercises 25

This is known as the master equation.

(b) Defining the mean magnetisation with respect to Pt,m


Z
⟨m(t)⟩ = dm m Pt,m

and using the master equation above, show we can get an equations for the expected
magnetisation:

1 − tanh(m + 2/N )
 
1
Z
⟨m(t + δt)⟩ = Pt,m+2/N × (1 + m + 2/N ) × × m dm
2 2
1 + tanh(m − 2/N )
 
1
Z
+ Pt,m−2/N (1 − m + 2/N ) × × m dm
2 2
1 + m 1 + tanh(m) 1 − m 1 − tanh(m)
Z  
+ Pt,m × × + × × m dm
2 2 2 2

(c) Making the change of variables m → m+2/N in the first integral and m → m−2/N
in the second and choosing δ = N1 , conclude that for N → ∞ we can write the
following continuous dynamics for the mean magnetisation:

d
⟨m(t)⟩ = −⟨m(t)⟩ + tanh⟨m(t)⟩
dt

(d) Conclude that the stationary expected magnetisation satisfies the Curie-Weiss
mean-field equation. Generalise to arbitrary β and h.

(e) We can now repeat the experiment of the previous exercise, but using the theoretical
ordinary differential equation: start by a configuration where all spins are equal to
1 and take different values of h and β. For which values will the Monte-Carlo chain
reach the equilibrium value? When will it be trapped in a spurious maximum
of the free entropy ϕ(m)? Compare your theoretical prediction with numerical
simulations.

Exercise 1.6: Potts model


The Potts model is a variant of the Ising model where there could be more that two
states: here the Potts spins can take up to q values. It is one of the most fundamental
model of statistical physics. In its fully connected version, the Hamiltonian reads:
X 1
H= (1 − δσi ,σj )
N
i,j

with σi = 1 . . . q, for all i = 1 . . . N .

(a) Using the variational approach of Section 1.3, write the Gibbs free-energy of the
Potts model, as a function of the fractions {ρτ }i=1,...N,τ =1...q of spins in state τ .

(b) Write the mean-field self-consistent equation governing the {ρτ } by extremizing
the Gibbs free-energy and solve it numerically.
26 F. Krzakala and L. Zdeborová

(c) Using the cavity approach of section 1.4, show that one can recover the mean-field
self-consistent equation using this method.
Appendix

1.A The jargon of Statistical Physics

In these notes we will often adopt a terminology from Statistical Physics. While this is standard
for someone who already took a course on the subject, it can often be confusing for newcomers
from other fields.

As we have discussed in the introduction, a system is defined by a set of degrees of freedom


(d.o.f.) {xi }i∈I and a Hamiltonian function H ({xi }i∈I ) assigning an energy or cost to each
configuration of d.o.f. For example, in the Curie-Weiss model the d.o.f. are spins xi ∈ {−1, 1}
indexed by i = 1, · · · , N , and the Hamiltonian is given by equation 1.1. Note however that the
index set I and the state space X can be more general. For instance, in Chapter 6 we will discuss
the Graph Colouring problem, an example of a system where the d.o.f. are defined in the
nodes of a graph G, and they can take q different values, such that X = {1, · · · , q}. Instead, in
Chapter 16 we will study inference problems in which the state space can be uncountable, such
as the weights of a single-layer neural network. Note that we can even study systems indexed
by an uncountable set — e.g. I ⊂ R — in which case it is common to refer to the d.o.f. as fields.
However, in these notes we will be only dealing with discrete and finite (therefore countable)
indices. Therefore, it will be convenient to adopt a vector notation for a configuration x ∈ X N ,
where N ≡ |I| is refereed to as the system size. Quantities that scale with the system size N
are often refereed to as extensive (in mathematical notation we write ON (N )), while quantities
that do not scale with the system size are often called intensive (and we write ON (1)). Typical
extensive quantities are the energy, entropy and volume, while typical intensive quantities
are the temperature and pressure. Note that we can always produce an intensive quantity
by considering the density of an extensive quantity (dividing it by N ). The limit of infinite
system size N → ∞ is called the thermodynamic limit, and a major part of Statistical Mechanics
is devoted to studying the behavior of intensive quantities in the thermodynamic limit.

In Classical Mechanics, the goal is to study the microscopic properties of a system, for instance
the trajectory of each molecule of a gas or the dynamics of each neuron of a neural network
during training. Instead, in Statistical Mechanics the goal is to study macroscopic properties of
the system, which are collective properties of the d.o.f. In our previous examples, a macroscopic
property of the gas is its mean energy, its temperature or its pressure, while a macroscopic
property of the neural network is the generalization error. To make these notions more precise,
in Statistical Physics we define an ensemble over the configurations, which is simply a probability
measure over the space of all possible configurations. Different ensembles can be defined for
the same system, but in these notes we will mostly focus on the canonical ensemble, which is
28 F. Krzakala and L. Zdeborová

defined by the Boltzmann-Gibbs distribution:

1
P(X = x) = e−βH(x) (1.25)
ZN (β)

where the normalization constant ZN (β) is known as the partition function. Note that the
partition function is closely related to the moment generating function for the energy. From
this probabilistic perspective, a configuration x ∈ X N is simply a random sample from the
Boltzmann-Gibbs distribution, and a macroscopic quantity can be seen as a statistic from the
ensemble. Physicists often denote the average with respect to the Boltzmann-Gibbs distribution
with brackets ⟨·⟩β and refer to it as a thermal average. We now define the most important quantity
in these notes, the free energy density:

1
−βfN (β) = log ZN (β). (1.26)
N
Note that since the Hamiltonian is typically extensive H = O(N ), the free entropy (the
logarithm of the partition sum) is also extensive, and therefore its density is an intensive
quantity. It is closely related to the cumulant generating function for the energy. For this
reason, and as discussed in the introduction, from the free entropy we can access many of the
important macroscopic properties of our system. Two macroscopic quantities physicists are
often interested in are the energy and entropy densities:
 
1
eN (β) = H(x) = ∂β (βfN (β)), sN (β) = β 2 ∂β fN (β) (1.27)
N β

which satisfy the useful relation:

1
fN (β) = eN (β) − sN (β) (1.28)
β

1.B The ABC of phase transitions

One of the main goals of the Statistical Physicist is to characterize the different phases of a
system. A phase can be loosely defined as a region of parameters defining the model that share
common macroscopic properties. For example, the Curie-Weiss model studied in Chapter
1 is defined by the parameters (β, h) ∈ R+ × R, and we have identified two phases in the
thermodynamic limit: a paramagnetic phase characterised by no net system magnetization and
a ferromagnetic phase characterized by net system magnetization. The macroscopic quantities
characterising the phase of the system (in this case the net magnetization) are known as the
order parameters for the system. In Chapter 16 we will study a model for Compressive Sensing,
which is the problem of reconstructing a compressed sparse signal corrupted by noise. The
parameters of this system are the sparsity level (density of non-zero elements) ρ ∈ [0, 1] and
the noise level ∆ ∈ R+ , and we will identify the existence of three phases: one in which
reconstruction is easy, one in which it is hard and a one in which it is impossible. The order
parameter in this case will be the correlation between the estimator and the signal. Although
all examples studied here have clear order parameters, it is not always easy to identify one,
and it might not always be unique. For instance, in the Compressive sensing example we
could also have chosen the mean squared error as an order parameter.
1.B The ABC of phase transitions 29

When a system changes phase by varying a parameter (say the temperature), we say the system
undergoes a phase transition, and we refer to the boundary (in parameter space) separating the
two phases as the phase boundary. In Physics, we typically summarize the information about
the phases of a system with a phase diagram, which is just a plot of the phase boundaries in
parameter space. See Figure 1.B.1 for two examples of well-known phase diagrams in Physics.

Figure 1.B.1: Phase diagrams of water (left) and of the cuprate (right), from Taillefer (2010)
and Schwarz et al. (2020) respectively.

Phase transitions manifest physically by a macroscopic change in the behavior of the system
(think about what happens to the water when it starts boiling). Therefore, the reader who
followed our discussion from Chapter 1 and Appendix 1.A should not be surprised by the fact
that phase transitions can be characterised and classified from the free energy. Indeed, the
classification of phase transitions in terms of the analytical properties of the free energy dates
back to the work of Paul Ehrenfest in 1933 (see Jaeger (1998) for a historical account). At this
point, the mathematically inclined reader might object: the free energy from equation 1.26 is
an analytic function of the model parameters, so how can it change behavior across phases?
Indeed, for finite N the free energy is an analytic function of the model parameters. However,
the limit of a sequence of analytic functions need not be analytic, and in the thermodynamic
limit the free energy can develop singularities. Studying the singular behavior of the limiting
free energy is the key to Ehrenfest’s classification of phase transitions. The two most common
types of phase transitions are:

First order phase transition: A first order phase transition is characterised by the disconti-
nuity in the first derivative of the limiting free energy with respect to a model parameter.
The most common example is the transition of water from a liquid to a gas as we change
the temperature at fixed pressure (what you do when you cook pasta), see Fig. 1.B.1
(left). In this example, the derivative of the free energy with respect to the temperature,
also known as the entropy, discontinuously jumps across the phase boundary.

Second order phase transition: A second order phase transition is characterised by a dis-
continuity in the second derivative of the limiting free energy with respect to a model
parameter. Therefore, in a second order transition the free energy itself and its first
derivative are continuous. Note that second order derivatives of the free energy are
typically associated with response functions such as the susceptibility. Perhaps the most
30 F. Krzakala and L. Zdeborová

famous example of a second order transition is the spontaneous magnetization of certain


metals as a function of temperature, known as ferromagnetic transition.

Although less commonly used, we can define an n-th order phase transition in terms of the
discontinuity of the n-th derivative of the limiting free energy. The order of a phase transition
is associated to a rich phenomenology, which we now discuss, for the sake of concreteness, in
our favourite model: the Curie-Weiss model for ferromagnetism.

Phase transitions in the Curie-Weiss model Recall that in Chapter 1 we have computed the
thermodynamic limit of the free energy density for the Curie-Weiss model:
1
−βfβ = lim log ZN = max ϕ(m)
N →∞ N m∈[−1,1]

where:
β 2
ϕ(m) = m + βhm + H(m)
2
1+m 1+m 1−m 1−m
H(m) = − log − log .
2 2 2 2
As we have shown, the parameter m⋆ solving the minimization problem above gives the order
parameter of the system, the net magnetization at equilibrium:

m⋆ = argmax ϕ(m) = −∂h f (β, h) = ⟨S̄⟩β (1.29)


m∈[−1,1]

In particular, the limiting average energy and entropy densities are given by:
 ⋆ 
⋆ m
e(β, h) = ∂β (βf (β, h)) = −m +h s(β, h) = β 2 ∂β f (β, h) = H(m⋆ ) (1.30)
2

In particular, note that the entropy density depends on the model parameters (β, h) only
indirectly through the magnetization m⋆ = m⋆ (β, h). The potential ϕ(m) is an analytic function
of m and the parameters (β, h). However, due to the optimization over m, the free energy
density can develop a non-analytic behavior as a function of (β, h), signaling the presence of
phase transitions, which we now recap.

Zero external field and the second order transition: Note that the decomposition f = e−βs
(see equation 1.27) makes it explicitly that the free energy is a competition between two
parabolas: the energy (convex) and the entropy (concave). At zero external field h = 0,
we note that the potential is a symmetric function of the magnetization ϕ(−m) = ϕ(m).
At high temperatures β → 0+ , the dominant term is given by the entropy, which has
a single global minimum at m⋆ = 0 (see Fig. 1.1): this is the paramagnetic phase in
which the system has no net magnetization. At the critical temperature βc = 1, m = 0
becomes a maximum of the system, with two global minima (having the same free
energy) continuously appearing, see Fig. 1.4. This signals a phase transition towards a
ferromagnetic phase defined by a net system magnetization |m⋆ | > 0. Note that the first
derivative of the free energy with respect to β (proportional to the entropy) remains a
continuous function across the transition. However, we notice that the second derivative
1.B The ABC of phase transitions 31

0.7
c 1000

0.6
800
0.5

convergence time
600
entropy

0.4
400
0.3

0.2 200

0.1 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Figure 1.B.2: (Left) Entropy as a function of the inverse temperature β at zero external field
h = 0. Note that the entropy is a continuous function of the temperature, with a cusp at the
critical point βc = 1, indicating that its derivative (proportional to the second derivative of the
free energy) has a discontinuity. (Right) Convergence time of the saddle-point equation as a
function of the inverse temperature β at zero external field h = 0. Note the critical slowing
down close to the second order critical point βc = 1.

of the free energy is discontinuous, indicating this is a second order phase transition. This
transition corresponds to a significant change in the statistical behavior of the system at
macroscopic scales: while for β < 1 a typical configuration from the Boltzmann-Gibbs
distribution has no net magnetization m⋆ = ⟨S̄⟩β ≈ 0 (disordered phase), for β > 1 a
typical configuration has a net magnetization |m⋆ | = |⟨S̄⟩β | > 0 (ordered phase). This is
an example of an important concept in Physics known as spontaneous symmetry breaking:
while the Hamiltonian of the system is invariant under the Z2 symmetry S̄ → −S̄, for
β > 1 a typical draw of the Gibbs-Boltzmann distribution S ∼ PN,β breaks this symmetry
at the macroscopic level. Second order transitions carry a rich phenomenology. Since
the transition is second order (i.e. continuous first derivative), the critical temperature
can be obtained by studying the expansion of the free energy potential around m = 0:

m2
ϕ(m) = log 2 + (β − 1) + O(m3 )
m→0 2
which give us the critical βc = 1 as the point in which the second derivative changes
sign (m = 0 goes from a minimum to a maximum). It is also useful to have the picture
in terms of the saddle-point equation:

m = tanh(βm).

The fact that m = 0 is always a fixed point of this equation signals it is always an
extremizer of the free energy potential. From this perspective, the critical temperature
βc = 1 corresponds to a change of stability of this fixed point. Seeing the saddle-point
equations as a discrete dynamical system mt+1 = f (mt ), the stability of a fixed point
can be determined by looking at the Jacobian of the update function f : [−1, 1] → [−1, 1]
around the fixed point m = 0:

f (x) = tanh(βx) = βx + O(m3 ) (1.31)


m→0

For β < 1, the fixed point is stable (attractor/sink of the dynamics), while for β > 1
it becomes an unstable (repeller/source of the dynamics). Note that this implies that
32 F. Krzakala and L. Zdeborová

close to the transition β ≈ 1+ , iterating the saddle point equations starting close to zero
mt=0 = ϵ ≪ 1 (but not exactly at zero) takes long to converge to a non-zero magnetization
m > 0, with the time diverging as we get closer to the transition. This phenomenon is
known is Physics as the critical slowing down, and together with the expansion of the free
energy and the stability analysis of the equations give yet another way to characterise a
second order critical point. See Figure 1.B.2 (right) for an illustration.

1.1
|h| > hsp
1.0 h = hsp 1
|h| < hsp
0.9 2

free energy (f)


0.8
(m)

3
0.7
4
0.6
= 0.5
5 =1
0.5
= 1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
m external field (h)

Figure 1.B.3: (Left) Free energy potential ϕ(m) as a function of m for fixed inverse temperature
β = 1.5 and varying external field h < 0. Note that the free energy potential has a local
maximum for |h| > hsp that disappears at the spinodal transition h = hsp . (Right) Free energy
as a function of the external field h at different temperatures. Note the non-analytical cusp at
h = 0.

Finite external field and the first order transition: Turning on the external magnetic field
h ̸= 0 can dramatically change the discussion above. First, note that the Hamiltonian
loses the Z2 symmetry: this is known in Physics as explicit symmetry breaking. At high
temperatures β → 0+ , the free energy potential is convex, with a single minimum at
m = h aligned with the field. As temperature is lowered and we enter what previously
was the ferromagnetic phase (β > 1), two behaviors are possible. For small h, the
field simply has the effect of breaking the symmetry between the previous two global
minima and making the with opposite sign a local minimum, see Fig. 1.B.3 (left). In this
situation, even though the equilibrium free energy is given by the now unique global
minimum of the potential, the presence of a local minimum has an important effect in the
dynamics. Indeed, if we initialize the saddle-point equations close to the magnetization
corresponding to the local minimum, it will converge to this local minimum, since it
is also a stable fixed point of the corresponding dynamical system, see Fig. 1.B.4 (left).
This phenomenon is known as metastability in Physics. Note that metastability can be a
misleading name, since in the thermodynamic limit N → ∞ metastable states are stable
fixed points of the free energy potential. However, at finite system size N , the system
will dynamically reach equilibrium in a time of order t = O(eN ). Metastability will play
a major role in the Statistical Physics analysis of inference problems, since it is closely
related to algorithmic hardness.

As the external field h is increased, the difference in the free energy potential between the
two minima increases, and eventually at a critical field hsp , known as the spinodal point,
the local minimum disappears, making the potential convex again, see Fig. 1.B.3 (left).
The spinodal points can be derived from the expression of the free energy potential, and
1.C A rigorous version of the cavity method 33

1.00 stable 1.00 m0 = -0.5


0.75 metastable 0.75
m0 = 0.5
unstable
0.50 spinodals 0.50

magnetization (m)
0.25 0.25
magnetization

0.00 0.00
0.25 0.25
0.50 0.50

0.75 0.75

1.00 1.00
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
external field (h) external field (h)

Figure 1.B.4: (Left) Stable, metastable and unstable branches of the magnetization as a function
of the external field at fixed inverse temperature β = 1.5. (Right) Magnetization obtained
by iterating the saddle-point equations from different initial conditions mt=0 as a function of
the external field h and fixed inverse temperature β = 1.5. Note the hysteresis loop: point at
which the magnetization discontinuously jumps from negative to positive depends on the
initial state of the system.

is given by:
s   r 
1 1 1 1
hsp (β) = ± 1− ∓ tanh−1 1− , β>1
β β β β

From this discussion, it is clear that for β > 1 the magnetization (which is the derivative
of the free entropy with respect to h) has a discontinuity at h = 0, since for h ̸= 0 we
have a non-zero magnetization and for h = 0 we are in the paramagnetic phase h = 0.
This is a first order phase transition of the system with respect to the external field h, see
Fig. 1.B.3 (right). Note that as a consequence of metastability, in the region |h| < |hsp |
the system magnetization will depend of the state in which it was initially prepared.
This memory of the initial state is known as hysteresis in Physics, see Fig. 1.B.4 (right).

1.C A rigorous version of the cavity method

The cavity method presented in the main chapter can also be done entirely rigorously, as we
now show. This appendix can be skipped for non mathematically-minded readers, although
it is always interesting to see how things can be done precisely. In fact, this section is a good
training for the later Chapters where the rigorous proofs are more involved, despite using the
very same techniques.

First, we want to show that the magnetization, as well as many other observables, does indeed
converge to a fixed value as N increases. This is done through the following lemma that tells
us that, if indeed S̄ concentrates, then any observable will concentrate as well.

Lemma 5. For any bounded observable O({S}N ) there exists a constant C = ∥O(.)∥∞ such that:
q
|⟨O(S)⟩N +1,β ′ − ⟨O(S)⟩N,β | ≤ Cβ sinh(βh + β) VarN,β (S̄) (1.32)
34 F. Krzakala and L. Zdeborová

Proof. The proof proceeds as follow. First we compute directly


⟨O(S) cosh (β(S̄ + h))⟩N,β
⟨O(S)⟩N +1,β ′ =
⟨cosh (β(S̄ + h)))⟩N,β
From this, it follows that
⟨O(S)⟩N +1,β ′ − ⟨O(S)⟩N,β
⟨O(S) cosh (β(S̄ + h))⟩N,β − ⟨O(S)⟩N,β ⟨cosh (β(S̄ + h))⟩N,β
=
⟨cosh (β(S̄ + h))⟩N,β
≤ ⟨O(S)(cosh (β(S̄ + βh))⟩N,β − ⟨O(S)⟩N,β ⟨cosh (β(S̄ + h))⟩N,β
 
= COVN,β O(S), cosh (β(S̄ + βh))
where we have used the fact that cosh(x) ≥ 1. When one sees a covariance, one should always
try to use the Cauchy-Schwartz formula that states hat Cov2 (X, Y ) < Var(X)Var(Y ). In our
case, this leads to
(⟨O(S)⟩N +1,β ′ − ⟨O(S)⟩N,β )2 ≤ Var(O(S))Var(cosh (β(S̄ + h))) (1.33)
Finally we note that cosh(β(S̄ + h)) is a convex Lipschitz function for the variable S̄ with
constant β sinh(β(1 + h)) (since S̄ is bounded by one), and therefore by applying Jensen’s
inequality, and then the Lipschitz inequality, we have
cosh (β(S̄ + h)) − ⟨cosh (β(S̄ + h))⟩ ≤ cosh (β(S̄ + h)) − cosh (β(⟨S̄⟩ + h))
≤ β sinh(β(1 + h))((S̄ − ⟨S̄⟩))
so that
⟨(cosh (β(S̄ + h)) − ⟨cosh (β(S̄ + h))⟩)⟩2 ≤ β 2 sinh(β(1 + h))2 Var(⟨S̄⟩)
Plugging this relation in (1.33) finishes the proof.

We can now prove the main thesis and obtain the mean field equation:
Theorem 5. There exists a constant C(β, h) such that
q
|⟨Si ⟩N,β − tanh β(h + ⟨S̄⟩N,β )| ≤ C(β, h) Var(S̄)

Proof. By direct computation, we get that


⟨sinh (β(S̄ + h))⟩N,β
⟨Si ⟩N +1,β ′ = ⟨S0 ⟩N +1,β ′ =
⟨cosh (β(S̄ + h))⟩N,β
Using again Jensen and the Lipschitz property of the cosh and sinh, we get concentration of
both terms in the fraction as
q
|⟨sinh (β(S̄ + h))⟩ − sinh (β(⟨S̄⟩ + h))| ≤ β cosh (β(1 + h)) Var(S̄)
q
|⟨cosh (β(S̄ + h))⟩ − cosh (β(⟨S̄⟩ + h))| ≤ β sinh (β(1 + h)) Var(S̄)
Together with the inequality |a1 /b1 − a2 /b2 | ≤ (a1 − a2 )/b1 + a2 |b1 − b2 |/b1 b2 , we get using
ai ≥ 0 and bi ≥ max(1, ai ) that
 q
|⟨Si ⟩N +1,β ′ − tanh β(h + ⟨S̄⟩N,β ) | ≤ C(β, h) Var(S̄)
Which, using lemma 5, finishes the proof.
1.C A rigorous version of the cavity method 35

All is left to do to get the mean field equation is showing that the variance of the magnetization
is going to zero outside of the phase transition/coexistence line. This is not entirely easy to do,
but we can easily show that this true almost everywhere in the plane (β, h) using the so-called
fluctuation-dissipation approach:

Lemma 6 (Bound on the variance). For any β, and values h1 , h2 of the magnetic field, one has
h2
2
Z
Var(S̄)β,N,h dh ≤
h1 βN
2
so that Var(S̄)β,N,h ≤ N for almost every h.

Proof. The proof start by noticing that ⟨S̄⟩β,N = ∂


∂βh ΦN (β, h) and by direct computation

∂2 ∂
N Var(S̄)β,N,h = ΦN (β, h) = ⟨S̄⟩β,N
∂(βh)2 ∂(βh)

Therefore Z βh2
N Var(S̄)β,N,h dβh = ⟨S̄⟩β,N,h2 − ⟨S̄⟩β,N,h1 ≤ 2
βh1

We can also prove the free entropy, using a technique that shall be very useful for more complex
problems:

Theorem 6 (Free entropy by the cavity method).

Φ(β, h) = maxm ϕ̃(m; β, h)

Proof. Writing the seemingly trivial equality:

1 1 ZN 1 ZN ZN −1 Z1
ϕN (β, h) = log ZN = log ZN −1 = log ...
N N ZN −1 N ZN −1 ZN −2 1
N −1
1 X ZN +1
= An (β, h); AN (β, h) =: log
N ZN
n=0

we have, by the Stolz–Cesàro theorem, that if An converges to A, then ΦN (β, h) also converges
to A. Using equation 1.21 we thus write
β 2
AN (β, h) =: log ZN +1 /ZN = log⟨e− 2 S̄ 2 cosh (β(S̄ + h))⟩N,β + o(1) .

By lemma 5 and 6 we thus obtain for any β > 0 and almost everywhere in h, that

m∗ 2
lim AN (β, h) = −β + log 2 cosh(β(m∗ + h))
N →∞ 2
with m∗ = ⟨S̄⟩. Therefore almost everywhere in field we have, following eq.(1.21) that the
free entropy is given by one of the extremum m∗ of eq.(1.21):

Φ(β, h) = ϕ̃(⟨S̄⟩) .
36 F. Krzakala and L. Zdeborová

It just remains to show that the correct extremum, if they are many of them, is the maximum
one. This can be done by noting that it is necessary the maximum since ϕ̃(m∗ ) = ϕ(m∗ ), and
that by the Gibbs variational approach, theorem 4, we already proved that Φ ≥ ϕ(m)∀m. Given
the entropy is Lipschitz continuous in h for all N (its derivative is the magnetization, which is
bounded), its limit is continuous, so that the if the free entropy is given by the maximum of
ϕ(m) almost everywhere, it is true everywhere.
Chapter 2

A simple example: The Random Field


Ising Model

Let’s start at the very beginning. A very good place to start.

The sound of music


1965

Let us move to a more challenging example. We continue to consider a system of N spins


si ∈ {±1} with Hamiltonian
!2
N X si X
HN (s, h) = − − hi si ,
2 N
i i

i.i.d.
but now, the additional fields h are fixed, once and for all. We choose hi ∼ N (0, ∆).

This is a simple variation on the Ising model, but now we have these new random fields h. This
makes the problem a bit more complicated. For this reason, it is called the Random Field Ising
Model, or RFIM. We may ask many questions. For instance: what is the assignment of the spins
that minimizes the energy? This is a non-trivial question since there is a competition between
aligning all variables together in the same direction, and aligning them to the direction of
the local random fields. What will be the energy of this assignment? Will the energy be very
different when one picks up another random value for h? How do we find such an assignment
in practice? With which algorithm?

2.1 Self-averaging and Concentration

We shall be interested in the behavior of this system in the large N limit. How are we going to
deal with the random variables? The entire field of statistical mechanics of disordered systems,
and its application to optimization and statistics, is based on the idea of self-averaging: that is,
the idea that the particular realization of h does not matter in the large size limit N → ∞ (the
38 F. Krzakala and L. Zdeborová

asymptotic limit for mathematicians, or the "thermodynamic" limit for physicists). This is a
powerful idea which allows us to average over all h.

Let us see how this can be proven rigorously. First, a word of caution, the partition sum ZN is
exponentially large in N , so we expect its fluctuation around its mean to be very large as well.
Fortunately, as we have seen, our focus will be on log ZN (h)/N , which we hope converges to
a constant O(1) value as N → ∞. The quantity log ZN (h)/N is random, it depends on the
value of h, but we may expect that the typical value of log ZN (h)/N is close to its mean. This
notion, that a large enough system is close to its mean, is called "self-averaging" in statistical
physics. In probability theory, this is a concentration of measure phenomenon.

Indeed for the Random Field Ising model, we can show that the free entropy concentrates
around its mean value for large N :

Theorem 7 (Self-averaging). Let ΦN (h, β) =: log ZN (β, h)/N be the free entropy density of the
RFIM, then:
∆β 2
Var[ΦN (h, β)] ≤
N

Proof. The proof is a consequence of the very useful Gaussian Poincaré inequality (see the
theorem in exercise 9): Suppose f : Rn 7→ R is a smooth function and X has multivariate
Gaussian distribution X ∼ N (0, Γ), where Γ ∈ Rn×n . Then

Var[f (X)] ≤ E [Γ∇f (X) · ∇f (X)] .

For the RFIM, given that


β
∂hi ΦN (h, β) = ⟨Si ⟩
N
we find that
β 2 X ⟨Si ⟩2 β2
∇ΦN (h, β) · ∇ΦN (h, β) = ≤ ,
N N N
i

and the Gaussian Poincaré inequality with Γ = ∆I (covariance matrix of h) yields the final
result.

Hence, instead of computing the free entropy density for each realization of h, we turn to
computing the expectation over all possible h, i.e. we define

1
Φ(β, ∆) ≜ lim Eh ΦN (β, δ, h) = lim Eh [log ZN (β, h)]
N →∞ N →∞ N

and the self-averaging property guarantees that Eh [log ZN (β, h)] /N is close to [log ZN (β, h)] /N
when N is large. In probability theory, when the variance goes to zero one says that the random
variable converges in probability. Here, we have thus showed that ΦN (β, h, ∆) converges in
probability to Φ(β, ∆). In fact, we could work a bit more and show that the probability to
have a deviation larger than the variance is exponentially low, so we are definitely safe, even
at moderate values of N . This leaves us with the question of how to compute Φ(β, ∆), as it
involves the expectation of a logarithm.
2.2 Replica Method 39

2.2 Replica Method

In order to compute the average of the logarithm, a powerful heuristic method has been used
widely in statistical physics: the replica method, proposed in the 70’s by Sir Sam Edwards
(who credits it to Marc Kac). Here is the argument: Suppose n is close to zero, then

Zn − 1
Z n = en log Z = 1 + n log(Z) + o(n) =⇒ log Z = lim
n→0 n

If Z is a random variable and we suppose that swapping limit and expectation is valid (which
is by no means evident), then we find the following identity, often referred to as "the replica
trick:"

Zn − 1 E[Z n ] − 1
 
E [log Z] = E lim = lim
n→0 n n→0 n

This is at the root of the replica method: we replace the average of the logarithm of Z, which
is hard to compute, by the average of powers of Z. If n is integer, we may hope that we shall
indeed be able to compute averages of Z n . We could then "pretend" that our computation
with n finite is valid for n ∈ R (which is again not a trivial step), and perform an analytic
continuation from N to ∈ R, and send n → 0. While this sounds acrobatic and certainly not like
rigorous mathematics, it seems, amazingly, to work quite well when we stick to the guidelines
that physicists —following the trail of Giorgio Parisi and Marc Mézard— have proposed over
the last few decades. Indeed, when it can be applied, this method appears to seemingly always
lead to the correct result, at least when we can compare it to rigorous computation. Nowadays,
there is a deep level of trust in the results given by the replica method.

We shall come back on what we can say with rigorous mathematics later on, but first, let us
see how it works in detail for the computation of Φ(β, ∆) in the Random Field Ising Model,
using our "field-theoretic" toolbox of the previous chapter.

2.2.1 Computing the replicated partition sum

Let n be the number of replicas and α = 1, . . . , n be the index of replicas (these are traditional
notation in the replica literature) we have

  !2 
n X s(α)
XX X X N X (α)

Zn = ··· exp β i
+ hi s i  .
 2 N 
s(1) s (2) s (n) α=1 i i
40 F. Krzakala and L. Zdeborová

We now write the average over the random fields, and proceed to fix the magnetization, as we
did for the Curie-Weiss model, this time for each of the n "replicas" indexed from α = 1, ..., n :
 !2 
N Pn P s(α)
i
Pn P (α)
β 2 α=1 i N +β α=1 h s
i i i 
 X
Eh [Z n ] = Eh  e 
n
{s }α=1
(α)
 
Z !
(a)  X Y X (α) N P 2
P P (α) 
= N n Eh  dmα δ si − N mα eβ 2 α mα +β α i hi si 
{s(a) }α α i
 
Z hP i
(α) (α) 
(b)
P
= (2iπN )n Eh 
m̂ s −N mα β N 2
P P P
 X Y
dmα dm̂α e α α i i e 2 α mα +β α i hi si 
{s(a) }α α
 
Z Y
 X Pα m̂α Pi s(α) (α) 
dmα dm̂α e(β 2 α mα −N α m̂α mα ) Eh 
N P 2
P P P
∝ e i +β α i hi si

α { }α
s (a)

where (a) is obtained by splitting the sum by magnetization and (b) is obtained by taking the
Fourier transform of the Dirac delta function, followed by a change of variables m̂α = 2π iλα .
We then chose to ignore the irrelevant prefactors in front of the integral. Then:
 
Z Y
 X Y Y m̂α s(α) (α) 
dmα dm̂α e(β 2 α mα −N α m̂α mα ) Eh 
N P 2
P
Eh [Z n ] ∝ e i +βhi si 

α {s(a) }α α i
 
Z Y
(c) (α) (α) 
dmα dm̂α e(β 2 α mα −N α m̂α mα ) Eh 
N P 2
P Y Y X
∝ em̂α si +βhi si 

α i α (α)
si =±1
Z Y ( " #)N
(d)
(β N2 2
α m̂α mα )
P P Y
∝ dmα dm̂α e α mα −N Eh 2 cosh (βh + m̂α )
α α
Z Y
N [ β2 2
α m̂α mα +log(Eh [ α 2 cosh(βh+m̂α )])]
P P Q
= dmα dm̂α e α mα −

where (c) is obtained by writing the sum of products into a product of sums and (d) comes
from the fact that all hi ’s are i.i.d. so that the integral over the vectors h has been replaced by
a single intergal over a scalar h. We have again got read entirely of the combinatoric sums but
at the price of introducing integrals.

2.2.2 Replica Symmetry Ansatz

At this point, we seem to have reached a quite complicated expression: we now have to
somehow manage to integrate over all the mα , m̂α , and magically take the n → 0 limit. These
integrals can be see seen as an integral over the two n-dimensional vectors matrices m and m̂,
i.e: Z Y Z
dmα dm̂α =: dm dm̂
α
2.2 Replica Method 41

From the structure of the integral, we should expect that a saddle point method will hold, so
that we should extremize the expression in the exponential over these two vectors. While this
looks like a formidable challenge (maximization over all possible vectors!) we may guess how
these vectors will look like at the extremum. A very reasonable assumption, called the replica
symmetry (RS) ansatz, is that at the extremum all the replicas are equivalent, so that

mα ≡ m, m̂α ≡ m̂, ∀α

Physicists, who have been trained to follow the steps of the giant german scientists of the late
XIXth and early XXth century, call such a guess an ansatz. Following then the replica symmetric
ansatz, the seemingly huge monster Eh [Z n ] is now reduced to the more gentle:
Z   
n β 2 n n
Eh [Z ] =∝ dm dm̂ exp N nm − nm̂m + log (Eh [2 cosh (βh + m̂)])
2
We have thus quite simplified the problem and the averaged free entropy we are looking for
reads
1
Φ(β, ∆) = lim Eh [log (Z(β, h))]
N →∞ N
(a) 1 Eh [(Z(β, h))n ] − 1
= lim lim
N →∞ N n→0 n
(b) 1 Eh [(Z(β, h))n ] − 1
= lim lim
n→0 n N →∞ N
 
(c) 1 β 2 n n
= lim Extrm,m̂ nm − nm̂m + log (Eh [2 cosh (βh + m̂)])
n→0 n 2
where (a) is simply applying the replica trick; (b) is a non-rigorous swap of two limits which
is assumed to be correct in the replica method; and (c) is the saddle point method. We have
almost finished the replica computation, the last step is to get rid of the remaining n. This can
be done by using the replica trick once more:
 
(d) 1 β 2
Φ(β, ∆) = lim Extrm,m̂ nm − nm̂m + nEh [log (2 cosh (βh + m̂))]
n→0 n 2
 
β 2
= Extrm,m̂ m − m̂m + Eh [log (2 cosh (βh + m̂))]
2

where (d) comes from the trick that as n → 0+ :


h i
E[X n ] = E en log(X) ≈ E [1 + n log(X)] = 1 + nE [log(X)] ≈ enE[log(X)]

which implies
log (E[X n ]) ≈ log (1 + nE [log(X)]) ≈ nE [log(X)]
This kind of voodoo replica magic should be astonishing: essentially, we see that we can push
the expectation within a function when n is going to 0! This already hints at the fact that, for
this to be really valid, we shall require some concentration property for the random variables!
In any case, we have finished the replica part of our computation, and have managed to bring
back the computation to an extremization of a two-dimensional function, just like we did for
the Curie-Weiss model:
 
β 2
Φ(β, ∆) = extrm,m̂ m − m̂m + Eh [log (2 cosh (βh + m̂))]
2
42 F. Krzakala and L. Zdeborová

2.2.3 Computing the Saddle points: mean-field equation

We now need to compute the saddle points. We have:


 
∂ β 2
m − m̂m + Eh [log (2 cosh (βh + m̂))] = βm − m̂ ⇒ m̂ = βm
∂m 2

Plugging this back, we reach a formula very similar to the one obtained for Curie-Weiss.
Defining
β
ΦRS (m, β, ∆) ≜ − m2 + Eh [log (2 cosh (β(h + m)))]
2
we find:
Φ(β, ∆) = extrm ΦRS (m) = ΦRS (m∗ ) (2.1)
where m∗ will satisfy the self-consistent mean-field equation

h2
e− 2∆
Z
m = Eh [tanh (β(h + m))] = dh √ tanh(β(h + m))
2π∆

As we did in the Curie-Weiss model, we can also compute the large deviation function that
gives us the free entropy for a fixed value of m. In fact, this is self-averaging as well, since we
could repeat all the steps of theorem 7 with an indicator function. The free entropy is obtained
by doing the saddle point in the correct order and first differentiating wrt m̂, leading to an
implicit equation on m̂∗ :
m = Eh [tanh (βh + m̂∗ )]
so that the replica symmetric approach predicts
LD (β,∆,m)
P(S̄ = m) ≍ eN Φ

where  
LD β 2
Φ (β, ∆, m) = extrm̂ m − m̂m + Eh [log (2 cosh (βh + m̂))]
2
where the extremization over m̂ imposes that it solves eq. (2.2.3). Indeed it is easy to check
that this recover the large deviation function of the previous chapter.

We illustrate the result of the replica predictions are shown in Figure.2.2.1 where we plot the
minimal cost of the assignement versus the variance of the random field. This is obtained by
solving the self consistant equation 2.1, and computing the equilibrium energy by deriving
the free entropy with respect to the (inverse) temperature :

(m∗ )2
E⟨e⟩ = −∂β Φ(β, ∆) = − Eh [(h + m∗ ) tanh (β(h + m∗ ))] (2.2)
2
and taking the zero temperature limit so that

(m∗ )2
E[emin ] = − Eh [(h + m∗ ) sign (h + m∗ )] (2.3)
2
2.3 A rigorous computation with the interpolation technique 43

Figure 2.2.1: Minimum energy in the Random Field Ising Model depending on the variance ∆.

2.3 A rigorous computation with the interpolation technique

Given this computation was somehow acrobatic, it would be only natural to seek rigorous
re-assurance that the result we reached is exact. In order to do so, we shall use the interpolation
method introduced by Francesco Guerra to prove results from the replica method.

2.3.1 A Simple Problem

First we start by a different, but simpler, problem. Consider a system with the Hamiltonian:
X
H0 (s, h; m) = − si (hi + m)
i

The corresponding partition function and free entropy per spin read
X P Y X Y
Z0 (β, h; m) = eβ i si (hi +m) = eβsi (hi +m) = 2 cosh (β(hi + m))
s i si =±1 i
 
log (Z0 (β, h; m))
Φ0 (β, ∆; m) = Eh = Eh [log (2 cosh (β(h + m)))]
N

In fact, we can even define this partition sum at fixed value of S̄ = s, to access large deviations:
X P
Z0 (β, h; m, s) = 1(S̄ = s)eβ i si (hi +m)

We do not know how to do this computation directly but we can, however, do it using the
Gartner-Ellis theorem, or Legendre transform, since as N → ∞ this is equivalent to computing
the rate. We thus write
X P P
Z̃0 (β, h; m, k) = eβ i si (hi +m)+k i si
s
1
Z̃0 (β, h; m, k) → Eh [log (2 cosh (β(h + m) + k))]
N
44 F. Krzakala and L. Zdeborová

and get, from Gartner-Ellis, imposing s = m:

1
Φ0 (β, m, ∆) = lim log Z0 (β, h; m, S̄ = m) = Eh [log (2 cosh (β(h + m) + k ∗ ))] − k ∗ m
N →∞N
m = Eh [tanh (β(h + m) + k ∗ )]

we can make the trivial change of variable k ∗ = m̂ − βm and we reach

Φ0 (β, m, ∆) = extrm̂ Eh [log (2 cosh (βh + m̂))] − mm̂ + βm2

and we now have something that looks already very close to the replica prediction!

2.3.2 Guerra’s Interpolation

Guerra’s method consists in modifying the Hamiltonian H0 to transform it progressively into


the actual problem. This is done as follows: we define a family of Hamiltonians and their
associated partition functions at a different "times" t as:

!2
X N X si
Ht (s, h; m) = − si [hi + m(1 − t)] − t ,
2 N
i i
X
Zt (β, h; m) = 1(S̄ = m)e−βHt (s,h;m) , ∀ t ∈ (0, 1]
s

It is easy to verify that that, for t = 0, we recover the model of the previous paragraph, while
for t = 1 we have:

H1 (s, h; m) ≡ HRFIM (s, h) , Z1 (β, h; m) ≡ ZRFIM (β, h, m) , ∀ m ∈ [−1 : 1]

We are now going to interpolate the model at time 1 from the one at time 0, and write, using
the fundamental theorem of calculus:

 
log (ZRFIM (β, h))
Φ(β, m, ∆) = lim Eh
N →∞ N
 
log (Z1 (β, h; m))
= lim Eh
N →∞ N
 Z 1 
log (Z0 (β, h; m)) ∂ log (Zt (β, h; m))
= lim Eh + dτ
N →∞ N 0 ∂t N t=τ
Z 1 
∂ log (Zt (β, h; m))
= Φ0 (m, ∆, β) + lim Eh dτ
N →∞ 0 ∂t N t=τ
2.3 A rigorous computation with the interpolation technique 45

What is left to do is to compute the additional integral term. We find:


∂ log (Zt (β, h; m)) 1 1 ∂ X
= 1(S̄ = m)e−βHt (s,h;m)
∂t N N Zt (β, h; m) ∂t s
1 1 ∂
1(S̄ = m)e−βHt (s,h;m) (−β) Ht (s, h; m)
X
=
N Zt (β, h; m) s ∂t
 !2 
β N si 
1(S̄ = m)e−βHt (s,h;m) −m
X X X
= si +
N Zt (β, h; m) s 2 N
i i
 !2 
X 1(S̄ = m)e−βHt (s,h;m) X si 1 X si
=β −m + 
s |
Zt (β, h; m) N 2 N
{z } i i
β,t,h,m=P (s)
 * + * !2 + 
 X si 1 X si 
= β −m +
 N 2 N 
i β,t,h,m i β,t,h,m
 * !2 + 
1 X si m2 
=β m− −
2 N 2 
i β,t,h,m

Substituting this back we obtain


  * !2 + 
Z 1 1 X si 2
m 
Φ(β, m, ∆) = Φ0 (β, m, ∆) + lim Eh  dτ β m− −
N →∞ 0 2 N 2 
i β,τ,h,m
 * !2 + 
Z 1
βm2 β X si
= Φ0 (β, m, ∆) − + lim Eh  dτ m− 
2 } 2 N →∞ 0 N
| {z i β,τ,h,m
=ΦRS (m) | {z }
=0
= extrm̂ ΦRS (m, m̂), ∀ m ∈ [−1 : 1]
where the last equality to 0 arises because we have restricted the magnetization to be precisely
equal to m. We have thus succeed in proving the replica symmetric equation for the free
entropy at all values of m. More importantly, we saw that the replica method was trustworthy!

Bibliography

A nice review on the random field Ising model in physics is Nattermann (1998). It played
a fundamental role in the development of disordered systems. The replica method was
introduced by Sam Edwards, who credited it to Marc Kac (Edwards et al. (2005)). It has been
turned into a powerful and versatile tool by the work of a generation of physicists led by Parisi,
Mézard and Virasoro (Mézard et al. (1987b)). The interpolation trick we discussed to prove the
replica formula was famously introduced by Guerra (2003). The peculiar technique we used
here fixing the magnetization is inspired from El Alaoui and Krzakala (2018). Probabilistic
inequalities such as Gaussian Poincaré are fundamental to modern probability and statistics
theories. A good reference is Boucheron et al. (2013). These concentration inequalities are
the cornerstone of all approaches to rigorous mathematical treatments of statistical physics
models.
46 F. Krzakala and L. Zdeborová

2.4 Exercises

Exercise 2.1: Gaussian Poincaré inequality and Efron-Stein

In order to prove the Gaussian Poincaré inequality, we first need to prove the very generic
Efron-Stein inequality, which is at the root of many important result in probability
theory:

Theorem 8 (Efron-Stein). Suppose that X1 , . . . , Xn and X1′ , . . . , Xn′ are independant random
variable, with Xi and Xi′ having the same law for all i. Let X = (X1 , . . . , Xi , . . . Xn ) and
X (i) = (X1 , . . . , Xi−1 , Xi′ , Xi+1 , . . . Xn ). Then for any function f : Rn → R we have:
n
1X
var(f (X)) ≤ E[(f (X) − f (X (i) ))2 ].
2
i=1

We are going to prove Efron-Stein using the so-called Lindeberg trick, by consider-
ing averages over mixed ensembles of the Xi and Xi′ . First we define the set X(i)
as the random vector equal to X ′ to i, and equal to X for all larger indices, i.e.
X(i) = (X1′ , . . . , Xi−1 i+1 , . . . Xn ). In particular X(0) = X and X(n) = X .
′ , X ′, X ′
i

1. Show that (this is called the Lindeberg replacement trick):


n
X
Var [f (X)] = E[f (X)(f (X) − f (X ′ ))] = E[f (X)(f (X(i−1) ) − f (X(i) ))]
i=1

2. Show that for all i:

E[f (X)(f (X(i−1) ) − f (X(i) ))] = E[f (X (i) )(f (X(i) ) − f (X(i−1) ))]
1 h i
= E (f (X) − f (X (i) )(f (X(i−1) ) − f (X(i) )
2

3. Show that by Cauchy-Schwartz:

1 h i
|E[f (X)(f (X(i−1) ) − f (X(i) ))]| ≤ E (f (X) − f (X (i) )2
2
and proove the Efron-Stein theorem.

Now that we have Efron-Stein, we can prove Poincare’s inequality for Gaussian random
variables. We shall do it for a single variable, and let the reader generalize the proof to
the multi-valued case.
With Xi a ±1 random variable that takes each value with probability 1/2 (this is called
a Rademacher variable), define:

Sn = X1 + X2 + . . . + Xn .

4. Using Efron-Stein, show that


"    2 #
Sn n Sn−1 1 Sn−1 1
Var[f ( √ )] ≤ E f √ + √ −f √ − √
n 4 n n n n
2.4 Exercises 47

5. Using the central limit theorem, show that this leads, as n → ∞, to the following
theorem:

Theorem 9 (Gaussian-Poincaré). Suppose f: R 7→ R is a smooth function and X is


Gaussian X ∼ N (0, 1), then

Var[f (X)] ≤ E (f ′ (X))2 .


 

Exercise 2.2: Random field Ising model by the cavity method


The goal of this exercise is to provide an alternative derivation for the free entropy of
the random field Ising model, using a technique close to the cavity method.

1. Show that, by adding one spin to a system of N spins, one has:

ZN +1 β 2
AN (β, ∆) =: Eh,h log = Eh log⟨e− 2 S̄ 2 cosh (β(S̄ + h))⟩N,β,h + o(1)
ZN

2. Show
P that, by adding an external magnetic field B to the Hamiltonian (i.e. a term
B i Si , one can get a concentration of the magnetization for almost all B so that,
for any h, we have:
Z B2
⟨S̄ 2 ⟩N,β,h − ⟨S̄⟩2N,β,h dB ≤ 2/βN
B1

Note that this gives the concentration over the Boltzmann averages, but not over
the disorder (the fields h). This means we showed that the magnetization con-
verges to a value m(h) that could — a priori — depend on the given realization
of the disorder h.

3. Explain why this implies, almost everywhere in B, that at large N :


 
β 2
AN (β, ∆, B) = Eh,h − ⟨S̄⟩N,β,h + log 2 cosh (β(⟨S̄⟩N,β,h + h + B)) + o(1)
2

4. Given that we proved that ⟨S⟩N,β,h concentrates as N grows to a value m(h), show
that this implies, as N → ∞, the bound:
 
β
Φ(β, ∆, B) ≤ supm − m2 + Eh log 2 cosh (β(m + h + B))
2

5. Use the variational approach of lecture 1 to obtain the converse bound and finally
show:
β
Φ(β, ∆) = supm − m2 + Eh log 2 cosh (β(m + h))
2
48 F. Krzakala and L. Zdeborová

Exercise 2.3: Mean-field algorithm and state evolution for the RFIM

Our aim in this exercise is to provide an algorithm for finding the lowest energy config-
uration, and to analyse its property.

1. Using the variational approach of section 1.3 or the cavity method of section 1.4,
explain why the following iterative algorithm might be a good one for finding
the lowest energy in practice (tip: this is the zero temperature limit of a finite
temperature iteration):
!
X
Sit+1 = sign hi + Sit /N
i

2. Implement the code of this algorithm, check that it indeed finds configurations
with minimum values that match the replica predictions for the minimum energy
when the system is large enough.

3. Show that in the large N limit, the dynamics of this algorithm obey a "state
evolution" equation, that is, that at each time the average magnetization mt =
i Si /N is given by the deterministic equation:
t
P

mt+1 = Eh sign(h + mt )

and conclude that the algorithm is performing a fixed point iteration of the replica
symmetric free energy equation 2.2.3.
Chapter 3

A first application: The spectrum of


random matrices

Unfortunately, no one can be told what the Matrix is. You have to
see it for yourself

The Matrix (Morpheus)


1999

We shall move now to a first non-trivial application of the replica method to compute the
spectrum of random matrices. Random matrices were introduced by Eugene Wigner to model
the nuclei of heavy atoms. He postulated that the spacings between the lines in the spectrum
of a heavy atom nucleus should resemble the spacings between the eigenvalues of a random
matrix, and should depend only on the symmetry class of the underlying evolution. Since
then, the study of random matrices has become a field in itself, with numerous applications
ranging from solid-state physics and quantum chaos to machine learning and number theory.
The simplest of all random matrix is the Wigner one:

1  
AN = √ G + G⊤
2N
where G is a random matrix where each element Gij ∼ N (0, 1) is i.i.d. chosen from N (0, 1).
The central question is: what does the distribution of eigenvalues νAN (λ) of such random
matrices look like as N → ∞? It turns out that νAN (λ) converges towards a well-defined
deterministic function ν(λ). We shall see how one can use the replica and the cavity method
to compute it, as a first real application of our tools.

3.1 The Stieltjes transform

We will use a technique called the Stieltjes transform. It starts with a very useful identity,
defined in the sense of distributions, called Sokhotsky’s Formula. It is often used in field
theory in physics, where people refer to it as the "Feynman trick". It can also be useful to
compute the Kramers-Kronig relations in optics, or to define the Hilbert transform of causal
50 F. Krzakala and L. Zdeborová

functions1
1 1
δ(x − x0 ) = − lim ℑ
ϵ→0 π x − x0 + iϵ

This formula is at the roots of the theoretical approach to random matrix theory. Indeed, given
a probability distribution ν(x) that can take N values with uniform probability, we have:

1 X 1 X 1
ν(x) = δ(x − xi ) = − lim ℑ
N ϵ→0 N π x − x0 + iϵ
i i

Now we use the fact that 1/x is the derivative of the logarithm to write

1 X 1 1 X 1 Y
= ∂x log (x − xi ) = ∂x log (x − xi )
N x − xi N N
i i i

So that, given an N × N matrix A with eigenvalues λ1 , . . . , λN , and using the fact that the
determinant is the product of eigenvalues, we have:

1 X 1
ν(λ) = δ(λ − λi ) = − ℑ lim ∂λ log det(A − (λ + iϵ)1)
N N π ϵ→0
i

This is the basis of the computation techniques used in random matrix theory. For a given
matrix A, we introduce the Stieltjes transform, as

1 X 1 1
SA (λ) = − = − ∂λ log det(A − λ1) ,
N λ − λi N
i

and once we have the Stieltjes transform, we can access the probably density via its imaginary
part:
1
νA (λ) = lim ℑSA (λ + iϵ) .
π ϵ→0
The entire field of random matrix theory is thus reduced to the computation of the Stieltjes
transform associated with the probability distribution of eigenvalues.

3.2 The replica method

Our aim is now to compute the Stieltjes transform of the Wigner matrix using the replica
method. We shall not aim for mathematical rigor here, as our goal is merely the demonstration
of the power of our tools.
1
This formula is easily proven using Cauchy integration in the complex plane. Alternatively, one can simply
use x2 − ε2 = (x + iε)(x − iε). Indeed,

x2
Z Z Z
f (x) ε f (x)
lim dx = ∓iπ lim f (x) dx + lim dx.
R x ± iε
2 2 2 2
R π(x + ε ) R x +ε x
ε→0 + ε→0 + ε→0 +

The first integral approaches a Dirac delta function as ε → 0+ (it is a nascent delta function) and therefore, the
first term equals ∓iπf (0). The second term converges to a (real) Cauchy principal-value integral, so that
Z Z
f (x)
ℑ lim dx = ∓iπf (0) = ∓iπ δ(x)f (x)dx . (3.1)
ε→0+ R x ± iε R
3.2 The replica method 51

3.2.1 Averaging replicas

Assuming that both the Stieltjes transform and the density of eigenvalues are self-averaging,
we need to compute their expectation in the large size limit.

1
lim ESAN (λ) = −∂λ lim E log det (AN − λIN ) . (3.2)
N →∞ N →∞ N

Instead of directly using the replica trick on the log-det, we shall instead use the following
approach, which will turn out to be more practical:
h i
E log det (AN − λIN ) = −2E log det (AN − λIN )−1/2

so that, following the replica strategy of computing log X by computing instead (X n − 1)/n,
we shall need to compute average of the n-th power of det (AN − λIN )−1/2 . We now use the
Gaussian integral 2 to express the square root of determinant as an integral and write:
" n Z #
−n/2
Y dx − 12 x⊤ (AN −λIN )x
E det (AN − λIN ) =E e
RN (2π)N/2
a=1
n n
n
" #
dxa λ P
∥xa ∥22 − 12 xa⊤ AN xa
Z Y P
2
= N/2
e a=1 E e a=1 (3.3)
RN a=1 (2π)

For our Wigner matrix (the so-called GOE(N) ensemble) we can write A = √1 G + G⊤

2N
for G ∼ N (0, 1) i.i.d. We can thus write the average in eq. equation 3.3 as:
" n # " n # " G n #
− 12 xa⊤ AN xa − √1 xa Gxa − √ ij xa a
P P P
i xj
2N a=1
Y 2N a=1
EA e a=1 = EG e = EGij e .
ij

At this point, it is a good idea to refresh our knowledge of Gaussian integrals, as we shall use
them often. In particular, we have
Z r
−ax2 +bx π b2 h i b2
e dx = e 4a , or Ex ebx = e 2a
a

so that
!
" n # Pn xa a b b
i xj xi xj
N 2
− 21 xa⊤ AN xa xa ·xb
P 
Pn 2 a,b=1 N2 N Pn
1
( a=1 xi xj )
a a 4
Y Y
EA e a=1 = e 4N = e =e 4 a,b=1 N
.
ij ij

2
Remember that Gaussian distributions are normalized so that
Z
1 dx 1 ⊤
p N/2
e− 2 x (AN −λIN )x = 1 .
det (AN − λIN ) RN (2π)

A more generic formula, that turns out to be used in 90% of replica computations, is that if A is a symmetric
positive-definite matrix, then
Z r
1 T T (2π)n 12 B T A−1 B
e− 2 x Ax+B x dn x = e .
det A
52 F. Krzakala and L. Zdeborová

This is a very typical step of a replica computation! We have now performed the integration
over the disorder (randomness) and reached:
n n
n λ P 1 a ·xa )+ N
P 1 a b 2
dxa N 2 a=1( N x (N x ·x )
Z
−n/2
Y 4
E det (AN − λIN ) = N/2
e a,b=1
.
RN a=1
(2π)

A very important phenomenon has occurred: we see that the integration over the disorder has
"coupled" the previously independent replicas. This is indeed what always happens when
integrating over the disorder. This new term, that coupled the replicas, is of fundamental
importance, and has a name: it is called the overlap between replicas. We shall thus define:
1 a b
q ab =: x ·x .
N
We now introduce a "delta function" to free the overlap order parameter, just like we did
previously for the magnetization in the random field Ising model. For any function f , we
have:
 
1 a b
Z Y  
f x · x = Nn dq ab δ N q ab − xa · xb f (xa · xb )
N
1≤a≤b≤n

As before, we shall drop the N n prefactor, that does not count as we shall eventually take the
normalized logarithm and send N → ∞, and write
n n
Nλ qaa + N 2
P P
xa · xb qab
Z  
−n/2
Y 2 4
E det (AN − λIN ) ≈ ab
δ q − ab
dq e a=1 a,b=1
. (3.4)
N
1≤a≤b≤n

Performing the exact same steps we used in the previous chapters, we now take the Fourier
representation of the delta functions (and change variables so that they appear "real" instead
of "complex"):

dq̂ ab − 1≤a≤b≤n q̂ (N q −x ·x )
ab ab a b
Z P
Y   Y
ab a b
δ Nq −x ·x = e ,

1≤a≤b≤n 1≤a≤b≤n

and inserting it in the above thus allow us to write:

dq ab dq̂ ab N Φ(qab ,q̂ab )


Z
E det (AN − λIN )−n/2 ≈
Y
e (3.5)

1≤a≤b≤n

where:
n n
X λ X aa 1 X  ab 2
Φ(q ab , q̂ ab ) = − q̂ ab q ab + q + q + Ψx (q̂ ab )
2 4
1≤a≤b≤n a=1 a,b=1

with (given now the sites are decoupled):


n P ab a b n P a ab b
!N
1 dxa 1≤a≤b≤n q̂ x ·x 1 dxa 1≤a≤b≤n x q̂ x
Z Y Z Y
ab
Ψx (q̂ ) = log N/2
e = log √ e
N
a=1
(2π) N
a=1

n P a ab b
dxa x q̂ x
Z Y
= log √ e1≤a≤b≤n
a=1

3.2 The replica method 53

At N → ∞, the integral in equation 3.5 can be evaluated with the saddle point method, and
therefore:
 
−n/2 ab ab
E det (AN − λIN ) ≈ exp N extr Φ(q , q̂ )
q ab ,q̂ ab

This concludes the replica computation.

3.2.2 Replica symmetric ansatz

Again, we are trapped with the extremization of a function, but this time it should be over a
n × n matrix, which seems like a complicated space. To make progress in the extremization
problem, we need can restrict our search for particular solutions, and we are going, again, to
assume replica symmetry:
1
q ab = δ ab q, q̂ ab = − δ ab q̂
2
With this ansatz, we have:
n n 
X n X X 2
q̂ ab q ab = − q̂q, q aa = nq, q ab = nq 2
2
1≤a≤b≤n a=1 a,b=1

Finally,
n
n
dxa q̂ a=1(xa )2
P
dx
Z Y Z
1 2 n
Ψx (q̂) = log √ e = n log √ e− 2 q̂x = − log(q̂)
2π 2π 2
a=1

Putting together and applying the saddle point method, we thus reach
− nN
2
extr{log q̂−q q̂−λq− 21 q 2 }
2 e q,q̂
−1
− E log det (An − λIn )−1/2 ≈ −2 lim
N n→0
 n 
1
≈ extr log q̂ − q q̂ − λq − q 2
2
To solve this problem, we look at the saddle point equations obtained by taking the derivatives
with respect to the parameters (q, q̂):
1 1
q̂ = , q = −q̂ − λ ⇔ +q+λ=0
q q
This has two solutions:

⋆ −λ ± λ2 − 4
q± =
2
Finally, to get the Stieltjes transform, we use the relation in eq. equation 3.2:

1 ⋆ −λ ± λ2 − 4
lim ESAN (λ) = −∂λ lim E log det (A − λIN ) = q± =
N →∞ N →∞ N 2
we have thus found the Stieltjes transform, and we have two (!) solutions for λ > 0 and λ < 0.
Only one of them will be the correct one when we shall inverse the Stieltjes transform, but it
will be easy to check which one, since probabilities needs to be positive.
54 F. Krzakala and L. Zdeborová

3.2.3 From Stieltjes to the spectrum

We can now compute the spectrum using the relation


1
νA (λ) = lim ℑSA (λ + iϵ) .
π ϵ→0
The first term −λ will not give us any non trivial computation, we concentrate on the second
one. When |λ| > 2, we have
 
p
2
p
2 2
p
2 2
iϵλ
(λ + iϵ) − 4 = λ − ϵ − 4 + 2iϵλ ≈ λ − ϵ − 4 1 − 2
λ − ϵ2 − 4
which, again, has no imaginary part as ϵ → 0. We are thus forced to conclude that ν(λ) = 0
when |λ| > 2. If |λ| < 2, on the other hand, we get that
r
p
2
p
2 2
iϵλ p
(λ + iϵ) − 4 ≈ λ − ϵ − 4 1 − 2 = i 4 − λ2 + O(ϵ)
λ − ϵ2 − 4
so that we find (finally choosing the correct replica solution to be the "+" to have a positive
distribution):
1 p
ν(λ) = 4 − λ2 if 2 > λ > −2

ν(λ) = 0 otherwise
which is indeed the correct Wigner solution, aka the famous "semi-circle" law.

Figure 3.2.1: Simulation of the semicircle law using a 1000 by 1000 Wigner matrix.

3.3 The Cavity method

It is a useful exercise to use the cavity method instead of the replica one to compute the Stieltjes
transform. We start from the very definition
1 X 1
SAN (λ) = −
N λ − λi
i
3.3 The Cavity method 55

The idea behind the cavity computation consists in finding a recursion equation between the
transform for an N × N matrix and the one for an (N + 1) × (N + 1) matrix. Let us define:

MN = λ1 − AN RN = [λ1 − AN ]−1 = MN
−1

where RN is called the resolvant matrix. Using the formula for computing inverse of matrices
via their matrix of cofactors, we find
det MN
(RN +1 )N +1,N +1 =
det MN +1

Additionally, we can compute the determinant of the N + 1 matrix with the Laplace expansion
along the last row, and then on the last column, so that:

N
X
N
det MN +1 = (MN +1 )N +1,N +1 det MN − (MN +1 )N +1,k (MN +1 )l,N +1 Cl,k
k,l=1

with Cl,k
N the matrix of co-factors of M . Dividing the previous expression by det M , we
N N
thus find
det MN +1 1
=
det Mn (RN +1 )N +1,N +1
N
1 X
N
= (MN +1 )N +1,N +1 − (MN +1 )N +1,k (MN +1 )l,N +1 Cl,k
det MN
k,l=1
N
X
= λ − (AN +1 )N +1,N +1 − (AN +1 )N +1,k (AN +1 )l,N +1 (RN )k,l
k,l=1

At this point, we make the additional assumption that the off-diagonal elements of the resolvent
RN are of order O(N −1/2 ). This can be checked, for instance by expanding in powers of λ
with pertubation theory, and observing that, at each order in this expansion, the off-diagonal
elements are indeed of order N −1/2 . In which case the equation further simplifies to

1
(RN +1 )N +1,N +1 = N
−1
P
λ−N l=1 (RN )l,l

where we have used that the AN have i.i.d. elements. Given the matrix elements of the
diagonal of RN +1 are identically distributed, we find

1 1
TrRN +1 = 1
N λ − N TrRN

so that the Stieltjes tranform verifies

1
SAN +1 (λ) = −
λ + SAN (λ)

and we thus expect, as N increases, that the Stieltjes transform will converge to the fixed point.
This is indeed the same (correct) equation that we found with the replica method. We have
thus checked that both methods give the correct solution in this case.
56 F. Krzakala and L. Zdeborová

Bibliography

The Wigner random matrix was famously introduced in Wigner (1958). The use of the replica
method for random matrices iniated with the seminal work of Edwards and Jones Edwards
and Jones (1976). It has now grown into a field in iteself, with hundred of deep non-trivial
results. A good set of lecture notes on the subject can be found in Livan et al. (2018); Potters
and Bouchaud (2020). A classical mathematical reference is Bai and Silverstein (2010).

3.4 Exercises

Exercise 3.1: Wishart, or Marcenko-Pastur law

The goal of this exercise is to repeat the replica computation for Wishart-Matrices,
and to derive their distribution of Eigenvalues, also called the Marcenko-Pastur law.
Wishart matrices are defined as follows: Consider a M × N random matrix X with i.i.d.
coefficients distributed from a standard normalized Gaussian N (0, 1). The Wishart
matrix Σ̂ is:
1
Σ̂ =: XX T .
N
In other words, these are the correlation matrices between random data points. As
such they are used in many concrete situations in data science and machine learning, in
particular to filter out the signal from the noise in data.

1. Denoting α = M/N , repeat the replica computation of the Stieltjes Transform,


but now for the Wishart matrix in the limit where M, N → ∞, with α fixed. Show
that it leads to p
1 − α − λ ± (λ − α − 1)2 − 4α
SΣ̂ (λ) =
2αλ
2. For α < 1, show that this implies the Marcenko-Pastur law for the distribution of
eigenvalues:
p
1 (λ+ − λ)(λ − λ− )
νΣ̂ (λ) = with λ ∈ [λ− , λ+ ]
2π √ αλ
λ± = (1 ± α)2

3. Perform simulation of such random matrices, for large values of M and N and
check your predictions for the distribution of eigenvalues.

4. Repeat the simulation for a value α > 1, and compare again with the distribution
νΣ̂ (λ). Are we missing something? Hint: the Stieltjes transform of a delta function
δ(λ) is −1/λ.
Chapter 4

The Random Field Ising Model on


Random Graphs

I think that I shall never see


A poem lovely as a tree.

Joyce Kilmer – 1913

We shall continue our exploration of the Boltzmann-Gibbs distribution

e−βH(s)
PN,β (S = s) = .
ZN (β)

for the random field Ising model, but we shall now look at a more complex, and more interest-
ing, topology. We assume the existence of a graph Gij that connects some of the nodes, such
that Gij = 1 if i and j are connected, and 0 otherwise. So far, the last chapters dealt only with
the case of fully connected graphs Gij = 1 for all i, j. Now our Hamiltonian can be written as:

X N
X
HN,J,{h},G (s) = −J Si Sj − hi Si
(i,j)∈G i=1

where J is a scalar coupling constant.

4.1 The root of all cavity arguments

Let us assume we have N spins i = 1, . . . , N , that are all isolated. In this case, their probability
distribution is simple enough, we have mi = tanh(βhi ). Imagine now we are connecting
these N spins to a new spin S0 , in the spirit of the cavity method. Of course, the mi are now
changed! Let us thus refer to the old values of mi as the "cavity ones" and write:

mci ≡ tanh βhi .


58 F. Krzakala and L. Zdeborová

With this definition, note that (1 + Si mci )/2 = eβhi Si / cosh(βhi ). Our main question of interest
is: can we write the magnetization of the new spin m0 as a function of the {mci }? Let us try!
Clearly
P P
βS0 h0 + i βJSi S0 +βSi hi P βS0 h0
Q P βJSi S0 1+Si mci
S0 ,{s} S0 e S0 S0 e i Si e 2
m0 = P βS h +
P
βJS S +βS h
= 1+Si mci
S0 ,{s} e
βS h βJS S
0 0 i 0 i i
P Q P
S0 e Si e
i 0 0 i 0
i 2
+
X −X − Y X 1 + Si mi c
= with X s = eβsh0 eβJSi s
X +X − 2
i Si

Using the identity atanhx = 1


2 1−x , and applying the atanh on both side, we reach,
log 1+x

X + −X −
1 1+ X + +X − 1 X+
atanh(m0 ) = log X + −X −
= log −
2 1− 2 X
X + +X −
X1 eβJ (1 + mci ) + e−βJ (1 − mci )
= βh0 + log
2 e−βJ (1 + mci ) + eβJ (1 − mci )
i
X1 cosh(βJ) + mci sinh(βJ)
= βh0 + log
2 cosh(βJ) − mci sinh(βJ)
i
X1 1 + mci tanh(βJ)
= βh0 + log
2 1 − mci tanh(βJ)
i
X
= βh0 + atanh (mci tanh(βJ)) .
i

We can thus finally write the magnetization of the new spins as the function of the spins in
the "old" system in a relatively simple form:
!
X
m0 = tanh βh0 + atanh (mci tanh(βJ)) (4.1)
i

4.2 Exact recursion on a tree

It is quite simple to repeat the same argument iteratively on a tree. If we start with initial
conditions then we can write, at any layer of the tree,
 
X
mi→j = tanh βhi + atanh (mk→i tanh(βJ)) (4.2)
k∈∂i ̸=j

What if we want to know the true marginals? This is easy, we just write
 
X
mi = tanh βhi + atanh (mk→i tanh(βJ)) (4.3)
k∈∂i

This is the root of the so-called Belief propagation approach. We solve the problem for the
cavity marginals mi→j , which has a convenient interpretation as a message passing problem.
Once we know them all, we can compute the true marginal!
4.2 Exact recursion on a tree 59

This is a method that first appeared in statistical physics, when Bethe and Peierls used it as an
approximation of the regular lattice. Indeed, we could iterate this method on a large infinite
tree of connectivity, say c = 2d to approximate a hypercubic lattice of dimension d (where
each node has 2d neighbors). Consider for instance the situation with zero field. In this case,
the magnetization at distance ℓ from the leaves follows
  
mℓ+1 = tanh (c − 1)atanh mℓ tanh(βJ)
We can look for a fixed point of this equation, and check for which values of the (inverse)
temperature a non-zero value for the magnetization is possible. This is the same phenomenon
as in the Ising model, just with a slightly more complicated fixed point. By ploting this
equation, one realizes that, assuming d = 2c, this happens at β BP = ∞ for d = 1, β BP = 0.346
for d = 2, β BP = 0.203 for d = 3, β BP = 0.144 for d = 4, and β BP = 0.112 for d = 5. If we
compare these numbers to the actual transition on a real hypercubic lattice, we find β lattice = ∞
for d = 1, β lattice = 0.44 for d = 2, β lattice = 0.221 for d = 3, β lattice = 0.149 for d = 4 and
β lattice = 0.114 for d = 5. Not so bad, and in fact we see that the predictions become exact as d
grows! This approach is quite a good one to estimate a critical temperature. In fact, one can
show that it gives a rigorous upper bound on the ferromagnetic transition, in any topology!

4.2.1 Belief propagation on trees for pairwise models

The iterative approach we just discussed can be made completely generic on a tree graph! The
example that we have been considering so far reads, in full generality
X X
H=− Jij Si Sj − hi Si .
(ij)∈G i

This is an instance of a very generic type of model: those with pairwise interactions, where
the probability of each configuration is given by
 
1  Y Y
P ((S)) = ψi (Si ) ψij (Si , Sj )
Z
i (ij)∈G
 
X Y Y
Z =  ψi (Si ) ψij (Si , Sj )
{Si=1,...,N } i (ij)∈G

and the connection is clear once we define ψij (Si , Sj ) = exp(βJij Si Sj ) and ψi (Si ) = exp(βhi Si ).

We want to compute Z on a tree, like we did for the RFIM. For two adjacent sites i and j,
the trick is to consider the variable Zi→j (Si ), defined as the partial partition function for the
sub-tree rooted at i, when excluding the branch directed towards j, with a fixed value Si of the
i spin variable. We also need to introduce Zi (Si ), the partition function of the entire complete
tree when, again, the variable i is fixed to a value Si . On a tree, these intermediate variables
can be computed exactly according to the following recursions
 
Y X
Zi→j (Si ) = ψi (Si )  Zk→i (Sk )ψik (Si , Sk ) (4.4)
k∈∂i\j Sk
 
Y X
Zi (Si ) = ψi (Si )  Zj→i (Sj )ψij (Si , Sj ) (4.5)
j∈∂i Sj
60 F. Krzakala and L. Zdeborová

where ∂i denotes the set of all neighbors of i. In order to write these equations, the only
assumption that has been made was that, for all k ̸= k ′ ∈ ∂i \ j, the messages Zk→i (Sk ) and
Zk′ →i (Sk′ ) are independent. On a tree, this is obviously true: since there are no loops, the sites
k and k ′ are connected only through i and we have "cut" this interaction when considering
the partial quantities. This recursion is very similar, in spirit, to the standard transfer matrix
method for a one-dimensional chain.

In practice, however, it turns out that working with partition functions (that is, numbers that
can be exponentially large in the system size) is somehow impractical. We can thus normalize
equation 4.4 and rewrite these recursions in terms of probabilities. Denoting ηi→j (Si ) as the
marginal probability distribution of the variable Si when the edge (ij) has been removed, we
have
Zi→j (Si ) Zi (Si )
ηi→j (Si ) = P ′ , ηi (Si ) = P ′ .
S ′ Zi→j (Si )
i S ′ Zi (Si ) i

So that the recursions equation 4.4 and equation 4.5 now read
 
ψi (Si ) Y X
ηi→j (Si ) =  ηk→i (Sk )ψik (Si , Sk ) , (4.6)
zi→j
k∈∂i\j Sk
 
ψi (Si ) Y X
ηi (Si ) =  ηj→i (Sj )ψij (Si , Sj ) , (4.7)
zi
j∈∂i Sj

where the zi→j and zi are normalization constants defined by:


 
X Y X
zi→j = ψi (Si )  ηk→i (Sk )ψik (Si , Sk ) , (4.8)
Si k∈∂i\j Sk
 
X Y X
zi = ψi (Si )  ηj→i (Sj )ψij (Si , Sj ) . (4.9)
Si j∈∂i Sj

The iterative equations equation 4.6 and equation 4.7), along with their normalization equa-
tion 4.8 and equation 4.9, are called the belief propagation equations. Indeed, since ηi→j (Si )
is the distribution of the variable Si when the edge to variable j is absent, it is convenient to
interpret it as the "belief" of the probability of Si in absence of j. It is also called a "cavity"
probability since it is derived by removing one node from the graph. The belief propagation
equations are used to define the belief propagation algorithm

1. Initialize the cavity messages (or beliefs) ηi→j (Si ) randomly or following a prior infor-
mation ψi (Si ) if we have one.

2. Update the messages in a random order following the belief propagation recursion
equation 4.6 and equation 4.7 until their convergence to their fixed point.

3. After convergence, use the beliefs to compute the complete marginal probability distri-
bution ηi (Si ) for each variable. This is the belief propagation estimate on the marginal
probability distribution for variable i.
4.2 Exact recursion on a tree 61

Using the resulting marginal distributions, one can compute, for instance, the equilibrium
local magnetization via mi = ⟨Si ⟩ = Si ηi (Si )Si , or basically any other local quantity of
P
interest.

At this point, since we have switched from partial partition sums to partial marginals, the
astute reader could complain that we have lost sight of our prime objective: the computation
of the partition function. Fortunately, one can compute it from the knowledge of the marginal
distributions. To do so, it is first useful to define the following quantity for every edge (ij):
X zj zi
zij = ηj→i (Sj )ηi→j (Si )ψij (Si , Sj ) = = ,
zj→i zi→j
Si ,Sj

where the last two equalities are obtained by plugging equation 4.6 into the first equality and
realizing that it almost gives equation 4.9. Using again equation 4.6 and equation 4.9, we
obtain
 
X Y X
zi = ψi (Si )  ηj→i (Sj )ψij (Si , Sj )
Si j∈∂i Sj
  P
Zj→i (Sj ) Si Zi (Si )
X Y X
= ψi (Si )  P ′
ψij (Si , Sj ) = Q P ,
Si j∈∂i Sj S ′ Zj→i (S ) j∈∂i Sj Zj→i (Sj )

and along the same steps P


SjZj→i (Sj )
zj→i = Q P .
k∈∂j\i Sk Zk→j (Sk )

For any spin Si , the total partition function can be obtained using Z = Si Zi (Si ). We can
P
thus start from an arbitrary spin i
   
X Y X Y Y X
Z= Zi (Si ) = zi  Zj→i (Sj ) = zi zj→i Zk→j (Sk ) ,
Si j∈∂i Sj j∈∂i k∈∂j\i Sk

and we continue to iterate this relation until we reach the leaves of the tree. Using eq4.2.1, we
obtain
   
Q
Y Y Y zj Y zk zi
Z = zi  zj→i zk→j · · · = zi
  ··· = Q i
 .
zij zjk (ij) zij
j∈∂i k∈∂j\i j∈∂i k∈∂j\i

We thus obtain the expression of the free energy in a convenient form, that can be computed
directly from the knowledge of the cavity messages, often called the Bethe free energy on a tree:
X X
fTree N = −T log Z = fi − fij ,
i (ij)

fi = −T log zi , fij = −T log zij ,

where fi is a "site term" coming from the normalization of the marginal distribution of site i,
and is related to the change in Z when the site i (and the corresponding edges) is added to
the system. Meanwhile, fij is an "edge" term that can be interpreted as the change in Z when
62 F. Krzakala and L. Zdeborová

the edge (ij) is added. This provides a convenient interpretation of the Bethe free energy
equation 4.2.1: it is the sum of the free energy fi for all sites but, since we have counted each
edge twice we correct this by subtracting fij .

We have now entirely solved the problem on a tree. There is, however, nothing that prevents
us from applying the same strategy on any graph. Indeed the algorithm we have described is
well defined on any graph, but we are not assured that it gives exact results nor that it will
converge. Using these equations on graphs with loops is sometimes referred to as loopy belief
propagation in Bayesian inference literature.

One may wonder if there is a connection between the BP approach and the variational one.
We may even wonder if this could be simply the same as using our variational bound with a
better approach than the naive mean field one! Sadly, the answer is no! We cannot prove in
general that the BP free entropy is a lower bound on any graph: indeed there are examples
where it is larger than log Z, and some where it is lower.

Two important remarks:

• There is however a connection between the variational approach and the BP one. If one
writes the variational approach and uses the following parametrization for the guess
(where ci = |∂i|): Q
ij bij (Si , Sj )
Q(S) = Q ci −1
i bi (S)

then it is possible to show that optimizing on the function bij and bi one finds the BP
free entropy. The sad news, however, is that Q(S) does not really correspond to a true
probability density, it is not always normalizable, so one cannot apply the variational
bound.

• In the case of ferromagnetic models, or more exactly, on attractive potential Ψij , and
only in this case, it can be shown rigorously that BP does give a lower bound on the free
entropy, on any graph. For such models (thus including the RFIM!) it is thus effectively
equivalent to a variational approach. In fact, it can be further shown that in the limit
of zero temperature, BP finds the ground state of the RFIM on any graph (through a
mapping to linear programming).

4.3 Cavity on random graphs

4.3.1 Random graphs

We shall now discuss the basic properties of sparse Erdős-Rényi (ER) random graphs.

An ER random graph is taken uniformly at random from the ensemble, denoted G(N, M ),
of graphs that have N vertices and M edges. To create such a graph, one has simply to add
M random edges to an empty graph. Alternatively, one can also define the so called G(N, p)
ensemble, where an edge exists independently for each pair of nodes with a given probability
0 < c/N < 1. The two ensembles are asymptotically equivalent in the large N limit, when
M = c(N − 1)/2. The constant c is called the average degree. We denote by ci the degree
4.3 Cavity on random graphs 63

𝒢N,M= 2c N 𝒢N+1,M= 2c (N+1)


𝒢N,M= 2c N− 2c 𝒢N+1,M= 2c (N+1)− 2c
Random graphs with cavities

Random graphs, average degree c

Figure 4.3.1: Iterative construction of a random graph for the Cavity method. The average
degree of the graph is c.

of a node i, i.e. the number of nodes to which i is connected. The degrees are distributed
according to Poisson distribution, with average c.

Alternatively, one can also construct the so-called regular random graphs from the ensemble
R(N, c) with N vertices but where the degree of each vertex is fixed to be exactly c. This means
that the number of edges is also fixed to M = cN/2.

At the core of the cavity method is the fact that such random graphs locally look like trees, i.e.
there are no short cycles going trough a typical node. The key point is thus that, in this limit,
such random graphs can be considered locally as trees. The intuitive argument for this result
is the following one: starting from a random site, and moving following the edges, in ℓ steps
cℓ sites will be reached. In order to have a loop, we thus need cℓ ∼ N to be able to come back
on the initial site, and this gives ℓ ∼ log(N ).

4.3.2 Cavity method

Let us now apply our beloved cavity method on an ER random graph with N links and on
average M = cN/2 links. We will see how this method leads to free entropies using the same
telescopic sum trick as before. However, this should be done with caution. If we apply our
usual Cesaro trick naively and write

ZN,M ZN −1,M −m1


ZN,M = . . . Z1,0
ZN −1,M −m1 ZN −2,M −m1 −m2

then we encounter a problem when choosing the value of m at each step. We want the new
spin to have on average c neighbors. That means we must add a Poissonian variable m with
mean c at each step, so that the new spin has the correct numbers of neighbors. But this also
adds a link to c spins in the previous graph, so that, on average, we went from M = cN/2 to
M ′ = cN/2 + c while we actually wanted to get M ′ = c(N + 1)/2. The difference is ∆M = c/2,
so we need to construct our sequence of graphs such that we remove m/2 links on average
64 F. Krzakala and L. Zdeborová

every time we add one spin connected to m previous spins. Therefore, we write instead

ZN,M ZN −1,M −m1 ZN −1,M −m1 +m2 /2


ZN,M = . . . . . . Z1,0
ZN −1,M −m1 ZN −1,M −m1 +m2 /2 ZN −2,M −m1 +m′1 /2−m2

Concretely, this means that we must apply the Cesaro theorem to the following term to
compute the free entropy
ZN,M =N 2c ZN −1,M −c ZN,M =N 2c ZN −1,M −c+ 2c
AN = log = log − log .
ZN −1,M −c ZN −1,M −c+ 2c ZN −1,M −c ZN −1,M −c

And we thus obtain


1 (site) (link)
lim E log ZN = lim EAN = lim EΦN − lim EΦN
N,M →∞ N N →∞ N →∞ N →∞

We thus have two terms to compute to compute the free entropy, that corresponds to the 2-step
iteration on the graph depicted in Fig.4.3.1

1. The first one corresponds to the change in free entropy when one adds one spin to a graph,
connecting it to c spins with c cavities.

2. The second one corresponds to the change in free entropy when one adds c/2 links to a
graph, connecting c spins with cavities pairwise.

This corresponds exactly to what we found on the tree! Indeed, the in 4.2.1, we see that we
have N sites terms, minus M = cN/2 links terms! It is reassuring to find that this construction
gives us the same answer as on the tree.

Mathematically, these two free entropy shift can be expressed as


(site)
X P
ΦN = E log⟨ eβh0 S0 +βJ i∈∂0 S0 Si ⟩N −1,M −c
S0
X
= E log⟨2 cosh (βh0 + βJ Si )⟩N −1,M −c
i∈∂0
(link)
Y c
ΦN = E log⟨ eβJSi Sj ⟩N −1,M −c = E log⟨eβJSi Sj ⟩N −1,M −1 .
2
(ij)

Interestingly, both these equations depend on the distribution of the joint cavity magnetizations
in the graph so we can write
Z d
Z Y X
(site)
Φ = E dPe (d) dmi Qc ({mi }) log⟨2 cosh (βh0 + βJ Si )⟩{mi }
i=1 i∈∂0
Z
(link) c
Φ = E dm1 Q(m1 , m2 ) log⟨eβJSi Sj ⟩{m1 ,m2 }
2
where Pe (d) is the excess degree probability. For a regular graph with fixed connectivity c,
Pe (d) = δ(d − (c − 1)) while for a Erdos-renyi random graph, interestingly Pe (d) = P (d) again!
See exercises section.
4.3 Cavity on random graphs 65

At the level of rigor used in physics, these formulas can be further simplified! First, we
make the assumption that the distribution of cavity fields converges to a limit distribution,
independently of the disorder. Secondly, the crucial point is now to assume the distribution of
these cavity marginals factorizes! This makes sense: the different cavities {mi } are all far from
each other in the graph. If we are not exactly at a critical point (a phase transition) then the
correlations are not infinite range. This is case, our formula depends only on the single point
distribution Qc (m) and we thus write:

Z d Z
Z Y X
Φ(site) = dPe (d) dmi Qc (mi ) log⟨2 cosh (βh0 + βJ Si )⟩{mi }
i=1 i∈∂0
Z
c
Φ(link) = dm1 Q(m1 ) dm2 Q(m2 ) log⟨eβJSi Sj ⟩{m1 ,m2 }
2

Our task is thus to find the asymptotic distribution Q(m). At the level of rigor used in physics,
this is easily done. We obviously assume that Q(m) is unique, and does not depend on the
realization of the disorder (this is not so trivial). Then we realize that the distribution of cavity
fields must satisfy a recursion such as

Z Z d Z
Y
QcN +1 (m) = dh0 P (h0 ) dPe (d) dmi QcN +1 (mi )δ (m − fBP ({mi }, h0 ))
i=1

with !
X
fBP ({mi }, h0 ) = tanh βh0 + atanh (mi tanh(βJ))
i

Obviously, once we find the fixed point, we can compute the distribution to total magnetization,
which reads almost exactly the same, except now we have to use the actual distribution of
neighbors:

Z Z d Z
Y
Q(m) = dh0 P (h0 ) dPe (d) dmi Qc (mi )δ (m − fBP ({mi }, h0 ))
i=1

1. Explain the population dynamic

2. Note that the free energy is the same as the one on graphs!!! Dictionary GRAPH to
POPULATION

3. Add discussion regular vs random, and excess degree

4.3.3 The relation between Loopy Belief Propagation and the Cavity method

• tree-like etc....

• Single graph vs POPULATION !


66 F. Krzakala and L. Zdeborová

4.3.4 Can we prove it?

Well, we can try !!! First Qc (m) is not clearly self-averaging, but for sure
Z d Z
Y X
Φ ≤ maxQc (m) dPe (d) dmi Qc ({mi }) log⟨2 cosh (βh0 + βJ Si )⟩{m}
i=1 i∈∂0
Z
c
− dm1 Qc ({m1 }) dm2 Qc ({m2 }) log⟨eβJJSi Sj ⟩m1 ,m2
2
It is possible to show that the extremization leads to Q(m) being a solution of the cavity
recursion. This means that we obtain a bound! If we found a distribution Q(m) (or actually,
all of them) that satisfies eq., then we have a bound.

Can we get the converse bound easily? Sadly, no. The point is that BP is not a variational
method on a given instance, so we cannot use the mean-field technics! Fortunately, it can be
shown rigorously, that, for any ferromagnetic model, BP does give a lower bound on the free
entropy.

It is also instructive to compare to what we had in the fully connected model. Indeed, if Q(m)
become a delta (which we expect as c grows) we obtain, using J = 1/N

N 2 m2
Φ ≤ maxm Eh log 2 cosh (βh + βm) − log eβm /N = Eh log 2 cosh (βh + βm) − β
2 2
which is indeed the result we had in the fully connected limit!

4.3.5 When do we expect this to work?

When do we expect belief propagation to be correct? As we have have discussed, random


graphs are locally tree-like: they are trees up to any finite distance. Further assuming that
we are in a pure thermodynamic state, we expect that we have short range correlations, so
that the large O(log N ) loops should not matter in a large enough system, and these equations
should provide a correct description of the model.

Clearly, this must be a good approach for describing a system in a paramagnetic phase, or
even a system with a ferromagnetic transition (where we should expect to have two different
fixed points of the iterations). It could be, however, that there exists a huge number of fixed
points for these equations: how to deal with this situation? Should they all correspond to a
given pure state? Fortunately, we do not have such worries, as the situation we just described
is the one arising when there is a glass transition. In this case, one needs to use the cavity
method in conjunction with the so-called “replica symmetry breaking” approach as was done
by Mézard, Parisi, and Virasoro.

Bibliography

The cavity method is motivated by the original ideas from Bethe (1935) and Peierls (1936) and
used in detail to study ferromagnetism (Weiss, 1948). Belief propagation first appeared in
4.4 Exercises 67

computer science in the context of Shanon error correction (Gallager, 1962) and was rediscov-
ered in many different contexts. The name "Belief Propagation" comes in particular from Pearl
(1982). The deep relation between loopy belief propagation and the Bethe "cavity" approach
was discussed in the early 2000s, for instance in Opper and Saad (2001) and Wainwright and
Jordan (2008). The work by Yedidia et al. (2003) was particularly influential. The construction
of the cavity method on random graphs presented in this chapter follows the classical papers
Mézard and Parisi (2001, 2003). That loopy belief propagation gives a lower bound on the
true partition on any graph in the case of ferromagnetic (and in general attractive models) is a
deep non-trivial result proven (partially) by Willsky et al. (2007) using the loop calculus of
Chertkov and Chernyak (2006), and (fully) by Ruozzi (2012). Chertkov (2008) showed how
beleif propagation finds the ground state of the RFIM at zero temperature. Finally, that the
cavity method gives rigorous upper bounds on the critical temperature was shown in Saade
et al. (2017).

4.4 Exercises

Exercise 4.1: Excess degree in Erdos-Renyi graphs

Consider a Erdos-Renyi random graph with N nodes and M links in the asymptotic
regime where N → ∞, with c = 2M/N the average degree.

1. Consider one given node, what is the probability p that it is connected with
another given node, say j? Since it has N − 1 potential neighbors, show that the
probability distribution of the number of neighbors for each node follows

ck e−c
P(d = k) = (4.10)
k!

In the cavity method, we are often interested as well in the excess degree distribution,
that is, given one site i that has a neighbor j, what is its distribution of additional
neighbors d?

2 Argue that finding first a link (ij) and then looking to i is equivalent to sampling all
nodes with a probability that is proportional to their number of neighbors:

di
Pi = (4.11)
c

2 Finally show that the probability distribution of having k + 1 neighbors when one
chose each sites with probability dci is

ck+1 e−c k + 1
P(d = k + 1) = (4.12)
k + 1! c
so that the probability distribution of excess degree reads

ck e−c
Pe (d = k) = (4.13)
k!
68 F. Krzakala and L. Zdeborová

Exercise 4.2: The random field ising model on a regular random graph

We have seen that the BP update equation for the RFIM is given by
!
X
fBP ({mi }, h0 ) = tanh βh0 + atanh (mi tanh(βJ)) (4.14)
i

and that the distribution of cavity fields follows (for a random graph with fixed connec-
tivity c − 1):
Z c−1
YZ
Q cav.
(m) = dh0 N (h0 ; 0, ∆) dmi Qcav (mi )δ (m − fBP ({mi }, h0 )) (4.15)
i=1

This can be solved in practice using the population dynamics approach where we
represent Qcav. (m) by a population of Npop elements. In this case, formally we iterate
a collection, or a pool, of elements. Starting from Qt=0 (m) = m1 , m2 , m3 , . . . , mNpop
cav.


with, for instance all m = 1 or random initial conditions, we iterate as follows:

• For T steps:

• For all i = 1 → Npop :

• Draw a random h0 ∼ N (0, ∆), and c − 1 random {mi } from Qcav.


t .

• Compute mnew = fBP ({mi }, h0 ).

• Assign m to the i elements of the new population Qt+1 i = mnew .

If Npop is large enough (say 105 ) then this is a good approximation of the true population
density, and if T is large enough, then we should have converged to the fixed point.
Once this is done, we can compute the average magnetization by computing the true
marginal as follows:

• Set m = 0

• For N steps:

• Draw c random {mi } from Qcav.


eq .

• Compute mnew = fBP ({mi }, h0 ) using this time the c values.

• m = m + mnew /Npop

1. Consider the RFIM on regular random graphs with connectivity c = 4. Compute


analytically the phase transition point in beta, denoted βc at zero disordered fields
(∆ = 0).

2. Implement the population dynamics and find numerically the phase transition
point when m(β, ∆) become non zero. Draw the phase diagram in (β, ∆) separat-
ing the phase where m = 0 with the one where m ̸= 0.
4.4 Exercises 69

3. Now, let us specialize to the low (and eventually zero) temperature limit. Using
the change of variable mi = tanh(β h̃i ), show that the iteration has the following
limit when β → ∞
1 X
h̃new = lim atanhfBP ({mi }, h0 ) = h0 + ϕ(h̃i ) (4.16)
β→∞ β
i

with (
x, if |x| < 1
ϕ(x) = (4.17)
sign(x) if |x| > 1

4. Using this equation, perform the population dynamics at zero temperature and
compute the critical value of ∆ where a non zero magnetization appears.
Chapter 5

Sparse Graphs & Locally Tree-like


Graphical Models

Auprès de mon arbre je vivais heureux


J’aurais jamais dû m’éloigner de mon arbre

Auprès de mon arbre


Georges Brassens – 1955

A common theme in the previous chapters has been the study of the Boltzmann-Gibbs distri-
bution:
e−βH(s)
PN,β (S = s) = .
ZN (β)
This is a joint probability distribution over the random variables S1 , . . . , SN defined through
the energy function H. In the two examples we have seen so far, the energy function is
composed of two pieces: an interaction term that couples different random variables and a
potential term which acts on each random variable separately. For instance, for the Curie-Weiss
model:
N N
1 X X
HN,h (s) = − si sj −h si
2N
i,j=1 i=1
| {z } | {z }
interaction potential

Notice that it is the interaction term that correlates the random variables: if it was zero, the
Gibbs-Boltzmann distribution would factorize and we would be able to fully characterize the
system by studying each variable independently. It is the interaction term that makes the
problem truly multi-dimensional.

In the Curie-Weiss model and RFIM, the interaction term is quadratic: it couples the ran-
dom variables pairwise. In the Chapters that follow, we will study many other examples of
Gibbs-Boltzmann distributions, each defined by different flavours of variables, potentials and
interactions terms. Therefore, it will be useful to introduce a very general way to think about
and represent multi-dimensional probability distributions. This is the subject of this Chapter.
72 F. Krzakala and L. Zdeborová

5.1 Graphical Models

To proceed with the study of probabilistic models, we introduce a tool called Graphical Models
that will give us a neat and very generic way to think about a broad range of probability
distributions. A Graphical Model is a way to represent relations or correlations between
variables.

In this section we give basic definitions and introduce a couple of examples that will be studied
in more detail later in the class.

5.1.1 Graphs

An undirected graph G(V, E) is defined by:

• A set of nodes V , which we will index by i ∈ V .

• A set of edges E, which we will index by a pair of nodes (ij) ∈ E.

Let |S| denote the size of set S. For the purpose of this section we denote the total number
of nodes by |V | = N and the total number of edges by |E| = M . The adjacency matrix of the
graph G(V, E) is a symmetric N × N binary matrix A ∈ {0, 1}N ×N with entries:
(
1 if (ij) ∈ E
Aij =:
0 if (ij) ∈/E

We will denote as ∂ the neighborhood operator, i.e. ∂i := {j ∈ V | (ij) ∈ E} is the set of


neighbors of node i. For a fixed node i ∈ V , define the degree of node i as di = |∂i|, i.e. the total
number of neighbors of i. Note that di can be easily obtained from the adjacency matrix A,

N N
1(ij)∈E .
X X
di = Aij =
j=1 j=1

node i edge (ik)


j

i k

5.1.2 Factor Graphs

• A factor graph is a graph with nodes of type ‘circle’ and of type ‘square’
5.1 Graphical Models 73

– a ‘circle’ node is a variable node, indexed by i, j, k, . . .


– a ‘square’ node is factor node, indexed by a, b, c, . . .

• A factor graph is bipartite: only edges of type ‘circle’—‘square’ exist, no ‘circle’—‘circle’


edges and no ‘square’—‘square’ edges.

∂i := {a | (ia) ∈ E} , |∂i| = di , ∀ i ∈ {1, . . . , N }


∂a := {i | (ia) ∈ E} , |∂a| = da , ∀ a ∈ {1, . . . , M }

• Every variable node i represents a random variable si taking values in Λ.

• Every factor node a represents a non-negative function fa {si }i∈∂a .




A graphical model represents a joint probability distribution over the variables {si }N
i=1 :

M
  1 Y  
P {si }N
i=1 =: fa {sj }j∈∂a ,
ZN
a=1

where ZN is the normalization constant


M
X Y  
ZN =: fa {sj }j∈∂a .
{si }N a=1
i=1

In this lecture we will use graphical models extensively as a language to represent probability
distributions arising in optimization, inference and learning problems. We will study a variety
of fa , Λ and graphical models. Let us start by giving several examples.

Example 1 (Physics, spin glass)

Consider a graph G(V, E) as a graph of interactions, e.g. 3D cubic lattice with N = 1023 nodes.
The nodes may represent N Ising spins si ∈ Λ = {−1, +1}. In statistical physics, systems
are often defined by their energy function, which we call Hamiltonian. One can think of the
Hamiltonian as a simple cost function where lower values are better. The Hamiltonian of a
spin glass then reads   X X
H {si }N
i=1 = − Jij si sj − hi si
(ij)∈E i
74 F. Krzakala and L. Zdeborová

where Jij ∈ R are the interactions and hi ∈ R are magnetic fields. Associated with the
Hamiltonian, we consider the Boltzmann probability distribution defined as:
N
  1 −βH({si }N
i=1 ) =
1 Y βhi si Y βJij si sj
P {si }N
i=1 = e e e ,
ZN ZN
i=1 (ij)∈E

N
where the normalization constant ZN = e−βH({si }i=1 ) is called the partition function.
P
{si }N
i=1

The graphical model associated with a spin glass defined on a graph G(V, E) (left) is drawn
on the right.

The graphical model has:

• For each node i one factor node fi (si ) = eβhi si corresponding to the magnetic field hi .

• For each edge (ij) ∈ E one factor node f(ij) (si , sj ) = eβJij si sj corresponding to the
interaction Jij .

Example 2 (Combinatorial optimization, graph coloring)

Consider a graph G(V, E), and a set of q colors si ∈ {red, blue, green, yellow, . . . , black} =
{1, 2, . . . , q}. In graph coloring we seek to assign a color to each node so that neighbors do not
have the same color.

In the figure we show a proper 4-coloring of the corresponding (planar) graph. Note that the
same graph can be also colored using only 3 colors, but not 2 colors.

Figure 5.1.1: Illustration on how coloring of maps corresponds to coloring of graphs.


5.1 Graphical Models 75

To set up graph coloring in the language of graphical models, we write the number of proper
colorings of the graph G(V, E) as
X Y 
ZN = 1 − δsi ,sj
{si }N
i=1
(ij)∈E

The graph is colorable if and only if ZN ≥ 1, in that case we can also define a probability
measure uniform over all proper colorings as
  1 Y
P {si }N

i=1 = 1 − δsi ,sj
ZN
(ij)∈E

We can also soften the constraint on colors and introduce a more general probability distribu-
tion: 
N
 1 Y −βδ
P {si }i=1 , β = e si ,sj
ZN (β)
(ij)∈E

As β ↗ ∞ we recover the case with strict constraints. The graphical model for graph coloring
then corresponds to factor nodes on each edge (ij) of the form f(ij) (si , sk ) = e−βδsi ,sj .

Example 3 (Statistical inference, unsupervised learning)

Probability distributions that are readily represented via graphical models also naturally
arise in statistical inference. We can give the example of the Stochastic Block Model, which is
a commonly considered model for community detection in networks. In the SBM, N nodes
are divided in q groups, the group of node i being written s∗i ∈ {1, 2, . . . , q} for i = 1, . . . , N .
Node iP is assigned in group s∗i = a with probability (fraction of expected group size) na ≥ 0,
where qa=1 na = 1. Pairs of nodes are then connected with probability that corresponds to
their group memberships. Specifically:
  
P (ij) ∈ E s∗ , s∗ = ps∗ s∗
i j i j
  ⇒ G(V, E) & Aij
P (ij) ∈ ∗ ∗
/ E si , sj = 1 − ps∗i s∗j

where pab forms a symmetric q × q matrix.

The question of community detection is whether, given the edges Aij and the parameters
θ = (na , pab , q), one can retrieve s∗i for all i = 1, . . . , N . A less demanding challenge is to find
an estimator ŝi such that ŝi = s∗i for as many nodes i as possible.

We adopt the framework of Bayesian inference. All the information we have about s∗i is
included in the posterior probability distribution1 :
  1    
P {si }N
i=1 A, θ = P A {si }N
i=1 , θ P {s } N
i i=1 θ
ZN (A, θ)
N
1 Y h 1−Aij

Aij
iY
= 1 − psi ,sj psi ,sj nsi
ZN (A, θ)
i<j i=1
1
Note that si in the posterior distribution is just a dummy variable, the argument of a function.
76 F. Krzakala and L. Zdeborová

The corresponding graphical model has one factor node per variable (field), and one factor
node per pair of nodes (interaction). Indeed, we notice that even pairs without edge (where
Aij = 0) have a factor node with fij (si , sj ) = 1 − psi ,sj . In the class we will study this posterior
quite extensively.

Example 4 (Generalized linear model)

Let n be the number of data samples and d be the dimension of data. Samples are denoted
Xµ ∈ Rd and labels yµ ∈ {−1, +1} (cats/dogs), for µ = 1, . . . , n. Generalized linear regression
is then formulated as the minimization of a loss function of the form
n
X d
X
L(w) = ℓ(yµ , Xµ · w) + r(wi ) .
µ=1 i=1

The loss minimization problem can then be regarded as the β → ∞ limit of the following
probability measure
d n
1 1 Y Y
P (w) = e−βL(w) = e−βr(wi ) e−βℓ(yµ ,Xµ ·w)
ZN (X, y, β) ZN (X, y, β)
i=1 µ=1

The corresponding graphical model then looks as follows

fi (wi ) = e−βr(wi )

wi

fµ (w) = e−βℓ(yµ ,Xµ ·w)

5.1.3 Some properties and usage of factor graphs

As we saw in the above examples the factors are often of two types: (i) factor nodes that are
related to only one variable — such as the magnetic field in the spin glass, the regularization
5.2 Belief Propagation and the Bethe free energy 77

in the generalized linear regression, or the prior in the stochastic block model — and (ii)
factor nodes that are related to interactions between the variables. For convenience, we will
thus treat those two types separately and denote the type (i) as gi (si ), and the type (ii) as
fa ({si }i∈∂a ). Factor nodes of type (i) will be denoted with indices i, j, k, l, . . . , factors of type
(ii) will be denoted with indices a, b, c, d, . . . . In this notation the probability distribution of
interest becomes:
N M

N
 1 Y Y 
P {si }i=1 = gi (si ) fa {si }i∈∂a
Z
i=1 a=1

As we saw already and will see throughout the course, quantities of interest can be extracted
from the value of the normalization constant, or partition function in physics jargon, that
reads
X Y N M
Y 
Z= gi (si ) fa {si }i∈∂a .
{si }N i=1 a=1
i=1

We will in particular be interested in the value of the free entropy N Φ = log Z. Another quan-
tity of interest is the marginal distribution for each variable, related to the local magnetization
in physics, defined as  
X
µi (si ) =: P {sj }Nj=1
{sj }N
j=1
j̸=i

The hurdle with computing the partition function and the marginals for large system sizes
is that it entails evaluating sums over a number of terms that is exponential in N . From the
computational complexity point of view, we do not know of exact polynomial algorithms able
to compute these sums for a general graphical model. In the rest of the lecture, we will cover
cases where the marginals and the free entropy can be computed exactly up to the leading
order in the system size N .

A special case of graphical models where the marginals and the free entropy can be computed
exactly is that of tree graphical models, i.e. graphs that do not contain loops. How to approach
this case is explained in the next section. Then, we will show that problems on graphs that
locally look like trees, i.e. the shortest loop going trough a typical node is long, can be solved
by carefully employing a tree-like approximation. This type of approximation holds for
graphical models corresponding to random sparse graphs, i.e. those where the average degree
is constant as the size N grows. We will see that also a range of problems defined on densely
connected factor graphs can be solved exactly in the large size limit.

5.2 Belief Propagation and the Bethe free energy

5.2.1 Derivation of Belief Propagation on a tree graphical model

We will start by computing the partition function Z and marginals µi (si ) for tree graphical
models. We recall that the probability distribution under consideration is
N M
  1 Y Y
{si }N

P i=1 = gi (si ) fa {si }i∈∂a
Z
i=1 a=1
78 F. Krzakala and L. Zdeborová

b
j

In order to express the marginals and the partition function we define auxiliary partition
functions for every (ia) ∈ E.

X Y Y
Rsj→a

j
= gj (sj ) gk (sk ) fb {sl }l∈∂b
{sk }all k above j all k above j all b above j
X Y Y
Vsa→i
 
i
= fa {sk }k∈∂a gj (sj ) fb {sk }k∈∂b
{sj }all j above a all j above a all b above a

The meaning of these quantities can be understood from the figure above. Rsj→a j represents
the partition function of the part of the system above the red dotted line, with variable node
j restricted to taking value sj . Analogously Vsa→i
i
is the partition function of the subsystem
above the blue dotted line, with variable node i restricted to taking value si 2 .

Since the graphical model is a tree, i.e. it has no loops, the restriction of the variable j to
sj makes the branches above j independent. We can thus split the sum over the variables,
according to the branch to which they belong, to obtain
 
Y X Y Y
Rsj→a
 
j
= gj (sj )  fb {sk }k∈∂b gk (sk ) fc {sl }l∈∂c 
b∈∂j\a {sk }all k above b all k above b all c above b
| {z }
=Vsb→j
j
Y
= gj (sj ) Vsb→j
j
b∈∂j\a

where we recognized the definition of Vsb→j


j and used it. Analogously, for the Vsa→i
i
we can

2
Note si appears in the argument of fa since i ∈ ∂a
5.2 Belief Propagation and the Bethe free energy 79

split the sum over the different branches of the tree above factor node a, to get
 
X   Y X Y Y
Vsa→i

i
= fa {s j } j∈∂a
 gj (s j ) g k (sk ) fb {sl }l∈∂b 
{sj }j∈∂a\i j∈∂a\i {sk }all k above j all k above j all b above j
| {z }
=Rsj→a
j
X   Y
= fa {sj }j∈∂a Rsj→a
j
(∗)
{sj }j∈∂a\i j∈∂a\i

If we start on the leaves of the tree, i.e. variable nodes that only belong to one factor a, we
have from the definition Rsj→aj = gj (sj ), for j being a leaf. This, together with the relations
above, would allow us to collect recursively the contribution from all branches and compute
the total partition function of a tree graphical model rooted in node j as
X Y
Z= gj (sj ) Vsb→j
j
.
sj b∈∂j

This value will not depend on the node in which we rooted the tree, as all the contributions to
the partition function are accounted for regardless of the root.

The partition function usually scales like exp(cN ), exponentially in the system size (simply
because it is a sum over exponentially many terms), which is a huge number. A more conve-
nient way to deal with the above restricted partition functions R and V is to define messages
χj→a
sj and ψsa→i
i
as follows:

Rsj→a X
χj→a
sj =: P j j→a so that χj→a
s = 1, ∀ (ja) ∈ E
s Rs s
Vsa→i X
ψsa→i
i
=: P i
a→i
so that ψsa→i = 1, ∀ (ia) ∈ E . (∗∗)
V
s s s

We will call χj→a


sj a message from variable j to factor a and interpret it as the probability that
variable j takes value sj in the restricted system where only the parts above the red dotted
line are considered. Analogously, ψsa→i i
will be called the message from factor a to variable i
and is interpreted as the probability that variable i takes value si in the system where only the
parts above the blue dotted line are considered. Using these definitions we rewrite the above
recursive relations as:
gj (sj ) b∈∂j\a Vsb→j b→j
Q Q P
j→a by (∗) j b∈∂j\a s′ V s′
χs j = P Q b→j
×Q P b→j
s gj (s) b∈∂j\a Vs b∈∂j\a s′′ Vs′′
| {z }
=1
Q Vsb→j
j
gj (sj ) b∈∂j\a P
s′′ Vsb→j
′′
= P Q Vsb→j
s gj (s) b∈∂j\a P ′ V b→j
s s′
Q b→j
gj (sj )
by (∗∗) b∈∂j\a ψsj
1 Y
= b→j
= j→a gj (sj ) ψsb→j
j
Z
P Q
s gj (s) b∈∂j\a ψs b∈∂j\a
X Y
Z j→a =: gj (s) ψsb→j
s b∈∂j\a
80 F. Krzakala and L. Zdeborová

and similarly

 Q
P j→a Q P j→a
by (∗)
f
{sj }j∈∂a\i a {s }
j j∈∂a j∈∂a\i Rsj j∈∂a\i s′j Rs′j
ψsa→i = ×Q
Rsj→a
i
 Q
j→a
P P P
si {sj }j∈∂a\i fa {s j }j∈∂a j∈∂a\i Rs j j∈∂a\i s′′
j
′′
j
| {z }
=1
Rsj→a
 Q
j
P
{sj }j∈∂a\i fa {sj }j∈∂a j∈∂a\i P
s′′ Rj→a
′′
j s
j
=
Rsj→a
 Q
j
P P
si {sj }j∈∂a\i fa {sj }j∈∂a j∈∂a\i P
s′ Rj→a

j s
j
 Q
P j→a
by (∗∗) {sj }j∈∂a\i fa {sj }j∈∂a j∈∂a\i χsj
=  Q
P P j→a
si f
{sj }j∈∂a\i a {s }
j j∈∂a j∈∂a\i χsj
1 X   Y
= fa {sj }j∈∂a χj→a
sj
Z a→i
{sj }j∈∂a\i j∈∂a\i
X X   Y X   Y
Z a→i =: fa {sj }j∈∂a χj→a
sj = fa {sj }j∈∂a χj→a
sj
si {sj } j∈∂a\i {sj }j∈∂a j∈∂a\i
j∈∂a\i

Above, we just obtained the self-consistent equations for the messages χ’s and ψ’s that are called
the Belief Propagation equations.

Now, we want the marginals µi (si ) and the partition function Z expressed in a way that can be
generalized to factor graphs that are not trees. With this in mind, we first write the marginals
5.2 Belief Propagation and the Bethe free energy 81

(still in a tree factor graph assumption) as

gi (si ) a∈∂i Vsa→i a→i


Q Q P
1 Y
a→i a∈∂i s′ Vs′
µi (si ) = gi (si ) V si = P Q i
a→i
× Q P a→i
Z s gi (s) a∈∂i Vs ′′ V ′′
a∈∂i | a∈∂i {zs s }
=1
V a→i
P si a→i
Q
gi (si ) a∈∂i gi (si ) a∈∂i ψsa→i
Q
1
s′′ Vs′′
Y
= P V a→i
=P i
a→i
= gi (si ) ψsa→i
Zi
Q
s gi (s)
i
a∈∂i ψs
Q
s gi (s)
P s a→i
a∈∂i a∈∂i
s′ Vs′ | {z }
=:χisi
X Y
Z i =: gi (s) ψsa→i
s a∈∂i

Here we see that each marginal µi (si ) is a very simple function of the incoming messages
ψsa→i
i
, a ∈ ∂i.

The partition function Z, instead, can be compute quite directly by rooting the tree in node i
and noticing (independently of which node i we chose as the root)
X Y
Z= gi (si ) Vsa→i
i
. (†)
si a∈∂i

We want an expression for Z that only involves the messages χ and ψ, in a way that the result
does not depend explicitly on the rooting of the tree. To do this, we first define

Vsa→i
P Q
s gi (s)
X Y
i
Z =: gi (s) ψsa→i
= Q Pa∈∂i a→i
s a∈∂i a∈∂i s′ Vs′
i→a
P Q
X Y {s i } fa {si }i∈∂a i∈∂a Rsi
a i→a
 i∈∂a
Z =: fa {si }i∈∂a χsi = Q P i→a
{si }i∈∂a i∈∂a i∈∂a s′ Rs′
P a→i i→a
ia
X
i→a a→i Vs Rs
Z =: χs ψs = P sa→i P i→a
s s′ V s′ s′′ Rs′′

We will now prove that


QN
Zi M a
Q
i=1 a=1 Z
Z= Q ia
(ia)∈E Z

Let’s go:

{si }i∈∂a fa ({si }i∈∂a ) i∈∂a Rsi


i→a
P Q
a→i
P Q
s gi (s) Pa∈∂i Vs
QN QM
QN i
QM
Z a=1 Z a i=1
Q a→i ·
a=1
Q P i→a
i=1 a∈∂i s′ Vs′ i∈∂a s′ R s′
Q ia
= P
Vsa→i Rsi→a
(ia)∈E Z
Q
P sa→i
(ia)∈E i→a
P
V
s′ s′ s′′ Rs′′
QM P
{si }i∈∂a fa ({si }i∈∂a ) i∈∂a Rsi
i→a
QN P Q
gi (s) a∈∂i Vsa→i
Q
a=1
i=1
Q s P (a→i (( · Q P (i→a((
s′ Vs′ ((( s′ Rs′
(
((((
((ia)∈E ((ia)∈E
= Q P
Vsa→i Rsi→a
(ia)∈E
Q P (a→i (sQ
( P (i→a ((
(( s′ s′
(ia)∈E( ( V · ((((
(ia)∈E s′′ Rs′′
( (
82 F. Krzakala and L. Zdeborová

We root the tree at node j. By (*), we have


 
  =Rsi→b
QN N X
Zi M a
Q z }| {
a=1 Z
X Y Y Y
i=1 a→j  b→i a→i 

Q ia
=  g j (s) Vs ×  V s g i (s) V s
(ia)∈E Z
 
s a∈∂j s i=1 a∈∂i\b
i̸=j
 
=Vsa→i
M X z X }| {
Y  Y
Rsi→a Rsk→a

×
 i
fa {sk }k∈∂a k


a=1 si {sk }k∈∂a\i k∈∂a\i
 −1
Y X
× Vsa→i Rsi→a 
(ia)∈E s

And by (†) we conclude


b towards root from i i towards root from a
zX }| { zX }| {
QN QM
QN i=1 Rsi→b Vsb→i · a=1 Vsa→i Rsi→a
Zi M a i̸=root j
Q
i=1 a=1 Z s s
Q ia
=Z· Q P a→i Ri→a
=Z
(ia)∈E Z (ia)∈E s Vs s

The last step comes from the fact that, in a tree, if we take for all variables node the edges
towards the root, and for all factor nodes the edge towards the root we accounted for exactly
all the edges.

It is quite instructive to keep in mind the interpretation of free energy terms (deduced from
their definitions)

• Z i : change of the partition function Z when variable node i is added to the factor graph

s s′ s′′ si gi (si )

• Z a : change of the partition function Z when factor node a is added to the factor graph

( ) ( ) ( ) fa

• Z ia : change of Z when variable node i and factor node a are connected

s′ ( ) s′′
5.2 Belief Propagation and the Bethe free energy 83

The formula for the partition function is hence quite intuitive: first, we are adding up all the
contributions from nodes and factors. Since with every new addition we account for all the
connected edges, we end up counting each edge exactly twice (since it is connected both to a
node and a factor). Thus, we have to subtract the edge contributions once, in order to correct
for the double-counting.

5.2.2 Belief propagation equations summary

After a bulky derivation, let us summarize the Belief Propagation equations, and the formulas
for the marginals and for the free entropy.

We consider a graphical model for the following probability distribution


N M
  1 Y Y
P {si }N

i=1 = g (s
i i ) fa {si }i∈∂a
Z
i=1 a=1

The Belief Propagation equations read


1 Y
χsj→a
j
= g (s )
j→a j j
ψsb→j
j
Z
b∈∂j\a
1 X   Y
ψsa→i = fa {sj }j∈∂a χj→a
sj
i
Z a→i
{sj }j∈∂a\i j∈∂a\i

j→a
where Z j→a and Z a→i are normalization factors set so that = 1 and a→i
P P
s χs s ψs = 1.

The free entropy density Φ, which is exact on trees and is called the Bethe free entropy on
more general graphs, reads
N
X M
X X
N Φ = log Z = log Z i + log Z a − log Z ia (5.1)
i=1 a=1 (ia)
X Y
i
Z =: gi (s) ψsa→i
s a∈∂i
X Y
a
Z =: fa {si }i∈∂a χi→a
si
{si }i∈∂a i∈∂a
X
Z ia =: χi→a
s ψsa→i
s

The marginals of the variable i are given as


1 Y
µi (si ) = gi (si ) ψsa→i
Zi i
a∈∂i

A landmark property of Belief Propagation and the Bethe entropy is that the BP equations
can be obtained from the stationary point condition of the Bethe free entropy. Show this for
homework. This is a crucial property in the task of finding the correct fixed point of BP when
several of them exist.
84 F. Krzakala and L. Zdeborová

As it often happens for successful methods/algorithms, Belief Propagation has been indepen-
dently discovered in several fields. Notable works introducing BP in various forms are:

• Hans Bethe & Rudolf Peierls 1935, in magnetism to approximate the regular lattice by a
tree of the same degree.

• Robert G. Gallager 1962 Gallager (1962), in information theory for decoding sparse error
correcting codes.

• Judea Pearl 1982 Pearl (1982), in Bayesian inference.

5.2.3 How do we use Belief Propagation?

Above we derived the BP equations and the free entropy on tree factor graphs. To use BP
as an algorithm on trees, we initialize the messages on the leaves according to definition
χj→a
sj = gj (sj ), and then spread towards the root. A single iteration is sufficient. The trouble
is that basically no problems of interest are defined on tree graphical models. Thus its usage
in this context is very limited.

On graphical models with loops BP can always be used as a heuristic iterative algorithm
1 Y
χj→a
sj (t + 1) = gj (sj ) ψsb→j (t)
Z j→a (t) j
b∈∂j\a
1 X   Y
ψsa→i (t) = fa {sj }j∈∂a χj→a
sj (t)
i
Z a→i (t)
{sj }j∈∂a\i j∈∂a\i

Initialization has various options

• χsj→a in the “prior”.


P
j (t = 0) = gj (sj )/ s gj (s)

• χsj→a
j (t = 0) = gj (sj ) + εj→a
sj + normalize, εj→a
sj are small perturbation of the “prior".

• χsj→a
j (t = 0) = εsj→a
j + normalize, random initialization.

• χsj→a
j (t = 0) = δsj ,s∗j , planted initialization

We will discuss in the follow-up lectures the usage of these different initializations and their
properties.

Then, we iterate the Belief Propagation equations until convergence or for a given number of
steps. The order of iterations can be parallel, or random sequential. Again, we will discuss in
what follows the advantages and disadvantages of the different update schemes.

It is useful to remark that, if the graph is not a tree, then the independence of branches when
conditioning on a value of a node is in general invalid. In general, the products in the BP equa-
tions should involve joint probability distributions. In this case, the simple recursive structure
of the algorithm would get replaced by joint probability distributions over neighborhoods
5.2 Belief Propagation and the Bethe free energy 85

(up to a large distance), which would be just as intractable as the original problem. We will
see in this lecture that there are many circumstances in which the independence between the
incoming messages ψsb→i j
for b ∈ ∂j \ a and between χj→a
sj for j ∈ ∂a \ i is approximately true.
We will focus on cases when it leads to results that are exact up to the leading order in N . One
important class of graphical models on which BP and its variants lead to asymptotically exact
results is that of sparse random factor graphs. This is the case because such graphs look like
trees up to a distance that grows (logarithmically) with the system size.

Sparse random graphs are locally tree-like

Informally, a graph is locally tree-like if, for almost all nodes, the neighborhood up to distance d
is a tree, and d → ∞ as N → ∞.

Importantly, sparse random factor graphs are locally tree-like. Sparse here refers to the fact
that the average degree of both the variable node and the factor nodes are constants while the
size of the graph N, M → ∞.

We will illustrate this claim in the case of sparse random graphs (for sparse random factor
graphs the argument is analogous). In a random graph, each edge is present with probability
N −1 , where c = O(1) and N → ∞, the average degree of nodes is c.
c

In order to compute the length of the shortest loop that goes trough a typical node i, consider
a non-backtracking spreading process starting in node i. The probability of the spreading
process returning to i in d steps (through a loop) is
d
1 c

1 − Pr (does not return to i) ≈ 1 − 1 −
N

where cd is the expected number of explored nodes after d steps. In the limit N → ∞, c = O(1)
so this probability is exponentially small for small distances d, and exponentially close to one
for large distances d. The distance d at which this probability becomes O(1) marks the order
of the length of the shorted loop going trough node i:
   
1 1 1
cd log 1 − = cd − − + · · · ≈ O(1)
N N 2N 2

This happens when cd ≈ O(N ) ⇒ d ≈ log(N )/ log(c). We conclude that a random graph
with average degree c = O(1) is such that the length of the shortest loop going through a
random node i is O(log(N )) with high probability. We thus see that up to distance O(log(N ))
the neighborhood of a typical node is a tree.

We will hence attempt to use belief propagation and the Bethe free entropy on locally tree-like
graphs. The key assumption BP makes is the independence of the various branches of the tree.
If the branches are connected through loops of length O(log(N )) → ∞ and the correlation
between the root and the leaves of the tree decays fast enough (we will make this condition
much more precise), the independence of branches gets asymptotically restored and BP and
the Bethe free entropy lead to asymptotically exact results. We will investigate cases when
the correlation decay is fast enough, and also those when it is not, and show how to still use
BP-based approach to obtain adjusted asymptotically exact results (this will lead us to the
notion of replica symmetry breaking).
86 F. Krzakala and L. Zdeborová

5.3 Exercises

Exercise 5.1: Representing problems by graphical models and Belief Propagation

Write the following problems (i) in terms of a probability distribution and (ii) in terms
of a graphical model by drawing a (small) example of the corresponding factor graph.
Finally (iii) write the Belief Propagation equations for these problems (without coding
or solving them) and the expression for the Bethe free energy that would be computed
from the BP fixed points.

(1) p-spin model


One model that is commonly studied in physics is the so-called Ising 3-spin model.
The Hamiltonian of this model is written as
X N
X
H({Si }N
i=1 ) =− Jijk Si Sj Sk − hi Si (5.2)
(ijk)∈E i=1

where E is a given set of (unordered) triplets i ̸= j ̸= k, Jijk is the interaction


strength for the triplet (ijk) ∈ E, and hi is a magnetic field on spin i. The spins
are Ising, which in physics means Si ∈ {+1, −1}.

(2) Independent set problem


Independent set is a problem defined and studied in combinatorics and graph
theory. Given a (unweighted, undirected) graph G(V, E), an independent set
S ⊆ V is defined as a subset of nodes such that if i ∈ S then for all j ∈ ∂i we have
j∈/ S. In other words in for all (ij) ∈ E only i or j can belong to the independent
set.
(a) Write a probability distribution that is uniform over all independent sets on a
given graph.
(b) Write a probability distribution that gives a larger weight to larger independent
sets, where the size of an independent set is simply |S|.

(3) Matching problem Matching is another classical problem of graph theory. It is


related to a dimer problem in statistical physics. Given a (unweighted, undirected)
graph G(V, E) a matching M ⊆ E is defined as a subset of edges such that if
(ij) ∈ M then no other edge that contains node i or j can be in M . In other words
a matching is a subset of edges such that no two edges of the set share a node.
(a) Write a probability distribution that is uniform over all matchings on a given
graph.
(b) Write a probability distribution that gives a larger weight to larger matchings,
where the size of a matching is simply |M |.

Exercise 5.2: Bethe free entropy

A key connection between Belief Propagation and the Bethe free entropy:
5.3 Exercises 87

Show that the BP equations we derived in the lecture


1 Y
χsj→a = gj (sj ) ψsb→j
j
Z j→a j
b∈∂j\a
1 X   Y
ψsa→i = fa {sj }j∈∂a χj→a
sj
i
Z a→i
{sj }j∈∂a\i j∈∂a\i

are stationarity conditions of


Pthe Bethe free entropy equation 5.1 under the constraint
that both s ψsa→i = 1 and s χi→a = 1 for all (ia) ∈ E.
P
s
Appendix

5.A BP for pair-wise models, node by node

In Section 5.2, we have derived the Belief propagation (BP) equations for a general tree-like
graphical model. To first sight, the BP equations might look daunting. The goal of this
Appendix is to provide additional intuition behind the BP equations by constructing them
from scratch, node by node, in a concrete setting.

Let G = (V, E) be a graph with N = |V | nodes, and let’s consider for concreteness the case of
a pair-wise interacting spin model on G:
N
1 Y Y
P(S = s) = gi (si ) f(ij) (si , sj )
Z
i=1 (ij)∈E

This encompasses many cases of interest, for instance the RFIM we studied in Chapter 2, for
which:

gi (si ) = eβhi si , f(ij) (si , sj ) = eβsi sj (RFIM)

and the q = 2 graph coloring problem studied in this Chapter:

gi (si ) = 1, f(ij) (si , sj ) = eβδsi sj . (Graph coloring)

The reader can keep any of these two problems in mind in what follows. To lighten notation, we
will write µ(s) =: P(S = s) for the probability distribution, with µ understood as a function
µ : {−1, 1}N → [0, 1]. We will also use ∝ to denote "equal up to a multiplicative factor", which
here it will always denote the constant which normalizes the probability distributions or
messages. Recall that the BP equations are self-consistent equations for the messages or beliefs
from variables to factors χi→a
si and from factors to variables ψsa→i
i
. For pair-wise models, we
have one factor per edge, and the BP equations read:

1 Y
χj→(ij)
sj = g (s )
j→(ij) j j
ψs(kj)→j
j
Z
(kj)∈∂j\(ij)
1 X Y
ψs(ij)→i = f(ij) (si , sj ) χj→(ij)
sj
i
Z (ij)→i
{sj }j∈∂(ij)\i j∈∂(ij)\i

Since in pair-wise models there is a one-to-one correspondence between edges and factor
nodes, we can simply rewrite the BP equations directly on the graph G. For each node i ∈ V
90 F. Krzakala and L. Zdeborová

i→(ij)
of the graph, define the outgoing messages χi→j si =: χsi and the incoming messages
j→i (ij)→i
ψsi = ψsi . The BP equations in terms of these "new" messages read:

1 Y 1 X
χi→j
si = gi (si ) ψsk→i , ψsk→i = f(ik) (si , sk ) χsk→i
Z i→j i i
Z k→i k
k∈∂i\j sk ∈{−1,1}

Recall that the marginal for variable si is given in terms of the messages as:
Y
µi (si ) =: P(Si = si ) ∝ gi (si ) ψsj→i
i
(5.3)
j∈∂i

Which basically tell us that the probability that Si = si is simply given by the "local belief" (or
prior) gi times the incoming beliefs ψsj→i
i from all neighbors of i. In other words, BP factorizes
the marginal distribution at every node in terms of independent beliefs.

Note that in this case it is also easy to solve for one of the messages to obtain a self-consistent
equation for only one of them. For instance, solving for the incoming messages ψsk→i k
gives:

gi (si ) Y X
χi→j
si = f(ik) (si , sk )χk→i
sk .
Z i→j
k∈∂i\j sk ∈{−1,+1}

5.A.1 One node

When the graph has a single node N = 1, we have only a single spin S1 ∈ {−1, 1} and therefore
E = ∅. The corresponding factor graph has a single variable node and a single factor node
for the "local field" gi . In this case, the marginal distribution over the spin S1 is simply given
by the "local belief":

1 g1 (±1)
µ(±1) =: P(S1 = ±1) = gi (±1) =
Z g1 (+1) + g1 (−1)

Note that in the absence of a prior g1 (s1 ) = 1, the marginal is simply the uniform distribution
µ1 (s1 ) = 12 .

5.A.2 Two nodes

For two nodes N = 2, we have two spin variables S1 , S2 ∈ {−1, +1} and two possible graphs:
either the spins are decoupled and E = ∅, or they interact through an edge E = {(12)}.

No edges: In the first case, the joint distribution is given by:

1
µ(s1 , s2 ) = g1 (s1 )g2 (s2 )
Z
g1 (s1 )g2 (s2 )
=
g1 (+1)g2 (+1) + g1 (−1)g2 (+1) + g1 (+1)g2 (−1) + g1 (−1)g2 (−1)
5.A BP for pair-wise models, node by node 91

and the marginal distribution of S1 is given by:


X
µ1 (s1 ) = µ(s1 , s2 ) = µ(s1 , −1) + µ(s1 , +1)
s2 ∈{−1,+1}
g2 (−1) + g2 (+1)
= g1 (s1 )
g1 (+1)g2 (+1) + g1 (−1)g2 (+1) + g1 (+1)g2 (−1) + g1 (−1)g2 (−1)

The marginal distribution of S2 is obtained by interchanging 1 ↔ 2. Notice that in this case,


the joint distribution factorizes into the product of the marginals:

µ(s1 , s2 ) = µ1 (s1 )µ2 (s2 )

This is a direct consequence of the absence of an interaction term coupling the two spins. Note
that we cannot write BP equations for this case since there are no factors.

One edge: The case in which E = {(12)} is more interesting. Now the joint distribution is
given by:

1
µ(s1 , s2 ) = g1 (s1 )g2 (s2 )f(12) (s1 , s2 )
Z
Notice that very quickly it becomes cumbersome to write the exact expression for the normal-
ization Z, which we keep implicit from now on. The marginals are now given by:

1  
µ1 (s1 ) = g2 (+1)f(12) (s1 , +1) + g2 (−1)f(12) (s1 , −1) g1 (s1 )
Z
1  
µ2 (s2 ) = g1 (+1)f(12) (+1, s2 ) + g1 (−1)f(12) (−1, s2 ) g2 (s2 )
Z
Crucially, it is easy to check that the joint distribution doesn’t factorise anymore:

µ(s1 , s2 ) ̸= µ1 (s1 )µ2 (s2 )

Let’s now look at what the BP equations are telling us. For instance, the outgoing messages
are given by:

χ1→2
s1 ∝ g1 (s1 ), χ2→1
s2 ∝ g2 (s2 )

while the incoming messages are given by:


X X
ψs2→1
1
∝ f(12) (s1 , s2 )χ2→1
s2 ∝ f(12) (s1 , s2 )f(12) (s1 , s2 )g2 (s2 )
s2 ∈{−1,+1} s2 ∈{−1,+1}

∝ f(12) (s1 , −1)g2 (−1) + f(12) (s1 , +1)g2 (+1)


X X
ψs1→2
2
∝ f(12) (s1 , s2 )χ1→2
s1 ∝ f(12) (s1 , s2 )f(12) (s1 , s2 )g1 (s1 )
s1 ∈{−1,+1} s1 ∈{−1,+1}

∝ f(12) (−1, s2 )g1 (−1) + f(12) (+1, s1 )g1 (+1)

From that, it is pretty clear that the marginals factorize in terms of the local beliefs times the
incoming beliefs eq. equation 5.3.
92 F. Krzakala and L. Zdeborová

5.A.3 Three nodes

Finally, let’s consider the more involved case of three nodes N = 3. The case in which there
is no edge E = ∅ or only one edge |E| = 1 reduces to one of the cases we have seen before.
Therefore, the interesting cases are when we have either two or three edges.

Two edges: For two edges, there are two nodes which have degree 1 and one node with
degree two. Without loss of generality, we can choose node 2 to have degree 2, and the joint
distribution read:
1
µ(s1 , s2 , s2 ) = f(12) (s1 , s2 )f(23) (s2 , s3 )g1 (s1 )g2 (s2 )g3 (s3 )
Z
The marginal probability of S1 = s1 is then given by:
X X
µ1 (s1 ) = µ(s1 , s2 , s2 )
s2 ∈{−1,1} s3 ∈{−1,1}
 
∝ f(23) (−1, −1)g3 (−1) + f(23) (−1, +1)g2 (−1)g3 (+1) f(12) (s1 , −1)g2 (−1)+
 
f(23) (+1, −1)g3 (−1) + f(23) (+1, +1)g3 (+1) f(12) (s1 , +1)g2 (+1) g1 (s1 )
Note that each two of the four terms share a common term. The outgoing BP messages now
read:
X
χ1→2
s1 ∝ g1 (s1 ), χ2→1
s2 ∝ g2 (s2 ) f(23) (s2 , s3 )χ3→2
s3 , χ3→2
s3 ∝ g3 (s3 )
s3 ∈{−1,+1}

Inserting the first and third into the second:


X
χ2→1
s2 ∝ g2 (s2 ) f(23) (s2 , s3 )g3 (s3 )
s3 ∈{−1,+1}
 
∝ f(23) (s2 , −1)g3 (−1) + f(23) (s2 , +1)g3 (+1) g2 (s2 )
It is easy to check that this allow us to reconstruct the marginal of S1 = s1 from eq. equation 5.3:
X
µ1 (s1 ) ∝ g1 (s1 ) f(12) (s1 , s2 )χ2→1
s2
s2 ∈{−1,+1}
X  
∝ g1 (s1 ) f(12) (s1 , s2 ) f(23) (s2 , −1)g3 (−1) + f(23) (s2 , +1)g3 (+1) g2 (s2 )
s2 ∈{−1,+1}

Three edges: For three edges, all nodes are connected and have degree two. The joint
distribution read:
1
µ(s1 , s2 , s2 ) = f(12) (s1 , s2 )f(13) (s1 , s3 )f(23) (s2 , s3 )g1 (s1 )g2 (s2 )g3 (s3 )
Z
The marginal probability of S1 = s1 now reads:
X X
µ1 (s1 ) = µ(s1 , s2 , s2 )
s2 ∈{−1,1} s3 ∈{−1,1}

∝ f(12) (s1 , −1)f(13) (s1 , −1)g2 (−1)g3 (−1) + f(12) (s1 , +1)f(13) (s1 , −1)g2 (+1)g3 (−1)

+f(12) (s1 , −1)f(13) (s1 , +1)g2 (−1)g3 (+1) + f(12) (s1 , +1)f(13) (s1 , +1)g2 (+1)g3 (+1) g1 (s1 )
5.A BP for pair-wise models, node by node 93

Note that, different from the 2 nodes case the four terms above don’t share any common factor
apart from g1 (s1 ). The outgoing BP messages now read:
X X
χ1→2
s1 ∝ g1 (s1 ) f(13) (s1 , s3 )χ3→1
s3 , χ1→3
s1 ∝ g1 (s1 ) f(12) (s1 , s2 )χ2→1
s2
s3 ∈{−1,1} s2 ∈{−1,1}
X X
χ2→1
s2 ∝ g2 (s2 ) f(23) (s2 , s3 )χ3→2
s3 , χ2→3
s2 ∝ g2 (s2 ) f(21) (s2 , s1 )χ1→2
s1
s3 ∈{−1,1} s1 ∈{−1,1}
X X
χ3→1
s3 ∝ g3 (s3 ) f(32) (s3 , s2 )χ2→3
s2 , χ3→2
s3 ∝ g3 (s3 ) f(31) (s3 , s1 )χ1→3
s1
s2 ∈{−1,1} s1 ∈{−1,1}

Note that in this case it is not simple to solve for the messages. For instance, the marginal of
S1 = s1 is given by:
X
µ1 (s1 ) ∝ g1 (s1 ) f(12) (s1 , s2 )χ2→1
s2
s2 ∈{−1,+1}

and therefore it depends on χ2→1


s2 . However, χs2
2→1 depends on χ3→2 , which itself depends on
s3
χ1→3
s1 which depends on χ2→1s2 . In other words: we cannot factorize the marginals in terms
of independent messages because removing an edge, say (12), doesn’t make the variable S1
independent from variable S2 ; they are correlated through their common link to S3 . This is a
direct consequence of the fact that the graph we consider has a loop.
Chapter 6

Belief propagation for graph coloring

The picture will have charm when each colour is very unlike the
one next to it.

Leon Battista Alberti – 1404-1472

As discussed previously, the probability distribution that we want to investigate in graph


coloring, given graph G(V, E) and colors si ∈ {1, 2, . . . , q} reads
  1
e si ,sj .
Y −βδ
P {si }N
i=1 = (6.1)
ZG (β)
(ij)∈E

In physics that variables of the type are called Potts spins, and the model is called correspond-
ingly the Potts model. In what follows we will consider both the repulsive (anti-ferromagnetic)
case of β > 0 and the attractive (ferromagnetic) case of β < 0. In our notation, the node-factors
will simply be gi (si ) = 1 for all i, and the interaction factors f(ij) (si , sj ) = e−βδsi ,sj for all
(ij) ∈ E.

6.1 Deriving quantities of interest

Let us now illustrate how to compute quantities of interest


P such as the number of configu-
rations having a given energy. We define the energy e = (ij)∈E δsi ,sj /N as the number of
monochromatic edges per node. We denote the number of colorings of a given energy as N (e)
and define the entropy s(e) via
N (e) = eN s(e) (6.2)
Letting Φ be the free entropy density, we can then rewrite the partition function as
Z
e e e = de eN s(e)−N βe
X −β P δsi ,sj
X X
N Φ(β) −N βe
= ZG = (ij)∈E =
{si }N e all coloring of energy e
i=1

where in the third step we split the sum into the sum of all the configurations that will be at
energy e and the sum over all the energies and in the last step we replaced the sum over the
96 F. Krzakala and L. Zdeborová

discrete values of e by an integral over e and we used the definition introduced in equation 6.2.
This is well justified since we consider the limit N → ∞ and we are interested in the leading
order (in N ) of Φ, e and s. The saddle-point method then gives us

∂s(e)
= β, Φ(β) = s(e∗ ) − βe∗ (6.3)
∂e e=e∗

A random configuration (sampled from the Boltzmann distribution at temperature β) will


have energy concentrating at e∗ w.h.p. ⇒ ⟨e⟩Boltz = e∗ . We also remind that the Boltzmann
distribution is convenient because

dΦ(β)
= − ⟨e⟩Boltz . (6.4)

Thus, if we compute the free entropy density Φ as a function of β, we can compute also its
derivative and consequently access the number of configurations of a given energy s(e). Doing
these calculations exactly is in general a computationally intractable task. We will use BP and
the Bethe free entropy to obtain approximate — and in some cases asymptotically exact —
results.

How can we use Belief Propagation to evaluate the above quantities? The BP fixed point gives
us a set of messages χi→j and the Bethe free entropy ΦBethe as a function of the messages. In
general, the messages give us an approximation of the true marginals of the variables, and the
Bethe free entropy ΦBethe give us an approximation for the true free entropy Φ. Remembering
that a BP fixed point is a stationary point of the Bethe free entropy we get:

dΦBethe (β) ∂ΦBethe ∂χ ∂ΦBethe (β)


= · + (6.5)
dβ ∂χ ∂β ∂β
| {z }
=0 at BP fixed point

Thus, given a BP fixed point we can approximate both the free entropy and the average energy.
Once we evaluated ΦBethe and e∗ we can readily obtain the entropy s(e), which we defined as
the logarithm of the number of colorings at energy e:

s(e∗ ) = Φ(β) + βe∗ . (6.6)

Remember, however, that the correctness of the resulting s(e) relies on the correctness of the
Bethe free entropy for a given fixed point of the BP equations.

6.2 Specifying generic BP to coloring

We now apply the BP equations, as derived in the previous section, even though in general the
graph G(V, E) is not a tree. In what follows, we will see under what circumstances this can lead
to asymptotically exact results for coloring of random graphs. Note that, on generic graphs this
approach can always be considered as a heuristic approximation of the quantities of interest
(in most cases the approximation is hard to control). In this section we will manipulate the
BP equations to see what can be derived from them and, when needed, we will restrict our
considerations to random sparse graphs.
6.3 Bethe free energy for coloring 97

Applying the recipe from the previous section we obtain:


1 Y
χj→(ij)
sj = ψs(kj)→j
Z j→(ij) j
(kj)∈∂j\(ij)
1 1 h   i
e−βδsi ,sj χsj→(ij) 1 − 1 − e−β χj→(ij)
X
ψs(ij)→i = = si
i
Z (ij)→i sj
j
Z (ij)→i

We see that since every factor node has two neighbors, the second BP equations has a simple
form (as the product is only over one term). We can thus eliminate the messages ψ from the
equations while still keeping a simple form. We get
1 h   i
e
Y
−β
χj→(ij)
sj = 1 − 1 − χ k→(kj)
sj
Z j→(ij) (kj)∈∂j\(ij) Z (kj)→j
Q
(kj)∈∂j\(ij)
1 Y h   i
⇒ χsj→i = 1 − 1 − e −β
χ k→j
s
j
Z j→i j
k∈∂j\i

In the last equation we defined an overall normalization term Z j→i and went back to a graph
notation, where ∂i denotes the set of neighbouring nodes of i (instead, in a factor graph
notation ∂i would denote the set of neighbouring factors of i). On a first sight, this change
of notation can be confusing, since we use the same letter χ to denote the messages on the
original graph and on the factor graph. However, note it presents no ambiguity: messages
between two variable nodes i → j can only refer to the original graph, since in a factor graph
we cannot connect two variable nodes directly.

The resulting belief propagation equations have quite an intuitive meaning. Recall that χj→i
sj
represents the probability that node j takes color sj if the connection between j and i was
temporarily removed. Keeping this in mind, the terms in the above equations have the
following meaning

→ 1 − 1 − e−β χk→i sk + e
= sk ̸=sj χk→i −β χk→i , is the probability that neighbor k allows
 P
sj sj
node j to take color sj .

→ k∈∂j\i [· · ·] is the the probability that all the neighbors let node j take color sj (i was
Q
excluded, as the edge (ij) is removed). The product is used because of the implicit
assumption of BP, about the independence of the neighbors when conditioning on the
value sj .

In computer science belief propagation is most commonly discussed as an algorithm. We note


that in the context of graph coloring the messages provide the marginal probabilities, not
directly a proper coloring. In some regimes a proper graph coloring can be deduced from
BP with decimation of variables, i.e. variables are set one by one to values deduced from the
values of the message at convergence or after a given number of iterations. We will discuss
how well this algorithm performs in follow up chapters. In this chapter we will be using BP
rather as an analysis tool.

6.3 Bethe free energy for coloring


98 F. Krzakala and L. Zdeborová

In a homework problem you will show that using similar simplifications as above we can
rewrite the generic Bethe free entropy for graph coloring as
N
X X
N ΦBethe (β) = log Z (i) − log Z (ij) (6.7)
i=1 (ij)∈E

where
XYh   i
Z (i) = 1 − 1 − e−β χk→i
s
s k∈∂i
 X
e
−βδsi ,sj
e
X
(ij) −β
Z = χi→j
si χj→i
sj = 1 − 1 − χi→j
s χs
j→i

si ,sj s

We keep in mind that the Bethe free entropy is evaluated at a fixed point of the BP equations.
Sometimes we will think of the Bethe free entropy as a function of all the messages χi→j or of
a parametrization of the messages.

We will always denote the free entropy with the index "Bethe" when we are evaluating it
using the BP approximation. While on a tree we showed that Φ = ΦBethe , in the case of a
generic (even locally tree-like) graph we will discuss the relation between Φ and ΦBethe (more
precisely its global maximizers) in more detail in the next lectures.

We note that, so far, all we wrote depends explicitly on the graph G(V, E) through the list of
nodes V and edges E. No average over the graph (i.e. the disorder) was taken! This is rather
different from the replica method, where the first step in the computation is in fact taking the
average over the disorder.

6.4 Paramagnetic fixed point for graph colorings

We notice that for graph coloring χj→i sj = q −1 for all (ij) and sj is a fixed point of the BP
equations on any graph G(V, E). We will call this the paramagnetic fixed point. In general
there might be, and often are, other fixed points, as we will discuss in next sections. To check
that q −1 is indeed a fixed point of BP, call di the degree of node i (i.e. the number of neighbors
of node i) and obtain the BP recursion
h  idi −1
1 − 1 − e−β 1q 1
idi −1 = (6.8)
q
h
q 1 − (1 − e ) q
−β 1

which is indeed always true. For the Bethe free entropy this paramagnetic fixed point gives us:
N
(  )
  1 di 
1 −β 1

log q 1 − 1 − e log q · 2 e + (q − q) ·
X X
−β 2
N ΦBethe (β) = −
q q q
i=1 (ij)∈E
 1   1 
 c 
ΦBethe (β) = c log 1 − 1 − e−β + log(q) − log 1 − 1 − e−β
q 2 q
 
c   1
= log(q) + log 1 − 1 − e−β
2 q
6.4 Paramagnetic fixed point for graph colorings 99

where we called c = i di /N = 2M/N the average degree of the graph. For the average
P
energy and entropy at inverse temperature β, we get

∂ΦBethe (β)
∗ c e q −β 1
c e−β
energy e = − = =
∂β 2 1 − (1 − e−β ) 1q 2 (q − 1) + e−β
| {z }
prob. an edge is monochromatic
e−β

  βc
c −β 1

entropy s(e ) = ΦBethe (β) + βe = log(q) + log 1 − 1 − e
∗ ∗
+
2 q 2 (q − 1) + e−β
Notice that these expressions give us a parametric form for s(e). Sometimes we can even
exclude β and write s(e) in a closed form, but generically we simply plot parametrically
e(β), s(β). In the figure we take q = 4 colors and several values of the average degrees c.

0
1
1
2
s( )

c=2
6 c=6
c = 12
0 1 2 3 4 5 6
e( )

Figure 6.4.1: Entropy as a function on the energy cost for graph coloring with q = 4 colors
corresponding to the paramagnetic fixed point of belief propagation. Average degree of the
graph is c.

For a given pair of parameters c, q, the curve for s(e) ranges from 0 to c/2 because at most all
the edges can be violated. The curve achieves a maximum equal to log(q) at e = c/2q, because
that is the typical cost of a random coloring (each edge is violated with probability 1/q). The
slope of the curve s(e) corresponds to the inverse temperature β since ∂s(e)
∂e = β.

Recalling that the entropy s is a logarithm of an integer (number of colorings with given
energy), it is clear it should not take negative values. The fact that we see negative values for
large energies and also for energy close to zero for c = 12 indicates that either coloring with
such energy do not exist with high probability or that there was a flaw in what we did. Note
also that so far we did not specify anything about the graph, except its average degree. Clearly
we could have graphs with average degree c = 6 that contain one or several (even linearly
many in N ) 5-cliques and thus will not have any valid coloring. Thus the result we obtained
cannot be valid for those graphs (indicating exponentially many valid 4-colorings e = 0 for
c = 6). At this point, we thus restrict to random sparse graphs and we investigate whether the
results we obtained are plausible at least in that case.

We notice, from the results we obtained for the paramagnetic fixed point, that χi = 1q for all
i implies that values of energies larger than certain values strictly smaller than c/2 are not
accessible, as they have negative entropy. At the same time, if all nodes had the same color this
100 F. Krzakala and L. Zdeborová

would automatically achieve the energy e = c/2. To reconcile this paradox we must realize
that the paramagnetic fixed point χi = 1q for all i assumes that every color is represented the
same number of times, which is clearly not the case if all nodes have the very same color. With
BP, it is often the case that there are several fixed points and we need to select the correct one.
In general we need to find the fixed point with larger free entropy (from the saddle point
method we know this is the one that dominates the probability measure). For the graphs
coloring problems at large energy, i.e. β < 0 this motivates the investigation of a different
fixed point that is able to break the equal representation of every color. We will call it the
ferromagnetic fixed point.

6.5 Ferromagnetic Fixed Point

We now investigate whether the BP equations for graph coloring have fixed points of the
following form for all (ij) ∈ E:
1−a
χi→j
1 = a, χi→j
s =b= , ∀ s ̸= 1 (6.9)
q−1
Then, the graph coloring BP equations would read
1 Y h   i
χsj→i = 1 − 1 − e −β
χ k→j
sj
j
Z j→i
k∈∂j\i
1 h   idj −1 1
a = j→i 1 − 1 − e−β a =: j→i Adj −1
Z Z
1 h   idj −1 1
b = j→i 1 − 1 − e −β
b =: j→i B dj −1
Z Z
with the normalization
Adj −1 B dj −1
Z j→i = (q − 1)B dj −1 + Adj −1 , a= , b=
(q − 1)B dj −1 + Adj −1 (q − 1)B dj −1 + Adj −1
(6.10)
For such a ansatz to be a fixed point for every (ij) ∈ E we need dj ≡ d for all j. This is the
case with random d-regular graphs where every variable node has degree d and satisfies this
condition. We could, of course, have a ferromagnetic fixed point where the a depends on (ij)
and solve the corresponding distributional equations, but in this section we look for a simpler
solution to illustrate the basic concepts and we will thus restrict to random d-regular graphs,
d ≥ 3. For d-regular random graphs we obtain the following self-consistent equation for the
parameter a, given the degree d, inverse temperature β and number of colors q

1 − 1 − e−β a
  d−1
a= h id−1 =: RHS(a; β, d, q) (6.11)
[1 − (1 − e−β ) a] + (q − 1) 1 − (1 − e−β ) 1−a
d−1
q−1

To express the Bethe free entropy corresponding to this ferromagnetic fixed point we use
eq. (6.7) and plug (6.9) in it to get:
( d h )
−β 1 − a
 id
ΦBethe (β) = log (q − 1) 1 − (1 − e ) + 1 − (1 − e )a
−β
(6.12)
q−1
(1 − a)2
  
d
− log 1 − (1 − e )
−β
+a 2
2 q−1
6.5 Ferromagnetic Fixed Point 101

1.0 a 1.8
= 0.7
= 0.5108 1.7
0.8
= 0.4
1/q 1.6
0.6
1.5

(a)
rhs

0.4 1.4

1.3
0.2
1.2
0.0
1.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
a a

Figure 6.5.1: Illustration of a second order (continuous) phase transition for 2-coloring on
5-regular graph and several values of the inverse temperature β. In the left panel, the right
hand side of eq. (6.11) is plotted against the parameter a. In the right panel, the Bethe free
entropy is plotted for the same inverse temperatures. The stable fixed points are marked and
correspond to the local maxima of the Bethe free entropy.

In the last expressions we can think of the Bethe free entropy as a function of the parameter a,
keeping in mind that we are seeking to evaluate it at the global maximizer.

6.5.1 Ising ferromagnet, q = 2

In Fig. 6.5.1 we plot the left hand side and the right hand size of eq. (6.11) as a function of a, for
a given value of degree d and number of colors q and several values of the inverse temperature
β. We also plot the Bethe free entropy as a function of a.

We observe that for β = −0.4 (green curve) the only fixed point of (6.11) and maximum of
(6.12) is reached at a = 1/q. This is the paramagnetic fixed point we investigated previously.
For β = −0.7 (blue curve) we see, however, a different picture. The fixed point a = 1/q is
unstable under iterations of (6.11) and corresponds to a local minimum of the Bethe free
entropy. There are two new stable fixed points that appear and correspond to the maxima of
the Bethe entropy.

At what value of the inverse temperature β do the additional fixed points appear? For this we
need to evaluate the stability of the paramagnetic fixed point, i.e. when the derivative at the
paramagnetic fixed point ∂RHS∂a a=1/q = 1:

(d − 1)(1 − e−β )
 
∂RHS q
1= =− ⇒ βstab = − log 1 + (6.13)
∂a a= 1q e−β + (q − 1) d−2

We can investigate other values of the degree d and still two colors q = 2 and we will observe
the same picture for all d. We recognize the second order phase transition from a paramagnet
to a ferromagnet that we saw already in the Curie-Weiss model. Indeed the only difference
between the present model and the Curie-Weiss model is that the graph of interactions was
fully connected in the Curie-Weiss model while it is a d-regular random graph in the present
case.
102 F. Krzakala and L. Zdeborová

1.0 a
a
1/q
0.8

0.6

a*
0.4

0.2

0.0
1.50 1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

Figure 6.5.2: Magnetization for 2-coloring on 5-regular graph as a function of the inverse
temperature β.

This is the Bethe approximation to the solution of the Ising model on regular cubic lattices. Of
course lattices are not trees, but if we match the degree of the random graph to the coordination
number of a cubic lattice in D dimension we obtain d = 2D and we can observe that for D = 1
(d = 2) the βstab = −∞. This is actually an exact solution because the 1-dimensional cubic
lattice is just a chain which is a tree-graph. Thus the BP solution is exact. For a 2-dimensional

Ising model that has been famously solved by Onsager we have βOnsager = − log(1 + 2) =
−0.881 which is relatively close to the Bethe approximation βstab (d = 4) = −0.693. Note that
the sign is opposite from what can be found in the literature because here we defined positive
temperature for the coloring (the anti-ferromagnet), and also that there is a multiplicative
factor two as here the energy cost for a variable change is 1 (whereas in the usual Ising
model it is 2). For the 3-dimensional Ising model no closed form solution exists yet, the
critical temperature has been evaluated numerically to very high precision and reads β3D =
−0.4433, again to be compared with its Bethe approximation βstab (d = 6) = −0.4055 which is
remarkably close. As the degree grows the Bethe approximation actually gets closer and closer
to the finite dimensional values. And eventually as d → ∞ we recover exactly the Curie-Weiss
solution that we studied in the first lecture with a proper rescaling on the interaction strength.
You will show this for a homework.

6.5.2 Potts ferromagnet q ≥ 3

For more than two colors, q ≥ 3, we find a somewhat different behaviour leading to a 1st
order phase transition. Let us again start by plotting the fixed point equations and the Bethe
free entropy in Fig. 6.5.3. We see that, as before, βstab marks the inverse temperature at which
the paramagnetic fixed point 1/q becomes unstable and the corresponding Bethe entropy
maximum becomes a minimum. But there is another stable fixed point, corresponding to a
local maximum of the Bethe entropy appearing at βs > βstab . This inverse temperature where
a new stable fixed point appears discontinuously is called the spinodal temperature in physics.

When there are more than one stable fixed points, more than one local maximas in Bethe
free entropy, eq. (6.12), we must compare their free entropies, the larger one dominates the
corresponding saddle point and hence is the correct solution.
6.5 Ferromagnetic Fixed Point 103

1.0 2.8

0.8 2.6

0.6 2.4

(a)
rhs

0.4 a
= 1.1 2.2
= 1.0
0.2 = 0.88
= 0.7 2.0
0.0 1/q
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
a a

Figure 6.5.3: Illustration of a first order (discontinuous) phase transition for 6-coloring on
5-regular graph and several values of the inverse temperature β. In the left panel, the right
hand side of eq. (6.11) is plotted against the parameter a. In the right panel, the Bethe free
entropy is plotted for the same inverse temperatures. The stable fixed points are marked and
correspond to the local maxima of the Bethe free entropy.

1.0
Ferromagnetic
4.00 Equilibrium
Paramagnetic
0.8 3.75

3.50
0.6
3.25
(a*)
a*

3.00
0.4
2.75

0.2 stab c s 2.50

2.25
1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8

Figure 6.5.4: The magnetization (left panel) and Bethe free entropy as a function of the inverse
temperature β for q = 10 and d = 5. (Note a different values of q from previous figure.)

The inverse temperature at which the ferromagnetic fixed point a > 1/q becomes the global
maximum of the Bethe entropy, instead of the paramegnetic fixed point, will be denoted βc and
corresponds to a first order phase transition. We have βstab < βc < βs . The order parameter a
changes discontinuously at βc in a first order phase transition. In the case of q = 2, instead,
we saw a continuous 2nd order phase transition, where the ferromagnetic fixed points appear
at the same temperature at which the corresponding free entropy becomes a global maximum,
together with the instability of the paramagnetic fixed point. In other words, in 2nd order
phase transitions βs = βc = βstab .

With the knowledge of the ferromagnetic point we can hence correct the result for the number
of colorings at a given cost, in Fig. 6.5.5.
104 F. Krzakala and L. Zdeborová

1 c

s( )
1

2
Paramagnetic
Ferromagnetic
0.0 0.5 1.0 1.5 2.0 2.5 3.0
e( )

Figure 6.5.5: Entropy as a function of the energy for graph coloring with q = 4 colors and
average degree c = d = 6 (same as Fig. 6.4.1).

Two important comments are in place:

• The results obtained with BP on random graphs are exact for β < 0 in the sense that

N →∞
∀ ε > 0, P r (|Φ(β) − ΦBethe (β)| < ε) −→ 1 w.h.p (6.14)

– For q = 2 this has been proven by Dembo & Montanari 2010 Dembo et al. (2010b)
(on sparse random graphs even beyond d-regular).
– For q ≥ 3 this has been proven by Dembo, Montanari, Sly, Sun, 2014 Dembo et al.
(2014) on d-regular graphs.

The situation for β > 0 is much more involved and will be treated in the following.

• We note that the paramagnetic and ferromagnetic fixed points are BP fixed points even
for finite size regular graphs, not only in the limit N → ∞. But of course not every
regular graph has the same number of colorings of a given energy, especially not the
non-random ones. Also, the discussed phase transitions exist only in the limit N → ∞,
but BP fixed point behave the same way we discussed even at finite size N . This means
that BP ignores part of the finite size effects and gives us an interesting proxy of a phase
transition even in finite size, where formally no phase transition can exists since the free
entropy is an analytic function for every finite N .

6.6 Back to the anti-ferromagnetic (β > 0) case

Let us now go back to the anti-ferromagnetic case, β > 0, with β → ∞ corresponding to


the original coloring problem on random sparse graph. We aim to obtain a condition on the
average degree for which the graphs are with high probability colorable, or not. We also want
to know, in the phase where colorings exist with high probability, how many of them are there,
i.e. the value of s(e = 0).
6.6 Back to the anti-ferromagnetic (β > 0) case 105

Previously, we evaluated the entropy s(e) of the paramagnetic fixed point χi→j s ≡ 1q . We
can observe that the corresponding Bethe free entropy is equal to the so-called annealed free
entropy.

To define this term, let us remind the general probability distribution we are considering:
N M
  1 Y Y
P {si }N (6.15)

i=1 = gi (s i ) fa {si }i∈∂a .
ZG
i=1 a=1

The free entropy ΦG = N1 log(ZG ) then depends explicitly on the graph G. We expect the free
entropy to be self-averaging, i.e. concentrating around its mean as
N →∞
∀ ε > 0, Pr (|ΦG − EG [ΦG ]| > ε) −→ 0 (6.16)
When N → ∞, computing ΦG and EG [ΦG ] should thus lead to the same result.

We define:
 
1
quenched free entropy Φquench ≡ EG log(ZG )
N
1
annealed free entropy Φanneal ≡ log (EG [ZG ])
N

Naively, we could expect to have (E[log(Z)] − log(E[Z])) /N → 0, but this is often not the case.
The partition function is of order Z = O (exp(N )) as N → ∞, and concentration holds only
for N1 log(Z). The annealed free entropy can get dominated by rare instances of the graph G.
Let us give a simple example.

Consider an artificial problems where we have

eN , w.p. 1 − e−N
(
ZG =
e3N , w.p. e−N
The quenched and annealed averages are then given by:
 
1
log(ZG ) = 1 − e−N + e−N 3 = 1 + e−N (2) → 1

EG
N
1 1 1
log eN − 1 + e2N = 2 + log 1 + e−N − e−2N → 2
 
log (EG [ZG ]) =
N N N
We see that while the quenched entropy represents the typical values, the annealed entropy got
influenced by exponentially rare values and could completely mislead us about the properties
of the typical instance. In general, since log(·) is a concave function, by Jensen’s inequality we
have Φanneal ≥ Φquench , therefore the annealed free entropy will at least provide us with an
upper bound. Of course, Φanneal is usually much easier to compute than Φquench .

For the coloring problem, let G(N, M ) represent a random graph with N nodes, and M edges
chosen at random among all possible edges. The annealed free entropy then follows from
 M
 
N  −β 1 1 
h i 
EG(N,M ) [ZG (β)] = q E{si }N EG(N,M ) e = q e
P
N −β (ij)∈E δsi ,sj
+ 1−

q q 

i=1 
| {z }
free entropy of one edge
106 F. Krzakala and L. Zdeborová

Here we use the fact that edges in the random graph are independent and that, for β > 0,
the contributions to the average are dominated by colorings where each color is represented
roughly equally. The annealed entropy is thus positive and vanishes at average degree
2 log q
cann (β) = − h i (6.17)
log 1 − (1 − e−β ) 1q

Notice that the annealed free entropy and the Bethe one corresponding to the paramagnetic
fixed point are the same Φanneal = ΦBethe |χ≡ 1 .
q

The question is whether the annealed/paramagnetic free entropy is correct for all average
degrees c and inverse temperatures β > 0. Unfortunately, the answer is negative. For instance
for β → 0 in Coja-Oghlan (2013) the authors show that all proper colorings disappear with high
probability for values of average degree strictly smaller than the average degree cann (β → ∞)
at which Φanneal (β → ∞) becomes negative. This means that Φanneal cannot be equal to the
quenched free entropy all the way to cann .

What could possibly go wrong with the paramagnetic fixed point χi→j s ≡ 1q ? One immediate
thing we should ask is whether belief propagation converges to this fixed point on large random
graphs. Is the paramagnetic fixed point even a stable one, i.e. if we initialize as χi→j
s = 1q + εi→j
s
P i→j i→j
(with s εs = 0 due to normalization), does BP converge back to χs ≡ q ? To investigate
1

this question one could implement the iterations on a large single graph and simply try out.

A computationally more precise way to find the answer to these questions is to perform the
linear stability analysis. Let θ = 1 − e−β , and consider the first-order Taylor expansion around
the fixed point χ ≡ 1q (or equivalently εk→j
sk ≡ 0)

1 X X ∂χj→i
sj (t + 1)
χsj→i (t + 1) = + εk→j
sk
j
q ∂χsk→i (t)
s
k∈∂j\i k k χ≡ 1q

Note that
Y  θ dj −1 θ dj −1
    
1 1 1 1 j→i
= j→i 1−θ = j→i 1− , Z χ≡ 1q
=q 1−
q Z |χ≡ 1 q Z |χ≡ 1 q q
q k∈∂j\i q
(6.18)
 
∂χsj→i
j (t + 1) 1 ∂ Y  
=  1 − θχℓ→j
sj (t) 
∂χk→j
sk (t) χ≡ 1q
Z j→i ∂χk→j
sk (t) ℓ∈∂j\i χ≡ 1q

1 Y   ∂Z j→i
− 1 − θχℓ→j
sj (t)
(Z j→i )2 ℓ∈∂j\i
∂χk→j
sk (t)
χ≡ 1q
 dj −2  dj −1 "
−θ 1 − θq 1 − θq
#
θ dj −2
 
= δsj ,sk −  2 −θ 1 −
Z j→i |χ≡ 1 j→i q
q Zχ≡ 1
q

θ θ
= −δsj ,sk +
q − θ q(q − θ)
6.6 Back to the anti-ferromagnetic (β > 0) case 107

In order to proceed, we define a q × q matrix T such that:


 
a b ··· b
( θ(1−q) .. .. 
a := if j = k . .

q(q−θ) ,
b a
Tjk = ⇒ T=
 .. . . (6.19)
..

b := θ
q(q−θ) , if j ̸= k . . . b

b ··· b a
Notice that the matrix T has q − 1 degenerate eigenvalues
(
−θ < 0, for β > 0 T
, with eigenvector +1 −1 0 · · ·

λmax = a − b = ⇒ 0
q−θ > 0, for β < 0

and a single zero eigenvalue.

Keeping this in mind, we can see that the linear expansion of the BP equations for ε’s reads:
X X
εsj→i
j
(t + 1) = Tsj ,sk εk→j
sk (t) . (6.20)
k∈∂j\i sk

We now define the excess degree distribution, p̃k = (k + 1)pk+1 /c, for a random graph ensemble
with degree distribution pk and average degree c. p̃k represents the probability that a randomly
chosen edge is incident to a Pnode has k other edges except the chosen edge, i.e. has degree
k + 1. And similarly, let c̃ = k k p̃k be the average excess degree. With it we obtain:

⟨ε(t + 1)⟩ = c̃ λmax ⟨ε(t)⟩ , (6.21)

where the ⟨·⟩ is the average over edges. A phase transition occurs at c̃ λmax = 1 that determines
whether lim ⟨ε(t + 1)⟩ blows up to infinity or converges to zero.
t→∞

−c̃ 1 − e−βstab
  
q
c̃ λmax = =1 ⇒ βstab = − log 1 + (6.22)
q − (1 − e−βstab ) c̃ − 1
This is the stability transition that we have already computed for the ferromagnetic solution for
β < 0. For β > 0 we notice that λmax < 0 and thus the corresponding instability corresponds to
an oscillation from one color to another at each parallel iteration. Such an oscillatory behaviour
would be possible on a bipartite graph, where indeed one side of the graph could have one
color and the other side another color. But this two-color solution is not compatible with the
existence of many loops of even and odd length in random graphs. The temperature βstab will
thus not have any significant bearing on the anti-ferromagnetic case β > 0.

Is it possible that the mean of the perturbation ⟨ε⟩ → 0 but the variance ε2 ↗ ∞? Let us
investigate:
 * " #2 " #" #+
2  X X X X X
j→i k→j k→j ℓ→j
εsj (t + 1) = Tsj ,sk εsk (t) + Tsj ,sk εsk (t) Tsj ,sℓ εsℓ (t)
k∈∂j\i sk k,ℓ∈∂j\i sk sℓ
k̸=ℓ
| {z D E
}
=0 since neighbors are independent εk→j ℓ→j
sk εsℓ =0
*" #2 +
X
= c̃ Tsj ,sk εsk→j
k
(t) .
sk
108 F. Krzakala and L. Zdeborová

Therefore, the variance will be determined by the maximum eigenvalue λmax of T, such that

Var(t + 1) = c̃λ2max Var(t) (6.23)

We can thus distinguish two cases for the graph coloring problem:

(q−θ)2
• For c̃ < θ2
BP converge back to χ ≡ 1
q . Specifically for β → ∞ we get

(q − θ)2
c̃KS = → (q − 1)2 (6.24)
θ2

(q−θ)2
• For c̃ > θ2
BP goes away from 1
q and actually does not converege.

Note that the abbreviation KS comes from the works on Kesten and Stigum (1967), and in
physics this transition is related to the works of de Almeida and Thouless (1978).

For Erdös-Rényi random graphs, where the average degree and the excess degree are equal
c̃ = c, in the setting q = 3 and β → ∞ we have that cKS (q = 3) = 4 < cann (q = 3) = 5.4. cKS
in this case is also smaller that the upper bound on colorability threshold from Coja-Oghlan
(2013). At average degree c > 4 the BP equations do not converge anymore and another
approach, based on replica symmetry breaking, will be needed to understand what is going
on. It is interesting to note that algorithms that are provably able to find proper 3-coloring
exist for all c < 4.03 Achlioptas and Moore (2003), so even slightly above the threshold
cKS (q = 3) = 4.

For q ≥ 4 and β → ∞ we have that cKS > cann and the investigated instability cannot be used
to explain what goes wrong in the colorable regime. An important motivation to understand
what is happening comes from the algorithmic picture that appears for large values of the
number of colors q. In that case the annealed upper bound on colorability scales as 2q log q,
and probabilistic lower bounds (see e.g. Coja-Oghlan and Vilenchik (2013)) imply that this
is indeed the right scaling for the colorability threshold. Yet, for what concerns polynomial
algorithms that provably find proper colorings, we only know of algorithms that work up
to degree q log q Achlioptas and Coja-Oghlan (2008), i.e. up to half of the colorable region.
Design of tractable algorithms able to find proper coloring for average degrees (1 + ϵ)q log q is
a long-standing open problem. What is happening in the second half of the colorable regime
will be clarified in the follow up lectures.

EASY HARD NON-COLORABLE


c
2q log(q)
q log(q)
colorabil-
ity

6.7 Exercises
6.7 Exercises 109

Exercise 6.1: Bethe free entropy

(a) Show from the generic formula for Bethe free entropy density that we derived in
the lecture
N M
1 X 1 X 1 X
Φgeneral
Bethe = log Z i + log Z a − log Z ia (†)
N N N
i=1 a=1 ia

where
X Y
Zi = gi (s) ψsa→i
s a∈∂i
X Y
a
Z = fa {si }i∈∂a χi→a
si
{si }i∈∂a i∈∂a
X
Z ia = ψsa→i χi→a
s
s

that the Bethe free entropy density for graph coloring can be written as
N
1 X 1 X
Φcoloring
Bethe = log Z (i) − log Z (ij)
N N
i=1 (ij)∈E

where
XYh   i
Z (i) = 1 − 1 − e−β χk→i
s
s k∈∂i
 X
Z (ij)
= 1 − 1 − e−β χi→(ij)
s χj→(ij)
s
s

Exercise 6.2: Fully connected limit from Belief Propagation


The goal of this exercise is to show that, when the average degree is large, the Belief
Propagation solution gives back the fully connected limit, and in particular the one
obtained with the "fully-connected" cavity method shown in eq. (1.22). We shall do
this starting from the Potts ferromagnetic model, using q = 2 to make the connection
with the ferromagnetic solution of section 1.4.
(a) First, we check that we can find back the mean field self-consistent equation. To
do so, first rewrite the recursion relation derived in section 6.5 using q = 2, and
denoting χ± = (1 ± m)/2, show that the belief propagation recursion fixed point
can be written as
 d−1  d−1
cosh β2 + m sinh β2 − cosh β − m sinh β2
m = − d−1  d−1 . (6.25)
cosh β2 + m sinh β2 + cosh β2 − m sinh β2

Using the relation atanhx = 1−x , show that this yields


1
log 1+x
2
   
β
m = − tanh (d − 1)atanh m tanh . (6.26)
2
110 F. Krzakala and L. Zdeborová

(b) We now move the large d limit, and use β = − 2βdMF . Why is this necessary in order
to recover the fully connected model? Show that it leads indeed to the mean field
equation
m = tanh (βMF m). (6.27)

Another approach is to look directly to the free entropy, and to recover directly the free
entropy expression of the fully connected model by taking the large connectivity limit
of the Bethe free entropy:

(a) Show that the Bethe free entropy reads, according to Belief Propagation:
d  d !
−β 1 − m

−β 1 + m
Φ(β, m) = log 1 − (1 − e ) + 1 − (1 − e )
2 2
 
d 1
− log 1 − (1 − e−β ) 1 + m2

2 2

(b) We use again β = 2βdMF . Show that, to leading order in d, we recover, as d → ∞ the
expression (1.22). Hint: First rewrite the expressions inside the log of first line as
 
d 2
1±m d log 1−(1−e− d βMF ) 1±m

1 − (1 − e−β ) (6.28)
2
=e .
2

Notice however that there is an additional trivial term βMF /2 and explain why this
additional term is here.

Exercise 6.3: Belief propagation for the matching problem on random graphs

Consider now the matching problem on sparse random graphs. Use the graphical
model representation from the previous homework.

(a) Write belief propagation equations able to estimate the marginals of the probability
distribution
 
N
  1
eβS(ij) I  S(ij) ≤ 1
Y Y X
P S(ij) (ij)∈E =
Z(β)
(ij)∈E i=1 j∈∂i

Be careful that in the matching problem the nodes of the graph play the role of
factor nodes in the graphical model and edges in the graph carry the variable nodes
in the graphical model.

(b) Write the corresponding Bethe free entropy in order to estimate log(Z(β)). Use re-
sults of the previous homework to suggest how to estimate the number of matchings
of a given size on a given randomly generated large graph G.

(c) Consider now d-regular random graphs and draw the number of matchings as a
function of their size for several values of d. Comment on what you obtained, does
it correspond to your expectation? If not, explain the differences.
6.7 Exercises 111
Part II

Probabilistic Inference
Chapter 7

Denoising, Estimation and Bayes


Optimal Inference

Si un événement peut être produit par un nombre n de causes


différentes, les probabilités de l’existence de ces causes prises de
l’événement sont entre elles comme les probabilités de
l’événement prises de ces causes, et la probabilité de l’existence de
chacune d’elles est égale à la probabilité de l’événement prise de
cette cause, divisée par la somme de toutes des probabilités de
l’événement prises de chacune de ces causes.

Pierre Simon de Laplace – 1774

7.1 Bayes-Laplace inverse problem

In this chapter, we shall discuss the estimation, or the learning, of a quantity that we do not
know directly, but only through some indirect, noisy measurements. They are actually many
different ways to think of the problem depending on whether we are in the context of signal
processing, Bayesian statistics, or information theory, but it boils down to separating the signal
from the noise in some data.

To be concrete, consider the following situation: assume an unknown signal x (a vector, or a


scalar, or a matrix. . . ) is generated from a known distribution PX (x). We would like to know
this signal, to "estimate" its value. We are not given x, however, but instead a measurement y
obtained through some noisy process, whose characteristic are also known. In other words
we have: X → Y , with
X ∼ PX (x), Y ∼ PY |X (y|x) , (7.1)
and we aim at finding back, in the best way we could, x.

This is the setting of Bayesian estimation. PX is called the "prior" distribution, as it tells us
what we know on the variable X a priori, before any measurement is done. PY |X is telling us
the probability to obtain a given result y, given the value of x. Seen as a function of x for a
116 F. Krzakala and L. Zdeborová

given value of y, L = (x; y) = PY|X (y, x) is called the Likelihood of x. What we are really
interested in, however, is the value of x (the signal) if we measure y (the data). This is called
the posterior probability of X given Y : PX|Y (x, y). To obtain the latter with the former, we
follow the direction given by Laplace and Bayes in the late XVIII century, and we just write
the celebrated "Bayes" formula:

PY |X (x, y)PX (x)


PX|Y (x, y) = (7.2)
PY (y)

so that the posterior PX|Y (x, y) is given by the product of the prior probability on X, PX (x),
times the likelihood PY |X (x, y), divided by the "evidence" PY (y) (which is just a normalization
constant). Of course, if we deal with continuous variable, we can write the same formula with
probability density instead:

PY |X (x, y)PX (x)


PX|Y (x, y) = (7.3)
PY (y)

Note that this Bayesian setting is not entirely general! Unfortunately, we often do not know
what PX is in many estimation problems (and sometime, we do not know PY |X either) which
makes the use of this formalism tricky (and has generated a long standing dispute between
so-called frequentist and Bayesian statisticians), but in this chapter, we shall forget about these
problems, and restricted ourselves to the situation where we do know these distributions, so
that one can safely use Bayesian statistics. There are many concrete problems where this is
the case (central to fields such as information theory and error correction, signal processing,
denoising, . . . ) and so this will be enough to keep us busy for some time. We will move to
more complicated situations later.

7.2 Scalar estimation

7.2.1 Posterior distribution

Let us look at three concrete problems where the "true value", a scalar that we shall denote x∗ ,
is generated by:

1. A Rademacher random variable : X = ±1 with probability 1/2.

2. A Gaussian random variable with mean 0 and variance 1: X ∼ N (x; 0, 1).

3. A Gauss-Bernoulli random variable that is 0 with probability 1/2, and a Gaussian with
mean 0 and variance 1 otherwise: X ∼ N (x; 0, 1)/2 + δ(x)/2.

We shall concentrate on noisy measurements with Gaussian noise. In this case, we are given n
measurements

yi = x ∗ + ∆zi , i = 1, · · · , N (7.4)
7.2 Scalar estimation 117

with zi ∼ N (0, 1) a standard Gaussian noise, with zero mean and unit variance. Following
Bayes formula, we can now compute the posterior probability for our estimate of x∗ as:
1 1 P (y −x)2
− i i2∆
PX|Y (x, y) = e PX (x) (7.5)
PY (y) (2π∆)N/2
The posterior tells us all there is to know about the inference of the unknown x∗ .

For instance, in the Rademacher example, if we are given the five measurement numbers:
1.04431591, 2.55352006, 1.43665582, 1.37069702, 0.77697312 . (7.6)
It is very likely that the x∗ = 1 rather than x∗ = −1. How likely? We can compute explicitly
the posterior and find
N
!
1 X yi
Rademacher
PX|Y (x|y) = = σ 2x (7.7)
−2x
N
P yi ∆
∆ i=1
1+e i=1

where σ(x) is the sigmoid function. We thus find that we can estimate the probability that
x∗ = 1 to be larger that 0.999. So we are indeed pretty sure of our estimation. What this
number really mean is "if we repeat many time such experiments: when we measure such
outcomes for the y, then less than 1 in 1000 times it would have been with x∗ = −1".

Let us move to the more difficult Gaussian example. In this case, x∗ is chosen randomly from
a Gaussian distribution, and we are given 10 measurements:
0.04724576, 1.26855971, −0.19887457, 1.09534511, −1.46442807
0.44767123, 2.6244575, 1.94488421, 0.58953688, 0.572018 . (7.8)
We compute explicitly the posterior and find that it is itself also a Gaussian, given by:
 
PN
yi
 ∆ 
Gaussian
PX|Y (x|y) = N x; i=1 , . (7.9)
 
 N + ∆ N + ∆

The posterior distribution for this particular data set is shown in figure 7.2.1: it is Gaussian
with mean 0.630. and variance 0.091.. Actually, the true value of x∗ is this case was 0.67.

Let us consider finally the third example. In this case the posterior is slightly more complicated
and reads
 P N 
yi
i=1 ∆ 
N x; N +∆ , N +∆
δ(x)
Gauss−Bernoulli
PX|Y (x|y) = N
!2 + N
!2 (7.10)
P P
yi − yi
i=1 i=1
q q
∆ N +∆
1+ N +∆ e
2∆(N +∆) 1+ ∆ e
2∆(N +∆)

We performed the experiment,where x∗ is chosen randomly from a Gauss-Bernoulli, distribu-


tion, and we are given 10 measurements:
0.99978688, 1.7956116, −0.43158072, 3.07234211, 1.11920946]
0.53248943, 0.80011329, −0.52783428, 0.40378413, −0.00223177 . (7.11)

The posterior is shown in Figure 7.2.2 (here x∗ = 0.8, and p non zero is 0.8233787142909471).
118 F. Krzakala and L. Zdeborová

Figure 7.2.1: Posterior distribution for the Gaussian prior example and the data (7.8). The
true value x∗ is marked by a red vertical line.

Figure 7.2.2: Posterior distribution for the Gaussian prior example and the data (7.11). Note
the point mass component represented by the blue arrow and the Gaussian component. The
true value x∗ is marked by the red line.

7.2.2 Point-estimate inference: MAP, MMSE and all that.

Often, we are not really interested by the posterior distribution, but rather by a given estimate
of the unknown x∗ . We would really like to give a number and make our best guess! Such
estimates are denoted as x̂(y). x̂(y) should be a function of the data y that gives us a number
that is, ideally, as close as possible to the true value x∗ . The first idea that comes to mind is to
use the most probable value, that is the "mode" of the posterior distribution. This is called
maximum a posteriori estimation:

x̂MAP (y) =: argmax PX|Y (x, y) (7.12)


x

The MAP estimate is the default-estimator. This is the one of choice in many situations, in
particular because it is very often simple to compute.

However, it is not (at least for finite amount of data) always the best estimator. For instance
what if the posterior has bad looking as below? Is x̂MAP still reasonable for this case? We need
to think of a way to define a "best" estimator. The particular choice of the estimator depends
7.2 Scalar estimation 119

P (x|y)

Figure 7.2.3: A nasty looking posterior

on our definition of "error". For instance one could decide to minimize the squared error
(x̂(y) − x∗ )2 , or the absolute error |x̂(y) − x∗ |. If x∗ is discrete, however, we might instead
be interested by minimizing the probability of having a wrong value and use 1 − δx̂(y),x∗ .
Depending on our objective, we shall see that we should use a different estimator.

Let us consider the expected error one can get using a given estimator. Formally, we define a
"risk" as the average of the loss function L(x̂(y), x∗ ) over the jointed distribution of signal and
measurements. This is called the averaged posterior risk, as it is —indeed— the average of the
posterior risk:
Z
R averaged
(x̂) = Ex ,y [L(x̂, x )] = PX,Y (x∗ , y)dx∗ dyL(x̂(y), x∗ )


(7.13)
Z Z
= dyPY (y) PX|Y (x∗ , y)dx∗ L(x̂(y), x∗ ) (7.14)

(7.15)
 posterior 
= EY R (x̂, y)

Our goal, of course, is to find a way to minimize this risk. This minimal value, that is the "best"
possible error one can possibly get (on average) is called the Bayes risk:

RBayes = min Raveraged (x̂) (7.16)


Our goal is two-fold: we want to know what is the best possible error, the Bayes error, as well
as how to get it: we want to know what is the Bayes-optimal estimator that gives us the Bayes
risk.

This line of reasoning leads, for the square loss, to the following theorem:

Theorem 10 (MMSE Estimator). The optimal estimator for the square error, called the Minimal
Mean Square Error (MMSE) estimator, is given by the posterior mean:

x̂MMSE (y) = EX|Y (x, y) (7.17)

and the minimal mean square error is given by the variance of the estimator with respect to the posterior
distribution:
Z
MMSE =: min RBayes (y, x̂(.)) = dx∗ dyPX|Y (x∗ , y)(EX|Y (x, y) − x∗ )2 = VarPX|Y [X]

(7.18)
120 F. Krzakala and L. Zdeborová

Proof. Consider the posterior risk for the square loss:


Z
RPosterior (y, x̂(.)) = PX|Y (x∗ , y)(x̂(y) − x∗ )2 dx∗ . (7.19)

Conditioned on y x̂ is simple a scalar variable, it is a just number, so we can differentiate this


expression with respect to x̂ to obtain its minimum:
Z
∂ Posterior
R (y, x̂(.)) = 2 PX|Y (x∗ , y)(x̂(y) − x∗ ) dx⋆ = 0 (7.20)
∂ x̂
Z Z
x̂(y) = PX|Y (x∗ , y)x∗ dx∗ = PX|Y (x, y)x dx = EX|Y (x, y) (7.21)

Conditioned on y, the minimum is thus obtained using the posterior mean.

In what follows we shall study many such problems with Gaussian noise, so it is rather
convenient to define the optimal denoising function as the MMSE estimator for a given
problem:
(x−R)2
dxxe− 2Σ2 P (x)
R
η(R, Σ) =: EP (X|X+R+ΣZ) = R (x−R)2
(7.22)
dxe− 2Σ2 P (x)
It is interesting to see, however, that for other errors, the optimal function can be different.
If one choose the absolute value as the cost, then we find instead that one should use the
Median:
Theorem 11 (MMAE Estimator). The optimal estimator for the absolute error, called the Minimal
Mean absolute error (MMAE) estimator, is given by the posterior median:

x̂MMAE (y) = MedianX|Y (x|y) . (7.23)

Proof. Here we have


Z
R Posterior
(y, x̂(.)) = dx∗ PX|Y (x∗ , y)|x̂(y) − x∗ | (7.24)
Z x̂ Z ∞
= dx∗ PX|Y (x∗ , y)(−(x̂(y) − x∗ )) + dx∗ PX|Y (x∗ , y)(x̂(y) − x∗ )
−∞ x̂
(7.25)
Performing the derivative with Leibniz integral rule, we find that we require:
Z x̂ Z ∞
∂ Posterior ∗ ∗
R (y, x̂(.)) = − dx PX|Y (x , y) + dx∗ PX|Y (x∗ , y) = 0 (7.26)
∂ x̂ −∞ x̂

and this is achieved for


x̂MMAE (y) = MedianX|Y (x|y) (7.27)

Finally, if we are interested to choose between a finite number of hypothesis, like in the case
±1, or if we want to know if the number was exactly zero in the Gauss-Bernoulli case, a good
measure of error is to look to the optimal decision version and to minimize the number of
mistakes:
7.2 Scalar estimation 121

Theorem 12 (Optimal Decision). The Optimal Bayesian decision estimator is the one that maximizes
the (marginal) probability for each class:

x̂OBD (y) = argmax PX|Y (x, y) (7.28)


x

7.2.3 Back to free entropies

Let us now discuss how to think about these problems with a statistical physics formalism.
We can write down the posterior distribution as

exp (log (P (y | x) P (x))) e−H(x;y)


P (x | y) = ≡ PGibbs,y (x) = (7.29)
P (y) Z(y)

A way to define our Boltzmann measure would be to use β = 1, H(x; y) = − log (P (y | x)) −
log (P (x)), and Z(y) = P (y). In practice, for such problems with a Gaussian noise, we shall
employ a slightly different convention that is more practical, and use instead
P (yi −x)2
exp (log (P (y | x) P (x))) 1 e− i 2∆ P (x)
P (x | y) = = (7.30)
P (y) (2π∆)N/2 P (y)
P yi2
e− i 2∆ x2
 
P yi x
− 2∆
= e i ∆
P (x) (7.31)
(2π∆)N/2 P (y)
P  x2 yi x 
i − 2∆ + ∆
e P (x)
=: (7.32)
Z(y)

with −1
yi2
 P
Z P  x2 yi x  −
i − 2∆ + ∆
e i 2∆
Z(y) = dx e P (x) =  
(2π∆)N/2 P (y)

Interestingly, with this definition, the partition sum is also equal to the ratio between the
probability that y is a pure random noise (a Gaussian with variance ∆), and that y has been
actually generated by a noisy process from x:

PYmodel (y)
Z(y) = .
PYrandom (y)

This is called the likelihood ratio in hypothesis testing. Obviously, if the two distributions
are the same, then Z = 1 for all values of y. With this definition, we define the expected free
entropy, as before, as
FN = EY log Z(y) . (7.33)

In fact, the free entropy turns out to be nothing more than the Kullback-Liebler divergence
between PYmodel (y) and PYrandom (y):

PYmodel (y)
FN = Emodel
Y log = DKL (PYmodel (y)|PYrandom (y)) (7.34)
PYrandom (y)
122 F. Krzakala and L. Zdeborová

Many other information quantities would have been equally interesting, but they are all
equivalent. We could have used for instance the entropy of the variable y, which is related
trivially to our free entropy.

Ey [y 2 ] N
H(Y ) = −EY log P (y) = N + log 2π∆ − FN (7.35)
2∆ 2
Information theory practitioners would, typically, use the mutual information between X and
Y , that is the Kullback-Leibler distance between the jointed and factorized distribution of Y
and X.

I(X, Y ) = DKL (PX,Y ||PX PY ) . (7.36)

Again, this can be expressed directly as a function of the free entropy, using (see exercise
section for basic properties of the mutual information and conditional entropies):

N
I(X, Y ) = H(Y ) − H(Y |X) = H(Y ) − log(2πe∆) (7.37)
2
N Ey [y 2 ] N Ex [x2 ] + ∆
= −FN − +N =F − +N (7.38)
2 2∆ 2 2∆
2
Ex [x ]
= −FN + N (7.39)
2∆
Given these equivalences, we shall thus focus on the free entropy.

7.2.4 Some useful identities: Nishimori and Stein

Before going further, we need to note some important mathematical identities that we shall
use all the time, especially in the context of Bayesian inference.

The first one is a generic property of the Gaussian integrals, a simple consequence of integration
by part, called Stein’s lemma:

Lemma 7 (Stein’s Lemma). Let X ∼ N (µ, σ 2 ). Let g be a differentiable function such that the
expectation E [(X − µ)g(X)] and E [g ′ (X)] exists, then we have

E [g(X)(X − µ)] = σ 2 E g ′ (X)


 

Particularly, when X ∼ N (0, 1), we have

E [Xg(X)] = E g ′ (X)
 

Proof. The proof is a trivial application of the integration by part formula f ′ g = [f g] − f g′


R R
2 √
applied on g f = −e−x /2 / 2π.

Additionally, there is a set of identities that are extremly useful, that are usually called "
Nishimori symmetry" in the context of physics and error correcting codes. In its more general
form, it reads
7.2 Scalar estimation 123

Theorem 13 (Nishimori Identity). Let X (1) , . . . , X (k) be k i.i.d. samples (given Y ) from the
distribution P (X = · | Y ). Denoting ⟨·⟩ the "Boltzmann" expectation, that is the average with respect
to the P (X = · | Y ), and E [·] the "Disorder" expectation, that is with respect to (X ∗ , Y ). Then for all
continuous bounded function f we can switch one of the copies for X ∗ :
hD  E i D  E 

(1)
E f Y, X , . . . , X (k−1)
,X (k) (1)
= E f Y, X , . . . , X (k−1)
,X (7.40)
k k−1

Proof. The proof is a consequence of Bayes theorem and of the fact that both x∗ and any of
the copy X (k) are distributed from the posterior distribution. Denoting more explicitly the
Boltzmann average over k copies for any function g as

D E k
Z Y
g(X (1)
,...,X (k)
) =: dxi P (xi |Y )g(X (1) , . . . , X (k) ) (7.41)
k
i=1

we have, starting from the right hand side


D  E 
(1) (k−1) ∗
EY,X ∗ f Y, X , . . . , X ,X
k−1
Z D  E
∗ ∗ (1) (k−1) ∗
= dx dyP (x |Y )P (Y ) f Y, X , . . . , X ,X
k−1
Z D  E
= EY dxk P (xk |y) f Y, X (1) , . . . , X (k−1) , X k
k−1
hD  E i
= EY f Y, X (1) , . . . , X (k−1) , X (k)
k

We shall drop the subset "k" from Boltzmann averages from now on. The Nishimori property
has many useful consequences that we can now discuss. First let us look at the expression of
the MMSE. It has a nice expression in terms of overlaps:
 2  h i
MMSE(λ) = Ey,x∗ ⟨x⟩y − x ∗
= Ey,x∗ ⟨x⟩2y + (x∗ )2 − 2x∗ ⟨x⟩y = q + q0 − 2m

where
h i h i
• q ≜ Ey ⟨x⟩2y = Ey x(1) x(2) y is overlap between two copies
h i
• q0 ≜ Ex∗ (x∗ )2 is the self overlap
h i
• m ≜ Ey,x∗ x∗ ⟨x⟩y is the overlap with the truth.

Using now the Nishimori Identity this can be simplified as


h i D E 
Ey,x∗ ⟨f (x, x∗ )⟩y = Ey f (x(1) , x(2) )
y

Using this result we have q ≡ m and thus MMSE(λ) = q0 − m


124 F. Krzakala and L. Zdeborová

7.2.5 I-MMSE theorem



Theorem 14 (I-MMSE Theorem). For a single measurement and Gaussian noise, if Y = ∆Z +X ∗ ,
The derivative of the mutual information, or the free entropy, with respect to the inverse noise gives the
MMSE
∂ 1 1
−1
I(∆) = MMSE(λ) = (q0 − m) (7.42)
∂∆ 2 2
∂ 1
F (∆) = m (7.43)
∂∆−1 2

Proof. Writing F explicitly as a function of the Gaussian noise, we have


2 ∗
Z
x
− 2∆ + xx + √zx
F = Ex∗ ,z log dx P (x)e ∆ ∆ (7.44)

Performing the derivative, we find


 2 √  − x2 + xx∗ + √zx
dx P (x) − x2 + xx∗ + zx
R
2 ∆ e 2∆ ∆ ∆

∂∆−1 F = Ex∗ ,z 2 ∗ (7.45)


R − x + xx + √
zx
dx P (x)e 2∆ ∆ ∆
"√ #
1 ∗ ∆
(7.46)
 2 
= − Ez,x∗ ⟨x ⟩ + Ez,x∗ [⟨x⟩x ] + Ez,x∗ ⟨x⟩z
2 2

Using Stein’s lemma on the variable z, the third term can be written as
"√ # "√ #
∆ ∆ 1
∂z ⟨x⟩ = Ez,x∗ ⟨x2 ⟩ − ⟨x⟩2 (7.47)
 
Ez,x∗ ⟨x⟩z = Ez,x∗
2 2 2

so that
1 1
∂∆−1 F = − Ez,x∗ ⟨x2 ⟩ + Ez,x∗ [⟨x⟩x∗ ] + Ez,x∗ ⟨x2 ⟩ − ⟨x⟩2 (7.48)
   
2 2
∗ 1
(7.49)
 2
= Ez,x∗ [⟨x⟩x ] − Ez,x∗ ⟨x⟩
2
q m
=m− = (7.50)
2 2
Where the last step follows from Nishimori.

7.3 Application: Denoising a sparse vector

It is obvious to check that all the theorems that we have discussed applied equally to d-
dimensional vectors. We can thus apply our newfound knowledge to a more interesting
problem: denoising a sparse vector.

Consider a vector x∗ of dimension d. Often, in computers, d is a power of 2, so we take d = 2N ,


with only ONE single non zero component, exactly equal to 1. In other words, x is one of
7.3 Application: Denoising a sparse vector 125

the d vectors x = [10000 . . .], x = [01000 . . .], etc. Instead of this vector, you are given a noisy
d-dimensional vector y which has been polluted by a very small Gaussian noise
r
∗ ∆
y=x + z (7.51)
N
Can we recover x∗ ? We proceed in the Bayesian way, and write that

(yi −xi )2
− Q x2
1 i N (yi , 0, ∆/N )
Y e 2∆/N Y − i −2xi yi
P (x|y) = P (x) p = e 2∆/N P (x) (7.52)
P (y) 2π∆/N P (y)
i i

We recognize that i N (yi , 0, ∆/N ) is just the probability of the null model (if the vector y
Q
was simply a random one). Additionally, using the prior on x tell us that it has to be on the
corner of the hypercube (one of the vector xi which are zero everywhere but xi = 1), we write

P random (y) 1 N yi − N 1 1 N yi − N
P (xi |y) = e∆ 2∆ =: e∆ 2∆ . (7.53)
P model (y) 2N Z 2N

As before, the partition sum stands for the ratio of probability between the model and the null,
and reads
d d N δi,i∗
1 X − N + N yi 1 X − 2∆
q
N
+ ∆ + N z
Z= e 2∆ ∆ = e ∆ i (7.54)
2N 2N
i=1 i=1

Our goal is to compute the free entropy as a function of ∆

1
Φ(∆) = lim E log Z (7.55)
N →∞ N

The application of the I-MMSE theorem tells us in particular that


" #
X ⟨x⟩x∗
q =: lim E = 2∂∆−1 Φ(∆) (7.56)
N →∞ N
i

This can be done rigorously, as we shall now see. In fact we can prove the following expression
for the free energy:

Theorem 15. Let f (∆) : R → R be

1
f (∆) = − log 2 , if , ∆ ≤ 1/2 log 2 (7.57)
2∆
f (∆) = 0 , if ∆ ≥ 1/2 log 2 (7.58)

Then the limit of the free entropy is f (∆)

We shall prove this theorem by proving an upper and lower bound. Let us start by

Lemma 8 (Upper bound).


ΦN (∆) ≥ f (∆) (7.59)
126 F. Krzakala and L. Zdeborová

Proof. The bound comes from using only one term in the sum, the one corresponding to the
correct position i∗ :
  q 
1 1 1 − 2∆
N
+N +z N
ΦN (∆) = E [log Z] ≥ E log e ∆ ∆ (7.60)
N N 2N
1
≥ − log 2 (7.61)
2∆
Additionally, since the two distributions become indistinguishable for infinite noise and that Z
is just the likelihood ratio, we have ΦN (∆ = ∞) = 0. Since ∂∆ ΦN (∆) = (∂∆−1 ΦN (∆))(∂∆ (1∆)) =
q
− 2∆ 2 ≤ 0, we have ΦN (∆) ≥ 0.

Lemma 9 (Lower bound).


ΦN (∆) ≤ f (∆) + o(1) (7.62)

Proof. The bound comes from the Jensen inequality (the annealed bound):
  
 q 
1 1 1
q
N ∗ N N N
+z −N log 2 − +z
X
ΦN (∆) = E [log Z] ≤ Ezi∗ log e 2∆ i ∆ + Ezi N e 2∆ i ∆ 
N N 2
i̸=i∗
    
1 1
q
N
+z ∗ N −N log 2 N N
≤ Ezi∗ log e 2∆ i ∆ + 1 − N e 2∆ − 2∆ (7.63)
N 2
  
1
q
N ( 2∆
1
−log 2)+zi∗ N
≤ Ezi∗ log e ∆ +1
N

It is intuitively clear that, depending on where or not the term 1/2∆ − log 2 in the exponential
is positive or negative, then we should expect to either completely dominate the expression,
or to disappear exponentially.

We can show this with rigor, for instance by defining the monotonic growing function
q
N ( 2∆
1
−log 2)+zi∗ N
g(zi∗ ) =: e ∆ + 1. (7.64)
We have g(zi∗ ) ≤ g(|zi∗ |), and
r
g′ N
log g(|zi∗ |) ≤ log g(0) + |zi∗ |max = log g(0) + |zi∗ | (7.65)
g ∆
so that
1 1 1   1
log 1 + eN ( 2∆ −log 2) + E|z| √
1
E log g(zi∗ ) ≤ E log g(|zi∗ |) ≤ (7.66)
N N N N∆
1 
N ( 2∆ −log 2)
1

≤ log 1 + e + o(1) (7.67)
N
We conclude by noting that the bounds tends to f (∆) as N → ∞ (taking the exponential of
1/2∆ − log 2 out of the log) and using log(1 + x) ≤ x.

Now that we know the free entropy, we can apply the I-MMSE theorem. A phase transition
occurs depending on ∆ being lower or larger than ∆c
1
∆c =
2 log 2
7.3 Application: Denoising a sparse vector 127

If ∆ > ∆c , the MMSE is 1, and we cannot find the signal. Even the best guess is not better than
a random one. If, on the other end, ∆ < ∆C , then we should be able to solve the problem, and
finds a perfect MMSE (that is, a zero error).

Can we do it in practice? What should be our algorithm? It is a well-known


√ result that the
maximum of √ d i.i.d. Gaussian random variables is asymptotically 2 log d with fluctuations
of order 1/ log d. In our case, this means that in absence
√ of a signal and with a variance
σ 2 = ∆/N = ∆ log 2/ log d the largest number will be ∆2 log 2, which is indeed smaller than
one when ∆ < ∆c . This means that the simplest algorithm (keep only the largest component),
will be optimal with high probab Therefore, we have the interesting following phase diagram:

EASY IMPOSSIBLE

∆c

For signal of dimension d, and a noise σ 2 this yields σc2 = 1/(2 log(d)).

Actually, one can prove an even stronger result in the regime ∆ ≥ ∆c . As we show in appendix
7.A, not only the free entropy divided by N goes to zero, but the total free entropy as well. We
recall that it is nothing but the KL divergence between the distribution of the model and the
random one:

PYmodel (y)
FN = Emodel
Y log = DKL (PYmodel (y)|PYrandom (y)) →N →∞ 0 (7.68)
PYrandom (y)

This actually means that the two distributions are eventually just becoming just the same one,
and are thus indistinguishable: not only we cannot find the signal but there is just no way
to know that a signal has been hidden for ∆ > ∆c , as the data looks perfectly like Gaussian
noise.

Bibliography

The legacy of the Bayes theorem, and the fundamental role of Laplace in the invention of
"inverse probabilities" is well discussed in McGrayne (2011). Bayesian estimation is a fun-
damental field at the frontier between information theory and statistics, and is discussed in
many references such as Cover and Thomas (1991). The I-MMSE theorem was introduced by
Guo et al. (2005). Nishimori symmetries were introduced in physics is Nishimori (1980) and
soon realized to have deep connection to information theory Nishimori (1993) and Bayesian
inference Iba (1999). The model of denoising a sparse vector was discussed in Donoho et al.
(1998). This problem has deep relation to Shannon’s Random codes Shannon (1948) and the
Random energy model in statistical physics Derrida (1981).
128 F. Krzakala and L. Zdeborová

7.4 Exercises

Exercise 7.1: Few useful equalities on Entropies

In what follow, we shall denote the entropy of a random variable X with a distribution
pX (x) as Z
H(X) = − dx p(x) log p(x)

• Entropy of a Gaussian variable:


Show that the entropy of a Gaussian variable sampled from N (m, ∆) is given by

1
H(X) = log 2πe∆
2

• Mutual information:
The mutual information between two (potentially) correlated variable X and Y
is defined as the Kullback-Leibler divergence between their joint distribution and
the factorized one. In other words, it reads
pX,Y (x, y)
Z
I(X; Y ) = DKL (PX,Y ||PX PY ) = dxdy pX,Y (x, y) log
pX (x) pY (y)

Show that the mutual information satisfies the following chain rules:

I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X)

where the so-called conditional entropy H(X|Y ) is defined as


Z Z
H(X|Y ) = − dy pY (y) dx pX|Y (x|y) log pX|Y (x|y)

• Conditional entropy of a Gaussian:


Given a random variable X, whose distribution is pX (x) and a random variable Z
whose distribution is Gaussian with variance ∆, we define a new random variable
Y given by
Y =X +Z
Shows that the conditional entropy H(X|Y ) is then given by

1
H(X|Y ) = log 2πe∆
2

Exercise 7.2: Numerical tests

Perform simulation of the 3 models discussed in section 6.2.1 with the MMSE, MAP,
and MMAE estimators discussed in section 6.2.2
Using different values for the number of observation n (from 10 to 1000, or even more)
and averaging your finding on many instances, plots how, for each problems the error
7.4 Exercises 129

evolves with N for different Risk and different estimators.

Exercise 7.3: Second derivative

We saw in section 6.2.5 that the first derivative of the free entropy (with Gaussian noise)
with respect to ∆−1 is (one-half) the overlap m.
Compute the second derivative (use again Stein and Nishimori) and relate it to a
variance of a quantity. Show that it implies the convexity of the free entropy with
respect to ∆−1 .
Appendix

7.A A tighter computation of the likelihood ratio

We shall here prove that the average partition sum is not only going to zero when ∆ > ∆c ,
but that the corrections to the free entropy are actually exponentially small. This require a
tighter analysis of the likelihood ratio and of the upper bound in 9.

Our starting point is eq.7.63 that states


  
1
q
N ( 2∆
1
−log 2)+zi∗ N
ΦN (∆) ≤ Ezi∗ log e ∆ +1
N

Our goal is to shat that for ∆ > ∆c , this is exponentially small. Let us simplify notation a bit
and denote f = − 2∆ 1
+ log 2 > 0, and write, spitting the integral in two:
2
   Z +∞ − x2  
1
q q
−f N +z N e −f N +z N
ΦN (∆) ≤ Ez log e ∆ + 1 = dz √ log e ∆ + 1 (7.69)
N −∞ 2π
Z √N ∆f x2  Z ∞ − x2
2
e− 2
 q  q 
−f N +z N e −f N +z N
≤ dz √ log e ∆ + 1 + √ √ log e ∆ + 1

−∞ 2π N ∆f 2π
(7.70)
≤ I1 + I2 (7.71)

We first deal with I1 . Since the exponential term is positive, we can write, using the "worst"
possible value of z:
√ x2 2
Z ∞ − x2
N ∆f (1−ϵ)
e− 2
Z  q 
−f N +z N e  
I1 = dz √ log e ∆ + 1 ≤ dz √ log e−f N +N f (1−ϵ) + 1
−∞ 2π −∞ 2π
Z ∞ x2
e− 2  
≤ dz √ log e−ϵf N + 1 ≤ e−ϵf N
−∞ 2π

where we have used log(1 + x) ≤ x. So indeed I1 is exponentially small.

What about I2 ? Again uwe use log(1 + x) ≤ x:


z2 2
∞ Z ∞ − z2
e− 2
Z  q  q
−f N +z N −f e z N
I2 = √ √ log e ∆ + 1 ≤e N
√ √ e ∆ (7.72)
N ∆f (1−ϵ) 2π N ∆f (1−ϵ) 2π
132 F. Krzakala and L. Zdeborová

With a bit or rewriting we can write


√N
(z− )2 z2
∞ − ∆ ∞
e− 2
Z Z
N
−f N + 2∆ e 2 N
−f N + 2∆
I2 ≤ e √ dz √ =e √ q dz √ (7.73)
N ∆f (1−ϵ) 2π N ∆f (1−ϵ)− N∆

!

r
N
−f N + 2∆ N
≤ e P Z> N ∆f (1 − ϵ) − (7.74)

2 /2
For a Gaussian random variable, we have that P (Z > b) ≤ e−b thus
2

 q
N − 12 N ∆f (1−ϵ)− N
−f N + 2∆
(7.75)

I2 ≤ e e
N
−f N + 2∆ − 12 N ∆f (1−ϵ)2 − 21 N
= e e ∆
N f (1−ϵ)
e e (7.76)
−ϵN − 12 N ∆f (1−ϵ)2
= e e (7.77)

This is again decaying exponentially fast. We thus obtain the following result:

Lemma 10 (Exponential decay of the Kullback-Leibler divergence).

N Fn (∆) = DKL (PYmodel (y)|PYrandom (y)) =≤ N e−KN for ∆ > ∆c (7.78)

Note that for the divergence to go to zero, one needs the total free entropy to go to zero, not
just the one divided by N (that is, the density).

7.B A replica computation for vector denoising

It is instructive, and a good exercise, to redo the computation of the free entropy in theorem:15
using the replica method. The computation is very reminiscent of the one for the random
energy model in chap 15, which we encourage the reader to consult. In this, the present is
nothing but a "Bayesian" version of the random energy model.

Let us see how the replica computation goes. We first remind that the partition sum reads

d N δi,i∗ d N δi,i∗
1 X − 2∆
q q
N
+ ∆ + N∆ i = e−N (log 2+ 2∆ )
z 1 + N z
X
Z= e e ∆ ∆ i (7.79)
2N
i=1 i=1

We now move to the computation of the averaged free entropy by the replica method, starting
zith the replicated partition sum:

n d q !
N N
Z n = e−nN ( )
1 Y X δ ∗+ z
log 2+ 2∆
e ∆ i,i ∆ i . (7.80)
a=1 i=1
7.B A replica computation for vector denoising 133

First, let us start using seemingly trivial rewriting, using i∗ = 1 without loss of generality:
q 
d Pn N
z +N δ
n nN (log 2+ 2∆
1
) ∆ ia ∆ ia ,1
X a=1
Z e = e (7.81)
i1 ,...,in =1
d Pn q
Pn N N
z
X
= e a=1 ∆ δia ,1 e a=1 ∆ ia (7.82)
i1 ,...,in =1
d Pn Pd q
Pn N N
z δ
X
= e a=1 ∆ δia ,1 e a=1 j=1 ∆ j j,ia (7.83)
i1 ,...,in =1
d d Pn q
Pn N N
zj δ
X Y
= e a=1 ∆ δia ,1 e a=1 ∆ j,ia (7.84)
i1 ,...,in =1 j=1

Now we perform the expectation over disorder, using the fact that we have now a product of
independent Gaussians:
 
d Pn N d Pn q N
E [Z n ] = e−nN (log 2+ 2∆ ) E 
1 X Y z δ
e a=1 ∆ δia ,1 e j a=1 ∆ j,ia  (7.85)
i1 ,...,in =1 j=1
d d  P q 
Pn n N
= e−nN (log 2+ 2∆ )
1 X N Y z δ
e a=1 ∆ δia ,1 E e j a=1 ∆ j,ia (7.86)
i1 ,...,in =1 j=1

2
Using E ebz = eb /2 for Gaussian variables, we thus find
 

d Pn d Pn
−nN (log 2+ 2∆
1
) N N 2
X Y
n
E [Z ] = e e a=1 ∆ δia ,1 e 2∆ ( a=1 δj,ia ) (7.87)
i1 ,...,in =1 j=1
d Pd Pn
Pn N
−nN (log 2+ 2∆
1
) N
X
= e e a=1 ∆ δia ,1 e 2∆ j=1 a,b=1 δj,ia δj,ib (7.88)
i1 ,...,in =1
d Pn Pn
e∆( )
N 1
−nN (log 2+ 2∆
1
)
X
= e a=1 δia ,1 + 2 a,b=1 δia ,ib (7.89)
i1 ,...,in =1

Given the replicas configurations (i1 ...in ), that can take values in 1, . . . , d, we now denote the
so-called n × n overlap matrix Qab = δia ,ib , that takes elements in 0, 1, respectively if the two
replicas (row and column) have different or equal configuration. We also write the n × n
magnetization vector Ma = δia ,1 . With this notation, we can write the replicated sum as
d Pn Pn
e∆( Qa,b )
N
−nN (log 2+ 2∆
1
) Ma + 12
X
n
E[Z ] = e a=1 a,b=1 (7.90)
i1 ,...,in =1
Pn Pn
#(Q, M )e ∆ ( Qa,b )
N
Ma + 12
= e−nN (log 2+ 2∆ )
1 X
a=1 a,b=1 (7.91)
{Q},{M }

where {Q},{M } is the sum over all possible such matrices and vectors, while #(Q, M ) is the
P
numbers of configurations that leads to the overlap matrix Q and magnetization vector M .

In this form, it is not yet possible to perform the analytic continuation when n → 0. Keeping
for a moment n integer, it is however natural to expect that the number of such configurations
134 F. Krzakala and L. Zdeborová

(for a given overlap matrix and magnetization vector), to be exponentially large. Denoting
#(Q, M ) = eN s(Q,M ) we thus write
Z Pn
Z
1 Pn
e nN (log 2+ 2∆
1
) EZ n
≈ dQ dM e N s(Q,M )+ N
∆ ( a=1 Ma + 2 a,b=1 Q a,b ) =: dQ, M eN g(∆,Q,M )

As N is large, we thus expect to be able to perform a Laplace (or Saddle) approximation by


choosing the "right" structure of the matrix Q and vector, that will "dominate" the sum. A
quite natural guess is the replica symmetric ansatz, where we assume that all replicas are
identical, and therefore the system should be invariant under the relabelling of the replicas
(permutation symmetry).

In this case, we have only three natural choices for the entries of Q and M :

1. All the replicas are in the same, identical configuration, that is ia = i∀a. Let us further
assume that i ̸= 1. In this case Qab = 1 for all a, b, and Ma = 0 for all a. There are
d − 1 ≈ 2N possibility for this so s(Q, M ) = log 2 and we find g(β, Q) = log 2 + n2 /∆.
This does not look right: this expression does not have a limit with a linear part in n,
so we cannot use this solution in the replica method. Clearly, this is a wrong analytical
continuation.

2. All the replica are in the same, identical configuration, which is the "correct" one ia =
i = 1. Then Qab = 1 for all a, b, and Ma = 1 for all a. There is only one possibility, so
that s(Q, M ) = 0 and g(β, Q) = n/∆ + n2 /2∆.

3. If instead all replicas are in a different, random, configurations then Qaa = 1, Qab = 0
for all a ̸= b and Ma = 0. In this case #(Q) = 2N (2N − 1) . . . (2N − n + 1), so that
s(Q) ≈ n log 2 if n ≪ N . Therefore g(β, Q) = n/2∆ + n log 2.

At the replica symmetric level, we thus find that the free entropy is given by two possible
solutions as n → 0. In the first one all replicas are in the correct, hidden solution:

EZ n = e−nN (log 2+ 2∆ )+nN ∆ = e−nN (log 2− 2∆ )


1 1 1
(7.92)

while in the second case, all replicas are distributed randomly over all states and

EZ n = e−nN (log 2+ 2∆ )+nN 2∆ +n log 2 = 0 .


1 1
(7.93)

We thus have recovered exactly the rigorous solution from the replica method. Indeed,
choosing the right solution is easy: the free entropy is continuous and convex in ∆, non-
negative, and goes from ∞ to 0 as ∆ grows, so that we the free energy must be log 2 − 1/2∆
for ∆ < ∆c = 1/2 log 2 and 0 for ∆ > ∆c .
Chapter 8

Low-Rank Matrix Factorization: the


Spike model

The signal is the truth. The noise is what distracts us from the
truth [. . . ] Distinguishing the signal from the noise requires both
scientific knowledge and self-knowledge: the serenity to accept the
things we cannot predict, the courage to predict the things we can,
and the wisdom to know the difference.

Nate Silver, The Signal and the Noise —- 2012

Now that we presented Bayesian estimation problems, we can apply our techniques to a
non-trivial problem. This is a perfect example for testing our newfound knowledge.

8.1 Problem Setting

The Spike-Wigner Model

Suppose we are given as data a n × n symmetric matrix Y created as follows:


r
λ ∗ ∗⊺
Y= x x} + ξ
N | {z |{z}
N ×N rank-one matrix symmetric iid noise

i.i.d. i.i.d.
where x∗ ∈ RN with x∗i ∼ PX (x), ξij = ξji ∼ N (0, 1) for i ≤ j.

This is called the Wigner spike model in statistics. The name "Wigner" refer to the fact that Y
is a Wigner matrix (a symmetric random matrix with components sampled randomly from a
Gaussian distribution) plus a "spike", that is a rank one matrix x∗ x∗⊺ .

Our task shall be to recover the vector x from the knowledge of Y . As we just learned, this
136 F. Krzakala and L. Zdeborová

can be achieved using the posterior estimation:

  q 2 
"N # − 12 yij − Nλ ∗ ∗
xi xj
1 Y Y e 
P (x | Y) = PX (xi )  √ 
Z(Y)  2π 
i=1 i≤j

Spike-Wishart Model (Bipartite Vector-Spin Glass Model)

For completeness, we present here an alternative model, which is also extremely interesting,
called the Wishart-spike model. In this case

r
λ ∗ ∗⊺
Y= u v } + ξ
N | {z |{z}
M ×N rank-one matrix iid noise

i.i.d. i.i.d. i.i.d.


where u∗ ∈ RM with u∗i ∼ PU (u), v∗ ∈ RN with vi∗ ∼ PV (v), ξij ∼ N (0, 1).

Strictly speaking, the name "Wishart" might sounds strange here. This is coming from the
fact that this model, for Gaussian vectors u, is exactly the same as another model involving a
Wishart matrix, a model also called the Spiked Covariance Model. Indeed, when the factors
are independent, the model can be viewed as a linear model with additive noise and scalar
random design:

r
λ
yi = v j u + ξj , (8.1)
N

Assuming the vj have zero mean and unit variance, this indeed is a model of spiked covariance:
YY
the mean of the empirical covariance matrix Σ = Y N is a rank one perturbation of the identity
1 + uu Random covariance matrices are called Wishart matrices, so this is a model with a
T

rank one perturbation of a Wishart matrix.

Regardless of its name, given the matrix Y , the posterior distribution over X reads

  q 2 
# − 21 yij − Nλ ∗ ∗

"M N ui vj
1 Y Y Y e 
P (u, v | Y) = PU (ui )  PV (vi )  √ 
Z(Y)  2π 
i=1 j=1 i,j
8.2 From the Posterior Distribution to the partition sum 137

8.2 From the Posterior Distribution to the partition sum

We shall now make a mapping to a Statistical Physics formulation. Consider the spike-Wigner
model, using Bayes rule we write:

" #  q 2 
1 λ
P (Y | x) P (x) Y Y 1 − y ij − x x
N i j
P (x | Y) = ∝ PX (xi )  √ e 2 
P (Y) 2π
i i≤j
" #  " r #
Y X λ λ
∝ PX (xi ) exp  − x2 x2 + yij xi xj 
2N i j N
i i≤j
" #  " r #
1 Y X λ 2 2 λ
⇒ P (x | Y) = PX (xi ) exp  − x x + yij xi xj 
Z(Y) 2N i j N
i i≤j
 
x̂MSE,1 (Y)
.
Z
⇒ x̂MSE (Y) = 
 .. ,

x̂MSE,i (Y) = ⟨xi ⟩Y = dx P (x | Y) xi
x̂MSE,N (Y)

8.3 Replica Method

" #  " r #
1 Y X λ 2 2 λ
P (x | Y) = PX (xi ) exp  − x x + yij xi xj 
Z(Y) 2N i j N
i i≤j
" #  " r #
1 Y X λ 2 2 λ λ
= PX (xi ) exp  − xi xj + xi xj x∗i x∗j + ξij xi xj 
Z(Y) 2N N N
i i≤j

We are interested in

   
1 1
lim EY log (Z(Y)) = lim Ex∗ ,ξ log (Z(Y))
N →∞ N N →∞ N
138 F. Krzakala and L. Zdeborová

Using replica method, we have



Z " #  " r #n 
Y X λ 2 2 λ λ
Ex∗ ,ξ [Z n ] = Ex∗ ,ξ  dx PX (xi ) exp  − xi xj + xi xj x∗i x∗j + ξij xi xj  
i
2N N N
i≤j
 "  #
n Z
r
Y 
(α)
(α)

Y X λ 
(α)
2 
(α)
2 λ (α) (α) ∗ ∗ λ (α) (α)
= Ex∗ ,ξ  dx P X xi exp  − xi xj + xi xj xi xj + ξij xi xj 
α=1 i
2N N N
i≤j
"Z  " #
Y 
(α)

(α)
X λ X  (α) 2  (α) 2 λ X (α) (α) ∗ ∗
= Ex∗ PX xi dxi exp  − xi xj + x xj xi xj 
α,i
2N α N α i
i≤j
" q # #
Y ξij λ P x(α) x(α)
N α i j
Eξij e
i≤j
| {z }
(a) 
λ P P (α) (α) (β) (β)

= exp 2N i≤j α,β xi xj xi xj
   2 2 
"Z (α) !2 !2
(b) Y  
 λN X X xi λN X X ∗ (α)
xi xi λN X X (α) (β)
xi xi
(α) (α)
= Ex∗ PX xi dxi exp −  + +
  
4 α N 2 N 4 N
 
α,i i α i i α,β

"Z  !2 !2 #
(α) X x(α) x(β)
Y 
(α)

(α) λN X X x∗i xi λN X i i
= Ex∗ PX xi dxi exp  + 
α,i
2 α i
N 2 i
N
α<β
"Z Z Y ! Z Y !
(c) Y 
(α)

(α) 1 X (α) ∗ 1 X (α) (β)
= Ex∗ PX xi dxi δ mα − x xi dmα δ qαβ − x xi dqαβ
α,i α
N i i N i i
α<β
   #
λN X 2 X 2
exp   mα + qαβ 
2 α α<β
"Z Z Y h Z Y
P (α) i h P (α) (β) i
m̂α N mα − i xi x∗
(d)
 
q̂αβ N qαβ − i xi xi
Y (α) (α) i
= Ex∗ PX xi dxi e dm̂α dmα e dq̂αβ dqαβ
α,i α α<β
   #
λN X 2 X 2
exp   mα + qαβ 
2 α α<β
    
Z Y Z Y
(e) λN X 2 X 2 X X
= dm̂α dmα dq̂αβ dqαβ exp   mα + qαβ  + N  mα m̂α + qαβ q̂αβ 
α
2 α α
α<β α<β α<β
   N
 Z Y X X 
Ex ∗  PX (xα ) dxα exp − m̂α x∗ xα − q̂αβ xα xβ 
 
α α α<β

where

2 /2
(a) uses the fact that Dz eaz = ea
R

(b) uses the fact that


!2
X ai aj 1 X ai 1 X a2i
= +
NN 2 N 2 N2
i≤j i i

and neglect the second term since it scales like O(N −1 ).

(c) partitions the huge integral according to overlap with true signal mα and overlap between
two distinct replicas qαβ with definitions

1 X (α) ∗ 1 X (α) (β)


mα = xi xi , ∀ α, qαβ = xi xi , ∀α < β
N N
i i
8.3 Replica Method 139

(d) applies Fourier transformation and change of variable


 
(1) (n)
(e) change the order of integral and expectation and x∗i are iid, for each i, the tuple x∗i , xi , · · · , xi
are identical distributed, so we switch to subscript notation x∗ , x(1) , · · · , x(n) to get rid


of component index i but keep the replica index α.

Replica Symmetry Ansatz

Under replica symmetry Ansatz, we have mα ≡ m, m̂α ≡ m̂, qαβ ≡ q, q̂αβ ≡ q̂.
Z Z h 2 −n i h 2 i
λN
nm2 + n q 2 +N nmm̂+ n −n q q̂
E[Z n ] = dm̂ dm dq̂ dq e 2 2 2
×
( "Z #)N
Y P P
× Ex∗ PX (xα ) dxα e−m̂ α x∗ xα −q̂ α<β xα xβ
α
Z
(a) n−1
dm̂ dm dq̂ dq enN [ 2 m + 4 (n−1)q +mm̂+ 2 qq̂] ×
λ 2 λ 2
=
( "Z #)N
√ P
Z
PX (xα ) dxα e 2 α xα −m̂ α x∗ xα Dz e−iz q̂ α xα
Y q̂ P 2
P
× Ex∗
α
Z
(b) n−1
dm̂ dm dq̂ dq enN [ 2 m + 4 (n−1)q +mm̂+ 2 qq̂] ×
λ 2 λ 2
=
( "Z
Y Z √
#)N
× Ex∗ Dz
q̂ 2
x
PX (xα ) dxα e 2 α − m̂x x
∗ α −i z q̂xα

α
Z
(c) nN [ λ m2 + λ (n−1)q 2 +mm̂+ n−1 q q̂ ]
= dm̂ dm dq̂ dq e 2 4 × 2

 Z √ ion
N
i
n h q̂ 2
x − m̂x x− z q̂x
× Ex∗ Dz Ex e 2 ∗

Z
(d)
dm̂ dm dq̂ dq enN [ 2 m ]×
λ 2 − λ q 2 +mm̂− 1 q q̂
= 4 2


  
q̂ 2
−m̂x∗ x−iz q̂x
PX (x) dx e 2 x
R R
nN Ex∗ Dz log
Z ×e
= dm̂ dm dq̂ dq enN Φ(m,q,m̂,q̂)

where

Z Z 
λ 1 q̂ 2
Φ(m, q, m̂, q̂) = (2m2 − q 2 ) + mm̂ − q q̂ + Ex∗ Dz log PX (x) dx e 2
x +( q̂z−m̂x∗ )x
4 2

and

(a) uses the Hubbard–Stratonovich transformation


Z ∞  2 
 a  1 z
2
exp − x = √ exp − − ixz dz, ∀a > 0
2 2πa −∞ 2a
140 F. Krzakala and L. Zdeborová

Then we have
     !2 
X q̂ X q̂ X q̂ X
exp −q̂ xα xβ  = exp − xα xβ  = exp  x2 − xα 
2 2 α α 2 α
α<β α̸=β
!Z !
q̂ X 2 1 z 2
√ dz exp − − iz q̂
p X
= exp x xa
2 α α 2π 2 α
!Z !
q̂ X 2
Dz exp −iz q̂
p X
= exp x xa
2 α α α

(b) exchange order of integrals, and split xα ’s

(c) use the fact that xα ’s are i.i.d.

(d) take limit n → 0 and use the fact that when n is small we have
h i
E[X n ] = E en log(X) ≃ E [1 + n log(X)] = 1 + nE [log(X)]
= exp (log (1 + nE [log(X)])) ≃ exp (nE [log(X)])

Recall that according to the Nishimori identity we have q = m and q̂ = m̂, it simplifies to
λ 2 1
ΦNishi (m, m̂) ≜ Φ(m, q, m̂, q̂)|q=m,q̂=m̂ = m + mm̂
4 Z 2

Z 
+ Ex∗ Dz log
m̂ 2
PX (x) dx e 2 x −( i m̂z+m̂x ∗ )x

We can further reduce the problem by taking partial derivative w.r.t. m and set it to zero
∂ λ 1
ΦNishi (m, m̂) = m + m̂ = 0 ⇒ m̂ = −λm
∂m 2 2
Plug this back to ΦNishi (m, m̂) we will obtain the final free entropy function under replica
symmetry Ansatz

ΦRS (m) ≜ ΦNishi (m, m̂)|m̂=−λm



Z Z 
λ 2 x −(i −λmz−λmx∗ )x
−λm 2
= − m + Ex∗ Dz log PX (x) dx e 2
4

Z Z 
λ 2 −λm 2
x +(λmx∗ − λmz)x
= − m + Ex∗ Dz log PX (x) dx e 2
4

 Z 
λ 2 −λm 2
x +(λmx + λmz)x
= − m + Ex∗ ,z log PX (x) dx e 2 ∗
4
where the last step is because the standard normal distribution is symmetric around zero and
a change of variable z ← −z.

The self-consistent equation on m thus reads


−λm 2

dx PX (x) xx∗ e 2 x +(λmx∗ − λmz)x
R
m = Ex∗ ,z −λm 2
√ (8.2)
dx PX (x)e 2 x +(λmx∗ − λmz)x
R

which involves, at worst, 3 integrals, and is therefore tractable numerically.


8.4 A rigorous proof via Interpolation 141

8.4 A rigorous proof via Interpolation

Preliminaries

Lemma 11 (Stein’s Lemma). Let X ∼ N (µ, σ 2 ). Let g be a differentiable function such that the
expectation E [(X − µ)g(X)] and E [g ′ (X)] exists, then we have

E [g(X)(X − µ)] = σ 2 E g ′ (X)


 

Particularly, when X ∼ N (0, 1), we have

E [Xg(X)] = E g ′ (X)
 

Proposition 1 (Nishimori Identity). Let (X, Y ) be a couple of random variables on a polish space.
Let k ≥ 1 and let X (1) , . . . , X (k) be k i.i.d. samples (given Y ) from the distribution P (X = · | Y ),
independently of every other random variables. Let us denote ⟨·⟩ the expectation w.r.t. P (X = · | Y )
and E [·] the expectation w.r.t. (X, Y ). Then for all continuous bounded function f
hD  Ei hD  Ei
E f Y, X (1) , . . . , X (k−1) , X (k) = E f Y, X (1) , . . . , X (k−1) , X

Corollary 1 (Nishimori Identity for Two Replicas). Consider model y = g(x∗ ) + w, where g is a
continuous bounded function and w is the additive noise. Let us denote ⟨·⟩x∗ ,w the expectation w.r.t.
P (X = · | Y = g(x∗ ) + w). Then we have
h i D  E 
EX ∗ ,W ⟨f (X, X ∗ )⟩X ∗ ,W = EX ∗ ,W f X (1) , X (2)
X ∗ ,W

where X (1) , X (2) are two independent replicas distributed as P (X = · | Y = g(x∗ ) + w)

Two Problems

Problem A: From previous lecture we studied the scalar denoising problem that y = λx∗ + ω


 
λ
P (x | y) ∝ exp log (PX (x)) − x2 + λx∗ x + λωx (8.3)
2

 Z  
λ ∗ ∗
Φdenoising (λ) = Ex∗ ,ω log PX (x) dx exp − x + λx x + λxz (8.4)
2

Suppose we solve N such problem parallelly such that y = λmx∗ + ω as problem A and
define the Hamiltonian HA (x, λ, x∗ , ω; m)
!
X X  λm


P (x | y) ∝ exp log (PX (xi )) + − x + λmxi xi + λmωi xi (8.5)
2
2 i
i i
X  λm √
X 
∗ ∗
HA (x, λ, x , ω; m) ≜ − log (PX (xi )) − − 2
x + λmxi xi + λmωi xi (8.6)
2 i
i i
142 F. Krzakala and L. Zdeborová

with partition function


!
Z Y X λm √
ZA (λ, x∗ , ω; m) = dxi exp log (PX (xi )) − x2 + λmx∗i xi + λmωi xi
2 i
i i

YZ  
λm 2 ∗
= PX (xi ) dxi exp − x + λmxi xi + λmωi xi
2 i
i
log (ZA (λ, x∗ , ω; m)) √
   Z 
1 X λm 2
Ex∗ ,ω = Ex∗i ,ωi log PX (xi ) dxi exp − xi + λmx∗i xi + λmωi xi
N N 2
i

 Z  
λm 2 ∗
= Ex1 ,ω1 log
∗ PX (x1 ) dx1 exp − x + λmx1 x1 + λmω1 x1
2 1
= Φdenoising (λm)

q
Problem B: Our target rank-one matrix factorization problem Y = Nλ x∗ x∗⊺ + ξ, define the
Hamiltonian HB (x, x∗ , λ)
 " r #
X X λ 2 2 λ λ
P (x | Y) ∝ exp  log (PX (xi )) + − x x + x∗ x∗ xi xj + ξij xi xj (8.9)
2N i j N i j

N
i i≤j
" r #

X X λ 2 2 λ ∗ ∗ λ
HB (x, λ, x , ξ) ≜ − log (PX (xi )) − − x x + x x xi xj + ξij xi xj (8.10)
2N i j N i j N
i i≤j

with partition function ZB (λ, x∗ , ξ) is the quantity that we are interested in.

Lower Bound by Guerra’s Interpolation

Define
(t)
H̃A (xλ, x∗ , ω; m) ≜ HA (x, tλ, x∗ , ω; m)
X  tλm √
X 
2 ∗
=− log (PX (xi )) − − x + tλmxi xi + tλmωi xi
2 i
i i
(t)
H̃B (x, λ, x∗ , ξ) ∗
≜ HB (x, tλ, x , ξ)
" r #
X X tλ 2 2 tλ ∗ ∗ tλ
=− log (PX (xi )) − − x x + x x xi xj + ξij xi xj
2N i j N i j N
i i≤j

And the interpolated Hamiltonian as


(1−t) (t)
Ht (x, λ, x∗ , ω, ξ; m) ≜ H̃A (x, λ, x∗ , ω; m) + H̃B (x, λ, x∗ , ξ)
(1 − t)λm 2
X  

p
=− − xi + (1 − t)λmxi xi + (1 − t)λmωi xi
2
i
" r #
X tλ 2 2 tλ ∗ ∗ tλ
− − x x + x x xi xj + ξij xi xj (8.11)
2N i j N i j N
i≤j

with partition function Zt (λ, x∗ , ω, ξ; m).


8.4 A rigorous proof via Interpolation 143

This is a problem where we have ground truth x∗ and observations

( q
Yij = tλ x∗ x∗ + ξij , ∀1 ≤ i ≤ j ≤ N
p N i j
yi = (1 − t)λmx∗i + ωi , ∀1 ≤ i ≤ N

Besides, it is easy to verify that

H0 (x, λ, x∗ , ω, ξ; m) ≡ HA (x, λ, x∗ , ω; m) , Z0 (λ, x∗ , ω, ξ; m) ≡ ZA (λ, x∗ , ω; m) , (8.12)


∀m
∗ ∗ ∗ ∗
H1 (x, λ, x , ω, ξ; m) ≡ HB (x, λ, x , ξ) , Z1 (λ, x , ω, ξ; m) ≡ ZB (λ, x , ξ) , ∀ m (8.13)

Notice that


Ht (x, λ, x∗ , ω, ξ; m)
∂t " √ # " p #
X λm X
∗ λm X λ 2 2 λ ∗ ∗ λ/N
=− 2
xi − λmxi xi − √ ωi xi − − x x + x x xi xj + √ ξij(8.14)
xi xj
2 2 1 − t 2N i j N i j 2 t
i i i≤j

Therefore this derivative can be into several Boltzmann average terms associated to Ht (x, λ, x∗ , ω, ξ; m).
For short we denote θ = {λ, x∗ , ω, ξ}

∂ log (Zt (θ; m)) 1 1


Z

= dx e−Ht (x,θ;m)
∂t N N Zt (θ; m) ∂t
e−Ht (x,θ;m) ∂
 
1 1
Z

=− dx Ht (x, θ; m) = − Ht (x, θ; m)
N Zt (θ; m) ∂t N ∂t t,θ,m
| {z }
=Pt,θ;m (x)
D E 
( x2i x2j "
x2i
#
1 λ X t,θ,m
X t,θ,m
=− − x∗i x∗j xi xj − λm − ⟨x∗i xi ⟩t,θ,m
 
N N

2 t,θ,m  2
i≤j i
p √ )
λ/N X λm X
− √ ξij ⟨xi xj ⟩t,θ,m + √ ωi ⟨xi ⟩t,θ,m (8.15)
2 t i≤j 2 1−t i

Let ΦMF (λ) be the free entropy density of the rank-one matrix factorization problem under
144 F. Krzakala and L. Zdeborová

SNR λ (problem B), from fundamental theorem of calculus we have


log (ZB (λ, x∗ , ξ))
 
ΦMF (λ) = lim Ex∗ ,ξ
N →∞ N
log (Z1 (λ, x∗ , ω, ξ; m))
 
= lim Ex∗ ,ω,ξ
N →∞ N
log (Z0 (λ, x∗ , ω, ξ; m)) ∂ log (Zt (λ, x∗ , ω, ξ; m))
 Z 1 
= lim Ex∗ ,ω,ξ + dτ
N →∞ N 0 ∂t N t=τ

  Z 1  
(a) log (ZA (λ, x , ω; m)) ∂ log (Zt (θ; m))
= lim Ex∗ ,ω + dτ Ex∗ ,ω,ξ
N →∞ N 0 ∂t N t=τ
(

 
(b) log (ZA (λ, x , ω; m))
= lim Ex∗ ,ω
N →∞ N
 D E 
Z 1  x 2 x2 
λ  i j 
X τ,θ,m ∗ ∗
− 2 dτ Ex∗ ,ω,ξ  − xi xj xi xj τ,θ,m 

N 0  2 
i≤j  
" ( 2 )#
λm 1
Z X xi τ,θ,m

+ dτ Ex∗ ,ω,ξ − ⟨xi xi ⟩τ,θ,m
N 0 2
i 
Z 1
λ Xn o
+ dτ Ex∗ ,ω,ξ  x2i x2j τ,θ,m − ⟨xi xj ⟩2τ,θ,m 
2N 2 0
i≤j
Z 1 " #)
λm X n o
− dτ Ex∗ ,ω,ξ x2i τ,θ,m − ⟨xi ⟩2τ,θ,m
2N 0
i
(

 
log (ZA (λ, x , ω; m))
= lim Ex∗ ,ω
N →∞ N
 
Z 1
λ X n o
− dτ Ex∗ ,ω,ξ  ⟨xi xj ⟩2τ,θ,m − 2 x∗i x∗j xi xj τ,θ,m 
2N 2 0
i≤j
Z 1 " #)
λm Xn 2 ∗
o
+ dτ Ex∗ ,ω,ξ ⟨xi ⟩τ,θ,m − 2 ⟨xi xi ⟩τ,θ,m
2N 0
i
(
log (ZA (λ, x∗ , ω; m))
 
(c)
= lim Ex ,ω ∗
N →∞ N
" #
λm 1
Z Xn o
2 ∗
+ dτ Ex∗ ,ω,ξ ⟨xi ⟩τ,θ,m − 2 ⟨xi xi ⟩τ,θ,m
2N 0
i
 )
Z 1
λ X n o
− dτ Ex∗ ,ω,ξ  ⟨xi xj ⟩2τ,θ,m − 2 x∗i x∗j xi xj τ,θ,m 
4N 2 0
i,j
(

 
log (ZA (λ, x , ω; m))
= lim Ex∗ ,ω
N →∞ N
 
Z 1
λ X X
− dτ Ex∗ ,ω,ξ  ⟨xi xj ⟩2τ,θ,m − 2N m ⟨xi ⟩2τ,θ,m 
4N 2 0
i,j i
 )
Z 1
λ X 2
X
+ dτ Ex∗ ,ω,ξ  x∗i x∗j xi xj τ,θ,m − 2N m ⟨x∗i xi ⟩2τ,θ,m 
2N 2 0
i,j i
(

 
(d) log (ZA (λ, x , ω; m))
= lim Ex∗ ,ω
N →∞ N
8.4 A rigorous proof via Interpolation 145

where

(a) Uses Eqn 8.12, and the short hand notation θ = {λ, x∗ , ω, ξ}.

(b) Plug in Eqn 8.15 and uses the Stein’s Lemma to deal with terms containing ξij and ωi

r
∂ tλ
Ht (x, θ; m) = − xi xj
∂ξij N
Z
∂ ′ ∂
Zt (θ; m) = − dx′ e−Ht (x ,θ,m) Ht x′ , θ; m

∂ξij ∂ξij
−Ht (x′ ,θ;m)
r Z r
tλ ′e ′ ′ tλ ′ ′
= Zt (θ; m) · dx xi xj = Zt (θ; m) · xx
N Zt (θ; m) N i j t,θ,m
 
h i ∂
Ex ,ω,ξ ξij ⟨xi xj ⟩t,θ,m = Ex ,ω,ξ
∗ ∗ ⟨xi xj ⟩t,θ,m
∂ξij
"Z ( ∂ ∂ )#
− ∂ξij Ht (x, θ; m) ∂ξ Z t (θ; m)
= Ex∗ ,ω,ξ dx xi xj e−Ht (θ;m) − ij
Zt (θ; m) [Zt (θ; m)]2
"Z r #
e−Ht (θ;m) tλ n o
= Ex∗ ,ω,ξ dx xi xj xi xj + x′i x′j t,θ,m
Zt (θ; m) N
r
tλ h i
= Ex∗ ,ω,ξ x2i x2j t,θ,m − ⟨xi xj ⟩2t,θ,m
N

Similarly, for the term containing ωi , we have

∂ p
Ht (x, θ; m) = − (1 − t)λmxi
∂ωi Z
∂ ′ ∂
Zt (θ; m) = − dx′ e−Ht (x ,θ,m) Ht x′ , θ; m

∂ωi ∂ωi

e−Ht (x ,θ;m) ′ ′
Z
= Zt (θ; m) · (1 − t)λm dx′
p
xx
Zt (θ; m) i j
= Zt (θ; m) · (1 − t)λm x′i t,θ,m
p
 
h i ∂
Ex∗ ,ω,ξ ωi ⟨xi ⟩t,θ,m = Ex∗ ,ω,ξ ⟨xi ⟩t,θ,m
∂ωi
"Z ( ∂ ∂
)#
−Ht (θ;m)
− ∂ωi Ht (x, θ; m) ∂ωi Zt (θ; m)
= Ex∗ ,ω,ξ dx xi e −
Zt (θ; m) [Zt (θ; m)]2
"Z #
e−Ht (θ;m) p n

o
= Ex∗ ,ω,ξ dx xi (1 − t)λm xi + xi t,θ,m
Zt (θ; m)
h i
= (1 − t)λm Ex∗ ,ω,ξ x2i t,θ,m − ⟨xi ⟩2t,θ,m
p
146 F. Krzakala and L. Zdeborová

(c) Uses the fact that when ai ’s are O(1)


 
1 X 1 X X
lim ai aj = lim ai aj + a2i 
N →∞ N 2 N →∞ 2N 2
i≤j i,j i
1 X
ai aj + O N −1

= lim 2
N →∞ 2N
i,j
1 X
= lim ai aj
N →∞ 2N 2
i,j

(d) Write as different replicas for the second term:


 
λ X X
Ex∗ ,ω,ξ  ⟨xi xj ⟩2τ,θ,m − 2N m ⟨xi ⟩2τ,θ,m 
4N 2
i,j i
* + 
λ X (1) (1) (2) (2) X (1) (2)
= Ex∗ ,ω,ξ  xi xj xi xj − 2N m xi xi 
4N 2
i,j i τ,θ,m
* !2 ! + 
λ X (1) (2)
X (1) (2)
= 2
Ex∗ ,ω,ξ  xi xi − 2N m xi xi + N 2 m2 − N 2 m2 
4N
i i τ,θ,m
* !2 + 
λ 1 X (1) (2) 2
= Ex∗ ,ω,ξ  xi xi − m  − λm
4 N 4
i τ,θ,m

Analogously, we have
 
Z 1
λ X 2
X
dτ Ex∗ ,ω,ξ  x∗i x∗j xi xj − 2N m ⟨x∗i xi ⟩2τ,θ,m 
2N 2 0
τ,θ,m
i,j i
* + 
λ X X
= Ex∗ ,ω,ξ  x∗i x∗j xi xj − 2N m x∗i xi 
2N 2
i,j i τ,θ,m
* !2 ! + 
λ X X
= Ex∗ ,ω,ξ  x∗i xi − 2N m x∗i xi + N 2 m2 − N 2 m2 
2N 2
i i τ,θ,ξ
* !2 + 
λ 2
 − λm
X
= Ex∗ ,ω,ξ  x∗i xi − m
2 2
i τ,θ,m

(e) Utilize Eqn 8.8 and the Nishimori identity so that


* !2 +  * !2 + 
X X (1) (2)

Ex∗ ,ω,ξ  xi xi − m  = Ex∗ ,ω,ξ  xi xi − m 
i τ,θ,m i τ,θ,m

Hence, we found a lower bound that


λm2 λm2
 
ΦMF (λ) ≥ Φdenoising (λm) − , ∀m ⇒ ΦMF (λ) = max Φdenoising (λm) −
4 m 4
8.4 A rigorous proof via Interpolation 147

Upper Bound by Fixed-Magnetization Models

To get start, we first redefine the Hamiltonian of problem A by replacing some m into θ:
X X  λq 
∗ ∗
(8.16)
2
p
ĤA (x, λ, x , ω; m, q) ≜ − log (PX (xi )) − − xi + λmx x + λqωi xi
2
i i

And as before, we define the interpolated Hamiltonian


˜ (t)
ĤA ≜ ĤA (x, tλ, x∗ , ω; m, q)
X X  tλq 

2
p
=− log (PX (xi )) − − x + tλmx x + tλqωi xi
2 i
i i
∗ ˜ (1−t) (t)
Ĥt (x, λ, x , ω, ξ; m, q) ≜ ĤA (x, λ, x∗ , ω; m) + H̃B (x, λ, x∗ , ξ)
(1 − t)λq 2
X  

p
=− − xi + (1 − t)λmxi xi + (1 − t)λqωi xi
2
i
" r #
X tλ 2 2 tλ ∗ ∗ tλ
− − x x + x x xi xj + ξij xi xj (8.17)
2N i j N i j N
i≤j

Consider models at fixed value of magnetization M = N1 i x∗i xi , i.e. we use the same
P
family of Hamiltonians as above, but the system only contains configurations with the given
magnetization M
!
1
Z

X
fixed
ZA (λ, x∗ , ω; m, q, M ) ≜ dx e−ĤA (x,λ,x ,ω;m,q) δ M − x∗i xi (8.18)
N
i
!
1
Z

X
fixed
ZB (λ, x∗ , ξ; M ) ≜ dx e−HB (x,λ,x ,ξ) δ M − x∗i xi (8.19)
N
i
!
1
Z X
Ztfixed (θ; m, q, M ) ≜ dx e−Ĥt (x,θ;m,q) δ M − x∗i xi (8.20)
N
i

It is easy to verify that

Z1fixed (θ; m, q, M ) ≡ ZB
fixed
(λ, x∗ , ξ; M ) , ∀ m, q

Again, we are interested in the quantity


" #
fixed (λ, x∗ , ξ; M )
log ZB
Φfixed
MF (λ; M ) ≜ lim Ex∗ ,ω,ξ
N →∞ N

Notice that
" √ #
∂ X λq X λq
Ĥt (x, λ, x∗ , ω, ξ; m, q) = − x2i − λmx∗i xi − √ ωi xi
∂t 2 2 1−t
i i
" p #
X λ 2 2 λ ∗ ∗ λ/N
− − x x + x x xi xj + √ ξij xi xj (8.21)
2N i j N i j 2 t
i≤j
148 F. Krzakala and L. Zdeborová

Therefore this derivative can be into several Boltzmann average terms associated to

Ĥt (x, λ, x∗ , ω, ξ; m, q) with fixed magnetization M . For short we denote θ = {λ, x∗ , ω, ξ}

!
∂ log Ztfixed (θ; m, q, M ) 1 1 1
Z
∂ X
= dx e−Ĥt (x,θ;m,q) δ M − x∗i xi
∂t N N Ztfixed (θ; m, q, M ) ∂t N
i
−Ĥ (x,θ;m,q) 1 P ∗

1 e δ M − N i xi xi ∂
Z t
=− dx Ĥt (x, θ; m, q)
N Zt (θ; m) ∂t
| {z }
=Pt,θ,m,q,M (x)
 
1 ∂
=− Ĥt (x, θ; m, q)
N ∂t t,θ,m,q,M
D E 
( x2 x2
1 λ X  i j t,θ,m,q,M
=− − x∗i x∗j xi xj

N N

2 t,θ,m,q,M 
i≤j
X hq i
−λ x2i t,θ,m,q,M
− m ⟨x∗i xi ⟩t,θ,m,q,M
2
i
p √ )
λ/N X λq X
− √ ξij ⟨xi xj ⟩t,θ,m,q,M + √ ωi ⟨xi ⟩t,θ,m,q,M
2 t i≤j 2 1−t i
(8.22)
8.4 A rigorous proof via Interpolation 149

Redo everything using same trick above gives


" #
log ZB fixed (λ, x∗ , ξ; M )
fixed
ΦMF (λ; M ) = lim Ex∗ ,ξ
N →∞ N
" #
log Z1fixed (λ, x∗ , ω, ξ; m, q, M )
= lim Ex∗ ,ω,ξ
N →∞ N
" #
log Z0fixed (λ, x∗ , ω, ξ; m, q, M ) ∂ log Ztfixed (λ, x∗ , ω, ξ; m, q, M )
 Z 1 
= lim Ex∗ ,ω,ξ + dτ
N →∞ N 0 ∂t N
" # Z " t=τ
#
fixed (λ, x∗ , ω; m, q, M )
∂ log Ztfixed (θ; m, q, M )
 1

log ZA
= lim Ex∗ ,ω + dτ Ex∗ ,ω,ξ
N →∞ N 0 ∂t N
( " # t=τ
fixed (λ, x∗ , ω; m, q, M )

log ZA
= lim Ex∗ ,ω
N →∞ N
 D E 
Z 1  x 2 x2 
λ  i j 
X τ,θ,m,q,M ∗ ∗
− 2 dτ Ex∗ ,ω,ξ  − xi xj xi xj τ,θ,m,q,M 

N 0  2 
i≤j  
" #
λ 1
Z X nq o
2 ∗
+ dτ Ex∗ ,ω,ξ x − m ⟨xi xi ⟩τ,θ,m,q,M
N 0 2 i τ,θ,m,q,M
i 
Z 1
λ Xn o
+ dτ Ex∗ ,ω,ξ  x2i x2j τ,θ,m,q,M − ⟨xi xj ⟩2τ,θ,m,q,M 
2N 2 0
i≤j
Z 1 " #)
λq X n o
− dτ Ex∗ ,ω,ξ x2i τ,θ,m,q,M − ⟨xi ⟩2τ,θ,m,q,M
2N 0
i
( " #
fixed ∗
log ZA (λ, x , ω; m, q, M )
= lim Ex∗ ,ω
N →∞ N
 
Z 1
λ X n o
− dτ Ex∗ ,ω,ξ  ⟨xi xj ⟩2τ,θ,m,q,M − 2 x∗i x∗j xi xj τ,θ,m,q,M 
2N 2 0
i≤j
Z 1 " #)
λ Xn 2 ∗
o
+ dτ Ex∗ ,ω,ξ q ⟨xi ⟩τ,θ,m,q,M − 2m ⟨xi xi ⟩τ,θ,m,q,M
2N 0
i
( " #
log ZA fixed (λ, x∗ , ω; m, q, M )
= lim Ex∗ ,ω
N →∞ N
 
Z 1
λ Xn o
− dτ Ex∗ ,ω,ξ  ⟨xi xj ⟩2τ,θ,m,q,M − 2 x∗i x∗j xi xj τ,θ,m,q,M 
4N 2 0
i,j
Z 1 " #)
λ X n o
+ dτ Ex∗ ,ω,ξ q ⟨xi ⟩2τ,θ,m,q,M − 2m ⟨x∗i xi ⟩τ,θ,m,q,M
2N 0
i
( " #
fixed ∗
log ZA (λ, x , ω; m, q, M )
= lim Ex∗ ,ω
N →∞ N
 
Z 1
λ X X
− dτ Ex∗ ,ω,ξ  ⟨xi xj ⟩2τ,θ,m,q,M − 2N q ⟨xi ⟩2τ,θ,m,q,M 
4N 2 0
i,j i
 )
Z 1
λ X 2
X
+ dτ Ex∗ ,ω,ξ  x∗i x∗j xi xj τ,θ,m,q,M − 2N m ⟨x∗i xi ⟩2τ,θ,m,q,M 
2N 2 0
i,j i
(

 
150 F. Krzakala and L. Zdeborová

Hence, we found a upper bound that

λq 2 λ λm2
ΦFixed
MF (λ; M ) ≤ Φdenoising (λm) + + (M − m)2 − , ∀ m, q
4 2 2
λq 2 λ λm2
 
⇒ Φfixed
MF (λ; M ) ≤ min Φdenoising (λm) + + (M − m)2 −
m,q 4 2 2

Sandwich the ΦMF (λ)

In the large N limit, the Boltzmann distribution will be dominated by the configurations with
specific magnetization, so by Laplace method (we are a bit sloppy here) we have

ΦMF (λ) = max Φfixed


MF (λ, M )
M
λq 2 λ λm2
 
2
≤ max min Φdenoising (λm) + + (M − m) −
M m,q 4 2 2
2 2
 
λq λ λm
≤ max Φdenoising (λm) + + (M − m)2 −
M 4 2 2 m=M,q=M
2
 
λM
= max Φdenoising (λM ) −
M 4

Finally, combine the bound from both sides and note that ΦMF (λ) does not depend on m and
M

 h i
ΦMF (λ) ≥ maxm Φdenoising (λm) − λm2 
λm2

4
h i ⇒ ΦMF (λ) = max Φdenoising (λm) −
ΦMF (λ) ≤ maxM Φdenoising (λM ) − λM 2 m 4
4

Bibliography

Nishimori demonstrated that his symmetry implied replica-symmetry in Nishimori (1980).


The modern approach in terms of perturbations is discussed in Korada and Macris (2009);
Abbe and Montanari (2013); Coja-Oghlan et al. (2018). The spiked Wigner model has been
studied in great detail over the last decades, and has been the topics of many fundamental
papers Johnstone (2001); Baik et al. (2005). The replica approach to this problem is reviewed
in details in Lesieur et al. (2017). The results was first proved in Barbier et al. (2016). We
presented here the alternative later proof of El Alaoui and Krzakala (2018). Thanks to an
universality theorem Krzakala et al. (2016), many problems can be reduced to variants of such
low-rank factorization problems and generic formula have been proven for these Lelarge and
Miolane (2019); Miolane (2017); Barbier and Macris (2019). Extensions can also be made for
almost arbitrary priors, including neural networks generating models Aubin et al. (2020).
8.5 Exercises 151

8.5 Exercises

Exercise 8.1: THe BBP transition


Consider the rank-one factorization model with vectors x where each component is
sampled uniformly from N (0, 1).

a) Using the replica expression for the free entropy, show that the overlap m between
the posterior estimate ⟨X⟩ and the real value x∗ obeys a self consistent equation.

b) Solve this equation numerically, and show that m is non zero only for SNR λ > 1.

c) Once this is done, perform simulations of the model by creating matrices


r
λ ∗ ∗⊺
Y= |x {z
x} + ξ
N |{z}
N ×N rank-one matrix symmetric iid noise

and compare the MMSE obtained with this approach with the one of any algorithm
you may invent so solve the problem. A classical algorithm for instance, is to use as
an estimator the eigenvector of Y corresponding to its largest eigenvalue.

Exercise 8.2: More phase transitions


Consider the rank-one factorization model with vectors X where each components is

model 1: sampled uniformly from ±1

model 2: sampled uniformly from ±1 (with probability ρ), otherwise 0 (with proba-
bility 1 − ρ)

a) Using the replica expression for the free entropy, show that the overlap m between
the posterior means estimate ⟨X⟩ and the real value X ∗ obeys a self consistent
equation.

b) Solve this equation numerically, and show that m is non zero only for SNR λ > 1,
for models 1 and for a non-trivial critical value for model 2. Check also that, for ρ
small enough, the transition is a first order one for model 2.
Chapter 9

Cavity method and Approximate


Message Passing

9.1 Self-consistent equation

N variables
x2i x2j λ xi xj x∗i x∗j λ
r
X λ
−HN = − + + xi xj ξij (9.1)
2N N N
1≤i≤j

N + 1 variables
x2i x2j λ xi xj x∗i x∗j λ
r
X λ
− HN +1 = − + + xi xj ξij (9.2)
2(N + 1) N +1 N +1
0≤i≤j
!
x2i x2j λ xi xj x∗i x∗j λ
r
X λ x2 λ X x2i
= − + + xi xj ξij − 0
2(N + 1) N +1 N +1 2 N +1
1≤i≤j i
r
X xi x∗ X λ

+ x0 x0 λ i
+ x0 xi ξ0i (9.3)
N +1 N +1
i i
r
N X x2 X xi x∗
2λ λ
X

= −HN (λ ) − x0 i
+ x0 x0 i
+ x0 xi ξ0i + o(1) (9.4)
N +1 2 N N N
i i i

Let us now look at the average magnetization of the new spin. It must satisfies:
x2 x0 x∗
q
2λ i ∗ 0 +x λ
P P P
−x i N +x0 x0 xi ξ0i
dx0 PX (x0 ) x∗0 x0 ⟨e 0 2
R 0
i N i N ⟩N
m = Eξ,x∗ ,x∗0 ,ξ0 x2 xi x∗
q (9.5)
−x2 λ i +x x∗ i λ
P P P
i N +x0 xi ξ0i
R 0 0
dx0 PX (x0 )⟨e 0 2 i N i N ⟩N
Let us therefore evaluate this term in brackets. Using concentration of measure for the overlaps,
we find
x2 xi x∗
q q
−x20 λ i ∗ i +x λ
−x20 λ ρ+x0 x∗0 m+x0 λ
P P P P
i N +x0 x0 xi ξ0i xi ξ0i
⟨e 2 i N 0 i N ⟩N ≈ ⟨e 2 i N ⟩N (9.6)
q
2λ ∗
P λ
x0 xi ξ0i
= e−x0 2 ρ+x0 x0 m ⟨e i N ⟩N (9.7)
154 F. Krzakala and L. Zdeborová

How to deal with the last term? We could expand in power of m! Indeed, concentration
suggest that the xi are xj are only weekly correlated so that we could hope that:
r
(x0 xi )2 2 λ
q
P
x0 i xi ξ0i Nλ Y x x ξ qλ Y λ
0 i 0i
⟨e ⟩N = ⟨ e N ⟩N ≈ ⟨ (1 + x0 xi ξ0i + ξ0i )⟩N
N 2 N
i i
r r
Y λ 1 2 λ
Y λ 1 λ
≈ (1 + x0 ⟨xi ⟩ξ0i + x20 ⟨x2i ⟩ξ0i )≈ (1 + x0 ⟨xi ⟩ξ0i + x20 ⟨x2i ⟩ )
N 2 N N 2 N
i i
x2 x2 x2 2
q q
λ
− 20 2 2 λ 0 2 2 λ λ 0 qλ+ x0 ρλ
P P P P
x0 i ⟨xi ⟩ξ0i i ⟨xi ⟩ ξ0i N + 2 i ⟨xi ⟩ξ0i N x0 i ⟨xi ⟩ξ0i −
≈e N ≈e N 2 2 (9.8)

This can be actually proved rigorously. Indeed consider the two following expressions:
r
x20 X λ
A = − λρ + x0 ξi0 xi (9.9)
2 N
i
r
x20 X λ
B = − λq + x0 ξi0 ⟨xi ⟩ (9.10)
2 N
i

It is easy to check via Gaussian integration, and using concentration, that

E⟨(eA − eB )2 ⟩ → 0 (9.11)

We thus obtain
x2 x0 x∗
q q
−x20 λ i ∗ 0 +x λ
−x20 λ q+x0 x∗0 m+x0 λ
P P P P
i N +x0 x0 xi ξ0i i ⟨xi ⟩ξ0i
⟨e 2 i N 0 i N ⟩N ≈ e 2 N (9.12)

Recognizing that the last term is actually a random Gaussian variable thanks to the CLT, we
have
2 qλ ∗

dx0 PX (x0 ) x∗0 x0 e−x0 2 +x0 x0 m+x0 λqz
R
m = Ex∗0 ,z R 2 qλ ∗
√ (9.13)
dx0 PX (x0 )e−x0 2 +x0 x0 m+x0 λqz

and we have recovered our self-consistent equation from a rigorous computation. This is the
power of the cavity method!

9.2 Rank-one by the cavity method

We can derive the free energy as well. To do this, we need to be keep track of all order 1
constant, so we write:

x2i x2j λ xi xj x∗i x∗j λ


r
X λ
− HN +1 = − + + xi xj ξij (9.14)
2(N + 1) N +1 N +1
0≤i≤j
!
x2i x2j λ xi xj x∗i x∗j λ
r
X λ x2 λ X x2i
= − + + xi xj ξij − 0
2(N + 1) N +1 N +1 2 N +1
1≤i≤j i
r
X xi x∗ X λ
+ x0 x∗0 λ i
+ x0 xi ξ0i (9.15)
N +1 N +1
i i
9.2 Rank-one by the cavity method 155

In the parenthesis, we recognise, expanding in N and keeping only O(1) variables

λ X x2i x2j X xi x∗i xj x∗j


r
1 λ X
(.) = −HN + − λ − xi xi ξij + o(1) (9.16)
2 N2 N2 2N N
i≤j i≤j i≤j

λ X x2i x2j λ X xi x∗i xj x∗j


r
1 λ X
= −HN + 2
− 2
− xi xi ξij + o(1) (9.17)
4 N 2 N 4N N
i,j i,j i,j
!2 !2 r
λ X x2i λ X xi x∗i 1 λ X
= −HN + − − xi xi ξij + o(1) (9.18)
4 N 2 N 4N N
i i i,j

We thus write
!2 !2 r
λ X x2 λ X xi x∗ 1 λ X
i i
−HN +1 = − HN + − − xi xi ξij + o(1)
4 N 2 N 4N N
i i i,j
r
λ X x2i X xi x∗ X λ
− x20 + x0 x∗0 i
+ x0 xi ξ0i
2 N N N
i i i

We can now compute the free energy using the Cavity method and write
2 2
x2 xi x∗
  q
λ i −λ i 1 λ
P P P
ZN +1 4 i N 2 i N
+ 4N N i,j xi xi ξij
F ≈ E log =Eξ,x∗ log⟨e ⟩N
ZN
x2 xi x∗
Z q
−x20 λ i ∗ i +x λ
P P P
i N +x0 x0 0 xi ξ0i
+ Eξ,x∗ ,ξ0 log dx0 P (x0 )⟨e 2 i N i N ⟩N
(9.19)

All these terms are somehow simple, and more importantly, look like they depends on our
order parameters, excepts for the Gaussian random variables ξij and ξ0i . In fact, we already
took care of the second line, so we can further simply the expression as
2 2
x2 xi x∗
  q
λ i −λ i 1 λ
P P P
ZN +1 4 i N 2 i N + 4N N i,j xi xi ξij
F ≈ E log =Eξ,x∗ log⟨e ⟩N
ZN
Z q
−x20 λ q+x0 x∗0 m+x0 λ
P
i ⟨xi ⟩ξ0i
+ Eξ,x∗ ,ξ0 log dx0 PX (x0 )e 2 N (9.20)

Let us this deal with the first line. First notice that, because of concentration, we have:
X xi x∗ X x2
i
→ m, i
→ ρ. (9.21)
N N
i i

We thus can write, via this concentration property:


2 2
x2 xi x∗
  q
λ
−λ 1 λ ρ2 λ
P i
P i
P q
i N i + 4N i,j xi xj ξij − mλ 1
+ 4N λ
xi xi ξij
(9.22)
4 2 N N
⟨e ⟩ = o(1) + ⟨e 4 2 N ⟩
q
ρ2 λ 1 λ
− mλ xi xj ξij
Y
= o(1) + e 4 2 ⟨ e 4N N ⟩ (9.23)
ij
156 F. Krzakala and L. Zdeborová

This last terms look complicated. However, it also follows a concentration property. First, let
us notice that, using use Stein lemma, we have
r
1 λ X X λ 
E⟨ xi xj ξij ⟩ = E⟨ 2 2 2
⟨x x ⟩ − ⟨xi xj ⟩ ⟩ (9.24)
4N N 4N 2 i j
i,j

We can also compute the variance of this quantity, and check that it concentrates. This is,
actually, nothing but the Matrix-MMSE. We can thus write
2 2
x2 xi x∗
  q
λ i −λ i 1 λ
P P P
i N i + 4N i,j xi xj ξij
(9.25)
4 2 N N
⟨e ⟩
λρ2
− λm + λ2 2 2 2
P
≈e 4 2 4N ij ⟨xi xj ⟩−⟨xi xj ⟩ (9.26)
λq
− λm
≈e 4 2 (9.27)

We thus find (using also Nishimori so that m = q):

λm
Z
2 λm ∗ ∗

F ≈− + Ex∗ ,z log dx0 PX (x0 )e−x0 2 +x0 x0 m+x0 x0 λmz (9.28)
4

9.3 AMP

We come back to the cavity equation. We have the new spins x0 that "sees" an effective model
as
P ⟨xi ⟩2 ∗
1 P ⟨xi ⟩xi q
−x20 λ ∗ λ λ
P
i N +x0 x0 N +x0 i ⟨xi ⟩ξ0i N
x0 ∼ PX (x0 )e 2 i N (9.29)
Z

Can we turn this into an algorithm? If we remember that yij = x∗i x∗j λ/n + ξij , this can be
p

written as
P ⟨xi ⟩2
1
q
−x2 λ λ P
+ x0 y0i ⟨xi ⟩
x0 ∼ PX (x0 )e 0 2 i N N i
(9.30)
Z
Defining the denoising function η(A, B) as
2
dx0 PX (x0 )x0 e−x0 A/2+x0 B
R
η(A, B) =: R 2 (9.31)
dxPX (x0 )e−x0 A/2 +x0 B

We have the mean and second moments of x0 expressed as


! r
λ X
X
⟨x0 ⟩ = η λ ⟨xi ⟩2c /N, ⟨xi ⟩c y0i (9.32)
N
i i
r !
X λ X
⟨x20 ⟩ − ⟨x0 ⟩2 = η ′ ⟨xi ⟩2c /N, ⟨xi ⟩c y0i (9.33)
N
i i

where η ′ refer to the derivative with respect to B.


9.3 AMP 157

It is VERY tempting to turn this into an iterative algorithm, and write, denoting x̂ the estimator
of the marginal of x at time t
r
λ t
ht = x̂ Y (9.34)
N
 t t+1
x̂ · x̂

x̂t+1 = η λ , ht (9.35)
N
However, there is a problem! In these, we have indeed the mean of X0 , but it is not expressed
as a function of the mean of xi , but the mean of xi in a system WITHOUT X0 (the cavity
system, where x0 has been removed). What we need would be to express the mean of X0 as a
function of the mean of xi instead, not its cavity mean!

This problem was solved by Thouless, Anderson and Palmer, using Onsager’s retraction term.
To first order, we can express the cavity mean of xi (in absence of x0 ) as a function of the
actual mean xi (in presence of x0 ) as

 r   r 
X λ X X λ X
⟨xi ⟩c ≈ η λ ⟨xj ⟩2c /N, ⟨xj ⟩c yij  ≈ η λ ⟨xj ⟩2 /N, ⟨xj ⟩yij  (9.36)
N N
j̸=0 j̸=0 j̸=0 j̸=0
 r r 
X λ X λ
≈ η λ ⟨xj ⟩2 /N − λ⟨x0 ⟩2 /N, ⟨xj ⟩yij − ⟨x0 ⟩yi0  (9.37)
N N
j j
 r 
X λ X
(9.38)
p
≈ η λ ⟨xj ⟩2 /N, ⟨xj ⟩yij  − (∂B η) λ/N ⟨x0 ⟩yi0 + O(1/N )
N
j j

(9.39)
p
≈ ⟨xi ⟩ − η λ/N ⟨x0 ⟩yi0 + O(1/N )

So the algorithms need to be slightly modified! The local field acting on on the spin 0 is now
r r r !
λ X λ X λ
y0i ⟨xi ⟩c → y0i ⟨xi ⟩ − ηi′ ⟨x0 ⟩yi0
N N N
i i
r !
λ X 1 X ′ 2
= y0i ⟨xi ⟩ − λ ηi y01 ⟨x0 ⟩ (9.40)
N N
i i
r !
λ X 1 X ′
≈ y0i ⟨xi ⟩ − λ ηi ⟨x0 ⟩ (9.41)
N N
i i

so that our iterative algorithm can be written


r !
λ 1 X
ht = x̂t Y − λ ηi′ x̂t−1 (9.42)
N N
i
 t t
x̂ · x̂

x̂t+1 = η λ , ht (9.43)
N
This algorithm is often called Approximate Message Passing (AMP in short). The correction
in z is called the Onsager retro-action term.
158 F. Krzakala and L. Zdeborová

The really power-full things about this algorithm is that z is really the cavity field, at each,
time, and we know the distribution of cavity fields! In fact, from eqs.(9.30) and (9.13), we
expect that h is distributed as Gaussian so that

ht = λmt x∗ + λmt z (9.44)

with

2 mt λ ∗ t tz
dx0 PX (x0 ) x∗0 x0 e−x0 2 +λx0 x0 m −x0 λm
R
mt+1 = Ex∗0 ,z 2 mt λ ∗ t

t
(9.45)
dx0 PX (x0 )e−x0 2 +λx0 x0 m −x0 λm z
R

In other words, we can track the performance of the algorithm step by step. This last equation
is often called the "state evolution" of the algorithm.

This can actually be made rigorous (Bolthausen, Bayati-Montanari, Fletcher-Rangan):

Theorem 16 (AMP and State Evolution,


p informal). Given the AMP algorithm (eqs.9.42), the
t ∗
algorithm behaves as if h = x m + λq z, with mt given by eq.9.13 with high probability as n → ∞
t t

9.4 Exercises

Exercise 9.1: AMP

Implement AMP for the problems discussed in the Exercises in chap 7, and compare its
performance with the optimal ones

Bibliography

The cavity method as described in this section was developed by Parisi, Mézard, & Virasoro
Mézard et al. (1987a,b). It is often used in mathematical physics as well, as initiated in
Aizenman et al. (2003), and has been applied to the problem discussed in this chapter in
Lelarge and Miolane (2019). The vision of this cavity method as an algorithm is initially due
to Thouless-Anderson & Palmer Thouless et al. (1977), an article that extraordinary influential
in the statistical physics community. Its study as an iterative algorithm with a rigorous state
evolution is due to Bolthausen (2014); Bayati and Montanari (2011). For the present low-rank
problem, it was initially studied by Rangan and Fletcher (2012) and later discussed in great
details in Lesieur et al. (2017), where it was derived starting from Belief-Propagation.
Chapter 10

Stochastic Block Model & Community


Detection

10.1 Definition of the model

In this lecture we will discuss clustering of sparse networks also known as community detection
problem. To have a concrete example in mind you can picture part of the Facebook network
corresponding to students in a high-school where edges are between user who are friends.
Knowing the graph of connections the aim is to recover from such a network a division of
students corresponding to the classes in the high-school. The signal comes from the notion
that students in the same class will more likely be friends and hence connected than students
from two different classes. A widely studied simple model for such a situation is called the the
Stochastic Block Model that we will now introduce a study. Each of N nodes i = 1, . . . , N belong
of one among q classes/groups. The variable denoting to which class the node i belongs will
be denoted as s∗i ∈ {1, 2, . . P
. , q}. A node is in group a with probability (fraction of expected
group size) na ≥ 0, where qa=1 na = 1.

The edges in the graph are generated as follows: For each pair of node i, j we decide indepen-
dently whether the edge (ij) is present or not with probability
  
P (ij) ∈ E, Aij = 1 s∗ , s∗ = ps∗ s∗
i j i j
  ⇒ G(V, E) & Aij
/ E, Aij = 0 s∗i , s∗j = 1 − ps∗i s∗j
P (ij) ∈

Here pab is a symmetric q × q matrix.

The goal of community detection is given the adjacency matrix Aij to find ŝi so that ŝi is as
close as s∗i as possible according to some natural measure of distance, such as the number of
misclassification of node into wrong classes. In what follows we will discuss the case where
θ
the parameters of the model na , pab , q are known to the community detection algorithms, but
z }| {
also when they are not and need to be learned. Following previous lecture on inference a
natural approach is to consider Bayesian inference where for known values of parameters θ,
all information about s∗i we can extract from the knowledge of the graph is included in the
160 F. Krzakala and L. Zdeborová

posterior distribution:
  1    
P {si }N
i=1 G, θ = P G {s }N
i i=1 , θ P {s } N
i i=1 θ
ZG
N
1 Yh 1−Aij Aij i Y
= 1 − psi ,sj psi ,sj nsi
ZG
i<j i=1

Leading to the fully connected graphical model with one factor node per every variable and
per every (non-ordered) pair of variables.

We will study this posterior and investigate what is the algorithm that gives the highest
performance. Whether this algorithm achieves information-theoretically optimal performance
and whether there are interesting phase transition in this problem. In order to obtain self-
averaging results, we study the thermodynamic limit where N → ∞ and pab = cab /N with
cab , na , q = O(1). The intuition behind this limit is that in this way every node has on average
O(1) neighbors while the side of the graph goes large, just as in real world where one has one
average the same number of friends independently of the size of the world. This limit also
ends up challenging and presents intriguing behaviour.

The average degree of a node in group a, i.e. s∗i = a is

q q q
X X cab X
ca = pab N nb = N
nb
 cab nb (10.1)
N


b=1 b=1 b=1

Thus we can see that the average degree of a node in group a does not depends on N, i.e. even
if the number of nodes is increasing we have always the same average degree.
The overall average degree is then
X
c= cab na nb . (10.2)
a,b

10.2 Bayesian Inference and Parameter Learning

We start by defining a natural measure of performance the agreement between the original
assignment {s∗i } and its estimate {ti } as

1 X
A({s∗i }, {ti }) = max δs∗i ,π(ti ) , (10.3)
π N
i
10.2 Bayesian Inference and Parameter Learning 161

where π ranges over the permutations on q elements. We also define a normalized agreement
that we call the overlap,
1 P
∗ N i δs∗i ,π(ti ) − maxa na
Q({si }, {ti }) = max . (10.4)
π 1 − maxa na

The overlap is defined so that if s∗i = ti for all i, i.e., if we find the original labeling without
error, then Q = 1. If on the other hand the only information we have are the group sizes
na , and we assign each node to the largest group to maximize the probability of the correct
assignment of each node, then Q = 0. We will say that a labeling {ti } is correlated with the
original one {s∗i } if in the thermodynamic limit N → ∞ the overlap is strictly positive, with
Q > 0 bounded above some constant.

10.2.1 Community detection with known parameters

The probability that the stochastic block model generates a graph G, with adjacency matrix A,
along with a given group assignment {si }, conditioned on the parameters θ = {q, {na }, {cab }}
is
Y  csi ,sj Aij  csi ,sj 1−Aij Y

P (G, {si } | θ) = 1− nsi . (10.5)
N N
i<j i

Note that the above probability is normalized, i.e. G,{si } P (G, {si } | θ) = 1. Assume now
P
that we know the graph G and the parameters θ, and we are interested in the probability
distribution over the group assignments. Using Bayes’ rule we have

P (G, {si } | θ) csi ,sj 1−Aij Y


 
1 Y Aij 
P ({si } | G, θ) = P = csi ,sj 1− nsi . (10.6)
ti P (G, {ti } | θ) ZG
i<j
N
i

Theorem 12 from previous lectures tells us that in order to maximize the overlap Q({s∗i }, {ŝi })
between the ground truth assignment and an estimator ŝi we need to compute the

ŝi = argmaxsi µi (si ) , (10.7)

where µi (ti ) is
Pthe marginal probability of the posterior probability distribution. We remind
that µi (si ) = {sj }j̸=i P ({si } | G, θ).

Note the key difference between this optimal decision estimator and the maximum likelihood
estimator that is evaluating the configuration at which the posterior distribution has the
largest value. In high-dimensional, N → ∞, noisy setting as we consider in the SBM the
maximum likelihood estimator is sub-optimal with respect to the overlap with the ground
truth configuration. Note also that the posterior distribution is symmetric with respect to
permutations of the group labels. Thus the marginals over the entire distribution are uniform.
However, we will see that when communities are detectable this permutation symmetry is
broken, and we obtain that marginalization is the optimal estimator for the overlap defined
in equation 10.4, where we maximize over all permutations.

When the graph G is generated from the SBM using indeed parameters of value θ then
Nishimori identities derived in Section 7.2.4 hold and their consequence in the SBM is that in
162 F. Krzakala and L. Zdeborová

the thermodynamic limit we can evaluate the overlap Q({si }, {s∗i }) even without the explicit
knowledge of the original assignment {s∗i }. Due to the Nishimori identity it holds
1 P
i µi (ŝi ) − maxa na
lim N
= lim Q({ŝi }, {s∗i }) . (10.8)
N →∞ 1 − maxa na N →∞

The marginals µi (qi ) can also be used to distinguish nodes that have a very strong group
preference from those that are uncertain about their membership.

Another consequence of the Nishimori identities is that two configurations taken at ran-
dom from the posterior distribution have the same agreement with each other as one such
configuration with the original assignment s∗i , i.e.
1 X 1 XX
lim max µi (π(s∗i )) = lim µi (a)2 , (10.9)
N →∞ N π N →∞ N
i a i

where π again ranges over the permutations on q elements. Overall we are seeing that in
the spirit of the Nishimori identities the ground truth configuration s∗ has exactly the same
properties as any other configuration drawn uniformly from the posterior measure. This
property lets us use the ground truth configuration to probe the equilibrium properties of the
posterior, a fact that we will use heavily when coming back to the graph coloring problem.

10.2.2 Learning the parameters

Now assume that the only knowledge we have about the system is the graph G, and not the
parameters θ. The general goal in Bayesian inference is to learn the most probable values of
the parameters θ of an underlying model based on the data known to us. In this case, the
parameters are θ = {q, {na }, {cab }} and the data is the graph G, or rather the adjacency matrix
Aij . According to Bayes’ rule, the probability P (θ | G) that the parameters take a certain value,
conditioned on G, is proportional to the probability P (G | θ) that the model with parameters θ
would generate G. This in turn is the sum of P (G, {si } | θ) over all group assignments {si }:
P (θ) P (θ) X
P (θ | G) = P (G | θ) = P (G, {si } | θ) . (10.10)
P (G) P (G)
{si }

The prior distribution P (θ) includes any graph-independent information we might have about
the values of the parameters. In our setting, we wish to remain perfectly agnostic about these
parameters; for instance, we do not want to bias our inference process towards assortative
structures. Thus we assume a uniform prior, i.e., P (θ) = 1 up to normalization. Note, however,
that since the sum in (10.10) typically grows exponentially with N , we could take any smooth
prior P (θ) as long as it is independent of N ; for large N , the data would cause the prior to
“wash out,” leaving us with the same distribution we would have if the prior were uniform.
Thus maximizing P (θ | G) over θ is equivalent to maximizing the partition function over θ, or
equivalently the free energy entropy over θ.

10.3 Belief propagation for SBM

We now write Belief Propagation as we derived it in previous lectures for the probability
distribution (10.6). We note that this distribution corresponds to a fully-connected factor
10.3 Belief propagation for SBM 163

graph, while we argued Belief Propagation is designed for trees and works well on tree-like
graphs. At the same time the main reason we needed tree-like graphs was the assumption of
independence of incoming messages, if the incoming messages are changing only very weekly
the outcoming one then the needed independence can be weaker. Indeed results for other
models that we have studied so far, such as the Curie-Weiss model or the low-rank matrix
estimation can be derived from belief propagation. In the case of the Curie-Weiss model the
interactions were very weak, every neighbor was influencing the magnetization by interaction
of strength inversely proportional to N. We note that the probability distribution (10.6) is
a mixture of the terms corresponding to edges that are O(1) and organized on a tree-like
graph, and non-edged that are dense but each of them contributing by a interaction of strength
inversely proportional to N . It is thus an interesting case where a sparse and dense graphical
model combine.

The canonical Belief Propagation equations for the graphical model (10.6) read
" #
1  cti tk
1−Aik
χi→j
Y X
ti = i→j nti cA
ti tk 1 −
ik
χk→i
tk , (10.11)
Z t
N
k̸=i,j k

where Z i→j is a normalization constant ensuring ti χi→j = 1. We remind that we interpret


P
ti
i→j
the messages χqi as a marginal probability that node i is in group ti conditional to the absence
of factor ij. The BP assumes that the only correlations between i’s neighbors are mediated
through i, so that if i were missing—or if its label were fixed—the distribution of its neighbors’
states would be a product distribution. In that case, we can compute the message that i sends
j recursively in terms of the messages that i receives from its other neighbors k.

Then the marginal probability is then estimated from a fixed point of BP to be µi (ti ) ≈ χiti ,
where " #
1 Y X A  cti tk 1−Aik k→i
i
χti = i nti cti tk 1 −
ik
χtk . (10.12)
Z t
N
k̸=i k

Since we have nonzero interactions between every pair of nodes, we have potentially N (N − 1)
messages, and indeed (10.11) tells us how to update all of these for finite N . However, this
gives an algorithm where even a single update takes O(N 2 ) time, making it suitable only for
networks of up to a few thousand nodes. Happily, for large sparse networks, i.e., when N
is large and cab = O(1), we can neglect terms of sub-leading order in N . In that case we can
assume that i sends the same message to all its non-neighbors j, and treat these messages as
an external field, so that we only need to keep track of 2M messages where M is the number
of edges. In that case, each update step takes just O(M ) = O(N ) time. To see this, suppose
that (i, j) ∈
/ E. We have
" # " #  
i→j 1 Y 1 X Y X 1
χti = i→j nti 1− k→i
ctk ti χtk k→i
ctk ti χtk i
= χti + O . (10.13)
Z N t t
N
k∈∂i\j
/ k k∈∂i k

Hence the messages on non-edges do not depend to leading order on the target node j. On
the other hand, if (i, j) ∈ E we have
" # " #
1 1
χi→j
Y X Y X
ti = i→j nti 1− ctk ti χk→i
tk ctk ti χk→i
tk . (10.14)
Z N t t
k∈∂i
/ k k∈∂i\j k
164 F. Krzakala and L. Zdeborová

The belief propagation equations can hence be rewritten as


" #
i→j 1 Y X
χti = i→j nti e−hti ctk ti χk→i
tk , (10.15)
Z t k∈∂i\j k

where we neglected terms that contribute O(1/N ) to χi→j , and defined an auxiliary external
field that summarizes the contribution and overall influence of the non-edges
1 XX
hti = ctk ti χktk . (10.16)
N t k k

In order to find a fixed point of Eq. (10.15) in linear time we update the messages χi→j ,
recompute χj , update the field hti by adding the new contribution and subtracting the old
one, and repeat. The estimate of the marginal probability µi (ti ) is then
 
1
ctj ti χj→i
Y X
χiti = i nti e−hti  tj
. (10.17)
Z t
j∈∂i j

The role of the magnetic field hti in the belief propagation update is similar as the one of the
prior on group sizes nti . Contrary to the fixed nti the field hti adapts to the current estimation
of the groups sizes. For the assortative communities this term is crucial as if one group a
was becoming more represented then the corresponding field ha would be larger and would
weaken all the messages χa . The field is hence adaptively keeping the group sizes of the
correct size preventing all node to fall into the same group. For the disassortative case the field
plays a similar role of adjusting the sizes of the groups. A particular case if the disassortative
structure with groups of the same size and the same average degree of every group in which
case the term e−hti does not change the behaviour of the BP equations. This will stand on the
basis of the mapping to planted coloring problem.

When the Belief Propagation is asymptotically exact then the true marginal probabilities are
given as µi (ti ) = χiti . The estimator maximizing the overlap Q is then

ŝi = argmaxti χiti (10.18)

The overlap with the original group assignment is then computed from (10.8) under the
assumption that the assumed parameters θ were indeed the ones used to generate the graph,
leading to
1 P i
N i χŝi − maxa na
Q = lim (10.19)
N →∞ 1 − maxa na

In order to write the Bethe free entropy we use similar simplification and neglect subleading
terms. To write the resulting formula we introduce
j→i
+ χbi→j χj→i
X X
Z ij = cab (χi→j
a χb a )+ caa χi→j
a χa
j→i
for (i, j) ∈ E (10.20)
a<b a
X cab  i j
Z̃ ij
= 1− χa χb for (i, j) ∈
/ E, , (10.21)
N
a,b

ctj ti χj→i
X YX
Zi = nti e−hti tj (10.22)
ti j∈∂i tj
10.3 Belief propagation for SBM 165

we can then write the Bethe free entropy, in the thermodynamic limit as
1 X 1 X c
ΦBP (q, {na }, {cab }) = log Z i − log Z ij + , (10.23)
N N 2
i (i,j)∈E

where c is the average degree given


P by (10.2) and the third term comes from the edge-
contribution of non-edges (i.e. i,j log(Z̃ ij )). Derivation of this expression will be your

homework.

Under the assumption that the model parameters θ are known, hence Nishomori conditions
hold, we can now state a conjecture about exactness of the belief propagation in the asymptotic
limit of N → ∞. The fixed point of belief propagation corresponding to the largest Bethe free
entropy provides the Bayes-optimal estimator for the SBM in the following sense:

• The Bethe free entropy ΦBP is exact, i.e. ∀ε > 0


|ΦBP (θ) − Φexact (θ)| < ε (10.24)
with high probability over the randomness in the model and as N → ∞.
• The corresponding BP overlap is equal to the one obtained by the Bayes-optimal estimator
∀ε > 0
|QBP − QBO | < ε (10.25)
with high probability over the randomness in the model and as N → ∞.
• Finally for all up to a vanishing fraction of node the BP marginals are correct ∀ε > 0
|χisi − µi (si )| < ε (10.26)
for almost all i = 1, . . . , N with high probability as N → ∞

In case the parameters θ are not known and need to be learned the Bethe free energy is greatly
useful. In previous sections we concluded that the most likely parameters are those maximizing
the Bethe free entropy. One can thus simply run gradient descent on the parameters to
maximize the Bethe entropy. Even more conveniently, we can write the stationarity conditions
of the Bethe free entropy with respect to the parameters na and cab and use them as an iterative
procedure to estimate the parameters. Keeping in mind that a BP fixed point is a P stationary
point of the Bethe free entropy, and that for na we need to impose the normalization a na = 1
and thus we are looking for a constraint optimizer we obtain
1 X i
na = χa , (10.27)
N
i

1 1 X cab (χi→j
a χb
j→i
+ χi→j χj→i
a )
cab = ij
b
, (10.28)
N nb na Z
(i,j)∈E

where Z ij is defined in (10.20). Derivation of this expression will be your homework. The
interpretation of these expressions are very intuitive and again stems from Nishmori identities
for Bayes-optimal estimation. The first equations states that the fraction of nodes in group a
should be the expected number of nodes in group a according to the BP prediction. Similarly
for cab the meaning of the second equations is that the expected number of edges between
group a and group b is the same as when computed from the BP marginals. Therefore, BP can
also be used readily to learn the optimal parameters.
166 F. Krzakala and L. Zdeborová

10.4 The phase diagram of community detection

We will now consider the case when the parameters q, {na }, {cab } used to generate the network
are known. We will further limit ourselves to a particularly algorithmically difficult case of
the block model, where every group a has the same average degree c and hence there is no
information about the group assignment simply in the degree distribution. The condition
reads:
Xq Xq
cad nd = cbd nd = c , for all a, b . (10.29)
d=1 d=1

If this is not the case, we can achieve a positive overlap with the original group assignment
simply by labeling nodes based on their degrees. The first observation to make about the
belief propagation equations (10.15) in this case is that

χi→j
ti = nti (10.30)

is always a fixed point, as can be verified by plugging (10.30) into (10.15). The free entropy at
this fixed point is
c
Φpara = − (1 − log c) . (10.31)
2

For the marginals we have χiti = nti , in which case the overlap (10.8) is Q = 0. This fixed
point does not provide any information about the original assignment—it is no better than a
random guess. If this fixed point gives the correct marginal probabilities and the correct free
entropy, we have no hope of recovering the original group assignment. For which values of q,
na and cab is this the case?

10.4.1 Second order phase transition

Fig. 10.4.1 represents two examples where the overlap Q is computed on a randomly generated
graph with q groups of the same size and an average degree c. We set caa = cin and cab = cout
for all a ̸= b and vary the ratio ϵ = cout /cin . The continuous line is the overlap resulting from the
BP fixed point obtained by converging from a random initial condition (i.e., where for each i, j
the initial messages χi→jti are random normalized distributions on ti ). The convergence time is
plotted in Fig. 10.4.2. The points in Fig. 10.4.1 are results obtained from Gibbs sampling, using
the Metropolis rule and obeying detailed balance with respect to the posterior distribution,
starting with a random initial group assignment {qi }. We see that Q = 0 for cout /cin > ϵc . In
other words, in this region both BP and MCMC converge to the paramagnetic state, where the
marginals contain no information about the original assignment. For cout /cin < ϵc , however,
the overlap is positive and the paramagnetic fixed point is not the one to which BP or MCMC
converge.

Fig. 10.4.1(b) shows the case of q = 4 groups with average degree c = 16. We show the large N
results and also the overlap computed with MCMC for a rather small size N = 128. Again, up
to symmetry breaking, marginalization achieves the best possible overlap that can be inferred
from the graph by any algorithm.
10.4 The phase diagram of community detection 167

(a) (b)
1
1 N=100k, BP
N=500k, BP N=70k, MC
0.9 N=70k, MC N=128, MC
0.8 0.8 N=128, full BP
0.7 q=4, c=16
0.6 q*=2, c=3 0.6
overlap

overlap
0.5
0.4 0.4
0.3
0.2 detectable undetectable 0.2 undetectable
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0 0.2 0.4 0.6 0.8 1
ε*= c*out/c*in ε= cout/cin

Figure 10.4.1: (color online): The overlap (10.4) between the original assignment and its best
estimate given the structure of the graph, computed by the marginalization (10.7). Graphs
were generated using N nodes, q groups of the same size, average degree c, and different
ratios ϵ = cout /cin . Thus ϵ = 1 gives an Erdős-Rényi random graph, and ϵ = 0 gives completely
separated groups. Results from belief propagation (10.15) for large graphs (red line) are
compared to Gibbs sampling, i.e., Monte Carlo Markov chain (MCMC) simulations (data
points). The agreement is good, with differences in the low-overlap regime that we attribute to
finite size fluctuations. In the part (b) we also compare to results from the full BP (10.11) and
MCMC for smaller graphs with N = 128, averaged over 400 samples. The finite size effects are
not very strong in this case, and BP is reasonably close to the exact (MCMC) result even on
√ √
small graphs that contain many short loops. For N → ∞ and ϵ > ϵc = (c − c)/[c + c(q − 1)]
it is impossible to find an assignment correlated with the original one based purely on the
structure of the graph. For two groups and average degree c = 3 this means that the density
of connections must be ϵ−1 c (q = 2, c = 3) = 3.73 greater within groups than between groups
to obtain a positive overlap.

700
N=10k
N=100k
600

500
convergence time

400 q=4, c=16

300

200

100
εc
0
0 0.2 0.4 0.6 0.8 1
ε= cout/cin

Figure 10.4.2: (color online): The number of iterations needed for convergence of the BP
algorithm for two different sizes. The convergence time diverges at the critical point ϵc . The
equilibration time of Gibbs sampling (MCMC) has qualitatively the same behavior, but BP
obtains the marginals much more quickly.
168 F. Krzakala and L. Zdeborová

10.4.2 Stability of the paramagnetic fixed point

Let us now investigate the stability of the paramagnetic fixed point under random perturbations
to the messages when we iterate the BP equations. In the sparse case where cab = O(1), graphs
generated by the block model are locally treelike in the sense that almost all nodes have a
neighborhood which is a tree up to distance O(log N ), where the constant hidden in the O
depends on the matrix cab . Equivalently, for almost all nodes i, the shortest loop that i belongs
to has length O(log N ). Consider such a tree with d levels, in the limit d → ∞. Assume that
on the leaves the paramagnetic fixed point is perturbed as

χkt = nt + ϵkt , (10.32)

and let us investigate the influence of this perturbation on the message on the root of the
tree, which we denote k0 . There are, on average, cd leaves in the tree where c is the average
degree. The influence of each leaf is independent, so let us first investigate the influence of
the perturbation of a single leaf kd , which is connected to k0 by a path kd , kd−1 , . . . , k1 , k0 . We
define a kind of transfer matrix
" #
∂χ ki χ ki c χ ki c c 
ab sb ab
X
i
Tab ≡ k
a
= a
k
− χ ki
a
s
k
= n a − 1 . (10.33)
∂χ i+1 χt =nt car χr i+1 csr χr i+1 χt =nt c
P P
b r s r

where this expression was derived from (10.15) to leading order in N . The perturbation ϵkt00
on the root due to the perturbation ϵktdd on the leaf kd can then be written as
"d−1 #
X Y
ϵkt00 = Ttii ,ti+1 ϵktdd (10.34)
{ti }i=1,...,d i=0

We observe in (10.33) that the matrix Tab


i does not depend on the index i. Hence (10.34) can

be written as ϵk0 = T d ϵkd . When d → ∞, T d will be dominated by T ’s largest eigenvalue λ, so


ϵk0 ≈ λd ϵkd .

Now let us consider the influence from all cd of the leaves. The mean value of the perturbation
on the leaves is zero, so the mean value of the influence on the root is zero. For the variance,
however, we have
 2 +
   * X cd   
2 2
k0
ϵt0 ≈  d k
λ ϵt  ≈ c λ d 2d
ϵkt . (10.35)
k=1

This gives the following stability criterion,

cλ2 = 1 . (10.36)

For cλ2 < 1 the perturbation on leaves vanishes as we move up the tree and the paramagnetic
fixed point is stable. On the other hand, if cλ2 > 1 the perturbation is amplified exponentially,
the paramagnetic fixed point is unstable, and the communities are easily detectable.

Consider the case with q groups of equal size, where caa = cin for all a and cab = cout for
all a ̸= b. If there are q groups, then cin + (q − 1)cout = qc. The transfer matrix Tab has only
two distinct eigenvalues, λ1 = 0 with eigenvector (1, 1, . . . , 1), and λ2 = (cin − cout )/(qc) with
10.4 The phase diagram of community detection 169

eigenvectors of the form (0, . . . , 0, 1, −1, 0, . . . , 0) and degeneracy q − 1. The paramagnetic


fixed point is then unstable, and communities are easily detectable, if

|cin − cout | > q c . (10.37)

The stability condition (10.36) is known in the literature on spin glasses as the de Almeida-
Thouless local stability condition de Almeida and Thouless (1978), in information science as
the Kesten-Stigum bound on reconstruction on trees Kesten and Stigum (1967).

We observed empirically that for random initial conditions both the belief propagation con-
verges to the paramagnetic fixed point when cλ2 < 1. On the other hand when cλ2 > 1 then
BP converges to a fixed point with a positive overlap, so that it is possible to find a group
assignment that is correlated (often strongly) to the original assignment. We thus conclude
that if the parameters q, {na }, {cab } are known and if cλ2 > 1, it is possible to reconstruct the
original group assignment.

For the cases presented in Fig. 10.4.1 we can thus distinguish two phases:


• If |cin − cout | < q c, the graph does not contain any significant information about
the original group assignment, and community detection is impossible. Moreover, the
network generated with the block model is indistinguishable from an Erdős-Rényi random
graph of the same average degree.

• If |cin − cout | > q c, the graph contains significant information about the original group
assignment, and using BP or MCMC yields an assignment that is strongly correlated
with the original one. There is some intrinsic uncertainty about the group assignment
due to the entropy, but if the graph was generated from the block model there is no
better method for inference than the marginalization introduced by Eq. (10.7).

Fig. 10.4.1 hence illustrates a phase transition in the detectability of communities. Unless
the ratio cout /cin is far enough from 1, the groups that truly existed when the network was
generated are undetectable from the topology of the network. Moreover, unless the condition
(10.37) is satisfied the graph generated by the block model is indistinguishable from a random
graph, in the sense that typical thermodynamic properties of the two ensembles are the same.

10.4.3 First order phase transition

The situation of a continuous (2nd order) phase transition, illustrated in Fig. 10.4.1 is, however,
not the most general one. Fig. 10.4.3 illustrates the case of a discontinuous (1st order) phase
transition that occurs e.g. for q = 5, cin = 0, and cout = qc/(q − 1). In this case the condition for
stability (10.37) leads to a threshold value cℓ = (q − 1)2 . We plot again the overlap obtained
with BP, using two different initializations: the random one, and the planted/informed one
corresponding to the original assignment. In the latter case, the initial messages are

χi→j
qi = δqi s∗i , (10.38)

where s∗i is the original assignment. We also plot the corresponding BP free energies. As the
average degree c increases, we see four different phases in Fig. 10.4.3:
170 F. Krzakala and L. Zdeborová

(a) (b)
1 0.5 1
0.6
BP planted init.
0.8 0.4 0.8 BP random init.
overlap, planted init 0.4
0.6 overlap, random init 0.3

ffactorized-fBP
ffactorized-fBP 0.6

overlap
entropy

0.2 εl
0.4 0.2 0.4
cd cl
0.2 0.1 0
0.2
0.17 0.18 0.19 0.2
q=10, c=10, N=500k
0 0
cc 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
12 13 14 15 16 17 18 ε= cout/cin
c

Figure 10.4.3: (color online): (a) Graphs generated with q = 5, cin = 0, and N = 105 . We
compute the overlap (10.4) and the free energy with BP for different values of the average
degree c. The green crosses show the overlap of the BP fixed point resulting from using the
original group assignment as the initial condition, and the blue crosses show the overlap
resulting from random initial messages. The red stars show the difference between the
paramagnetic free energy (10.31) and the free energy resulting from the informed initialization.
We observe three important points where the behavior changes qualitatively: cd = 12.84,
cc = 13.23, and cℓ = 16. We discuss the corresponding phase transitions in the text. (b) The
case q = 10 and c = 10. We plot the overlap as a function of ϵ; it drops down abruptly from
about Q = 0.35. The inset zooms in on the critical region. We mark the stability transition
ϵℓ , and data points for N = 5 · 105 for both the random and informed initialization of BP. In
this case the data are not so clear. The overlap from random initialization becomes positive a
little before the asymptotic transition. We think this is due to strong finite size effects. From
our data for the free energy it also seems that the transitions ϵc and ϵd are very close to each
other (or maybe even equal, even though this would be surprising). These subtle effects are,
however, relevant only in a very narrow region of ϵ and are, in our opinion, not likely to appear
for real-world networks.

I. For c < cd , both initializations converge to the paramagnetic fixed point, so the graph
does not contain any significant information about the original group assignment.

II. For cd < c < cc , the planted initialization converges to a fixed point with positive overlap,
and its free entropy is smaller than the paramagnetic free entropy. In this phase there are
exponentially many basins of attraction (states) in the space of assignments that have
the proper number of edges between each pair of groups. These basins of attraction have
zero overlap with each other, so none of them yield any information about any of the
others, and there is no way to tell which one of them contains the original assignment.
The paramagnetic free entropy is still the correct total free entropy, the graphs generated
by the block model are thermodynamically indistinguishable from Erdős-Rényi random
graphs, and there is no way to find a group assignment correlated with the original one.

III. For cc < c < cℓ , the planted initialization converges to a fixed point with positive overlap,
and its free entropy is larger than the paramagnetic free entropy. There might still be
exponentially many basins of attraction in the state space with the proper number of
edges between groups, but the one corresponding to the original assignment is the
one with the largest free entropy. Therefore, if we can perform an exhaustive search of
the state space, we can infer the original group assignment. However, this would take
10.4 The phase diagram of community detection 171

exponential time, and initializing BP randomly almost always leads to the paramagnetic
fixed point. In this phase, inference is possible, but conjectured to be exponentially hard
computationally.

IV. For c > cℓ , both initializations converge to a fixed point with positive overlap, strongly
correlated with the original assignment. Thus inference is both possible and tractable,
and BP achieves it in linear time.

We saw in our experiments that for assortative communities where cin > cout , phases (II)
and (III) are extremely narrow or nonexistent. For q ≤ 4, these phases do not exist at all,
and the overlap grows continuously from zero in phase (IV), giving a continuous phase
transition as illustrated in Fig. 10.4.1. We can think of the continuous (2nd order) phase
transition as a degenerate case of the 1st order phase transition where the three discussed
thresholds cd = cc = cℓ are the same. For q ≥ 5, phases (II) and (III) exist but occur in an
extremely narrow region, as shown in Fig. 10.4.3(b). The overlap jumps discontinuously from
zero to a relatively large value, giving a discontinuous phase transition. In the disassortative
(antiferromagnetic) case where cin < cout , phases (II) and (III) are more important. For
instance, when cin = 0 and the number of groups is large, the thresholds scale as cd ≈ q log q,
cc ≈ 2q log q and cℓ = (q − 1)2 . However, in phase (III) the problem of inferring the original
assignment is hard and conjectured to be insurmountable to all polynomial algorithms in this
region.

10.4.4 The non-backtracking matrix and spectral method

Having seen that BP and also MCMC are able to attain the algorithmic threshold cl a natural
question is what other classes of algorithms are able to do so. Very natural spectral methods
such as principal component analysis, based on leading eigenvalues of the associated matrices,
are commonly used for clustering. For community detection in sparse graphs, however, most
of the commonly used spectral methods do not attain the threshold because in sparse graphs
their leading eigenvectors become localized on subgraphs that have nothing to do with the
communities (e.g. the adjacency matrix spectrum localizes on high degree nodes in the limit
of large sparse graphs).

A very generic and powerful idea to design spectral methods achieving optimal thresholds
is to realize that the way we computed the threshold cl in the first place was by linearizing
the belief propagation around it paramagnetic fixed point by introducing its perturbations
χi→j
t = nt + εi→j
t we obtained

X X ∂χi→j
εi→j
X X
t = t
k→i
εk→i
q = Ttq εk→i
q (10.39)
q
∂χq χk→i
q =nq
q
k∈∂i\j k∈∂i\j

where the matrix T was computed in (10.33). Introducing the so-called non-backtracking
matrix as a 2M but 2M matrix (M being the number of edges) with coordinates corresponding
to oriented edges
Bi→j,k→l = δi,l (1 − δj,k ) (10.40)
we can write the linearized BP as
ε = (T ⊗ B)ε . (10.41)
172 F. Krzakala and L. Zdeborová

We thus see that the linearized belief propagation corresponds to a power-iteration of a tensor
product of a small q×q matrix T and the non-backtracking matrix B. The spectrum of B indeed
provides information about the communities that is more accurate than other commonly used
spectral methods and it attains the detectability threshold in sparse SBM.

10.5 Exercises

Exercise 10.1: SBM free energy


Show that the Bethe free entropy for the stochastic block model can be written using
j→i
+ χi→j
X X
Z ij = cab (χi→j
a χb b χj→i
a )+ caa χi→j
a χa
j→i
for (i, j) ∈ E(10.42)
a<b a

ctj ti χj→i
X YX
−hti
Z i
= nti e tj (10.43)
ti j∈∂i tj

as
1 X 1 X c
ΦBP (q, {na }, {cab }) = log Z i − log Z ij + , (10.44)
N N 2
i (i,j)∈E

where c is the average degree given by (10.2).

Exercise 10.2: Parameter learning with BP


Show that in the stochastic block model maximization of the Bethe free entropy with
respect to the parameters na and cab at a BP fixed point leads to the following conditions
for stationarity that can be then used for iterative learning of the parameters na and cab .

1 X i
na = χa , (10.45)
N
i

1 1 X cab (χi→j
a χb
j→i
+ χi→j χj→i
a )
cab = b
. (10.46)
N nb na Z ij
(i,j)∈E
Chapter 11

The spin glass game from sparse to


dense graphs

He deals the cards as a meditation


And those he plays, never suspect
He doesn’t play for the money he wins
He don’t play for respect
He deals the cards to find the answer
The sacred geometry of chance
The hidden law of a probable outcome
The numbers lead a dance

Sting – Shape of my heart, 1993

11.1 The spin glass game

Consider the following problem: N people are given a red or black card, or, equivalently, a
value Si∗ = ±1. You are allowed to ask M pairs of people, randomly, to tell you if they had the
same card (without telling you which one). Can we figure out the two groups?

This is simple enough to answer formally with Bayesian statistics. Given the M answers
Jij = ±1 (1 for the same card, −1 for different ones) the posterior probability assignment is
given by
1 1 Y
Ppost (S|J) = Ppost (J|S)Pprior (S) = P (Jij = Si Sj ) (11.1)
N N 2N
ij∈G

where G is the graph where two sites are connected if you asked the question to the pairs.
Given some people are not trustworthy, let us denote the probability of lies as p. Then

1
(11.2)
Q
Ppost (S|J) = N 2N ij∈G [(1 − p)δ(Jij = Si Sj ) + pδ(Jij = −Si Sj )]

and using the change of variable p = e−β /(e−β + eβ ) = 1/(1 + e2β ) we find
174 F. Krzakala and L. Zdeborová

1 β Pij∈G Jij Si Sj
Ppost (S|J) = e (11.3)
Z
This is the spin glass problem with Hamiltonian H = − Jij Si Sj , thus the name the "Spin
P
Glass Game" for this particular inference problem.

11.2 Sparse graph

11.2.1 Belief Propagation

First, we discuss the problem on sparse graphs. In this case, as N, M → ∞, we have random
tree-like regular graph, and we can thus write the BP equations. We can use the result from
appendix 5.A, For such pair-wise models, we have one factor per edge, and the BP equations
read:
1 Y
χj→(ij)
sj = ψs(kj)→j (11.4)
Z j→(ij) j
(kj)∈∂j\(ij)
1 X
ψs(ij)→i = eβJij si sj χj→(ij)
sj (11.5)
i
Z (ij)→i sj

It is convenient to write the iteration using the parametrization

1 + sj mj→(ij)
χj→(ij)
sj = (11.6)
2
βs h(ij)→i
e i
ψs(ij)→i = (11.7)
i
2 cosh(βh(ij)→i )

so that
 
X
mj→(ij) = tanh β h(kj)→j  (11.8)
(kj)∈∂j\(ij)
(ij)→i j→(ij)
eβsi h 1 X
βJij si sj 1 + sj m
= e (11.9)
2 cosh(βh(ij)→i ) Z (ij)→i s 2
j

The second equation can equivalently be written as

X+ − X−
tanh βh(ij)→i = (11.10)
X+ + X−
X 1 + sj mj→(ij)
Xs = eβJij ssj (11.11)
s
2
j

1 1
= eβsJij (1 + mj→(ij) ) + e−βsJij (1 − mj→(ij) ) (11.12)
2 2
11.2 Sparse graph 175

Using now the relation atanhy = 1


2
1+x
log 1−x , and applying the atanh on both side, we reach
1 X+
βh(ij)→i = log (11.13)
2 X−
1 eβJij (1 + mj→(ij) ) + e−βJij (1 − mj→(ij) )
= log −βJ (11.14)
2 e ij (1 + mj→(ij) ) + eβJij (1 − mj→(ij) )

1 cosh(βJij ) + mj→(ij) ) sinh(βJij )


= log (11.15)
2 cosh(βJij ) − mj→(ij) ) sinh(βJij )
1 1 + mj→(ij) ) tanh(βJij )
= log (11.16)
2 1 − mj→(ij) ) tanh(βJij )
 
= atanh mj→(ij) ) tanh(βJij ) (11.17)

We thus finally write BP more conveniently as


 
X
mj→(ij) = tanh β h(kj)→j  (11.18)
(kj)∈∂j\(ij)
1  
h(ij)→i = atanh mj→(ij) ) tanh(βJij ) (11.19)
β
Or even better (realizing that the notation mj→(ij) can be written without ambiguity as mj→i )
 
X  
mj→i = tanh  atanh mk→j tanh(βJkj )  (11.20)
k∈∂j\i

With this choice of notation, the iteration is really practical.

11.2.2 Population dynamics

One can prove this algorithm perfectly solve the spin glass game. We can also analyze the
Bayes performances, using population dynamics. Defining
 
  X  
FBP {mk→j } = tanh  atanh mk→j tanh(βJkj )  (11.21)
k∈∂j\i

The probability density distribution of messages is given by the fixed point of


Z Yc
P(mcav ) = dcP excess (c) dmi P(mi )δ(m − FBP {m}) (11.22)
i=1

11.2.3 Phase transition

It is easy to locate the phase transition, as it turns out to be a second-order one, using the local
perturbation approach. Writing mj→i = ϵj→i S∗j we find to linear order than
   
X   X  
ϵj→i S∗j ≈ tanh  ϵk→j S∗k tanh(βJkj )  ≈  ϵk→j S∗k tanh(βJkj )  (11.23)
k∈∂j\i k∈∂j\i
176 F. Krzakala and L. Zdeborová

or equivalently for an homogeneous growth


 
X  
ϵ≈ ϵS∗k S∗j tanh(βJkj )  (11.24)
k∈∂j\i

so that on average, the magnetization will increase if

1 = cE S∗i S∗j tanh(βJij ) = c(1 − 2p) tanh(β) = c tanh(β)2 (11.25)


 

so that the transition arise at


s s
1 1
β= = (11.26)
atanh(2M/N ) atanh(c)

11.3 Dense graph limit: TAP/AMP

Now we look to the problem with DENSE graph, in fact we may assume we observe ALL
pairs, but to make the problem interesting, we take the probability of lies close to 1/2. We
write √ √ √ √
pDense = e−β/ N /(e−β/ N + eβ/ N ) = 1/(1 + e2β/ N ) (11.27)

In this limit, the problem becomes


1 √β Pi<j Jij Si Sj
Ppost (S|J) = e N (11.28)
Z

Of course, we can write the same BP algorithm


 
X  √ 
mj→i = tanh  atanh mk→j tanh(β/ N Jkj )  (11.29)
k∈∂j\i

and the transition arise at β = 1 since the criterion becomes


2N 2 /2
(11.30)
p
1= tanh(β/ (N ))2 ≈ β 2
N

11.4 Approximate Message Passing

This is a very nice limit to study the problem, however there is a little annoying fact: we have
to update N 2 messages! This is way too much!!!! The trick is now to Taylor expand BP. We
start by the BP iteration, which reads at first order:
   
 √  β  
mj→i atanh mk→j mtk→j Jkj 
X X
t+1 = tanh
 t tanh(β/ N Jkj  ≈ tanh  √
k∈∂j\i
N k∈∂j\i
(11.31)
11.4 Approximate Message Passing 177

At this point, we realize that we can close the equation on the full marginal defined as

!
β X  k→j 
mjt+1 = tanh √ mt Jkj (11.32)
N k

Indeed
!
β X  k→j  β  i→j 
mj→i
t+1 = tanh √ mt Jkj − √ mt Jij (11.33)
N k N
! !
β X  k→j  β  i→j  β X  k→j 
≈ tanh √ mt Jkj −√ mt Jij tanh′ √ mt Jkj
N k N N k
(11.34)
β   2
≈ mjt+1 − √ mi→j
t Jij (1 − mjt+1 ) (11.35)
N
β 2
≈ mjt+1 − √ mit Jij (1 − mjt+1 ) (11.36)

N
(11.37)

where we keep the first correction in N . Finally, combining the two following equations:

β  j  2
mtk→j = mkt − √ mt−1 Jjk (1 − mkt ) (11.38)
N
!
j β X  k→j 
mt+1 = tanh √ mt Jkj (11.39)
N k

we reach
 !
β X β  j 
k2
mjt+1 = tanh √ k
Jkj mt − √ mt−1 Jjk (1 − mt ) (11.40)
N k N
 !
β X β   2
mjt+1 = tanh √ Jkj mkt − √ mjt−1 (1 − mkt ) (11.41)
N k N
X mk 2
!!
β X
mjt+1 = tanh √ k 2 j
Jjk mt − β mt−1 1 − t
(11.42)
N N
k k

This can be conveniently written as

1
ht = √ Jmt − βmt−1 (1 − mt 2 ) (11.43)
N
mt+1 = tanh βht (11.44)

This is the TAP, or AMP algorithm. The second term in the first equation is called the Onsager
term, and it makes a subtle difference with the naive mean field approx!
178 F. Krzakala and L. Zdeborová

In full generality we can write

1
ht = √ Jmt − mt−1 ∂h η(βht−1 ) (11.45)
N
mt+1 = η(βht ) (11.46)

Note how convenient and easy is it to write this algorithm! In fact in this form, this is known
as the AMP algorithm!

11.4.1 State Evolution

Let us come back on the iteration


 
 √ 
mj→i atanh mtk→j tanh(β/ N Jkj ) 
X
t+1 = tanh
 (11.47)
k∈∂j\i

Instead of focusing on the population dynamics on m, it seems like a good idea to decompose
the iteration as

mj→i
t+1 = tanh hj→i
t (11.48)
 
 √ 
hj→i atanh mk→j
X
t =  t tanh(β/ N Jkj )  (11.49)
k∈∂j\i

With this notation, the distribution of hj→i t is a sum of uncorrelated terms on a tree, so we
expect that for large tree, in the limit of large connectivities, we could simply things a bit. Let
us first rewrite in the large connectivity limits:

mj→i
t+1 = tanh hj→i
t (11.50)
β X  k→j 
hj→i
t = √ mt Jkj ) (11.51)
N k

we have
t
hj→i
t ∼ N (h , ∆ht ) (11.52)
With this, we can close the equations as follows: the mean and variance of the distribution of
h is m and q, then

β X
hj→i
t Si∗ = Si∗ √ δ(J = OK)S∗i Sk∗ mk→j
t − δ(J = lies)S∗i Sk∗ mk→j
t ) (11.53)
N k
β X
= √ δ(J = OK)Sk∗ mk→j
t − δ(J = lies)Sk∗ mk→j
t ) (11.54)
N k
11.5 From the spin glass game to rank-one factorization 179

Let us denote the mean and second moments of the distribution of the m in the direction of
the hidden states as

q t = E[(mi→j Si∗ )2 ] (11.55)


m = t
E[mi→j Si∗ ] (11.56)

then we have
t β β √
h = N √ (1 − 2p)mt = N √ tanh(β/ N )mt = β 2 mt (11.57)
N N
and
∆ t = β 2 qt (11.58)
so that we can write

√ 2
q t = Ez tanh β 2 mt + zβ qt (11.59)
√ 
mt = Ez tanh β 2 mt + zβ qt (11.60)

Which is the same as what we could have had with the replica method.

11.5 From the spin glass game to rank-one factorization

It turns out what we wrote here is entirely general in the large connectivity limit. Assuming
the following problem:
1 ∗ ∗ T √
r
Y = x (x ) + ∆W (11.61)
N
where W is a random matrix, and x∗ is sampled from a distribution P0 , it turns out that if we
wanted to solve the problem, we could simply write the AMP algorithm as

1
ht = √ Jmt − mt−1 βη ′ (βht−1 ) (11.62)
N
mt+1 = η(βht ) (11.63)

where this time we should actually use a different denoiser:


2
dxPX (x)xe−x A/2+xB
R
ηt (h) = η(A = βm2t , B = βh) =: R (11.64)
dxPX (x)e−x2 A/2 +xB

Using this algorithm, we, again, can get the state evolution as

mt = Ez,x∗ [x∗ η βmt , βmt x∗ + z ] (11.65)



Chapter 12

Approximate Message Passing and


State Evolution

I hope that someone gets my message in a bottle.

Sting, the Police – 1979

12.1 AMP And SE

Let us follow Bolthausen Bolthausen (2009) and Bolthausen (2014) and consider the iteration
using symmetric matrices.

12.1.1 Symmetric AMP

Consider the symmetric AMP iteration with the same prescriptions as in i.e. assumptions on
the matrices, functions and inputs.

xt+1 = Amt − bt mt−1 (12.1)


t
m = ft (x )t
(12.2)

with initialization at x0 and


bt = E divft (Zt ) (12.3)
 

where Zt ∼ N(0, κt,t In ). We claim that in the iterations, all xs are behaving as Gaussian
variables, with xt ∼ N (0, κt ), and X ∼ N (0, κt,s ). Additionally, we also have

1 s−1 t−1
κs,t = m m =: qs−1,t−1 (12.4)
N
This is precisely what would happen WITHOUT the Onsager term IF the matrix A would be a
new one at each iteration time. The Onsager correction makes it work in such a way that it
remains true when A remains the same over iteration.
182 F. Krzakala and L. Zdeborová

12.1.2 Conditioning on linear observations

Define the σ-algebra St = σ(x1 , x2 , ..., xt ). We then have :

xt+1 |St = A|St mt − bt mt−1 (12.5)

because mt , mt−1 are St -measurable. Ok.


A simple recursion shows that conditioning on St is equivalent to conditioning on the gaussian
space generated by Am0 , Am1 , ..., Amt−1 . This is a subspace of the Gaussian space gener-
ated by the entries of A. Conditioning a Gaussian space on its subspace amounts to doing
orthogonal projections, which gives, for GOE(n) :

A|St = E [A|St ] + Pt (A) (12.6)


=A− P⊥ ⊥
Mt−1 APMt−1 + P⊥ ⊥
Mt−1 ÃPMt−1 (12.7)

where Mt−1 = m0 |...|mt−1 and à is an independent copy of A. Using this on symmetric


 

AMP, we get :
 
xt+1 |St = A − P⊥ ⊥ ⊥ ⊥ t
Mt−1 APMt−1 + PMt−1 ÃPMt−1 m − bt m
t−1
(12.8)
= A − Id − PMt−1 A Id − PMt−1 m + P⊥ ⊥
(12.9)
  t t t−1
Mt−1 ÃPMt−1 m − bt m
 
= APMt−1 + PMt−1 APTMt−1 mt + P⊥ ⊥ t
Mt−1 ÃPMt−1 m − bt m
t−1
(12.10)
= APMt−1 mt + P⊥ ⊥ t t
Mt−1 ÃPMt−1 m + PMt−1 Am⊥ − bt m
t−1
(12.11)

assuming Mt−1 has full rank, defining (unique way) αt as the coefficients of the projection of
mt onto the columns of Mt−1 , we have :

xt+1 |St = AMt−1 αt + P⊥ ⊥ t t


Mt−1 ÃPMt−1 m + PMt−1 Am⊥ − bt m
t−1
(12.12)

using the definition of the


 symmetric AMP iteration, we have AMt−1 = Xt−1 + [0|Mt−2 ] Bt
where Xt−1 = x1 |...|xt and Bt is the diagonal matrix of Onsager terms. Then:


xt+1 |St = (Xt−1 + [0|Mt−2 ] Bt ) αt + P⊥ ⊥ t t


Mt−1 ÃPMt−1 m + PMt−1 Am⊥ − bt m
t−1
(12.13)
= Xt−1 αt + P⊥ ⊥ t
Mt−1 ÃPMt−1 m + [0|Mt−2 ] Bt αt + PMt−1 Amt⊥ t−1
− bt m (12.14)
| {z } | {z }
P art 1 P art 2

It is clear that Part 1 in the above expression is a combination of previous terms with an
additional Gaussian one (product of independent GOE with frozen Ms). Part 2 cancels out
in high-dimensional limit, isometry+Stein’s lemma, AKA Onsager magic. Then recursion
intuitively gives Gaussian equivalent model at each and across iterations.

Let us show how to cancel part 2! We shall focus on the term

A= PMt−1 Amt⊥ (12.15)


= Mt−1 (MTt−1 Mt−1 )−1 MTt−1 Amt⊥ (12.16)
= Mt−1 (MTt−1 Mt−1 )−1 (AMt−1 )T mt⊥ (12.17)
12.1 AMP And SE 183

But (AMt−1 )T = (Xt−1 − [0|Mt−2 ] Bt )T so that

(AMt−1 )T mt⊥ = (XTt−1 − BTt [0|Mt−2 ]T )mt⊥ = XTt−1 mt⊥ (12.18)

so that

A= Mt−1 (MTt−1 Mt−1 )−1 XTt−1 mt⊥ (12.19)


= Mt−1 (MTt−1 Mt−1 )−1 XTt−1 (mt − mt∥ ) (12.20)
= Mt−1 (MTt−1 Mt−1 )−1 XTt−1 (f t (xt ) − Mt−1 αt ) (12.21)

Here we need to be a bit carefful with the scaling of these terms. Remember than this should
be a vector with order one compotent. To underline this, we can write:
 
−1
 1 T
A= T
Mt−1 (Mt−1 Mt−1 ) N t t t−1 t
X (f (x ) − M α ) (12.22)
N t−1

Now we see that we have a matrix with O(1) element that multiply vectors with O(1) elements,
and when we multiply them we have a again an O(1) quantities (since we sum over t values,
and t is finite). Let us focus on the two terms in the second parenthesis:

1 T
B= X (f t (xt ) − Mt−1 αt ) = C + D (12.23)
N t−1
These terms can be simplifies using Stein lemma. Indeed

1 T t t
C= X f (x ) (12.24)
N t−1

1 P (1) (t)

i xi f (xi )
 N1 P (2) (t) 
=
N
 i xi f (xi )  (12.25)
 ... 
1 P (t) (t)
N i xi f (xi )
 1
E[z f (z (t) )]

E[z 2 f (z (t) )]
=N →∞   (12.26)
 ... 
E[z t f (z (t) )]
(12.27)

where we have used the recurence hypothesis. Now we can use Stein lemma1 and we find

κ1,t E[f ′ (z (t) )]


   
κ1,t
κ2,t E[f ′ (z (t) )] κ2,t 
C=  = bt 
...
 (12.28)
 ... 
′ (t)
κt,t E[f (z )] κt,t

But we also have that


1 s−1 t−1
κs,t = m m (12.29)
N
1
Stein’s lemma: Suppose X is a normally distributed random variable with and variance κ. Then
E g(X)(X) = κE g ′ (X) .. In general, suppose X and Y are jointly normally distributed. Then
 

Cov(g(X), Y ) = Cov(X, Y )E(g ′ (X)).


184 F. Krzakala and L. Zdeborová

and therefore
(m0 )T mt−1
 
1  (m1 )T mt−1  = 1 bt MTt−1 mt−1
C= bt  (12.30)
N  ...  N
t−1
(m ) m T t−1

We can deal in a similar way with D and we find


1 T
D= M [0|Mt−2 ] Bt αt (12.31)
N t−1
so that finally

 
1 T
Mt−1 (MTt−1 Mt−1 )−1 N t t t−1 t
(12.32)

A= X (f (x ) − M α )
N t−1
Mt−1 (MTt−1 Mt−1 )−1 MTt−1 bt mt−1 − [0|Mt−2 ] Bt αt (12.33)

=
PMt−1 bt mt−1 − [0|Mt−2 ] Bt αt (12.34)

=
This is precisely the part needed to cancel part 2! At this point, we have proven the claim.

Additionally, we can now write the evolution of the κt , or equivalently, of the qt . This is called
state evolution!
1 h √ i h √ i
qt = mt mt = E (ft ( κt Z))2 = E (ft ( qt−1 Z))2 (12.35)
N

What is interesting here is the generality of the statement. In fact, f not even need to be
separable or even well defined for it to be true Berthier et al. (2020). This allows to use such
iteration with neural networks ?Gerbelot and Berthier (2021) and black-box functions in signal
processing ?.

12.1.3 Convergence of AMP

A crucial question at this point can be asked about whether or not xt (or mt ) does converge.
This can be studied by looking at
1 2
mt+1 − mt 2
= q t+1 + q t − 2q t,t+1 (12.36)
N

If we assume that we have a convergence in q (but not necessary in x!!) such that we are on
the orbit of AMP are q t+1 = q t = q ∗ , then it amonts to wther or not q t,t+1 is convergeing to q ∗ .
We can write
1
C t+1 = qt+1,t = mt+1 mt = E ft (X t+1 ) ft (X t ) (12.37)
  
N
Using our Gaussian equivalance, we can now replcae the two Xs by correlated gaussians! We
have:

Xt = κt − κt,s Z ′ + κt,s Z (12.38)
p
′′ √
s
(12.39)
p
X = κs − κt,s Z + κt,s Z
12.1 AMP And SE 185

so that at the fixed point when κs = κt = q ∗ we have


√ √
C t+1 = E ft ( κt+1 − κt+1,t Z ′ + κt+1,t Z) ft ( κt − κt+1,t Z ′′ + κt+1,t Z) (12.40)
 p  p 
h p √  p √ i
= E ft ( q ∗ − C t Z ′ + C t Z) ft ( q ∗ − C t Z ′′ + C t Z) (12.41)
= g(C t ) (12.42)

The question is thus is this fixed-point iteration of g, defined for 0 to q ∗ is converging. Accorid-
ing to the fixed point theorem, this will be the case if the map is contrative, that is if for any x
and y, we have
∥g(x) − g(y)∥ ≤ |x − y| (12.43)
In other words, we need g to be 1-Lipschitz. One can show that the first two derivative of g
are positive (takes the derivative and use Stein lemma!), so that the derivative is maximum at
q ∗ ! We thus obtain, after a quick computation, the criterion:
 2 

(12.44)
p

E ft ( q Z) ≤1

12.1.4 A first application: the TAP equation!

Consider how the function


ft (x) = tanh(βx + h) (12.45)
In this case, we see that AMP simplifies as follows:

bt = β(1 − E[tanh(βxt + h)2 ]) = β(1 − q t ) (12.46)

and

xt+1 = Amt − β(1 − q t )mt−1 (12.47)


m t+1
= tanh(βx t+1
+ βh) (12.48)

so that
mt+1 = tanh(β Amt − β(1 − q t )mt−1 + βh) (12.49)


There are the TAP equations Thouless et al. (1977) for the SK model, and q obbey the parisi
replica symmetric equation:
h √ i
qt = E (tanh(β qt−1 Z + βh))2 (12.50)

Additionally, we also can have convergence when


 2 

(12.51)
p

E tanh (β q Z + βh) ≤1

which is precisely the Almeida-Thouless criterion. This was the main result of the original
paper by Bolthausen (2014).
186 F. Krzakala and L. Zdeborová

12.1.5 Parisi equation reloaded: Approximate Survey propagation

Since the Parisi RS’s equation of the SK model seems to appear, we may wonder, are the RSB
equation a state evolution of something? The answer is yes, as shown in ?. Define in full
generality the function
h  p m  p i
EV cosh βx + β q1t − q0t V + βh tanh βx + β q1t − q0t V + βh
ft (x) = h  p m i (12.52)
EV cosh βx + β q1t − q0t V + βh
 h  m  i 
EV cosh βx + β q1t − q0t V + βh tanh2 βx + β q1t − q0t V + βh
p p
q1t+1 = EU  h  p m i 
EV cosh βx + β q1t − q0t V + βh
(12.53)
 h  p m  p i 2 
t t t t
 EV cosh βx + β q1 − q0 V + βh tanh βx + β q1 − q0 V + βh  
q0t+1 = EU  h  p m i 
EV cosh βx + β q1t − q0t V + βh
(12.54)

then the AMP iteration follows has a state evolution that follows Parisi’s 1RSB equations, in
particular, the norm of ft is q0 !

We thus see that the Parisi equation can be interpreted as a way to follow some iterative
algorithm in time.

12.2 Applications

12.2.1 Wigner Spike Model

Consider the following recursion:


√ !
λ
xt+1 = A+ x∗ xT∗ mt − bt mt−1 = Y mt − bt mt−1 (12.55)
N
mt = ft (xt ) (12.56)

How do we deal with this? Let us make a change of variable! We write simply

xt+1 = Amt − bt mt−1 + λx∗ q0t (12.57)

with

x̃t+1 = Amt − bt mt−1 (12.58)


1
q0t = xT∗ mt (12.59)
N
12.2 Applications 187

with this change of variable, we can write, defining f˜t (x) = ft (x + x0 q0t ) and
x̃t+1 = Amt − bt mt−1 (12.60)
mt = f˜t (x̃t ) (12.61)
so that we can use state evolution for the tilde variable! This leads again to
 2 
1 t t ˜ √
qt = m m = E ft ( qt−1 Z) (12.62)
N

 2 

= E ft ( qt−1 Z + λq0t−1 X ∗ ) (12.63)

and from the definition of q0t , we have the jointed state evolution:
√ t−1 ∗ 2
 

t
q = E ft ( qt−1 Z + λq0 X ) (12.64)
h √ √  i
q0t = E ft ( qt−1 Z + λq0t−1 X ∗ ) X∗ (12.65)

This is very nice! Let’s look a concrete example: the Rademacher spike model!
 q 
Say we are given Y = √1N W + Nλ x∗ xT∗ , with xi∗ = ±1. Can we recover the vector ? We
can use the AMP algorithm for this,
xt+1 = Y mt − bt mt−1 (12.66)
t t
m = ft (x ) (12.67)

using the binary denoiser, that is using ft (x) = tanh βx. This is a particular case of the TAP
equation (this is often called the planted SK model). In this case we have the state evolution:


 2 

t
q =E tanh(β qt−1 Z + β λq0t−1 X ∗ ) (12.68)
h √ √  i
q0t = E tanh(β qt−1 Z + β λq0t−1 X ∗ ) X∗ (12.69)

These are the one of the SK model with a ferromagnetic bias (λ is often called J0 in this
context).

12.2.2 A primer on Bayesian denoising

Why did we use the hyperbolic tangent? This has to do with √ Bayesian
p methods in statistics.
After all, we know that we are given X, and that X = q0t−1 λX∗ + q t−1 Z. Given we know
X is just X0 up to a Gaussian noise, what is the best we can do to estimates it?

It is a general rule that the "best" estimates in Bayesian method is the mean of the posterior, in
terms of MMSE. This is easily proven: assume that we have an observable Y which is given by
a polluted version of X∗ . Then for all estimators we have:
h i h h ii
MMSE = EX∗ ,Y (X̂(Y ) − X∗ )2 = EY EX ∗ |Y (X̂(Y ) − X∗ )2 (12.70)
188 F. Krzakala and L. Zdeborová

Let us minimize it! COnditioned on Y, X̂ is just a number so we can derive with respect to it
and find that for each Y , we should choose
Z
ˆ(X) = E ∗ X ∗ = P (X|Y )X (12.71)
X |Y

This is the Bayesian √optimal!pIn our problem, this is easyly done: Given we are given an
observable X = q0 t−1
λX∗ + q t−1 Z, we see that
√ t+1 ∗ 2
(X− λq0 X )
∗ X|X∗ −
P (X |X) ∝ P prior
(X)P (X) ∝ P prior
(X)e 2q t−1 (12.72)
√ t+1 2
(X− λq0 x)

dxxP prior (x)e
R
2q t−1

E[X |X] = √ t+1 (12.73)
(X− λq0 x)2
R −
dxP prior (x)e 2q t−1

This is called the Optimal Bayesian Gaussian Denoiser.

Another simplification can be done, using the so called Nishimori property:

Theorem 17 (Nishimori Identity). Let X (1) , . . . , X (k) be k i.i.d. samples (given Y ) from the
distribution P (X = · | Y ). Denoting ⟨·⟩ the "Boltzmann" expectation, that is the average with respect
to the P (X = · | Y ), and E [·] the "Disorder" expectation, that is with respect to (X ∗ , Y ). Then for all
continuous bounded function f we can switch one of the copies for X ∗ :
hD  E i D  E 
E f Y, X (1) , . . . , X (k−1) , X (k) = E f Y, X (1) , . . . , X (k−1) , X ∗ (12.74)
k k−1

Proof. The proof is a consequence of Bayes theorem and of the fact that both x∗ and any of
the copy X (k) are distributed from the posterior distribution. Denoting more explicitly the
Boltzmann average over k copies for any function g as

D E k
Z Y
(1) (k)
g(X , . . . , X ) =: dxi P (xi |Y )g(X (1) , . . . , X (k) ) (12.75)
k
i=1

we have, starting from the right hand side


D  E 
(1) (k−1) ∗
EY,X ∗ f Y, X , . . . , X ,X
k−1
Z D  E
= dx∗ dyP (x∗ |Y )P (Y ) f Y, X (1) , . . . , X (k−1) , X ∗
k−1
Z D  E
= EY dxk P (xk |y) f Y, X (1) , . . . , X (k−1) , X k
k−1
hD  E i
(1) (k−1) (k)
= EY f Y, X , . . . , X ,X
k

We shall drop the subset "k" from Boltzmann averages from now on.
12.2 Applications 189

√ √
Using a binary prior leads indeed to f ( t) = tanh λX (indeed we have only a factor ex λq/qx

As far as we know, this is the best algorithm there is in polynomial times!

Sparse? : HARD PHASE: No algo known!

12.2.3 Wishart Spike Model

Given two vectors u∗ ∈ Rn and v∗ ∈ Rm we are given the n × m matrix (with α = m/n):
r
λ
W = u∗ v∗T + ξ (12.76)
n

The AMP algorithm ? reads (conveniently removing the fact that variances can be pre-
computed so that the functions f is defined at each time t):

1 ′
ut+1 = √ W gvt (vt ) − α < gvt (v) > gut−1 (ut−1 ) (12.77)
n
1 ′
vt+1 = √ W gut (ut )− < gut (ut ) > gvt−1 (vt−1 ) (12.78)
n

Conveniently, one can recast these equations into a new form, that is amenable to a rigorous
analysis. Define the n + m × 2 matrix X
 
u 0
x= (12.79)
0 v

and the interaction matrix


r r 
u∗ uT∗ u∗ v∗T
  
λ λ ξup ξ
Y = T
x∗ x∗ + Z = + (12.80)
n n (u∗ v∗T )T v∗ v∗T ξ T ξlow

Now we define f t : Rm+n×2 → Rm+n×2 defined as each time t by the following rule at each
lines (with matrix-like notations):

for i = 1 . . . n fi = [0 gut (x[i, 1])] (12.81)


for i = n + 1 . . . m + n fi = [gvt (x[i, 2]) 0] (12.82)
(12.83)

Then the following AMP recursion is equivalent to eq.(12.77,12.78):

Y 1 ′
xt+1 = √ f t (xt ) − < f t (xt ) > f t−1 (X t−1 ) (12.84)
n n
Y
= √ f t (xt ) − bt f t−1 (xt−1 ) (12.85)
n

where we have put the Onsager terms into the moniker bt . We thus obtain the state evolution
equation for eq.(12.77,12.78) directly from the Wigner ones:
190 F. Krzakala and L. Zdeborová

"  #
q √ 0 t−1 ∗ 2
qut =E gut ( t−1
αqv Z + α λqv V ) (12.86)

√ 0 t−1 ∗
 q  
t
qu0 t t−1
= E gu ( αqv Z + α λqv )V U∗ (12.87)
" q  #
√ 0 t−1 ∗ 2
t t t−1
qv = E gv ( qu Z + λqu U ) (12.88)

t
h p √ t−1
 i
qv0 = E gvt ( qu t−1 Z + λqu0 U ∗ ) V∗ (12.89)

Again, they are many application (inclyding denoising spike models, analysing Mixture of
Gaussians, Hopfield models, etc....)

if r > 4 + 2 α, there is a gap between the threshold for informationtheoretically optimal
performance and the threshold at which known algorithms succeed

12.3 From AMP to proofs of replica prediction: The Bayes Case

Let us start from the Bayes case! Suppose we are given as data a n × n symmetric matrix Y
created as follows: r
λ ∗ ∗⊺
Y= x x} + ξ
N | {z |{z}
N ×N rank-one matrix symmetric iid noise

i.i.d. i.i.d.
where x∗ ∈ RN with x∗i ∼ PX (x), ξij = ξji ∼ N (0, 1) for i ≤ j.

This is called the Wigner spike model in statistics. The name "Wigner" refer to the fact that Y
is a Wigner matrix (a symmetric random matrix with component sampled randomly from a
Gaussian distribution) plus a "spike", that is a rank one matrix x∗ x∗⊺ .

We shall now make a mapping to a Statistical Physics formulation. Consider the spike-Wigner
model, using Bayes rule we write:
" #  q 2 
1 λ
P (Y | x) P (x) Y Y 1 − y ij − x x
N i j
P (x | Y) = ∝ PX (xi )  √ e 2 
P (Y) 2π
i i≤j
" #  " r #
Y X λ λ
∝ PX (xi ) exp  − x2 x2 + yij xi xj 
2N i j N
i i≤j
" #  " r #
1 Y X λ 2 2 λ
⇒ P (x | Y) = PX (xi ) exp  − xi xj + yij xi xj 
Z(Y) 2N N
i i≤j
 
x̂MSE,1 (Y)
..
Z
⇒ x̂MSE (Y) =  . , x̂ (Y) = ⟨x ⟩ = dx P (x | Y) xi
 
 MSE,i i Y
x̂MSE,N (Y)
12.3 From AMP to proofs of replica prediction: The Bayes Case 191

Interestingly, we see that the partition function Z(Y ) is given by the ratio between

P (Y )
Z(Y ) = P 2 √ N2 (12.90)
e− ij yij /2 / 2π

This is the so-called likelihood ratio betweem the probablity that our model has been generated
randomly, and the one that it has been generated by accident from a pure random problem.

It is convenient to use the notation of information theory, and to compute instead the mutual
information. It is defined as

P (X, Y )
I(X, Y ) = EX,Y log = H(X) − H(X|Y ) = H(Y ) − H(Y |X) (12.91)
P (X), P (Y )
In our model, the relation between the free energy and the mutual information is trivial to
compute, one simply finds:
I(X, Y ) λ(E[X 2 ])2
=f+ (12.92)
N 4
Why work with mutual information since it is just the free energy ? It has actually nice
properties that we now list:

• The mutual information is monotonic in λ from 0 to H(X) (the first is trivial from the
definition, since then the joint distribution factorize, while the second comes from the
fact that if we recover X perfectly from Y , then H(Y |X) = 0).

• Call the matrix M = X ∗ X, it is easy to show using Bayesian technics that the best
possible error in reconstructing M from Y is given by the derivative of the mutual
information. This is the I-MMSE theorem:
1 ∂I(λ)
M − MMSE = (12.93)
4 ∂λ

Additionally, we see that√ the model looks like a SK problem, where the role of the inverse
temperature is played by λ This is often called the Nishimori condition in the litterature. It
is a very peculiar condition, in fact we can show that q = m if we impose this!

Using the replica method, one can work hard and derive the free energy! It reads (here q = m)

λm2
Z
2 λm ∗ ∗

f (m) ≈ − Ex∗ ,z log dx0 P (x0 )e−x0 2 +λx0 x0 m+x0 x0 λmz (12.94)
4

Notice that we also find an interesting identity : dfrs /dλ = 4m2 /4, so that the replica prediction
for the mutual information is just

frs + λρ2 /4 (12.95)

This is interesting, can we proove it ?????

Let us notice that AMP cannot be better than the MMSE, so let us use the following estimator:

AMP − M = mi mj (12.96)
192 F. Krzakala and L. Zdeborová

What is the error of this estimator? This is easily computed

1 X
AMP − MSE = (ft (Xi )ft (Xj ) − Xi∗ Xj∗ ) (12.97)
N2
i

In expectation, this is simply


AMP − MSE = ρ2 − m2 (12.98)

For sure, this error is larger, or equal to the Bayes one, so

AMP − MSE(λ) > MMSE(λ) (12.99)


so
1 1
Z Z
dλ AMP − MSE(λ) > MMSE(λ) = I(∞) − I(0) = H(X) (12.100)
4 4

Now, it turns out that the derivative of the replica symmetric free energy is just AMP-MSE!
Indeed the derivative of the replica mutual information is just

1 1
− m2 + ρ2 + ∂λ T (12.101)
4 4
but the derivative of the free energy gives the correct term and we reach indeed the MSE at
the end! Imagine we can integrate without discontinuity (that is, m is continuous) then:

1 1
Z Z
Ireplica (∞) − Ireplica (0) = dλ AMP − MSE(λ) > MMSE(λ) = I(∞) − I(0) = H(X)
4 4
(12.102)

but (defining E = ρ − m) and Σ = 1/λm we find

λ(ρ2 + m2 )
Z
2 λm +λx x∗ m+x x∗
√ λE 2
Ireplica (λ) = − Ex∗ ,z log dx0 P (x0 )e−x0 2 0 0 0 0 λmz
= I(X; X + ΣZ) +
4 4
(12.103)

But clearly, we see tghat I(0) = 0 and I(∞) = H(x) so we reach

1 1
Z Z
H(X) = dλ AMP − MSE(λ) > MMSE(λ) = H(X) (12.104)
4 4

So we must have AMP reaching the MMSE almost everywhere and the free energy to be
correct!

12.4 From AMP to proofs of replica prediction: The MAP case

Consider again our favorite problem, aka the spike model!


 q 
Say we are given Y = √N W + N x∗ x∗ , and we do not know anything about X∗ , except
1 λ T

that it is well normalized (on the sphere). A typical thing to do would be to use maximum
12.4 From AMP to proofs of replica prediction: The MAP case 193


likelihood. We want to minimize ∥Y − λxxT ∥22 constainted to X being on the sphere. This
can be done using the following cost fucntion
X X
L=− xi Yij xj + µ( x2i − N )2 (12.105)
i,j i

where we use µ as a Lagrange parameter to fix the norm. This problem is solved when the
following condition is satisfied: X
Yij xj = µxi (12.106)
j

or equivalently when
Y x̂ = µx̂ (12.107)
that is, of course, when x is an eigenvector. Them, the loss is minimized using the largest
possible eigevnvalue.

Let us see how we can use AMP for this! We simply use the function that will preserve the
norm and write
√ !
λ
xt+1 = A + x∗ xT∗ mt − bt mt−1 = Y mt − bt mt−1 (12.108)
N
mt = ft (xt ) (12.109)

with a non separable function written as



N
t
ft (x ) = x t
(12.110)
∥xt ∥2
clearly, we can compute the Onsager term

N
bt = (12.111)
∥xt ∥ 2

Imagine that we iterate AMP and find a fixed point. Then what is this fixed point? Well, we
have
√ √ √
N N N
x=Y x t − x (12.112)
∥x ∥2 ∥x∥2 ∥x∥2
  √
N N
x 1+ 2 =Y x (12.113)
∥x∥2 ∥x∥2
√ !
∥x∥2 N
m √ + =Ym (12.114)
N ∥x∥ 2

Clearly, this is the same fixed point as the one we are looking for! In fact, we even are given
the value of the Lagrange parameter! So it seems that, if it converges, AMP can solved the
problem for us! √ Now we can analyse the solution that AMP gives using state evolution! We
have x = Z + q0 λx∗ , so clearly |x|22 = N (1 + q02 λ). Additionally we have
√ ! √
1 Z + q0 λx0 λ
q0 = E p
2
x 0 = q0 p (12.115)
N 1 + q0 λ 1 + q02 λ
194 F. Krzakala and L. Zdeborová

From this, we can deduce that

q0 = 0 when λ < 1 (12.116)


r
1
q0 = 1 − when λ > 1 (12.117)
λ

This is the BBP transition! Additionally, we can even compute the value of the eigenvalue!
INdeed |x|22 = N (1 + q02 λ) become |x|22 = N λ so that


 
1
λmax = λ+ √ (12.118)
λ

This is given a generic strategy for prooving replica prediction: find an AMP that matches
the fixed point of the minimum, prove that it converges, then simply study its state evolution
prediction!

12.5 Rectangular AMP and GAMP

This is from Bayati and Montanari (2011); Rangan (2011), but the present discussion uses
Berthier et al. (2020); Gerbelot and Berthier (2021).

There are many problem defined for rectangular non symmetric matrices A ∈ Rm×n , and for
such, it will be interesting to look at the following iteration:

ut+1 = AT gt (vt ) − dt ef (ut ) (12.119)


t t
v = Aet (u ) − bt gt−1 (v t−1
) (12.120)

with, assuming E[A2ij ] = 1/n and α = m/n,

α 1
dt = divgt (vt ), bt = divgt (ut ) (12.121)
m n
It would be great to have a state evolution for those equations, so that u and v can be treated as
Gaussian at all times! This can be done by mapping (reducing) this recursion to the original
symmetric one! Define:
r    
1 B A 0
As = and x = 0
(12.122)
1+α AT C u0

and

r

   
gt (x1 , . . . , xm ) 1+α 0
f2t+1 = 1+α , f2t = . (12.123)
0 α et (xm+1 , . . . , xm+n )

and then we see, again, that xt follows the regular AMP, equation. Making the change of
variable, we see that both u and v are Gaussian variables, and we can compute their variances
etc using state evolution,
12.6 Learning problems: the LASSO 195

A very important application of this rectangular AMP is the Approximate Message Passing
for generalized linear problem, called Generalized AMP (GAMP in short), written slightly
differently, as

ut+1 = AT gtout (ω t ) + Σ−1 in t


t ft (u ) (12.124)
ω = t
Aftin (ut ) − Vt g out
(ω t−1
) (12.125)

with Vt and −Σ−1


t playing the role of the Onsager term, and where the state evolution reads
so that

qut Z , ω t = qωt Z ′ (12.126)


p p
ut =
(12.127)
t+1 out
p 2 t in
p 2
qu t t
= αE[(gt ( qω Z)) ] , qω = E[ft ( qu Z)) ]

In practice, however, since we are interested in cases where there is a hidden x0 and a set of
measurement given by a function of z = Ax0 , we need to be slightly more generic (in the
same as in the wigner spike model). In fact, we want to use function g tht depends potentially
on z (through y and function f tgat correlates with x0 . In full generality GAMP reads as

ut+1 = AT gtout (ω t , Y) + Σ−1 in t


t ft (u ) (12.128)
t
ω = Aftin (ut ) − Vt g out
(ω t−1
, Y) (12.129)

and this add a new direction for u (that can get correlated with x0 ) and for ω (that can get
correlated with z0 ). In this case, the state evolution reads

qut+1 = αE[(gtout (ω, z0 , ϕ(z0 ))2 ] (12.130)


(12.131)
p
qωt = E[ftin ( qut Z + mtu x0 )2 ]
mt+1
u = αE[∂z0 (gtout (ω, ϕ(z0 ))] (12.132)
(12.133)
p
mtω = E[ftin ( qut Z + mtu x0 )x0 ]

where the equation on mt+1


u comes from applying the Stein lemma to

mt+1
u = αE[z0 (gtout (ω, ϕ(z0 ))] = αE[∂z0 (gtout (ω, ϕ(z0 ))] (12.134)

ρ mt
 
and where z, ω have a joint Gaussian distribution with covariance
mt q t

12.6 Learning problems: the LASSO

12.6.1 Gradient Descent

We start with a initial guess x0 and we iterate to tend to the minimum of the function. At each
iteration, the next approximation of x is given by:

xt+1 = xt − ηf ′ (xt )

To find the next value xt+1 , the function is approximated around the current point using a
quadratic approximation and xt+1 is given by the minimum of the quadratic approximation.
196 F. Krzakala and L. Zdeborová

By the Taylor’s theorem, we have:

(x − xt )2 ′′ t
f (x) ≈ f (xt ) + (x − xt )f ′ (xt ) + f (x )
2

By replacing f ′′ (xt ) by η −1 , that represents the learning rate, we have:

(x − xt )2
f (x) ≈ f (xt ) + (x − xt )f ′ (xt ) +

The minimum of this function is given by:

x∗ − xt
f ′ (xt ) + =0
η

Def
⇒ xt+1 = x∗ = xt − ηf ′ (xt )

We see that we can write GD as the minimum, at each steps, of the approximate f (x) cost
function

12.6.2 Proximal Gradient Descent

Let f (x) be a function we want to optimize, which is not derivable somewhere but can be
decomposed as a sum of two functions, one derivable and the other not.

f (x) = g(x) + h(x)

is derivable

g(x)
Where
h(x) is not derivable, e.g. |x|

Therefore, the gradient approximation can be done, up to the second derivative, only for g(x).
Since multiplying terms won’t change the optimum value, the expression for the minimum x
can be derived like so:

(x − xt )2
f (x) ≃ g(xt ) + (x − xt )g ′ (xt ) + + h(x)

(x − xt )2
 
t ′ t
x t+1
= argmin (x − x )g (x ) + + h(x)

 
1 ′ t
2
= argmin h(x) + t
x − x + g (x )η

12.6 Learning problems: the LASSO 197

 
1 2
= argmin h(x)η + x − (xt − g ′ (xt )η)
2

If h(x) is not very complicated, the former expression can be solved analytically (this is a
fundamental object of study in Convex Optimization).

Definition: The proximal operator of a function.


 
1
P rox(z) = argmin l(x) + (x − z)2
l(.) x 2

Example: Soft Thresholding l(x) = |x|a

 z − a if z ≥ a

P rox(z) = 0 if z ∈ [−a; a]
l(.)
z + a if z ≤ −a

The recursion is then:


xt+1 = P rox [xt − ηg ′ (xt )]
ηh(.)

12.6.3 The LASSO and other linear problems

A very typical problem is the following one: find the argmin of

1
L(w) = ∥y − Aw∥22 + h(λ) (12.135)
2

This can be solve, for convex h, using PGD, with the iteration:

The recursion is then:

wt+1 = P rox [wt − ηL′ (xt )] = P rox [wt + ηAT (y − Aw)]


ηh(.) ηh(.)

12.6.4 AMP reloaded

Consider the following AMP recursion:

xt+1 = ft (xt + AT rt ) (12.136)


rt = y − Ax + bt rt−1 (12.137)

Clearly, this is an instance of GAMP! Use gt (v t ) = y − v t = rt and e(u) = P rox, then GAMP
reads

ut+1 = AT (y − vt ) + Prox(ut ) (12.138)


t t
v = Aprox(u ) − bt gt−1 (v t−1
) (12.139)
198 F. Krzakala and L. Zdeborová

so using x = prox(u) and r = y − v we have

ut+1 = AT rt + x (12.140)
rt = y − vt = y − Aprox(ut ) + bt gt−1 (vt−1 ) = y − Ax + bt r (12.141)

So state evolution can be written for the AMP considered! At the fixed point, if this AMP
converges, we find
x = ft (x + AT (y − Ax)/(1 − b) (12.142)
So this recursion solved the LASSO (or any convex optimization problem for that matter).
Just takes its AMP recursion, and voila, you just solved LASSO!

Let us write the state evolution in the simplest case where y = Ax0 . Then we have u and v
Gaussian and we need to track their variance and projection to x0 and y.

qut+1 = αE[(gtout (ω, ϕ(z0 ))2 ] = αE[(y − w)2 ] = α(ρ + qω − 2mω ) = αE t (12.143)
(12.144)
p
qωt = E[ftin ( qut Z + mtu x0 )2 ]
= αE[∂z0 (gtout ( qωt Z ′ + mtω z0 , ϕ(z0 ))] = α (12.145)
p
mt+1
u
(12.146)
p
mtω = E[ftin ( qut Z + mtu x0 )x0 ]

so that we can write, keeping only q and m

(12.147)
p
q t+1 = E[ftin ( α(ρ + q t − 2mt )Z + αx0 )2 ]
(12.148)
p
mt+1 = E[ftin ( α(ρ + q t − 2mt )Z + αx0 )x0 ]

or even more directly

√ √
E t+1 = ρ + E[ftin ( αE t Z + αx0 )2 ] − 2E[ftin ( αE t Z + αx0 )x0 ]

12.7 Inference with GAMP

The inference problem addressed in this section is the following. We observe a M dimensional
vector yµ , µ = 1, . . . , M that was created component-wise depending on a linear projection of
an unknown N dimensional vector xi , i = 1, . . . , N by a known matrix Fµi
N
!
X
yµ = fout Fµi xi , (12.149)
i=1

where fout is an output function that includes noise. Examples are the additive white Gaussian
noise (AWGN) were the output is given by fout (x) = x + ξ with ξ being random Gaussian
variables, or the linear threshold output where fout (x) = sign(x − κ), with κ being a threshold
value. The goal in linear estimation is to infer x from the knowledge of y, F and fout . An
alternative representation of the output function fout is to denote zµ = N i=1 Fµi xi and talk
P
about the likelihood of observing a given vector y given z
M
Y
P (y|z) = Pout (yµ |zµ ) . (12.150)
µ=1
12.7 Inference with GAMP 199

One uses
(z−ω)2
dzPout (y|z) (z − ω) e− 2V
R
gout (ω, y, V ) ≡ (z−ω)2
. (12.151)
V dzPout (y|z)e− 2V
R

and (using R = u ∗ Σ)
(x−R)2
dx x PX (x) e− 2Σ
R
fin (Σ, R) = R (x−R)2
, fv (Σ, R) = Σ∂R fa (Σ, R) . (12.152)
dx PX (x) e− 2Σ

Again, these are not new equations! This was written as early as in 1989 by Mézard as the
TAP equation for the so-called perceptron problem ?. The same equation were written also by
Kabashima ? and Krzakala et al ?. The form presented here is due to Rangan Rangan (2011).
Part III

Random Constraint Satisfaction


Problems
Chapter 13

Graph Coloring II: Insights from


planting

13.1 SBM and planted coloring

Consider here a special case of the stochastic block model with na = 1/q and caa = cin , and
cab = cout for a ̸= b. We call assortative/ferromagnetic the case with cin > cout , i.e. connections
withing groups being more likely than between different groups. We call disassortative/anti-
ferromagnetic the case with cin < cout , i.e. connections withing groups being less likely than
between different groups.

In the last lecture we derived the belief propagation equations for the stochastic block model
that read
" #
i→j 1 Y X 1 Y h i
χti = i→j nti e−hti ctk ti χk→i
tk = i→j nti e−hti cout − (cin − cout )χk→i
ti ,
Z t
Z
k∈∂i\j k k∈∂i\j
(13.1)
with an auxiliary external field that summarizes the contribution and overall influence of the
non-edges
1 XX
hti = ctk ti χktk . (13.2)
N t k k

We notice that defining


cin
e−β = (13.3)
cout
and including terms independent of the value ti info the normalization Z i→j we obtain

1 Y h i
χi→j
ti = nti e−hti cout − (cout − cin )χk→i
ti (13.4)
Z i→j
k∈∂i\j

 
1 cin k→i
χi→j
Y
−hti
= nti e cout 1 − (1 − )χ (13.5)
ti
Z i→j cout ti
k∈∂i\j
204 F. Krzakala and L. Zdeborová

and including cout in 1


Z i→j
, we finally obtain:

1 Y h i
χi→j
ti = nti e−hti 1 − (1 − e−β )χk→i
ti , (13.6)
Z i→j
k∈∂i\j

where we abused the notation as we denoted the new normalization in the same way. We
notice that this is very close to the BP that we wrote for graph coloring in Section 6

1 Y h i
χti→j = 1 − (1 − e−β )χk→i
ti , (13.7)
i
Z i→j
k∈∂i\j

the only difference is the missing term nti e−hti that corresponds to the prior fixing the proper
sizes of groups. In the disassortative/anti-ferromagnetic case with cin < cout or equivalently
β > 0, and equal group sizes na = 1/q for all a = 1, . . . , q, this term is not needed and does
not asymptotically influence the behaviour of the BP equations upon iterations. This claim
can be checked empirically by running the BP algorithm with and without the term, showing
it formally mathematically is non-trivial.

The stochastic block model under the setting of this section (na = 1/q, caa = cin , and cab = cout
for a ̸= b) can be seen as a planted coloring problem at finite inverse temperature β > 0.
Planted coloring is defined as follows:

• Start with N nodes, q colors, each node gets randomly one color s∗i ∈ [q] ≜ {1, . . . , q} as
the true configuration.

• Put at random M = cN/2 edges between nodes so that fraction cout (q − 1)/(cq) (chosen
so that the expected number of edges between groups agrees with the SBM) of them is
between nodes with different colors and the rest between nodes with the same colors.
This produces the adjacency matrix A = [Aij ]N
i,j=1 .

The goal is to find the true configuration s∗ (up to permutation of colors) from the knowledge
of the adjacency matrix A, and the parameters θ. We call this way of generating coloring
instances planted because we picture the ground truth group assignment to be the planted
configuration around which is the graph to be colored constructed. The inverse temperature
β is then related to the fraction of monochromatic edges.

The threshold cd , cc and cℓ that we described to occur in the SBM for q = 5 colors at β → ∞ in
Fig. 10.4.3 thus exist in the planted coloring. More interestingly they bear consequences on
the original non-planted graph coloring problem as we describe in what follows.

13.2 Relation between planted and random graph coloring

13.2.1 BP convergence and algorithmic threshold

For the stochastic block model we derived that for



|cin − cout | > q c (13.8)
13.2 Relation between planted and random graph coloring 205

the BP algorithm converges to a configuration with positive overlap with the ground truth
assignment into groups. In terms of inverse temperature β and the average degree c this
condition is written as

q − 1 + e−β
 2
c > cℓ = . (13.9)
1 − e−β

This is thus the algorithmic transition that in the SBM marks the onset of a phase where belief
propagation is able to find the optimal overlap with the ground truth group assignment.

We derived this phase transition via a stability of the paramagnetic BP fixed point. Since the
BP equations for the planted coloring are the same as the one for the random graph coloring
the stability of the BP fixed point also applies to the random graph coloring. Indeed, at the
end of Section 6 we derived a condition for when the BP for random graph coloring converges
to the paramagnetic BP fixed point and when it goes away from it (6.24). This condition was
exactly eq. (13.9). In fact in the normal (non-planted) coloring the BP algorithm does not
converge for c > cℓ .

To summarize the consequence of the threshold cℓ , we have 4 cases:

• c > cℓ planted coloring: randomly initialized BP converges to a configuration with


positive overlap with the planted configuration.

• c > cℓ normal coloring: randomly initialized BP does not converge.

• c < cℓ planted coloring: randomly initialized BP converges to the paramagnetic fixed


point.

• c < cℓ normal coloring: randomly initialized BP converges to the paramagnetic fixed


point.

Let us now review in the following table how is this BP convergence threshold related to
the estimation of the colorability threshold for cin = 0 or equivalently β → ∞ and to the
average degree beyond which we do not know of polynomial algorithms that would provably
find proper colorings for large number of colors. We see from the table that the convergence
of BP had little to do with colorability nor with hardness of finding a proper coloring. In
the planted coloring this is the algorithmic threshold beyond which the inference of the
planted coloring starts to the algorithmically easy, but in the non-planted coloring the cℓ BP
convergence threshold does not have algorithmic consequences.
206 F. Krzakala and L. Zdeborová

Colorable and Not colorable and


Colorable and
BP does not BP does not con-
BP converges
q=3 converge verge c
cℓ = (q − 1)2 = 4 ccol ≈ 5.4
Colorable Not colorable Not colorable and
and BP con- and BP con- BP does not con-
q = 5
verges verges verge c
(q > 3)
ccol ≈ 14.4
cℓ = (q − 1)2 = 16
Polynomial Polynomial al- Not colorable Not colorable and
algorithms gorithm not and BP con- BP does not con-
exists known verges verge
q≫1 c
q log(q)
ccol ∼ 2q log(q) cℓ = (q − 1)2

13.2.2 Contiguity of random and planted ensemble and clustering

We recall that for the SBM we found a threshold cc (in continuous phase transition cc = cℓ )
such that

• For c < cc the BP fixed point with the largest free entropy is the paramagnetic fixed point
with χi→j
a = na and Q = 0.

• For c > cc the BP fixed point with the largest free entropy is the feromagnetic fixed point
with Q > 0.

Recall Fig. 10.4.3 for illustration of this threshold and also recall the fact that for c < cc the
paramagnetic fixed point thus provides the exact marginals for most variables and the exact
free entropy in the leading order.

A consequence of the paramagnetic fixed point being the one describing the correct marginals,
overlap and free energy is that there is no information in the graph that allows us to find a
configuration correlated with the planted configuration. This can only be true if all properties
that are true with high-probability in the limit of large graphs are the same in the planted
graph as they would be in a non-planted graph. Properties that hold with high-probability
in the large size limit are called thermodynamic properties in physics. Mathematically, this
kind of indistinguishability of two graph ensembles (the random and the planted one) by
high-probability properties is termed contiguity. For c < cc the random and planted ensemble
are hence contiguous, i.e. not distinguishable by thermodynamics properties. This among
other things implies that for c < cc the paramagnetic BP fixed point and the corresponding
Bethe free entropy are exact in the leading order not only for the planted graphs but also
for the random ones. We will see in the next lecture that for c > cc this is not the case, thus
answering one of the main questions we have about correctness of BP to graph coloring.

Nishimori identities for Bayes optimal inference are implying that the planted configuration
has all the properties of a configuration randomly sampled from the posterior distribution.
Putting this property together with the contiguity of the random and planted ensemble for
13.2 Relation between planted and random graph coloring 207

c < cc this allows us to investigate in a unique way properties of equilibrium configuration in


the random ensemble.

In the section on SBM we observed empirically that the fixed point of BP and marginals
reached by correspondingly initialized Monte Carlo Marlo chain in time linear in the size
of the system are equivalent in the large size limit (MCMC was just slower to convergence).
We thus conclude that if BP converges to the paramagnetic fixed point from the planted
initialization, i.e. for c < cd , then MCMC would be able to equilibrate, i.e. estimate in linear
time correctly in the leading order all quantities that concentrate. Thus for c < cd dynamics
such as MCMC is able to equilibrate in a number of steps that is O(1) per node, this happens
if from the physics point of view the phase is liquid and not glassy.

In the cases where we have a 1st order phase transition in the planted model, i.e. when cd < cc
the phase occurring for cd < c < cc has interesting properties. Recall that for all c < cc the
planted and random ensemble are contiguous. Yet in the planted ensemble the BP initialized
in the planted configuration converges to a fixed point strongly correlated with the planted
configuration Q > 0. Because of the contiguity this means that also in the random ensemble
BP initialized in a configuration drawn uniformly at random from the Boltzmann distribution
would converge to an analogous fixed point. Considering that such a BP fixed point describes
the subspace that would be sampled by MCMC if initialized in the same way we conclude
that dynamics initialized at equilibrium would remain stuck close to the initialisation and
would not explore in linear time the whole phase space. The domain of attraction where the
dynamics is stuck will be denoted as a cluster. Randomly initialized MCMC will not be able
to equilibrate, i.e. sample the space of configuration almost uniformly for cd < c < cc . In
this phase is thus conjectured that sampling configurations uniformly from the Boltzmann
measure is algorithmically hard.

Having introduced the notion of clustering, let us state one possible definition of how to
split configurations into clusters via equivalence classes of BP fixed points: Consider the
random graph coloring problem at inverse temperature β, consider BP equations (at the same
temperature) initialized in all possible configurations and iterate each of those BP to a fixed
point. Then define a cluster of configurations as all configurations from which BP converges
to the same fixed point. Cluster are then equivalence classes of configurations with respect
to BP fixed points. In case BP does not reach convergence from some configurations we can
think that one cluster corresponds to all such configurations. While it may be harder to relate
this definition to the more physical notion such as Gibbs states it is one that is most readily
translated into a method of analysis that is able to describe organization of clusters, e.g. how
many there are of a given free entropy.

The size of the cluster is described by the free entropy of the corresponding BP fixed point
Φferro . We remind that for c < cc the total equilibrium free entropy is the paramagnetic one
Φpara , that is in the phase cd < c < cc strictly larger than the ferromagnetic one Φpara > Φferro .
If equilibrium solutions are in cluster (basins of attraction) corresponding to free entropy
Φferro and the total free entropy is Φpara is must mean that there are eN (Φpara −Φferro ) of such
clusters. We will define the complexity of clusters as the logarithm of the number of such
clusters per node, defined as:
eN Σ = number of clusters (13.10)
where:
Σ = Φpara − Φferro (13.11)
208 F. Krzakala and L. Zdeborová

for c ∈ (cd , cc ). Since at cd the ferromagnetic fixed point appears discontinuously, it means
that exponentially many clusters appear discontinuously. As c → cc the complexity Σ is then
going to zero.

To summarize

• For c < cd most configurations belong to the same cluster, MCMC is able to equilibrate
in linear time.

• For cd < c < cc configurations belong to one of exponentially many clusters, MCMC is
not able to equilibrate in linear time.

This relation to dynamics being able to equilibrate before the threshold and not being able to
equilibrate in linear time after the threshold gives the name dynamical threshold.

So far the method we described to locate the clustering threshold cd is based on running BP
equations from the random and planted fixed point and comparing the free entropies of the
corresponding fixed points. This is a somewhat demanding procedure and does not lead to a
simple closed-form expression for the threshold as we obtained e.g. for the linear stability
thresholds cℓ . In the next section we will give a closed-form upper bound on the clustering
threshold cd for the case of proper colorings, i.e. when cin = 0 or equivalently when β → ∞.

Figure 13.2.1: Cartoon of the space of proper colorings in random graph coloring problem.
The grey circles correspond to unfrozen clusters, the black circles to frozen clusters. The size
of the circle illustrates the size of the clusters.

13.2.3 Upper bound on clustering threshold

For graph coloring at zero temperature, β → ∞, the BP fixed points initialized in a solution
(proper coloring) may have so-called frozen variables, i.e. variables i that stay at their initial
value up until the fixed point. This means that the frozen variables take the same value in
all the solutions belonging to the cluster. Frozen variables: Variable i is frozen if in the fixed
point of BP we still have χj→i
s = δs,s∗j . Frozen cluster: A cluster in frozen if a finite fraction of
variables are frozen in the cluster.

In order to monitor whether clusters in which a randomly samples solution lives are frozen of
not we recall the contiguity between the random and planted ensemble and the fact that the
planted configuration has all the properties of the equilibrium configurations. This allows
13.2 Relation between planted and random graph coloring 209

us to monitor the threshold where typical clusters get frozen by tracking the updates of BP
initialized in the planted configuration and monitoring how many variables are frozen.

Let us consider a simple case, for degree-d regular random graph, that is locally-treelike in
the large size limit N → ∞. Consider then a rooted tree around of on the nodes. The planted
solution can be seen as a broadcasting process on that tree where the root took a given color
and then recursively the children (i.e. the neighbors in the direction of the leaves) are given at
random one of the other colors. The question of existence of frozen variables is then translated
into the question whether the leaves of such a tree of large depth imply the value of the root
or whether the root can take another color while staying compatible with the coloring rule
and the colors on the leaves.

Denote ηℓ the probability that a node in ℓ-th generation is implied by the leaves above it, i.e.
can take only one color (the planted). A node will be implied to have its planted color if each
of the other colors is implied in at least one of its children.

ηℓ = Pr (a node in ℓ-th generation is implied by its leaves)


= Pr (for a node in ℓ-th generation, all other (q − 1) colors are implied on children)
= 1 − Pr (for a node in ℓ-th generation, at least one of the (q − 1) colors is not implied from children)

Assume now that r of the remaining q − 1 colors are not implied on any of its d − 1 children.
This probability can be written as:

Pr (for a node in ℓ-th generation, given r colors are not implied by any children)
= {Pr (for a node in ℓ-th generation, the given r colors are not implied by a given child)}d−1
= {1 − Pr (for a node in ℓ-th generation, a given child is implied to take one of the r given colors)}d−1
r ηℓ+1 d−1
 
= 1−
q−1
We then proceed by inclusion-exclusion principle. For r = 1 we over-counted the cases where
in fact two colors were not implied etc. obtaining
q−1
q−1
X  
r−1
ηℓ = 1 − (−1) Pr (for a node in ℓ-th generation, the given r colors are not implied by any children)
r
r=1
q−1
r q−1 r ηℓ+1 d−1
X   
=1+ (−1) 1−
r q−1
r=1

Therefore, we end up with a one-variable recursion


q−1
q−1 r ηℓ+1 d−1
X   
r
ηℓ = 1 + (−1) 1− ,
r q−1
r=1

For a tree of depth d we start from ηd = 1, since the nodes in d-th generation are a leaves
themselves. The probability ηℓ is then updates as we proceed down to the root up to a fixed
point.

We now define a rigidity threshold cr such that

• When d > cr , the fixed point η > 0.


210 F. Krzakala and L. Zdeborová

• When d < cr , the fixed point η = 0.

The rigidity threshold cr then provides an upper bound on the dynamical threshold cd . For
small number of colors the two threshold are not particularly close to each other and the
rigidity threshold in even larger than cc for q < 9. But for large number of colors q ≫ 1 we
get by expansion cr = q log(q) + o (q log(q)). This thus tells us that for large number of colors
clusters containing typical solution are frozen with high probability starting from average
degree at least cr .

We note that while we presented this argument for d-regular trees, for the random ones with
fluctuating degree distribution we can get an analogous closed-form expression that has the
same larger q behaviour in the leading order.

13.2.4 Comments and bibliography

Contiguity between planted and random ensembles only holds when paramagnetic (or anal-
ogous) fixed point exists. In general problems such as random K-SAT there the planted
ensemble is always different from the random one. However, the notion of clustering and clus-
ters corresponding to basins of attraction of Monte Carlo sampling as well as their definition
via BP fixed point is more general and holds beyond the models where planted and random
ensembles are related.

We also note that the scaling of the rigidity threshold at large number of colors is cr ≈ q log(q)
and this coincides with the scaling beyond which we do not have any known analyzable
algorithms working. Moreover numerical tests in the literature show that solutions found by
polynomial algorithms do not lead to a frozen BP fixed point. This leads to a conjecture that
frozen solutions are hard to find. At the same time the rigidity threshold concerns freezing
of the typical clusters and there might be atypical ones that remain unfrozen up to much
larger average degree. This is indeed expected to be the case in analogy with related constraint
satisfaction problems that have been analyzed Braunstein et al. (2016). Overall the precise
threshold at which solution become hard to find remain open, even on the level of a conjecture.
Chapter 14

Graph coloring III: Colorability


threshold and properties of clusters

14.1 Analyzing clustering: Generalities

In the last section we deduced that in a region of parameters the space of solutions in the
random graph coloring problem is split into so-called clusters. We associated clusters to BP
fixed points, and their free entropy to the Bethe free entropy ΦBethe corresponding to the fixed
point. This allows us to analyse how many clusters there are of a given free entropy in the same
way as we analyzed how many configurations there are of a given energy/cost. We define the
complexity function Σ(ΦBethe ) as the logarithm of the number of BP fixed points with a given
Bethe free entropy ΦBethe per node. With this definition we can define the so-called replicated
free entropy Ψ(m) as
Z
e e eN [Σ(Φ)+mΦ]
X
Ψ(m)N
= N mΦBethe
= (14.1)
BP fixed points Φ

Just as before for the relation between energy, entropy and free entropy, we have from the
properties of the saddle point and the Legendre transform that

∂Σ(Φ) ∂Ψ(m)
= −m , = Φ. (14.2)
∂Φ ∂m
Thus if we are able to compute the replicated free entropy Ψ(m) we can compute from it the
number of clusters of a given free entropy Σ(Φ) just as we did when computing the number
of configurations of a given energy.

We would now like to determine what is the free entropy Φ of clusters that contain the equilib-
rium configurations, i.e. configurations samples at random from the Boltzmann measure. For
this we need to consider the free entropy Φ that maximizes the total free entropy Φ + Σ(Φ).
We also need to take into account that the BP fixed point are structures that can be counted
and those that exist must be at least one, and hence Σ(Φ) cannot be negative, since logarithm
of positive integers are non-negative. Thus the equilibrium of the system is described by

max[Φ + Σ(Φ)|Σ(Φ) ≥ 0] (14.3)


Φ
212 F. Krzakala and L. Zdeborová

This expression is maximized in two possible ways. Assume that for neighborhood of m = 1
the complexity Σ(Φ) has non-zero value (except maybe one point where it crosses zero). Then
define Φ1 as the value of the free entropy at which the function Σ(Φ) has slope −1 i.e.
∂Σ(Φ)
= −1 (14.4)
∂Φ Φ1

then

• Either Σ(Φ1 ) > 0, in this case eq. (14.3) is maximized at Φ1 and the number of clusters of
that free entropy is exponentially large corresponding to Σ(Φ1 ). We call such a clustered
phase the dynamical one-step replica symmetry breaking phase.
• Or Σ(Φ1 ) < 0, in this case the value of free entropy that maximized the expression
(14.3) is the largest Φ such that Σ(Φ) > 0. We denote this the Φ0 . When this happens
the dominating part of the free entropy is included in the largest clusters that are not
exponentially numerous and in fact arbitrarily large fraction of the weight is carried by a
finite number of largest clusters. The equilibrium of the system thus condensed in a few
of the larger clusters. We call this phase the static one-step replica symmetry breaking
phase.

In the last lecture we computed the complexity of clusters in random graph coloring for
cd < c < cc as the difference between the free entropy of the paramagnetic and the feromagnetic
fixed point. In problems where we have contiguity between the planted and the random
ensembles, this complexity from eq. (13.11) is exactly equal to Σ(Φ1 ) and thus goes to zero
at the condensation threshold cc . Above the condensation threshold c > cc the equilibrium
of the system is no longer described by cluster of free entropy Φ1 (that is equal to the free
entropy of the planted cluster) instead it is given by Φ0 as defined above by the free entropy
at which the complexity becomes zero.

As the average degree grows further in the random graph coloring problem the maximum of
the complexity curve Σ(Φ) becomes zero at which point the last existing clusters disappear
and this thus marks the colorability threshold. To compute the colorability threshold we thus
need to count the total number of clusters, corresponding to the maximum of the curve Σ(Φ).

14.2 Analyzing BP fixed points

From the previous section we see that we can learn many details about clusters if we are able
to compute the replicated free entropy Ψ(m) from eq. (14.1) where we sum the exponential of
the Bethe free entropy to the power m over all the BP fixed points. More explicitly this means
to compute
eΨ(m)N = em[ i log Z + a log Z − (ia) log Z ]
i a ia
X P P P
(14.5)
BP fixed points

where the terms Z i , Z a,


and Z ia are the contributions to the Bethe free entropy that we derived
when we were deriving the BP equations. We will write BP equations schematically as
χi→a = Fχ ({ψb∈∂i\a
b→i
}) (14.6)
j→a
ψ a→i = Fψ ({χj∈∂a\i }) (14.7)
14.2 Analyzing BP fixed points 213

Figure 14.1.1: Illustration of the complexity curve Σ(Φ) that is the logarithm of the number
of clusters per nodes versus the free entropy Φ. The point where the curve has slope −1 is
marked in purple. Recall the properties of the Legendre transform that imply that the slope of
this curve is equal to −m.

The replicated free entropy thus reads


Z Y
e e i
 i→a a→i  m[P log Z i +P log Z a −P log Z ia ]
Ψ(m)N
= dχ dψ a (ia) (14.8)
(ia)

δ[ψ a→i − Fψ ({χj→a


YY YY
δ[χi→a − Fχ ({ψb∈∂i\a j∈∂a\i })] . (14.9)
b→i
})]
i a∈∂i a i∈∂a

What we just wrote is in fact very naturally a partition function of an auxiliary problem living
on the original graph where the variables (χi→a , ψ a→i ) and fields (Z ia )m live on the original
edges and the factors live on the original variable i and factor nodes a.

We thus put the problem of computing the replicated free energy Ψ(m) into a form in which
we can readily apply belief propagation as we derived it in Section 5.2. The only difference
now is that the variables are continuous real numbers and that the messages in the auxiliary
problem are probability distributions. Moreover we realize that it is consistent to assume that
the new BP message in the auxiliary problem in the direction from i → a does not depend on
the original message in the opposite direction, because of the assumed independence between
BP messages. This allows us to write the BP for the auxiliary problem that is a generic form of
a survey propagation (instead of messages we are now passing surveys). These equations
are also called the one step replica symmetry breaking cavity equations (but deriving the
214 F. Krzakala and L. Zdeborová

Figure 14.2.1: Illustration of the graphical model constructed from the original factor graph
used to analyze BP fixed points.

same result with the replica approach is rather demanding and much less transparent). The
parameter m is called the Parisi parameter. They read
1
Z Y
P a→i (ψ a→i ) = a→i dP j→a (χj→a )(Z a→i )m δ[ψ a→i − Fψ ({χj→a
j∈∂a\i })] (14.10)
Z
j∈∂a\i
1
Z Y
P i→a (χi→a ) = i→a dP b→j (ψ b→j )(Z i→a )m δ[χi→a − Fχ ({ψb∈∂i\a
b→i
})] (14.11)
Z
b∈∂i\a

where the messages are now probability distributions over the original BP messages. As
before we need to find a fixed point of these equations which in a single graph is numerically
demanding because we need to update a full probability distribution (in fact two) on every
edge.

In problems on random regular graphs without any additional disordered, e.g. in the graph
coloring on random regular graphs, the correct solution often has a form that is independent
on the edge (ia). In that case we have the same probability distribution on every edge and the
fixed point can be found naturally by so-called population dynamics where the probability
distribution of edges is represented by a large set of samples from the probability distributions
and the so-called reweighting factor (Z i→a )m or (Z a→i )m is proportional to the probability
with which each element appears in the population.

The replicated free entropy can then be computed from the fixed point using the recipe for
14.3 Computing the colorability threshold 215

the Bethe entropy on the auxiliary problem and reads:


X X X
N Ψ(m) = Φi + Φa − Φia (14.12)
i a ia

where
Z Y
e mΦi
= dP a→i (ψ a→i )(Z i )m (14.13)
a∈∂i
Z Y
emΦ
a
= dP i→a (χi→a )(Z a )m (14.14)
i∈∂a
Z
e mΦia
= dP a→i (ψ a→i )dP i→a (χi→a )(Z ia )m . (14.15)

The curve Σ(Φ) is then computed from the Legendre transform of Ψ(m).

It might happen that also the clusters cluster into super-clusters, or split into mini-clusters.
This would then lead to the so-called two-step replica symmetry breaking. And one could
continue to speak about K-step RSB, or full-RSB in case an infinite hierarchy of step is needed.
Currently we do not know how to solve the full-RSB equations on random sparse graphs, but
in dense optimization problems this can be done, but this is beyond the scope of the present
lecture.

14.3 Computing the colorability threshold

The concept of frozen variables that we introduced to upper bound the clustering threshold
enables a key simplification to compute the colorability threshold ccol that we aimed at from
the very first lecture on graph coloring. We will again restrict to random d-regular graphs as
the computation simplifies in this case. We learned that the space of solutions is divided in
clusters and some clusters have frozen variables.

Note that frozen variables are needed in order for a cluster of solution to vanish when the
average degree is infinitesimally increasing (addition of new edge will create contradiction
with high probability). Further it is not possible to have a very small fraction of frozen variables
because they must sustain each other in a sort of backbone structure. We will thus assume
that the colorability threshold is given by the average degree at which clusters with frozen
variables all disappear. Since clusters were identified with BP fixed points, counting frozen
clusters to see when all disappear becomes counting BP fixed points with frozen variables.

The key idea of describing clusters is again that being a BP fixed point is just another type of
constraints for another type of variables. It is an auxiliary problem that can be formulated
using a tree-like graphical model and solved using again belief propagation and counting
frozen clusters comes from the value of the corresponding free entropy. This "BP on frozen
BP fixed points" is called the survey propagation.

Let us now describe the structure of a frozen BP fixed point. The frozen BP messages are
χj→i j→i
sj = δsj ,s = “s” for q possible values of s; the not frozen ones are χsj = something else =
“∗” and will be called "joker". For each edge (ij) in each direction the message will be either
"joker" or frozen in one of the actual colors. The variables will now be living on the edges, let
216 F. Krzakala and L. Zdeborová

νij ∈ {∗, 1, . . . , q} denote the value for variable node (ij), in total this variables can be in one
of the (q + 1) possible states. In the new graphical model BP equations lead to the constraints,
terms in Bethe entropy are the new factor weights, BP messages are the new variables. The
new messages are probability distribution on the original BP messages.

Let us not discuss the constraints (factor nodes) that will be in place of the original nodes (as
in graph matching problems)

ik (q + 1) values

i constraint

ij

The constraint Ci on node i requires the following:

• if an incoming value νki = s, the outgoing value νij cannot be s since the neighbor k is
frozen to s; if an incoming value νki = ∗, then it will not impose any constraint on the
outgoing value νij .

• incoming values do not forbid all colors

• if νij = ∗ at least two colors are not forbidden by the incoming edges, if vij = s, only s is
not forbidden by the incoming edges.

We can now write the survey propagation equation, which is just BP on this new graphical
model
1 X   Y
ni→j
νij = C νij , {ν }
ki k∈∂i\j nk→j
νki
Z i→j
{νki }k∈∂i\j k∈∂i\j

which is generic form of BP on variables νij .

Survey propagation (SP) equations can be made more explicit by using the specific form of the
constraints imposed above. We proceed again according to the inclusion-exclusion principle
14.3 Computing the colorability threshold 217

• When νij ̸= ∗
( )
1 Y  
ni→j
νij = 1 − nνk→i + correct for the cases that also allow p ̸= νij
Z i→j ij
k∈∂i\j
| {z }
neighbors are not forbidding νij
 )
q
(
1 Y   X Y   Y X
= 1 − nνk→i − 1 − nk→i k→i
νij − np + · · · − (−1)q 1 − npk→i 
Z i→j ij
k∈∂i\j p̸=νij k∈∂i\j k∈∂i\j p=1
| {z }
no body is forbidden
q−1
!
1 X l
X Y X
= (−1) 1− nk→i
νij − nk→i
v
Z i→j
l=0 V ⊆{1,...,q}\νij k∈∂i\j v∈V
|V |=l

• When νij = ∗

1
ni→j
∗ = {not forbidding at least two colors}
Z i→j
q
!
1 X X Y X
= (−1)l 1− nk→i
v .
Z i→j
l=2 V ⊆{1,...,q}\νij k∈∂i\j v∈V
|V |=l

• Because of the normalization we need ni→j


Pq i→j
∗ + p=1 np = 1, thus

Z i→j = Pr (no contradiction (at least one color not forbidden))


 
Xq Y   X Y   Y q
X
= 1 − npk→i − 1 − nk→i
p − nk→i
r + · · · − (−1)q 1 − nk→i
p

p=1 k∈∂i\j p<r k∈∂i\j k∈∂i\j p=1
q
!
X l−1
X Y X
= (−1) 1− nk→i
v
l=1 V ⊆{1,...,q} k∈∂i\j v∈V
|V |=l

The Bethe free entropy of this problem is the complexity Σ, counting all the frozen clusters. We
remind that when we expressed complexity in the previous section of was the one of clusters
that contained typical solutions. In the next lecture we will see the relation between the two.
The complexity can be computed from the fixed point of SP using the usual recipe for Bethe
free entropy
N
1 X   1 X  
Σ= log Σ(i) − log Σ(ij)
N N
i=1 (ij)∈E
q
!
X l−1
X Y X
(i)
Σ = (−1) 1− nk→i
v
l=1 V ⊆{1,...,q} k∈∂i v∈V
|V |=l
q
X
Σ(ij) = 1 − ni→j
p np
j→i

p=1
218 F. Krzakala and L. Zdeborová

Here the term Σ(i) is related to the normalization of the messages that does not exclude the
node j, the term Σ(ij) is then counting the probability that the messages in the two directions
of one edge are compatible, i.e. not imposing the same color on the two ends.

On a d-regular graph and assuming symmetry among colors (np = η, ∀ p = 1, . . . , q), this
leads to a very concrete conjecture for the colorability threshold of random d-regular graphs:
Pq ℓ−1 q−1 d−1
ℓ=1 (−1) ℓ−1 (1 − ℓη)
η = Pq ℓ−1 q  d−1
ℓ=1 (−1) ℓ (1 − ℓη)
q
(   )
X q d
(−1)ℓ−1 (1 − ℓη)d − log 1 − qη 2

Σ = log
ℓ 2
ℓ=1
We can first compute η to be the fixed point, and then plug the fixed point in Σ to compute
the complexity. If Σ > 0, then it means a random d-regular graph is colorable w.h.p. for large
N , and uncolorable vise versa. For the random graph with fluctuating variables degree the
expressions are only slightly more complex, but derived in the very same spirit.

Going back to the overall picture we now described the whole regime of average degrees,
except cc < c < ccol . This is the condensed phase which has rather peculiar properties and
will be discussed in the next lecture.

14.4 Exercises

Exercise 14.1: Random subcube model

The random-subcube model is defined by its solution space S ⊂ {0, 1}N (not by a
graphical model). We define S as the union of ⌊2(1−α)N ⌋ random clusters (where ⌊x⌋
denotes the integer value of x). A random cluster A being defined as:

A = σ σi ∈ πiA , ∀ i ∈ {1, . . . , N } (14.16)




where π A is a random mapping:

π A : {1, . . . , N } → {{0} , {1} , {0, 1}}


i 7→ πiA

such that for each variable i, πiA = {0} with probability p/2, {1} with probability p/2,
and {0, 1} with probability 1 − p. A cluster is thus a random subcube of {0, 1}. If
πiA = {0} or {1}, variable i is said “frozen” in A; otherwise it is said “free” in A. One
given configuration σ might belong to zero, one or several clusters. A “solution” belongs
to at least one cluster.
We will analyze the properties of this model in the limit N → ∞, the two parameters α
and p being fixed and independent of N . The internal entropy s of a cluster A is defined
as N1 log2 (|A|), i.e. the fraction of free variables in A. We also define complexity Σ(s) as
the (base 2) logarithm of the number of clusters of internal entropy s per variable (i.e.
divide by N ).

(a) What is the analog of the satisfiability threshold αs in this model?


14.4 Exercises 219

(b) Compute the αd threshold below which most configurations belong to at least one
cluster.

(c) For α > αd write the expression for the complexity Σ(s) as a function of the parame-
ters p and α. Compute the total entropy defined as stot = maxs [Σ(s) + s | Σ(s) ≥ 0].
Observe that there are two regimes in the interval α ∈ (αd , 1), discuss their proper-
ties and write the value of the “condensation” threshold αc .
Chapter 15

Replica Symmetry Breaking: The


Random Energy Model

If you can’t solve a problem, then there is an easier problem you


can solve: find it.

George Pòlya , How to Solve It - 1945

In this chapter, we come back to the replica method, and will in particular discuss replica
symmetry breaking (in its simpler form, the one-step replica symmetry breaking, 1RSB in
short), and will make contact with the form of 1RSB discussed with the cavity method.

To do so, we shall discuss a very simple model, that was introduced by Bernard Derrida in
1980 to understand spin glasses and the replica method. It turns out to be indeed a very good
toy model to understand key concepts, and in fact, it is also a very important model in its own,
connecting with important concepts in denoising and information theory.

The random energy model is a trivial spin models with N variables, in the sense that the
energy of the possible M = 2N configurations are choosen, once and for all, randomly from a
Gaussian distribution.

Formally, in the Random Energy Model (REM), we have 2N configuration, with random fixed
energies Ei sampled from

 
N 1 E2
P (E) = N 0, =√ e− N (15.1)
2 πN

so that the partition sum reads

2N
X
ZN = e−βEi (15.2)
i=1
222 F. Krzakala and L. Zdeborová

15.1 Rigorous Solution

We start to computing the exact asymptotic solution of the model, without using the replica
method. To do so, we shall show that the entropy (in the sense of Boltzmann) density as a
function of the energy has a deterministic asymptotic limit, that we can compute.

To do so, we first ask: if we sample 2N energy randomly, what is the number of energies
that fall between [N e, N (e + de)]? Let us call this number #(E). This is a random variable,
it depends on the specific draw of the 2N configuration, but we can compute its mean and
variance. First we compute the average; when de is small enough, we have
2N −N e2 1
E(#N e) = 2N E[1(E ∈ [N e, N (e + de)])] ≈ √
2
e de = √ e−N (e −log 2) de (15.3)
πN πN
Also, we see that, the probability of the energy taken randomly from P (E) to be between e
and e + de being small, this follows a Poisson law, so that the variance is equal to the mean.
Defining the function sann (e) = log 2 − e2 (physicists call this quantity the annealed entropy
density) we thus have two regimes:

• If sann (e) < 0, then there the average number


√of configuration
√ with energy e is exponen-
tially close to 0. This happens when e < − log 2 and e > log 2. In this case Markov
Markov inequality inequality tells us that with high probability the number of configuration (and not only
states that if X is a its average) is actually 0.
non-negative random
variable, then Indeed, from Markov:
P(X ≥ k) ≤ E[X]/k. P(X ≥ 1) ≤ E(X) (15.4)
so we have, with high√probability
√ as N is going to infinity, all configurations have energy
densities between [− log 2, log 2].
• If sann (e) > 0, we have instead an exponential number of configurations, with a variance
also exponential. However the entropy density concentrates to a deterministic value, as
can be seen again from Markov inequality and the fact that the variance is equal to the
mean:

2 !
E (#(e) − E[#(e)])2
    
#(e) #(e) 2
P −1 ≥k = P −1 ≥k ≤
E#(e) E#(e) k 2 (E[#(e)])2
Var([#(e)] 
E[#(e)] e−N sann (e)
≤ ≃ 
∝ . (15.5)
k 2 (E[#(e)])2 k 2 (E[#(e)])2 k2
Where we have used the fact that in this case, since the probability of the energy taken
randomly from P (E) to be between e and e + de being small follows a Poisson law,
the variance is equal to the mean. As N grows, the probability of deviation is going
#(e)
exponentially to zero when sann (e) > 0, so that with high probability, E#(e) is arbitrary
close to 1.

Thanks to this very tight concentration, we now can state that, with high probability, the
number of configurations is either 0 —if s(E) is negative— or close to eN sann (e) otherwise.
More precisely, we can write the entropy density as a function of e:
s(e) =: sann (e) if, sann (e) ≥ 0, and s(e) = −∞ otherwise . (15.6)
15.1 Rigorous Solution 223

Figure 15.1.1: Entropy density in the REM

With this knowledge, we can now solve the random energy model easily, without any disorder
averaging. We write that, with high probability, we have

X Z log 2
−βE
ZN = #(E)e ≈ √ dee−N (βe−s(e)) (15.7)
E − log 2

and using Laplace integral, with the sum is dominated by its maximum, we have the free
entropy given by the Legendre transform of the entropy:

1
lim log Z = s(e∗ ) − βe∗ (15.8)
N →∞ N
with
e∗ = min[−√log 2,√log 2] [βe − s(e)] (15.9)

Again, there are two situations: the minimum can be reached when the derivative is 0, that is
when
β = ∂e s(e) = −2e (15.10)

but this can only be the√case when s(e) > 0. However, at e = − log 2, s(e) reaches
√ zero, and
at this point s√
′ (e) = 2 log 2. So this minimum is only valid when β < β = 2 log 2, after
c
which e∗ = − log 2. In a nutshell, we have:

1 β2
log Z = −(e∗ )2 + log 2 − βe∗ = log 2 + (15.11)
p
lim , ifβ < βc = 2 log 2
N →∞ N 4
1
(15.12)
p p
lim log Z = log 2β , ifβ ≥ βc = 2 log 2
N →∞ N

The phase transition arising a βc is called a condensation transition. This is because, at this
point,
√ all the probability measure condensate on the lowest energy configurations of energy
−N log 2. This phenomenon is of crucial importance in understanding the 1RSB phase
transition.
224 F. Krzakala and L. Zdeborová

Figure 15.1.2: Equilibrium energy in the REM

15.2 Solution via the replica method

We now move to the computation by the replica method. We start by the replicated partition
sum:  
Y X 2N
Zn =  e−βEi  (15.13)
n i=1

Let us do a bit of seemingly trivial rewriting

2 N
X
Z n
= e−β(Ei1 +...+Ein ) (15.14)
i1 ,...,in =1
N N
 
2 2 P2N
1
Pn
X Pn X −β Ej (j=i a )
e−β
a=1 j=1
= a=1 Eia
= e (15.15)
i1 ,...,in =1 i1 ,...,in =1
2 N 2N
1(j=ia ))
X Y Pn
= e−βEj ( a=1 (15.16)
i1 ,...,in =1 j=1

We now compute the average over the disorder. By linearity of expectation, we can push the
average in the sum. Using independence of the different 2N energies, we can also push the
Remember that if X average over each Ej in the product. Using the Gaussian integral , we thus find that:
is Gaussian with zero
mean and variance ∆, N β 2 Pn 2
1(j=ia ))2 = e N4β 1(j=ia )1(j=ib )
Pn Pn
then E[ebX
]=e
b2 ∆
2 . EEj e−βEj ( a=1 I(j=ia ))
=e 4
( a=1 a,b=1 (15.17)

and therefore the expectation of the replicated partition sum reads:

2 N
N β2
1(ia =ib )
X Pn
E[Z ] =n
e 4 a,b=1 (15.18)
i1 ,...,in =1
15.2 Solution via the replica method 225

Following our tradition of using overlaps, we see that, given the replicas configurations (i1 ...in ),
it is convenient to introduce the n × n overlap matrix Qab = 1(ia = ib ), that takes elements in
0, 1, respectively if the two replicas (row and column) have different or equal configuration.
We can finally write the replicated sum over configurations as
2 N
X N β2 Pn X β2 Pn
Qa,b
n
E[Z ] = e 4 a,b=1 = #(Q)eN 4 a,b=1 Qa,b
(15.19)
iα =1,...,in {Q}

where {Q} is the sum over all possible such matrices, and #(Q) is the numbers of configura-
P
tions that leads to the overlap matrix Q.

15.2.1 An instructive bound

Considering, for a moment, the n to be integers, we can derive an instructive bound on all the
moments of Z. Indeed, a possible value of the matrix Q is the one where all the n replica are
in the same configuration, in which case Qa,b = 1, ∀a, b. There are 2N such matrices, so
β2 2 β2 2
EZ n > 2N eN 4
n
> eN 4
n
(15.20)
2
so that we now know, rigorously that the moment of Z grow at least as en . This is a bad new.
Indeed, if the moments do not grow faster than exponentially, their knowledge completely
determines the distribution of Z, and thus the expectation of its logarithm, according to
Carleman’s condition. Since, in our case, they do grow faster, this means that the moments do
not necessarily determine the distribution of Z, and in particular the analytic continuation at
n < 1 may not be unique. We thus will need to choose the "right" one, which is the essence of
the replica method. This is precisely what Parisi’s ansatz is doing for us: it provides a well
defined class of analytic continuations, which turns out to be the correct one.

15.2.2 The replica symmetric solution

Let us try to perform the analytic continuation when n → 0. Keeping for a moment n integer,
it is natural to expect that the number of such configurations , for a given overlap matrix, to be
exponential so that, denoting #(Q) = eN sq (Q) we write
 
Z β2 Pn Z
N a,b=1 Q a,b +s q (Q)
EZ n ≈ dQeN g(β,Q) (15.21)
4
dQe =:

As N is large, we thus expect to be able to perform a Laplace (or Saddle point) approximation
by choosing the "right" structure of the matrix Q, which will "dominate" the sum. A quite
natural guess (Physicists like to call this an ansatz) is to assume that all replicas are identical,
and therefore the system should be invariant under the relabelling of the replicas (permutation
symmetry). In this case, we only have two choices for the entries of Q: 1) Qab = 1 for all a, b or
2),Qaa = 1 for a = 1, · · · , n and Qab = 0 for a ̸= b. In both case, we can easily evaluate sq (Q).

1. If Qab = 1 for all a, b, then all the replica are in a single configuration, as we have
already seen. Then N = 2N , and we find g(β, Q) = n2 β 2 /4 + log 2. This is actually very
frustrating, as this does not have a limit with a linear part in n, so we cannot use this
solution in the replica method. Clearly, this is a wrong analytical continuation.
226 F. Krzakala and L. Zdeborová

2. If instead Qaa = 1 and Qab = 0 for all a ̸= b then all replicas are in a different configu-
ration; #(Q) = 2N (2N − 1) . . . (2N − n + 1), so that s(Q) ≈ n log 2 if n ≪ N . Therefore
g(β, Q) = nβ 2 /4 + n log 2.

At the replica symmetric level, we thus find that the free entropy is given, at all temperature,
by
fRS (β) = β 2 /4 + log 2 (15.22)
This is not bad, and indeed we know it is the correct solution for β < βc . However, this solution
is obviously wrong for β > βc .

15.2.3 The replica symmetry breaking solution

We now follow Parisi’s prescription, and use a different ansatz for the saddle point. We assume
the n replica are divided into many groups of m replica, and that the overlap is 1 within the
group, and 0 between different groups. The number of group is of course n/m. In order
to count the number of such matrices, we can choose one configuration by group, so that
N mn
2
N = 2N (2N −1) . . . (2N −n/m+1), sq (Q) = log(2N ) = (n/m) log 2 and a,b Qa,b = nn = nm.
P
m
Thus, we have

n β2
E[Z n ] ≃ eN [ m log 2+ 4
nm]
(15.23)

and again 1 ≤ m ≤ n

g(q) = nmβ 2 /4 + n/m log 2 (15.24)

This has a nice limit as n going to zero, and we thus find:

f1RSB (β, m) = mβ 2 /4 + log2/m (15.25)

We now need to choose m. The historical reasoning to choose m proposed by Parisi is almost a
joke, as it sounds crazy. First, we know that for n integer, we have obviously n ≥ m ≥ 1. Parisi
tells us than, when n → 0, we must instead choose n ≤ m ≤ 1. Secondly, as it was not crazy
enough, we will see that it corresponds to a minimum rather than to a maximum. We shall see
later on how to rationalize these results, thanks to the cavity method, but so far, let us follow
the replica prescriptions. We extremize the replica free entropy with respect to m and find:

β 2 /4 = log 2/m2 (15.26)

so that m = βc /β. Since, however, we want m ≤ 1, we write instead

m = min[βc /β, 1] (15.27)

This, amazingly, turns out to be the correct solution! Indeed, we have now:

1. If β < βc (high-temperature), we have m = 1 so that

f1RSB (β, m) = β 2 /4 + log 2 = fRS (β, m) (15.28)


15.2 Solution via the replica method 227

2. while, if instead β > βc (low-temperature), we have m = βc /β so that


(15.29)
p
f1RSB (β, m) = βc β/4 + β log 2/βc = β log 2

which is indeed the correct free energy. There is definitely something magical, but it works.

15.2.4 The condensation transition and the participation ratio

The replica method can also be used to shed some light on the type of phase transition hap-
pening at βc . Indeed, we know that for low
√ temperature, the Boltmzann measure condensate
on the lowest energy level, close to e = − log 2, but we may ask, how many are they, really?
This can be answered by computing the participation ratio defined as
N
2  −βEj 2
X e
Y = (15.30)
Z
i=1

Handwavingly speaking, the participation ratio is the inverse of the number of configuration
that matters in the sum. Let’s compute its average by the replica method! We find
     
2N  −βEj 2 2N 2N
X e 1 X X
EY = E   = E e−2βj  = lim E Z n−2 e−2βEj  (15.31)
Z Z2 n→0
i=1 i=1 i=1
 
2N
−β(Ei1 +Ei2 +...+Ein−2 )
X X
= lim E  e e−2βEj  (15.32)
n→0
i1 ,...,in−2 i=1
 

e−β(Ei1 +Ei2 +...+Ein ) 1(in−1 = in )


X
= lim E  (15.33)
n→0
i1 ,...,in

Now, using the invariance of all the replica, we symmetrize the last expression so that
 
1
e−β(Ei1 +Ei2 +...+Ein ) 1(ia = ib )
X X
EY = lim E (15.34)
n→0 n(n − 1)
a̸=b i1 ,...,in

Finally, since the limit when n → 0 of Z n is one, we write


hP i
1 X E i1 ,...,in e −β(Ei1 +Ei2 +...+Ein )
1(ia = ib )
EY = lim hP i (15.35)
n→0 n(n − 1) −β(Ei1 +Ei2 +...+Ein )
a̸=b E i1 ,...,in e
1
= lim ⟨Qa,b ⟩ (15.36)
n→0 n(n − 1)
and assuming, as we saw in the 1RSB computation, that the replica computation is dominated
by its saddle points, we find
1 X
EY = lim 1RSB
Qa,b (15.37)
n→0 n(n − 1)
a̸=b
1 1
n/m m2 − m = lim (15.38)
 
= lim [m − 1]
n→0 n(n − 1) n→0 (n − 1)
= 1 − m = 1 − βc /β (15.39)
228 F. Krzakala and L. Zdeborová

We can now plot the participation ratio and see that it foes from 1 at zero temperature and
diverges at T = Tc = 1./βc . A similar computation shows that second moment is finite, so
that this number is NOT self-averaging, the number of configurations participating in the
states fluctuate from instance to instance.

Figure 15.2.1: Inverse participation ratio in the REM

15.2.5 Distribution of overlaps

Another way to think about the replica solution is through the distribution of possible overlaps.
Consider two actual real replica of the system. We may ask, what if we take two independent
copies of the same system, what is the probability that they are in the same configuration?
Or, equivalently, what is the probability that their overlap is one? This is precisely what we
computed with the participation ratio! Indeed, the (average) probability that two copies are
in the same configurations reads
 
2N  −βEi −βEj
1(Qa,b )
 P
e
1(i = j)  = E[Y ] = lim a̸=b
X
E(P(q = 1)) = E  2
. (15.40)
Z n→0 n(n − 1)
i,j=1

In the high-temperature solution, when the RS approach is correct, we see that the only
possible overlap is 0. If we sample two configurations in the high-temperature region, then
with high-probability, they are different ones.

In the 1RSB low temperature solution, we see instead that with probability 1 − m we have q = 1
(and therefore, with probability m, we have q = 0). If we sample two configurations, they will
be the same ones with probability 1 − m. This is another way to think of the condensation
phenomenon. In spin glass parlance, one often says that the distribution of overlap become
non trivial. We went from a simple delta-peak at high-temperature
EPRS (q) = δ(q) (15.41)
to a distribution with two pick at low temperature
EP1RSB (q) = (1 − m)δ(q − 1) + mδ(q) (15.42)
15.3 The connection between the replica potential, and the complexity 229

15.3 The connection between the replica potential, and the complex-
ity

We have seen that the replica method is indeed very powerful, and allows to compute ev-
erything we want, pretty much. Can we understand how this version of replica symmetry
breaking is connected with the one introduced in the cavity method, while at the same time
shed some light on the strange replica ansatz and minimization? The answer is yes, using the
construction derived in the cavity method.

The 1RSB potential — Let us discuss again the idea introduced at the previous chapter. The
whole concept is that there are many "Gibbs states", and that we would like to count them. So
the entire partition sum can be divided into "many" partition sums, each of them defining a
Gibbs states (so that quantities of interest are well defined and concentrate within each state).
We thus have X
Z= Zα (15.43)
α

The idea is then that we can count those states. To do so, we introduce this modified partition
sum, that weights all partition sum by a power m:
X X
Z(m) = Zαm = eN mfα (15.44)
α α

where we have denoted the free entropy of each state α as fα = (log Zα )/N . We assume that
there are exponential many of them, so we need to replace this discrete sum, and approximate
it by an integral and hope that the version we compute using the our approximation (BP, RS
etc...) will be correct:
Z Z
Z(m) = df #(f )e N mfα
= df eN (mfα +Σ(f ) (15.45)

where we have introduced the "complexity" Σ(f ), that is the logarithm of the number of state
of free entropy f (divided by N ). At this point, we can perform the Laplace integral and write
Z Z
Z(m) = df #(f )e N mfα
= df eN (mfα +Σ(f ) (15.46)

so that we compute the thermodynamic potential

1
Ψ(m) = log Z(m) → mfα∗ + Σ(f ∗ ) (15.47)
N
m = −∂f Σ(f )|f ∗ (15.48)

This construction was at the basis of the 1RSB approach in the cavity method, and it allowed
to compute the complexity function. Additionally, the Legendre transform ensure that

∂m Ψ(m) = f ∗ (15.49)

As discussed in the previous chapter, we have to pay attention to the fact that Σ is actually an
average version of the true logarithm of the number of state, and can be negative, so that we
need to evaluate the integral on values of m such that Σ ≥ 0.
230 F. Krzakala and L. Zdeborová

This is reminiscent of the solution for the REM, except that for the REM, a "configuration"
is the same as a "Gibbs" state (this is just because the REM is very simple model). Still, the
same thing happens, with condensation on the lowest energy configuration, so that m can be
different from one when Σ is negative.

With replica: the "Real replica" method — Can we compute the same potential Ψ(m) with
the replica method? The answer is yes, and very instructively so.

First, we notice that if we force m copies (or real replica) to be in the same "states", then the
replicated sum X X
Zm = ( Zα )m = Zi1 Zi2 . . . Zim , (15.50)
α i1 ,i2 ,...im

becomes X
m
Zconstrained = Zαm , (15.51)
α
which is what we want to compute the potential.

Now, we need to compute the average of the log of this quantity and write, using the replica
method ′
m
(Zconstrained )n − 1
1 m
Ψ(m) = E log Zconstrained = lim (15.52)
N n′ →∞ N n′
This would be the way to compute the potential that gives the Legendre transform of the
complexity. However, we also see that
mn ′
1 1 Zconstrained −1
m
E log Zconstrained = lim ′
(15.53)
m N ′
n →0 N mn
n
Zconstrained −1
= lim (15.54)
n→0 Nn
(15.55)

We thus see that when we perform the replica computation, constraining m replica to be in
the same Gibbs states, and thus perform the 1RSB ansatz, what we do is nothing but the
computation of the Legendre transform of the complexity (up to a trivial 1/m factor)! The
1RSB ansatz is nothing but a fancy way (in replica space) to construct the function Ψ(m).

This is, we believe, a much clearer motivation for the 1RSB ansatz! indeed, with this motivation
in mind, it is clearer to see why m should be 1, unless the complexity is negative, so that indeed
m ∈ [0 : 1]. Additionally, it also explain why we must extremize the replica 1RSB formula
with respect to m. Indeed
Ψ(m) Ψ′ (m)m − Ψ(m)
∂m = (15.56)
m m2
which is zero when

Ψ′ (m)m − Ψ(m) = 0 (15.57)


∗ ∗ ∗
f m − f m + Σ(f ) = 0 (15.58)

Σ(f ) = 0 (15.59)

so indeed, we see that extremizing over m in the replica formalism is the same as looking for
the zero complexity. This is exactly what we had to do for the REM at low temperature!
15.4 Exercises 231

Bibliography

Replica symmetry Breaking was invented by Parisi in a serie of deep thought provoking
papers (Parisi, 1979, 1980) that ultimately led him to receive the Physics Nobel prize in 2021.
In particular, the distribution of overlaps was introduced in (Parisi, 1983) and its fundamental
role in the replica theory described in Mézard et al. (1984). The Random Energy Model was
introduced by Derrida (1980, 1981) as a toy model to understand replicas and spin glasses,
and it played an important role in the clarification of the replica method, especially as it leads
to the study of the celebrated p-spin model by (Gross and Mézard, 1984). The construction in
terms of counting Gibbs states is discussed in Monasson (1995); Mézard and Parisi (1999). It
was instrumental in the creation of the modern version of the cavity method (Mézard and
Parisi, 2001; Mézard et al., 2002; Mézard and Parisi, 2003) discussed in the previous chapters.

15.4 Exercises

Exercise 15.1: REM as a p-spin model

The p-spin model is one of the cornerstones of spin glass theory. It is defined as
follows: there are 2N possible configurations for the N spins variables Si = ±1, and the
Hamiltonian is given by all possible p-body (or p-upplet) interactions:
X
H=− Ji1 ,...,ip Si1 . . . Sip (15.60)
i1 ,i2 ,...,ip

with r
πp! − N p−1 J2
P (J) = p−1
e p! (15.61)
N
1. Computing the moment of the partition function using Gaussian integrals, show
that  !p 
X β 2 X X
E[Z n ] = exp  p−1
Sia Sib  (15.62)
4N
a b n
{Si },{Si },...,{Si } a,b i

2. introducing delta functions to fix the overlap and taken its Fourier transform,
show that Z Y
n
E[Z ] ≈ dqa,b dq̂a,b e−N G(Q,Q̂) (15.63)
a<b
with
 
β2 β2 X p X X P 1
a<b iq̂a,b N
P a b
i Si Si
G(Q, Q̂) = −n − qa,b + i q̂a,b qa,b − log  e 
4 2
a<b a<b {Sia ,...,Sin }
(15.64)
3. Using the replica method, with the replica symmetric solution, show that the free
entropy is given by
β2
fRS = + log 2 (15.65)
4
232 F. Krzakala and L. Zdeborová

as in the REM. It is possible, of course, to break the symmetry and to obtain the
1RSB solution, which turns out to be correct at low temperature (at least for p
large enough, and for a range of temperature).

4. It is interesting that the free energy is the same as the one of the REM. Show that,
when p → ∞, the energies of the p − spin model becomes uncorrelated and that
the p − spin model become the REM in this limit.

Exercise 15.2: Second moment of the participation ratio in REM

We consider again the REM. Using the replica method, show that at for β ≥ βc , the
second moment of the participation ratio is given by

3 − 5m + 2m2
E[Y 2 ] = (15.66)
3
Deduce that Y is not self-averaging. Actually, these results do not depends on the REM,
but on the 1RSB structure, and are universal to all 1RSB models.

Exercise 15.3: Maximum of Ω Gaussians numbers and denoising

• We are given Ω Gaussian numbers Zi , i = 1, . . . , Ω, all sampled i.i.d. from a Gaus-


sian distribution N (0, ∆). Using the same reasoning as for the random energy
model, show that when log Ω is large, then the maximum value ωmax = max √log Zi


is, with high probability, 2∆. Check this prediction numerically, by computing
the average value of the maximum of Ω Gaussian numbers for many values of Ω.

• Imagine we are interested in a sparse Ω−dimensional vector x∗ . All of its compo-


nents are zero, expect a few (say one or two) that are O(1).
Unfortunately, we are not given x∗ , but instead y = x∗ + z, a vector where all the
component of x∗ have been polluted by independent random Gaussian noises
of variance ∆. Can you propose a simple algorithm to recover x∗ from y that
will work with high probability when Ω is large? For which value of ∆ will this
algorithm works? Test your algorithm in simulation.
Part IV

Statistics and machine learning


Chapter 16

Linear Estimation and Compressed


Sensing

He doesn’t play for the money he wins


He don’t play for respect
He deals the cards to find the answer
The sacred geometry of chance
The hidden law of a probable outcome

Sting – Shape of my heart, 1993

16.1 Problem Statement

The inference problem addressed in this section is the following. We observe a M dimensional
vector yµ , µ = 1, . . . , M that was created component-wise depending on a linear projection of
an unknown N dimensional vector xi , i = 1, . . . , N by a known matrix Fµi

N
!
X
yµ = fout Fµi xi , (16.1)
i=1

where fout is an output function that includes noise. Examples are the additive white Gaussian
noise (AWGN) were the output is given by fout (x) = x + ξ with ξ being random Gaussian
variables, or the linear threshold output where fout (x) = sign(x − κ), with κ being a threshold
value. The goal in linear estimation is to infer x from the knowledge of y, F and fout . An
alternative representation of the output function fout is to denote zµ = N i=1 Fµi xi and talk
P
about the likelihood of observing a given vector y given z

M
Y
P (y|z) = Pout (yµ |zµ ) . (16.2)
µ=1
236 F. Krzakala and L. Zdeborová

16.2 relaxed-BP

Figure 16.2.1: Factor graph of the linear estimation problem corresponding to the poste-
rior probability for generalized linear estimation . Circles represent the unknown variables,
whereas squares represent the interactions between variables.

We shall describe the main elements necessary for the reader to understand where the resulting
algorithm comes from. We are given the posterior distribution

M N N
1 Y Y X
P (x|y, F ) = Pout (yµ |zµ ) PX (xi ) , where zµ = Fµi xi , (16.3)
Z
µ=1 i=1 i=1

where the matrix Fµi has independent random entries (not necessarily identically distributed)
with mean and variance O(1/N ). This posterior probability distribution corresponds to a
graph of interactions yµ between variables (spins) xi called the graphical model as depicted in
Fig. 16.2.1.

A starting point in the derivation of AMP is to write the belief propagation algorithm cor-
responding to this graphical model. The matrix Fµi plays the role of randomly-quenched
disorder, the measurements yµ are planted disorder. As long as the elements of Fµi are inde-
pendent and their mean and variance of of order O(1/N ) the corresponding system is a mean
field spin glass. In the Bayes-optimal case (i.e. when the prior is matching the true empirical
distribution of the signal) the fixed point of belief propagation with lowest free energy then
provides the asymptotically exact marginals of the above posterior probability distribution.

For model such as (16.3) BP implements a message-passing scheme between nodes in the
graphical model of Fig. 16.2.1, ultimately allowing one to compute approximations of the
posterior marginals. Messages mi→µ are sent from the variable-nodes (circles) to the factor-
nodes (squared) and subsequent messages mµ→i are sent from factor-nodes back to variable-
nodes that corresponds to algorithm’s current “beliefs” about the probabilistic distribution of
16.2 relaxed-BP 237

the variables xi . It reads


1 Y
mi→µ (xi ) = PX (xi ) mγ→i (xi ) (16.4)
zi→µ
γ̸=µ
1
Z Y X
mµ→i (xi ) = [dxj mj→µ (xj )] Pout (yµ | Fµl xl ) (16.5)
zµ→i
j̸=i l

While being easily written, this BP is not computationally tractable, because every interaction
involves N variables, and the resulting belief propagation equations involve (N − 1)-uple
integrals. Furthermore, here we are considering continuous variable, thus the messages would
be densities and, as we have seen in Chapter 8, it would be quite hard dealing with them
numerically.

Two facts enable the derivation of a tractable BP algorithm: the central limit theorem, on
the one hand, and a projection of the messages to only two of their moments (also used in
algorithms such as Gaussian BP or non-parametric BP.
This results in the so-called relaxed-belief-propagation (r-BP): a form of equations that is
tractable and involves a pair of means and variances for every pair variable-interaction.

Let us go through the steps that allow to go from BP


P to r-BP and first concentrate our attention
to eq. (16.5). Consider the variable zµ = Fµi xi + j̸=i Fµj xj . In the summation inPeq. (16.5),
all xj s are independent, and thus we expect, according to the
P central limit theorem, j̸=i Fµj xj
to be a Gaussian random variable with mean ωµ→i = j̸=i Fµj aj→µ and variance Vµ→i =
j̸=i Fµj vj→µ , where we have denoted
2
P

Z Z
ai→µ ≡ dxi xi mi→µ (xi ) vi→µ ≡ dxi x2i mi→µ (xi ) − a2i→µ . (16.6)

We thus can replace the multi-dimensional integral in eq. (16.5) by a scalar Gaussian one over
the random variable z:
Z (z−ωµ→i −Fµi xi )2

mµ→i (xi ) ∝ dzµ Pout (yµ |zµ )e 2Vµ→i
. (16.7)

We have replaced a complicated multi-dimensional integral involving full probability distribu-


tions by a single scalar one that involves only first two moments.

One can further simplify the equations: The next step is to rewrite (z − ωµ→i 2
√ − Fµi xi ) =
(z − ωµ→i ) + Fµi xi − 2(z − ωµ→i )Fµi xi Fµi xi and to use the fact that Fµi is O(1/ N ) to expand
2 2 2

the exponential
Z (z−ωµ→i )2

mµ→i (xi ) ∝ dzµ Pout (yµ |zµ )e 2Vµ→i
(16.8)
 
2 2 1
× 1 + Fµi xi − 2(z − ωµ→i )Fµi xi + (z − ωµ→i )2 Fµi
2 2
xi + o(1/N ) .
2
At this point, it is convenient to introduce the output function gout , defined via the output
probability Pout as
(z−ω)2
dzPout (y|z) (z − ω) e− 2V
R
gout (ω, y, V ) ≡ (z−ω)2
. (16.9)
V dzPout (y|z)e− 2V
R
238 F. Krzakala and L. Zdeborová

The following useful identity holds for the average of (z − ω)2 /V 2 in the above measure:
(z−ω)2
dzPout (y|z) (z − ω)2 e− 2V
R
1
(z−ω)2
= 2
+ ∂ω gout (ω, y, V ) + gout (ω, y, V ) . (16.10)
V 2 dzPout (y|z)e− 2V
R V

Using definition (16.9), and re-exponentiating the xi -dependent terms while keeping all
the leading order terms, we obtain (after normalization) that the iterative form of equation
eq. (16.5) reads:
t )2
s
x2 (Bµ→i
Atµ→i − 2Ni Atµ→i +Bµ→i
t x
√i −
2At
(16.11)
N
mµ→i (t, xi ) = e µ→i
2πN
(16.12)

with
t
Bµ→i t
= gout (ωµ→i t
, yµ , Vµ→i ) Fµi and Atµ→i = −∂ω gout (ωµ→i
t t
, yµ , Vµ→i 2
) Fµi .

Given that mµ→i (t, xi ) can be written with a quadratic form in the exponential, we just write
it as a Gaussian distribution. We can now finally close the equations by writing (16.4) as a
product of these Gaussians with the prior
(xi −Ri→µ )2

mi→µ (xi ) ∝ PX (xi )e 2Σi→µ
(16.13)

so that we can finally give the iterative form of the r-BP algorithm Algorithm 1 (below) where
we denote by ∂ω (resp. ∂R ) the partial derivative with respect to variable ω (resp. R) and we
define the “input" functions as
(x−R)2
dx x PX (x) e− 2Σ
R
fa (Σ, R) = R (x−R)2
, fv (Σ, R) = Σ∂R fa (Σ, R) . (16.14)
dx PX (x) e− 2Σ

Examples of priors and outputs

The r-BP algorithm is written for a generic prior on the signal PX (as long as it is factorized
over the elements) and a generic element-wise output channel Pout . The algorithm depends
on their specific form only trough the function fa and gout defined by (16.14) and (16.9). It is
useful to give a couple of explicit examples.

The sparse prior that is most commonly considered in probabilistic compressed sensing is
the Gauss-Bernoulli prior, that is when in (??) we have ϕ(x) = N (x, σ) Gaussian with mean x
and variance σ.

For this prior the input function fa reads


(T −x)2 √
− 2(Σ+σ) Σ
ρe 3 (xΣ + T σ)
(Σ+σ) 2
faGauss−Bernoulli (Σ, T ) = 2 √ (T −x)2
, (16.25)
T −
(1 − ρ)e− 2Σ + ρ √Σ+σ Σ
e 2(Σ+σ)
16.3 State Evolution 239

The most commonly considered output channel is simply additive white Gaussian noise
(AWGN) (??).

The output function then reads

y−ω
AW GN
gout (ω, y, V ) = . (16.26)
∆+V

As we anticipated above, the example of linear estimation that was most broadly studied in
statistical physics is the case of the perceptron problem discussed in detail e.g. in Watkin et al.
(1993). In the perceptron problem each of the M N -dimensional patterns Fµ is multiplied by
a vector of synaptic weights xi in order to produce an output yµ according to

N
X
yµ = 1 if Fµi xi > κ , (16.27)
i=1
yµ = −1 otherwise , (16.28)

where κ is a threshold value independent of the pattern. The perceptron is designed to classify
patterns, i.e. one starts with a training set of patterns and their corresponding outputs yµ
and aims to learn the weights xi in such a way that the above relation between patterns and
outputs is satisfied. To relate this to the linear estimation problem above, let us consider the
perceptron problem in the teacher-student scenario where the teacher perceptron generated
the output yµ using some ground-truth set of synaptic weights x∗i . The student perceptron
knows only the patterns and the outputs and aims to learn the weights. How many patterns
are needed for the student to be able to learn the synaptic weights reliably? What are efficient
learning algorithms?

In the simplest case where the threshold is zero, κ = 0 one can redefine the patterns Fµi ←
Fµi yµ in which case the corresponding redefined output is yµ = 1. The output function in that
case reads
ω2
perceptron 1 e− 2V
gout (ω, V ) = √  , (16.29)
2πV H − √ω
V

where Z ∞
dt t2
H(x) = √ e− 2 . (16.30)
x 2π

In physics a case of a perceptron that was studied in detail is that of binary synaptic weights
xi ∈ {±1}. To take that into account in the G-AMP we consider the binary prior PX (x) =
[δ(x − 1) + δ(x + 1)]/2 which leads to the input function
 
T
fabinary (Σ, T ) = tanh . (16.31)
Σ

16.3 State Evolution

Let us try to write the cavity equation here, using our simplified formulation
240 F. Krzakala and L. Zdeborová

16.3.1 The hat variables

First, let us now see how Ri behaves (note the difference between the letters ω and w):

Ri X X
= Bµ→i = Fµi gout (ωµ→i , yµ , Vµ→i ) (16.32)
Σi µ µ
X X
= Fµi gout (ωµ→i , fout ( Fµj sj + Fµi si , ξµ ), V ) (16.33)
µ j̸=i
X X
= Fµi gout (ωµ→i , fout ( Fµj sj , ξµ ), V ) + si αm̂ , (16.34)
µ j̸=i

where we define

m̂ = Eω,z,ξ [∂z gout (ω, fout (z, ξ), V )] . (16.35)

We further write
Ri
(16.36)
p
∼ N (0, 1) αq̂ + si αm̂
Σi

with N (0, 1) being a random Gaussian variables of zero mean and unit variance, and where

(16.37)
 2 
q̂ = Eω,z,ξ gout (ω, fout (z, ξ) V )

Finally, we observe that Σ should not fluctuate between sites, and we thus expect they are
close to the value
Σ−1 (t) = αχ̂ (16.38)
where we have defined

χ̂ = −Eω,z,ξ [∂ω gout (ω, fout (z, ξ), V )] . (16.39)

16.3.2 Order parameters

Given these three variables, one can express the order parameters as simple integrals. Indeed,
if we define:

q = Es ER,Σ fa2 (Σ, R) (16.40)


 

m = Es [ER,Σ sfa (Σ, R)] (16.41)


V = Es [ER,Σ fc (Σ, R)] (16.42)

we have immediately, denoting ξ as a standard Gaussian variable:


h p i
q t = Es Eξ fa2 ((αχ̂t )−1 , (αχ̂t )−1 ( q̂ t + αm̂)sξ) (16.43)
h p i
mt = Es Eξ sfa ((αχ̂t )−1 , αχ̂t )−1 ( q̂ t + αm̂) (16.44)
h p i
V t = Es Eξ fc ((αχ̂t )−1 , αχ̂t )−1 ( q̂ t + αm̂) (16.45)
16.4 From r-BP to G-AMP 241

16.3.3 Back to the hats

We can now close the equation by writing the hat variables as a function of the order parameters.
First, we realize that from the definition, we have z and ω jointly Gaussian with covariance

q0 mt
    
z
∼ N 0, (16.46)
ω mt q t

So now, we can simply collect the definition, which become a 3-d integral:

m̂t+1 = Etω,z,ξ ∂z gout (ω, fout (z, ξ), V t ) . (16.47)


 

q̂ t+1 = Etω,z,ξ gout (16.48)


 2
(ω, fout (z, ξ), V t )


χ̂t+1 = −Etω,z,ξ ∂ω gout (ω, fout (z, ξ)), V t ) . (16.49)


 

16.4 From r-BP to G-AMP

While one can implement and run the r-BP, it can be simplified further without changing the
leading order behavior of the marginals by realizing that the dependence of the messages
on the target node is weak. This is exactly what we have done to go from the standard belief
propagation to the TAP equations in sec. 9.

After the corresponding Taylor expansion the corrections add up into the so-called Onsager
reaction terms Thouless et al. (1977). The final G-AMP iterative equations are written in terms
of means ai and their variances ci for each of the variables xi . The whole derivation is done in
a way that the leading order of the marginal ai are conserved. Given the BP was asymptotically
exact for the computations of the marginals so is the G-AMP in the computations of the means
and variances. Let us define
X
ωµt+1 = Fµi ati→µ , (16.50)
i
X
Vµt+1 = 2 t
Fµi vi→µ , (16.51)
i
1
Σt+1
i =P t+1 , (16.52)
µ Aµ→i
P t+1
µ Bµ→i
Rit+1 =P t . (16.53)
µ Aµ→i

First we notice that the correction to V are small so that


X X
Vµt+1 = 2 t
Fµi vi→µ ≈ 2 t
Fµi vi . (16.54)
i i

From now on we will use the notation ≈ for equalities that are true only in the leading order.
242 F. Krzakala and L. Zdeborová

From the definition of A and B we get


" #−1 " #−1
X X
2 t+1 t+1 2 t+1 t+1
Σi = − Fµi ∂ω gout (ωµ→i , yµ , Vµ→i ) ≈ − Fµi ∂ω gout (ωµ , yµ , Vµ )
µ µ
" #−1 " #
X X
2 t+1 t+1
Ri = − Fµi ∂ω gout (ωµt+1 , yµ , Vµt+1 ) × Fµi gout (ωµ→i , yµ , Vµ→i )
µ µ

Then we write that


t+1
gout (ωµ→i t+1
, yµ , Vµ→i ) ≈ gout (ωµt+1 , yµ , Vµt+1 ) − Fµi ati→µ ∂ω gout (ωµt+1 , yµ , Vµt+1 ) (16.55)
≈ gout (ωµt+1 , yµ , Vµt+1 ) − Fµi ati ∂ω gout (ωµt+1 , yµ , Vµt+1 ) (16.56)
So that finally
" #−1
X
(Σi )t+1 = − 2
Fµi ∂ω gout (ωµt+1 , yµ , Vµt+1 ) (16.57)
µ
" #
X
t+1 −1
Rit+1 = ((Σi ) ) × Fµi gout (ωµ , yµ , Vµ ) − 2
Fµi ai ∂ω gout (ωµ , yµ , Vµ ) (16.58)
µ
X
= ati + ((Σi )t+1 )−1 Fµi gout (ωµt+1 , yµ , Vµt+1 ) (16.59)
µ

Now let us consider ai→µ :

ati→µ = fa (Ri→µ
t t
, Σi→µ ) ≈ fa (Ri→µ , Σi ) (16.60)
≈ fa (Rit , Σi ) − Bµ→i
t
fv (Rit , Σi ) (16.61)
≈ ati − gout (ωµt , yµ , Vµt )Fµi vit (16.62)

So that
X X X
ωµt+1 = Fµi ati − gout (ωµt , yµ , Vµt )Fµi
2 t
vi = Fµi ai − Vµt gout (ωµt , yµ , Vµt ) (16.63)
i i i

An important aspect of these equations to note is the index t − 1 in the Onsager reaction term
in eq. (16.65) that is crucial for convergence and appears for the same reason as in the TAP
equations in sec. 9. Note that the whole algorithm is comfortably written in terms of matrix
multiplications only, this is very useful for implementations where fast linear algebra can be
used.

16.5 Rigorous results for Bayes and Convex Optimization

16.5.1 Bayes-Optimal

In the Bayes-optimal case, we have the Nishimori condition so that q = m and V = q0 − m,


together with q̂ = m̂ = χ̂, and therefore, we can reduce everything to simpler equations.
16.6 Application: LASSO, Sparse estimation, and Compressed sensing 243

In fact, one can prove rigorously the following theorem: the “the averaged free energy” in
the statistical physics literature, in the asymptotic limit –when the number of variable is
growing—of the entropy density of the variable Y reads:
H(Y |Φ) rq
lim = − sup inf ψP0 (r) + αΨPout (q; ρ) − . (16.70)
n→∞ n r≥0 q∈[0,ρ] 2
where q0 = ρ = E[x2 ] and, denoting Gaussian integrals as D,

Z Z
2
ψP0 (r) := DZdP0 (X0 ) ln dP0 (x)erxX0 + rxZ−rx /2 , (16.71)
√ √
Z
ΨPout (q; ρ) := DV DW dYe0 Pout (Ye0 | q V + ρ − q W )
√ √
Z
ln DwPout (Ye0 | q V + ρ − q w) (16.72)

It is a simple check to see that q follows state evolution with the Nishimori points:

mt+1 = ψP′ 0 (m̂t )/2 (16.73)


t
m̂ = αΨ′Pout (mt ; ρ)/2 (16.74)

16.5.2 Bayes-Optimal

Another set of problem where the eisults is rigorous is when the optimization corresponds a
convex problem

16.6 Application: LASSO, Sparse estimation, and Compressed sens-


ing

16.6.1 AMP with finite ∆

In this case fout


AWGN = x + ξ and the output function reads

y−ω
AW GN
gout (ω, y, V ) = (16.75)
∆+V

16.6.2 LASSO: infinite ∆

Let us change variable: we now denote V = ∆Ṽ and Σ = ∆Σ̃. When ∆ is large fa and fb
can now be evaluated as Laplace integrals, and finally, all equation closes on the orginal algo
where ∆ is replaced by one, and where the update function are given by:
(x−R)2
dx x PX (x) e− 2∆Σ̃ (x − R)2
R
faMAP (Σ, R) = R (x−R)2
, = argmin − + λfreg (x) (16.82)
dx PX (x) e −
2∆Σ̃
2Σ̃

fv (∆Σ̃, R) = ∆Σ̃∂R fa (Σ̃, R) . (16.83)


244 F. Krzakala and L. Zdeborová

For instance, for a ℓ1 regularization, we find faMAP (Σ, R) = argmin(x − R)2 /2Σ̃ + λ|x| so that

faST (Σ, R) = 0 if|R| < λΣ (16.84)


= R − λΣ ifR > λΣ (16.85)
= R + λΣ ifR < λΣ (16.86)
(16.87)

and

fcST (Σ, R) = 0 if|R| < λΣ (16.88)


= Σ otherwise (16.89)

Phase diagram for noiseless CS with Bernoulli-Gaussian


1.0
ρ=α
AMP

0.8
Easy Phase
rows to columns ratio α

0.6
e
as
Ph
rd
Ha

0.4

Impossible Phase

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
fraction of nonzeros ρ

Figure 16.6.1: Phase diagram of noiseless compressed sensing for Gauss-Bernulli prior

• In easy phase, AMP can find the optimal solution.


• In hard phase, it is possible to find the optimal solution, for example by exhaustive
search, but AMP gives the wrong answer because it get stuck at a local minima.
• In impossible phase, no algorithm can find the solution.

Conjecture: In the “hard” phase, no algorithm works on polynomial time.


16.7 Exercises 245

Bibliography

Donoho et al. (2009), Krzakala et al. (2012) and Zdeborová and Krzakala (2016).

16.7 Exercises

Exercise 16.1: AMP for sparse estimation


Consider a N -dimensional sparse bernoulli-signal x0 where each element is sampled
from
x0 ∼ (1 − ρ)δ(x) + ρN (0, 1) (16.96)
Say you measure y = F x0 where F is a M × N random matrix as in the main chapter.
Run both the Bayes optimal and the LASSO version of AMP for the problem of estimating
x0 from y for different values of ρ and α, using moderate values of N (few hundreds)
Show that both algorithm can be efficient.
Bonus point: compare with the state evolution of each algorithm.
246 F. Krzakala and L. Zdeborová

Input: y
Initialize: ai→µ (t = 0),vi→µ (t = 0), t = 1
repeat
r-BP Update of {ωµ→i , Vµ→i }
X
Vµ→i (t) ← 2
Fµj vj→µ (t − 1) (16.15)
j̸=i
X
ωµ→i (t) ← Fµj aj→µ (t − 1) (16.16)
j̸=i

r-BP Update of {Aµ→i , Bµ→i }

Bµ→i (t) ← gout (ωµ→i (t), yµ , Vµ→i (t)) Fµi , (16.17)


Aµ→i (t) ← 2
−∂ω gout (ωµ→i (t), yµ , Vµ→i (t))Fµi (16.18)

r-BP Update of {Rµ→i , Σµ→i }

1
Σi→µ (t) ← P (16.19)
ν̸=µ ν→i (t)
A
X
Ri→µ (t) ← Σi→µ (t) Bν→i (t) (16.20)
ν̸=µ

AMP Update of the estimated partial marginals ai→µ (t),vi→µ (t)

ai→µ ← fa (Σi→µ (t), Ri→µ (t)) , (16.21)


vi→µ ← fv (Σi→µ (t), Ri→µ (t)) . (16.22)

t←t+1
until Convergence on ai→µ (t),vi→µ (t)
output: Estimated marginals mean and variances:
 P 
ai ← fa P A 1
, Pν Bν→i , (16.23)
ν→i ν Aν→i
 ν P 
Bν→i
1
vi ← fv P Aν→i , Pν Aν→i . (16.24)
ν ν

Algorithm 1: relaxed Belief-Propagation (r-BP)


16.7 Exercises 247

Input: y
Initialize: a0 ,v0 , t = 1 gout,µ
0

repeat
AMP Update of ωµ , Vµ
X
Vµt ← 2 t−1
Fµi vi (16.64)
i
X
ωµt ← Fµi at−1
i
t−1
− Vµt gout (16.65)
i

AMP Update of Σi , Ri and gout,µ


" #−1
X
Σti ← − 2
Fµi ∂ω gout (ωµt , yµ , Vµt ) (16.66)
µ
X
Rit ← at−1
i + (Σti )−1 Fµi gout (ωµt , yµ , Vµt ) (16.67)
µ

AMP Update of the estimated marginals ai , vi

at+1
i ← fa (Σi , Rit+1 , ) (16.68)
vit+1 ← fv (Σi , Rit+1 ) (16.69)

t←t+1
until Convergence on a,v
output: a,v.
Algorithm 2: Generalized Approximate Message Passing (G-AMP)
248 F. Krzakala and L. Zdeborová

Input: y
Initialize: a0 ,v0 , t = 1 gout,µ
0

repeat
AMP Update of ωµ , Vµ

1 X t−1
Vt ← vi (16.76)
N
i
X
ωµt ← Fµi at−1
i
t−1
− Vµt gout (16.77)
i

AMP Update of Σi , Ri and gout,µ

∆+Vt
Σt ← (16.78)
α
X y µ − ωµ
Rit t−1
← ai + Fµi (16.79)
µ
α

AMP Update of the estimated marginals ai , vi

at+1
i ← fa (Σt , Rit+1 , ) (16.80)
vit+1 ← fv (Σ t
, Rit+1 ) (16.81)

t←t+1
until Convergence on a,v
output: a,v.
Algorithm 3: Approximate Message Passing (AMP) for Linear estimation
16.7 Exercises 249

Input: y
Initialize: a0 ,v0 , t = 1 gout,µ
0

repeat
AMP Update of ωµ , Vµ

1 X t−1
Vt ← vi (16.90)
N
i
X
ωµt ← Fµi at−1
i
t−1
− Vµt gout (16.91)
i

AMP Update of Σi , Ri and gout,µ

1+Vt
Σt ← (16.92)
α
X yµ − ωµ
Rit ← at−1
i + Fµi (16.93)
µ
α

AMP Update of the estimated marginals ai , vi

at+1
i ← faST (Σt , Rit+1 , ) (16.94)
vit+1 ← fvST (Σt , Rit+1 ) (16.95)

t←t+1
until Convergence on a,v
output: a,v.
Algorithm 4: Approximate Message Passing (AMP) for LASSO
Chapter 17

Perceptron

17.1 Perceptron

1943, McCulloch-Pitts model: y = Heavyside(ξ · w − κ), where ξ is the signal coming through
dendrites, w is the synaptic weights.

1958, Rosenblatt build McCulloch-Pitts model mechanically: Given set of ξ patterns: {ξ1 , . . . , ξn }
and objectives yi ∈ {0, 1}. Goal is given new pattern ξnew , determine its objective.

1969, Minsky and Papert introduced Perceptron: the McCulloch-Pitts model is leaning a
separating hyperplane, and hence cannot learn cases that are not linearly separable, for
example, the XOR problem.

But by embedding the points in a higher dimensional space, it is possible to find a separating
hyperplane, this is called representation learning.

• Idea 1 – linear embedding: the points will still live in a low-dimensional space, BAD.

• Idea 2 – Nonlinear embedding, e.g. kernels, Deep neuron networks.


252 F. Krzakala and L. Zdeborová

Statistical Physics to Perceptron

Now we change the Heavyside function to sign function, since the bijection from {0, 1} to
{−1, 1} does not change the analysis but in physics it is more common to use ±1 spins.

Capacity

1988, Derrida, Gartner.

Consider we have M length-N binary patterns ξµ , with µ = 1, . . . , M . The labels yµ ∈ {−1, 1}


are generated i.i.d.

Goal: learn w ∈ {−1, +1}N such that we can make prediction ŷ(x) = sign(ξ · w) and minimize
the cost function   X
H w; {ξµ , yµ }M
µ=1 = M − δ (yµ , ŷ(ξµ ))
µ

Let α = M/N , it is observed that there exists a threshold αc such that at in the large-N limit
with α fixed
 
• When α < αc , w.h.p. there exists some w so that the cost function H w; {ξµ , yµ }M
µ=1 =
0.
 
• When α > αc , w.h.p. for all w the cost function H w; {ξµ , yµ }M
µ=1 > 0.

And the storage capacity is defined as Mcapacity = αc N .

Replica Method

1989, Knuth, Mezard, Storage capacity of memory networks with binary couplings

Let’s consider the partition function of Boltzmann distribution under inverse temperature β
X   
Z(β; {ξµ , yµ }M
µ=1 ) = exp −βH w; {ξµ , yµ }M
µ=1
w

As β → ∞, the partition function will converge to the number of ‘good’ w, i.e.


n   o
lim Z(β; {ξµ , yµ }M
µ=1 ) = w H w; {ξµ , yµ }M
µ=1 = 0
β→∞

We are interested in the average free entropy for all possible ‘training set’ {ξµ , yµ }M
µ=1
 
log Z(β; {ξµ , yµ }M
µ=1 )
Φ(β, α) = lim E{ξµ ,yµ }M  
N →∞ µ=1 N
17.1 Perceptron 253

By replica method, we can compute


   r 
1 1
Z Z
  p  q
ΦRS (β, α) = Extq,q̂ q q̂ − q̂ + Dz log 2 cosh z q̂ + α Dz log e + (1 − e )H z
−β −β
2 2 1−q
with
e−u
∞ 2 /2  
1
Z
x
H(x) = du √ = erfc √
x 2π 2 2

It is still an open problem to rigorously prove that ΦRS (β, α) is correct under αc ≈ 0.83, we
just believe it is correct from the replica formula. When α > αc , the ΦRS is incorrect, one needs
to turn to 1RSB analysis.

Teacher-Student Scenario
• Teacher: there is a y ∗ = sign(ξ · w∗ ), where w∗ is also binary.
{ξ1 , . . . , ξM } is still random but their corresponding {y1 , . . . , yM } are generated accord-
ing to the above rule.
• Student: Need to learn the rule
– There is no capacity here because there is a true w∗ .
– How many patterns does the student need to see to learn the rules?

Treat ξµ are the µ-th row of matrix F, treat w∗ as the true signal x∗ we want to recover, then
y = sign (F x)

Let’s put this problem in a probabilistic framework. We introduce a Gaussian noise inside the
sgn, so as to have a probit likelihood; moreover we will assume that the weights are binary,
x ∈ {+1, −1}N . The full model reads

 1 1
xi ∼ δ(xi − 1) + δ(xi + 1),

2 2
1
y ∼ erfc √ 1 
 µ Fµ · x .
2 2σ 2

Hence, this problem is same as the generalized linear model, and we can use AMP to solve it.

Let PX (x) be the prior of xi and ρ = E[x2 ]. Then the replica formula reads
 
log (Z (F, y))
lim EF,y = Extq,q̂ ΦRS (q, q̂)
N →∞ N
where
∗√ √
Z 
√  √ 
Z
α ∗
ΦRS (q, q̂) = − q q̂ + α dx dy Dz Pout y x ρ − q + z q × log dx Pout y x ρ − q + z q
2
√ √
Z Z 
+ dx∗ Dz PX (x∗ )e− 2 (x ) +zx q̂ × log dx PX (x)e− 2 x +zx q̂
αq̂ ∗ 2 ∗ αq̂ ∗

q ∗ is the extremizer of ΦRS , and equals to ⟨x∗ , x̂⟩F,ψ .


254 F. Krzakala and L. Zdeborová

AMP State Evolution

 
p2 (x−p)2
exp − 2q(t) − 2 ρ−q(t)
( )
Z Z  
q̂ (t) = − dp dx q  dy P out (y | x) ∂ p gout p, y, ρ − q (t)

2π q (t) ρ − q (t)
!
1
Z Z
z
q (t+1) = dx PX (x) Dz fa2 ,x + p
αq̂ (t) αq̂ (t)

• Impossible Phase: When α < 1.245, q = 1 is not the global maxima of ΦRS , so it is
impossible to find a solution.

• Hard Phase: When α ∈ (1.245, 1.493), q = 1 is the global maxima, but there is another
local maxima at which AMP will get stuck.

• Easy Phase: When α > 1.493, q = 1 is the only local maxima, AMP will always converge
to it.
17.1 Perceptron 255

α = 1.2 α = 1.3
0.78

0.72 0.76
ΦRS(q)

0.74
0.70
0.72

0.68 0.70

10−5 10−3 10−1 10−5 10−3 10−1

α = 1.4 α = 1.5

0.85
0.80
ΦRS(q)

0.80
0.75
0.75

0.70 0.70
10−5 10−3 10−1 10−5 10−3 10−1
q q

Figure 17.1.1: Free entropy of a perceptron for different values of α


Part V

Appendix
Chapter 18

A bit of probabilty theory

But wait, if I could shake the crushing weight of expectations?


Would that free some room up for joy? Or relaxation? Or simple
pleasure? Instead, we measure!

Lin-Manuel Miranda, Surface Pressure, Encanto – 2021


Bibliography

Abbe, E. and Montanari, A. (2013). Conditional random fields, planted constraint satisfaction
and entropy concentration. In Approximation, Randomization, and Combinatorial Optimization.
Algorithms and Techniques, pages 332–346. Springer.

Achlioptas, D. and Coja-Oghlan, A. (2008). Algorithmic barriers from phase transitions. In


2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 793–802. IEEE.

Achlioptas, D. and Moore, C. (2003). Almost all graphs with average degree 4 are 3-colorable.
Journal of Computer and System Sciences, 67(2):441–471.

Aizenman, M., Sims, R., and Starr, S. L. (2003). Extended variational principle for the
sherrington-kirkpatrick spin-glass model. Physical Review B, 68(21):214403.

Aubin, B., Loureiro, B., Maillard, A., Krzakala, F., and Zdeborová, L. (2020). The spiked matrix
model with generative priors. IEEE Transactions on Information Theory.

Bai, Z. and Silverstein, J. W. (2010). Spectral analysis of large dimensional random matrices,
volume 20. Springer.

Baik, J., Arous, G. B., Péché, S., et al. (2005). Phase transition of the largest eigenvalue for
nonnull complex sample covariance matrices. Annals of Probability, 33(5):1643–1697.

Barbier, J., Dia, M., Macris, N., Krzakala, F., Lesieur, T., and Zdeborová, L. (2016). Mutual
information for symmetric rank-one matrix estimation: A proof of the replica formula. arXiv
preprint arXiv:1606.04142.

Barbier, J. and Macris, N. (2019). The adaptive interpolation method: a simple scheme to prove
replica formulas in bayesian inference. Probability theory and related fields, 174(3):1133–1185.

Bayati, M. and Montanari, A. (2011). The dynamics of message passing on dense graphs, with
applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785.

Berthier, R., Montanari, A., and Nguyen, P.-M. (2020). State evolution for approximate
message passing with non-separable functions. Information and Inference: A Journal of the
IMA, 9(1):33–79.

Bethe, H. A. (1935). Statistical theory of superlattices. Proceedings of the Royal Society of London.
Series A-Mathematical and Physical Sciences, 150(871):552–575.

Bolthausen, E. (2009). On the high-temperature phase of the sherrington-kirkpatrick model.


In Seminar at EURANDOM, Eindhoven.
262 F. Krzakala and L. Zdeborová

Bolthausen, E. (2014). An iterative construction of solutions of the tap equations for the
sherrington–kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366.

Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic
theory of independence. Oxford university press.

Braunstein, A., Dall’Asta, L., Semerjian, G., and Zdeborová, L. (2016). The large deviations
of the whitening process in random constraint satisfaction problems. Journal of Statistical
Mechanics: Theory and Experiment, 2016(5):053401.

Chertkov, M. (2008). Exactness of belief propagation for some graphical models with loops.
Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10016.

Chertkov, M. and Chernyak, V. Y. (2006). Loop series for discrete statistical models on graphs.
Journal of Statistical Mechanics: Theory and Experiment, 2006(06):P06009.

Coja-Oghlan, A. (2013). Upper-bounding the k-colorability threshold by counting covers.


arXiv preprint arXiv:1305.0177.

Coja-Oghlan, A., Krzakala, F., Perkins, W., and Zdeborová, L. (2018). Information-theoretic
thresholds from the cavity method. Advances in Mathematics, 333:694–795.

Coja-Oghlan, A. and Vilenchik, D. (2013). Chasing the k-colorability threshold. In 2013 IEEE
54th Annual Symposium on Foundations of Computer Science, pages 380–389. IEEE.

Cover, T. M. and Thomas, J. A. (1991). Information theory and statistics. Elements of Information
Theory, 1(1):279–335.

Curie, P. (1895). Propriétés magnétiques des corps a diverses températures. Number 4. Gauthier-
Villars et fils.

de Almeida, J. R. and Thouless, D. J. (1978). Stability of the sherrington-kirkpatrick solution


of a spin glass model. Journal of Physics A: Mathematical and General, 11(5):983.

Debye, P. (1009). Näherungsformeln für die zylinderfunktionen für große werte des arguments
und unbeschränkt veränderliche werte des index. Math. Ann., 67:535–558.

Dembo, A., Montanari, A., et al. (2010a). Gibbs measures and phase transitions on sparse
random graphs. Brazilian Journal of Probability and Statistics, 24(2):137–211.

Dembo, A., Montanari, A., et al. (2010b). Ising models on locally tree-like graphs. The Annals
of Applied Probability, 20(2):565–592.

Dembo, A., Montanari, A., Sly, A., and Sun, N. (2014). The replica symmetric solution for
potts models on d-regular graphs. Communications in Mathematical Physics, 327(2):551–575.

Dembo, A., Zeltouni, O., and Fleischmann, K. (1996). Large deviations techniques and
applications. Jahresbericht der Deutschen Mathematiker Vereinigung, 98(3):18–18.

Derrida, B. (1980). Random-energy model: Limit of a family of disordered models. Physical


Review Letters, 45(2):79.

Derrida, B. (1981). Random-energy model: An exactly solvable model of disordered systems.


Physical Review B, 24(5):2613.
BIBLIOGRAPHY 263

Donoho, D. L., Johnstone, I. M., et al. (1998). Minimax estimation via wavelet shrinkage. The
annals of Statistics, 26(3):879–921.

Donoho, D. L., Maleki, A., and Montanari, A. (2009). Message-passing algorithms for com-
pressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919.

Edwards, S. F., Goldbart, P. M., Goldenfeld, N., Sherrington, D. C., and Edwards, S. F., editors
(2005). Stealing the gold: a celebration of the pioneering physics of Sam Edwards. Number 126 in
International series of monographs on physics. Clarendon Press ; Oxford University Press,
Oxford : New York.

Edwards, S. F. and Jones, R. C. (1976). The eigenvalue spectrum of a large symmetric random
matrix. Journal of Physics A: Mathematical and General, 9(10):1595.

El Alaoui, A. and Krzakala, F. (2018). Estimation in the spiked wigner model: A short proof
of the replica formula. In 2018 IEEE International Symposium on Information Theory (ISIT),
pages 1874–1878. IEEE.

Gallager, R. (1962). Low-density parity-check codes. IRE Transactions on information theory,


8(1):21–28.

Gerbelot, C. and Berthier, R. (2021). Graph-based approximate message passing iterations.


arXiv preprint arXiv:2109.11905.

Gross, D. J. and Mézard, M. (1984). The simplest spin glass. Nuclear Physics B, 240(4):431–452.

Guerra, F. (2003). Broken replica symmetry bounds in the mean field spin glass model.
Communications in mathematical physics, 233(1):1–12.

Guo, D., Shamai, S., and Verdú, S. (2005). Mutual information and minimum mean-square
error in gaussian channels. IEEE transactions on information theory, 51(4):1261–1282.

Iba, Y. (1999). The nishimori line and bayesian statistics. Journal of Physics A: Mathematical and
General, 32(21):3875.

Jaeger, G. (1998). The ehrenfest classification of phase transitions: Introduction and evolution.
Arch Hist Exact Sc., 53:51–81.

Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components


analysis. Annals of statistics, pages 295–327.

Kadanoff, L. P. (2009). More is the same; phase transitions and mean field theories. Journal of
Statistical Physics, 137(5):777–797.

Kesten, H. and Stigum, B. P. (1967). Limit theorems for decomposable multi-dimensional


galton-watson processes. Journal of Mathematical Analysis and Applications, 17(2):309–338.

Korada, S. B. and Macris, N. (2009). Exact solution of the gauge symmetric p-spin glass model
on a complete graph. Journal of Statistical Physics, 136(2):205–230.

Krzakala, F., Mézard, M., Sausset, F., Sun, Y., and Zdeborová, L. (2012). Probabilistic recon-
struction in compressed sensing: algorithms, phase diagrams, and threshold achieving
matrices. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08009.
264 F. Krzakala and L. Zdeborová

Krzakala, F., Xu, J., and Zdeborová, L. (2016). Mutual information in rank-one matrix estima-
tion. In 2016 IEEE Information Theory Workshop (ITW), pages 71–75. IEEE.

Laplace, P. S. d. (1774). Memoire sur les probabilites des causes par lesevenements.

Lelarge, M. and Miolane, L. (2019). Fundamental limits of symmetric low-rank matrix estima-
tion. Probability Theory and Related Fields, 173(3):859–929.

Lesieur, T., Krzakala, F., and Zdeborová, L. (2017). Constrained low-rank matrix estimation:
Phase transitions, approximate message passing and applications. Journal of Statistical
Mechanics: Theory and Experiment, 2017(7):073403.

Livan, G., Novaes, M., and Vivo, P. (2018). Introduction to random matrices: theory and practice,
volume 26. Springer.

McGrayne, S. B. (2011). The theory that would not die. Yale University Press.

Mézard, M. and Parisi, G. (1999). Thermodynamics of glasses: A first principles computation.


Journal of Physics: Condensed Matter, 11(10A):A157.

Mézard, M. and Parisi, G. (2001). The bethe lattice spin glass revisited. The European Physical
Journal B-Condensed Matter and Complex Systems, 20(2):217–233.

Mézard, M. and Parisi, G. (2003). The cavity method at zero temperature. Journal of Statistical
Physics, 111(1):1–34.

Mézard, M., Parisi, G., Sourlas, N., Toulouse, G., and Virasoro, M. (1984). Nature of the
spin-glass phase. Physical review letters, 52(13):1156.

Mézard, M., Parisi, G., and Virasoro, M. (1987a). Sk model: The replica solution without
replicas. SPIN GLASS THEORY AND BEYOND: AN INTRODUCTION TO THE REPLICA
METHOD AND ITS APPLICATIONS. Edited by MEZARD M ET AL. Published by World
Scientific Press, pages 232–237.

Mézard, M., Parisi, G., and Virasoro, M. A. (1987b). Spin glass theory and beyond: An Introduction
to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company.

Mézard, M., Parisi, G., and Zecchina, R. (2002). Analytic and algorithmic solution of random
satisfiability problems. Science, 297(5582):812–815.

Miolane, L. (2017). Fundamental limits of low-rank matrix estimation: the non-symmetric


case. arXiv preprint arXiv:1702.00473.

Monasson, R. (1995). Structural glass transition and the entropy of the metastable states.
Physical review letters, 75(15):2847.

Nattermann, T. (1998). Theory of the random field ising model. In Spin glasses and random
fields, pages 277–298. World Scientific.

Nishimori, H. (1980). Exact results and critical properties of the ising model with competing
interactions. Journal of Physics C: Solid State Physics, 13(21):4071.

Nishimori, H. (1993). Optimum decoding temperature for error-correcting codes. Journal of


the Physical Society of Japan, 62(9):2973–2975.
BIBLIOGRAPHY 265

Opper, M. and Saad, D. (2001). Advanced mean field methods: Theory and practice. MIT press.

Parisi, G. (1979). Infinite number of order parameters for spin-glasses. Physical Review Letters,
43(23):1754.

Parisi, G. (1980). A sequence of approximated solutions to the sk model for spin glasses.
Journal of Physics A: Mathematical and General, 13(4):L115.

Parisi, G. (1983). Order parameter for spin-glasses. Physical Review Letters, 50(24):1946.

Pearl, J. (1982). Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive
Systems Laboratory, School of Engineering and Applied Science . . . .

Peierls, R. (1936). Statistical theory of adsorption with interaction between the adsorbed
atoms. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 32, pages
471–476. Cambridge University Press.

Potters, M. and Bouchaud, J.-P. (2020). A First Course in Random Matrix Theory: For Physicists,
Engineers and Data Scientists. Cambridge University Press.

Rangan, S. (2011). Generalized approximate message passing for estimation with random
linear mixing. In 2011 IEEE International Symposium on Information Theory Proceedings, pages
2168–2172. IEEE.

Rangan, S. and Fletcher, A. K. (2012). Iterative estimation of constrained rank-one matrices


in noise. In 2012 IEEE International Symposium on Information Theory Proceedings, pages
1246–1250. IEEE.

Ruozzi, N. (2012). The bethe partition function of log-supermodular graphical models.


Advances in Neural Information Processing Systems, 25.

Saade, A., Krzakala, F., and Zdeborová, L. (2017). Spectral bounds for the ising ferromag-
net on an arbitrary given graph. Journal of Statistical Mechanics: Theory and Experiment,
2017(5):053403.

Schwarz, A., Bluhm, J., and Schröder, J. (2020). Modeling of freezing processes of ice floes
within the framework of the tpm. Acta Mechanica, 231.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical


journal, 27(3):379–423.

Taillefer, L. (2010). Scattering and pairing in cuprate superconductors. Annual Review of


Condensed Matter Physics, 1.

Thouless, D. J., Anderson, P. W., and Palmer, R. G. (1977). Solution of ’solvable model of a
spin glass’. Philosophical Magazine, 35(3):593–601.

Touchette, H. (2009). The large deviation approach to statistical mechanics. Physics Reports,
478(1):1–69.

Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational
inference. Now Publishers Inc.

Watkin, T. L. H., Rau, A., and Biehl, M. (1993). The statistical mechanics of learning a rule.
Reviews of Modern Physics, 65(2):499–556.
266 F. Krzakala and L. Zdeborová

Weiss, P. (1907). L’hypothèse du champ moléculaire et la propriété ferromagnétique. J. Phys.


Theor. Appl., 6(1):661–690.

Weiss, P. (1948). The application of the bethe-peierls method to ferromagnetism. Physical


Review, 74(10):1493.

Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Annals of
Mathematics, pages 325–327.

Willsky, A., Sudderth, E., and Wainwright, M. J. (2007). Loop series and bethe variational
bounds in attractive graphical models. Advances in neural information processing systems, 20.

Yedidia, J. S., Freeman, W. T., Weiss, Y., et al. (2003). Understanding belief propagation and its
generalizations. Exploring artificial intelligence in the new millennium, 8(236-239):0018–9448.

Zdeborová, L. and Krzakala, F. (2016). Statistical physics of inference: thresholds and algo-
rithms. Advances in Physics, 65(5):453–552.
Todo list

You might also like