Supp 2
Supp 2
Kevin Murphy
1 Introduction 7
I Fundamentals 9
2 Probability 11
2.1 More fun with Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Deriving the conditionals of an MVN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Deriving Bayes rule for linear Gaussian systems . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Sensor fusion with unknown measurement noise . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Google’s PageRank algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Retrieving relevant pages using inverted indices . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 The PageRank score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Efficiently computing the PageRank vector . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Web spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.5 Personalized PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Statistics 21
3.1 Bayesian concept learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Learning a discrete concept: the number game . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Learning a continuous concept: the healthy levels game . . . . . . . . . . . . . . . . . 26
3.2 Informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Domain specific priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Gaussian prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Power-law prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.4 Erlang prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Graphical models 33
4.1 More examples of DGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 The QMR network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Genetic linkage analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 More examples of UGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 Restricted Boltzmann machines (RBMs) in more detail . . . . . . . . . . . . . . . . . 39
4.2.3 Feature induction for a maxent spelling model . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Relational UGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.5 Markov logic networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Information theory 45
3
6 Optimization 47
6.1 Proximal methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.1 Proximal operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.2 Computing proximal operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.3 Proximal point methods (PPM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.4 Mirror descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.5 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.6 Alternating direction method of multipliers (ADMM) . . . . . . . . . . . . . . . . . . 55
6.2 Local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Stochastic local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.2 Tabu search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.3 Random search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Population-based optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.1 Evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.2 Metaheuristic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3.3 Estimation of distribution algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.4 Cross-entropy method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.5 Natural evolutionary strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4.1 Example: computing Fibonnaci numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4.2 ML examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Conjugate duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5.2 Example: exponential function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5.3 Conjugate of a conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5.4 Bounds for the logistic (sigmoid) function . . . . . . . . . . . . . . . . . . . . . . . . . 69
II Inference 71
7 Inference algorithms: an overview 73
8 State-space inference 75
10 Variational inference 85
10.1 Exact and approximate inference for PGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.1.1 Exact inference as VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.1.2 Mean field VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
10.1.3 Loopy belief propagation as VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
10.1.4 Convex belief propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.1.5 Tree-reweighted belief propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.1.6 Other tractable versions of convex BP . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4
13 Sequential Monte Carlo (SMC) inference 97
III Prediction 99
14 Discriminative models: an overview 101
IV Generation 123
20 Generative models: an overview 125
V Discovery 139
27 Discovery methods: an overview 141
5
28.1.6 Variational inference for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
34 Interpretability 193
6
Chapter 1
Introduction
7
8
Part I
Fundamentals
9
Chapter 2
Probability
11
where the parameters of the conditional distribution can be read off from the above equations using
We can also use the fact that |M| = |M/H||H| to check the normalization constants are correct:
1 1
(2π)(d1 +d2 )/2 |Σ| 2 = (2π)(d1 +d2 )/2 (|Σ/Σ22 | |Σ22 |) 2 (2.13)
1 1
= (2π) d1 /2
|Σ/Σ22 | (2π)
2 d2 /2
|Σ22 | 2 (2.14)
Here Λ is the precision matrix, and η is the precision-weighted mean. In this form, one can show that the
marginals and conditions are given by
Hence
−1
Λ11 Λ12 Σ11 Σ12
= (2.21)
Λ21 Λ22 Σ21 Σ22
(Σ/Σ22 )−1 −(Σ/Σ22 )−1 Σ12 Σ−1
= −1
22
(2.22)
−Σ22 Σ21 (Σ/Σ22 )−1 Σ22 + Σ−1
−1
22 Σ21 (Σ/Σ22 )
−1
Σ12 Σ−2
22
Λ1|2 = Σ−1 −1
1|2 = (Σ11 − Σ12 Σ22 Σ21 )
−1
= Λ11 (2.24)
µ1|2 = µ1 + Σ12 Σ−1
22 (x2 − µ2 ) (2.25)
so
µ1|2 = µ1 − Λ−1
11 Λ12 (x2 − µ2 ) (2.27)
η 1|2 = Λ1|2 µ1|2 = Λ11 µ1 − Λ12 (x2 − µ2 ) (2.28)
= Λ11 µ1 + Λ12 µ2 − Λ12 x2 = η 1 − Λ12 x2 (2.29)
12
We will now derive the results for marginalizing in information form. Let
−1
Σ11 Σ12 Λ11 Λ12
= (2.30)
Σ21 Σ22 Λ21 Λ22
−1
Λ11 + Λ−1
11 Λ12 (Λ/Λ11 )
−1
Λ21 Λ−1 −Λ−1
11 Λ12 (Λ/Λ11 )
−1
= −1 −1
11
−1 (2.31)
−(Λ/Λ11 ) Λ21 Λ11 (Λ/Λ11 )
Hence
−1 −1
Λm
22 = Σ22 = Λ/Λ11 = Λ22 − Λ21 Λ11 Λ12 (2.32)
m m −1
ηm
2 = Λ22 µ2 = (Λ22 − Λ21 Λ11 Λ12 )µ2 (2.33)
= Λ22 µ2 − Λ21 Λ−1
11 Λ12 µ2 (2.34)
= (Λ21 µ1 + Λ22 µ2 ) − Λ21 Λ−1
11 (Λ11 µ1 + Λ12 µ2 ) (2.35)
= η 2 − Λ21 Λ11 η 1 (2.36)
1 1
log p(z, y) = − (z − µz )T Σ−1 T −1
z (z − µz ) − (y − Wz − b) Σy (y − Wz − b) (2.37)
2 2
This is clearly a joint Gaussian distribution, since it is the exponential of a quadratic form.
Expanding out the quadratic terms involving z and y, and ignoring linear and constant terms, we have
1 1 T −1 1
Q = − z T Σ−1 T −1 T −1
z z − y Σy y − (Wz) Σy (Wz) + y Σy Wz (2.38)
2 2 2
T −1
1 z Σz + WT Σ−1 T −1
y W −W Σy z
=− (2.39)
2 y −Σ−1
y W Σ−1
y y
T
1 z −1 z
=− Σ (2.40)
2 y y
From Equation ??, and using the fact that µy = Wµz + b, we have
13
2.1.3 Sensor fusion with unknown measurement noise
In this section, we extend the sensor fusion results from ?? to the case where the precision of each measurement
device is unknown. This turns out to yield a potentially multi-modal posterior, as we will see, which is quite
different from the Gaussian case. Our presentation is based on [Min01].
For simplicity, we assume the latent quantity is scalar, z ∈ R, and that we just have two measurement
devices, x and y. However, we allow these to have different precisions, so the data generating mechanism
has the form xn |z ∼ N (z, λ−1x ) and yn |z ∼ N (z, λy ). We will use a non-informative prior for z, p(z) ∝ 1,
−1
which we can emulate using an infinitely broad Gaussian, p(z) = N (z|m0 = 0, λ−1 0 = ∞). So the unknown
parameters are the two measurement precisions, θ = (λx , λy ).
Suppose we make 2 independent measurements with each device, which turn out to be
measurement precisions, and the posterior mean is a weighted sum of the prior mean (which is 0) and the
data means.
However, the measurement precisions are not known. A simple solution is to estimate them by maximum
likelihood. The log-likelihood is given by
Nx λx X Ny λy X
`(z, λx , λy ) = log λx − (xn − z)2 + log λy − (yn − z)2 (2.52)
2 2 n 2 2 n
∂`
= λx Nx (x − z) + λy Ny (y − z) = 0 (2.53)
∂z
Nx
∂` 1 1 X
= − (xn − z)2 = 0 (2.54)
∂λx λx Nx n=1
Ny
∂` 1 1 X
= − (yn − z)2 = 0 (2.55)
∂λy λy Ny n=1
This gives
Nx λ̂x x + Ny λ̂y y
ẑ = (2.56)
Nx λ̂x + Ny λ̂y
1 X
1/λ̂x = (xn − ẑ)2 (2.57)
Nx n
1 X
1/λ̂y = (yn − ẑ)2 (2.58)
Ny n
We notice that the MLE for z has the same form as the posterior mean, mN .
14
We can solve these equations by fixed point iteration. Let us initialize by estimating λx = 1/s2x and
PNx PNy
λy = 1/s2y , where s2x = N1x n=1 (xn − x)2 = 0.16 and s2y = N1y n=1 (yn − y)2 = 0.36. Using this, we
get ẑ = 2.1154, so p(z|D, λ̂x , λ̂y ) = N (z|2.1154, 0.0554). If we now iterate, we converge to λ̂x = 1/0.1662,
λ̂y = 1/4.0509, p(z|D, λ̂x , λ̂y ) = N (z|1.5788, 0.0798).
The plug-in approximation to the posterior is plotted in Figure 2.1(a). This weights each sensor according
tohits estimatedi precision. Since sensor y was estimated to be much less reliable than sensor x, we have
E z|D, λ̂x , λ̂y ≈ x, so we effectively ignore the y sensor.
Now we will adopt a Bayesian approach and integrate out the unknown precisions, following ??. That is,
we compute Z Z
p(z|D) ∝ p(z) p(Dx |z, λx )p(λx |z)dλx p(Dy |z, λy )p(λy |z)dλy (2.59)
We will use uninformative Jeffrey priors (??) p(z) ∝ 1, p(λx |z) ∝ 1/λx and p(λy |z) ∝ 1/λy . Since the x and
y terms are symmetric, we will just focus on one of them. The key integral is
Z
I = p(Dx |z, λx )p(λx |z)dλx (2.60)
Z
Nx Nx 2
∝ λ−1 x λx
Nx /2
exp − λx (x − z)2 − s λx dλx (2.61)
2 2 x
Exploiting the fact that Nx = 2 this simplifies to
Z
I = λ−1 1 2 2
x λx exp(−λx [(x − z) + sx ])dλx (2.62)
where θ s ∼ p(θ|D). Note that p(z|D, θ s ) is conditionally Gaussian, and is easy to compute. So we just need
a way to draw samples from the parameter posterior, p(θ|D). We discuss suitable methods for this in ??.
15
0.8 1.5
0.7
0.6
1
0.5
0.4
0.3
0.5
0.2
0.1
0 0
−2 −1 0 1 2 3 4 5 6 −2 −1 0 1 2 3 4 5 6
(a) (b)
Figure 2.1: Posterior for z. (a) Plug-in approximation. (b) Exact posterior. Generated by sen-
sor_fusion_unknown_prec.py.
average), this means there were about 100 billion unique web pages. Estimates for 2010 are about 121 billion unique web pages.
Source: https://bit.ly/2keQeyi
16
X1
0.30
0.25
X2
0.20
0.15
X3
0.10
0.05
X4 X5 X6 0.00
X1 X2 X3 X4 X5 X6
(a) (b)
Figure 2.2: (a) A very small world wide web. Generated by pagerank_small_plot_graph.py (b) The corresponding
stationary distribution. Generated by pagerank_demo_small.py.
where Aij is the probability of following a link from i to j. (The term “PageRank” is named after Larry Page,
one of Google’s co-founders.)
We recognize Equation (2.69) as the stationary distribution of a Markov chain. But how do we define
the transition matrix? In the simplest setting, we define Ai,: as a uniform distribution over all states that
i is connected to. However, to ensure the distribution is unique, we need to make the chain into a regular
chain. This can be done by allowing each state i to jump to any other state (including itself) with some
small probability. This effectively makes the transition matrix aperiodic and fully connected (although the
adjacency matrix Gij of the web itself is highly sparse).
We discuss efficient methods for computing the leading eigenvector of this giant matrix below. Here we
ignore computational issues, and just give some examples.
First, consider the small web in Figure 2.2. We find that the stationary distribution is
π = (0.3209, 0.1706, 0.1065, 0.1368, 0.0643, 0.2008) (2.70)
So a random surfer will visit site 1 about 32% of the time. We see that node 1 has a higher PageRank than
nodes 4 or 6, even though they all have the same number of in-links. This is because being linked to from an
influential node helps increase your PageRank score more than being linked to by a less influential node.
As a slightly larger example, Figure 2.3(a) shows a web graph, derived from the root of harvard.edu.
Figure 2.3(b) shows the corresponding PageRank vector.
17
(a) (b)
Figure 2.3: (a) Web graph of 500 sites rooted at www. harvard. edu . (b) Corresponding page rank vector. Generated
by pagerank_demo_harvard.py.
1 − p you jump to a random node, again chosen uniformly at random. If there are no outlinks, you just jump
to a random page. (These random jumps, including self-transitions, ensure the chain is irreducible (singly
connected) and regular. Hence we can solve for its unique stationary distribution using eigenvector methods.)
This defines the following transition matrix:
pGij /cj + δ if cj 6= 0
Mij = (2.71)
1/n if cj = 0
The matrix M is not sparse, but it is a rank one modification of a sparse matrix. Most of the elements of M
are equal to the small constant δ. Obviously these do not need to be stored explicitly.
Our goal is to solve v = Mv, where v = πT . One efficient method to find the leading eigenvector of a
large matrix is known as the power method. This simply consists of repeated matrix-vector multiplication,
followed by normalization:
v ∝ Mv = pGDv + 1zT v (2.75)
It is possible to implement the power method without using any matrix multiplications, by simply sampling
from the transition matrix and counting how often you visit each state. This is essentially a Monte Carlo
approximation to the sum implied by v = Mv. Applying this to the data in Figure 2.3(a) yields the stationary
distribution in Figure 2.3(b). This took 13 iterations to converge, starting from a uniform distribution. To
handle changing web structure, we can re-run this algorithm every day or every week, starting v off at the
old distribution; this is called warm starting [LM06].
For details on how to perform this Monte Carlo power method in a parallel distributed computing
environment, see e.g., [RU10].
18
2.2.4 Web spam
PageRank is not foolproof. For example, consider the strategy adopted by JC Penney, a department store in
the USA. During the Christmas season of 2010, it planted many links to its home page on 1000s of irrelevant
web pages, thus increasing its ranking on Google’s search engine [Seg11]. Even though each of these source
pages has low PageRank, there were so many of them that their effect added up. Businesses call this search
engine optimization; Google calls it web spam. When Google was notified of this scam (by the New
York Times), it manually downweighted JC Penney, since such behavior violates Google’s code of conduct.
The result was that JC Penney dropped from rank 1 to rank 65, essentially making it disappear from view.
Automatically detecting such scams relies on various techniques which are beyond the scope of this chapter.
19
20
Chapter 3
Statistics
21
Examples
16 1
0.5
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
60 1
0.5
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
16 8 2 64 1
0.5
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
16 23 19 20 1
0.5
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
60 80 10 30 1
Figure 3.1: Empirical membership distribution in the numbers game, derived from predictions from 8 humans. First two
0.5
rows: after seeing D = {16} and D 0 = {60}. This illustrates diffuse similarity. Third row: after seeing D = {16, 8, 2, 64}.
predict that y ∈ {2, 4, 8, 16, 32, 64} may also be generated in the future by the teacher. This is an example of
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
generalization, since81 we are0.5making predictions about future data that we have not seen.
25 4 36 1
Figure 3.1 gives an example0 of how humans perform at this task. Given a single example, such as D = {16}
or D = {60}, humans make fairly diffuse predictions over the other numbers that are similar in magnitude.
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
But when given several examples,
81 98 86 93 1
such as D = {2, 8, 16, 64}, humans often find an underlying pattern, and
use this to make fairly precise 0.5 predictions about which other numbers might be part of the same concept,
of induction is to suppose we have a hypothesis space H of concepts (such as even numbers, all numbers
between 1 and 10, etc.), and then to identify the smallest subset of H that is consistent with the observed
data D; this is called the version space. As we see more examples, the version space shrinks and we become
increasingly certain about the underlying hypothesis [Mit97].
However, the version space theory cannot explain the human behavior we saw in Figure 3.1. For example,
after seeing D = {16, 8, 2, 64}, why do people choose the rule “powers of two” and not, say, “all even numbers”,
or “powers of two except for 32”, both of which are equally consistent with the evidence? We will now show
how Bayesian inference can explain this behavior. The resulting predictions are shown in Figure 3.2.
3.1.1.1 Likelihood
We must explain why people chose htwo and not, say, heven after seeing D = {16, 8, 2, 64}, given that
both hypotheses are consistent with the evidence. The key intuition is that we want to avoid suspicious
coincidences. For example, if the true concept was even numbers, it would be surprising if we just happened
to only see powers of two.
To formalize this, let us assume that the examples are sampled uniformly at random from the extension
of the concept. (Tenenbaum calls this the strong sampling assumption.) Given this assumption, the
probability of independently sampling N items (with replacement) from the unknown concept h is given by
N
Y N
Y N
1 1
p(D|h) = p(yn |h) = I (yn ∈ h) = I (D ∈ h) (3.1)
n=1 n=1
size(h) size(h)
where I (D ∈ h) is non zero iff all the data points lie in the support of h. This crucial equation embodies
22
Examples
16 1
0.5
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
60 1
0.5
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
16 8 2 64 1
0.5
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
16 23 19 20 1
0.5
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
60 80 10 30 1
Figure 3.2: Posterior membership probabilities derived using the full hypothesis space. Compare to Figure 3.1. The
0.5
0
predictions of the Bayesian model are only plotted for those values for which human data is available; this is why the
top line looks sparser than Figure 3.4.
4 8 From
12 16 20Figure
24 28 325.6 of44[Ten99].
36 40 48 52 56 60Used
64 68 with kind
72 76 80 permission
84 88 92 96 100 of Josh Tenenbaum.
60 52 57 55 1
0.5
0
what Tenenbaum calls the size principle, which means the model favors the simplest (smallest) hypothesis
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
consistent with the data. This
81 25 4 36 1
is more commonly known as Occam’s razor.
To see how it works, let D0.5= {16}. Then p(D|htwo ) = 1/6, since there are only 6 powers of two less than
100, but p(D|heven ) = 1/50, since 0
there are 50 even numbers. So the likelihood that h = htwo is higher than
if h = heven . After 4 examples, the likelihood of htwo is (1/6) = 7.7 × 10−484, 88whereas
4 8 12 16 20 24 28 32 36 40 44 48 52 56460 64 68 72 76 80 92 96 100
the likelihood of heven
is (1/50)4 = 1.6 × 10−7 . This
81 98 86 93 1
0.5
is a likelihood ratio of almost 5000:1 in favor of h two . This quantifies our
earlier intuition that D = {16,0 8, 2, 64} would be a very suspicious coincidence if generated by heven .
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
3.1.1.2 Prior
In the Bayesian approach, we must specify a prior over unknowns, p(h), as well as the likelihood, p(D|h). To
see why this is useful, suppose D = {16, 8, 2, 64}. Given this data, the concept h0 =“powers of two except
32” is more likely than h =“powers of two”, since h0 does not need to explain the coincidence that 32 is
missing from the set of examples. However, the hypothesis h0 =“powers of two except 32” seems “conceptually
unnatural”. We can capture such intuition by assigning low prior probability to unnatural concepts. Of
course, your prior might be different than mine. This subjective aspect of Bayesian reasoning is a source of
much controversy, since it means, for example, that a child and a math professor will reach different answers.1
Although the subjectivity of the prior is controversial, it is actually quite useful. If you are told the
numbers are from some arithmetic rule, then given 1200, 1500, 900 and 1400, you may think 400 is likely but
1183 is unlikely. But if you are told that the numbers are examples of healthy cholesterol levels, you would
probably think 400 is unlikely and 1183 is likely, since you assume that healthy levels lie within some range.
Thus we see that the prior is the mechanism by which background knowledge can be brought to bear on
a problem. Without this, rapid learning (i.e., from small samples sizes) is impossible.
So, what prior should we use? We will initially consider 30 simple arithmetical concepts, such as “even
numbers”, “odd numbers”, “prime numbers”, or “numbers ending in 9”. We could use a uniform prior over
these concepts; however, for illustration purposes, we make the concepts even and odd more likely apriori,
and use a uniform prior over the others. We also include two “unnatural” concepts, namely “powers of 2, plus
1 A child and a math professor presumably not only have different priors, but also different hypothesis spaces. However, we
can finesse that by defining the hypothesis space of the child and the math professor to be the same, and then setting the child’s
prior weight to be zero on certain “advanced” concepts. Thus there is no sharp distinction between the prior and the hypothesis
space.
23
data = {16} data = {16,8,2,64}
even even
odd odd
squares squares
mult of 3 mult of 3
mult of 4 mult of 4
mult of 5 mult of 5
mult of 6 mult of 6
mult of 7 mult of 7
mult of 8 mult of 8
mult of 9 mult of 9
mult of 10 mult of 10
ends in 1 ends in 1
ends in 2 ends in 2
ends in 3 ends in 3
ends in 4 ends in 4
ends in 5 ends in 5
ends in 6 ends in 6
ends in 7 ends in 7
ends in 8 ends in 8
ends in 9 ends in 9
powers of 2 powers of 2
powers of 3 powers of 3
powers of 4 powers of 4
powers of 5 powers of 5
powers of 6 powers of 6
powers of 7 powers of 7
powers of 8 powers of 8
powers of 9 powers of 9
powers of 10 powers of 10
all all
powers of 2 +{37} powers of 2 +{37}
powers of 2 -{32} powers of 2 -{32}
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.350.0 0.1 0.2 0.3 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.00 0.25 0.50 0.75 1.00 1.25 1.50 0.0 0.2 0.4 0.6 0.8 1.0
1e 3
prior lik post prior lik post
(a) (b)
Figure 3.3: (a) Prior, likelihood and posterior for the model when the data is D = {16}. (b) Results when D =
{2, 8, 16, 64}. Adapted from [Ten99]. Generated by numbers_game.py.
37” and “powers of 2, except 32”, but give them low prior weight. See Figure 3.3a(bottom row) for a plot of
this prior.
In addition to “rule-like” hypotheses, we consider the set of intervals between n and m for 1 ≤ n, m ≤ 100.
This allows us to capture concepts based on being “close to” some number, rather than satisfying some more
abstract property. We put a uniform prior over the intervals.
We can combine these two priors by using a mixture distribution, as follows:
where 0 < π < 1 is the mixture weight assigned to the rules prior, and Unif(h|S) is the uniform distribution
over the set S.
3.1.1.3 Posterior
The posterior is simply the likelihood times the prior, normalized: p(h|D) ∝ p(D|h)p(h). Figure 3.3a plots
the prior, likelihood and posterior after seeing D = {16}. (In this figure, we only consider rule-like hypotheses,
not intervals, for simplicity.) We see that the posterior is a combination of prior and likelihood. In the case
of most of the concepts, the prior is uniform, so the posterior is proportional to the likelihood. However,
the “unnatural” concepts of “powers of 2, plus 37” and “powers of 2, except 32” have low posterior support,
despite having high likelihood, due to the low prior. Conversely, the concept of odd numbers has low posterior
support, despite having a high prior, due to the low likelihood.
Figure 3.3b plots the prior, likelihood and posterior after seeing D = {16, 8, 2, 64}. Now the likelihood is
much more peaked on the powers of two concept, so this dominates the posterior. Essentially the learner has
an “aha” moment, and figures out the true concept.2 This example also illustrates why we need the low prior
on the unnatural concepts, otherwise we would have overfit the data and picked “powers of 2, except for 32”.
24
1.0
0.8
0.6
0.4
0.2
0.0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
powers of 4
powers of 2
ends in 6
squares
even
mult of 8
mult of 4
all
powers of 2 -{32}
powers of 2 +{37}
0.0 0.2
p(h|16)
Figure 3.4: Posterior over hypotheses, and the induced posterior over membership, after seeing one example, D = {16}.
A dot means this number is consistent with this hypothesis. The graph p(h|D) on the right is the weight given to
hypothesis h. By taking a weighed sum of dots, we get p(y ∈ h|D) (top). Adapted from Figure 2.9 of [Ten99]. Generated
by numbers_game.py.
This is called Bayes model averaging [Hoe+99]. Each term is just a weighted average of the predictions of
each individual hypothesis. This is illustrated in Figure 3.4. The dots at the bottom show the predictions
from each hypothesis; the vertical curve on the right shows the weight associated with each hypothesis. If we
multiply each row by its weight and add up, we get the distribution at the top.
The first term, log p(D|h), is the log of the likelihood, p(D|h). The second term, log p(h), is the log of
the prior. As the data set increases in size, the log likelihood grows in magnitude, but the log prior term
remains constant. We thus say that the likelihood overwhelms the prior. In this context, a reasonable
approximation to the MAP estimate is to ignore the prior term, and just pick the maximum likelihood
estimate or MLE, which is defined as
N
X
hmle , argmax p(D|h) = argmax log p(D|h) = argmax log p(yn |h) (3.6)
h h h n=1
Suppose we approximate the posterior by a single point estimate ĥ, might be the MAP estimate or
MLE. We can represent this degenerate distribution as a single point mass
p(h|D) ≈ I h = ĥ (3.7)
25
where I () is the indicator function. The corresponding posterior predictive distribution becomes
X
p(y|D) ≈ p(y|h)I h = ĥ = p(y|ĥ) (3.8)
h
This is called a plug-in approximation, and is very widely used, due to its simplicity, as we discuss further
in ??.
Although the plug-in approximation is simple, it behaves in a qualitatively inferior way than the fully
Bayesian approach when the dataset is small. In the Bayesian approach, we start with broad predictions, and
then become more precise in our forecasts as we see more data, which makes intuitive sense. For example,
given D = {16}, there are many hypotheses with non-negligible posterior mass, so the predicted support over
the integers is broad. However, when we see D = {16, 8, 2, 64}, the posterior concentrates its mass on one
or two specific hypotheses, so the overall predicted support becomes more focused. By contrast, the MLE
picks the minimal consistent hypothesis, and predicts the future using that single model. For example, if we
we see D = {16}, we compute hmle to be “all powers of 4” (or the interval hypothesis h = {16}), and the
resulting plugin approximation only predicts {4, 16, 64} as having non-zero probability. This is an example
of overfitting, where we pay too much attention to the specific data that we saw in training, and fail to
generalise correctly to novel examples. When we observe more data, the MLE will be forced to pick a broader
hypothesis to explain all the data. For example, if we D = {16, 8, 2, 64}, the MLE broadens to become “all
powers of two”, similar to the Bayesian approach. Thus in the limit of infinite data, both approaches converge
to the same predictions. However, in the small sample regime, the fully Bayesian approach, in which we
consider multiple hypotheses, will give better (less over confident) predictions.
3.1.2.1 Likelihood
We assume points are sampled uniformly at random from the support of the rectangle. To simplify the
analysis, let us first consider the case of one-dimensional “rectangles”, i.e., lines. In the 1d case, the likelihood
26
samples from p(h|D1 : 3), uninfPrior samples from p(h|D1 : 12), uninfPrior
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7
(a) (b)
Figure 3.5: Samples from the posterior in the “healthy levels” game. The axes represent “cholesterol level” and “insulin
level”. (a) Given a small number of positive examples (represented by 3 red crosses), there is a lot of uncertainty
about the true extent of the rectangle. (b) Given enough data, the smallest enclosing rectangle (which is the maximum
likelihood hypothesis) becomes the most probable, although there are many other similar hypotheses that are almost as
probable. Adapted from [Ten99]. Generated by healthy_levels_plots.py.
is p(D|`, s) = (1/s)N if all points are inside the interval, otherwise it is 0. Hence
s−N if min(D) ≥ ` and max(D) ≤ ` + s
p(D|`, s) = (3.9)
0 otherwise
To generalize this to 2d, we assume the observed features are conditionally independent given the hypothesis.
Hence the 2d likelihood becomes
p(D|h) = p(D1 |`1 , s1 )p(D2 |`2 , s2 ) (3.10)
where Dj = {ynj : n = 1 : N } are the observations for dimension (feature) j = 1, 2.
3.1.2.2 Prior
For simplicity, let us assume the prior factorizes, i.e., p(h) = p(`1 )p(`2 )p(s1 )p(s2 ). We will use uninformative
priors for each of these terms. As we explain in ??, this means we should use a prior of the form p(h) ∝ s11 s12 .
3.1.2.3 Posterior
The posterior is given by
1 1
p(`1 , `2 , s1 , s2 |D) ∝ p(D1 |`1 , s1 )p(D2 |`2 , s2 ) (3.11)
s1 s2
We can compute this numerically by discretizing R4 into a 4d grid, evaluating the numerator pointwise, and
normalizing.
Since visualizing a 4d distribution is difficult, we instead draw posterior samples from it, hs ∼ p(h|D),
and visualize them as rectangles. In Figure 3.5(a), we show some samples when the number N of observed
data points is small — we are uncertain about the right hypothesis. In Figure 3.5(b), we see that for larger
N , the samples concentrate on the observed data.
27
Bayes predictive, n=3, uninfPrior Bayes predictive, n=12, uninfPrior
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) (b)
MLE predictive, n=3 MLE predictive, n=12
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(c) (d)
Figure 3.6: Posterior predictive distribution for the healthy levels game. Red crosses are observed data points. Left
column: N = 3. Right column: N = 12. First row: Bayesian prediction. Second row: Plug-in prediction using
MLE (smallest enclosing rectangle). We see that the Bayesian prediction goes from uncertain to certain as we learn
more about the concept given more data, whereas the plug-in prediction goes from narrow to broad, as it is forced to
generalize when it sees more data. However, both converge to the same answer. Adapted from [Ten99]. Generated by
healthy_levels_plot.py.
Let us define yjmin = minn ynj , yjmax = maxn ynj , and rj = yjmax − yjmin . Then one can show that the
posterior predictive distributxion is given by
N −1
1
p(y|D) = (3.12)
(1 + d(y1 )/r1 )(1 + d(y2 )/r2 )
where d(yj ) = 0 if yjmin ≤ yj ≤ yjmax , and otherwise d(yj ) is the distance to the nearest data point along
dimension j. Thus p(y|D) = 1 if y is inside the support of the training data; if y is outside the support, the
probability density drops off, at a rate that depends on N .
Note that if N = 1, the predictive distribution is undefined. This is because we cannot infer the extent of
a 2d rectangle from just one data point (unless we use a stronger prior).
In Figure 3.6(a), we plot the posterior predictive distribution when we have just seen N = 3 examples; we
see that there is a broad generalization gradient, which extends further along the vertical dimension than the
horizontal direction. This is because the data has a broader vertical spread than horizontal. In other words,
if we have seen a large range in one dimension, we have evidence that the rectangle is quite large in that
dimension, but otherwise we prefer compact hypotheses, as follows from the size principle.
In Figure 3.6(b), we plot the distribution for N = 12. We see it is focused on the smallest consistent
hypothesis, since the size principle exponentially down-weights hypothesis which are larger than necessary.
28
3.1.2.5 Plugin approximation
Now suppose we use a plug-in approximation to the posterior predictive, p(y|D) ≈ p(y|θ̂), where θ̂ is the
MLE or MAP estimate, analogous to the discussion in Section 3.1.1.5. In Figure 3.6(c-d), we show the
behavior of this approximation. In both cases, it predicts the smallest enclosing rectangle, since that is the
one with maximum likelihood. However, this does not extrapolate beyond the range of the observed data.
We also see that initially the predictions are narrower, since very little data has been observed, but that the
predictions become broader with more data. By contrast, in the Bayesian approach, the initial predictions
are broad, since there is a lot of uncertainty, but become narrower with more data. In the limit of large data,
both methods converge to the same predictions. (See ?? for more discussion of the plug-in approximation.)
Note that the posterior median is often a better summary of the posterior than the posterior mode, for
reasons explained in ??.
29
Life Spans LifeMovie Life
SpansRuntimes Spans
Movie Runtimes
Movie Movie
Life
Grosses Runtimes
Movie
Spans Grosses Movie
Movie
Poems Grosses
Poems
Runtimes Poems
Movie Grosses Poems Representatives Pharaohs Cakes
Probability
Probability
Probability
Probability
Probability 0 40 80 1200 400 80 100 120 0 200
040 80 0100120 0 040
200
300 100120
6000 80 0300200 0 01000
600
500 1000 3002005006001000
0 0 300 500600 10000 500 1000 0 30 60 0 50 100 0 60 120 0
ttotal ttotal t ttotal t t ttotalttotalt t t t t ttotal ttotal ttotal t t t
total total total total total total total total total total total
Predicted ttotal
Predicted ttotal
total
Predicted ttotal
Predicted ttotal
Predicted ttotal
Predicted t
120 120 120 80 120 80 80 80 40 30 80 30
100 100 100 100 100 100 100 100
50 50 60 50 60 50 50 60 50 40 60 50 40 50 40 40 20 15 40 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 50 1000 050 60 0 120
100 0 0100
0 50 06010050120 500 60100
050 120
400 0 80
100 60
0 50 120
40100 080 0 50 100
40 80 0 40 80 0 20 40 0 15 30 0 40 80 0
t t t t t t t t t t t t t t t t t t t
Figure 3.7: Top: empirical distribution of various durational quantities. Bottom: predicted total duration as a function
of observed duration, p(T |t). Dots are observed median responses of people. Solid line: Bayesian prediction using
informed prior. Dotted line: Bayesian prediction using uninformative prior. From Figure 2a of [GT06]. Used with
kind permission of Tom Griffiths.
60 60 60
µ=30 γ=1 β=30
µ=25 γ=1.5 β=18
45 45 45
µ=15 γ=2 β=10
total
Predicted t
30 30 30
15 15 15
0 0 0
0 15 30 0 15 30 0 15 30
t t t
Figure 3.8: Top: three different prior distributions, for three different parameter values. Bottom: corresponding
predictive distributions. From Figure 1 of [GT06]. Used with kind permission of Tom Griffiths.
Looking at Figure 3.7(a-b), it seems clear that life-spans and movie run-times can be well-modeled by a
Gaussian, N (T |µ, σ 2 ). Unfortunately, we cannot compute the posterior median in closed form if we use a
Gaussian prior, but we can still evaluate it numerically, by solving a 1d integration problem. The resulting
plot of T̂ (t) vs t is shown in Figure 3.8 (bottom left). For values of t much less than the prior mean, µ, the
predicted value of T is about equal to µ, so the left part of the curve is flat. For values of t much greater
than µ, the predicted value converges to a line slightly above the diagonal, i.e., T̂ (t) = t + for some small
(and decreasing) > 0.
To see why this behavior makes intuitive sense, consider encountering a man at age 18, 39 or 51: in all
cases, a reasonable prediction is that he will live to about µ = 75 years. But now imagine meeting a man at
age 80: we probably would not expect him to live much longer, so we predict T̂ (80) ≈ 80 + .
30
3.2.3 Power-law prior
Looking at Figure 3.7(c-d), it seems clear that movie grosses and poem length can be modeled by a power
law distribution of the form p(T ) ∝ T −γ for γ > 0. (If γ > 1, this is called a Pareto distribution, see ??.)
Power-laws are characterized by having very long tails. This captures the fact that most movies make
very little money, but a few blockbusters make a lot. The number of lines in various poems also has this
shape, since there are a few epic poems, such as Homer’s Odyssey, but most are short, like haikus. Wealth
has a similarly skewed distribution in many countries, especially in plutocracies such as the USA (see e.g.,
inequality.org).
In the case of a power-law prior, p(T ) ∝ T −γ , we can compute the posterior median analytically. We have
Z ∞
1 1 −γ
p(t) ∝ T −(γ+1) dT = − T −γ |∞
t = t (3.16)
t γ γ
Hence the posterior becomes
T −(γ+1) γtγ
p(T |t) = 1 −γ = γ+1 (3.17)
γt
T
for values of T ≥ t. We can derive the posterior median as follows:
Z ∞ γ γ
γtγ t ∞ t
p(T > TM |t) = dT = − = (3.18)
γ+1 TM
TM T T TM
Solving for TM such that P (T > TM |t) = 0.5 gives TM = 21/γ t.
This is plotted in Figure 3.8 (bottom middle). We see that the predicted duration is some constant
multiple of the observed duration. For the particular value of γ that best fits the empirical distribution of
movie grosses, the optimal prediction is about 50% larger than the observed quantity. So if we observe that a
movie has made $40M to date, we predict that it will make $60M in total.
As Griffiths and Tenenbaum point out, this rule is inappropriate for quantities that follow a Gaussian
prior, such as people’s ages. As they write, “Upon meeting a 10-year-old girl and her 75-year-old grandfather,
we would never predict that the girl will live a total of 15 years (1.5 × 10) and that the grandfather will
live to be 112 (1.5 × 75).” This shows that people implicitly know what kind of prior to use when solving
prediction problems of this kind.
Solving for TM such that p(T > TM |t) = 0.5 gives TM = t + β log 2.
This is plotted in Figure 3.8 (bottom right). We see that the best guess is simply the observed value plus
a constant, where the constant reflects the average term in office.
31
32
Chapter 4
Graphical models
where zk represents the k’th disease and xd represents the d’th symptom. This model can be used inside
an inference engine to compute the posterior probability of each disease given the observed symptoms, i.e.,
p(zk |xv ), where xv is the set of visible symptom nodes. (The symptoms which are not observed can be
removed from the model, assuming they are missing at random (??), because they contribute nothing to the
likelihood; this is called barren node removal.)
We now discuss the parameterization of the model. For simplicity, we assume all nodes are binary. The
CPD for the root nodes are just Bernoulli distributions, representing the prior probability of that disease.
Representing the CPDs for the leaves (symptoms) using CPTs would require too many parameters, because
the fan-in (number of parents) of many leaf nodes is very high. A natural alternative is to use logistic
Z1 Z2 Z3
X1 X2 X3 X4 X5
Figure 4.1: A small version of the QMR network. All nodes are binary. The hidden nodes zk represent diseases, and
the visible nodes xd represent symptoms. In the full network, there are 570 hidden (disease) nodes and 4075 visible
(symptom) nodes. The shaded (solid gray) leaf nodes are observed; in this example, symptom x2 is not observed (i.e.,
we don’t know if it is present or absent). Of course, the hidden diseases are never observed.
33
z0 z1 z2 P (xd = 0|z0 , z1 , z2 ) P (xd = 1|z0 , z1 , z2 )
1 0 0 θ0 1 − θ0
1 1 0 θ0 θ1 1 − θ0 θ1
1 0 1 θ0 θ2 1 − θ0 θ2
1 1 1 θ0 θ1 θ2 1 − θ0 θ1 θ2
Table 4.1: Noisy-OR CPD for p(xd |z0 , z1 , z2 ), where z0 = 1 is a leak node.
regression to model the CPD, p(xd |zpa(d) ) = Ber(xd |σ(wTd zpa(d) )). However, we use an alternative known as
the noisy-OR model, which we explain below,
The noisy-OR model assumes that if a parent is on, then the child will usually also be on (since it is an
or-gate), but occasionally the “links” from parents to child may fail, independently at random. If a failure
occurs, the child will be off, even if the parent is on. To model this more precisely, let θkd = 1 − qkd be the
probability that the k → d link fails. The only way for the child to be off is if all the links from all parents
that are on fail independently at random. Thus
Y I(z =1)
p(xd = 0|z) = θkd k (4.2)
k∈pa(d)
Obviously, p(xd = 1|z) = 1−p(xd = 0|z). In particular, let us define qkd = 1−θkd = p(xd = 1|zk = 1, z−k = 0);
this is the probability that k can activate d “on its own”; this is sometimes called its “causal power” (see
e.g., [KNH11]).
If we observe that xd = 1 but all its parents are off, then this contradicts the model. Such a data case
would get probability zero under the model, which is problematic, because it is possible that someone exhibits
a symptom but does not have any of the specified diseases. To handle this, we add a dummy leak node z0 ,
which is always on; this represents “all other causes”. The parameter q0d represents the probability that the
background leak can cause symptom d on its own. The modified CPD becomes
Y z
p(xd = 0|z) = θ0d θkdk (4.3)
k∈pa(d)
34
(a) (b)
Figure 4.2: Left: family tree, circles are females, squares are males. Individuals with the disease of interest are
highlighted. Right: DGM for locus j = L. Blue node Pij is the phenotype for individual i at locus j. Orange nodes
p/m p/m
Gij is the paternal/ maternal allele. Small red nodes Sij are the paternal/ maternal selection switching variables.
The founder (root) nodes do not have any parents, and hence do no need switching variables. All nodes are hidden
except the blue phenotypes. Adapted from Figure 3 from [FGL00].
Table 4.2: CPT which encodes a mapping from genotype to phenotype (bloodtype). This is a deterministic, but
many-to-one, mapping.
35
Locus # 1 Locus # 2
1 2
3 4 1 2
5 6 3 4
5 6
Figure 4.3: Extension of Figure 4.2 to two loci, showing how the switching variables are spatially correlated. This is
m m p p
indicated by the Sij → Si,j+1 and Sij → Si,j+1 edges. Adapted from Figure 3 from [FGL00].
p
Gmij and Gij , one inherited from i’s mother (maternal allele) and the other from i’s father (paternal allele).
p
Together, the ordered pair Gij = (Gm ij , Gij ) constitutes i’s hidden genotype at locus j.
Obviously we must add Gij → Xij and Gpij → Pij arcs representing the fact that genotypes cause
m
p
phenotypes. The CPD p(Pij |Gm ij , Gij ) is called the penetrance model. As a very simple example, suppose
p
Pij ∈ {A, B, O, AB} represents person i’s observed bloodtype, and Gm ij , Gij ∈ {A, B, O} is their genotype.
We can represent the penetrance model using the deterministic CPD shown in Table 4.2. For example, A
dominates O, so if a person has genotype AO or OA, their phenotype will be A.
m/p
In addition, we add arcs from i’s mother and father into Gij , reflecting the Mendelian inheritance of
genetic material from one’s parents. More precisely, let µi = k be i’s mother. For example, in Figure 4.2(b),
for individual i = 3, we have µi = 2, since 2 is the mother of 3. The gene Gm ij could either be equal to Gkj or
m
p
Gkj , that is, i’s maternal allele is a copy of one of its mother’s two alleles. Let Sij be a hidden switching
m
variable that specifies the choice. Then we can use the following CPD, known as the inheritance model:
I Gm = Gm if Sijm
=m
p ij kj
p(Gm ij |G m
kj , G , S m
ij ) = (4.5)
kj I G =G p
m
ij kj if Sij = p
m
36
We can now use this model to determine where along the genome a given disease-causing gene is assumed
to lie — this is the genetic linkage analysis task. The method works as follows. First, suppose all the
parameters of the model, including the distance between all the marker loci, are known. The only unknown
is the location of the disease-causing gene. If there are L marker loci, we construct L + 1 models: in model `,
we postulate that the disease gene comes after marker `, for 0 < ` < L + 1. We can estimate the Markov
switching parameter θ̂` , and hence the distance d` between the disease gene and its nearest known locus.
We measure the quality of that model using its likelihood, p(D|θ̂` ). We then can then pick the model with
highest likelihood.
Note, however, that computing the likelihood requires marginalizing out all the hidden S and G variables.
See [FG02] and the references therein for some exact methods for this task; these are based on the variable
elimination algorithm, which we discuss in Section ??. Unfortunately, for reasons we explain in Section ??,
exact methods can be computationally intractable if the number of individuals and/or loci is large. See
[ALK06] for an approximate method for computing the likelihood based on the “cluster variation method”.
Note that it is possible to extend the above model in multiple ways. For example, we can model evolution
amongst phylogenies using a phylogenetic HMM [SH03].
as discussed in [IM17].
37
hopfield_training
0 0 0
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
120 120 120
140 140 140
0 50 100 0 50 100 0 50 100
hopfield_occluded
0 0 0
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
120 120 120
140 140 140
0 50 100 0 50 100 0 50 100
hopfield_recall
0 0 0
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
120 120 120
140 140 140
0 50 100 0 50 100 0 50 100
Figure 4.4: Examples of how an associative memory can reconstruct images. These are binary images of size 150 × 150
pixels. Top: training images. Middle row: partially visible test images. Bottom row: final state estimate. Adapted
from Figure 2.1 of [HKP91]. Generated by hopfield_demo.py.
we want to memorize. (We discuss how to do this below). Then, at test time, we present a partial pattern to
the network. We would like to estimate the missing variables; this is called pattern completion. That is,
we want to compute
x∗ = argmin E(x) (4.7)
x
We can solve this optimization problem using iterative conditional modes (ICM), in which we set each
hidden variable to its most likely state given its neighbors. Picking the most probable state amounts to using
the rule
xt+1 = sgn(Wxt ) (4.8)
This can be seen as a deterministic version of Gibbs sampling (see ??).
We illustrate this process in Figure 4.4. In the top row, we show some training examples. In the middle
row, we show a corrupted input, corresponding to the initial state x0 . In the bottom row, we show the final
state after 30 iterations of ICM. The overall process can be thought of as retrieving a complete example from
memory based on a piece of the example.
To learn the weights W, we could use the maximum likelihood estimate method described in ??. (See
also [HSDK12].) However, a simpler heuristic method, proposed in [Hop82], is to use the following outer
product method: !
N
1 X
W= T
xn xn − I (4.9)
N n=1
38
This normalizes the output product matrix by N , and then sets the diagonal to 0. This ensures the energy
is low for patterns that match any of the examples in the training set. This is the technique we used in
Figure 4.4. Note, however, that this method not only stores the original patterms but also their inverses, and
other linear combinations. Consequently there is a limit to how many examples the model can store before
they start to “collide” in the memory. Hopfield proved that, for random patterns, the network capacity is
∼ 0.14N .
where E is the energy function, W is a D × K weight matrix, b are the visible bias terms, c are the hidden
bias terms, and θ = (W, b, c) are all the parameters. For notational simplicity, we will absorb the bias terms
into the weight matrix by adding dummy units x0 = 1 and z0 = 1 and setting w0,: = c and w:,0 = b. Note
that naively computing Z(θ) takes O(2D 2K ) time but we can reduce this to O(min{D2K , K2D }) time using
the structure of the graph.
When using a binary RBM, the posterior can be computed as follows:
K
Y Y
p(z|x, θ) = p(zk |x, θ) = Ber(zk |σ(wT:,k x)) (4.14)
k=1 k
By symmetry, one can show that we can generate data given the hidden variables as follows:
Y Y
p(x|z, θ) = p(xd |z, θ) = Ber(xd |σ(wTd,: z)) (4.15)
d d
The weights in W are called the generative weights, since they are used to generate the observations, and
the weights in WT are called the recognition weights, since they are used to recognize the input.
From Equation 4.14, we see that we activate hidden node k in proportion to how much the input vector x
“looks like” the weight vector w:,k (up to scaling factors). Thus each hidden node captures certain features of
the input, as encoded in its weight vector, similar to a feedforward neural network.
For example, consider an RBM for text models, where x is a bag of words (i.e., a bit vector over the
vocabulary). Let zk = 1 if “topic” k is present in the document. Suppose a document has the topics “sports”
and “drugs”. If we “multiply” the predictions of each topic together, the model may give very high probability
39
to the word “doping”, which satisfies both constraints. By contrast, adding together experts can only make
the distribution broader (see Figure ??). In particular, if we mix together the predictions from “sports” and
“drugs”, we might generate words like “cricket” and “addiction”, which come from the union of the two topics,
not their intersection.
The parameters of the model are θ = (wdk , ak , bd ). (We have assumed the data is standardized, so we fix the
variance to σ 2 = 1.) Compare this to a Gaussian in canonical or information form (see Section ??):
1
Nc (x|η, Λ) ∝ exp(ηT x − xT Λx) (4.22)
2
P
where η = Λµ. P We see that we have set Λ = I, and η = k zk w:,k . Thus the mean is given by
µ = Λ−1 η = k zk w:,k , which is a weighted combination of prototypes. The full conditionals, which are
needed for inference and learning, are given by
X
p(xd |z, θ) = N (xd |bd + wdk zk , 1) (4.23)
k
!
X
p(zk = 1|x, θ) = σ ck + wdk xd (4.24)
d
More powerful models, which make the (co)variance depend on the hidden states, can also be developed
[RH10].
40
4.2.3 Feature induction for a maxent spelling model
In some applications, we assume the features φ(x) are known. However, it is possible to learn the features in
a maxent model in an unsupervised way; this is known as feature induction.
A common approach to feature induction, first proposed in [DDL97; ZWM97], is to start with a base set
of features, and then to continually create new feature combinations out of old ones, greedily adding the best
ones to the model.
As an example of this approach, [DDL97] describe how to build models to represent English spelling. This
can be formalized as a probability distribution over variable length strings, p(x|θ), where xt is a letter in
the English alphabet. Initially the model has no features, which represents the uniform distribution. The
algorithm starts by choosing to add the feature
X
φ1 (x) = I (xi ∈ {a, . . . , z}) (4.25)
i
which checks if any letter is lower case or not. After the feature is added, the parameters are (re)-fit by
maximum likelihood (a computationally difficult problem, which we discuss in ??). For this feature, it turns
out that θ̂1 = 1.944, which means that a word with a lowercase letter in any position is about e1.944 ≈ 7
times more likely than the same word without a lowercase letter in that position. Some samples from this
model, generated using (annealed) Gibbs sampling (described in ??), are shown below.2
m, r, xevo, ijjiir, b, to, jz, gsr, wq, vf, x, ga, msmGh, pcp, d, oziVlal, hzagh, yzop, io,
advzmxnv, ijv_bolft, x, emx, kayerf, mlj, rawzyb, jp, ag, ctdnnnbg, wgdw, t, kguv, cy, spxcq,
uzflbbf, dxtkkn, cxwx, jpd, ztzh, lv, zhpkvnu, l^, r, qee, nynrx, atze4n, ik, se, w, lrh, hp+,
yrqyka’h, zcngotcnx, igcump, zjcjs, lqpWiqu, cefmfhc, o, lb, fdcY, tzby, yopxmvk, by, fz„ t,
govyccm, ijyiduwfzo, 6xr, duh, ejv, pk, pjw, l, fl, w
The second feature added by the algorithm checks if two adjacent characters are lower case:
X
φ2 (x) = I (xi ∈ {a, . . . , z}, xj ∈ {a, . . . , z}) (4.26)
i∼j
was, reaser, in, there, to, will, „ was, by, homes, thing, be, reloverated, ther, which, conists,
at, fores, anditing, with, Mr., proveral, the, „ ***, on’t, prolling, prothere, „ mento, at, yaou,
1, chestraing, for, have, to, intrally, of, qut, ., best, compers, ***, cluseliment, uster, of,
is, deveral, this, thise, of, offect, inatever, thifer, constranded, stater, vill, in, thase, in,
youse, menttering, and, ., of, in, verate, of, to
If we define a feature for every possible combination of letters, we can represent any probability distribution.
However, this will overfit. The power of maxent approach is that we can choose which features matter for the
domain.
An alternative approach is to introduce latent variables, that implicitly model correlations amongst the
visible nodes, rather than explicitly having to learn feature functions. See ?? for an example of such a model.
2 We thank John Lafferty for sharing this example.
41
Fr(A,A) Fr(B,B) Fr(B,A) Fr(A,B) Sm(A) Sm(B) Ca(A) Ca(B)
1 1 0 1 1 1 1 1
1 1 0 1 1 0 0 0
1 1 0 1 1 1 0 1
Table 4.3: Some possible joint instantiations of the 8 variables in the smoking example.
values. Similarly, we have the “degenerate” nodes F r(A) and F r(B), since we did not enforce x 6= y in Equation (4.29). (If we
add such constraints, then the model compiler, which generates the ground network, should avoid creating redundant nodes.)
42
Friends(A,B)
Cancer(A) Cancer(B)
Friends(B,A)
Figure 4.5: An example of a ground Markov logic network represented as a pairwise MRF for 2 people. Adapted from
Figure 2.1 from [DL09]. Used with kind permission of Pedro Domingos.
where the value of w > 0 controls strongly we want to enforce the corresponding rule.
The overall joint distribution has the form
1 X
p(x) = exp( wi ni (x)) (4.34)
Z(w) i
where ni (x) is the number of instances of clause i which evaluate to true in assignment x.
Given a grounded MLN model, we can then perform inference using standard methods. Of course, the
ground models are often extremely large, so more efficient inference methods, which avoid creating the full
ground model (known as lifted inference), must be used. See [DL09; KNP11] for details.
One way to gain tractability is to relax the discrete problem to a continuous one. This is the basic
idea behind hinge-loss MRFs [Bac+15], which support exact inference using scalable convex optimization.
There is a template language for this model family known as probabilistic soft logic, which has a similar
“flavor” to MLN, although it is not quite as expressive.
Recently MLNs have been combined with DL in various ways. For example, [Zha+20] uses graph neural
networks for inference. And [WP18] uses MLNs for evidence fusion, where the noisy predictions come from
DNNs trained using weak supervision.
Finally, it is worth noting one subtlety which arises with undirected models, namely that the size of the
unrolled model, which depends on the number of objects in the universe, can affect the results of inference,
even if we have no data about the new objects. For example, consider an undirected chain of length T , with T
hidden nodes zt and T observed nodes yt ; call this model M1 . Now suppose we double the length of the chain
to 2T , without adding more evidence; call this model M2 . We find that p(zt |y1:T , M1 ) 6= p(zt |y1:T , M2 ), for
t = 1 : T , even though we have not added new information, due to the different partition functions. This does
not happen with a directed chain, because the newly added nodes can be marginalized out without affecting
the original nodes, since the model is locally normalized and therefore modular. See [JBB09; Poo+12] for
further discussion.
43
44
Chapter 5
Information theory
45
46
Chapter 6
Optimization
The proximal operator (also called a proximal mapping) of f , denoted proxf (x) : Rn → Rn , is
defined by
1 2
proxf (x) = argmin f (z) + ||z − x||2 (6.2)
z 2
This is a strongly convex function and hence has a unique minimizer. This operator is sketched in Figure 6.1a.
We see that points inside the domain move towards the minimum of the function, whereas points outside the
domain move to the boundary and then towards the minimum.
For example, suppose f is the indicator function for the convex set C, i.e.,
(
0 if x ∈ C
f (x) = IC (x) = (6.3)
∞ if x 6∈ C
In this case, the proximal operator is equivalent to projection onto the set C:
47
3.0
2.5
2.0
1.5
1.0
0.5
0.0
-3 -2 -1 0 1 2 3
(a) (b)
Figure 6.1: (a) Evaluating a proximal operator at various points. The thin lines represent level sets of a convex
function; the minimum is at the bottom left. The black line represents the boundary of its domain. Blue points get
mapped to red points by the prox operator, so points outside the feasible set get mapped to the boundary, and points
inside the feasible set get mapped to closer to the minimum. From Figure 1 of [PB+14]. Used with kind permission of
Stephen Boyd. (b) Illustration of the Moreau envelope with η = 1 (dotted line) of the absolute value function (solid
black line). See text for details. From Figure 1 of [PSW15]. Used with kind permission of Nicholas Polson.
We will often want to compute the prox operator for a scaled function ηf , for η > 0, which can be written
as
1
proxηf (x) = argmin f (z) + ||z − x||22 (6.5)
z 2η
The solution to the problem in Equation (6.5) the same as the solution to the trust region optimization
problem of the form
argmin f (z) s.t. ||z − x||2 ≤ ρ (6.6)
z
for appropriate choices of η and ρ. This the proximal projection minimizes the function while staying close to
the current iterate. We give other interpretations of the proximal operator below.
We can generalize the operator by replacing the Euclidean distance with Mahalanobis distance:
1 T
proxηf,A (x) = argmin f (z) + (z − x) A(z − x) (6.7)
z 2η
where A is a psd matrix.
48
For example, in Figure 6.1b, we see that fx10 (z ∗ ) = fx10 (0.5) = 1.0, so f 1 (x0 ) = 1.0. This is shown by the
blue circle. The dotted line is the locus of blue points as we vary x0 , i.e., the Moreau envelope of f .
We see that the Moreau envelope is a smooth lower bound on f , and has the same minimum location as
f . Furthermore, it has domain Rn , even when f does not, and it is continuously differentiable, even when f
is not. This makes it easier to optimize. For example, the Moreau envelope of f (r) = |r| is the Huber loss
function, which is used in robust regression.
49
6.1.2.1 Moreau decomposition
A useful technique for computing some kinds of proximal operators leverages a result known as Moreau
decomposition, which states that
x = proxf (x) + proxf ∗ (x) (6.18)
where f ∗ is the convex conjugate of f (see Section 6.5).
For example, suppose f = || · || is a general norm on RD . If can be shown that f ∗ = IB , where
B = {x : ||x||∗ ≤ 1} (6.19)
Hence
proxλf (x) = x − λproxf ∗ /λ (x/λ) = x − λprojB (x/λ) (6.21)
Thus there is a close connection between proximal operators of norms and projections onto norm balls that
we will leverage below.
For example, if we want to ensure all elements are non-negative, we can use
6.1.2.3 `1 norm
Consider the 1-norm f (x) = ||x||1 . The proximal projection can be computed componentwise. We can solve
each 1d problem as follows:
1
proxλf (x) = argmin λ|z| + (z − x)2 (6.24)
z z
One can show that the solution to this is given by
x − λ if x ≥ λ
proxλf (x) = 0 if |x| ≥ λ (6.25)
x + λ if x ≤ λ
This is known as the soft thresholding operator, since values less than λ in absolute value are set to 0
(thresholded), but in a differentiable way. This is useful for enforcing sparsity. Note that soft thresholding
can be written more compactly as
where x+ = max(x, 0) is the positive part of x. In the vector case, we define SoftThresholdλ (x) to be
elementwise soft thresholding.
50
6.1.2.4 `2 norm
qP
D
Now consider the `2 norm f (x) = ||x||2 = d=1 xd . The dual norm for this is also the `2 norm. Projecting
2
onto the corresponding unit ball B can be done by simply scaling vectors that lie outside the unit sphere:
(
x
||x||2 > 1
projB (x) = ||x||2
(6.27)
x ||x||2 ≤ 1
This will set the whole vector to zero if its `2 norm is less than λ. This is therefore called block soft
thresholding.
1
proxλf (x) = x (6.29)
1+λ
This reduces the magnitude of the x vector, but does not enforce sparsity. It is therefore called the shrinkage
operator.
More generally, if f (x) = 12 xT Ax + bT x + c is a quadratic, with A being positive definite, then
A special case of this is if f is affine, f (x) = bT x + c. Then we have proxλf (x) = x − λb. We saw an example
of this in Equation (6.12).
The nuclear norm, also called the trace norm, of an m × n matrix A is the `1 norm of of its singular
values: f (A) = ||A||∗ = ||σ||1 . Using this as a regularizer can result in a low rank matrix. The proximal
operator for this is defined by
X
proxλf (A) = (σi − λ)+ ui vTi (6.31)
i
P
where A = i σi ui viT is the SVD of A. This operation is called singular value thresholding.
Consider the cone of positive semidefinite matrices C, and let f (A) = IC (A) be the indicator function. The
proximal operator corresponds to projecting A onto the cone. This can be computed using
X
projC (A) = (λi )+ ui uTi (6.32)
i
P
where i λi ui uTi is the eigenvalue decomposition of A. This is useful for optimizing psd matrices.
51
6.1.2.8 Projection onto probability simplex
PD
Let C = {x : x ≥ 0, d=1 xd = 1} = SD be the probability simplex in D dimensions. We can project onto
this using
projC (x) = (x − ν1)+ (6.33)
The value ν ∈ R must be found using bisection search. See [PB+14, p.183] for details. This is useful for
optimizing over discrete probability distributions.
1
θt+1 = proxηt `t (θt ) = argmin `t (θ) + ||θ − θt ||22 (6.35)
θ 2ηt
where `t (θ) = `(θ, zt ) and zt ∼ q. The resulting method is known as stochastic PPM (see e.g., [PN18]). If q
is the empirical distribution associated with a finite-sum objective, this is called the incremental proximal
point method [Ber15]. It is often more stable than SGD.
In the case where the cost function is a linear least squares problem, one can show [AEM18] that the
IPPM is equivalent to the Kalman filter (??), where the posterior mean is equal to the current parameter
estimate, θt . The advantage of this probabilistic perspective is that it also gives us the posterior covariance,
which can be used to define a variable-metric distance function inside the prox operator, as in Equation (6.7).
We can extend this to nonlinear problems using the extended KF (??).
1
θt+1 = proxηt `ˆt (θt ) = argmin `ˆt (θ) + ||θ − θt ||22 (6.37)
θ 2ηt
We have
1 1
T
∇θ `t (θt ) + gt (θ − θt ) + 2
||θ − θt ||2 = gt + (θ − θt ) (6.38)
2ηt ηt
Setting the gradient to zero yields the SGD step θt+1 = θt − ηt gt .
52
50
45 0.8
40
SGM 0.6
35 Truncated
adam
30
trunc-adagrad 0.4
25
SGM
0.2 Truncated
20 adam
trunc-adagrad
10−3 10−2 10−1 100 101 102 103 10−3 10−2 10−1 100 101 102 103
(a) (b)
Figure 6.2: Illustration of the benefits of using a lower-bounded loss function when training a resnet-128 CNN on
the CIFAR10 image classification dataset. The curves are as follows: SGM (stochastic gradient method, i.e., SGD),
Adam, truncated SGD and truncated AdaGrad. (a) Time to reach an error that satisifes L(θt ) − L(θ ∗ ) ≤ vs initial
learning rate η0 . (b) Top-1 accuracy after 50 epochs vs η0 . The lines represent median performance across 50 random
restarts, and shading represents 90% confidence intervals. From Figure 4 of [AD19c]. Used with kind permission of
Hilal Asi.
Sometimes we can do better than just using PPM with a linear approximation to the objective, at essentially
no extra cost, as pointed out in [AD19b; AD19a; AD19c]. For example, suppose we know a lower bound on
the loss, `min
t = minθ `t (θ). For example, when using squared error, or cross-entropy loss for discrete labels,
we have `t (θ) ≥ 0. Let us therefore define the truncated model
`ˆt (θ) = max `t (θ) + gTt (θ − θt ), `min
t (6.39)
We can further improve things by replacing the PEuclidean norm with a scaled Euclidean norm, where
t 1
the diagonal scaling matrix is given by At = diag( i=1 gi gTi ) 2 , as in AdaGrad [DHS11]. If `min
t = 0, the
resulting proximal update becomes
1
θt+1 = argmin `t (θt ) + gTt (θ − θt ) + + (θ − θt )T At (θ − θt ) (6.40)
θ 2ηt
`t (θt )
= θt − min(ηt , T −1 )gt (6.41)
gt At gt
Thus the update is like a standard SGD update, but we truncate the learning rate if it is too big.1
[AD19c] call this truncated AdaGrad. Furthermore, they prove optimizing this truncated linear
approximation (with or without AdaGrad weighting), instead of the standard linear approximation used by
gradient descent, can result in significant benefits. In particular, it is guaranteed to be stable (under certain
technical conditions) for any learning rate, whereas standard GD can “blow up”, even for convex problems.
Figure 6.2 shows the benefits of this approach when training a resnet-128 CNN (??) on the CIFAR10
image classification dataset. For SGD and the truncated proximal method, the learning rate is decayed using
ηt = η0 t−β with β = 0.6. For Adam and truncated AdaGrad, the learning rate is set to ηt = η0 , since
we use diagonal scaling. We see that both truncated methods (regular and AdaGrad version) have good
performance for a much broader range of initial learning rate η0 compared to SGD or Adam.
1 One way to derive this update (suggested by Hilal Asi) is to do case analysison the value of `ˆ (θ ˆ
t t+1 ), where `t is the
truncated linear model. If `ˆt (θt+1 ) > 0, then setting the gradient to zero yields the usual SGD update, θt+1 = θt − ηt gt . (We
assume At = I for simplicity.) Otherwise we must have `ˆt (θt+1 ) = 0. But we know that θt+1 = θt − λgt for some λ, so we
solve `ˆt (θt − λgt ) = 0 to get λ = `ˆt (θt )/||gt ||22 .
53
6.1.4 Mirror descent
We can extend the proximal point update in Equation (6.34) by replacing the Euclidean distance term
||θ − θt ||22 by a more general Bregman divergence (??),
Dh (x, y) = h(x) − h(y) + ∇h(y)T (x − y) (6.42)
where h(x) is a strongly convex function. This gives the following update:
1
θt+1 = argmin L(θ) + Dh (θ, θt ) (6.43)
θ ηt
Suppose we make a linear approximation to L(θ).
L̂t (θ) = L(θt ) + gTt (θ − θt ) (6.44)
where gt = ∇θ L(θt ). If we perform a proximal update with this linear approximation using Euclidean
distance, we get a standard gradient update, as we showed in Section 6.1.3.2, However, combining this with
the Bregman divergence gives the following update:
θt+1 = argmin ηt gTt θ + Dh (θ, θt ) (6.45)
θ
This is known as mirror descent [NY83; BT03]. This can easily be extended to the stochastic setting in
the obvious way.
One can show that natural gradient descent (??) is a form of mirror descent [RM15]. More precisely, mirror
descent in the mean parameter space is equivalent to natural gradient descent in the canonical parameter
space.
54
6.1.5.1 Example: Iterative soft-thresholding algorithm (ISTA) for sparse linear regression
Suppose we are interested in fitting a linear regression model with a sparsity-promoting prior on the weights, as
in the lasso model (??). One way to implement this is to add the `1 -norm of the parameters as a (non-smooth)
PD
penalty term, Lr (θ) = ||θ||1 = d=1 |θd |. Thus the objective is
1
L(θ) = Ls (θ) + Lr (θ) = ||Xθ − y||22 + λ||θ||1 (6.51)
2
The proximal gradient descent update can be written as
where the soft thresholding operator (Equation (6.26)) is applied elementwise, and ∇Ls (θ) = XT (Xθ − y).
This is called the iterative soft thresholding algorithm or ISTA [DDDM04; Don95]. If we combine this
with Nesterov acceleration, we get the method known as “fast ISTA” or FISTA [BT09], which is widely used
to fit sparse linear models.
We see that the dual variable is the (scaled) running average of the consensus errors.
Inserting the definition of Lρ (x, z, y) gives us the following more explicit update equations:
ρ
xt+1 = argmin Ls (x) + yTt x + ||x − zt ||22 (6.58)
x 2
ρ
zt+1 = argmin Lr (z) − yt z + ||xt+1 − z||22
T
(6.59)
z 2
55
Figure 6.3: Robust PCA applied to some frames from a surveillance video. First column is input image. Second
column is low-rank background model. Third model is sparse foreground model. Last column is derived foreground
mask. From Figure 1 of [Bou+17]. Used with kind permission of Thierry Bouwmans.
Finally, if we define ut = (1/ρ)yt and λ = 1/ρ, we can now write this in a more general way:
xt+1 = proxλLs (zt − ut ) (6.62)
zt+1 = proxλLr (xt+1 + ut ) (6.63)
ut+1 = ut + xt+1 − zt+1 (6.64)
This is called the alternating direction method of multipliers or ADMM algorithm. The advantage
of this method is that the different terms in the objective (along with any constraints they may have) are
handled completely independently, allowing different solvers to be used. Furthermore, the method can be
extended to the stochastic setting as shown in [ZK14].
where A ∈ Rm×n is a given data matrix, Xj ∈ Rm×n are the optimization variables, and γj > 0 are trade-off
parameters.
For example, suppose we want to find a good least squares approximation to A as a sum of a low rank
matrix plus a sparse matrix. This is called robust PCA [Can+11], since the sparse matrix can handle the
small number of outliers that might otherwise cause the rank of the approximation to be high. The method
is often used to decompose surveillance videos into a low rank model for the static background, and a sparse
model for the dynamic foreground objects, such as moving cars or people, as illustrated in Figure 6.3. (See
e.g., [Bou+17] for a review.) RPCA can also be used to remove small “outliers”, such as specularities and
shadows, from images of faces, to improve face recognition.
We can formulate robust PCA as the following optimization problem:
minimize ||A − (L + S)||2F + γL ||L||∗ + γS ||S||1 (6.66)
which is a sparse plus low rank decomposition of the observed data matrix. We can reformulate this to match
the form of a canonical matrix decomposition problem by defining X1 = L, X2 = S and X3 = A − (X1 + X2 ),
and then using these loss functions:
φ1 (X1 ) = ||X1 ||∗ , φ2 (X2 ) = ||X2 ||1 , φ3 (X3 ) = ||X3 ||2F (6.67)
56
We can tackle such matrix decomposition problems using ADMM, where we use the split Ls (X) =
P PJ
j γj φj (Xj ) and Lr (X) = IC (X), where X = (X1 , . . . , XJ ) and C = {X1:J : j=1 Xj = A}. The overall
algorithm becomes
1
Xj,t+1 = proxηt φj (Xj,t − Xt + A − Ut ) (6.68)
N
1
Ut+1 = Ut + Xt+1 − A (6.69)
J
where X is the elementwise average of X1 , . . . , XJ . Note that the Xj can be updated in parallel.
Projection onto the `1 norm is discussed in Section 6.1.2.3, projection onto the nuclear norm is discussed
in Section 6.1.2.6. projection onto the squared Frobenius norm is the same as projection
P onto the squared
Euclidean norm discussed in Section 6.1.2.5, and projection onto the constraint set j Xj = A can be done
using the averaging operator:
1
projC (X1 , . . . , XJ ) = (X1 , . . . , XJ ) − X + A (6.70)
J
An alternative to using `1 minimization in the inner loop is to use hard thresholding [CGJ17]. Although
not convex, this method can be shown to converge to the global optimum, and is much faster.
It is also possible to formulate a non-negative version of robust PCA. Even though NRPCA is not a
convex problem, it is possible to find the globally optimal solution [Fat18; AS19].
where nbr(xt ) ⊆ X is the set of neighbors of xt . This is called hill climbing, steepest ascent, or greedy
search.
If the “neighborhood” of a point contains the entire space, Equation (6.71) will return the global optimum
in one step, but usually such a global neighborhood is too large to search exhaustively. Consequently we
usually define local neighborhoods. For example, consider the 8-queens problem. Here the goal is to
place queens on an 8 × 8 chessboard so that they don’t attack each other (see Figure 6.6). The state space
has the form X = 648 , since we have to specify the location of each queen on the grid. However, due to the
constraints, there are only 88 ≈ 17M feasible states. We define the neighbors of a state to be all possible
states generated by moving a single queen to another square in the same column, so each node has 8 × 7 = 56
neighbors. According to [RN10, p.123], if we start at a randomly generated 8-queens state, steepest ascent
gets stuck at a local maximum 86% of the time, so it only solves 14% of problem instances. However, it is
fast, taking an average of 4 steps when it succeeds and 3 when it gets stuck.
In the sections below, we discuss slightly smarter algorithms that are less likely to get stuck in local
maxima.
57
climbing. If we gradually decrease the entropy of this probability distribution (so we become greedier over
time), we get a method called simulated annealing, which we discuss in ??.
Another simple technique is to use greedy hill climbing, but then whenever we reach a local maximum,
we start again from a different random starting point. This is called random restart hill climbing. To
see the benefit of this, consider again the 8-queens problem. If each hill-climbing search has a probability of
p ≈ 0.14 of success, then we expect to need R = 1/p ≈ 7 restarts until we find a valid solution. The expected
number of total steps can be computed as follows. Let N1 = 4 be the average number of steps for successful
trials, and N0 = 3 be the average number of steps for failures. Then the total number of steps on average is
N1 + (R − 1)N0 = 4 + 6 × 3 = 22. Since each step is quick, the overall method is very fast. For example, it
can solve an n-queens problem with n =1M in under a minute.
Of course, solving the n-queens problem is not the most useful task in practice. However, it is typical of
several real-world boolean satisfiability problems, which arise in problems ranging from AI planning to
model checking (see e.g., [SLM92]). In such problems, simple stochastic local search (SLS) algorithms of
the kind we have discussed work surprisingly well (see e.g., [HS05]).
Hill climbing will stop as soon as it reaches a local maximum or a plateau. Obviously one can perform a
random restart, but this would ignore all the information that had been gained up to this point. A more
intelligent alternative is called tabu search [GL97]. This is like hill climbing, except it allows moves that
decrease (or at least do not increase) the scoring function, provided the move is to a new state that has not
been seen before. We can enforce this by keeping a tabu list which tracks the τ most recently visited states.
This forces the algorithm to explore new states, and increases the chances of escaping from local maxima.
We continue to do this for up to cmax steps (known as the “tabu tenure”). The pseudocode can be found in
Algorithm 1. (If we set cmax = 1, we get greedy hill climbing.)
For example, consider what happens when tabu search reaches a hill top, xt . At the next step, it will move
to one of the neighbors of the peak, xt+1 ∈ nbr(xt ), which will have a lower score. At the next step, it will
move to the neighbor of the previous step, xt+2 ∈ nbr(xt+1 ); the tabu list prevents it cycling back to xt (the
peak), so it will be forced to pick a neighboring point at the same height or lower. It continues in this way,
“circling” the peak, possibly being forced downhill to a lower level-set (an inverse basin flooding operation),
until it finds a ridge that leads to a new peak, or until it exceeds a maximum number of non-improving moves.
According to [RN10, p.123], tabu search increases the percentage of 8-queens problems that can be solved
from 14% to 94%, although this variant takes an average of 21 steps for each successful instance and 64 steps
for each failed instance.
58
Grid Layout Random Layout
Unimportant parameter
Unimportant parameter
Important parameter Important parameter
Figure 6.4: Illustration of grid search (left) vs random search (right). From Figure 1 of [BB12]. Used with kind
permission of James Bergstra.
bilistic optimization is often used to refer to Bayesian optimization (??), and stochastic optimization refers to any optimization
problem in which the objective is stochastic.) Another term that is used for DBO is “model based optimization” (see e.g.
[BBZ17]). However, this term is ambiguous, because it could either refer to a model of good candidates (i.e., a probability
distribution over X ) or a cheap (possibly differentiable) approximation to the objective (i.e., a regression function X → R), as
discussed in ??.
59
24748552 24 31% 32752411 32748552 32748152
32752411 23 29% 24748552 24752411 24752411
24415124 20 26% 32752411 32752124 32252124
32543213 11 14% 24415124 24415411 24415417
Figure 6.5: Illustration of a genetic algorithm applied to the 8-queens problem. (a) Initial population of 4 strings. (b)
We rank the members of the population by fitness, and then compute their probability of mating. Here the integer
numbers represent the number of nonattacking pairs of queens, P so the global maximum has a value of 28. We pick
an individual θ with probability p(θ) = L(θ)/Z, where Z = θ∈P L(θ) sums the total fitness of the population. For
example, we pick the first individual with probability 24/78 = 0.31, the second with probability 23/78 = 0.29, etc. In
this example, we pick the first individual once, the second twice, the third one once, and the last one does not get to
breed. (c) A split point on the “chromosome” of each parent is chosen at random. (d) The two parents swap their
chromosome halves. (e) We can optionally apply pointwise mutation. From Figure 4.6 of [RN10]. Used with kind
permission of Peter Norvig.
can sometimes get better performance by using a parametric distribution, with suitable inductive bias. We
discuss some examples in Section 6.3.3.
60
+ =
Figure 6.6: The 8-queens states corresponding to the first two parents in Figure 6.5(c) and their first child in
Figure 6.5(d). We see that the encoding 32752411 means that the first queen is in row 3 (counting from the bottom
left), the second queen is in row 2, etc. The shaded columns are lost in the crossover, but the unshaded columns are
kept. From Figure 4.7 of [RN10]. Used with kind permission of Peter Norvig.
(a) (b)
(c) (d)
61
Figure 6.8: A taxonomy of various metaheuristic optimization algorithms. From https: // en. wikipedia. org/ wiki/
Metaheuristic . Used with kind permission of Wikipedia authors Johann Dreo and Caner Candan.
is similar to the use of response surface models in Bayesian optimization (??), except it does not deal
with the explore-exploit tradeoff.
• In a memetic algorithm [MC03], we combine mutation and recombination with standard local search.
Evolutionary algorithms have been applied to a large number of applications, including training neural
networks (this combination is known as neuroevolution [Sta+19]). An efficient JAX-based library for
(neuro)-evolution can be found at https://github.com/google/evojax.
62
S
t-1
X1 X2 X3 … Xn eval
1 4 5 2 … 3 13.25
2 5 3 1 … 6 32.45
… … … … … … …
K’ 1 5 4 … 2 34.12
*
S Selection of K<K' individuals
t-1
X1 X2 X3 … Xn
1 4 1 5 … 3
2 2 3 1 … 6
… … … … … …
Induction of the
K 3 4 6 … 5
probability model
Selection of K<K’
individuals x1 x2
S
t
X1 X2 X3 … Xn eval
x3
1 3 3 4 … 5 32.78
2 2 5 1 … 4 33.45
....................
… … … … … … …
K’ 4 2 1 … 2 37.26
Sampling from
pt(x) xn-1
xn
*
pt (x) = p (x|S t-1)
Figure 6.9: Illustration of the BOA algorithm (EDA applied to a generative model structured as a Bayes net). Adapted
from Figure 3 of [PHL12].
63
It is straightforward to use more expressive probability models that capture dependencies between the
parameters (these are known as building blocks in the EA literature). For example, in the case of real-valued
parameters, we can use a multivariate Gaussian, p(x) = N (x|µ, Σ). The resulting method is called the
estimation of multivariate normal algorithm or EMNA, [LL02]. (See also Section 6.3.4.)
For discrete random variables, it is natural to use probabilistic graphical models (??) to capture de-
pendencies between the variables. [BD97] learn a tree-structured graphical model using the Chow-Liu
algorithm (Section 31.1.2); [BJV97] is a special case of this where the graph is a tree. We can also learn more
general graphical model structures (see e.g., [LL02]). We typically use a Bayes net (??), since we can use
ancestral sampling (??) to easily generate samples; the resulting method is therefore called the Bayesian
Optimization Algorithm (BOA) [PGCP00].3 The hierarchical BOA (hBOA) algorithm [Pel05] extends
this by using decision trees and decision graphs to represent the local CPTs in the Bayes net (as in [CHM97]),
rather than using tables. In general, learning the structure of the probability model for use in EDA is called
linkage learning, by analogy to how genes can be linked together if they can be co-inherited as a building
block.
We can also use deep generative models to represent the distribution over good candidates. For example,
[CSF16] use denoising autoencoders and NADE models (??), [Bal17] uses a DNN regressor which is then
inverted using gradient descent on the inputs, [PRG17] uses RBMs (??), [GSM18] uses VAEs (??), etc.
Such models might take more data to fit (and therefore more function calls), but can potentially model the
probability landscape more faithfully. (Whether that translates to better optimization performance is not
clear, however.)
The differentiable CEM method of [AY19] replaces the top K operator with a soft, differentiable approxi-
mation, which allows the optimizer to be used as part of an end-to-end differentiable pipeline. For example,
we can use this to create a differentiable model predictive control (MPC) algorithm (??), as described in ??.
The basic idea is as follows. Let St = {xt,i ∼ p(x|θt ) : i = 1 : K 0 } represent the current population, with
fitness values vt,i = f (xt,i ). Let vt,K
∗
be the K’th smallest value. In CEM, we compute P the set of top K samples,
St = {i : vt,i ≥ vt,K }, and then update the model based on these: θt+1 = argmaxθ i∈St pt (i) log p(xt,i |θ),
∗ ∗
where pt (i) = I (i ∈ St∗ ) /|St∗ |. In the differentiable version, we replace the sparse distribution pt with the
“soft” dense distribution qt = Π(pt ; τ, K), where
P
projects the distribution p onto the polytope of distributions which sum to K. (Here H(q) = − i qi log(qi ) +
(1 − qi ) log(1 − qi ) is the entropy, and τ > 0 is a temperature parameter.) This projection operator (and
hence the whole DCEM algorithm) can be backpropagated through using implicit differentiation [AKZK19].
3 This should not be confused with the Bayesian optimization methods we discuss in ??, that uses response surface modeling
64
Figure 6.10: Illustration of the CMA-ES method applied to a simple 2d function. The dots represent members of the
population, and the dashed orange ellipse represents the multivariate Gaussian. From https: // en. wikipedia. org/
wiki/ CMA-ES . Used with kind permission of Wikipedia author Sentewolf.
This can be approximated by drawing Monte Carlo samples. If the probability model is in the exponential
family, we can compute the natural gradient (??), rather than the “vanilla” gradient; such methods are called
natural evolution strategies [Wie+14].
6.3.5.1 CMA-ES
The CMA-ES method of [Han16], which stands for “covariance matrix adaptation evolution strategy” is
a kind of NES. It is very similar to CEM except it updates the parameters in a special way. In particular,
instead of computing the new mean and covariance using unweighted MLE on the elite set, we attach weights
to the elite samples based on their rank. We then set the new mean to the weighted MLE of the elite set.
The update equations for the covariance are more complex. In particular, “evolutionary paths” are also
used to accumulate the search directions across successive generations, and these are used to update the
covariance. It can be show that the resulting updates approximate the natural gradient of L(θ) without
explicitly modeling the Fisher information matrix [Oll+17].
Figure 6.10 illustrates the method in action.
65
6.4.1 Example: computing Fibonnaci numbers
Consider the problem of computing Fibonnaci numbers, defined via the recursive equation
F5 = F4 + F3 (6.76)
= (F3 + F2 ) + (F2 + F1 ) (6.77)
= ((F2 + F1 ) + (F1 + F0 )) + ((F1 + F0 ) + F1 ) (6.78)
= (((F1 + F0 ) + F1 ) + (F1 + F0 ))((F1 + F0 ) + F1 ) (6.79)
We see that there is a lot of repeated computation. For example, fib(2) is computed 3 times. One way to
improve the efficiency is to use memoization, which means memorizing each function value that is computed.
This will result in a linear time algorithm. However, the overhead involved can be high.
It is usually preferable to try to solve the problem bottom up, solving small subproblems first, and then
using their results to help solve larger problems later. A simple way to do this is shown in Algorithm 3.
6.4.2 ML examples
There are many applications of DP to ML problems, which we discuss elsewhere in this book. These include
the forwards-backwards algorithm for inference in HMMs (??), the Viterbi algorithm for MAP sequence
estimation in HMMs (??), inference in more general graphical models (??), reinforcement learning (??), etc.
66
f(x) f(x)
−f*(λ)
y
y
λx
λx − f*(λ)
x x
(a) (b)
Figure 6.11: Illustration of a conjugate functon. Red line is original function f (x), and the blue line is a linear lower
bound λx. To make the bound tight, we find the x where ∇f (x) is parallel to λ, and slide the line up to touch there;
the amount we slide up is given by f ∗ (λ). Adapted from Figure 10.11 of [Bis06].
6.5.1 Introduction
Consider an arbitrary continuous function f (x), and suppose we create a linear lower bound on it of the form
where λ is the slope, which we choose, and f ∗ (λ) is the intercept, which we solve for below. See Figure 6.11(a)
for an illustration.
For a fixed λ, we can find the point xλ where the lower bound is tight by “sliding” the line upwards until
it touches the curve at xλ , as shown in Figure 6.11(b). At xλ , we minimize the distance between the function
and the lower bound:
xλ , argmin f (x) − L(x, λ) = argmin f (x) − λT x (6.81)
x x
Since the bound is tight at this point, we have
and hence
f ∗ (λ) = λT xλ − f (xλ ) = max λT x − f (x) (6.83)
x
The function f ∗ is called the conjugate of f , also known as the Fenchel transform of f . For the special
case of differentiable f , f ∗ is called the Legendre transform of f .
One reason conjugate functions are useful is that they can be used to create convex lower bounds to
non-convex functions. That is, we have L(x, λ) ≤ f (x), with equality at x = xλ , for any function f : RD → R.
For any given x, we can optimize over λ to make the bound as tight as possible, giving us a fixed function
L(x); this is called a variational approximation. We can then try to maximize this lower bound wrt x
instead of maximizing f (x). This method is used extensively in approximate Bayesian inference, as we discuss
in ??.
67
1
0.5
0
0 ξ 1.5 3
(a) (b)
Figure 6.12: (a) The red curve is f (x) = e−x and the colored lines are linear lower bounds. Each lower bound of slope
λ is tangent to the curve at the point xλ = − log(−λ), where f (xλ ) = elog(−λ) = −λ. For the blue curve, this occurs
at xλ = ξ. Adapted from Figure 10.10 of [Bis06]. Generated by opt_lower_bound.py. (b) For a convex function f (x),
its epipgraph can be represented as the intersection of half-spaces defined by linear lower bounds of the form f † (λ).
Adapted from Figure 13 of [JJ99].
Hence
f † (λ) = J(xλ , λ) = λ(− log(−λ)) − elog(−λ) = −λ log(−λ) + λ (6.89)
If f is convex, then f ∗∗ = f , so f and f † are called conjugate duals. To see why, note that
Since we are free to modify λ for each x, we can make the lower bound tight at each x. This perfectly
characterizes f , since the epigraph of a convex function is an intersection of half-planes defined by linear
lower bounds, as shown in Figure 6.12(b).
Let us demonstrate this using the example from Section 6.5.2. We have
Define
J ∗ (x, λ) = λx − f † (x) = λx + λ log(−λ) − λ (6.93)
68
We have
∂ ∗ −1
J (x, λ) = x + log(−λ) + λ −1=0 (6.94)
∂λ −λ
x = − log(−λ) (6.95)
−x
λx = −e (6.96)
where
f † (η) = min ηx − f (x) (6.99)
x
The function f (x) = − log(ex/2 + e−x/2 ) is a convex function of y = x2 , as can be verified by showing
dx2 f (x) > 0. Hence we can create a linear lower bound on f , using the conjugate function
d
√
f † (η) = max
2
ηx2 − f ( x2 ) (6.105)
x
We have
dx d 1 x
0=η− f (x) = η + tanh( ) (6.106)
dx2 dx 4x 2
69
1 1
0.8 0.8
eta = 0.2
0.6 0.6
eta = 0.7
0.4 0.4
xi = 2.5
0.2 0.2
0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
(a) (b)
Figure 6.13: Illustration of (a) exponental upper bound and (b) quadratic lower bound to the sigmoid function.
Generated by sigmoid_upper_bounds.py and sigmoid_lower_bounds.py.
70
Part II
Inference
71
Chapter 7
73
74
Chapter 8
State-space inference
75
76
Chapter 9
Here V = {x1 , . . . , xV } are the nodes, E are the edges, θs and θst are the node and edge potentials, and Z is
the partition function:
X X X
Z= exp θs (xs ) + θst (xs , xt ) (9.2)
x s∈V (s,t)∈E
Since we just want the MAP configuration, we can ignore Z, and just compute
X X
x∗ = argmax θs (xs ) + θst (xs , xt ) (9.3)
x
s∈V (s,t)∈E
We can compute this exactly using dynamic programming as we explain in ??; However, this takes time
exponential in the treewidth of the graph, which is often too slow. In this section, we focus on approximate
methods that can scale to intractable models. We only give a brief description here; more details can be
found in [WJ08; KF09].
9.1.1 Notation
To simplify the presentation, we write the distribution in the following form:
1
p(x) = exp(−E(x))E(x) , −θT T (x) (9.4)
Z(θ)
where θ = ({θs;j }, {θs,t;j,k }) are all the node and edge parameters (the canonical parameters), and T (x) =
({I (xs = j)}, {I (xs = j, xt = k)}) are all the node and edge indicator functions (the sufficient statistics).
Note: we use s, t ∈ V to index nodes and j, k ∈ X to index states.
The mean of the sufficient statistics are known as the mean parameters of the model, and are given by
77
This is a vector of length d = KV + K 2 E, where K = |X | is the number of states, V = |V| is the number of
nodes, and E = |E| is the number of edges. Since µ completely characterizes the distribution p(x), so we
sometimes treat µ as a distribution itself.
Equation (9.5) is called the standard overcomplete representation. It is called “overcomplete” because
it ignores the sum-to-one constraints. In some cases, it is convenient to remove this redundancy. For example,
consider an Ising model where Xs ∈ {0, 1}. The model can be written as
1 X X
p(x) = exp θ s xs + θst xs xt (9.6)
Z(θ)
s∈V (s,t)∈E
where d = V + E. The corresponding mean parameters are µs = p(xs = 1) and µst = p(xs = 1, xt = 1).
For example, consider an Ising model. If we have just two nodes connected as X1 − X2 , one can
show that we have the following minimal set of constraints: 0 ≤ µ12 , 0 ≤ µ12 ≤ µ1 , 0 ≤ µ12 ≤ µ2 , and
1 + µ12 − µ1 − µ2 ≥ 0. We can write these in matrix-vector form as
0 0 1 0
1 µ1 0
0 −1 µ2 ≥
0 (9.9)
1 −1 0
µ12
−1 −1 1 −1
These four constraints define a series of half-planes, whose intersection defines a polytope, as shown in
Figure 9.1(a).
Since M(G) is obtained by taking a convex combination of the T (x) vectors, it can also be written as the
convex hull of these vectors:
M(G) = conv{T1 (x), . . . , Td (x)} (9.10)
For example, for a 2 node MRF X1 − X2 with binary states, we have
M(G) = conv{(0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 1)} (9.11)
These are the four black dots in Figure 9.1(a). We see that the convex hull defines the same volume as the
intersection of half-spaces.
To see why this equation is true, note that we can just set µ to be a degenerate distribution with µ(xs ) =
I (xs = x∗s ), where x∗s is the optimal assigment of node s. Thus we can “emulate” the task of optimizing over
78
(a) (b) (c)
Figure 9.1: (a) Illustration of the marginal polytope for an Ising model with two variables. (b) Cartoon illustration of
the set MF (G), which is a nonconvex inner bound on the marginal polytope M(G). MF (G) is used by mean field. (c)
Cartoon illustration of the relationship between M(G) and L(G), which is used by loopy BP. The set L(G) is always
an outer bound on M(G), and the inclusion M(G) ⊂ L(G) is strict whenever G has loops. Both sets are polytopes,
which can be defined as an intersection of half-planes (defined by facets), or as the convex hull of the vertices. L(G)
actually has fewer facets than M(G), despite the picture. In fact, L(G) has O(|X ||V | + |X |2 |E|) facets, where |X | is
the number of states per variable, |V | is the number of variables, and |E| is the number of edges. By contrast, M(G)
has O(|X ||V | ) facets. On the other hand, L(G) has more vertices than M(G), despite the picture, since L(G) contains
all the binary vector extreme points µ ∈ M(G), plus additional fractional extreme points. From Figures 3.6, 5.4 and
4.2 of [WJ08]. Used with kind permission of Martin Wainwright.
discrete assignments by optimizing over probability distributions µ. Furthermore, the non-degenerate (“soft”)
distributions will not correspond to corners of the polytope, and hence will not maximize a linear function.
It seems like we have an easy problem to solve, since the objective in Equation (9.12) is linear in µ, and
the constraint set M(G) is convex. The trouble is, M(G) in general has a number of facets that is exponential
in the number of nodes.
A standard strategy in combinatorial optimization is to relax the constraints. In this case, instead of
requiring probability vector µ to live in the marginal polytope M(G), we allow it to live inside a simpler,
convex enclosing set L(G), which we define in Section 9.1.3.1. Thus we try to maximize the following upper
bound on the original objective:
τ ∗ = argmax θT τ (9.13)
τ ∈L(G)
This is called a linear programming relaxation of the problem. If the solution τ ∗ is integral, it corresponds
to the exact MAP estimate; this will be the case when the graph is a tree. In general, τ ∗ will be fractional;
we can derive an approximate MAP estimate by rounding (see [Wer07] for details).
The first constraint is called the normalization constraint, and the second is called the marginalization
constraint. We then define the set
The set L(G) is also a polytope, but it only has O(|V |+|E|) constraints. It is a convex outer approximation
on M(G), as shown in Figure 9.1(c). (By contrast, the mean field approximation, which we discuss in ??, is a
non-convex inner approximation, as we discuss in ??.)
79
Figure 9.2: (a) Illustration of pairwise UGM on binary nodes, together with a set of pseudo marginals that are not
globally consistent. (b) A slice of the marginal polytope illustrating the set of feasible edge marginals, assuming the
node marginals are clamped at µ1 = µ2 = µ3 = 0.5. From Figure 4.1 of [WJ08]. Used with kind permission of Martin
Wainwright.
We call the terms τs , τst ∈ L(G) pseudo marginals, since they may not correspond to marginals of any
valid probability distribution. As an example of this, consider Figure 9.2(a). The picture shows a set of
pseudo node and edge marginals, which satisfy the local consistency requirements. However, they are not
globally consistent. To see why, note that τ12 implies p(X1 = X2 ) = 0.8, τ23 implies p(X2 = X3 ) = 0.8, but
τ13 implies p(X1 = X3 ) = 0.2, which is not possible (see [WJ08, p81] for a formal proof). Indeed, Figure 9.2(b)
shows that L(G) contains points that are not in M(G).
We claim that M(G) ⊆ L(G), with equality iff G is a tree. To see this, first consider an element µ ∈ M(G).
Any such vector must satisfy the normalization and marginalization constraints, hence M(G) ⊆ L(G).
Now consider the converse. Suppose T is a tree, and let µ ∈ L(T ). By definition, this satisfies the
normalization and marginalization constraints. However, any tree can be represented in the form
Y Y µst (xs , xt )
pµ (x) = µs (xs ) (9.17)
µs (xs )µt (xt )
s∈V (s,t)∈E
Hence satsifying normalization and local consistency is enough to define a valid distribution for any tree.
Hence µ ∈ M(T ) as well.
In contrast, if the graph has loops, we have that M(G) 6= L(G). See Figure 9.2(b) for an example of this
fact. The importance of this observation will become clear in Section 10.1.3.
9.1.3.2 Algorithms
Our task is to solve Equation (9.13), which requires maximizing a linear function over a simple convex
polytope. For this, we could use a generic linear programming package. However, this is often very slow.
Fortunately, one can show that a simple algorithm, that sends messages between nodes in the graph, can
be used to compute τ ∗ . In particular, the tree reweighted belief propagation algorithm can be used;
see Section 10.1.5.3 for details.
80
True Disparities 0 1
Figure 9.3: Illustration of belief propagation for stereo depth estimation applied to the Venus image from the Middlebury
stereo benchmark dataset [SS02]. Left column: image and true disparities. Remaining columns: initial estimate,
estimate after 1 iteration, and estimate at convergence. Top row: Gaussian edge potentials using a continuous state
space. Bottom row: robust edge potentials using a quantized state space. From Figure 4 of [SF08]. Used with kind
permission of Erik Sudderth.
occur at object boundaries, as illustrated in Figure 9.3. (We can also use a hybrid discrete-continuous state
space, as discussed in [Yam+12], but we can no longer apply BP.)
Not surprisingly, people have recently applied deep learning to this problem. For example, [XAH19]
describes a differentiable version of message passing (??), which is fast and can be trained end-to-end.
However, it requires labeled data for training, i.e., pixel-wise ground truth depth values. For this particular
problem, such data can be collected from depth cameras, but for other problems, BP on “unsupervised” MRFs
may be needed.
9.1.4 Graphcuts
In this section, we show how to find MAP state estimates, or equivalently, minimum energy configurations,
by using the maxflow / mincut algorithm for graphs. This class of methods is known as graphcuts and is
very widely used, especially in computer vision applications (see e.g., [BK04]).
We will start by considering the case of MRFs with binary nodes and a restricted class of potentials; in
this case, graphcuts will find the exact global optimum. We then consider the case of multiple states per
node; we can approximately solve this case by solving a series of binary subproblems, as we will see.
where λst ≥ 0 is the edge cost. This encourages neighboring nodes to have the same value (since we are trying
to minimize energy). Since we are free to add any constant we like to the overall energy without affecting the
MAP state estimate, let us rescale the local energy terms such that either Eu (1) = 0 or Eu (0) = 0.
Now let us construct a graph which has the same set of nodes as the MRF, plus two distinguished nodes:
the source s and the sink t. If Eu (1) = 0, we add the edge xu → t with cost Eu (0). Similarly, If Eu (0) = 0, we
add the edge s → xu with cost Eu (1). Finally, for every pair of variables that are connected in the MRF,
we add edges xu → xv and xv → xu , both with cost λu,v ≥ 0. Figure 9.4 illustrates this construction for an
MRF with 4 nodes and the following parameters:
Having constructed the graph, we compute a minimal s − t cut. This is a partition of the nodes into two sets,
Xs and Xt , such that s ∈ Xs and t ∈ Xt . We then find the partition which minimizes the sum of the cost of
81
t
7
6
z1 z2
1 6
2
z4 z3
2
6 1
s
Figure 9.4: Illustration of graphcuts applied to an MRF with 4 nodes. Dashed lines are ones which contribute to the
cost of the cut (for bidirected edges, we only count one of the costs). Here the min cut has cost 6. From Figure 13.5
from [KF09]. Used with kind permission of Daphne Koller.
In Figure 9.4, we see that the min-cut has cost 6. Minimizing the cost in this graph is equivalent to minimizing
the energy in the MRF. Hence nodes that are assigned to s have an optimal state of 0, and the nodes that are
assigned to t have an optimal state of 1. In Figure 9.4, we see that the optimal MAP estimate is (1, 1, 1, 0).
Thus we have converted the MAP estimation problem to a standard graph theory problem for which
efficient solvers exist (see e.g., [CLR90]).
In other words, the sum of the diagonal energies is less than the sum of the off-diagonal energies. In this
case, we say the energies are submodular (??). An example of a submodular energy is an Ising model
where λuv > 0. This is also known as an attractive MRF or associative MRF, since the model “wants”
neighboring states to be the same.
It is possible to modify the graph construction process for this setting, and then apply graphcuts, such
that the resulting estimate is the global optimum [GPS89].
82
IEEE Transactions on PAMI, vol. 23, no. 11, pp. 1222-1239 p.8
(a) initial labeling (b) standard move (c) α-β-swap (d) α-expansion
Graphcuts isthe
often
onlyapplied to low-level
difference between Pcomputer
and P! is vision problems,
that some suchwere
pixels that as stereo
labeleddepth
α in Pestimation,
are now which we
discussed in Section 9.1.3.3. ! Figure 9.6 compares graphcuts (both swap and expansion ! version) to two other
labeled β in P , and some pixels that were labeled β in P are now labeled α in P . A special
algorithms (simulated annealated, and a patch matching method based on normalization cross correlation) on
case of an α-β swap is a move that gives the label α to some set of pixels previously labeled
the famous Tsukuba test image. The graphcuts approach works the best on this example, as well as others
β. One
[Sze+08; TF03]. It example
also tendsof α-β swap move isbelief
to outperform shownpropagation
in Fig. 2(c). (results not shown) in terms of speed and
accuracy on stereo problems
Given [Sze+08;
a label α, a move TF03], as well asPother
from a partition problems
(labeling f ) to asuch
new as CRF labeling
partition of LIDAR point
P! (labeling
cloud data [LMW17].
! ! !
f ) is called an α-expansion if Pα ⊂ Pα and Pl ⊂ Pl for any label l "= α. In other words, an
α-expansion move allows any set of image pixels to change their labels to α. An example of
an α-expansion move is shown in Fig. 2(d).
Recall that ICM and annealing use standard moves allowing only one pixel to change its
intensity. An example of a standard move is given in Fig. 2(b). Note that a move which
assigns a given label α to a single pixel is both an α-β swap and an α-expansion. As a
consequence, a standard move is a special case of both a α-β swap and an α-expansion.
83
IEEE Transactions on PAMI, vol. 23, no. 11, pp. 1222-1239 p.29
84
Chapter 10
Variational inference
We can write this as an exponential family model, p(z|x) = p̃(z)/Z, where Z = log p(x), p̃(z) = T (z)T θ,
θ = ({θs;j }, {θs,t;j,k }) are all the node and edge parameters (the canonical parameters), and T (z) =
({I (zs = j)}, {I (zs = j, zt = k)}) are all the node and edge indicator functions (the sufficient statistics). Note:
we use s, t ∈ V to index nodes and j, k ∈ X to index states.
Let µ = Eq [T (z)] be the mean parameters of the variational distribution. Then we can rewrite this as
The set of all valid (unrestricted) mean parameters µ is the marginal polytope corresponding to the graph,
M(G), as explained in Section 9.1.2. Optimizing over this set recovers q = p, and hence
Equation (10.4) seems easy to optimize: the objective is concave, since it is the sum of a linear function
and a concave function (see Figure ?? to see why entropy is concave); furthermore, we are maximizing
this over a convex set, M(G). Hence there is a unique global optimum. However, the entropy is typically
intractable to compute, since it requires summing over all states. We discuss approximations below. See
Table 10.1 for a high level summary of the methods we discuss.
85
Method Definition Objective Opt. Domain Section
Exact maxµ∈M(G) θ T µ + H (µ) = log Z Concave Marginal polytope, convex Section 10.1.1
Mean field maxµ∈MF (G) θ T µ + HMF (µ) ≤ log Z Concave Nonconvex inner approx. Section 10.1.2
Loopy BP maxτ ∈L(G) θ T τ + HBethe (τ ) ≈ log Z Non-concave Convex outer approx. Section 10.1.3
TRBP maxτ ∈L(G) θ T τ + HTRBP (τ ) ≥ log Z Concave Convex outer approx. Section 10.1.5
Table 10.1: Summary of some variational inference methods for graphical models. TRBP is tree-reweighted belief
propagation.
which follows from the factorization assumption. Thus the mean field objective is
This is a concave lower bound on log Z. We will maximize this over a a simpler, but non-convex, inner
approximation to M(G), as we now show.
First, let F be an edge subgraph of the original graph G, and let I(F ) ⊆ I be the subset of sufficient
statistics associated with the cliques of F . Let Ω be the set of canonical parameters for the full model, and
define the canonical parameter space for the submodel as follows:
In other words, we require that the natural parameters associated with the sufficient statistics α outside of
our chosen class to be zero. For example, in the case of a fully factorized approximation, F0 , we remove all
edges from the graph, giving
Ω(F0 ) , {θ ∈ Ω : θst = 0 ∀(s, t) ∈ E} (10.8)
In the case of structured mean field (Section ??), we set θst = 0 for edges which are not in our tractable
subgraph.
Next, we define the mean parameter space of the restricted model as follows:
This is called an inner approximation to the marginal polytope, since MF (G) ⊆ M(G). See Figure 9.1(b)
for a sketch. Note that MF (G) is a non-convex polytope, which results in multiple local optima.
Thus the mean field problem becomes
This requires maximizing a concave objective over a non-convex set. It is typically optimized using coordinate
ascent, since it is easy to optimize a scalar concave function over the marginal distribution for each node.
86
In this section, we will consider a convex outer approximation, L(G), based on pseudo marginals, as in
Section 9.1.3.1. We also need to approximate the entropy (which was not needed when performing MAP
estimation, discussed in Section 9.1.3). We discuss this entropy approximation in Section 10.1.3.1, and then
show how we can use this to approximate log Z. Finally we show that loopy belief propagation attempts to
optimize this approximation.
This satisfies the normalization and pairwise marginalization constraints of the outer approximation by
construction.
From Equation 10.11, we can write the exact entropy of any tree structured distribution µ ∈ M(T ) as
follows:
X X
H (µ) = Hs (µs ) − Ist (µst ) (10.12)
s∈V (s,t)∈E
X
Hs (µs ) = − µs (xs ) log µs (xs ) (10.13)
xs ∈Xs
X µst (xs , xt )
Ist (µst ) = µst (xs , xt ) log (10.14)
µs (xs )µt (xt )
(xs ,xt )∈Xs ×Xt
Note that we can rewrite the mutual information term in the form Ist (µst ) = Hs (µs ) + Ht (µt ) − Hst (µst ),
and hence we get the following alternative but equivalent expression:
X X
H (µ) = − (ds − 1)Hs (µs ) + Hst (µst ) (10.15)
s∈V (s,t)∈E
We define the Bethe free energy as the expected energy minus approximate entropy:
FBethe (τ ) , − θ T τ + HBethe (τ ) ≈ − log Z (10.17)
We call this the Bethe variational problem or BVP. The space we are optimizing over is a convex set,
but the objective itself is not concave (since HBethe is not concave). Thus there can be multiple local optima.
Also, the entropy approximation is not a bound (either upper or lower) on the true entropy. Thus the value
obtained by the BVP is just an approximation to log Z(θ). However, in the case of trees, the approximation
is exact. Also, in the case of models with attractive potentials, the resulting value turns out to be an upper
bound [SWW08]. In Section 10.1.5, we discuss how to modify the algorithm so it always minimizes an upper
bound for any model.
1 Hans Bethe was a German-American physicist, 1906–2005.
87
10.1.3.2 LBP messages are Lagrange multipliers
In this subsection, we will show that any fixed point of the LBP algorithm defines a stationary P point of the
above constrained objective. Let us define the normalization constraint as Css (τ ) , −1 + xs τs (xs ), and
P
the marginalization constraint as Cts (xs ; τ ) , τs (xs ) − xt τst (xs , xt ) for each edge t → s. We can now write
the Lagrangian as
X
L(τ , λ; θ) , θ T τ + HBethe (τ ) + λss Css (τ )
s
" #
X X X
+ λts (xs )Cts (xs ; τ ) + λst (xt )Cst (xt ; τ ) (10.19)
s,t xs xt
(The constraint that τ ≥ 0 is not explicitly enforced, but one can show that it will hold at the optimum since
θ > 0.) Some simple algebra then shows that ∇τ L = 0 yields
X
log τs (xs ) = λss + θs (xs ) + λts (xs ) (10.20)
t∈nbr(s)
τst (xs , xt )
log = θst (xs , xt ) − λts (xs ) − λst (xt ) (10.21)
τ̃s (xs )τ̃t (xt )
P
where we have defined τ̃s (xs ) , xt τ (xs , xt ). Using the fact that the marginalization constraint implies
τ̃s (xs ) = τs (xs ), we get
To make the connection to message passing, define mt→s (xs ) = exp(λts (xs )). With this notation, we can
rewrite the above equations (after taking exponents of both sides) as follows:
Y
τs (xs ) ∝ exp(θs (xs )) mt→s (xs ) (10.23)
t∈nbr(s)
where the λ terms and irrelevant constants are absorbed into the constant of proportionality. We see that
this is equivalent to the usual expression for the node and edge marginals in LBP.
To derive an equation for the
Pmessages in terms of other messages (rather than in terms of λts ), we enforce
the marginalization condition xt τst (xs , xt ) = τs (xs ). Then one can show that
X Y
mt→s (xs ) ∝ exp {θst (xs , xt ) + θt (xt )} mu→t (xt ) (10.25)
xt u∈nbr(t)\s
We see that this is equivalent to the usual expression for the messages in LBP.
88
In more detail, define Lt (G) to be the set of all pseudo-marginals such that normalization and marginal-
ization constraints hold on a hyper-graph whose largest hyper-edge is of size t + 1. For example, in Figure ??,
we impose constraints of the form
X X
τ1245 (x1 , x2 , x4 , x5 ) = τ45 (x4 , x5 ), τ56 (x5 , x6 ) = τ5 (x5 ), . . . (10.26)
x1 ,x2 x6
where Hg (τg ) is the entropy of the joint (pseudo) distribution on the vertices in set g, and c(g) is called the
overcounting number of set g. These are related to Mobius numbers in set theory. Rather than giving
a precise definition, we just give a simple example. For the graph in Figure ??, we have
HKikuchi (τ ) = −[H1245 + H2356 + H4578 + H5689 ] − [H25 + H45 + H56 + H58 ] + H5 (10.28)
Putting these two approximations together, we can define the Kikuchi free energy2 as follows:
FKikuchi (τ ) , − θ T τ + HKikuchi (τ ) ≈ − log Z (10.29)
Just as with the Bethe free energy, this is not a concave objective. There are several possible algorithms
for finding a local optimum of this objective, including generalized belief propagation. For details, see e.g.,
[WJ08, Sec 4.2] or [KF09, Sec 11.3.2].
This is a convex set since each M(F ) is a projection of a convex set. Hence we define our problem as
89
f f f f
b b b b
e e e e
Figure 10.1: (a) A graph. (b-d) Some of its spanning trees. From Figure 7.1 of [WJ08]. Used with kind permission of
Martin Wainwright.
This is a concave objective being maximized over a convex set, and hence has a unique optimum. Furthermore,
the result is always an upper bound on log Z, because the entropy is an upper bound, and we are optimizing
over a larger set than the marginal polytope.
It remains to specify the set of tractable submodels, F, and the distribution ρ. We discuss some options
below.
This is called the tree reweighted BP approximation [WJW05b; Kol06]. This is similar to the Bethe
approximation to the entropy except for the crucial ρst weights. So long as ρst > 0 for all edges (s, t), this
gives a valid concave upper bound on the exact entropy.
The edge appearance probabilities live in a space called the spanning tree polytope. This is because
they are constrained to arise from a distribution over trees. Figure 10.1 gives an example of a graph and
three of its spanning trees. Suppose each tree has equal weight under ρ. The edge f occurs in 1 of the 3 trees,
so ρf = 1/3. The edge e occurs in 2 of the 3 trees, so ρe = 2/3. The edge b appears in all of the trees, so
ρb = 1. And so on. Ideally we can find a distribution ρ, or equivalently edge probabilities in the spanning tree
polytope, that make the above bound as tight as possible. An algorithm to do this is described in [WJW05a].
A simpler approach is to use all single edges with weight ρe = 1/E.
What about the set we are optimizing over? We require µ(T ) ∈ M(T ) for each tree T , which means
enforcing normalization and local consistency. Since we have to do this for every tree, we are enforcing
normalization and local consistency on every edge. Thus we are effectively optimizing in the pseudo-marginal
polytope L(G). So our final optimization problem is as follows:
90
10.1.5.2 Message passing implementation
The simplest way to minimize Equation (10.35) is a modification of belief propagation known as tree
reweighted belief propagation. The message from t to s is now a function of all messages sent from other
neighbors v to t, as before, but now it is also a function of the message sent from s to t. Specifically, we have
the following [WJ08, Sec 7.2.1]:
X Q ρvt
1 v∈nbr(t)\s [mv→t (xt )]
mt→s (xs ) ∝ exp θst (xs , xt ) + θt (xt ) (10.36)
x
ρst [ms→t (xt )]1−ρts
t
If ρst = 1 for all edges (s, t) ∈ E, the algorithm reduces to the standard LBP algorithm. However, the
condition ρst = 1 implies every edge is present in every spanning tree with probability 1, which is only
possible if the original graph is a tree. Hence the method is only equivalent to standard LBP on trees, when
the method is of course exact.
In general, this message passing scheme is not guaranteed to converge to the unique global optimum.
One can devise double-loop methods that are guaranteed to converge [HS08], but in practice, using damped
updates as in Equation ?? is often sufficient to ensure convergence.
91
92
Chapter 11
93
94
Chapter 12
95
96
Chapter 13
97
98
Part III
Prediction
99
Chapter 14
101
102
Chapter 15
103
JJ bound, χ=2.5 Bohning bound, χ=−2.5
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
(a) (b)
Figure 15.1: Quadratic lower bounds on the sigmoid (logistic) function. In solid red, we plot σ(x) vs x. In dotted
blue, we plot the lower bound L(x, ψ) vs x for ψ = 2.5. (a) JJ bound. This is tight at ψ = ±2.5. (b) Bohning bound
(Section 15.1.2.2). This is tight at ψ = 2.5. Generated by sigmoid_lower_bounds.py.
Since this is a quadratic function of w, we can derive a Gaussian posterior approximation as follows:
This is more flexible than a Laplace approximation, since the variational parameters ψ can be used to
optimize the curvature of the posterior covariance. To find the optimal ψ, we can maximize the ELBO, which
is given by Z Z
log p(y|X) = log p(y|X, w)p(w)dw ≥ log h(w, ψ)p(w)dw = Ł(ψ) (15.12)
where
N
Y
h(w, ψ) = σ(ψ n ) exp ηn yn − (ηn + ψ n )/2 − λ(ψ n )(ηn2 − ψ 2n ) (15.13)
n=1
We can evaluate the lower bound analytically to get
XN
1 |VN | 1 T −1 1 1
Ł(ψ) = log + µN VN µN − µT0 V0−1 µ0 + log σ(ψ n ) − ψ n + λ(ψ n )ψ 2n (15.14)
2 |V0 | 2 2 n=1
2
If we solve for ∇ψ Ł(ψ) = 0, we get the following iterative update equation for each variational parameter:
(ψ new 2
n ) = xn E ww
T
xn = xn VN + µN µTN xn (15.15)
One we have estimated ψ n , we can plug it into the above Gaussian approximation q(w|ψ).
104
If we define Xi = I ⊗ xi , where ⊗ is the kronecker product, and I is C × C identity matrix, then we can write
the logits as η i = Xi w. (For example, if C = 2 and xi = [1, 2, 3], we have Xi = [1, 2, 3, 0, 0, 0; 0, 0, 0, 1, 2, 3].)
Then the likelihood is given by
YN
p(y|X, w) = exp[yTi η i − lse(η i )] (15.17)
i=1
where g and H are the gradient and Hessian of lse, and ψ i ∈ RM , where M = C − 1 is the number
of classes minus 1. An upper bound to lse can be found by replacing the Hessian matrix H(ψ i ) with a
matrix Ah i such that Ai iH(ψ i ) for all ψi . [Boh92] showed that this can be achieved if we use the matrix
Ai = 12 IM − M1+1 1M 1TM . In the binary case, this becomes Ai = 12 (1 − 12 ) = 14 .
Note that Ai is independent of ψ i ; however, we still write it as Ai (rather than dropping the i subscript),
since other bounds that we consider below will have a data-dependent curvature term. The upper bound on
lse therefore becomes
1 T
lse(η i ) ≤η Ai η i − bTi η i + ci (15.23)
2 i
1 1
Ai = IM − 1M 1MT
(15.24)
2 M +1
bi = Ai ψ i − g(ψ i ) (15.25)
1
ci = ψTi Ai ψ i − g(ψ i )T ψ i + lse(ψ i ) (15.26)
2
where ψ i ∈ RM is a vector of variational parameters.
We can use the above result to get the following lower bound on the softmax likelihood:
1
log p(yi = c|xi , w) ≥ yTi Xi w − 4 wT Xi Ai Xi w + bTi Xi w − ci (15.27)
2 c
ỹi , A−1
i (bi + yi ) (15.28)
105
Then we can get a “Gaussianized” version of the observation model:
where f (xi , ψ i ) is some function that does not depend on w. Given this, it is easy to compute the posterior
q(w) = N (mN , VN ), using Bayes rule for Gaussians.
Given the posterior, we can write the ELBO as follows:
"N #
X
Ł(ψ) , −DKL (q(w)kp(w)) + Eq log p(yi |xi , w) (15.30)
i=1
"N #
X
= −DKL (q(w)kp(w)) + Eq yTi η i − lse(η i ) (15.31)
i=1
N
X N
X
= −DKL (q(w)kp(w)) + yTi Eq [η i ] − Eq [lse(η i )] (15.32)
i=1 i=1
where p(w) = N (w|m0 , V0 ) is the prior and q(w) = N (w|mN , VN ) is the approximate posterior. The first
term is just the KL divergence between two Gaussians, which is given by
1
−DKL (N (mN , VN )kN (m0 , V0 )) = − tr(VN V0−1 ) − log |VN V0−1 |
2
+(mN − m0 )T V0−1 (mN − m0 ) − DM (15.33)
where DM is the dimensionality of the Gaussian, and we assume a prior of the form p(w) = N (m0 , V0 ),
where typically µ0 = 0DM , and V0 is block diagonal. The second term is simply
N
X N
X
yTi Eq [η i ] = yTi m̃i (15.34)
i=1 i=1
where m̃i , Xi mN . The final term can be lower bounded by taking expectations of our quadratic upper
bound on lse as follows:
N
X 1 1
− Eq [lse(η i )] ≥ − tr(Ai Ṽi ) − m̃i Ai m̃i + bTi m̃i − ci (15.35)
i=1
2 2
We will use coordinate ascent to optimize this lower bound. That is, we update the variational posterior
parameters VN and mN , and then the variational likelihood parameters ψ i . We leave the detailed derivation
as an exercise, and just state the results. We have
N
!−1
X
VN = V0 + T
Xi Ai Xi (15.37)
i=1
N
!
X
mN = VN V0−1 m0 + XTi (yi + bi ) (15.38)
i=1
ψ i = m̃i = Xi mN (15.39)
106
We can exploit the fact that Ai is a constant matrix, plus the fact that Xi has block structure, to simplify
the first two terms as follows:
N
!−1
X
VN = V0 + A ⊗ T
xi xi (15.40)
i=1
N
!
X
mN = VN V0−1 m0 + (yi + bi ) ⊗ xi (15.41)
i=1
107
15.2 Converting multinomial logistic regression to Poisson regres-
sion
It is possible to represent a multinomial logistic regression model with K outputs as K separate Poisson
regression models. (Although the Poisson models are fit separately, they are implicitly coupled, since the
counts must sum to Nn across all K outcomes.) This fact can enable more efficient training when the number
of categories is large [Tad15].
To see why this relationship is true, we follow the presentation of [McE20, Sec 11.3.3]. We assume K = 2
for notational brevity (i.e., binomial regression). Assume we have m trials, with counts y1 and y2 of each
outcome type. The multinomial likelihood has the form
m! y1 y2
p(y1 , y2 |m, µ1 , µ2 ) = µ µ (15.52)
y1 !y2 ! 1 2
Now consider a product of two Poisson likelihoods, for each set of counts:
e−λ1 λy11 e−λ2 λy22
p(y1 , y2 |λ1 , λ2 ) = p(y1 |λ1 )p(y2 |λ2 ) = (15.53)
y1 ! y2 !
We now show that these are equivalent, under a suitable setting of the parameters.
Let Λ = λ1 + λ2 be the expected total number of counts of any type, µ1 = λ1 /Λ and µ2 = λ2 /Λ.
Substituting into the binomial likelihood gives
y1 y2
m! λ1 λ2 m! λy1 λy2
p(y1 , y2 |m, µ1 , µ2 ) = = y1 y2 1 2 (15.54)
y1 !y2 ! Λ Λ Λ Λ y1 ! y2 !
m! e−λ1 λy11 e−λ2 λy22
= (15.55)
Λm e−λ1 y1 ! e−λ2 y2 !
m! e−λ1 λy11 e−λ2 λy22
= −Λ m (15.56)
|e {zΛ } | y{z
1! y2 !
} | {z }
p(m)−1 p(y1 ) p(y2 )
The final expression says that p(y1 , y2 |m) = p(y1 )p(y2 )/p(m), which makes sense.
108
dept gender admit reject applications
A male 512 313 825
A female 89 19 108
B male 353 207 560
B female 17 8 25
C male 120 205 325
C female 202 391 593
D male 138 279 417
D female 131 244 375
E male 53 138 191
E female 94 299 393
F male 22 351 373
F female 24 317 341
Here MALE[i] = 1 iff case i refers to male admissions data. So the log odds is α for female cases, and α + β
for male candidates. (The choice of prior for these parameters is discussed in ??.)
The above formulation is asymmetric in the genders. In particular, the log odds for males has two random
variables associated with it, and hence is a priori is more uncertain. It is often better to rewrite the model in
the following symmetric way:
Ai ∼ Bin(Ni , µi ) (15.61)
logit(µi ) = αGENDER[i] (15.62)
αj ∼ N (0, 1.5), j ∈ {1, 2} (15.63)
Here GENDER[i] is the gender (1 for male, 2 for female), so the log odds is α1 for males and α2 for females.
We can perform posterior inference using a variety of methods (see ??). Here we use HMC (??). We find
the 89% credible interval for α1 is [−0.29, 0.16] and for α2 is [−0.91, 0.75].1 The corresponding distribution
for the difference in probability, σ(α1 ) − σ(α2 ), is [0.12, 0.16], with a mean of 0.14. So it seems that Berkeley
is biased in favor of men.
However, before jumping to conclusions, we should check if the model is any good. In Figure 15.2a, we
plot the posterior predictive distribution, along with the original data. We see the model is a very bad fit to
the data (the blue data dots are often outside the black predictive intervals). In particular, we see that the
empirical admissions rate for women is actually higher in all the departments except for C and E, yet the
model says that women should have a 14% lower chance of admission.
The trouble is that men and women did not apply to the same departments in equal amounts. Women
tended not to apply to departments, like A and B, with high admissions rates, but instead applied more to
departments, like F, with low admissions rates. So even though less women were accepted overall, within in
each department, women tended to be accepted at about the same rate.
We can get a better understanding if we consider the DAG in Figure 15.3a. This is intended to be a
causal model of the relevant factors. We discuss causality in more detail in ??, but the basic idea should be
clear from this picture. In particular, we see that there is an indirect causal path G → D → A from gender
to acceptance, so to infer the direct affect G → A, we need to condition on D and close the indirect path.
1 McElreath uses 89% interval instead of 95% to emphasize the arbitrary nature of these values. The difference is insignificant.
109
1.0 1.0
0.8 A 0.8 A
B B
0.6 0.6
admit
admit
0.4 C D 0.4 C D
E E
0.2 0.2
F F
0.0 0.0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
case case
(a) (b)
Figure 15.2: Blue dots are admission rates for each of the 6 departments (A-F) for males (left half of each dyad)
and females (right half ). The circle is the posterior mean of µi , the small vertical black lines indicate 1 standard
deviation of µi . The + marks indicate 95% predictive interval for Ai . (a) Basic model, only taking gender into
account. (b) Augmented model, adding department specific offsets. Adapted from Figure 11.5 of [McE20]. Generated
by logreg_ucb_admissions_numpyro.ipynb.
110
G G
D D
A A
(a) (b)
Figure 15.3: Some possible causal models of admissions rates. G is gender, D is department, A is acceptance
rate. (a) No hidden confounders. (b) Hidden confounder (small dot) affects both D and A. Generated by lo-
greg_ucb_admissions_numpyro.ipynb.
Note that we have parameterized the model in terms of its mean rate,
αi
πi = (15.72)
α i + βi
and shape,
κi = αi + βi (15.73)
We choose to make the mean depend on the inputs (covariates), but to treat the shape (which is like a
precision term) as a shared constant.
As we discussed in ??, the beta-binomial distribution as a continuous mixture distribution of the following
form: Z
BetaBinom(y|m, α, β) = Bin(y|m, µ)Beta(µ|α, β)dµ (15.74)
In the regression context, we can interpret this as follows: rather than just predicting the mean directly, we
predict the mean and variance. This allows for each individual example to have more variability than we
might otherwise expect.
If the shape parameter κ is less than 2, then the distribution is an inverted U-shape which strongly favors
probabilities of 0 or 1 (see ??). We generally want to avoid this, which we can do by ensuring κ > 2.
Following [McE20, p371], let us use this model to reanalyze the Berkeley admissions data from Section 15.3.
We saw that there was a lot of variability in the outcomes, due to the different admissions rates of each
department. Suppose we just regress on the gender, i.e., xi = (I (GENDERi = 1) , I (GENDERi = 2)), and
w = (α1 , α2 ) are the corresponding logits. If we use a binomial regression model, we can be misled into
thinking there is gender bias. But if we use the more robust beta-binomial model, we avoid this false
conclusion, as we show below.
We fit the following model:
Ai ∼ BetaBinom(Ni , πi , κ) (15.75)
logit(πi ) = αGENDER[i] (15.76)
αj ∼ N (0, 1.5) (15.77)
κ=φ+2 (15.78)
φ ∼ Expon(1) (15.79)
(To ensure that κ > 2, we use a trick and define it as κ = φ + 2, where we put an exponential prior (which
has a lower bound of 0) on φ.)
We fit this model (using HMC) and plot the results in Figure 15.4. In Figure 15.4a, we show the posterior
predictive distribution; we see that is quite broad, so the model is no longer overconfident. In Figure 15.4b,
111
distribution of admission rates
3.0
male
0.8 female
2.5
0.6 2.0
Density
1.5
0.4
1.0
0.2
0.5
0.0 0.0
2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0
probability admit
(a) (b)
Figure 15.4: Results of fitting beta-binomial regression model to Berkeley admissions data. (b) Posterior predictive distri-
bution (black) superimposed on empirical data (blue). The hollow circle is the posterior predicted mean acceptance rate,
E [Ai |D]; the vertical lines are 1 standard deviation around this mean, std [Ai |D]; the + signs indicate the 89% predictive
interval. (b) Samples from the posterior distribution for the admissions rate for men (blue) and women (red). Thick
curve is posterior mean. Adapted from Figure 12.1 of [McE20]. Generated by logreg_ucb_admissions_numpyro.ipynb.
we plot p(σ(αj )|D), which is the posterior over the rate of admissions for men and women. We see that there
is considerable uncertainty in these value, so now we avoid the false conclusion that one is significantly higher
than the other. However, the model is so vague in its predictions as to be useless. In Section 15.3.4, we fix
this problem by using a multi-level logistic regression model.
for j = 1 : 2 and n = 1 : 12. Let λi = E [λi |Di ], where D1 = y1,1:N is the vector of admission counts, and
D2 = y2,1:N is the vector of rejection counts (so mn = y1,n + y2,n is the total number of applications for case
n). The expected acceptance rate across the entire dataset is
λ1 146.2
= = 0.38 (15.83)
λ1 + λ2 146.2 + 230.9
yn ∼ Bin(mn , µ) (15.84)
µ = σ(α) (15.85)
α ∼ N (0, 1.5) (15.86)
Let α = E [α|D], where D = (y1,1:N , m1:N ). The expected acceptance rate across the entire dataset is
σ(α) = 0.38, which matches Equation (15.83). (See logreg_ucb_admissions_numpyro.ipynb for the code.)
112
Recall that Ai is the number of students admitted in example i, Ni is the number of applicants, µi is the
expected rate of admissions (the variable of interest), and DEPT[i] is the department (6 possible values). For
pedagogical reasons, we replace the categorical variable GENDER[i] with the binary indicator MALE[i]. We
can create a model with varying intercept and varying slope as follows:
Ai ∼ Bin(Ni , µi ) (15.87)
logit(µi ) = αDEPT[i] + βDEPT[i] × MALE[i] (15.88)
This has 12 parameters, as does the original formulation in Equation (15.65). However, these are not
independent degrees of freedom. In particular, the intercept and slope are correlated, as we see in Figure 15.2
(higher admissions means steeper slope). We can capture this using the following prior:
α
(αj , βj ) ∼ N ( , Σ) (15.89)
β
α ∼ N (0, 4) (15.90)
β ∼ N (0, 1) (15.91)
Σ = diag(σ) R diag(σ) (15.92)
R ∼ LKJ(2) (15.93)
2
Y
σ∼ N+ (σd |0, 1) (15.94)
d=1
We can write this more compactly in the following way.2 . We define u = (α, β), and wj = (αj , βj ), and
then use this model:
wj = u + σLzj (15.98)
where L = chol(R) is the Cholesy factor for the correlation matrix R, and zj ∼ N (0, I2 ). Thus the model
becomes the following:3 .
zj ∼ N (0, I2 ) (15.99)
vj = diag(σ)Lzj (15.100)
u ∼ N (0, diag(4, 1)) (15.101)
log(µi ) = u[0] + v[DEPT[i], 0] + (u[1] + v[DEPT[i], 1]) × MALE[i] (15.102)
This is the version of the model that is implemented in the numypro code.
The results of fitting this model are shown in Figure 15.5(b). The fit is slightly better than in Figure 15.2b,
especially for the second column (females in department 2), where the observed value is now inside the
predictive interval.
2 In https://bit.ly/3mP1QWH, this is referred to as glmm4. Note that we use w instead of v, and we use u instead of vµ .
3 In https://bit.ly/3mP1QWH, this is referred to as glmm5.
113
Posterior Predictive Check with 90% CI
di mi actual rate
mean ± std
0.8
σ R 0.7
αdi βd i 0.6
u Σ
0.5
admit rate
wj µi 0.4
0.3
J
Ai
0.2
0.1
Ni
N 2 4 6 8 10 12
cases
(a) (b)
Figure 15.5: (a) Generalized linear mixed model for inputs di (department) and mi (male), and output Ai (number of
admissionss), given Ni (number of applicants). (b) Results of fitting this model to the UCB dataset. Generated by
logreg_ucb_admissions_numpyro.ipynb.
114
Chapter 16
115
116
Chapter 17
Gaussian processes
117
118
Chapter 18
Structured prediction
119
120
Chapter 19
121
122
Part IV
Generation
123
Chapter 20
125
126
Chapter 21
Variational autoencoders
127
128
Chapter 22
Auto-regressive models
129
130
Chapter 23
Normalizing flows
131
132
Chapter 24
Energy-based models
133
134
Chapter 25
135
136
Chapter 26
137
138
Part V
Discovery
139
Chapter 27
141
142
Chapter 28
where W[k, :] = wk is the distribution over words for the k’th topic. See Figure 28.1 for the corresponding
PGM-D.
We typically use a Dirichlet prior the topic parameters, p(wk ) = Dir(wk |β1V ); by setting β small enough,
we can encourage these topics to be sparse, so that each topic only predicts a subset of the words. In addition,
we use a Dirichlet prior on the latent factors, p(zn ) = Dir(zn |α1Nz ). If we set α small enough, we can
encourage the topic distribution for each document to be sparse, so that each document only contains a
subset of the topics. See Figure 28.2 for an illustration.
143
α α
zn
cn,1 cn,Ln zn
β
xn,1 xn,Ln cnl
W wk
xnl Ln K
β N
(a) (b)
Figure 28.1: Latent Dirichlet Allocation (LDA) as a PGM-D. (a) Unrolled form. (b) Plate form.
Figure 28.2: Illustration of latent Dirichlet allocation (LDA). We have color coded certain words by the topic they
have been assigned to: yellow represents the genetics cluster, pink represents the evolution cluster, blue represent
the data analysis cluster, and green represents the neuroscience cluster. Each topic is in turn defined as a sparse
distribution over words. This article is not related to neuroscience, so no words are assigned to the green topic. The
overall distribution over topic assignments for this document is shown in the right as a sparse histogram. Adapted
from Figure 1 of [Ble12]. Used with kind permission of David Blei.
144
Topic 77 Topic 82 Topic 166
word prob. word prob. word prob.
MUSIC .090 LITERATURE .031 PLAY .136
DANCE .034 POEM .028 BALL .129
SONG .033 POETRY .027 GAME .065
PLAY .030 POET .020 PLAYING .042
SING .026 PLAYS .019 HIT .032
SINGING .026 POEMS .019 PLAYED .031
BAND .026 PLAY .015 BASEBALL .027
PLAYED .023 LITERARY .013 GAMES .025
SANG .022 WRITERS .013 BAT .019
SONGS .021 DRAMA .012 RUN .019
DANCING .020 WROTE .012 THROW .016
PIANO .017 POETS .011 BALLS .015
PLAYING .016 WRITER .011 TENNIS .011
RHYTHM .015 SHAKESPEARE .010 HOME .010
ALBERT .013 WRITTEN .009 CATCH .010
MUSICAL .013 STAGE .009 FIELD .010
Figure 28.3: Three topics related to the word play. From Figure 9 of [SG07]. Used with kind permission of Tom
Griffiths.
Document #29795
Bix beiderbecke, at age060 fifteen207, sat174 on the slope071 of a bluff055 overlooking027 the mississippi137 river137. He
was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157
as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He showed002 promise134 on
the piano077, and his parents035 hoped268 he might consider118 becoming a concert077 pianist077. But bix was
interested268 in another kind050 of music077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Document #1883
There is a simple050 reason106 why there are so few periods078 of really great theater082 in our whole western046 world.
Too many things300 have to come right at the very same time. The dramatists must have the right actors082, the
actors082 must have the right playhouses, the playhouses must have the right audiences082. We must remember288 that
plays082 exist143 to be performed077, not merely050 to be read254. ( even when you read254 a play082 to yourself, try288 to
perform062 it, to put174 it on a stage078, as you go along.) as soon028 as a play082 has to be performed082, then some
kind126 of theatrical082...
Document #21359
Jim296 has a game166 book254. Jim296 reads254 the book254. Jim296 sees081 a game166 for one. Jim296 plays166 the game166.
Jim296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and
jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166. The
boys020 play166 the game166 for two. The boys020 like the game166. Meg282 comes040 into the house282. Meg282 and
don180 and jim296 read254 the book254. They see a game166 for three. Meg282 and don180 and jim296 play166 the game166.
They play166 ...
Figure 28.4: Three documents from the TASA corpus containing different senses of the word play. Grayed out words
were ignored by the model, because they correspond to uninteresting stop words (such as “and”, “the”, etc.) or very low
frequency words. From Figure 10 of [SG07]. Used with kind permission of Tom Griffiths.
Note that an earlier version of LDA, known as probabilistic LSA, was proposed in [Hof99]. (LSA stands
for “latent semantic analysis”, and refers to the application of PCA to text data; see [Mur22, Sec 20.5.1.2]
for details.) The likelihood function, p(x|z), is the same as in LDA, but pLSA does not specify a prior for
z, since it is designed for posterior analysis of a fixed corpus (similar to LSA), rather than being a true
generative model.
28.1.1.2 Polysemy
Each topic is a distribution over words that co-occur together, and which are therefore semantically related.
For example, Figure 28.3 shows 3 topics which were learned from an LDA model fit to the TASA corpus1 .
These seem to correspond to 3 different senses of the word “play”: playing an instrument, a theatrical play,
and playing a sports game.
We can use the inferred document-level topic distribution to overcome polysemy, i.e., to disambiguate
1 The TASA corpus is an untagged collection of educational materials consisting of 37,651 documents and 12,190,931 word
tokens. Words appearing in fewer than 5 documents were replaced with an asterisk, but punctuation was included. The combined
vocabulary was of size 37,202 unique words.
145
the meaning of a particular word. This is illustrated in Figure 28.4, where a subset of the words are annotated
with the topic to which they were assigned (i.e., we show argmaxk p(cnl = k|xn ). In the first document, the
word “music” makes it clear that the musical topic (number 77) is present in the document, which in turn
makes it more likely that cnl = 77 where l is the index corresponding to the word “play”.
146
neurons
brain stimulus
motor
memory
visual
activated subjects synapses
tyrosine phosphorylation cortical
left ltp
activation
phosphorylation p53 task surface glutamate
kinase cell cycle proteins tip synaptic
activity protein
cyclin binding rna image neurons
regulation domain dna sample materials
computer
domains rna polymerase organic
problem device
receptor cleavage
information
polymer
science amino acids
research
scientists
receptors cdna
site
computers
polymers
funding molecules physicists
support says ligand sequence problems
laser particles
nih research ligands isolated optical physics
program people protein sequence light
apoptosis sequences surface
particle
electrons experiment
genome liquid quantum
wild type dna surfaces stars
mutant enzyme sequencing fluid
mutations enzymes model reaction astronomers
united states mutants
iron
active site
reactions universe
women cells
mutation reduction molecule galaxies
universities
cell molecules
expression magnetic
galaxy
cell lines plants
magnetic field transition state
students bone marrow plant
spin
superconductivity
gene
education genes
superconducting
pressure mantle
arabidopsis
bacteria high pressure crust sun
bacterial pressures upper mantle solar wind
host fossil record core meteorites earth
resistance development birds inner core ratios planets
mice parasite embryos fossils planet
gene dinosaurs
antigen virus drosophila species
disease fossil
t cells hiv genes forest
mutations
antigens aids expression forests
families earthquake co2
immune response infection populations
mutation earthquakes carbon
viruses ecosystems
fault carbon dioxide
ancient images methane
patients genetic found
disease cells population impact
data water
ozone
treatment proteins populations million years ago volcanic atmospheric
drugs differences africa
clinical researchers deposits climate
measurements
variation stratosphere
protein magma ocean
eruption ice concentrations
found volcanism changes
climate change
Figure 28.5: Output of the correlated topic model (with K = 50 topics) when applied to articles from Science. Nodes
represent topics, with the 5 most probable phrases from each topic shown inside. Font size reflects overall prevalence of
the topic. See http: // www. cs. cmu. edu/ ~lemur/ science/ for an interactive version of this model with 100 topics.
Used with kind permission of Figure 2 of [BL07]. Used with kind permission of David Blei.
α
z t−1
n z tn z t+1
n
st−1
nl stnl st+1
nl
xt−1
nl xtnl xt+1
nl
Nt−1 Nt Nt+1
wt−1
k w tk w t+1
k
Nz
2000s, it is more likely to use words like “calcium receptor” (this reflects the general trend of neuroscience
towards molecular biology).
One way to model this is to assume the topic distributions evolve according to a Gaussian random walk,
as in a state space mdoel (see ??). We can map these Gaussian vectors to probabilities via the softmax
function, resulting in the following model:
This is known as a dynamic topic model [BL06a]. See Figure 28.6 for the PGM-D.
One can perform approximate inference in this model using a structured mean field method (??), that
exploits the Kalman smoothing algorithm (??) to perform exact inference on the linear-Gaussian chain
between the wkt nodes (see [BL06a] for details). See the main text for an example of this model applied to
100 years of articles from Science.
147
α
zn
W
xn,t−1 xn,t xn,t+1 ...
ND
A B
It is also possible to use amortized inference, and to learn embeddings for each word, which works much
better with rare words. This is called the dynamic embedded topic model [DRB19].
28.1.4 LDA-HMM
The Latent Dirichlet Allocation (LDA) model of Section 28.1.1 assumes words are exchangeable, and thus
ignores word order. A simple way to model sequential dependence between words is to use an HMM. The
trouble with HMMs is that they can only model short-range dependencies, so they cannot capture the overall
gist of a document. Hence they can generate syntactically correct sentences, but not semantically plausible
ones.
It is possible to combine LDA with HMM to create a model called LDA-HMM [Gri+04]. This model
uses the HMM states to model function or syntactic words, such as “and” or “however”, and uses the LDA to
model content or semantic words, which are harder to predict. There is a distinguished HMM state which
specifies when the LDA model should be used to generate the word; the rest of the time, the HMM generates
the word.
More formally, for each document n, the model defines an HMM with states hnl ∈ {0, . . . , H}. In addition,
each document has an LDA model associated with it. If hnl = 0, we generate word xnl from the semantic
LDA model, with topic specified by cnl ; otherwise we generate word xnl from the syntactic HMM model.
The PGM-D is shown in Figure 28.7. The CPDs are as follows:
where W is the usual topic-word matrix, B is the state-word HMM emission matrix and A is the state-state
HMM transition matrix.
Inference in this model can be done with collapsed Gibbs sampling, analytically integrating out all the
continuous quantities. See [Gri+04] for the details.
The results of applying this model (with Nz = 200 LDA topics and H = 20 HMM states) to the combined
Brown and TASA corpora2 are shown in Table 28.1. We see that the HMM generally is responsible for
2 The Brown corpus consists of 500 documents and 1,137,466 word tokens, with part-of-speech tags for each token. The TASA
corpus is an untagged collection of educational materials consisting of 37,651 documents and 12,190,931 word tokens. Words
appearing in fewer than 5 documents were replaced with an asterisk, but punctuation was included. The combined vocabulary
was of size 37,202 unique words.
148
the the the the the a the the the
blood , , of a the , , ,
, and and , of of of a a
of of of to , , a of in
body a in in in in and and game
heart in land and to water in drink ball
and trees to classes picture is story alcohol and
in tree farmers government film and is to team
to with for a image matter to bottle to
is on farm state lens are as in play
blood forest farmers government light water story drugs ball
heart trees land state eye matter stories drug game
pressure forests crops federal lens molecules poem alcohol team
body land farm public image liquid characters people *
lungs soil
image data food statelocal mirror
membrane particles
chip poetry
experts drinkingkernel baseball
network
oxygen areas people act eyes gas character person players
images
vessels park gaussian
farming policy
states synaptic
glass analog
solid expert
author support football
effects neural
object
arteries wildlifemixturewheat value
national cell
object neuron
substance gating marijuana
poems vector networks
player
objects
* arealikelihoodfarms function
laws objects * digital
temperature lifehme body svm output
field
breathing
feature rain posteriorcorn department
action lenses
current changes
synapse poet
architecture use
kernels basketball
input
the
recognition in prior he reinforcement * be
dendritic said
neural can
mixture time # ,
training
a for it new have made would way ;
views
his distribution
to you learning
other potential
see hardware
used learning
will yearsspace (inputs
this # on em they classesfirst neuron
make cameweight mixtures
could function
day weights
:
pixel
their with bayesian i optimal
same conductance
do went# function
may machines
part ) #
visual
these parametersshe
at *great channels
know found vlsi had gate number set outputs
yourin by is we seegood get
used called
model must
networks kind
however #
her from there small go do place
with
my as was this show little trained
take algorithm values
have also *
somefor into has who noteold obtained
find system results
did then i
on becomes consider described case models thus x
from denotes assume given problem parameters therefore t
at
Table 28.1: Upper being extracted
row: Topics present found when trained
by the LDA model network on the units
combined Brownfirst and TASAn corpora.
using remains need presented method data here -
Middle row: topics extracted by LDA part of LDA-HMM model. Bottom row: topics extracted by HMM part of
into represents propose defined approach functions now c
LDA-HMM model.over Eachexists
column represents
describe a single topic/class, paper
generated and words problems
appear in orderhence of probability
r in that
topic/class. within
Since some classes
seems give almost
suggestall probability
shown to onlyprocess
a few words, a list is terminated
algorithms finally when
p the words
account for 90% of the probability mass. From Figure 2 of [Gri+04]. Used with kind permission of Tom Griffiths.
Figure 3: Topics and classes from the composite model on the NIPS corpus.
In contrast to this approach, we study here how the overall network activity can control single cell
parameters such as input resistance, as well as time and space constants, parameters that are crucial for
excitability and spariotemporal (sic) integration.
1.
The integrated architecture in this paper combines feed forward control and error feedback adaptive
control using neural networks.
In other words, for our proof of convergence, we require the softassign algorithm to return a doubly
stochastic matrix as *sinkhorn theorem guarantees that it will instead of a matrix which is merely close
2. to being doubly stochastic based on some reasonable metric.
The aim is to construct a portfolio with a maximal expected return for a given risk level and time
horizon while simultaneously obeying *institutional or *legally required constraints.
The left graph is the standard experiment the right from a training with # samples.
3.
The graph G is called the *guest graph, and H is called the host graph.
FigureFigure 4: Function
28.8: Function and content
and content words in words in corpus,
the NIPS the NIPS corpus. Graylevel
as distinguished indicates
by the LDA-HMM posterior
model. Graylevel
indicates
probability of assignment to LDA component, with black being highest. The boxed word as
posterior probability of assignment to LDA component, with black being highest. The boxed word appears
a function word in one sentence, and as a content word in another sentence. Asterisked words had low frequency,
appears as a function word and a content word in one element of each pair of sentences.
and were treated as a single word type by the model. From Figure 4 of [Gri+04]. Used with kind permission of Tom
Asterisked words had low frequency, and were treated as a single word type by the model.
Griffiths.
z1 ... ... zN α
c1,1 c
. . . 1,L1 ...
cN,1 c
. .N,L
. N c1,1 c cN,1 c
. . . 1,L1 ... . .N,L
. N
x1,1 x
. . . 1,L1 ...
xN,1 x
. .N,L
. N
x1,1 x
. . . 1,L1 ...
xN,1 x
. .N,L
. N
W
β β
(a) (b)
Figure 28.9: (a) LDA unrolled for N documents. (b) Collapsed LDA, where we integrate out the continuous latents zn
and the continuous topic parameters W.
syntactic words, and the LDA for semantics words. If we did not have the HMM, the LDA topics would
get “polluted” by function words (see top of figure), which is why such words are normally removed during
preprocessing.
The model can also help disambiguate when the same word is being used syntactically or semantically.
Figure 28.8 shows some examples when the model was applied to the NIPS corpus.3 We see that the roles of
words are distinguished, e.g., “we require the algorithm to return a matrix” (verb) vs “the maximal expected
return” (noun). In principle, a part of speech tagger could disambiguate these two uses, but note that (1) the
LDA-HMM method is fully unsupervised (no POS tags were used), and (2) sometimes a word can have the
same POS tag, but different senses, e.g., “the left graph” (a synactic role) vs “the graph G” (a semantic role).
More recently, [Die+17] proposed topic-RNN, which is similar to LDA-HMM, but replaces the HMM
model with an RNN, which is a much more powerful model.
However, one can get better performance by analytically integrating out the π i ’s and the wk ’s, both of
which have a Dirichlet distribution, and just sampling the discrete cil ’s. This approach was first suggested in
[GS04], and is an example of collapsed Gibbs sampling. Figure 28.9(b) shows that now all the cil variables
are fully correlated. However, we can sample them one at a time, as we explain below.
PLi
First, we need some notation. Let Nivk =P l=1 I (cil = k, xil = v) be the number of times word v is
assigned to topic k in document i. Let NP ik = v ivk be the number of times any word from document i
N
has been assigned to topic k. Let
P vkN = i ivk be the number of times word v has been assigned to
N P topic
k in any document. Let Nk = v Nvk be the number of words assigned to topic k. Finally, let Li = k Nik
be the number of words in document i; this is observed.
3 NIPS stands for “Neural Information Processing Systems”. It is one of the top machine learning conferences. The NIPS
150
4
5
6
7
8
9
10
11
12
13
14
15
16
(a) (b)
River Stream Bank Money Loan
1
2
Figure 28.10: Illustration of (collapsed) Gibbs sampling applied to a small LDA example. There are N = 16 documents,
3
4
5
each containing a variable number of words drawn from a vocabulary of V = 5 words, There are two topics. A white
6
7
dot means word the word is assigned to topic 1, a black dot means the word is assigned to topic 2. (a) The initial
8
9
random assignment of states. (b) A sample from the posterior after 64 steps of Gibbs sampling. From Figure 7 of
10
11
12
[SG07]. Used with kind permission of Tom Griffiths.
13
14
15
16
We can now derive the marginal prior. By applying ??, one can show that
"L #
YZ Y i
From the above equations, and using the fact that Γ(x + 1)/Γ(x) = x, we can derive the full conditional
−
for p(cil |c−i,l ). Define Nivk to be the same as Nivk except it is computed by summing over all locations in
document i except for cil . Also, let xil = v. Then
− −
Nv,k +β Ni,k +α
p(ci,l = k|c−i,l , y, α, β) ∝ (28.19)
Nk− + V β Li + Kα
We see that a word in a document is assigned to a topic based both on how often that word is generated by
the topic (first term), and also on how often that topic is used in that document (second term).
Given Equation (28.19), we can implement the collapsed Gibbs sampler as follows. We randomly assign a
topic to each word, cil ∈ {1, . . . , K}. We can then sample a new topic as follows: for a given word in the
corpus, decrement the relevant counts, based on the topic assigned to the current word; draw a new topic
from Equation (28.19), update the count matrices; and repeat. This algorithm can be made efficient since
the count matrices are very sparse [Li+14].
This process is illustrated in Figure 28.10 on a small example with two topics, and five words. The left
part of the figure illustrates 16 documents that were sampled from the LDA model using p(money|k = 1) =
p(loan|k = 1) = p(bank|k = 1) = 1/3 and p(river|k = 2) = p(stream|k = 2) = p(bank|k = 2) = 1/3. For
example, we see that the first document contains the word “bank” 4 times (indicated by the four dots in
row 1 of the “bank” column), as well as various other financial terms. The right part of the figure shows the
state of the Gibbs sampler after 64 iterations. The “correct” topic has been assigned to each token in most
cases. For example, in document 1, we see that the word “bank” has been correctly assigned to the financial
151
topic, based on the presence of the words “money” and “loan”. The posterior mean estimate of the parameters
is given by p̂(money|k = 1) = 0.32, p̂(loan|k = 1) = 0.29, p̂(bank|k = 1) = 0.39, p̂(river|k = 2) = 0.25,
p̂(stream|k = 2) = 0.4, and p̂(bank|k = 2) = 0.35, which is impressively accurate, given that there are only 16
training examples.
where z̃n are the variational parameters for the approximate posterior over zn , and Ñnl are the variational
parameters for the approximate posterior over snl . We will follow the usual mean field recipe. For q(snl ), we
use Bayes’ rule, but where we need to take expectations over the prior:
Ñnlk ∝ wd,k exp(E[log znk ]) (28.21)
where d = xnl , and X
E [log znk ] = ψk (z̃n ) , Ψ(z̃nk ) − ψ( z̃nk0 ) (28.22)
k0
where ψ is the digamma function. The update for q(zn ) is obtained by adding up the expected counts:
X
z̃nk = αk + Ñnlk (28.23)
l
152
The M step becomes
X
ŵdk ∝ βd + xnd Ñndk (28.28)
n
We now modify the algorithm to use variational Bayes (VB) instead of EM, i.e., we infer the parameters
as well as the latent variables. There are two advantages to this. First, by setting β 1, VB will encourage
W to be sparse (as in ??). Second, we will be able to generalize this to the online learning setting, as we
discuss below.
Our new posterior approximation becomes
Y Y
q(zn , Nn , W) = Dir(zn |z̃n ) M(Nnd |xnd , Ñnd ) Dir(wk |w̃k ) (28.29)
d k
No normalization is required, since we are just updating the pseudcounts. The overall algorithm is summarized
in Algorithm 4.
153
Algorithm 5: Online VB for LDA
1 Input: {xnd }, Nz , α, β, LR schedule;
2 Initialize w̃dk randomly;
3 for t = 1 : ∞ do
4 Set step size ηt ;
5 Pick document n; ;
6 (z̃n , Ñn ) = VB-Estep(xn , W̃, α);
7
new
w̃dk = βd + N xnd Ñndk ;
8
new
w̃dk = (1 − ηt )w̃dk + ηt w̃dk ;
Online 98K
900
850
800 Batch 98K
Perplexity
Online 3.3M
750
700
650
600
Latent Dirichlet Allocation (LDA) [7] is a Bayesian probabilistic model of text documents. It as-
sumes a collection of K “topics.” Each topic defines a multinomial distribution over the vocabulary
and is assumed to have been drawn from a Dirichlet, βk ∼ Dirichlet(η). Given the topics, LDA
Chapter 29
155
156
Chapter 30
State-space models
157
158
Chapter 31
Graph learning
where p(xs , xt ) is an edge marginal and p(xt ) is a node marginal. For example, in Figure 31.1(a) we have
p(x1 , x2 )p(x2 , x3 )p(x2 , x4 )
p(x1 , x2 , x3 , x4 |T ) = p(x1 )p(x2 )p(x3 )p(x4 ) (31.5)
p(x1 )p(x2 )p(x2 )p(x3 )p(x2 )p(x4 )
1 2 1 2 1 2
4 3 4 3 4 3
159
To see the equivalence with the directed representation, let us cancel terms to get
p(x2 , x3 ) p(x2 , x4 )
p(x1 , x2 , x3 , x4 |T ) = p(x1 , x2 ) (31.6)
p(x2 ) p(x2 )
= p(x1 )p(x2 |x1 )p(x3 |x2 )p(x4 |x2 ) (31.7)
= p(x2 )p(x1 |x2 )p(x3 |x2 )p(x4 |x2 ) (31.8)
where Nstjk is the number of times node s is in state j and node t is in state k, and Ntk is the number
of times node t is in state k. We can rewrite these counts in terms of the empirical distribution: Nstjk =
N pD (xs = j, xt = k) and Ntk = N pD (xt = k). Setting θ to the MLEs, this becomes
log p(D|θ, T ) X X
= pD (xt = k) log pD (xt = k) (31.10)
N
t∈V k
X
+ I(xs , xt |θ̂st ) (31.11)
(s,t)∈E(T )
where I(xs , xt |θ̂st ) ≥ 0 is the mutual information between xs and xt given the empirical distribution:
XX pD (xs = j, xt = k)
I(xs , xt |θ̂st ) = pD (xs = j, xt = k) log (31.12)
j
pD (xs = j)pD (xt = k)
k
Since the first term in Equation (31.11) is independent of the topology T , we can ignore it when learning
structure. Thus the tree topology that maximizes the likelihood can be found by computing the maximum
weight spanning tree, where the edge weights are the pairwise mutual informations, I(ys , yt |θ̂st ). This is
called the Chow-Liu algorithm [CL68].
There are several algorithms for finding a max spanning tree (MST). The two best known are Prim’s
algorithm and Kruskal’s algorithm. Both can be implemented to run in O(E log V ) time, where E = V 2 is
the number of edges and V is the number of nodes. See e.g., [SW11, Sec 4.3] for details. Thus the overall
running time is O(N V 2 + V 2 log V ), where the first term is the cost of computing the sufficient statistics.
Figure 31.2 gives an example of the method in action, applied to the binary 20 newsgroups data shown in
??. The tree has been arbitrarily rooted at the node representing “email”. The connections that are learned
seem intuitively reasonable.
160
Figure 31.2: The MLE tree estimated from the 20-newsgroup data. Generated by chow_liu_tree_demo.py.
forest rather than a single tree, since inference in a forest is much faster than in a tree (we can run belief
propagation in each tree in the forest in parallel). The MLE criterion will never choose to omit an edge.
However, if we use the marginal likelihood or a penalized likelihood (such as BIC), the optimal solution may
be a forest. Below we give the details for the marginal likelihood case.
In Section 31.2.3.2, we explain how to compute the marginal likelihood of any DAG using a Dirichlet
prior for the CPTs. The resulting expression can be written as follows:
X Z Y
N X
log p(D|T ) = log p(xit |xi,pa(t) |θt )p(θt )dθt = score(Nt,pa(t) ) (31.13)
t∈V i=1 t
where Nt,pa(t) are the counts (sufficient statistics) for node t and its parents, and score is defined in
Equation (31.26).
Now suppose we only allow DAGs with at most one parent. Following [HGC95, p227], let us associate a
weight with each s → t edge, ws,t , score(t|s) − score(t|0), where score(t|0) is the score when t has no parents.
Note that the weights might be negative (unlike the MLE case, where edge weights are aways non-negative
because they correspond to mutual information). Then we can rewrite the objective as follows:
X X X
log p(D|T ) = score(t|pa(t)) = wpa(t),t + score(t|0) (31.14)
t t t
The last term is the same for all trees T , so we can ignore it. Thus finding the most probable tree amounts
to finding a maximal branching in the corresponding weighted directed graph. This can be found using
the algorithm in [GGS84].
If the scoring function is prior and likelihood equivalent (these terms are explained in Section 31.2.3.3),
we have
score(s|t) + score(t|0) = score(t|s) + score(s|0) (31.15)
and hence the weight matrix is symmetric. In this case, the maximal branching is the same as the maximal
weight forest. We can apply a slightly modified version of the MST algorithm to find this [EAL10]. To see
this, let G = (V, E) be a graph with both positive and negative edge weights. Now let G0 be a graph obtained
by omitting all the negative edges from G. This cannot reduce the total weight, so we can find the maximum
weight forest of G by finding the MST for each connected component of G0 . We can do this by running
Kruskal’s algorithm directly on G0 : there is no need to find the connected components explicitly.
161
Figure 31.3: A simple linear Gaussian model. .
is to learn a mixture of trees [MJ00], where each mixture component may have a different tree topology.
This is like an unsupervised version of the TAN classifier discussed in ??. We can fit a mixture of trees by
using EM: in the E step, we compute the responsibilities of each cluster for each data point, and in the M
step, we use a weighted version of the Chow-Liu algorithm. See [MJ00] for details.
In fact, it is possible to create an “infinite mixture of trees”, by integrating out over all possible trees.
Remarkably, this can be done in V 3 time using the matrix tree theorem. This allows us to perform exact
Bayesian inference of posterior edge marginals etc. However, it is not tractable to use this infinite mixture for
inference of hidden nodes. See [MJ06] for details.
31.2.1 Faithfulness
The Markov assumption allows us to infer CI properties of a distribution p from a graph G. To go in the
opposite direction, we need to assume that the generating distribution p is faithful to the generating DAG
G. This means that all the conditional indepence (CI) properties of p are exactly captured by the graphical
structure, so I(p) = I(G); this means there cannot be any CI properties in p that are due to particular
settings of the parameters (such as zeros in a regression matrix) that are not graphically explicit. (For this
reason, a faithful distribution is also called a stable distribution.)
Let us consider an example of a non-faithful distribution (from [PJS17, Sec 6.5.3]). Consider a linear
Gaussian model of the form
2
X = EX , EX ∼ N (0, σX ) (31.16)
Y = aX + EY , EY ∼ N (0, σY2 ) (31.17)
Z = bY + cX + EZ , EZ ∼ 2
N (0, σZ ) (31.18)
where the error terms are independent. If ab + c = 0, then X ⊥ Z, even though this is not implied by the
DAG in Figure 31.3. Fortunately, this kind of accidental cancellation happens with zero probability if the
coefficients are drawn randomly from positive densities [SGS00, Thm 3.2].
162
G1 G2 G3
X1 X3 X1 X3 X1 X3
X2 X2 X2
X5 X5 X5
X4 X4 X4
X X X X X X
Y Y Y ≡ Y Y ≡ Y
Z Z Z Z Z Z
We say these graphs are Markov equivalent, since they encode the same set of CI assumptions. That is,
they all belong to the same Markov equivalence class. However, the DAG X → Y ← Z encodes X ⊥ Z
and X 6⊥ Z|Y , so corresponds to a different distribution.
In [VP90], they prove the following theorem.
Theorem 31.2.1. Two structures are Markov equivalent iff they have the same skeleton, i.e., the have the
same edges (disregarding direction) and they have the same set of v-structures (colliders whose parents are
not adjacent).
For example, referring to Figure 31.4, we see that G1 6≡ G2 , since reversing the 2 → 4 arc creates a new
v-structure. However, G1 ≡ G3 , since reversing the 1 → 5 arc does not create a new v-structure.
We can represent a Markov equivalence class using a single partially directed acyclic graph or PDAG
(also called an essential graph or pattern), in which some edges are directed and some undirected (see ??).
The undirected edges represent reversible edges; any combination is possible so long as no new v-structures
are created. The directed edges are called compelled edges, since changing their orientation would change
the v-structures and hence change the equivalence class. For example, the PDAG X − Y − Z represents
{X → Y → Z, X ← Y ← Z, X ← Y → Z} which encodes X 6⊥ Z and X ⊥ Z|Y . See Figure 31.4 for another
example.
The significance of the above theorem is that, when we learn the DAG structure from data, we will not be
able to uniquely identify all of the edge directions, even given an infinite amount of data. We say that we
can learn DAG structure “up to Markov equivalence”. This also cautions us not to read too much into the
meaning of particular edge orientations, since we can often change them without changing the model in any
observable way. (If we want to distinguish between edge orientations within a PDAG (e.g., if we want to
imbue a causal interpretation on the edges), we can use interventional data, as we discuss in Section 31.4.2.)
163
31.2.3 Bayesian model selection: statistical foundations
In this section, we discuss how to compute the exact posterior over graphs, p(G|D), ignoring for now the issue
of computational tractability. We assume there is no missing data, and that there are no hidden variables.
This is called the complete data assumption.
For simplicity, we will focus on the case where all the variables are categorical and all the CPDs are tables.
Our presentation is based in part on [HGC95], although we will follow the notation of ??. In particular,
let xit ∈ {1, . . . , Kt } be the value of node t in case i, where Kt is the number of states for node t. Let
θtck , p(xt = k|xpa(t) = c), for k = 1 : Kt , and c = 1 : Ct , where Ct is the number of parent combinations
(possible conditioning cases). For notational simplicity, we will often assume Kt = K, so all nodes have the
same number of states. We will also let dt = dim(pa(t)) be the degree or fan-in of node t, so that Ct = K dt .
N Y
Y V
p(D|G, θ) = Cat(xit |xi,pa(t) , θt ) (31.20)
i=1 t=1
N Y
Y Ct Y
V Y Kt Ct Y
V Y
Y Kt
I(xi,t =k,xi,pa(t) =c) Ntck
= θtck = θtck (31.21)
i=1 t=1 c=1 k=1 t=1 c=1 k=1
where Ntck is the number of times node t is in state k and its parents are in state c. (Technically these counts
depend on the graph structure G, but we drop this from the notation.)
164
parents, and score() is a local scoring function defined by
YCt
B(Ntc + αtc )
score(Nt,pa(t) ) , (31.26)
c=1
B(αtc )
We say that the marginal likelihood decomposes or factorizes according to the graph structure.
where α > 0 is called the equivalent sample size, and p0 is some prior joint probability distribution. This
is called the BDe prior, which stands for Bayesian Dirichlet likelihood equivalent.
To derive the hyper-parameters for other graph structures, Geiger and Heckerman [GH97] invoked an
additional assumption called parameter modularity, which says that if node Xt has the same parents in
G1 and G2 , then p(θt |G1 ) = p(θt |G2 ). With this assumption, we can always derive αt for a node t in any
other graph by marginalizing the pseudo counts in Equation (31.27).
Typically the prior distribution p0 is assumed to be uniform over all possible joint configurations. In this
case, we have αtck = KtαCt , since p0 (xt = k, xpa(t) = c) = Kt1Ct . Thus if we sum the pseudo counts over all
Ct × Kt entries in the CPT, we get a total equivalent sample size of α. This is called the BDeu prior, where
the “u” stands for uniform. This is the most widely used prior for learning Bayes net structures. For advice
on setting the global tuning parameter α, see [SKM07].
165
Figure 31.6: The two most probable DAGs learned from the Sewell-Shah data. From [HMC97]. Used with kind
permission of David Heckerman
If all CPDs are linear Gaussian, we can replace the Dirichlet-multinomial model with the normal-gamma
model, and thus derive a different exact expression for the marginal likelihood. See [GH94] for the details.
In fact, we can easily combine discrete nodes and Gaussian nodes, as long as the discrete nodes always
have discrete parents; this is called a conditional Gaussian DAG. Again, we can compute the marginal
likelihood in closed form. See [BD03] for the details.
In the general case (i.e., everything except Gaussians and CPTs), we need to approximate the marginal
likelihood. The simplest approach is to use the BIC approximation, which has the form
X Kt Ct
log p(Dt |θ̂t ) − log N (31.28)
t
2
Suppose we know a total ordering of the nodes. Then we can compute the distribution over parents for
each node independently, without the risk of introducing any directed cycles: we simply enumerate over all
possible subsets of ancestors and compute their marginal likelihoods. If we just return the best set of parents
for each node, we get the the K2 algorithm [CH92]. In this case, we can find the best set of parents for
each node using `1 -regularization, as shown in [SNMM07].
In general, the ordering of the nodes is not known, so the posterior does not decompose. Nevertheless, we
can use dynamic programming to find the globally optimal MAP DAG (up to Markov equivalence), as shown
in [KS04; SM06].
If our goal is knowledge discovery, the MAP DAG can be misleading, for reasons we discussed in ??. A
better approach is to compute the marginal probability that each edge is present, p(Gst = 1|D). We can also
compute these quantities using dynamic programming, as shown in [Koi06; PK11].
Unfortunately, all of these methods take V 2V time in the general case, making them intractable for graphs
with more than about 16 nodes.
166
evidence case course question
rights power bible honda computer bmw medicine earth solar season launch technology dos
president war state research league fans win phone format video mac
water health pc
vitamin display
Figure 31.7: A locally optimal DAG learned from the 20-newsgroup data. From Figure 4.10 of [Sch10a]. Used with
kind permission of Mark Schmidt.
for D > 2. The base case is f (1) = 1. Solving this recurrence yields the following sequence: 1, 3, 25, 543,
29281, 3781503, etc.1
Indeed, the general problem of finding the globally optimal MAP DAG is provably NP-complete [Chi96].
In view of the enormous size of the hypothesis space, we are generally forced to use approximate methods,
some of which we review below.
of DAGs is equal to the number of (0,1) matrices all of whose eigenvalues are positive real numbers [McK+04].
167
blankets estimated from a dependency network [Sch10a]. Figure 31.7 gives an example of a DAG learned in
this way from the 20-newsgroup data. For binary data, it is possible to use techniques from frequent itemset
mining to find good Markov blanket candidates, as described in [GM04].
We can use techniques such as multiple random restarts to increase the chance of finding a good local
maximum. We can also use more sophisticated local search methods, such as genetic algorithms or simulated
annealing, for structure learning. (See also Section 31.2.6 for gradient based techniques based on continuous
relaxations.)
It is also possible to perform the greedy search in the space of PDAGs instead of in the space of DAGs;
this is known as the greedy equivalence search method [Chi02]. Although each step is somewhat more
complicated, the advantage is that the search space is smaller.
31.2.5.1 IC algorithm
The original algorithm, due to Verma and Pearl [VP90], was called the IC algorithm, which stands for
“inductive causation”. The method is as follows [Pea09, p50]:
1. For each pair of variables a and b, search for a set Sab such that a ⊥ b|Sab . Construct an undirected
graph such that a and b are connected iff no such set Sab can be found (i.e., they cannot be made
conditionally independent).
2. Orient the edges involved in v-structures as follows: for each pair of nonadjacent nodes a and b with a
common neighbor c, check if c ∈ Sab ; if it is, the corresponding DAG must be a → c → b, a ← c → b
or a ← c ← b, so we cannot determine the direction; if it is not, the DAG must be a → c ← b, so add
these arrows to the graph.
3. In the partially directed graph that results, orient as many of the undirected edges as possible, subject
to two conditions: (1) the orientation should not create a new v-structure (since that would have been
detected already if it existed), and (2) the orientation should not create a directed cycle. More precisely,
follow the rules shown in Figure 31.8. In the first case, if X → Y has a known orientation, but Y − Z is
unknown, then we must have Y → Z, otherwise we would have created a new v-structure X → Y ← Z,
which is not allowed. The other two cases follow similar reasoning.
168
Figure 31.8: The 3 rules for inferring compelled edges in PDAGs. Adapted from [Pe’05].
31.2.5.2 PC algorithm
A significant speedup of IC, known as the PC algorithm after is creators Peter Spirtes and Clark Glymour
[SG91], can be obtained by ordering the search for separating sets in step 1 in terms of sets of increasing
cardinality. We start with a fully connected graph, and then look for sets Sab of size 0, then of size 1, and so
on; as soon we find a separating set, we remove the corresponding edge. See Figure 31.9 for an example.
Another variant on the PC algorithm is to learn the original undirected structure (i.e., the Markov blanket
of each node) using generic variable selection techniques instead of CI tests. This tends to be more robust,
since it avoids issues of statisical significance that can arise with independence tests. See [PE08] for details.
The running time of the PC algorithm is O(DK+1 ) [SGS00, p85], where D is the number of nodes and K
is the maximal degree (number of neighbors) of any node in the corresponding undirected graph.
169
Figure 31.9: Example of step 1 of the PC algorithm. From Figure 5.1 of [SGS00]. Used with kind permission of Peter
Spirtes.
use.) In particular, they show how to convert the combinatorial problem into a continuous problem:
min f (W) s.t. G(W) ∈ DAGs ⇐⇒ min f (W) s.t. h(W) = 0 (31.30)
W∈RD×D W∈RD×D
Here W is a weighted adjacency matrix on D nodes, G(W) is the corresponding graph (obtained by
thresholding W at 0), f (W) is a scoring function (e.g., penalized log likelihood), and h(W) is a constraint
function that measures how close W is to defining a DAG. The constraint is given by
d
X
d
h(W) = tr((I + αW) ) − d ∝ tr( αk W k ) (31.31)
k=1
where Wk = W · · · W with k terms, and α > 0 is a regularizer. Element (i, j) of Wk will be non-zero iff
theree is a path from j to i made of K educes. Hence the diagonal elements count the number of paths from
an edge to itself in k steps. Thus h(w) will be 0 if W defines a valid DAG.
The scoring function considered in [Zhe+18] has the form
1
f (W) = ||X − XW||2F + λ||W||1 (31.32)
2N
where X ∈ RN D is the data matrix. The show how to find a local optimum of the equality constrained
objective using gradient-based methods. The cost per iteration is O(D3 ).
Several extensions of this have been proposed. For example, [Yu+19] replace the Gaussian noise assumption
with a VAE (variational autoencoder, ??), and use a graph neural network as the encoder/decoder. And
[Lac+20] relax the linearity assumption, and allow for the use of neural network dependencies between
variables.
170
The posterior mode (MAP) is known to converge to the MLE, which in turn will converge to the true
graph G (up to Markov equivalence), so any exact algorithm for Bayesian inference is a consistent estimator.
[Chi02] showed that his greedy equivalence search method (which is a form of hill climbing in the space of
PDAGs) is a consistent estimator. Similarly, [SGS00; KB07] showed that the PC is a consistent estimator.
However, the running time of these algorithms might be exponential in the number of nodes. Also, all of
these methods assume that all the variables are fully observed.
In general this is intractable to compute. For example, consider a mixture model, where we don’t observe the
cluster label. In this case, there are K N possible completions of the data (assuming we have K clusters);
we can evaluate the inner integral for each one of these assignments to h, but we cannot afford to evaluate
all of the integrals. (Of course, most of these integrals will correspond to hypotheses with little posterior
support, such as assigning single data points to isolated clusters, but we don’t know ahead of time the relative
weight of these assignments.) Below we mention some faster deterministic approximations for the marginal
likelihood.
However, comparing this to Equation (31.33), we can see that the value will be exponentially smaller, since it
does not sum over all values of h. To correct for this, we first write
171
p(H=0) = 0.63
H p(H=1) = 0.37
H p(SES=high|H)
PE H p(IQ=high|PE,H)
p(male) = 0.48 0 0.088
low 0 0.098 SEX 1 0.51
low 1 0.22
high 0 0.21 SES
high 1 0.49
PE
IQ SES IQ PE p(CP=yes|SES,IQ,PE)
low low low 0.011
SES SEX p(PE=high|SES,SEX) low low high 0.170
low high low 0.124
low male 0.32 CP low high high 0.53
low female 0.166 high low low 0.093
high male 0.86 high low high 0.39
high female 0.81 high high low 0.24
high high high 0.84
Figure 31.10: The most probable DAG with a single binary hidden variable learned from the Sewell-Shah data. MAP
estimates of the CPT entries are shown for some of the nodes. From [HMC97]. Used with kind permission of David
Heckerman.
The first term p(D|G) can be computed by plugging in the filled-in data into the exact marginal likelihood.
The second term p(D|θ̂, G), which involves an exponential sum (thus matching the “dimensionality” of the
left hand side) can be computed using an inference algorithm. The final term p(D|θ̂, G) can be computed by
plugging in the filled-in data into the regular likelihood.
where zi are the hidden variables in case i. In the E step, we update the q(zi ), and in the M step, we update
q(θ). The corresponding variational free energy provides a lower bound on the log marginal likelihood. In
[BG06], it is shown that this bound is a much better approximation to the true log marginal likelihood (as
estimated by a slow annealed importance sampling procedure) than either BIC or CS. In fact, one can prove
that the variational bound will always be more accurate than CS (which in turn is always more accurate
than BIC).
172
al. decided to see what would happen if they introduced a hidden variable H, which they made a parent of
both SES and IQ, representing a hidden common cause. They also considered a variant in which H points
to SES, IQ and PE. For both such cases, they considered dropping none, one, or both of the SES-PE and
PE-IQ edges. They varied the number of states for the hidden node from 2 to 6. Thus they computed the
approximate posterior over 8 × 5 = 40 different models, using the CS approximation.
The most probable model which they found is shown in Figure 31.10. This is 2 · 1010 times more likely
than the best model containing no hidden variable. It is also 5 · 109 times more likely than the second most
probable model with a hidden variable. So again the posterior is very peaked.
These results suggests that there is indeed a hidden common cause underlying both the socio-economic
status of the parents and the IQ of the children. By examining the CPT entries, we see that both SES and IQ
are more likely to be high when H takes on the value 1. They interpret this to mean that the hidden variable
represents “parent quality” (possibly a genetic factor). Note, however, that the arc between H and SES can
be reversed without changing the v-structures in the graph, and thus without affecting the likelihood; this
underscores the difficulty in interpreting hidden variables.
Interestingly, the hidden variable model has the same conditional independence assumptions amongst the
visible variables as the most probable visible variable model. So it is not possible to distinguish between these
hypotheses by merely looking at the empirical conditional independencies in the data (which is the basis
of the constraint-based approach to structure learning discussed in Section 31.2.5). Instead, by adopting a
Bayesian approach, which takes parsimony into account (and not just conditional independence), we can
discover the possible existence of hidden factors. This is the basis of much of scientific and everday human
reasoning (see e.g. [GT09] for a discussion).
31.2.8.6 Structural EM
One way to perform structural inference in the presence of missing data is to use a standard search procedure
(deterministic or stochastic), and to use the methods from Section 31.2.8.1 to estimate the marginal likelihood.
However, this approach is not very efficient, because the marginal likelihood does not decompose when we
have missing data, and nor do its approximations. For example, if we use the CS approximation or the VBEM
approximation, we have to perform inference in every neighboring model, just to evaluate the quality of a
single move!
[Fri97; Thi+98] presents a much more efficient approach called the structural EM algorithm. The basic
idea is this: instead of fitting each candidate neighboring graph and then filling in its data, fill in the data
once, and use this filled-in data to evaluate the score of all the neighbors. Although this might be a bad
approximation to the marginal likelihood, it can be a good enough approximation of the difference in marginal
likelihoods between different models, which is all we need in order to pick the best neighbor.
More precisely, define D(G0 , θ̂0 ) to be the data filled in using model G0 with MAP parameters θ̂0 . Now
define a modified BIC score as follows:
log N
BIC(G, D) , log p(D|θ̂, G) − dim(G) + log p(G) + log p(θ̂|G) (31.41)
2
where we have included the log prior for the graph and parameters. One can show [Fri97] that if we pick a
graph G which increases the BIC score relative to G0 on the expected data, it will also increase the score on
the actual data, i.e.,
BIC(G, D) − BIC(G0 , D) ≤ BIC(G, D) − BIC(G0 , D) (31.42)
To convert this into an algorithm, we proceed as follows. First we initialize with some graph G0 and some
set of parameters θ0 . Then we fill-in the data using the current parameters — in practice, this means when
we ask for the expected counts for any particular family, we perform inference using our current model. (If
we know which counts we will need, we can precompute all of them, which is much faster.) We then evaluate
the BIC score of all of our neighbors using the filled-in data, and we pick the best neighbor. We then refit
the model parameters, fill-in the data again, and repeat. For increased speed, we may choose to only refit
the model every few steps, since small changes to the structure hopefully won’t invalidate the parameter
estimates and the filled-in data too much.
173
Figure 31.11: A DGM with and without hidden variables. For example, the leaves might represent medical symptoms,
the root nodes primary causes (such as smoking, diet and exercise), and the hidden variable can represent mediating
factors, such as heart disease. Marginalizing out the hidden variable induces a clique.
One interesting application is to learn a phylogenetic tree structure. Here the observed leaves are the
DNA or protein sequences of currently alive species, and the goal is to infer the topology of the tree and the
values of the missing internal nodes. There are many classical algorithms for this task (see e.g., [Dur+98]),
but one that uses structural EM is discussed in [Fri+02].
Another interesting application of this method is to learn sparse mixture models [BF02]. The idea is
that we have one hidden variable C specifying the cluster, and we have to choose whether to add edges
C → Xt for each possible feature Xt . Thus some features will be dependent on the cluster id, and some will
be independent. (See also [LFJ04] for a different way to perform this task, using regular EM and a set of bits,
one per feature, that are free to change across data cases.)
174
erred by method BIN - A modelling co-occurrences in the 20 newsgroup dataset. ! " " "
"
!
!
"
"
Figure 31.12: Part of a hierarchical latent tree learned from the 20-newsgroup data. From Figure 2 of [HW11]. Used
h3 h17
president government power h4 war h20 religion h14 earth lunar orbit satellite solar
children
mars h1
gun
h2 christian jesus
space launch shuttle nasa
health
case course evidence fact question
program h9
food aids h21
insurance
version h12 ftp email
msg water studies h13 medicine
car
h25 files format phone
Figure 31.13: A partially latent tree learned from the 20-newsgroup data. Note that some words can have multiple
meanings, and get connected to different latent variables, representing different “topics”. For example, the word “win”
can refer to a sports context (represented by h5) or the Microsoft Windows context (represented by h25). From Figure
12 of [Cho+11]. Used with kind permission of Jin Choi.
175
DQG%H[SODLQV$
VWRSZRUGV
Ɣ 'LVFDUGFOXVWHUVWKDWDUHXVHGWRRUDUHO\
Figure 31.14: Google’s rephil model. Leaves represent presence or absence of words. Internal nodes represent clusters
of co-occuring words, or “concepts”. All nodes are binary, and all CPDs are noisy-OR. The model contains 12 million
word nodes, 1 million latent cluster nodes, and 350 million edges. Used with kind permission of Brian Milch.
An alternative approach is proposed in [Cho+11], in which the observed data is not constrained to be at
the leaves. This method starts with the Chow-Liu tree on the observed data, and then adds hidden variables
to capture higher-order dependencies between internal nodes. This results in much more compact models,
as shown in Figure 31.13. This model also has better predictive accuracy than other approaches, such as
mixture models, or trees where all the observed data is forced to be at the leaves. Interestingly, one can show
that this method can recover the exact latent tree structure, providing the data is generated from a tree. See
[Cho+11] for details. Note, however, that this approach, unlike [Zha04; HW11], requires that the cardinality
of all the variables, hidden and observed, be the same. Furthermore, if the observed variables are Gaussian,
the hidden variables must be Gaussian also.
Patent #8024372, “Method and apparatus for learning a probabilistic generative model for text”, filed in 2004. Rephil is a more
probabilistically sound version of the method, developed by Uri Lerner et al. The summary below is based on notes by Brian
Milch (who also works at Google).
3 AdSense is Google’s system for matching web pages with content-appropriate ads in an automatic way, by extracting
semantic keywords from web pages. These keywords play a role analogous to the words that users type in when searching; this
latter form of information is used by Google’s AdWords system. The details are secret, but [Lev11] gives an overview.
176
The model was trained on about 100 billion text snippets or search queries; this takes several weeks,
even on a parallel distributed computing architecture. The resulting model contains 12 million word nodes
and about 1 million latent cluster nodes. There are about 350 million links in the model, including many
cluster-cluster dependencies. The longest path in the graph has length 555, so the model is quite deep.
Exact inference in this model is obviously infeasible. However note that most leaves will be off, since
most words do not occur in a given query; such leaves can be analytically removed. We can also prune out
unlikely hidden nodes by following the strongest links from the words that are on up to their parents to
get a candidate set of concepts. We then perform iterative conditional modes (ICM) to form approximate
inference. (ICM is a deterministic version of Gibbs sampling that sets each node to its most probable state
given the values of its neighbors in its Markov blanket.) This continues until it reaches a local maximum. We
can repeat this process a few times from random starting configurations. At Google, this can be made to run
in 15 milliseconds!
• If x contains 3 of more indepenent views of z [Goo74; AMR09; AHK12; HKZ12], sometimes called the
triad constraint.
In terms of algorithms, most of these methods are not based on maximum likelihood, but instead use the
method of moments and spectral methods. For details, see [Ana+14].
177
children case course fact question
nasa jesus pc dos drive bmw israel government health games hockey hit
launch memory scsi jews engine dealer state war computer president medicine season puck nhl
shuttle religion data card honda power oil world insurance science studies team
software solar graphics driver gun research university water human cancer win league
Figure 31.15: A dependency network constructed from the 20 newsgroup data. We show all edges with regression weight
above 0.5 in the Markov blankets estimated by `1 penalized logistic regression. Undirected edges represent cases where a
directed edge was found in both directions. From Figure 4.9 of [Sch10a]. Used with kind permission of Mark Schmidt.
is harder than learning DAG structure since the likelihood does not decompose (see ??). This precludes the
kind of local search methods (both greedy search and MCMC sampling) we used to learn DAG structures,
because the cost of evaluating each neighboring graph is too high, since we have to refit each model from
scratch (there is no way to incrementally update the score of a model). In this section, we discuss several
solutions to this problem.
178
baseball:windows and bmw:christian. We can gain more insight if we look not only at the sparsity pattern,
but also the values of the regression weights. For example, here are the incoming weights for the first 5 words:
• aids: children (0.53), disease (0.84), fact (0.47), health (0.77), president (0.50), research (0.53)
• baseball: christian (-0.98), drive (-0.49), games (0.81), god (-0.46), government (-0.69), hit (0.62),
memory (-1.29), players (1.16), season (0.31), software (-0.68), windows (-1.45)
• bible: car (-0.72), card (-0.88), christian (0.49), fact (0.21), god (1.01), jesus (0.68), orbit (0.83),
program (-0.56), religion (0.24), version (0.49)
• bmw: car (0.60), christian (-11.54), engine (0.69), god (-0.74), government (-1.01), help (-0.50), windows
(-1.43)
• cancer: disease (0.62), medicine (0.58), patients (0.90), research (0.49), studies (0.70)
Words in italic red have negative weights, which represents a dissociative relationship. For example, the
model reflects that baseball:windows is an unlikely combination. It turns out that most of the weights are
negative (1173 negative, 286 positive, 8541 zero) in this model.
[MB06] discuss theoretical conditions under which dependency networks using `1 -regularized linear
regression can recover the true graph structure, assuming the data was generated from a sparse Gaussian
graphical model. We discuss a more general solution in Section 31.3.2.
179
Let us consider a worked example from [HTF09, p652]. We will use the following adjacency matrix,
representing the cyclic structure, X1 − X2 − X3 − X4 − X1 , and the following empirical covariance matrix:
0 1 0 1 10 1 5 4
1 0 1 0 1 10 2 6
G=
0 1 0 1 , S = 5 2 10 3
(31.46)
1 0 1 0 4 6 3 10
(See ggmFitDemo.py for the code to reproduce these numbers, using the coordinate descent algorithm from
[FHT08].) The constrained elements in Ω, and the free elements in Σ, both of which correspond to absent
edges, have been highlighted.
180
where we assume x begins with a constant 1 term, to account for the offset. (If x only contains 1, the CRF
reduces to an MRF.) Note that we may choose to set some of the vtk and wstjk weights to 0, to ensure
identifiability, although this can also be taken care of by the prior.
To learn sparse structure, we can minimize the following objective:
N
" V V
#
X X X X
J =− log ψt (yit , xi , vt ) + log ψst (yis , yit , xi , wst )
i=1 t s=1 t=s+1
V
X V
X V
X
+ λ1 ||wst ||p + λ2 ||vt ||22 (31.50)
s=1 t=s+1 t=1
where ||wst ||p is the p-norm; common choices are p = 2 or p = ∞, as explained in ??. This method of CRF
structure learning was first suggested in [Sch+08]. (The use of `1 regularization for learning the structure of
binary MRFs was proposed in [LGK06].)
Although this objective is convex, it can be costly to evaluate, since we need to perform inference to
compute its gradient, as explained in ?? (this is true also for MRFs), due to the global partition function. We
should therefore use an optimizer that does not make too many calls to the objective function or its gradient,
such as the projected quasi-Newton method in [Sch+09]. In addition, we can use approximate inference, such
as loopy belief propagation (??), to compute an approximate objective and gradient more quickly, although
this is not necessarily theoretically sound.
Another approach is to apply the group lasso penalty to the pseudo-likelihood discussed in ??. This is
much faster, since inference is no longer required [HT09]. Figure 31.16 shows the result of applying this
procedure to the 20-newsgroup data, where yit indicates the presence of word t in document i, and xi = 1
(so the model is an MRF).
For a more recent approach to learning sparse discrete PGM-U structures, based on sparse full conditionals,
see the GRISE (Generalized Regularized Interaction Screening Estimator) method of [VML19], which takes
polynomial time, yet its sample complexity is close to the information-theoretic lower bounds [Lok+18].
61,7675; 30,888,596. If we divide these numbers by the number of undirected graphs, which is 2V (V −1)/2 , we find the ratios are:
1, 1, 0.95, 0.8, 0.55, 0.29, 0.12. So we see that decomposable graphs form a vanishing fraction of the total hypothesis space.
181
case children bible health
computer evidence
dos format help data image video gun human car president israel jesus
windows won
Figure 31.16: An MRF estimated from the 20-newsgroup data using group `1 regularization with λ = 256. Isolated
nodes are not plotted. From Figure 5.9 of [Sch10a]. Used with kind permission of Mark Schmidt.
In [Mog+09], a much faster method is proposed. In particular, they modify the gradient-based methods
from Section 31.3.2.1 to find the MAP estimate; these algorithms do not need to know the cliques of the
graph. A further speedup is obtained by just using a diagonal Laplace approximation, which is more accurate
than BIC, but has essentially the same cost. This, plus the lack of restriction to decomposable graphs,
enables fairly fast stochastic search methods to be used to approximate p(G|D) and its mode. This approach
significantly outperfomed graphical lasso, both in terms of predictive accuracy and structural recovery, for a
comparable computational cost.
182
31.4.1 Learning cause-effect pairs
If we only observe a pair of variables, we cannot use methods discussed in Section 31.2 to learn graph
structure, since such methods are based on conditional independence tests, which need at least 3 variables.
However, intuitively, we should still be able to learn causal relationships in this case. For example, we
know that altitude X causes temperature Y and not vice versa. For example, suppose we measure X and
Y in two different countries, say the Netherlands (low altitude) and Switzerland (high altitude). If we
represent the joint distribution as p(X, Y ) = p(X)p(Y |X), we find that the p(Y |X) distribution is stable
across the two populations, while p(X) will change. However, if we represent the joint distribution as
p(X, Y ) = p(Y )p(X|Y ), we find that both p(Y ) and p(X|Y ) need to change across populations, so both of
the corresponding distributions will be more “complicated” to capture this non-stationarity in the data. In
this section, we discuss some approaches that exploit this idea. Our presentation is based on [PJS17]. (See
[Moo+16] for more details.)
We can equally well write this in the following form [Daw02, p165]:
where α = logit(θ) + µ22 − µ21 and β = µ1 − µ2 . We can plausibly argue that the first model, which corresponds
to X → Y , is more likely to be correct, since it is consists of two simple distributions that seem to be
rather generic. By contrast, in Equation (31.52), the distribution of p(Y ) is more complex, and seems to be
dependent on the specific form of p(X|Y ).
[JS10] show how to formalize this intuition using algorithmic information theory. In particular,
they say that X causes Y if the distributions PX and PY |X (not the random variables X and Y ) are
algorithmically independent. To define this, let PX (X) be the distribution induced by fx (X, UX ), where
UX is a bit string, and fX is represented by a Turing machine. Define PY |X analogously. Finally, let K(s)
be the Kolmogorov complexity of bit string s, i.e., the length of the shortest program that would generate
s using a universal Turing machine. We say that PX and PY |X are algorithmically independent if
Unfortunately, there is no algorithm to compute the Kolmogorov complexity, so this approach is purely
conceptual. In the sections below, we discuss some more practical metrics.
183
𝑦
𝑡
all the heights
are the same
0
0
𝑥 𝑥
Figure 31.17: Signature of X causing Y . Left: If we try to predict Y from X, the residual error (noise term, shown by
vertical arrows) is independent of X. Right: If we try to predict X from Y , the residual error is not constant. From
Figure 8.8 of [Var21]. Used with kind permission of Kush Varshney.
In the case of two variables, we have Y = f (X) + U . If X and U are both Gaussian, and f is linear, the
system defines a jointly Gaussian distribution p(X, Y ), as we discussed in ??. This is symmetric, and prevents
us distinguishing X → Y from Y → X. However, if we let f be nonlinear, and/or let X or U be non-Gaussian,
we can distinguish X → Y from Y → X, as we discuss below.
Suppose pY |X is an additive noise model (possibly Gaussian noise) where f is a nonlinear function. In this
case, we will not, in general, be able to create an ANM for pX|Y . Thus we can determine whether X → Y
or vice versa as follows: we fit a (nonlinear) regression model for X → Y , and then check if the residual
error Y − fˆY (X) is independent of X; we then repeat the procedure swapping the roles of X and Y . The
theory [PJS17] says that the independence test will only pass for the causal direction. See Figure 31.17 for an
illustration.
184
Figure 31.18: Illustration of information-geometric causal inference for Y = f (X). The density of the effect p(Y )
tends to be high in regions where f is flat (and hence f −1 is steep). From Figure 4 of [Jan+12].
large slope. Thus pY (Y ) and f −1 (Y ) will depend on each other, whereas pX (X) and f (X) do not (since we
assume the distribution of causes is independent of the causal mechanism).
More precisely, let the functions log f 0 (the log of the derivative function) and pX be viewed as random
variables on the probability space [0, 1] with a uniform distrbution. We say pX,Y satisfies an IGCI model if f
is a mapping as above, and the following independence criterion holds: Cov [log f 0 , pX ] = 0, where
Z 1 Z 1 Z 1
0 0 0
Cov [log f , pX ] = log f (x)pX (x)dx − log f (x)dx pX (x)dx (31.55)
0 0 0
R1 h 0
i
where 0 pX (x)dx = 1. One can show that the inverse function f −1 satisfies Cov log f −1 , pY ≥ 0, with
equality iff f is linear.
This can be turned into an empirical test as follows. Define
Z N −1
1
1 X |yj+1 − yj |
CX→Y = log f 0 (x)p(x)dx ≈ log (31.56)
0 N − 1 j=1 |xj+1 − xj |
where x1 < x2 · · · xN are the observed x-values in increasing order. The quantity CY →X is defined analogously.
We then choose X → Y as the model whenever ĈX→Y < ĈY →X . This is called the slope based approach
to IGCI.
One can also show that an IGCI model satisfies the property that H(X) ≤ H(Y ), where H() is the
differential entropy. Intuitively, the reason is that applying a nonlinear function f to pX can introduce
additional irregularities, thus making pY less uniform that pX . This is illustrated in Figure 31.18. We can
then choose between X → Y and X ← Y based on the difference in estimated entropies.
An empirical comparison of the slope-based and entropy-based approaches to IGCI can be found in
[Moo+16].
185
Psitect AKT inh U0126
PMA
pkc
plcy
pip3
akt
raff
pip2
pka
mek12
G06967
erk
Present
Missing
Int. edge jnk p38 B2cAMP
(a) (b)
Figure 31.19: (a) A design matrix consisting of 5400 data points (rows) measuring the status (using flow cytometry) of
11 proteins (columns) under different experimental conditions. The data has been discretized into 3 states: low (black),
medium (grey) and high (white). Some proteins were explicitly controlled using activating or inhibiting chemicals. (b)
A directed graphical model representing dependencies between various proteins (blue circles) and various experimental
interventions (pink ovals), which was inferred from this data. We plot all edges for which p(Gst = 1|D) > 0.5. Dotted
edges are believed to exist in nature but were not discovered by the algorithm (1 false negative). Solid edges are true
positives. The light colored edges represent the effects of intervention. From Figure 6d of [EM07].
the data generating mechanism has been changed. For example, if θijk = p(Xi = j|Xpa(i) = k) is a CPT
P
for node i, then when we compute the sufficient statistics Nijk = n I Xni = j, Xn,pa(i) = k , we exclude
cases n where Xi was set externally by intervention, rather than sampled from θijk . This technique was first
proposed in [CY99], and corresponds to Bayesian parameter inference from a set of mutiliated models with
shared parameters.
The preceding method assumes that we use perfect interventions, where we deterministically set a
variable to a chosen value. In reality, experimenters can rarely control the state of individual variables.
Instead, they can perform actions which may affect many variables at the same time. (This is sometimes
called a “fat hand intervention”, by analogy to an experiment where someone tries to change a single
component of some system (e.g., electronic circuit), but accidently touching multiple components and thereby
causing various side effects.) We can model this by adding the intervention nodes to the DAG (??), and
then learning a larger augmented DAG structure, with the constraint that there are no edges between the
intervention nodes, and no edges from the “regular” nodes back to the intervention nodes.
For example, suppose we perturb various proteins in a cellular signalling pathway, and measure the
resulting phosphorylation status using a technique such as flow cytometry, as in [Sac+05]. An example of
such a dataset is shown in Figure 31.19(a). Figure 31.19(b) shows the augmented DAG that was learned from
the interventional flow cytometry data depicted in Figure 31.19(a). In particular, we plot the median graph,
which includes all edges for which p(Gij = 1|D) > 0.5. These were computed using the exact algorithm of
[Koi06]. See [EM07] for details.
Since interventional data can help to uniquely identify the DAG, it is natural to try to choose the optimal
set of interventions so as to discover the graph structure with as little data as possible. This is a form of
active learning or experiment design, and is similar to what scientists do. See e.g., [Mur01; HG09; KB14;
HB14; Mue+17] for some approaches to this problem.
186
learning (??). For more details, see e.g., [CEP17; Sch+21].
187
188
Chapter 32
189
190
Chapter 33
Representation learning
191
192
Chapter 34
Interpretability
193
194
Part VI
Decision making
195
Chapter 35
197
198
Chapter 36
Reinforcement learning
199
200
Bibliography
[AD19a] H. Asi and J. C. Duchi. “Modeling simple structures and geometry for better stochastic
optimization algorithms”. In: AISTATS. 2019.
[AD19b] H. Asi and J. C. Duchi. “Stochastic (Approximate) Proximal Point Methods: Convergence,
Optimality, and Adaptivity”. In: SIAM J. Optim. (2019).
[AD19c] H. Asi and J. C. Duchi. “The importance of better models in stochastic optimization”. en. In:
PNAS 116.46 (Nov. 2019), pp. 22924–22930.
[AEM18] Ö. D. Akyildiz, V. Elvira, and J. Miguez. “The Incremental Proximal Method: A Probabilistic
Perspective”. In: ICASSP. 2018.
[AHK12] A. Anandkumar, D. Hsu, and S. Kakade. “A method of moments for mixture models and hidden
Markov models”. In: COLT. 2012.
[AKM05] A. Atay-Kayis and H. Massam. “A Monte Carlo method for computing the marginal likelihood
in nondecomposable Gaussian graphical models”. In: Biometrika 92 (2005), pp. 317–335.
[Aky+19] Ö. D. Akyildiz, É. Chouzenoux, V. Elvira, and J. Míguez. “A probabilistic incremental proximal
gradient method”. In: IEEE Signal Process. Lett. 26.8 (2019).
[AKZK19] B. Amos, V. Koltun, and J Zico Kolter. “The Limited Multi-Label Projection Layer”. In: (June
2019). arXiv: 1906.08707 [cs.LG].
[ALK06] C. Albers, M. Leisink, and H. Kappen. “The Cluster Variation Method for Efficient Linkage
Analysis on Extended Pedigrees”. In: BMC Bioinformatics 7 (2006).
[AMR09] E. S. Allman, C. Matias, and J. A. Rhodes. “Identifiability of parameters in latent structure
models with many observed variables”. en. In: Ann. Stat. 37.6A (Dec. 2009), pp. 3099–3132.
[Ana+14] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. “Tensor Decompositions for
Learning Latent Variable Models”. In: JMLR 15 (2014), pp. 2773–2832.
[AQ+20] M. A. A. Al-Qaness, A. A. Ewees, H. Fan, and M. Abd El Aziz. “Optimization Method for
Forecasting Confirmed Cases of COVID-19 in China”. en. In: J. Clinical Medicine 9.3 (Mar.
2020).
[Arm05] H. Armstrong. “Bayesian estimation of decomposable Gaussian graphical models”. PhD thesis.
UNSW, 2005.
[Arm+08] H. Armstrong, C. Carter, K. Wong, and R. Kohn. “Bayesian Covariance Matrix Estimation
using a Mixture of Decomposable Graphical Models”. In: Statistics and Computing (2008),
pp. 1573–1375.
[Aro+13] S. Arora et al. “A Practical Algorithm for Topic Modeling with Provable Guarantees”. In: ICML.
2013.
[Aro+16] S. Arora, R. Ge, T. Ma, and A. Risteski. “Provable learning of Noisy-or Networks”. In: (2016).
arXiv: 1612.08795 [cs.LG].
201
[AS19] B. G. Anderson and S. Sojoudi. “Global Optimality Guarantees for Nonconvex Unsupervised
Video Segmentation”. In: 57th Annual Allerton Conference on Communication, Control, and
Computing (2019).
[AY19] B. Amos and D. Yarats. “The Differentiable Cross-Entropy Method”. In: (Sept. 2019). arXiv:
1909.12830 [cs.LG].
[Bac+15] S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. “Hinge-Loss Markov Random Fields and
Probabilistic Soft Logic”. In: (2015). arXiv: 1505.04406 [cs.LG].
[Bal17] S. Baluja. “Learning deep models of optimization landscapes”. In: IEEE Symposium Series on
Computational Intelligence (SSCI) (2017).
[BB12] J. Bergstra and Y. Bengio. “Random Search for Hyper-Parameter Optimization”. In: JMLR 13
(2012), pp. 281–305.
[BBZ17] T. Bartz-Beielstein and M. Zaefferer. “Model-based Methods for Continuous and Discrete Global
Optimization”. In: Appl. Soft Comput. 55.C (June 2017), pp. 154–167.
[BC95] S. Baluja and R. Caruana. “Removing the Genetics from the Standard Genetic Algorithm”. In:
ICML. 1995, pp. 38–46.
[BD03] S. G. Bottcher and C. Dethlefsen. “deal: A Package for Learning Bayesian Networks”. In: J. of
Statistical Software 8.20 (2003).
[BD97] S. Baluja and S. Davies. “Using Optimal Dependency-Trees for Combinatorial Optimization:
Learning the Structure of the Search Space”. In: ICML. 1997.
[Ber15] D. P. Bertsekas. “Incremental Gradient, Subgradient, and Proximal Methods for Convex Opti-
mization: A Survey”. In: (July 2015). arXiv: 1507.01030 [cs.SY].
[BF02] Y. Barash and N. Friedman. “Context-specific Bayesian clustering for gene expression data”. In:
J. Comp. Bio. 9 (2002), pp. 169–191.
[BG06] M. Beal and Z. Ghahramani. “Variational Bayesian Learning of Directed Graphical Models with
Hidden Variables”. In: Bayesian Analysis 1.4 (2006).
[BGd08] O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. “Model selection through sparse maximum
likelihood estimation for multivariate Gaussian or binary data”. In: JMLR 9 (2008), pp. 485–516.
[BGHM17] J. Boyd-Graber, Y. Hu, and D. Mimno. “Applications of Topic Models”. In: Foundations and
Trends® in Information Retrieval 11.2-3 (2017), pp. 143–296.
[BHO75] P. J. Bickel, E. A. Hammel, and J. W. O’connell. “Sex bias in graduate admissions: data from
berkeley”. en. In: Science 187.4175 (Feb. 1975), pp. 398–404.
[Bis06] C. Bishop. Pattern recognition and machine learning. Springer, 2006.
[BJV97] J. S. D. Bonet, C. L. I. Jr., and P. A. Viola. “MIMIC: Finding Optima by Estimating Probability
Densities”. In: NIPS. MIT Press, 1997, pp. 424–430.
[BK04] Y. Boykov and V. Kolmogorov. “An experimental comparison of min-cut/max-flow algorithms
for energy minimization in vision”. en. In: IEEE PAMI 26.9 (Sept. 2004), pp. 1124–1137.
[BK10] R. Bardenet and B. Kegl. “Surrogating the surrogate: accelerating Gaussian-process-based global
optimization with a mixture cross-entropy algorithm”. In: ICML. 2010.
[BKR11] A. Blake, P. Kohli, and C. Rother, eds. Advances in Markov Random Fields for Vision and
Image Processing. MIT Press, 2011.
[BL06a] D. Blei and J. Lafferty. “Dynamic topic models”. In: ICML. 2006, pp. 113–120.
[BL06b] K. Bryan and T. Leise. “The $25,000,000,000 Eigenvector: The Linear Algebra behind Google”.
In: SIAM Review 48.3 (2006).
[BL07] D. Blei and J. Lafferty. “A Correlated Topic Model of "Science"”. In: Annals of Applied Stat.
1.1 (2007), pp. 17–35.
202
[Ble12] D. M. Blei. “Probabilistic topic models”. In: Commun. ACM 55.4 (2012), pp. 77–84.
[BNJ03] D. Blei, A. Ng, and M. Jordan. “Latent Dirichlet allocation”. In: JMLR 3 (2003), pp. 993–1022.
[Boe+05] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. “A Tutorial on the Cross-Entropy
Method”. en. In: Ann. Oper. Res. 134.1 (Feb. 2005), pp. 19–67.
[Boh92] D. Bohning. “Multinomial logistic regression algorithm”. In: Annals of the Inst. of Statistical
Math. 44 (1992), pp. 197–200.
[Bou+17] T. Bouwmans, A. Sobral, S. Javed, S. K. Jung, and E.-H. Zahzah. “Decomposition into low-
rank plus additive matrices for background/foreground separation: A review for a comparative
evaluation with a large-scale dataset”. In: Computer Science Review 23 (Feb. 2017), pp. 1–71.
[Bro+20] D. Brookes, A. Busia, C. Fannjiang, K. Murphy, and J. Listgarten. “A view of estimation of
distribution algorithms through the lens of expectation-maximization”. In: GECCO. GECCO
’20. Cancún, Mexico: Association for Computing Machinery, July 2020, pp. 189–190.
[BT03] A. Beck and M. Teoulle. “Mirror descent and nonlinear projected subgradient methods for
convex optimization”. In: Operations Research Letters 31.3 (2003), pp. 167–175.
[BT09] A Beck and M Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse
Problems”. In: SIAM J. Imaging Sci. 2.1 (Jan. 2009), pp. 183–202.
[BVZ01] Y. Boykov, O. Veksler, and R. Zabih. “Fast Approximate Energy Minimization via Graph Cuts”.
In: IEEE PAMI 23.11 (2001).
[BWL19] Y. Bai, Y.-X. Wang, and E. Liberty. “ProxQuant: Quantized Neural Networks via Proximal
Operators”. In: ICLR. 2019.
[Can+11] E. J. Candes, X. Li, Y. Ma, and J. Wright. “Robust Principal Component Analysis?” In: JACM
58.3 (June 2011), 11:1–11:37.
[CEP17] K. Chalupka, F. Eberhardt, and P. Perona. “Causal feature learning: an overview”. In: Behav-
iormetrika 44.1 (Jan. 2017), pp. 137–164.
[CGJ17] Y. Cherapanamjeri, K. Gupta, and P. Jain. “Nearly-optimal Robust Matrix Completion”. In:
ICML. 2017.
[CH92] G. Cooper and E. Herskovits. “A Bayesian method for the induction of probabilistic networks
from data”. In: Machine Learning 9 (1992), pp. 309–347.
[CH97] D. Chickering and D. Heckerman. “Efficient approximations for the marginal likelihood of
incomplete data given a Bayesian network”. In: Machine Learning 29 (1997), pp. 181–212.
[Chi02] D. M. Chickering. “Optimal structure identification with greedy search”. In: Journal of Machine
Learning Research 3 (2002), pp. 507–554.
[Chi96] D. Chickering. “Learning Bayesian networks is NP-Complete”. In: AI/Stats V. 1996.
[CHM97] D. M. Chickering, D. Heckerman, and C. Meek. “A Bayesian Approach to Learning Bayesian
Networks with Local Structure”. In: UAI. UAI’97. San Francisco, CA, USA, 1997, pp. 80–89.
[Cho+11] M. Choi, V. Tan, A. Anandkumar, and A. Willsky. “Learning Latent Tree Graphical Models”.
In: JMLR (2011).
[CL68] C. K. Chow and C. N. Liu. “Approximating discrete probability distributions with dependence
trees”. In: IEEE Trans. on Info. Theory 14 (1968), pp. 462–67.
[CLR90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. An Introduction to Algorithms. MIT Press,
1990.
[CS96] P. Cheeseman and J. Stutz. “Bayesian Classification (AutoClass): Theory and Results”. In:
Advances in Knowledge Discovery and Data Mining. Ed. by Fayyad, Pratetsky-Shapiro, Smyth,
and Uthurasamy. MIT Press, 1996.
203
[CSF16] A. W. Churchill, S. Sigtia, and C. Fernando. “Learning to Generate Genotypes with Neural
Networks”. In: (Apr. 2016). arXiv: 1604.04153 [cs.NE].
[CY99] G. Cooper and C. Yoo. “Causal Discovery from a Mixture of Experimental and Observational
Data”. In: UAI. 1999.
[Dan+10] P. Daniusis et al. “Inferring deterministic causal relations”. In: UAI. 2010.
[Daw02] A. P. Dawid. “Influence diagrams for causal modelling and inference”. In: Intl. Stat. Review 70
(2002). Corrections p437, pp. 161–189.
[DDDM04] I Daubechies, M Defrise, and C De Mol. “An iterative thresholding algorithm for linear inverse
problems with a sparsity constraint”. In: Commun. Pure Appl. Math. Advances in E 57.11 (Nov.
2004), pp. 1413–1457.
[DDL97] S. DellaPietra, V. DellaPietra, and J. Lafferty. “Inducing features of random fields”. In: IEEE
PAMI 19.4 (1997).
[Dem72] A. Dempster. “Covariance selection”. In: Biometrics 28.1 (1972).
[DGK08] J. Duchi, S. Gould, and D. Koller. “Projected Subgradient Methods for Learning Sparse
Gaussians”. In: UAI. 2008.
[DGR03] P. Dellaportas, P. Giudici, and G. Roberts. “Bayesian inference for nondecomposable graphical
Gaussian models”. In: Sankhya, Ser. A 65 (2003), pp. 43–55.
[DHS11] J. Duchi, E. Hazan, and Y. Singer. “Adaptive Subgradient Methods for Online Learning and
Stochastic Optimization”. In: JMLR 12 (2011), pp. 2121–2159.
[Die+17] A. B. Dieng, C. Wang, J. Gao, and J. Paisley. “TopicRNN: A Recurrent Neural Network with
Long-Range Semantic Dependency”. In: ICLR. 2017.
[DL09] P. Domingos and D. Lowd. Markov Logic: An Interface Layer for AI. Morgan & Claypool, 2009.
[DL93] A. P. Dawid and S. L. Lauritzen. “Hyper-Markov laws in the statistical analysis of decomposable
graphical models”. In: The Annals of Statistics 3 (1993), pp. 1272–1317.
[Dob09] A. Dobra. Dependency networks for genome-wide data. Tech. rep. U. Washington, 2009.
[Dom+06] P. Domingos, S. Kok, H. Poon, M. Richardson, and P. Singla. “Unifying Logical and Statistical
AI”. In: IJCAI. 2006.
[Don95] D. L. Donoho. “De-noising by soft-thresholding”. In: IEEE Trans. Inf. Theory 41.3 (May 1995),
pp. 613–627.
[DRB19] A. B. Dieng, F. J. R. Ruiz, and D. M. Blei. “The Dynamic Embedded Topic Model”. In: (July
2019). arXiv: 1907.05545 [cs.CL].
[Dur+98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
[DVR08] J. Dahl, L. Vandenberghe, and V. Roychowdhury. “Covariance selection for non-chordal graphs
via chordal embedding”. In: Optimization Methods and Software 23.4 (2008), pp. 501–502.
[EAL10] D. Edwards, G. de Abreu, and R. Labouriau. “Selecting high-dimensional mixed graphical
models using minimal AIC or BIC forests”. In: BMC Bioinformatics 11.18 (2010).
[Ebe17] F. Eberhardt. “Introduction to the Foundations of Causal Discovery”. In: International Journal
of Data Science and Analytics 3.2 (2017), pp. 81–91.
[Eke+13] M. Ekeberg, C. Lövkvist, Y. Lan, M. Weigt, and E. Aurell. “Improved contact prediction in
proteins: using pseudolikelihoods to infer Potts models”. en. In: Phys. Rev. E Stat. Nonlin. Soft
Matter Phys. 87.1 (Jan. 2013), p. 012707.
[Eks+18] C. Eksombatchai et al. “Pixie: A System for Recommending 3+ Billion Items to 200+ Million
Users in Real-Time”. In: WWW. 2018.
204
[Eli+00] G. Elidan, N. Lotner, N. Friedman, and D. Koller. “Discovering Hidden Variables: A Structure-
Based Approach”. In: NIPS. 2000.
[EM07] D. Eaton and K. Murphy. “Exact Bayesian structure learning from uncertain interventions”. In:
AI/Statistics. 2007.
[Eva+18] R. Evans et al. “De novo structure prediction with deep-learning based scoring”. In: (2018).
[EW08] B. Ellis and W. H. Wong. “Learning Causal Bayesian Network Structures From Experimental
Data”. In: JASA 103.482 (2008), pp. 778–789.
[Fat18] S. Fattahi. “Exact Guarantees on the Absence of Spurious Local Minima for Non-negative
Robust Principal Component Analysis”. In: JMLR (2018).
[FG02] M. Fishelson and D. Geiger. “Exact genetic linkage computations for general pedigrees”. In:
BMC Bioinformatics 18 (2002).
[FGL00] N. Friedman, D. Geiger, and N. Lotner. “Likelihood computation with value abstraction”. In:
UAI. 2000.
[FHT08] J. Friedman, T. Hastie, and R. Tibshirani. “Sparse inverse covariance estimation the graphical
lasso”. In: Biostatistics 9.3 (2008), pp. 432–441.
[FK03] N. Friedman and D. Koller. “Being Bayesian about Network Structure: A Bayesian Approach to
Structure Discovery in Bayesian Networks”. In: Machine Learning 50 (2003), pp. 95–126.
[Fri+02] N. Friedman, M. Ninion, I. Pe’er, and T. Pupko. “A Structural EM Algorithm for Phylogenetic
Inference”. In: J. Comp. Bio. 9 (2002), pp. 331–353.
[Fri97] N. Friedman. “Learning Bayesian Networks in the Presence of Missing Values and Hidden
Variables”. In: UAI. 1997.
[FS18] S. Fattahi and S. Sojoudi. “Graphical Lasso and Thresholding: Equivalence and Closed-form
Solutions”. In: JMLR (2018).
[FZS18] S. Fattahi, R. Y. Zhang, and S. Sojoudi. “Linear-Time Algorithm for Learning Large-Scale
Sparse Graphical Models”. In: IEEE Access (2018).
[GG99] P. Giudici and P. Green. “Decomposable graphical Gaussian model determination”. In: Biometrika
86.4 (1999), pp. 785–801.
[GGS84] H. Gabow, Z. Galil, and T. Spencer. “Efficient implementation of graph algorithms using
contraction”. In: FOCS. 1984.
[GH94] D. Geiger and D. Heckerman. “Learning Gaussian Networks”. In: UAI. Vol. 10. 1994, pp. 235–243.
[GH97] D. Geiger and D. Heckerman. “A characterization of Dirchlet distributions through local and
global independence”. In: Annals of Statistics 25 (1997), pp. 1344–1368.
[GJ07] A. Globerson and T. Jaakkola. “Approximate inference using planar graph decomposition”. In:
AISTATS. 2007.
[GL97] F. Glover and M. Laguna. Kluwer Academic Publishers, 1997.
[GM04] A. Goldenberg and A. Moore. “Tractable Learning of Large Bayes Net Structures from Sparse
Data”. In: ICML. 2004.
[Gol89] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. 1st. Boston,
MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.
[Goo74] L. A. Goodman. “Exploratory latent structure analysis using both identifiable and unidentifiable
models”. In: Biometrika 61.2 (1974), pp. 215–231.
[Gop98] A. Gopnik. “Explanation as Orgasm”. In: Minds and Machines 8.1 (1998), pp. 101–118.
[GPS89] D. Greig, B. Porteous, and A. Seheult. “Exact maximum a posteriori estimation for binary
images”. In: J. of Royal Stat. Soc. Series B 51.2 (1989), pp. 271–279.
205
[GR01] A. Gelman and T. Raghunathan. “Using conditional distributions for missing-data imputation”.
In: Statistical Science (2001).
[Gri+04] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. “Integrating Topics and Syntax”. In: NIPS.
2004.
[GS04] T. Griffiths and M. Steyvers. “Finding scientific topics”. In: PNAS 101 (2004), pp. 5228–5235.
[GS15] F. Glover and K. Sorensen. “Metaheuristics”. In: Scholarpedia J. 10.4 (2015), p. 6532.
[GSM18] U. Garciarena, R. Santana, and A. Mendiburu. “Expanding Variational Autoencoders for
Learning and Exploiting Latent Representations in Search Distributions”. In: Proc. of the Conf.
on Genetic and Evolutionary Computation. 2018, pp. 849–856.
[GT06] T. Griffiths and J. Tenenbaum. “Optimal predictions in everyday cognition”. In: Psychological
Science 17.9 (2006), pp. 767–773.
[GT09] T. Griffiths and J. Tenenbaum. “Theory-Based Causal Induction”. In: Psychological Review
116.4 (2009), pp. 661–716.
[Guo+21] R. Guo, L. Cheng, J. Li, P Richard Hahn, and H. Liu. “A Survey of Learning Causality with
Data: Problems and Methods”. In: ACM Computing Surveys 53.4 (2021).
[GZS19] C. Glymour, K. Zhang, and P. Spirtes. “Review of Causal Discovery Methods Based on Graphical
Models”. en. In: Front. Genet. 10 (June 2019), p. 524.
[Han16] N. Hansen. “The CMA Evolution Strategy: A Tutorial”. In: (Apr. 2016). arXiv: 1604.00772
[cs.LG].
[Hau+11] M. Hauschild, M. Pelikan, M. Hauschild, and M. Pelikan. “An introduction and survey of
estimation of distribution algorithms”. In: Swarm and Evolutionary Computation. 2011.
[HB14] A. Hauser and P. Bühlmann. “Two optimal strategies for active learning of causal models from
interventional data”. In: Int. J. Approx. Reason. 55.4 (June 2014), pp. 926–939.
[HBB10] M. Hoffman, D. Blei, and F. Bach. “Online learning for latent Dirichlet allocation”. In: NIPS.
2010.
[HDMM18] C. Heinze-Deml, M. H. Maathuis, and N. Meinshausen. “Causal Structure Learning”. In: Annu.
Rev. Stat. Appl. 5.1 (Mar. 2018), pp. 371–391.
[Hec+00] D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. “Dependency Networks
for Density Estimation, Collaborative Filtering, and Data Visualization”. In: JMLR 1 (2000),
pp. 49–75.
[HG09] Y.-B. He and Z. Geng. “Active learning of causal networks with intervention experiments and
optimal designs”. In: JMLR 10 (2009), pp. 2523–2547.
[HGC95] D. Heckerman, D. Geiger, and M. Chickering. “Learning Bayesian networks: the combination of
knowledge and statistical data”. In: Machine Learning 20.3 (1995), pp. 197–243.
[HKP91] J. Hertz, A. Krogh, and R. G. Palmer. An Introduction to the Theory of Neural Comptuation.
Addison-Wesley, 1991.
[HKZ12] D. Hsu, S. Kakade, and T. Zhang. “A spectral algorithm for learning hidden Markov models”.
In: J. of Computer and System Sciences 78.5 (2012), pp. 1460–1480.
[HMC97] D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to Causal Discovery. Tech. rep.
MSR-TR-97-05. Microsoft Research, 1997.
[Hoe+99] J. Hoeting, D. Madigan, A. Raftery, and C. Volinsky. “Bayesian Model Averaging: A Tutorial”.
In: Statistical Science 4.4 (1999).
[Hof99] T. Hofmann. “Probabilistic latent semantic indexing”. In: Research and Development in Infor-
mation Retrieval (1999), pp. 50–57.
206
[Hol92] J. H. Holland. Adaptation in Natural and Artificial Systems. https://mitpress.mit.edu/
books/adaptation-natural-and-artificial-systems. Accessed: 2017-11-26. Apr. 1992.
[Hop82] J. J. Hopfield. “Neural networks and physical systems with emergent collective computational
abilities”. In: PNAS 79.8 (1982), 2554–2558.
[Hoy+09] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and P. B. Schölkopf. “Nonlinear causal discovery
with additive noise models”. In: NIPS. 2009, pp. 689–696.
[HS05] H. Hoos and T. Stutzle. Stochastic local search: Foundations and applications. Morgan Kauffman,
2005.
[HS08] T. Hazan and A. Shashua. “Convergent message-passing algorithms for inference over general
graphs with convex free energy”. In: UAI. 2008.
[HSDK12] C. Hillar, J. Sohl-Dickstein, and K. Koepsell. Efficient and Optimal Binary Hopfield Associative
Memory Storage Using Minimum Probability Flow. Tech. rep. Apr. 2012. arXiv: 1204.2916.
[HT08] H. Hara and A. Takimura. “A Localization Approach to Improve Iterative Proportional Scaling
in Gaussian Graphical Models”. In: Communications in Statistics - Theory and Method (2008).
to appear.
[HT09] H. Hoefling and R. Tibshirani. “Estimation of Sparse Binary Pairwise Markov Networks using
Pseudo-likelihoods”. In: JMLR 10 (2009).
[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. 2nd edition.
Springer, 2009.
[Hu+12] J. Hu, Y. Wang, E. Zhou, M. C. Fu, and S. I. Marcus. “A Survey of Some Model-Based Methods
for Global Optimization”. en. In: Optimization, Control, and Applications of Stochastic Systems.
Systems & Control: Foundations & Applications. Birkhäuser, Boston, 2012, pp. 157–179.
[HW11] S. Harmeling and C. K. I. Williams. “Greedy Learning of Binary Latent Trees”. In: IEEE PAMI
33.6 (2011), pp. 1087–1097.
[IM17] J. Ingraham and D. Marks. “Bayesian Sparsity for Intractable Undirected Models”. In: ICML.
2017.
[Jan+12] D. Janzing et al. “Information-geometric approach to inferring causal directions”. In: AIJ 182
(2012), pp. 1–31.
[Jay03] E. T. Jaynes. Probability theory: the logic of science. Cambridge university press, 2003.
[JBB09] D. Jian, A. Barthels, and M. Beetz. “Adaptive Markov logic networks: Learning statistical
relational models with dynamic parameters”. In: 9th European Conf. on AI. 2009, 937–942.
[JHS13] Y. Jernite, Y. Halpern, and D. Sontag. “Discovering hidden variables in noisy-or networks using
quartet tests”. In: NIPS. 2013.
[Jia+13] Y. Jia, J. T. Abbott, J. L. Austerweil, T. Griffiths, and T. Darrell. “Visual Concept Learning:
Combining Machine Vision and Bayesian Generalization on Concept Hierarchies”. In: NIPS.
2013.
[Jin11] Y. Jin. “Surrogate-assisted evolutionary computation: Recent advances and future challenges”.
In: Swarm and Evolutionary Computation 1.2 (June 2011), pp. 61–70.
[JJ00] T. S. Jaakkola and M. I. Jordan. “Bayesian parameter estimation via variational methods”. In:
Statistics and Computing 10 (2000), pp. 25–37.
[JJ96] T. Jaakkola and M. Jordan. “A variational approach to Bayesian logistic regression problems
and their extensions”. In: AISTATS. 1996.
[JJ99] T. Jaakkola and M. Jordan. “Variational probabilistic inference and the QMR-DT network”. In:
JAIR 10 (1999), pp. 291–322.
207
[Jon+05] B. Jones, A. Dobra, C. Carvalho, C. Hans, C. Carter, and M. West. “Experiments in stochastic
computation for high-dimensional graphical models”. In: Statistical Science 20 (2005), pp. 388–
400.
[JS10] D. Janzing and B. Scholkopf. “Causal inference using the algorithmic Markov condition”. In:
IEEE Trans. on Information Theory 56.10 (2010), pp. 5168–5194.
[JZB19] A. Jaber, J. Zhang, and E. Bareinboim. “Identification of Conditional Causal Effects under
Markov Equivalence”. In: NIPS. 2019, pp. 11512–11520.
[KB07] M. Kalisch and P. Buhlmann. “Estimating high dimensional directed acyclic graphs with the
PC algorithm”. In: JMLR 8 (2007), pp. 613–636.
[KB14] M. Kalisch and P. Bühlmann. “Causal Structure Learning and Inference: A Selective Review”.
In: Qual. Technol. Quant. Manag. 11.1 (Jan. 2014), pp. 3–21.
[KF09] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT
Press, 2009.
[KNH11] K. B. Korb, E. P. Nyberg, and L. Hope. “A new causal power theory”. In: Causality in the
Sciences. Oxford University Press, 2011.
[KNP11] K. Kersting, S. Natarajan, and D. Poole. Statistical Relational AI: Logic, Probability and
Computation. Tech. rep. UBC, 2011.
[Koi06] M. Koivisto. “Advances in exact Bayesian structure discovery in Bayesian networks”. In: UAI.
2006.
[Kol06] V. Kolmogorov. “Convergent Tree-reweighted Message Passing for Energy Minimization”. In:
IEEE PAMI 28.10 (2006), pp. 1568–1583.
[Koz92] J. R. Koza. Genetic Programming. https://mitpress.mit.edu/books/genetic-programming.
Accessed: 2017-11-26. Dec. 1992.
[KS04] M. Koivisto and K. Sood. “Exact Bayesian structure discovery in Bayesian networks”. In: JMLR
5 (2004), pp. 549–573.
[Lac+20] S. Lachapelle, P. Brouillard, T. Deleu, and S. Lacoste-Julien. “Gradient-Based Neural DAG
Learning”. In: ICLR. 2020.
[LD08] A. Lenkoski and A. Dobra. Bayesian structural learning and estimation in Gaussian graphical
models. Tech. rep. 545. Department of Statistics, University of Washington, 2008.
[Lev11] S. Levy. In The Plex: How Google Thinks, Works, and Shapes Our Lives. Simon & Schuster,
2011.
[LFJ04] M. Law, M. Figueiredo, and A. Jain. “Simultaneous Feature Selection and Clustering Using
Mixture Models”. In: IEEE PAMI 26.4 (2004).
[LGK06] S.-I. Lee, V. Ganapathi, and D. Koller. “Efficient Structure Learning of Markov Networks using
L1-Regularization”. In: NIPS. 2006.
[LHF17] R. M. Levy, A. Haldane, and W. F. Flynn. “Potts Hamiltonian models of protein co-variation,
free energy landscapes, and evolutionary fitness”. en. In: Curr. Opin. Struct. Biol. 43 (Apr.
2017), pp. 55–62.
[Li+14] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. “Reducing the sampling complexity of topic
models”. In: KDD. ACM, 2014, pp. 891–900.
[Li+18] C. Li, H. Farkhoor, R. Liu, and J. Yosinski. “Measuring the Intrinsic Dimension of Objective
Landscapes”. In: ICLR. 2018.
[Li+19] X. Li, L. Vilnis, D. Zhang, M. Boratko, and A. McCallum. “Smoothing the Geometry of
Probabilistic Box Embeddings”. In: ICLR. 2019.
[LL02] P. Larranaga and J. A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolu-
tionary Computation. Norwell, MA, USA: Kluwer Academic Publishers, 2002.
208
[LL15] H. Li and Z. Lin. “Accelerated Proximal Gradient Methods for Nonconvex Programming”. In:
NIPS. 2015, pp. 379–387.
[LL18] Z. Li and J. Li. “A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex
Optimization”. In: (Feb. 2018). arXiv: 1802.04477 [math.OC].
[LM06] A. Langville and C. Meyer. “Updating Markov chains with an eye on Google’s PageRank”. In:
SIAM J. on Matrix Analysis and Applications 27.4 (2006), pp. 968–987.
[LMW17] L Landrieu, C Mallet, and M Weinmann. “Comparison of belief propagation and graph-cut
approaches for contextual classification of 3D lidar point cloud data”. In: IEEE International
Geoscience and Remote Sensing Symposium (IGARSS). July 2017, pp. 2768–2771.
[Lof15] P. Lofgren. “Efficient Algorithms for Personalized PageRank”. PhD thesis. Stanford, 2015.
[Lok+18] A. Y. Lokhov, M. Vuffray, S. Misra, and M. Chertkov. “Optimal structure and parameter
learning of Ising models”. en. In: Science Advances 4.3 (Mar. 2018), e1700791.
[Luk13] S. Luke. Essentials of Metaheuristics. 2013.
[MB06] N. Meinshausen and P. Buhlmann. “High dimensional graphs and variable selection with the
lasso”. In: The Annals of Statistics 34 (2006), pp. 1436–1462.
[MB16] Y. Miao and P. Blunsom. “Language as a Latent Variable: Discrete Generative Models for
Sentence Compression”. In: EMNLP. 2016.
[MC03] P. Moscato and C. Cotta. “A Gentle Introduction to Memetic Algorithms”. en. In: Handbook of
Metaheuristics. International Series in Operations Research & Management Science. Springer,
Boston, MA, 2003, pp. 105–144.
[McE20] R. McElreath. Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd
edition). en. Chapman and Hall/CRC, 2020.
[McK+04] B. D. McKay, F. E. Oggier, G. F. Royle, N. J. A. Sloane, I. M. Wanless, and H. S. Wilf. “
Acyclic digraphs and eigenvalues of (0,1)-matrices”. In: J. Integer Sequences 7.04.3.3 (2004).
[Mei05] N. Meinshausen. A note on the Lasso for Gaussian graphical model selection. Tech. rep. ETH
Seminar fur Statistik, 2005.
[MGR18] H. Mania, A. Guy, and B. Recht. “Simple random search of static linear policies is competitive
for reinforcement learning”. In: NIPS. Ed. by S Bengio, H Wallach, H Larochelle, K Grauman,
N Cesa-Bianchi, and R Garnett. Curran Associates, Inc., 2018, pp. 1800–1809.
[MH12] R. Mazumder and T. Hastie. The Graphical Lasso: New Insights and Alternatives. Tech. rep.
Stanford Dept. Statistics, 2012.
[MH97] C. Meek and D. Heckerman. “Structure and Parameter Learning for Causal Independence and
Causal Interaction Models”. In: UAI. 1997, pp. 366–375.
[Min01] T. Minka. Statistical Approaches to Learning and Discovery 10-602: Homework assignment 2,
question 5. Tech. rep. CMU, 2001.
[Mit97] T. Mitchell. Machine Learning. McGraw Hill, 1997.
[MJ00] M. Meila and M. I. Jordan. “Learning with mixtures of trees”. In: JMLR 1 (2000), pp. 1–48.
[MJ06] M. Meila and T. Jaakkola. “Tractable Bayesian learning of tree belief networks”. In: Statistics
and Computing 16 (2006), pp. 77–92.
[MKM11] B. Marlin, E. Khan, and K. Murphy. “Piecewise Bounds for Estimating Bernoulli-Logistic Latent
Gaussian Models”. In: ICML. 2011.
[MM01] T. K. Marks and J. R. Movellan. Diffusion networks, products of experts, and factor analysis.
Tech. rep. University of California San Diego, 2001.
[Mog+09] B. Moghaddam, B. Marlin, E. Khan, and K. Murphy. “Accelerating Bayesian Structural Inference
for Non-Decomposable Gaussian Graphical Models”. In: NIPS. 2009.
209
[Mol+18] D. Moldovan, V. Chifu, C. Pop, T. Cioara, I. Anghel, and I. Salomie. “Chicken Swarm Opti-
mization and Deep Learning for Manufacturing Processes”. In: Networking in Education and
Research (RoEduNet) conference. Cluj-Napoca: IEEE, Sept. 2018, pp. 1–6.
[Moo+16] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. “Distinguishing Cause
from Effect Using Observational Data: Methods and Benchmarks”. In: JMLR 17.1 (Jan. 2016),
pp. 1103–1204.
[Mor+11] F. Morcos et al. “Direct-coupling analysis of residue coevolution captures native contacts across
many protein families”. en. In: Proc. Natl. Acad. Sci. U. S. A. 108.49 (Dec. 2011), E1293–301.
[MR94] D. Madigan and A. Raftery. “Model selection and accounting for model uncertainty in graphical
models using Occam’s window”. In: JASA 89 (1994), pp. 1535–1546.
[Mue+17] J. Mueller, D. N. Reshef, G. Du, and T. Jaakkola. “Learning Optimal Interventions”. In:
AISTATS. 2017.
[Mur01] K. Murphy. Active Learning of Causal Bayes Net Structure. Tech. rep. Comp. Sci. Div., UC
Berkeley, 2001.
[Mur22] K. P. Murphy. Probabilistic Machine Learning: An introduciton. MIT Press, 2022.
[Nea92] R. Neal. “Connectionist learning of belief networks”. In: Artificial Intelligence 56 (1992), pp. 71–
113.
[Nit14] A. Nitanda. “Stochastic Proximal Gradient Descent with Acceleration Techniques”. In: NIPS.
2014, pp. 1574–1582.
[NY83] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization.
Wiley, 1983.
[Oll+17] Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. “Information-Geometric Optimization Algo-
rithms: A Unifying Picture via Invariance Principles”. In: JMLR 18 (2017), pp. 1–65.
[PB+14] N. Parikh, S. Boyd, et al. “Proximal algorithms”. In: Foundations and Trends in Optimization
1.3 (2014), pp. 127–239.
[Pe’05] D. Pe’er. “Bayesian network analysis of signaling networks: a primer”. In: Science STKE 281
(2005), p. 14.
[PE08] J.-P. Pellet and A. Elisseeff. “Using Markov blankets for causal structure learning”. In: JMLR 9
(2008), pp. 1295–1342.
[Pea09] J. Pearl. Causality: Models, Reasoning and Inference (Second Edition). Cambridge Univ. Press,
2009.
[Pel05] M. Pelikan. Hierarchical Bayesian Optimization Algorithm: Toward a New Generation of
Evolutionary Algorithms. en. Softcover reprint of hardcover 1st ed. 2005 edition. Springer, 2005.
[PGCP00] M Pelikan, D. E. Goldberg, and E Cantú-Paz. “Linkage problem, distribution estimation, and
Bayesian networks”. en. In: Evol. Comput. 8.3 (2000), pp. 311–340.
[PHL12] M. Pelikan, M. Hausschild, and F. Lobo. Introduction to estimation of distribution algorithms.
Tech. rep. U. Missouri, 2012.
[PJS17] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference: Foundations and Learning
Algorithms (Adaptive Computation and Machine Learning series). The MIT Press, Nov. 2017.
[PK11] P. Parviainen and M. Koivisto. “Ancestor Relations in the Presence of Unobserved Variables”.
In: ECML. 2011.
[PN18] A. Patrascu and I. Necoara. “Nonasymptotic convergence of stochastic proximal point methods
for constrained convex optimization”. In: JMLR 18.198 (2018), pp. 1–42.
[Poo+12] D. Poole, D. Buchman, S. Natarajan, and K. Kersting. “Aggregation and Population Growth:
The Relational Logistic Regression and Markov Logic Cases”. In: Statistical Relational AI
workshop. 2012.
210
[PRG17] M. Probst, F. Rothlauf, and J. Grahl. “Scalability of using Restricted Boltzmann Machines for
combinatorial optimization”. In: Eur. J. Oper. Res. 256.2 (Jan. 2017), pp. 368–383.
[Pri12] S. Prince. Computer Vision: Models, Learning and Inference. Cambridge, 2012.
[PSCP06] M. Pelikan, K. Sastry, and E. Cantú-Paz. Scalable Optimization via Probabilistic Modeling:
From Algorithms to Applications (Studies in Computational Intelligence). Secaucus, NJ, USA:
Springer-Verlag New York, Inc., 2006.
[PSW15] N. G. Polson, J. G. Scott, and B. T. Willard. “Proximal Algorithms in Statistics and Machine
Learning”. en. In: Stat. Sci. 30.4 (Nov. 2015), pp. 559–581.
[RD06] M. Richardson and P. Domingos. “Markov logic networks”. In: Machine Learning 62 (2006),
pp. 107–136.
[Rea+19] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. “Regularized Evolution for Image Classifier
Architecture Search”. In: AAAI. 2019.
[Red+16] S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. “Proximal Stochastic Methods for Nonsmooth
Nonconvex Finite-sum Optimization”. In: NIPS. NIPS’16. USA, 2016, pp. 1153–1161.
[RH10] M. Ranzato and G. Hinton. “Modeling pixel means and covariances using factored third-order
Boltzmann machines”. In: CVPR. 2010.
[RK04] R. Rubinstein and D. Kroese. The Cross-Entropy Method: A Unified Approach to Combinatorial
Optimization, Monte-Carlo Simulation, and Machine Learning. Springer-Verlag, 2004.
[RM15] G. Raskutti and S. Mukherjee. “The information geometry of mirror descent”. In: IEEE Trans.
Info. Theory 61.3 (2015), pp. 1451–1457.
[RN10] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. 3rd edition. Prentice Hall,
2010.
[Rob73] R. W. Robinson. “Counting labeled acyclic digraphs”. In: New Directions in the Theory of
Graphs. Ed. by F. Harary. Academic Press, 1973, pp. 239–273.
[Roc06] S. Roch. “A short proof that phylogenetic tree reconstrution by maximum likelihood is hard”.
In: IEEE/ACM Trans. Comp. Bio. Bioinformatics 31.1 (2006).
[Rov02] A. Roverato. “Hyper inverse Wishart distribution for non-decomposable graphs and its applica-
tion to Bayesian inference for Gaussian graphical models”. In: Scand. J. Statistics 29 (2002),
pp. 391–411.
[RU10] A. Rajaraman and J. Ullman. Mining of massive datasets. Self-published, 2010.
[Rub97] R. Y. Rubinstein. “Optimization of computer simulation models with rare events”. In: Eur. J.
Oper. Res. 99.1 (May 1997), pp. 89–112.
[Sac+05] K. Sachs, O. Perez, D. Pe’er, D. Lauffenburger, and G. Nolan. “Causal Protein-Signaling
Networks Derived from Multiparameter Single-Cell Data”. In: Science 308 (2005).
[Sal+17] T. Salimans, J. Ho, X. Chen, and I. Sutskever. “Evolution Strategies as a Scalable Alternative
to Reinforcement Learning”. In: (Mar. 2017). arXiv: 1703.03864 [stat.ML].
[San17] R. Santana. “Gray-box optimization and factorized distribution algorithms: where two worlds
collide”. In: (July 2017). arXiv: 1707.03093 [cs.NE].
[SC08] J. G. Scott and C. M. Carvalho. “Feature-inclusion Stochastic Search for Gaussian Graphical
Models”. In: J. of Computational and Graphical Statistics 17.4 (2008), pp. 790–808.
[Sch+08] M. Schmidt, K. Murphy, G. Fung, and R. Rosales. “Structure Learning in Random Fields for
Heart Motion Abnormality Detection”. In: CVPR. 2008.
[Sch+09] M. Schmidt, E. van den Berg, M. Friedlander, and K. Murphy. “Optimizing Costly Functions
with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm”. In: AI &
Statistics. 2009.
211
[Sch10a] M. Schmidt. “Graphical model structure learning with L1 regularization”. PhD thesis. UBC,
2010.
[Sch10b] N. Schraudolph. “Polynomial-Time Exact Inference in NP-Hard Binary MRFs via Reweighted
Perfect Matching”. In: AISTATS. 2010.
[Sch+17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization
Algorithms”. In: (July 2017). arXiv: 1707.06347 [cs.LG].
[Sch+21] B. Schölkopf et al. “Toward Causal Representation Learning”. In: Proc. IEEE 109.5 (May 2021),
pp. 612–634.
[Seg11] D. Segal. “The dirty little secrets of search”. In: New York Times (2011).
[Sen+08] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. “Collective Classifi-
cation in Network Data”. en. In: AI Magazine 29.3 (Sept. 2008), pp. 93–93.
[SF08] E. Sudderth and W. Freeman. “Signal and Image Processing with Belief Propagation”. In: IEEE
Signal Processing Magazine (2008).
[SF19] A. Shekhovtsov and B. Flach. “Feed-forward Propagation in Probabilistic Neural Networks with
Categorical and Max Layers”. In: ICLR. 2019.
[SG07] M. Steyvers and T. Griffiths. “Probabilistic topic models”. In: Latent Semantic Analysis: A Road
to Meaning. Ed. by T. Landauer, D McNamara, S. Dennis, and W. Kintsch. Laurence Erlbaum,
2007.
[SG09] R. Silva and Z. Ghahramani. “The Hidden Life of Latent Variables: Bayesian Learning with
Mixed Graph Models”. In: JMLR 10 (2009), pp. 1187–1238.
[SG91] P. Spirtes and C. Glymour. “An algorithm for fast recovery of sparse causal graphs”. In: Social
Science Computer Review 9 (1991), pp. 62–72.
[SGS00] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. 2nd edition. MIT
Press, 2000.
[SH03] A. Siepel and D. Haussler. “Combining phylogenetic and hid- den Markov models in biosequence
analysis”. In: Proc. 7th Intl. Conf. on Computational Molecular Biology (RECOMB). 2003.
[SH06] T. Singliar and M. Hauskrecht. “Noisy-OR Component Analysis and its Application to Link
Analysis”. In: JMLR 7 (2006).
[SH10] R. Salakhutdinov and G. Hinton. “Replicated Softmax: an Undirected Topic Model”. In: NIPS.
2010.
[Shi+06] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. “A Linear Non-Gaussian Acyclic
Model for Causal Discovery”. In: JMLR 7.Oct (2006), pp. 2003–2030.
[Shw+91] M. Shwe et al. “Probabilistic Diagnosis Using a Reformulation of the INTERNIST-1/QMR
Knowledge Base”. In: Methods. Inf. Med 30.4 (1991), pp. 241–255.
[SK86] T. Speed and H. Kiiveri. “Gaussian Markov distributions over finite graphs”. In: Annals of
Statistics 14.1 (1986), pp. 138–150.
[SKM07] T. Silander, P. Kontkanen, and P. Myllymaki. “On Sensitivity of the MAP Bayesian Network
Structure to the Equivalent Sample Size Parameter”. In: UAI. 2007, pp. 360–367.
[SLM92] B. Selman, H. Levesque, and D. Mitchell. “A New Method for Solving Hard Satisfiability
Problems”. In: Proceedings of the Tenth National Conference on Artificial Intelligence. AAAI’92.
San Jose, California: AAAI Press, 1992, pp. 440–446.
[SM06] T. Silander and P. Myllymaki. “A simple approach for finding the globally optimal Bayesian
network structure”. In: UAI. 2006.
[SM09] M. Schmidt and K. Murphy. “Modeling Discrete Interventional Data using Directed Cyclic
Graphical Models”. In: UAI. 2009.
212
[SMH07] R. R. Salakhutdinov, A. Mnih, and G. E. Hinton. “Restricted Boltzmann machines for collabo-
rative filtering”. In: ICML. Vol. 24. 2007, pp. 791–798.
[SNMM07] M. Schmidt, A. Niculescu-Mizil, and K. Murphy. “Learning Graphical Model Structure using
L1-Regularization Paths”. In: AAAI. 2007.
[Sol19] L. Solus. “Interventional Markov Equivalence for Mixed Graph Models”. In: (Nov. 2019). arXiv:
1911.10114 [math.ST].
[Sör15] K. Sörensen. “Metaheuristics—the metaphor exposed”. In: Intl. Trans. in Op. Res. 22.1 (Jan.
2015), pp. 3–18.
[SP18] R. D. Shah and J. Peters. “The Hardness of Conditional Independence Testing and the Gener-
alised Covariance Measure”. In: Ann. Stat. (2018).
[SS02] D. Scharstein and R. Szeliski. “A taxonomy and evaluation of dense two-frame stereo correspon-
dence algorithms”. In: Intl. J. Computer Vision 47.1 (2002), pp. 7–42.
[SS17] A. Srivastava and C. Sutton. “Autoencoding Variational Inference For Topic Models”. In: ICLR.
2017.
[Sta+19] K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen. “Designing neural networks through
neuroevolution”. In: Nature Machine Intelligence 1.1 (2019).
[SW11] R. Sedgewick and K. Wayne. Algorithms. Addison Wesley, 2011.
[SWW08] E. Sudderth, M. Wainwright, and A. Willsky. “Loop series and Bethe variational bounds for
attractive graphical models”. In: NIPS. 2008.
[Sze+08] R. Szeliski et al. “A Comparative Study of Energy Minimization Methods for Markov Random
Fields with Smoothness-Based Priors”. In: IEEE PAMI 30.6 (2008), pp. 1068–1080.
[Sze10] R. Szeliski. Computer Vision: Algorithms and Applications. Springer, 2010.
[Tad15] M. Taddy. “Distributed multinomial regression”. en. In: Annals of Applied Statistics 9.3 (Sept.
2015), pp. 1394–1414.
[Teh+06] Y.-W. Teh, M. Jordan, M. Beal, and D. Blei. “Hierarchical Dirichlet processes”. In: JASA 101.476
(2006), pp. 1566–1581.
[Ten+11] J. Tenenbaum, C. Kemp, T. Griffiths, and N. Goodman. “How to Grow a Mind: Statistics,
Structure, and Abstraction”. In: Science 6022 (2011), pp. 1279–1285.
[Ten99] J. Tenenbaum. “A Bayesian framework for concept learning”. PhD thesis. MIT, 1999.
[TF03] M. Tappen and B. Freeman. “Comparison of graph cuts with belief propagation for stereo, using
identical MRF parameters”. In: ICCV. Oct. 2003, 900–906 vol.2.
[TG09] A. Thomas and P. Green. “Enumerating the Decomposable Neighbours of a Decomposable
Graph Under a Simple Perturbation Scheme”. In: Comp. Statistics and Data Analysis 53 (2009),
pp. 1232–1238.
[Thi+98] B. Thiesson, C. Meek, D. Chickering, and D. Heckerman. “Learning Mixtures of DAG models”.
In: UAI. 1998.
[Tse08] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Unpublished
manuscript. 2008.
[Tu+19] R. Tu, K. Zhang, B. C. Bertilson, H. Kjellström, and C. Zhang. “Neuropathic Pain Diagnosis
Simulator for Causal Discovery Algorithm Evaluation”. In: NIPS. 2019.
[Var21] K. R. Varshney. Trustworthy Machine Learning. 2021.
[VML19] M. Vuffray, S. Misra, and A. Y. Lokhov. “Efficient Learning of Discrete Graphical Models”. In:
(Feb. 2019). arXiv: 1902.00600 [cs.LG].
[VP90] T. Verma and J. Pearl. “Equivalence and synthesis of causal models”. In: UAI. 1990.
213
[Wal+09] H. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. “Evaluation Methods for Topic Models”.
In: ICML. 2009.
[WCK03] F. Wong, C. Carter, and R. Kohn. “Efficient estimation of covariance selection models”. In:
Biometrika 90.4 (2003), pp. 809–830.
[Wer07] T. Werner. “A linear programming approach to the max-sum problem: A review”. In: IEEE
PAMI 29.7 (2007), pp. 1165–1179.
[WHT19] Y. Wang, H. He, and X. Tan. “Truly Proximal Policy Optimization”. In: UAI. 2019.
[Wie+14] D Wierstra, T Schaul, J Peters, and J Schmidhuber. “Natural Evolution Strategies”. In: JMLR
15.1 (2014), pp. 949–980.
[WJ08] M. J. Wainwright and M. I. Jordan. “Graphical models, exponential families, and variational
inference”. In: Foundations and Trends in Machine Learning 1–2 (2008), pp. 1–305.
[WJW05a] M. Wainwright, T. Jaakkola, and A. Willsky. “A new class of upper bounds on the log partition
function”. In: IEEE Trans. Info. Theory 51.7 (2005), pp. 2313–2335.
[WJW05b] M. Wainwright, T. Jaakkola, and A. Willsky. “MAP estimation via agreement on trees: message-
passing and linear programming”. In: IEEE Trans. Info. Theory 51.11 (2005), pp. 3697–3717.
[WP18] H. Wang and H. Poon. “Deep Probabilistic Logic: A Unifying Framework for Indirect Supervision”.
In: EMNLP. 2018.
[WRL06] M. Wainwright, P. Ravikumar, and J. Lafferty. “Inferring Graphical Model Structure using
` − 1-Regularized Pseudo-Likelihood”. In: NIPS. 2006.
[WSD19] S. Wu, S. Sanghavi, and A. G. Dimakis. “Sparse Logistic Regression Learns All Discrete Pairwise
Graphical Models”. In: NIPS. 2019.
[XAH19] Z. Xu, T. Ajanthan, and R. Hartley. “Fast and Differentiable Message Passing for Stereo Vision”.
In: (Oct. 2019). arXiv: 1910.10892 [cs.CV].
[XT07] F. Xu and J. Tenenbaum. “Word learning as Bayesian inference”. In: Psychological Review 114.2
(2007).
[Xu18] J. Xu. “Distance-based Protein Folding Powered by Deep Learning”. In: (Nov. 2018). arXiv:
1811.03481 [q-bio.BM].
[Yam+12] K. Yamaguchi, T. Hazan, D. McAllester, and R. Urtasun. “Continuous Markov Random Fields
for Robust Stereo Estimation”. In: (Apr. 2012). arXiv: 1204.1393 [cs.CV].
[Yao+20] Q. Yao, J. Xu, W.-W. Tu, and Z. Zhu. “Efficient Neural Architecture Search via Proximal
Iterations”. In: AAAI. 2020.
[YL07] M. Yuan and Y. Lin. “Model Selection and Estimation in the Gaussian Graphical Model”. In:
Biometrika 94.1 (2007), pp. 19–35.
[Yu+19] Y. Yu, J. Chen, T. Gao, and M. Yu. “DAG-GNN: DAG Structure Learning with Graph Neural
Networks”. In: ICML. 2019.
[Zha04] N. Zhang. “Hierarchical latent class models for cluster analysis”. In: JMLR (2004), pp. 301–308.
[Zha+20] Y. Zhang et al. “Efficient Probabilistic Logic Reasoning with Graph Neural Networks”. In: ICLR.
2020.
[Zhe+18] X. Zheng, B. Aragam, P. Ravikumar, and E. P. Xing. “DAGs with NO TEARS: Smooth
Optimization for Structure Learning”. In: NIPS. 2018.
[ZK14] W. Zhong and J. Kwok. “Fast Stochastic Alternating Direction Method of Multipliers”. In:
ICML. Ed. by E. P. Xing and T. Jebara. Vol. 32. Proceedings of Machine Learning Research.
Bejing, China: PMLR, 2014, pp. 46–54.
[ZWM97] C. S. Zhu, N. Y. Wu, and D. Mumford. “Minimax Entropy Principle and Its Application to
Texture Modeling”. In: Neural Computation 9.8 (Nov. 1997).
214