0% found this document useful (0 votes)
138 views214 pages

Supp 2

Uploaded by

hangzhou ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views214 pages

Supp 2

Uploaded by

hangzhou ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 214

Supplementary Material for

“Probabilistic Machine Learning: Advanced Topics”

Kevin Murphy

February 27, 2022


2
Contents

1 Introduction 7

I Fundamentals 9
2 Probability 11
2.1 More fun with Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Deriving the conditionals of an MVN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Deriving Bayes rule for linear Gaussian systems . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Sensor fusion with unknown measurement noise . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Google’s PageRank algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Retrieving relevant pages using inverted indices . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 The PageRank score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Efficiently computing the PageRank vector . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Web spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.5 Personalized PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Statistics 21
3.1 Bayesian concept learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Learning a discrete concept: the number game . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Learning a continuous concept: the healthy levels game . . . . . . . . . . . . . . . . . 26
3.2 Informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Domain specific priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Gaussian prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Power-law prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.4 Erlang prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Graphical models 33
4.1 More examples of DGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 The QMR network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Genetic linkage analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 More examples of UGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 Restricted Boltzmann machines (RBMs) in more detail . . . . . . . . . . . . . . . . . 39
4.2.3 Feature induction for a maxent spelling model . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Relational UGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.5 Markov logic networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Information theory 45

3
6 Optimization 47
6.1 Proximal methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.1 Proximal operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.2 Computing proximal operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.3 Proximal point methods (PPM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.4 Mirror descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.5 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.6 Alternating direction method of multipliers (ADMM) . . . . . . . . . . . . . . . . . . 55
6.2 Local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Stochastic local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.2 Tabu search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.3 Random search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Population-based optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3.1 Evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.2 Metaheuristic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3.3 Estimation of distribution algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.4 Cross-entropy method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.5 Natural evolutionary strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4.1 Example: computing Fibonnaci numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4.2 ML examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Conjugate duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5.2 Example: exponential function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5.3 Conjugate of a conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5.4 Bounds for the logistic (sigmoid) function . . . . . . . . . . . . . . . . . . . . . . . . . 69

II Inference 71
7 Inference algorithms: an overview 73

8 State-space inference 75

9 Message passing inference 77


9.1 MAP estimation for discrete PGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.1.2 The marginal polytope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.1.3 Linear programming relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.1.4 Graphcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

10 Variational inference 85
10.1 Exact and approximate inference for PGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.1.1 Exact inference as VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10.1.2 Mean field VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
10.1.3 Loopy belief propagation as VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
10.1.4 Convex belief propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.1.5 Tree-reweighted belief propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.1.6 Other tractable versions of convex BP . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

11 Monte Carlo Inference 93

12 Markov Chain Monte Carlo (MCMC) inference 95

4
13 Sequential Monte Carlo (SMC) inference 97

III Prediction 99
14 Discriminative models: an overview 101

15 Generalized linear models 103


15.1 Variational inference for logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
15.1.1 Binary logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
15.1.2 Multinomial logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
15.2 Converting multinomial logistic regression to Poisson regression . . . . . . . . . . . . . . . . . 108
15.3 Case study: is Berkeley admissions biased against women? . . . . . . . . . . . . . . . . . . . . 108
15.3.1 Binomial logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
15.3.2 Beta-binomial logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
15.3.3 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
15.3.4 GLMM (hierarchical Bayes) regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

16 Deep neural networks 115

17 Gaussian processes 117

18 Structured prediction 119

19 Beyond the iid assumption 121

IV Generation 123
20 Generative models: an overview 125

21 Variational autoencoders 127

22 Auto-regressive models 129

23 Normalizing flows 131

24 Energy-based models 133

25 Denoising diffusion models 135

26 Generative adversarial networks 137

V Discovery 139
27 Discovery methods: an overview 141

28 Latent variable models 143


28.1 Topic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
28.1.1 Latent Dirichlet Allocation (LDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
28.1.2 Correlated topic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
28.1.3 Dynamic topic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
28.1.4 LDA-HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
28.1.5 Collapsed Gibbs sampling for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5
28.1.6 Variational inference for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

29 Hidden Markov models 155

30 State-space models 157

31 Graph learning 159


31.1 Learning tree structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
31.1.1 Directed or undirected tree? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
31.1.2 Chow-Liu algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
31.1.3 Finding the MAP forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
31.1.4 Mixtures of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
31.2 Learning DAG structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
31.2.1 Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
31.2.2 Markov equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
31.2.3 Bayesian model selection: statistical foundations . . . . . . . . . . . . . . . . . . . . . 164
31.2.4 Bayesian model selection: algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
31.2.5 Constraint-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
31.2.6 Methods based on sparse optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
31.2.7 Consistent estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
31.2.8 Handling latent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
31.3 Learning undirected graph structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
31.3.1 Dependency networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
31.3.2 Graphical lasso for GGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
31.3.3 Graphical lasso for discrete MRFs/CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . 180
31.3.4 Bayesian inference for undirected graph structures . . . . . . . . . . . . . . . . . . . . 181
31.4 Learning causal DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
31.4.1 Learning cause-effect pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
31.4.2 Learning causal DAGs from interventional data . . . . . . . . . . . . . . . . . . . . . . 185
31.4.3 Learning from low-level inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

32 Non-parametric Bayesian models 189

33 Representation learning 191

34 Interpretability 193

VI Decision making 195


35 Multi-step decision problems 197

36 Reinforcement learning 199

6
Chapter 1

Introduction

7
8
Part I

Fundamentals

9
Chapter 2

Probability

2.1 More fun with Gaussians


2.1.1 Deriving the conditionals of an MVN
Consider a joint Gaussian of the form p(x1 , x2 ) = N (x|µ, Σ), where
   
µ1 Σ11 Σ12
µ= , Σ= (2.1)
µ2 Σ21 Σ22

In ??, we claimed that

p(x1 |x2 ) = N (x1 |µ1 + Σ12 Σ−1 −1


22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 ) (2.2)

In this section, we derive this result using Schur complenents.


Let us factor the joint p(x1 , x2 ) as p(x2 )p(x1 |x2 ) as follows:
(  T  −1  )
1 x 1 − µ1 Σ11 Σ12 x1 − µ1
p(x1 , x2 ) ∝ exp − (2.3)
2 x 2 − µ2 Σ21 Σ22 x2 − µ2

Using ?? the above exponent becomes


(  T   
1 x1 − µ1 I 0 (Σ/Σ22 )−1 0
p(x1 , x2 ) ∝ exp − (2.4)
2 x2 − µ2 −Σ−122 Σ21 I 0 Σ−1
22
 −1
 
I −Σ12 Σ22 x 1 − µ1
× (2.5)
0 I x2 − µ2

1
= exp − (x1 − µ1 − Σ12 Σ−1 T
22 (x2 − µ2 )) (Σ/Σ22 )
−1
(2.6)
2
 
−1
1 T −1
(x1 − µ1 − Σ12 Σ22 (x2 − µ2 )) × exp − (x2 − µ2 ) Σ22 (x2 − µ2 ) (2.7)
2

This is of the form


exp(quadratic form in x1 , x2 ) × exp(quadratic form in x2 ) (2.8)
Hence we have successfully factorized the joint as

p(x1 , x2 ) = p(x1 |x2 )p(x2 ) (2.9)


= N (x1 |µ1|2 , Σ1|2 )N (x2 |µ2 , Σ22 ) (2.10)

11
where the parameters of the conditional distribution can be read off from the above equations using

µ1|2 = µ1 + Σ12 Σ−1


22 (x2 − µ2 ) (2.11)
Σ1|2 = Σ/Σ22 = Σ11 − Σ12 Σ−1
22 Σ21 (2.12)

We can also use the fact that |M| = |M/H||H| to check the normalization constants are correct:
1 1
(2π)(d1 +d2 )/2 |Σ| 2 = (2π)(d1 +d2 )/2 (|Σ/Σ22 | |Σ22 |) 2 (2.13)
1 1
= (2π) d1 /2
|Σ/Σ22 | (2π)
2 d2 /2
|Σ22 | 2 (2.14)

where d1 = dim(x1 ) and d2 = dim(x2 ).


We see that the equations for marginalization of an MVN are easy, but the equations for conditioning are
complex. We can also write the MVN in information form, using the parameterization

Λ , Σ−1 , η , Σ−1 µ (2.15)

Here Λ is the precision matrix, and η is the precision-weighted mean. In this form, one can show that the
marginals and conditions are given by

p(x2 ) = Nc (x2 |η 2 − Λ21 Λ−1 −1


11 η 1 , Λ22 − Λ21 Λ11 Λ12 ) (2.16)
p(x1 |x2 ) = Nc (x1 |η 1 − Λ12 x2 , Λ11 ) (2.17)

To show show this, let us partition η as follows:


    
η1 Λ11 Λ12 µ1
= (2.18)
η2 Λ21 Λ22 µ2
so

η 1 = Λ11 µ1 + Λ12 µ2 (2.19)


η 2 = Λ21 µ1 + Λ22 µ2 (2.20)

Hence
   −1
Λ11 Λ12 Σ11 Σ12
= (2.21)
Λ21 Λ22 Σ21 Σ22
 
(Σ/Σ22 )−1 −(Σ/Σ22 )−1 Σ12 Σ−1
= −1
22
(2.22)
−Σ22 Σ21 (Σ/Σ22 )−1 Σ22 + Σ−1
−1
22 Σ21 (Σ/Σ22 )
−1
Σ12 Σ−2
22

where the top left is


Λ11 = (Σ/Σ22 )−1 = (Σ11 − Σ12 Σ−1
22 Σ21 )
−1
(2.23)
Using the moment form for conditioning we have

Λ1|2 = Σ−1 −1
1|2 = (Σ11 − Σ12 Σ22 Σ21 )
−1
= Λ11 (2.24)
µ1|2 = µ1 + Σ12 Σ−1
22 (x2 − µ2 ) (2.25)

but from Equation 2.22 and 2.23 we have

Λ12 = −Λ11 Σ12 Σ−1


22 (2.26)

so

µ1|2 = µ1 − Λ−1
11 Λ12 (x2 − µ2 ) (2.27)
η 1|2 = Λ1|2 µ1|2 = Λ11 µ1 − Λ12 (x2 − µ2 ) (2.28)
= Λ11 µ1 + Λ12 µ2 − Λ12 x2 = η 1 − Λ12 x2 (2.29)

12
We will now derive the results for marginalizing in information form. Let
   −1
Σ11 Σ12 Λ11 Λ12
= (2.30)
Σ21 Σ22 Λ21 Λ22
 −1 
Λ11 + Λ−1
11 Λ12 (Λ/Λ11 )
−1
Λ21 Λ−1 −Λ−1
11 Λ12 (Λ/Λ11 )
−1
= −1 −1
11
−1 (2.31)
−(Λ/Λ11 ) Λ21 Λ11 (Λ/Λ11 )

Hence
−1 −1
Λm
22 = Σ22 = Λ/Λ11 = Λ22 − Λ21 Λ11 Λ12 (2.32)

m m −1
ηm
2 = Λ22 µ2 = (Λ22 − Λ21 Λ11 Λ12 )µ2 (2.33)
= Λ22 µ2 − Λ21 Λ−1
11 Λ12 µ2 (2.34)
= (Λ21 µ1 + Λ22 µ2 ) − Λ21 Λ−1
11 (Λ11 µ1 + Λ12 µ2 ) (2.35)
= η 2 − Λ21 Λ11 η 1 (2.36)

2.1.2 Deriving Bayes rule for linear Gaussian systems


We now derive Bayes rule for Gaussians. The basic idea is to derive the joint distribution, p(z, y) = p(z)p(y|z),
and then to use the results for conditioning Gaussians for computing p(z|y).
In more detail, we proceed as follows. The log of the joint distribution is as follows (dropping irrelevant
constants):

1 1
log p(z, y) = − (z − µz )T Σ−1 T −1
z (z − µz ) − (y − Wz − b) Σy (y − Wz − b) (2.37)
2 2
This is clearly a joint Gaussian distribution, since it is the exponential of a quadratic form.
Expanding out the quadratic terms involving z and y, and ignoring linear and constant terms, we have
1 1 T −1 1
Q = − z T Σ−1 T −1 T −1
z z − y Σy y − (Wz) Σy (Wz) + y Σy Wz (2.38)
2 2 2
 T  −1  
1 z Σz + WT Σ−1 T −1
y W −W Σy z
=− (2.39)
2 y −Σ−1
y W Σ−1
y y
 T  
1 z −1 z
=− Σ (2.40)
2 y y

where the precision matrix of the joint is defined as


 −1   
−1 Σz + WT Σ−1 y W −W Σy
T −1
Λxx Λxy
Σ = ,Λ= (2.41)
−Σ−1y W Σ−1
y Λyx Λyy

From Equation ??, and using the fact that µy = Wµz + b, we have

p(z|y) = N (µz|y , Σz|y ) (2.42)


Σz|y = Λ−1
xx = (Σ−1
z +W T
Σ−1
y W)
−1
(2.43)
µz|y = Σz|y (Λxx µz − Λxy (y − µy )) (2.44)

= Σz|y Σ−1
z µz + W Σ−1
T T −1
y Wµz + W Σy (y − µy ) (2.45)

= Σz|y Σ−1
z µz + WT Σ−1
y (Wµz + y − µy ) (2.46)

= Σz|y Σ−1
z µz + WT Σ−1
y (y − b) (2.47)

13
2.1.3 Sensor fusion with unknown measurement noise
In this section, we extend the sensor fusion results from ?? to the case where the precision of each measurement
device is unknown. This turns out to yield a potentially multi-modal posterior, as we will see, which is quite
different from the Gaussian case. Our presentation is based on [Min01].
For simplicity, we assume the latent quantity is scalar, z ∈ R, and that we just have two measurement
devices, x and y. However, we allow these to have different precisions, so the data generating mechanism
has the form xn |z ∼ N (z, λ−1x ) and yn |z ∼ N (z, λy ). We will use a non-informative prior for z, p(z) ∝ 1,
−1

which we can emulate using an infinitely broad Gaussian, p(z) = N (z|m0 = 0, λ−1 0 = ∞). So the unknown
parameters are the two measurement precisions, θ = (λx , λy ).
Suppose we make 2 independent measurements with each device, which turn out to be

x1 = 1.1, x2 = 1.9, y1 = 2.9, y2 = 4.1 (2.48)

If the parameters θ were known, then the posterior would be Gaussian:

p(z|D, λx , λy ) = N (z|mN , λ−1


N ) (2.49)
λN = λ0 + Nx λx + Ny λy (2.50)
λx Nx x + λy Ny y
mN = (2.51)
Nx λx + Ny λy
PNx
where Nx = 2 is the number of x measurements, Ny = 2 is the number of y measurements, x = N1x n=1 xn =
PNy
1.5 and y = Ny n=1 yn = 3.5. This result follows because the posterior precision is the sum of the
1

measurement precisions, and the posterior mean is a weighted sum of the prior mean (which is 0) and the
data means.
However, the measurement precisions are not known. A simple solution is to estimate them by maximum
likelihood. The log-likelihood is given by

Nx λx X Ny λy X
`(z, λx , λy ) = log λx − (xn − z)2 + log λy − (yn − z)2 (2.52)
2 2 n 2 2 n

The MLE is obtained by solving the following simultaneous equations:

∂`
= λx Nx (x − z) + λy Ny (y − z) = 0 (2.53)
∂z
Nx
∂` 1 1 X
= − (xn − z)2 = 0 (2.54)
∂λx λx Nx n=1
Ny
∂` 1 1 X
= − (yn − z)2 = 0 (2.55)
∂λy λy Ny n=1

This gives

Nx λ̂x x + Ny λ̂y y
ẑ = (2.56)
Nx λ̂x + Ny λ̂y
1 X
1/λ̂x = (xn − ẑ)2 (2.57)
Nx n
1 X
1/λ̂y = (yn − ẑ)2 (2.58)
Ny n

We notice that the MLE for z has the same form as the posterior mean, mN .

14
We can solve these equations by fixed point iteration. Let us initialize by estimating λx = 1/s2x and
PNx PNy
λy = 1/s2y , where s2x = N1x n=1 (xn − x)2 = 0.16 and s2y = N1y n=1 (yn − y)2 = 0.36. Using this, we
get ẑ = 2.1154, so p(z|D, λ̂x , λ̂y ) = N (z|2.1154, 0.0554). If we now iterate, we converge to λ̂x = 1/0.1662,
λ̂y = 1/4.0509, p(z|D, λ̂x , λ̂y ) = N (z|1.5788, 0.0798).
The plug-in approximation to the posterior is plotted in Figure 2.1(a). This weights each sensor according
tohits estimatedi precision. Since sensor y was estimated to be much less reliable than sensor x, we have
E z|D, λ̂x , λ̂y ≈ x, so we effectively ignore the y sensor.
Now we will adopt a Bayesian approach and integrate out the unknown precisions, following ??. That is,
we compute Z  Z 
p(z|D) ∝ p(z) p(Dx |z, λx )p(λx |z)dλx p(Dy |z, λy )p(λy |z)dλy (2.59)

We will use uninformative Jeffrey priors (??) p(z) ∝ 1, p(λx |z) ∝ 1/λx and p(λy |z) ∝ 1/λy . Since the x and
y terms are symmetric, we will just focus on one of them. The key integral is
Z
I = p(Dx |z, λx )p(λx |z)dλx (2.60)
Z  
Nx Nx 2
∝ λ−1 x λx
Nx /2
exp − λx (x − z)2 − s λx dλx (2.61)
2 2 x
Exploiting the fact that Nx = 2 this simplifies to
Z
I = λ−1 1 2 2
x λx exp(−λx [(x − z) + sx ])dλx (2.62)

We recognize this as proportional to the integral of an unnormalized Gamma density


Ga(λ|a, b) ∝ λa−1 e−λb (2.63)
where a = 1 and b = (x − z)2 + s2x . Hence the integral is proportional to the normalizing constant of the
Gamma distribution, Γ(a)b−a , so we get
Z
−1
I ∝ p(Dx |z, λx )p(λx |z)dλx ∝ (x − z)2 + s2x (2.64)

and the posterior becomes


1 1
p(z|D) ∝ (2.65)
(x − z)2 + s2x (y − z)2 + s2y
The exact posterior is plotted in Figure 2.1(b). We see that it has two modes, one near x = 1.5 and one
near y = 3.5. These correspond to the beliefs that the x sensor is more reliable than the y one, and vice
versa. The weight of the first mode is larger, since the data from the x sensor agree more with each other, so
it seems slightly more likely that the x sensor is the reliable one. (They obviously cannot both be reliable,
since they disagree on the values that they are reporting.) However, the Bayesian solution keeps open the
possibility that the y sensor is the more reliable one; from two measurements, we cannot tell, and choosing
just the x sensor, as the plug-in approximation does, results in overconfidence (a posterior that is too narrow).
So far, we have assumed the prior is conjugate to the likelihood, so we have been able to compute the
posterior analytically. However, this is rarely the case. A common alternative is to approximate the integral
using Monte Carlo sampling, as follows:
Z
p(z|D) ∝ p(z|D, θ)p(θ|D)dθ (2.66)
1X
≈ p(z|D, θ s ) (2.67)
S s

where θ s ∼ p(θ|D). Note that p(z|D, θ s ) is conditionally Gaussian, and is easy to compute. So we just need
a way to draw samples from the parameter posterior, p(θ|D). We discuss suitable methods for this in ??.

15
0.8 1.5

0.7

0.6
1
0.5

0.4

0.3
0.5
0.2

0.1

0 0
−2 −1 0 1 2 3 4 5 6 −2 −1 0 1 2 3 4 5 6

(a) (b)

Figure 2.1: Posterior for z. (a) Plug-in approximation. (b) Exact posterior. Generated by sen-
sor_fusion_unknown_prec.py.

2.2 Google’s PageRank algorithm


In this section, we discuss Google’s PageRank algorithm, since it provides an interesting application of
Markov chain theory. PageRanke is one of the components used for ranking web page search results. We
sketch the basic idea below; see [BL06b] for a more detailed explanation.

2.2.1 Retrieving relevant pages using inverted indices


We will treat the web as a giant directed graph, where nodes represent web pages (documents) and edges
represent hyper-links.1 We then perform a process called web crawling. We start at a few designated root
nodes, such as wikipedia.org, and then follows the links, storing all the pages that we encounter, until we
run out of time.
Next, all of the words in each web page are entered into a data structure called an inverted index. That
is, for each word, we store a list of the documents where this word occurs. At test time, when a user enters a
query, we can find potentially relevant pages as follows: for each word in the query, look up all the documents
containing each word, and intersect these lists. (We can get a more refined search by storing the location of
each word in each document, and then testing if the words in a document occur in the same order as in the
query.)
Let us give an example, from http://en.wikipedia.org/wiki/Inverted_index. Suppose we have 3
documents, D0 = “it is what it is”, D1 = “what is it” and D2 = “it is a banana”. Then we can create the
following inverted index, where each pair represents a document and word location:
"a": {(2, 2)}
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}
For example, we see that the word “what” occurs in document 0 at location 2 (counting from 0), and in
document 1 at location 0. Suppose we search for “what is it”. If we ignore word order, we retrieve the
following documents:
{D0 , D1 } ∩ {D0 , D1 , D2 } ∩ {D0 , D1 , D2 } = {D0 , D1 } (2.68)
If we require that the word order matches, only document D1 would be returned. More generally, we can
allow out-of-order matches, but can give “bonus points” to documents whose word order matches the query’s
word order, or to other features, such as if the words occur in the title of a document. We can then return
the matching documents in decreasing order of their score/ relevance. This is called document ranking.
1 In 2008, Google said it had indexed 1 trillion (1012 ) unique URLs. If we assume there are about 10 URLs per page (on

average), this means there were about 100 billion unique web pages. Estimates for 2010 are about 121 billion unique web pages.
Source: https://bit.ly/2keQeyi

16
X1
0.30

0.25
X2
0.20

0.15
X3
0.10

0.05

X4 X5 X6 0.00
X1 X2 X3 X4 X5 X6

(a) (b)

Figure 2.2: (a) A very small world wide web. Generated by pagerank_small_plot_graph.py (b) The corresponding
stationary distribution. Generated by pagerank_demo_small.py.

2.2.2 The PageRank score


So far, we have described the standard process of information retrieval. But the link structure of the web
provides an additional source of information. The basic idea is that some web pages are more authoritative
than others, so these should be ranked higher (assuming they match the query). A web page is a considered
an authority if it is linked to by many other pages. But to protect against the effect of so-called link
farms, which are dummy pages which just link to a given site to boost its apparent relevance, we will
weight each incoming link by the source’s authority. Thus we get the following recursive definition for the
authoritativeness of page j, also called its PageRank:
X
πj = Aij πi (2.69)
i

where Aij is the probability of following a link from i to j. (The term “PageRank” is named after Larry Page,
one of Google’s co-founders.)
We recognize Equation (2.69) as the stationary distribution of a Markov chain. But how do we define
the transition matrix? In the simplest setting, we define Ai,: as a uniform distribution over all states that
i is connected to. However, to ensure the distribution is unique, we need to make the chain into a regular
chain. This can be done by allowing each state i to jump to any other state (including itself) with some
small probability. This effectively makes the transition matrix aperiodic and fully connected (although the
adjacency matrix Gij of the web itself is highly sparse).
We discuss efficient methods for computing the leading eigenvector of this giant matrix below. Here we
ignore computational issues, and just give some examples.
First, consider the small web in Figure 2.2. We find that the stationary distribution is
π = (0.3209, 0.1706, 0.1065, 0.1368, 0.0643, 0.2008) (2.70)
So a random surfer will visit site 1 about 32% of the time. We see that node 1 has a higher PageRank than
nodes 4 or 6, even though they all have the same number of in-links. This is because being linked to from an
influential node helps increase your PageRank score more than being linked to by a less influential node.
As a slightly larger example, Figure 2.3(a) shows a web graph, derived from the root of harvard.edu.
Figure 2.3(b) shows the corresponding PageRank vector.

2.2.3 Efficiently computing the PageRank vector


Let Gij = 1 iff there is a link from j to i. Now imagine performing a random walk on this graph, where at
every time step, with probability p you follow one of the outlinks uniformly at random, and with probability

17
(a) (b)

Figure 2.3: (a) Web graph of 500 sites rooted at www. harvard. edu . (b) Corresponding page rank vector. Generated
by pagerank_demo_harvard.py.

1 − p you jump to a random node, again chosen uniformly at random. If there are no outlinks, you just jump
to a random page. (These random jumps, including self-transitions, ensure the chain is irreducible (singly
connected) and regular. Hence we can solve for its unique stationary distribution using eigenvector methods.)
This defines the following transition matrix:

pGij /cj + δ if cj 6= 0
Mij = (2.71)
1/n if cj = 0

where n is the number of nodes,


P δ = (1 − p)/n is the probability of jumping from one page to another without
following a link and cj = i Gij represents the out-degree of page j. (If n = 4 · 109 and p = 0.85, then
δ = 3.75 · 10−11 .) Here M is a stochastic matrix in which columns sum to one. Note that M = AT in our
earlier notation.
We can represent the transition matrix compactly as follows. Define the diagonal matrix D with entries

1/cj if cj 6= 0
djj = (2.72)
0 if cj = 0
Define the vector z with components

δ if cj 6= 0
zj = (2.73)
1/n if cj = 0

Then we can rewrite Equation (2.71) as follows:

M = pGD + 1zT (2.74)

The matrix M is not sparse, but it is a rank one modification of a sparse matrix. Most of the elements of M
are equal to the small constant δ. Obviously these do not need to be stored explicitly.
Our goal is to solve v = Mv, where v = πT . One efficient method to find the leading eigenvector of a
large matrix is known as the power method. This simply consists of repeated matrix-vector multiplication,
followed by normalization:
v ∝ Mv = pGDv + 1zT v (2.75)
It is possible to implement the power method without using any matrix multiplications, by simply sampling
from the transition matrix and counting how often you visit each state. This is essentially a Monte Carlo
approximation to the sum implied by v = Mv. Applying this to the data in Figure 2.3(a) yields the stationary
distribution in Figure 2.3(b). This took 13 iterations to converge, starting from a uniform distribution. To
handle changing web structure, we can re-run this algorithm every day or every week, starting v off at the
old distribution; this is called warm starting [LM06].
For details on how to perform this Monte Carlo power method in a parallel distributed computing
environment, see e.g., [RU10].

18
2.2.4 Web spam
PageRank is not foolproof. For example, consider the strategy adopted by JC Penney, a department store in
the USA. During the Christmas season of 2010, it planted many links to its home page on 1000s of irrelevant
web pages, thus increasing its ranking on Google’s search engine [Seg11]. Even though each of these source
pages has low PageRank, there were so many of them that their effect added up. Businesses call this search
engine optimization; Google calls it web spam. When Google was notified of this scam (by the New
York Times), it manually downweighted JC Penney, since such behavior violates Google’s code of conduct.
The result was that JC Penney dropped from rank 1 to rank 65, essentially making it disappear from view.
Automatically detecting such scams relies on various techniques which are beyond the scope of this chapter.

2.2.5 Personalized PageRank


The PageRank algorithm computes a single global notion of importance of each web page. In some cases,
it is useful for each user to define his own notion of importance. The Personalized PageRank algorithm
(aka random walks with restart) computes a stationary distribution relative to node k, by returning with
some probability to a specific starting node k rather than a random node. The corresponding stationary
distribution, π k , gives a measure of how important each node is relative to k. See [Lof15] for details. (A
similar system is used by Pinterest to infer the similarity of one “pin” (bookmarked webpage) to another, as
explained in [Eks+18]).

19
20
Chapter 3

Statistics

3.1 Bayesian concept learning


In this section, we introduce Bayesian statistics using some simple examples inspired by Bayesian models of
human learning. This will let us get familiar with the key ideas without getting bogged down by mathematical
technalities.
Consider how a child learns the meaning of a word, such as “dog”. Typically the child’s parents will point
out positive examples of this concept, saying such things as, “look at the cute dog!”, or “mind the doggy”,
etc. The core challenge is to figure out what we mean by the concept “dog”, based on a finite (and possibly
quite small) number of such examples. Note that the parent is unlikely to provide negative examples; for
example, people do not usually say “look at that non-dog”. Negative examples may be obtained during an
active learning process (e.g., the child says “look at the dog” and the parent says “that’s a cat, dear, not a
dog”), but psychological research has shown that people can learn concepts from positive examples alone
[XT07]. This means that standard supervised learning methods cannot be used.
We formulate the problem by assuming the data that we see are generated by some hidden concept h ∈ H,
where H is called the hypothesis space. (We use the notation h rather than θ to be consistent with the
concept learning literature.) We then focus on computing the posterior p(h|D). In Section 3.1.1, we assume
the hypothesis space consists of a finite number of alternative hypotheses; this will significantly simplify
the computation of the posterior, allowing us to focus on the ideas and not get too distracted by the math.
In Section 3.1.2, we will extend this to continuous hypothesis spaces. This will form the foundation for
Bayesian inference of real-valued parameters for more familiar probability models, such as the Bernoulli
and the Gaussian, logistic regression, and deep neural networks, that we discuss in later chapters. (See also
[Jia+13] for an application of these ideas to the problem of concept learning from images.)

3.1.1 Learning a discrete concept: the number game


Suppose that we are trying to learn some mathematical concept from a teacher who provides examples of that
concept. We assume that a concept is defined as the set of positive integers that belong to its extension; for
example, the concept “even number” is defined by heven = {2, 4, 6, . . .}, and the concept “powers of two ” is
defined by htwo = {2, 4, 8, 16, . . .}. For simplicity, we assume the range of numbers is between 1 and 100.
For example, suppose we see one example, D = {16}. What other numbers do you think are examples of
this concept? 17? 6? 32? 99? It’s hard to tell with only one example, so your predictions will be quite vague.
Presumably numbers that are similar in some sense to 16 are more likely. But similar in what way? 17 is
similar, because it is “close by”, 6 is similar because it has a digit in common, 32 is similar because it is also
even and a power of 2, but 99 does not seem similar. Thus some numbers are more likely than others.
Now suppose I tell you that D = {2, 8, 16, 64} are positive examples. You may guess that the hidden
concept is “powers of two”. Given your beliefs about the true (but hidden) concept, you may confidently

21
Examples

16 1
0.5
0

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

60 1
0.5
0

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

16 8 2 64 1
0.5
0

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

16 23 19 20 1
0.5
0

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

60 80 10 30 1
Figure 3.1: Empirical membership distribution in the numbers game, derived from predictions from 8 humans. First two
0.5

rows: after seeing D = {16} and D 0 = {60}. This illustrates diffuse similarity. Third row: after seeing D = {16, 8, 2, 64}.

This illustrates rule-like behavior (powers


4 8 12of
16 2).
20 24Bottom
28 32 36 row: after
40 44 48 seeing
52 56 D 72= 76{16,
60 64 68 23,8819,
80 84 92 20}.
96 100 This illustrates focussed

similarity (numbers near 20).


60 52 From
57 55 1 Figure 5.5 of [Ten99]. Used with kind permission of Josh Tenenbaum.
0.5
0

predict that y ∈ {2, 4, 8, 16, 32, 64} may also be generated in the future by the teacher. This is an example of
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

generalization, since81 we are0.5making predictions about future data that we have not seen.
25 4 36 1

Figure 3.1 gives an example0 of how humans perform at this task. Given a single example, such as D = {16}
or D = {60}, humans make fairly diffuse predictions over the other numbers that are similar in magnitude.
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
But when given several examples,
81 98 86 93 1
such as D = {2, 8, 16, 64}, humans often find an underlying pattern, and
use this to make fairly precise 0.5 predictions about which other numbers might be part of the same concept,

even if those other numbers are 0


“far away”.
How can we explain this behavior and emulate it in a machine? The classic approach to the problem
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

of induction is to suppose we have a hypothesis space H of concepts (such as even numbers, all numbers
between 1 and 10, etc.), and then to identify the smallest subset of H that is consistent with the observed
data D; this is called the version space. As we see more examples, the version space shrinks and we become
increasingly certain about the underlying hypothesis [Mit97].
However, the version space theory cannot explain the human behavior we saw in Figure 3.1. For example,
after seeing D = {16, 8, 2, 64}, why do people choose the rule “powers of two” and not, say, “all even numbers”,
or “powers of two except for 32”, both of which are equally consistent with the evidence? We will now show
how Bayesian inference can explain this behavior. The resulting predictions are shown in Figure 3.2.

3.1.1.1 Likelihood
We must explain why people chose htwo and not, say, heven after seeing D = {16, 8, 2, 64}, given that
both hypotheses are consistent with the evidence. The key intuition is that we want to avoid suspicious
coincidences. For example, if the true concept was even numbers, it would be surprising if we just happened
to only see powers of two.
To formalize this, let us assume that the examples are sampled uniformly at random from the extension
of the concept. (Tenenbaum calls this the strong sampling assumption.) Given this assumption, the
probability of independently sampling N items (with replacement) from the unknown concept h is given by
N
Y N
Y  N
1 1
p(D|h) = p(yn |h) = I (yn ∈ h) = I (D ∈ h) (3.1)
n=1 n=1
size(h) size(h)

where I (D ∈ h) is non zero iff all the data points lie in the support of h. This crucial equation embodies

22
Examples

16 1
0.5
0

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

60 1
0.5
0

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

16 8 2 64 1
0.5
0

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

16 23 19 20 1
0.5
0

4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

60 80 10 30 1
Figure 3.2: Posterior membership probabilities derived using the full hypothesis space. Compare to Figure 3.1. The
0.5
0
predictions of the Bayesian model are only plotted for those values for which human data is available; this is why the
top line looks sparser than Figure 3.4.
4 8 From
12 16 20Figure
24 28 325.6 of44[Ten99].
36 40 48 52 56 60Used
64 68 with kind
72 76 80 permission
84 88 92 96 100 of Josh Tenenbaum.
60 52 57 55 1
0.5
0
what Tenenbaum calls the size principle, which means the model favors the simplest (smallest) hypothesis
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
consistent with the data. This
81 25 4 36 1
is more commonly known as Occam’s razor.
To see how it works, let D0.5= {16}. Then p(D|htwo ) = 1/6, since there are only 6 powers of two less than
100, but p(D|heven ) = 1/50, since 0
there are 50 even numbers. So the likelihood that h = htwo is higher than
if h = heven . After 4 examples, the likelihood of htwo is (1/6) = 7.7 × 10−484, 88whereas
4 8 12 16 20 24 28 32 36 40 44 48 52 56460 64 68 72 76 80 92 96 100
the likelihood of heven
is (1/50)4 = 1.6 × 10−7 . This
81 98 86 93 1
0.5
is a likelihood ratio of almost 5000:1 in favor of h two . This quantifies our
earlier intuition that D = {16,0 8, 2, 64} would be a very suspicious coincidence if generated by heven .
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100

3.1.1.2 Prior

In the Bayesian approach, we must specify a prior over unknowns, p(h), as well as the likelihood, p(D|h). To
see why this is useful, suppose D = {16, 8, 2, 64}. Given this data, the concept h0 =“powers of two except
32” is more likely than h =“powers of two”, since h0 does not need to explain the coincidence that 32 is
missing from the set of examples. However, the hypothesis h0 =“powers of two except 32” seems “conceptually
unnatural”. We can capture such intuition by assigning low prior probability to unnatural concepts. Of
course, your prior might be different than mine. This subjective aspect of Bayesian reasoning is a source of
much controversy, since it means, for example, that a child and a math professor will reach different answers.1
Although the subjectivity of the prior is controversial, it is actually quite useful. If you are told the
numbers are from some arithmetic rule, then given 1200, 1500, 900 and 1400, you may think 400 is likely but
1183 is unlikely. But if you are told that the numbers are examples of healthy cholesterol levels, you would
probably think 400 is unlikely and 1183 is likely, since you assume that healthy levels lie within some range.
Thus we see that the prior is the mechanism by which background knowledge can be brought to bear on
a problem. Without this, rapid learning (i.e., from small samples sizes) is impossible.
So, what prior should we use? We will initially consider 30 simple arithmetical concepts, such as “even
numbers”, “odd numbers”, “prime numbers”, or “numbers ending in 9”. We could use a uniform prior over
these concepts; however, for illustration purposes, we make the concepts even and odd more likely apriori,
and use a uniform prior over the others. We also include two “unnatural” concepts, namely “powers of 2, plus
1 A child and a math professor presumably not only have different priors, but also different hypothesis spaces. However, we

can finesse that by defining the hypothesis space of the child and the math professor to be the same, and then setting the child’s
prior weight to be zero on certain “advanced” concepts. Thus there is no sharp distinction between the prior and the hypothesis
space.

23
data = {16} data = {16,8,2,64}
even even
odd odd
squares squares
mult of 3 mult of 3
mult of 4 mult of 4
mult of 5 mult of 5
mult of 6 mult of 6
mult of 7 mult of 7
mult of 8 mult of 8
mult of 9 mult of 9
mult of 10 mult of 10
ends in 1 ends in 1
ends in 2 ends in 2
ends in 3 ends in 3
ends in 4 ends in 4
ends in 5 ends in 5
ends in 6 ends in 6
ends in 7 ends in 7
ends in 8 ends in 8
ends in 9 ends in 9
powers of 2 powers of 2
powers of 3 powers of 3
powers of 4 powers of 4
powers of 5 powers of 5
powers of 6 powers of 6
powers of 7 powers of 7
powers of 8 powers of 8
powers of 9 powers of 9
powers of 10 powers of 10
all all
powers of 2 +{37} powers of 2 +{37}
powers of 2 -{32} powers of 2 -{32}

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.350.0 0.1 0.2 0.3 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.00 0.25 0.50 0.75 1.00 1.25 1.50 0.0 0.2 0.4 0.6 0.8 1.0
1e 3
prior lik post prior lik post

(a) (b)

Figure 3.3: (a) Prior, likelihood and posterior for the model when the data is D = {16}. (b) Results when D =
{2, 8, 16, 64}. Adapted from [Ten99]. Generated by numbers_game.py.

37” and “powers of 2, except 32”, but give them low prior weight. See Figure 3.3a(bottom row) for a plot of
this prior.
In addition to “rule-like” hypotheses, we consider the set of intervals between n and m for 1 ≤ n, m ≤ 100.
This allows us to capture concepts based on being “close to” some number, rather than satisfying some more
abstract property. We put a uniform prior over the intervals.
We can combine these two priors by using a mixture distribution, as follows:

p(h) = πUnif(h|rules) + (1 − π)Unif(h|intervals) (3.2)

where 0 < π < 1 is the mixture weight assigned to the rules prior, and Unif(h|S) is the uniform distribution
over the set S.

3.1.1.3 Posterior
The posterior is simply the likelihood times the prior, normalized: p(h|D) ∝ p(D|h)p(h). Figure 3.3a plots
the prior, likelihood and posterior after seeing D = {16}. (In this figure, we only consider rule-like hypotheses,
not intervals, for simplicity.) We see that the posterior is a combination of prior and likelihood. In the case
of most of the concepts, the prior is uniform, so the posterior is proportional to the likelihood. However,
the “unnatural” concepts of “powers of 2, plus 37” and “powers of 2, except 32” have low posterior support,
despite having high likelihood, due to the low prior. Conversely, the concept of odd numbers has low posterior
support, despite having a high prior, due to the low likelihood.
Figure 3.3b plots the prior, likelihood and posterior after seeing D = {16, 8, 2, 64}. Now the likelihood is
much more peaked on the powers of two concept, so this dominates the posterior. Essentially the learner has
an “aha” moment, and figures out the true concept.2 This example also illustrates why we need the low prior
on the unnatural concepts, otherwise we would have overfit the data and picked “powers of 2, except for 32”.

3.1.1.4 Posterior predictive


The posterior over hypotheses is our internal belief state about the world. The way to test if our beliefs are
justified is to use them to predict objectively observable quantities (this is the basis of the scientific method).
To do this, we compute the posterior predictive distribution over possible future observations:
X
p(y|D) = p(y|h)p(h|D) (3.3)
h
2 Humans have a natural desire to figure things out; Alison Gopnik, in her paper “Explanation as orgasm” [Gop98], argued
that evolution has ensured that we enjoy reducing our posterior uncertainty.

24
1.0
0.8
0.6
0.4
0.2
0.0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
powers of 4

powers of 2

ends in 6

squares

even

mult of 8

mult of 4

all

powers of 2 -{32}

powers of 2 +{37}
0.0 0.2
p(h|16)

Figure 3.4: Posterior over hypotheses, and the induced posterior over membership, after seeing one example, D = {16}.
A dot means this number is consistent with this hypothesis. The graph p(h|D) on the right is the weight given to
hypothesis h. By taking a weighed sum of dots, we get p(y ∈ h|D) (top). Adapted from Figure 2.9 of [Ten99]. Generated
by numbers_game.py.

This is called Bayes model averaging [Hoe+99]. Each term is just a weighted average of the predictions of
each individual hypothesis. This is illustrated in Figure 3.4. The dots at the bottom show the predictions
from each hypothesis; the vertical curve on the right shows the weight associated with each hypothesis. If we
multiply each row by its weight and add up, we get the distribution at the top.

3.1.1.5 MAP, MLE, and the plugin approximation


As the amount of data increases, the posterior will (usually) become concentrated around a single point,
namely the posterior mode, as we saw in Figure 3.3 (top right plot). The posterior mode is defined as the
hypothesis with maximum posterior probability:

hmap , argmax p(h|D) (3.4)


h

This is also called the maximum a posterior or MAP estimate.


We can compute the MAP estimate by solving the following optimization problem:

hmap = argmax p(h|D) = argmax log p(D|h) + log p(h) (3.5)


h h

The first term, log p(D|h), is the log of the likelihood, p(D|h). The second term, log p(h), is the log of
the prior. As the data set increases in size, the log likelihood grows in magnitude, but the log prior term
remains constant. We thus say that the likelihood overwhelms the prior. In this context, a reasonable
approximation to the MAP estimate is to ignore the prior term, and just pick the maximum likelihood
estimate or MLE, which is defined as
N
X
hmle , argmax p(D|h) = argmax log p(D|h) = argmax log p(yn |h) (3.6)
h h h n=1

Suppose we approximate the posterior by a single point estimate ĥ, might be the MAP estimate or
MLE. We can represent this degenerate distribution as a single point mass
 
p(h|D) ≈ I h = ĥ (3.7)

25
where I () is the indicator function. The corresponding posterior predictive distribution becomes
X  
p(y|D) ≈ p(y|h)I h = ĥ = p(y|ĥ) (3.8)
h

This is called a plug-in approximation, and is very widely used, due to its simplicity, as we discuss further
in ??.
Although the plug-in approximation is simple, it behaves in a qualitatively inferior way than the fully
Bayesian approach when the dataset is small. In the Bayesian approach, we start with broad predictions, and
then become more precise in our forecasts as we see more data, which makes intuitive sense. For example,
given D = {16}, there are many hypotheses with non-negligible posterior mass, so the predicted support over
the integers is broad. However, when we see D = {16, 8, 2, 64}, the posterior concentrates its mass on one
or two specific hypotheses, so the overall predicted support becomes more focused. By contrast, the MLE
picks the minimal consistent hypothesis, and predicts the future using that single model. For example, if we
we see D = {16}, we compute hmle to be “all powers of 4” (or the interval hypothesis h = {16}), and the
resulting plugin approximation only predicts {4, 16, 64} as having non-zero probability. This is an example
of overfitting, where we pay too much attention to the specific data that we saw in training, and fail to
generalise correctly to novel examples. When we observe more data, the MLE will be forced to pick a broader
hypothesis to explain all the data. For example, if we D = {16, 8, 2, 64}, the MLE broadens to become “all
powers of two”, similar to the Bayesian approach. Thus in the limit of infinite data, both approaches converge
to the same predictions. However, in the small sample regime, the fully Bayesian approach, in which we
consider multiple hypotheses, will give better (less over confident) predictions.

3.1.2 Learning a continuous concept: the healthy levels game


The number game involved observing a series of discrete variables, and inferring a distribution over another
discrete variable from a finite hypothesis space. This made the computations particularly simple: we just
needed to sum, multiply and divide. However, in many applications, the variables that we observe are
real-valued continuous quantities. More importantly, the unknown parameters are also usually continuous, so
the hypothesis space becomes (some subset) of RK , where K is the number of parameters. This complicates
the mathematics, since we have to replace sums with integrals. However, the basic ideas are the same.
We illustrate these ideas by considering another example of concept learning called the healthy levels
game, also due to Tenenbaum. The idea is this: we measure two continuous variables, representing the
cholesterol and insulin levels of some randomly chosen healthy patients. We would like to know what range
of values correspond to a healthy range. As in the numbers game, the challenge is to learn the concept from
positive data alone.
Let our hypothesis space be axis-parallel rectangles in the plane, as in Figure 3.5. This is a classic
example which has been widely studied in machine learning [Mit97]. It is also a reasonable assumption for
the healthy levels game, since we know (from prior domain knowledge) that healthy levels of both insulin and
cholesterol must fall between (unknown) lower and upper bounds. We can represent a rectangle hypothesis
as h = (`1 , `2 , s1 , s2 ), where `j ∈ (−∞, ∞) are the coordinates (locations) of the lower left corner, and
sj ∈ [0, ∞) are the lengths of the two sides. Hence the hypothesis space is H = R2 × R2+ , where R≥0 is the
set of non-negative reals.
More complex concepts might require discontinuous regions of space to represent them. Alternatively, we
might want to use latent rectangular regions to represent more complex, high dimensional concepts [Li+19].
The question of where the hypothesis space comes from is a very interesting one, but is beyond the scope of
this chapter. (One approach is to use hierarchical Bayesian models, as discussed in [Ten+11].)

3.1.2.1 Likelihood
We assume points are sampled uniformly at random from the support of the rectangle. To simplify the
analysis, let us first consider the case of one-dimensional “rectangles”, i.e., lines. In the 1d case, the likelihood

26
samples from p(h|D1 : 3), uninfPrior samples from p(h|D1 : 12), uninfPrior
0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3
0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7

(a) (b)

Figure 3.5: Samples from the posterior in the “healthy levels” game. The axes represent “cholesterol level” and “insulin
level”. (a) Given a small number of positive examples (represented by 3 red crosses), there is a lot of uncertainty
about the true extent of the rectangle. (b) Given enough data, the smallest enclosing rectangle (which is the maximum
likelihood hypothesis) becomes the most probable, although there are many other similar hypotheses that are almost as
probable. Adapted from [Ten99]. Generated by healthy_levels_plots.py.

is p(D|`, s) = (1/s)N if all points are inside the interval, otherwise it is 0. Hence

s−N if min(D) ≥ ` and max(D) ≤ ` + s
p(D|`, s) = (3.9)
0 otherwise

To generalize this to 2d, we assume the observed features are conditionally independent given the hypothesis.
Hence the 2d likelihood becomes
p(D|h) = p(D1 |`1 , s1 )p(D2 |`2 , s2 ) (3.10)
where Dj = {ynj : n = 1 : N } are the observations for dimension (feature) j = 1, 2.

3.1.2.2 Prior
For simplicity, let us assume the prior factorizes, i.e., p(h) = p(`1 )p(`2 )p(s1 )p(s2 ). We will use uninformative
priors for each of these terms. As we explain in ??, this means we should use a prior of the form p(h) ∝ s11 s12 .

3.1.2.3 Posterior
The posterior is given by
1 1
p(`1 , `2 , s1 , s2 |D) ∝ p(D1 |`1 , s1 )p(D2 |`2 , s2 ) (3.11)
s1 s2
We can compute this numerically by discretizing R4 into a 4d grid, evaluating the numerator pointwise, and
normalizing.
Since visualizing a 4d distribution is difficult, we instead draw posterior samples from it, hs ∼ p(h|D),
and visualize them as rectangles. In Figure 3.5(a), we show some samples when the number N of observed
data points is small — we are uncertain about the right hypothesis. In Figure 3.5(b), we see that for larger
N , the samples concentrate on the observed data.

3.1.2.4 Posterior predictive distribution


We now consider how to predict which data points we expect to see in the future, given the data we have
seen so far. In particular, we want to know how likely it is that we will see any point y ∈ R2 .

27
Bayes predictive, n=3, uninfPrior Bayes predictive, n=12, uninfPrior
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(a) (b)
MLE predictive, n=3 MLE predictive, n=12
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(c) (d)

Figure 3.6: Posterior predictive distribution for the healthy levels game. Red crosses are observed data points. Left
column: N = 3. Right column: N = 12. First row: Bayesian prediction. Second row: Plug-in prediction using
MLE (smallest enclosing rectangle). We see that the Bayesian prediction goes from uncertain to certain as we learn
more about the concept given more data, whereas the plug-in prediction goes from narrow to broad, as it is forced to
generalize when it sees more data. However, both converge to the same answer. Adapted from [Ten99]. Generated by
healthy_levels_plot.py.

Let us define yjmin = minn ynj , yjmax = maxn ynj , and rj = yjmax − yjmin . Then one can show that the
posterior predictive distributxion is given by
 N −1
1
p(y|D) = (3.12)
(1 + d(y1 )/r1 )(1 + d(y2 )/r2 )

where d(yj ) = 0 if yjmin ≤ yj ≤ yjmax , and otherwise d(yj ) is the distance to the nearest data point along
dimension j. Thus p(y|D) = 1 if y is inside the support of the training data; if y is outside the support, the
probability density drops off, at a rate that depends on N .
Note that if N = 1, the predictive distribution is undefined. This is because we cannot infer the extent of
a 2d rectangle from just one data point (unless we use a stronger prior).
In Figure 3.6(a), we plot the posterior predictive distribution when we have just seen N = 3 examples; we
see that there is a broad generalization gradient, which extends further along the vertical dimension than the
horizontal direction. This is because the data has a broader vertical spread than horizontal. In other words,
if we have seen a large range in one dimension, we have evidence that the rectangle is quite large in that
dimension, but otherwise we prefer compact hypotheses, as follows from the size principle.
In Figure 3.6(b), we plot the distribution for N = 12. We see it is focused on the smallest consistent
hypothesis, since the size principle exponentially down-weights hypothesis which are larger than necessary.

28
3.1.2.5 Plugin approximation
Now suppose we use a plug-in approximation to the posterior predictive, p(y|D) ≈ p(y|θ̂), where θ̂ is the
MLE or MAP estimate, analogous to the discussion in Section 3.1.1.5. In Figure 3.6(c-d), we show the
behavior of this approximation. In both cases, it predicts the smallest enclosing rectangle, since that is the
one with maximum likelihood. However, this does not extrapolate beyond the range of the observed data.
We also see that initially the predictions are narrower, since very little data has been observed, but that the
predictions become broader with more data. By contrast, in the Bayesian approach, the initial predictions
are broad, since there is a lot of uncertainty, but become narrower with more data. In the limit of large data,
both methods converge to the same predictions. (See ?? for more discussion of the plug-in approximation.)

3.2 Informative priors


When we have very little data, it is important to choose an informative prior.
For example, consider the classic taxicab problem [Jay03, p190]: you arrive in a new city, and see a
taxi numbered t = 27, and you want to infer the total number T of taxis in the city. We will use a uniform
likelihood, p(t|T ) = T1 I (T ≥ t), since we assume that we could have observed any taxi number up the
maximum value T . The MLE estimate of T is T̂mle = t = 27. But this does not seem reasonable. Instead,
most people would guess that T ∼ 2 × 27 = 54, on the assumption that if the taxi you saw was a uniform
random sample between 0 and T , then it would probably be close to the middle of the distribution.
In general, the conclusions we draw about T will depend strongly on our prior assumptions about what
values of T are likely. In the sections below, we discuss different possible informative priors for different
problem domains; our presentation is based on [GT06].
Once we have chosen a prior, we can compute the posterior as follows:
p(t|T )p(T )
p(T |t) = (3.13)
p(t)
where Z ∞ Z ∞
p(T )
p(t) = p(t|T )p(T )dT = dT (3.14)
0 t T
We will use the posterior median as a point estimate for T . This is the value T̂ such that
Z ∞
p(T ≥ T̂ |t) = p(T |t)dT = 0.5 (3.15)

Note that the posterior median is often a better summary of the posterior than the posterior mode, for
reasons explained in ??.

3.2.1 Domain specific priors


At the top of Figure 3.7, we show some histograms representing the empirical distribution of various kinds
of scalar quantities, specifically: the number of years people live, the number of minutes a movie lasts, the
amount of money made (in 1000s of US dollars) by movies, the number of lines of a poem, and the number of
years someone serves in the US house of Representatives. (The sources for these data is listed in [GT06].)
At the bottom, we plot p(T |t) as a function of t for each of these domains. The solid dots are the median
responses of a group of people when asked to predict T from a single observation t. The solid line is the
posterior median computed by a Bayesian model using a domain-appropriate prior (details below). The
dotted line is the posterior median computed by a Bayesian model using an uninformative 1/T prior. We see
a remarkable correspondence between people and the informed Bayesian model. This suggests that people
can implicitly use an appropriate kind of prior for a wide range of problems, as argued in [GT06]. In the
sections below, we discuss some suitable parametric priors which catpure this behavior. In [GT06], they also
consider some datasets that can only be well-modeled by a non-parametric prior. Bayesian inference works
well in that case, too, but we omit this for simplicity.

29
Life Spans LifeMovie Life
SpansRuntimes Spans
Movie Runtimes
Movie Movie
Life
Grosses Runtimes
Movie
Spans Grosses Movie
Movie
Poems Grosses
Poems
Runtimes Poems
Movie Grosses Poems Representatives Pharaohs Cakes

Probability

Probability
Probability

Probability
Probability 0 40 80 1200 400 80 100 120 0 200
040 80 0100120 0 040
200
300 100120
6000 80 0300200 0 01000
600
500 1000 3002005006001000
0 0 300 500600 10000 500 1000 0 30 60 0 50 100 0 60 120 0
ttotal ttotal t ttotal t t ttotalttotalt t t t t ttotal ttotal ttotal t t t
total total total total total total total total total total total

240 240 240 160 240 160 160 160 80 60 160 60


200 200 200 200 200 200 200 200

180 180 180 120 180 120 120 120 60 45 120 45

Predicted ttotal

Predicted ttotal

total
Predicted ttotal
Predicted ttotal

Predicted ttotal

150 150 150 150 150 150 150 150

Predicted t
120 120 120 80 120 80 80 80 40 30 80 30
100 100 100 100 100 100 100 100

50 50 60 50 60 50 50 60 50 40 60 50 40 50 40 40 20 15 40 15

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 50 1000 050 60 0 120
100 0 0100
0 50 06010050120 500 60100
050 120
400 0 80
100 60
0 50 120
40100 080 0 50 100
40 80 0 40 80 0 20 40 0 15 30 0 40 80 0
t t t t t t t t t t t t t t t t t t t

Figure 3.7: Top: empirical distribution of various durational quantities. Bottom: predicted total duration as a function
of observed duration, p(T |t). Dots are observed median responses of people. Solid line: Bayesian prediction using
informed prior. Dotted line: Bayesian prediction using uninformative prior. From Figure 2a of [GT06]. Used with
kind permission of Tom Griffiths.

Gaussian prior Power−law prior Erlang prior


Probability

0 50 100 0 50 100 0 50 100


t t t
total total total

60 60 60
µ=30 γ=1 β=30
µ=25 γ=1.5 β=18
45 45 45
µ=15 γ=2 β=10
total
Predicted t

30 30 30

15 15 15

0 0 0
0 15 30 0 15 30 0 15 30
t t t

Figure 3.8: Top: three different prior distributions, for three different parameter values. Bottom: corresponding
predictive distributions. From Figure 1 of [GT06]. Used with kind permission of Tom Griffiths.

3.2.2 Gaussian prior

Looking at Figure 3.7(a-b), it seems clear that life-spans and movie run-times can be well-modeled by a
Gaussian, N (T |µ, σ 2 ). Unfortunately, we cannot compute the posterior median in closed form if we use a
Gaussian prior, but we can still evaluate it numerically, by solving a 1d integration problem. The resulting
plot of T̂ (t) vs t is shown in Figure 3.8 (bottom left). For values of t much less than the prior mean, µ, the
predicted value of T is about equal to µ, so the left part of the curve is flat. For values of t much greater
than µ, the predicted value converges to a line slightly above the diagonal, i.e., T̂ (t) = t +  for some small
(and decreasing)  > 0.
To see why this behavior makes intuitive sense, consider encountering a man at age 18, 39 or 51: in all
cases, a reasonable prediction is that he will live to about µ = 75 years. But now imagine meeting a man at
age 80: we probably would not expect him to live much longer, so we predict T̂ (80) ≈ 80 + .

30
3.2.3 Power-law prior
Looking at Figure 3.7(c-d), it seems clear that movie grosses and poem length can be modeled by a power
law distribution of the form p(T ) ∝ T −γ for γ > 0. (If γ > 1, this is called a Pareto distribution, see ??.)
Power-laws are characterized by having very long tails. This captures the fact that most movies make
very little money, but a few blockbusters make a lot. The number of lines in various poems also has this
shape, since there are a few epic poems, such as Homer’s Odyssey, but most are short, like haikus. Wealth
has a similarly skewed distribution in many countries, especially in plutocracies such as the USA (see e.g.,
inequality.org).
In the case of a power-law prior, p(T ) ∝ T −γ , we can compute the posterior median analytically. We have
Z ∞
1 1 −γ
p(t) ∝ T −(γ+1) dT = − T −γ |∞
t = t (3.16)
t γ γ
Hence the posterior becomes
T −(γ+1) γtγ
p(T |t) = 1 −γ = γ+1 (3.17)
γt
T
for values of T ≥ t. We can derive the posterior median as follows:
Z ∞  γ  γ
γtγ t ∞ t
p(T > TM |t) = dT = − = (3.18)
γ+1 TM
TM T T TM
Solving for TM such that P (T > TM |t) = 0.5 gives TM = 21/γ t.
This is plotted in Figure 3.8 (bottom middle). We see that the predicted duration is some constant
multiple of the observed duration. For the particular value of γ that best fits the empirical distribution of
movie grosses, the optimal prediction is about 50% larger than the observed quantity. So if we observe that a
movie has made $40M to date, we predict that it will make $60M in total.
As Griffiths and Tenenbaum point out, this rule is inappropriate for quantities that follow a Gaussian
prior, such as people’s ages. As they write, “Upon meeting a 10-year-old girl and her 75-year-old grandfather,
we would never predict that the girl will live a total of 15 years (1.5 × 10) and that the grandfather will
live to be 112 (1.5 × 75).” This shows that people implicitly know what kind of prior to use when solving
prediction problems of this kind.

3.2.4 Erlang prior


Looking at Figure 3.7(e), it seems clear that the number of years a US Representative is approximately
modeled by a gamma distribution (??). Griffiths and Tenenbaum use a special case of the Gamma distributon,
where the shape parameter is a = 2; this is known as the Erlang distribution:
p(T ) = Ga(T |2, 1/β) ∝ T e−T /β (3.19)
For the Erlang prior, we can also compute the posterior median analytically. We have
Z ∞
p(t) ∝ exp(−T /β)dT = −β exp(−T /β)|∞ t = β exp(−t/β) (3.20)
t
so the posterior has the form
exp(−T /β) 1
p(T |t) = = exp(−(T − t)/β) (3.21)
β exp(−t/β) β
for values of T ≥ t. We can derive the posterior median as follows:
Z ∞
1
p(T > TM |t) = exp(−(T − t)/β)dT = − exp(−(T − t)/β)|∞TM = exp(−(TM − t)/β) (3.22)
TM β

Solving for TM such that p(T > TM |t) = 0.5 gives TM = t + β log 2.
This is plotted in Figure 3.8 (bottom right). We see that the best guess is simply the observed value plus
a constant, where the constant reflects the average term in office.

31
32
Chapter 4

Graphical models

4.1 More examples of DGMs


4.1.1 The QMR network
In this section, we describe the PGM-D known as the quick medical reference or QMR network [Shw+91].
This is a model of infectious diseases and is shown (in simplified form) in Figure 4.1. (We omit the parameters
for clarity, so we don’t use plate notation.) The QMR model is a bipartite graph structure, with hidden
diseases (causes) at the top and visible symptoms or findings at the bottom. We can write the distribution as
follows:
K
Y D
Y
p(z, x) = p(zk ) p(xd |xpa(d ) (4.1)
k=1 d=1

where zk represents the k’th disease and xd represents the d’th symptom. This model can be used inside
an inference engine to compute the posterior probability of each disease given the observed symptoms, i.e.,
p(zk |xv ), where xv is the set of visible symptom nodes. (The symptoms which are not observed can be
removed from the model, assuming they are missing at random (??), because they contribute nothing to the
likelihood; this is called barren node removal.)
We now discuss the parameterization of the model. For simplicity, we assume all nodes are binary. The
CPD for the root nodes are just Bernoulli distributions, representing the prior probability of that disease.
Representing the CPDs for the leaves (symptoms) using CPTs would require too many parameters, because
the fan-in (number of parents) of many leaf nodes is very high. A natural alternative is to use logistic

Z1 Z2 Z3

X1 X2 X3 X4 X5
Figure 4.1: A small version of the QMR network. All nodes are binary. The hidden nodes zk represent diseases, and
the visible nodes xd represent symptoms. In the full network, there are 570 hidden (disease) nodes and 4075 visible
(symptom) nodes. The shaded (solid gray) leaf nodes are observed; in this example, symptom x2 is not observed (i.e.,
we don’t know if it is present or absent). Of course, the hidden diseases are never observed.

33
z0 z1 z2 P (xd = 0|z0 , z1 , z2 ) P (xd = 1|z0 , z1 , z2 )
1 0 0 θ0 1 − θ0
1 1 0 θ0 θ1 1 − θ0 θ1
1 0 1 θ0 θ2 1 − θ0 θ2
1 1 1 θ0 θ1 θ2 1 − θ0 θ1 θ2

Table 4.1: Noisy-OR CPD for p(xd |z0 , z1 , z2 ), where z0 = 1 is a leak node.

regression to model the CPD, p(xd |zpa(d) ) = Ber(xd |σ(wTd zpa(d) )). However, we use an alternative known as
the noisy-OR model, which we explain below,
The noisy-OR model assumes that if a parent is on, then the child will usually also be on (since it is an
or-gate), but occasionally the “links” from parents to child may fail, independently at random. If a failure
occurs, the child will be off, even if the parent is on. To model this more precisely, let θkd = 1 − qkd be the
probability that the k → d link fails. The only way for the child to be off is if all the links from all parents
that are on fail independently at random. Thus
Y I(z =1)
p(xd = 0|z) = θkd k (4.2)
k∈pa(d)

Obviously, p(xd = 1|z) = 1−p(xd = 0|z). In particular, let us define qkd = 1−θkd = p(xd = 1|zk = 1, z−k = 0);
this is the probability that k can activate d “on its own”; this is sometimes called its “causal power” (see
e.g., [KNH11]).
If we observe that xd = 1 but all its parents are off, then this contradicts the model. Such a data case
would get probability zero under the model, which is problematic, because it is possible that someone exhibits
a symptom but does not have any of the specified diseases. To handle this, we add a dummy leak node z0 ,
which is always on; this represents “all other causes”. The parameter q0d represents the probability that the
background leak can cause symptom d on its own. The modified CPD becomes
Y z
p(xd = 0|z) = θ0d θkdk (4.3)
k∈pa(d)

See Table 4.1 for a numerical example.


If we define wkd , log(θkd ), we can rewrite the CPD as
!
X
p(xd = 1|z) = 1 − exp w0d + zk wkd (4.4)
k

We see that this is similar to a logistic regression model.


It is relatively easy to set the θkd parameters by hand, based on domain expertise, as was done with QMR.
Such a model is called a probabilistic expert system. In this book, we focus on learning parameters from
data; we discuss how to do this in ?? (see also [Nea92; MH97]).

4.1.2 Genetic linkage analysis


PGM-D’s are widely used in statistical genetics. In this section, we discuss the problem of genetic linkage
analysis, in which we try to infer which genes cause a given disease. We explain the method below.

4.1.2.1 Single locus


We start with a pedigree graph, which is a DAG that representing the relationship between parents and
children, as shown in Figure 4.2(a). Next we construct the DGM. For each person (or animal) i and location
or locus j along the genome, we create three nodes: the observed phenotype Pij (which can be a property
such as blood type, or just a fragment of DNA that can be measured), and two hidden alleles (genes),

34
(a) (b)

Figure 4.2: Left: family tree, circles are females, squares are males. Individuals with the disease of interest are
highlighted. Right: DGM for locus j = L. Blue node Pij is the phenotype for individual i at locus j. Orange nodes
p/m p/m
Gij is the paternal/ maternal allele. Small red nodes Sij are the paternal/ maternal selection switching variables.
The founder (root) nodes do not have any parents, and hence do no need switching variables. All nodes are hidden
except the blue phenotypes. Adapted from Figure 3 from [FGL00].

Gp Gm p(P = a) p(P = b) p(P = o) p(P = ab)


a a 1 0 0 0
a b 0 0 0 1
a o 1 0 0 0
b a 0 0 0 1
b b 0 1 0 0
b o 0 1 0 0
o a 1 0 0 0
o b 0 1 0 0
o o 0 0 1 0

Table 4.2: CPT which encodes a mapping from genotype to phenotype (bloodtype). This is a deterministic, but
many-to-one, mapping.

35
Locus # 1 Locus # 2

1 2

3 4 1 2

5 6 3 4

5 6

Figure 4.3: Extension of Figure 4.2 to two loci, showing how the switching variables are spatially correlated. This is
m m p p
indicated by the Sij → Si,j+1 and Sij → Si,j+1 edges. Adapted from Figure 3 from [FGL00].

p
Gmij and Gij , one inherited from i’s mother (maternal allele) and the other from i’s father (paternal allele).
p
Together, the ordered pair Gij = (Gm ij , Gij ) constitutes i’s hidden genotype at locus j.
Obviously we must add Gij → Xij and Gpij → Pij arcs representing the fact that genotypes cause
m
p
phenotypes. The CPD p(Pij |Gm ij , Gij ) is called the penetrance model. As a very simple example, suppose
p
Pij ∈ {A, B, O, AB} represents person i’s observed bloodtype, and Gm ij , Gij ∈ {A, B, O} is their genotype.
We can represent the penetrance model using the deterministic CPD shown in Table 4.2. For example, A
dominates O, so if a person has genotype AO or OA, their phenotype will be A.
m/p
In addition, we add arcs from i’s mother and father into Gij , reflecting the Mendelian inheritance of
genetic material from one’s parents. More precisely, let µi = k be i’s mother. For example, in Figure 4.2(b),
for individual i = 3, we have µi = 2, since 2 is the mother of 3. The gene Gm ij could either be equal to Gkj or
m
p
Gkj , that is, i’s maternal allele is a copy of one of its mother’s two alleles. Let Sij be a hidden switching
m

variable that specifies the choice. Then we can use the following CPD, known as the inheritance model:
  
 I Gm = Gm if Sijm
=m
p ij kj
p(Gm ij |G m
kj , G , S m
ij ) =   (4.5)
kj  I G =G p
m
ij kj if Sij = p
m

We can define p(Gpij |Gm p p


kj , Gkj , Sij ) similarly, where π = pi is i’s father. The values of the Sij are said to
specify the phase of the genotype. The values of Gpi,j , Gm p
i,j , Si,j and Si,j constitute the haplotype of person
m
p p
i at locus j. (The genotype Gi,j and Gi,j without the switching variables Si,j
m
and Si,j
m
is called the “unphased”
genotype.)
p
Next, we need to specify the prior for the root nodes, p(Gm ij ) and p(Gij ). This is called the founder
model, and represents the overall prevalence of difference kinds of alleles in the population. We usually
assume independence between the loci for these founder alleles, and give these root nodes uniform priors.
Finally, we need to specify priors for the switch variables that control the inheritance process. For now, we
will assume there is just a single locus, so we can assume uniform priors for the switches. The resulting DGM
is shown in Figure 4.2(b).

4.1.2.2 Multiple loci


We get more statistical power if we can measure multiple phenotypes and genotypes. In this case, we
must model spatial correlation amonst the genes, since genes that are close on the genome are likely to be
coinherited, since there is less likely to be a crossover event between them. We can model this by imposing
a two-state Markov chain on the switching variables S’s, where the probability of switching state at locus
j is given by θj = 12 (1 − e−2dj ), where dj is the distance between loci j and j + 1. This is called the
recombination model. The resulting DGM for two linked loci in Figure 4.3.

36
We can now use this model to determine where along the genome a given disease-causing gene is assumed
to lie — this is the genetic linkage analysis task. The method works as follows. First, suppose all the
parameters of the model, including the distance between all the marker loci, are known. The only unknown
is the location of the disease-causing gene. If there are L marker loci, we construct L + 1 models: in model `,
we postulate that the disease gene comes after marker `, for 0 < ` < L + 1. We can estimate the Markov
switching parameter θ̂` , and hence the distance d` between the disease gene and its nearest known locus.
We measure the quality of that model using its likelihood, p(D|θ̂` ). We then can then pick the model with
highest likelihood.
Note, however, that computing the likelihood requires marginalizing out all the hidden S and G variables.
See [FG02] and the references therein for some exact methods for this task; these are based on the variable
elimination algorithm, which we discuss in Section ??. Unfortunately, for reasons we explain in Section ??,
exact methods can be computationally intractable if the number of individuals and/or loci is large. See
[ALK06] for an approximate method for computing the likelihood based on the “cluster variation method”.
Note that it is possible to extend the above model in multiple ways. For example, we can model evolution
amongst phylogenies using a phylogenetic HMM [SH03].

4.2 More examples of UGMs


4.2.0.1 Potts models for protein structure prediction
One interesting application of Potts models arises in the area of protein structure prediction. The goal
is to predict the 3d shape of a protein from its 1d sequence of amino acids. A common approach to this is
known as direct coupling analysis (DCA). We give a brief summary below; for details, see [Mor+11].
First we compute a multiple sequence alignment (MSA), from a set of related amino acid sequences
from the same protein family; this can be done using HMMs, as explained in ??. The MSA can be represented
by an N × T matrix X, where N is the number of sequences, T is the length of each sequence, and
Xni ∈ {1, . . . , V } is the identity of the letter at location i in sequence n. For protein sequences, V = 21,
representing the 20 amino acids plus the gap character.
Once we have the MSA matrix X, we fit the Potts model using maximum likelihood estimation, or some
approximation, such as pseudo likelihood [Eke+13]; see ?? for details.1 After fitting the model, we select the
edges with the highest Jij coefficients, where i, j ∈ {1, . . . , T } are locations or residues in the protein. Since
these locations are highly coupled, they are likely to be in physical contact, since interacting residues must
coevolve to avoid destroying the function of the protein (see e.g., [LHF17] for a review). This graph is called
a contact map.
Once the contact map is established, it can be used as input to a 3d structural prediction algorithm, such
as [Xu18] or the alphafold system [Eva+18], which won the 2018 CASP competition. Such methods use
neural networks to learn functions of the form p(d(i, j)|{c(i, j)}), where d(i, j) is the 3d distance between
residues i and j, and c(i, j) is the contact map.

4.2.1 Hopfield networks


A Hopfield network [Hop82] is a fully connected Ising model (??) with a symmetric weight matrix,
W = WT . The corresponding energy function has the form
1
E(x) = − xT Wx (4.6)
2
where xi ∈ {−1, +1}.
The main application of Hopfield networks is as an associative memory or content addressable
memory. The idea is this: suppose we train on a set of fully observed bit vectors, corresponding to patterns
1 To encourage the model to learn sparse connectivity, we can also compute a MAP estimate with a sparsity promoting prior,

as discussed in [IM17].

37
hopfield_training
0 0 0
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
120 120 120
140 140 140
0 50 100 0 50 100 0 50 100

hopfield_occluded
0 0 0
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
120 120 120
140 140 140
0 50 100 0 50 100 0 50 100

hopfield_recall
0 0 0
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
120 120 120
140 140 140
0 50 100 0 50 100 0 50 100

Figure 4.4: Examples of how an associative memory can reconstruct images. These are binary images of size 150 × 150
pixels. Top: training images. Middle row: partially visible test images. Bottom row: final state estimate. Adapted
from Figure 2.1 of [HKP91]. Generated by hopfield_demo.py.

we want to memorize. (We discuss how to do this below). Then, at test time, we present a partial pattern to
the network. We would like to estimate the missing variables; this is called pattern completion. That is,
we want to compute
x∗ = argmin E(x) (4.7)
x

We can solve this optimization problem using iterative conditional modes (ICM), in which we set each
hidden variable to its most likely state given its neighbors. Picking the most probable state amounts to using
the rule
xt+1 = sgn(Wxt ) (4.8)
This can be seen as a deterministic version of Gibbs sampling (see ??).
We illustrate this process in Figure 4.4. In the top row, we show some training examples. In the middle
row, we show a corrupted input, corresponding to the initial state x0 . In the bottom row, we show the final
state after 30 iterations of ICM. The overall process can be thought of as retrieving a complete example from
memory based on a piece of the example.
To learn the weights W, we could use the maximum likelihood estimate method described in ??. (See
also [HSDK12].) However, a simpler heuristic method, proposed in [Hop82], is to use the following outer
product method: !
N
1 X
W= T
xn xn − I (4.9)
N n=1

38
This normalizes the output product matrix by N , and then sets the diagonal to 0. This ensures the energy
is low for patterns that match any of the examples in the training set. This is the technique we used in
Figure 4.4. Note, however, that this method not only stores the original patterms but also their inverses, and
other linear combinations. Consequently there is a limit to how many examples the model can store before
they start to “collide” in the memory. Hopfield proved that, for random patterns, the network capacity is
∼ 0.14N .

4.2.2 Restricted Boltzmann machines (RBMs) in more detail


In this section, we discuss RBMs in more detail.

4.2.2.1 Binary RBMs


The most common form of RBM has binary hidden nodes and binary visible nodes. The joint distribution
then has the following form:
1
p(x, z|θ) = exp(−E(x, z; θ)) (4.10)
Z(θ)
X K
D X D
X K
X
E(x, z; θ) , − xd zk Wdk − xd bd − z k ck (4.11)
d=1 k=1 d=1 k=1
= −(x Wz + xT b + z c)
T T
(4.12)
XX
Z(θ) = exp(−E(x, z; θ)) (4.13)
x z

where E is the energy function, W is a D × K weight matrix, b are the visible bias terms, c are the hidden
bias terms, and θ = (W, b, c) are all the parameters. For notational simplicity, we will absorb the bias terms
into the weight matrix by adding dummy units x0 = 1 and z0 = 1 and setting w0,: = c and w:,0 = b. Note
that naively computing Z(θ) takes O(2D 2K ) time but we can reduce this to O(min{D2K , K2D }) time using
the structure of the graph.
When using a binary RBM, the posterior can be computed as follows:
K
Y Y
p(z|x, θ) = p(zk |x, θ) = Ber(zk |σ(wT:,k x)) (4.14)
k=1 k

By symmetry, one can show that we can generate data given the hidden variables as follows:
Y Y
p(x|z, θ) = p(xd |z, θ) = Ber(xd |σ(wTd,: z)) (4.15)
d d

We can write this in matrix-vector notation as follows:

E [z|x, θ] = σ(WT x) (4.16)


E [x|z, θ] = σ(Wz) (4.17)

The weights in W are called the generative weights, since they are used to generate the observations, and
the weights in WT are called the recognition weights, since they are used to recognize the input.
From Equation 4.14, we see that we activate hidden node k in proportion to how much the input vector x
“looks like” the weight vector w:,k (up to scaling factors). Thus each hidden node captures certain features of
the input, as encoded in its weight vector, similar to a feedforward neural network.
For example, consider an RBM for text models, where x is a bag of words (i.e., a bit vector over the
vocabulary). Let zk = 1 if “topic” k is present in the document. Suppose a document has the topics “sports”
and “drugs”. If we “multiply” the predictions of each topic together, the model may give very high probability

39
to the word “doping”, which satisfies both constraints. By contrast, adding together experts can only make
the distribution broader (see Figure ??). In particular, if we mix together the predictions from “sports” and
“drugs”, we might generate words like “cricket” and “addiction”, which come from the union of the two topics,
not their intersection.

4.2.2.2 Categorical RBMs


We can extend the binary RBM to categorical visible variables by using a 1-of-C encoding, where C is the
number of states for each xd . We define a new energy function as follows [SMH07; SH10]:
D X
X K X
C D X
X C K
X
E(x, z; θ) , − xcd zk wdk
c
− xcd bcd − zk ck (4.18)
d=1 k=1 c=1 d=1 c=1 k=1

The full conditionals are given by


X
p(xd = c∗ |z, θ) = S({bcd + c C
zk wdk }c=1 ))[c∗ ] (4.19)
k
XX
p(zk = 1|x, θ) = σ(ck + xcd wdk
c
) (4.20)
d c

4.2.2.3 Gaussian RBMs


We can generalize the model to handle real-valued data. In particular, a Gaussian RBM has the following
energy function:
XD X K D K
1X X
E(x, z|θ) = − wdk zk xd − (xd − bd )2 − ak zk (4.21)
2
d=1 k=1 d=1 k=1

The parameters of the model are θ = (wdk , ak , bd ). (We have assumed the data is standardized, so we fix the
variance to σ 2 = 1.) Compare this to a Gaussian in canonical or information form (see Section ??):

1
Nc (x|η, Λ) ∝ exp(ηT x − xT Λx) (4.22)
2
P
where η = Λµ. P We see that we have set Λ = I, and η = k zk w:,k . Thus the mean is given by
µ = Λ−1 η = k zk w:,k , which is a weighted combination of prototypes. The full conditionals, which are
needed for inference and learning, are given by
X
p(xd |z, θ) = N (xd |bd + wdk zk , 1) (4.23)
k
!
X
p(zk = 1|x, θ) = σ ck + wdk xd (4.24)
d

More powerful models, which make the (co)variance depend on the hidden states, can also be developed
[RH10].

4.2.2.4 RBMs with Gaussian hidden units


If we use Gaussian latent variables and Gaussian visible variables, we get an undirected version of factor
analysis (??). Interestingly, this is mathematically equivalent to the standard directed version [MM01].
If we use Gaussian latent variables and categorical observed variables, we get an undirected version of
categorical PCA (Section ??). In [SMH07], this was applied to the Netflix collaborative filtering problem,
but was found to be significantly inferior to using binary latent variables, which have more expressive power.

40
4.2.3 Feature induction for a maxent spelling model
In some applications, we assume the features φ(x) are known. However, it is possible to learn the features in
a maxent model in an unsupervised way; this is known as feature induction.
A common approach to feature induction, first proposed in [DDL97; ZWM97], is to start with a base set
of features, and then to continually create new feature combinations out of old ones, greedily adding the best
ones to the model.
As an example of this approach, [DDL97] describe how to build models to represent English spelling. This
can be formalized as a probability distribution over variable length strings, p(x|θ), where xt is a letter in
the English alphabet. Initially the model has no features, which represents the uniform distribution. The
algorithm starts by choosing to add the feature
X
φ1 (x) = I (xi ∈ {a, . . . , z}) (4.25)
i

which checks if any letter is lower case or not. After the feature is added, the parameters are (re)-fit by
maximum likelihood (a computationally difficult problem, which we discuss in ??). For this feature, it turns
out that θ̂1 = 1.944, which means that a word with a lowercase letter in any position is about e1.944 ≈ 7
times more likely than the same word without a lowercase letter in that position. Some samples from this
model, generated using (annealed) Gibbs sampling (described in ??), are shown below.2

m, r, xevo, ijjiir, b, to, jz, gsr, wq, vf, x, ga, msmGh, pcp, d, oziVlal, hzagh, yzop, io,
advzmxnv, ijv_bolft, x, emx, kayerf, mlj, rawzyb, jp, ag, ctdnnnbg, wgdw, t, kguv, cy, spxcq,
uzflbbf, dxtkkn, cxwx, jpd, ztzh, lv, zhpkvnu, l^, r, qee, nynrx, atze4n, ik, se, w, lrh, hp+,
yrqyka’h, zcngotcnx, igcump, zjcjs, lqpWiqu, cefmfhc, o, lb, fdcY, tzby, yopxmvk, by, fz„ t,
govyccm, ijyiduwfzo, 6xr, duh, ejv, pk, pjw, l, fl, w

The second feature added by the algorithm checks if two adjacent characters are lower case:
X
φ2 (x) = I (xi ∈ {a, . . . , z}, xj ∈ {a, . . . , z}) (4.26)
i∼j

Now the model has the form


1
p(x) = exp(θ1 φ1 (x) + θ2 φ2 (x)) (4.27)
Z
Continuing in this way, the algorithm adds features for the strings s> and ing>, where > represents the end
of word, and for various regular expressions such as [0-9], etc. Some samples from the model with 1000
features, generated using (annealed) Gibbs sampling, are shown below.

was, reaser, in, there, to, will, „ was, by, homes, thing, be, reloverated, ther, which, conists,
at, fores, anditing, with, Mr., proveral, the, „ ***, on’t, prolling, prothere, „ mento, at, yaou,
1, chestraing, for, have, to, intrally, of, qut, ., best, compers, ***, cluseliment, uster, of,
is, deveral, this, thise, of, offect, inatever, thifer, constranded, stater, vill, in, thase, in,
youse, menttering, and, ., of, in, verate, of, to

If we define a feature for every possible combination of letters, we can represent any probability distribution.
However, this will overfit. The power of maxent approach is that we can choose which features matter for the
domain.
An alternative approach is to introduce latent variables, that implicitly model correlations amongst the
visible nodes, rather than explicitly having to learn feature functions. See ?? for an example of such a model.
2 We thank John Lafferty for sharing this example.

41
Fr(A,A) Fr(B,B) Fr(B,A) Fr(A,B) Sm(A) Sm(B) Ca(A) Ca(B)
1 1 0 1 1 1 1 1
1 1 0 1 1 0 0 0
1 1 0 1 1 1 0 1

Table 4.3: Some possible joint instantiations of the 8 variables in the smoking example.

4.2.4 Relational UGMs


We can create relational UGMs in a manner which is analogous to relational DGMs (??). This is particularly
useful in the discriminative setting, for the same reasons that undirected CRFs are preferable to conditional
DGMs (see ??).
For example, suppose we are interested in the problem of classifying web pages of a university into types
(e.g., student, professor, admin, etc.) Obviously we can do this based on the contents of the page (e.g., words,
pictures, layout, etc.) However, we might also suppose there is information in the hyper-link structure itself.
For example, it might be likely for students to cite professors, and professors to cite other professors, but
there may be no links between admin pages and students/ professors. When faced with a web page whose
label is ambiguous, we can bias our estimate based on the estimated labels of its neighbors, as in a CRF.
This process is known as collective classification (see e.g., [Sen+08]). To specify the CRF structure for a
web-graph of arbitrary size and shape, we just specify a template graph and potential functions, and then
unroll the template appropriately to match the topology of the web, making use of parameter tying.

4.2.5 Markov logic networks


One particularly popular way of specifying relational UGMs is to use first-order logic rather than a graphical
description of the template. The result is known as a Markov logic network [RD06; Dom+06; DL09].
For example, consider the sentences “Smoking causes cancer” and “If two people are friends, and one
smokes, then so does the other”. We can write these sentences in first-order logic as follows:
∀x.Sm(x) =⇒ Ca(x) (4.28)
∀x.∀y.F r(x, y) ∧ Sm(x) =⇒ Sm(y) (4.29)
where Sm and Ca are predicates, and F r is a relation.
It is convenient to write all formulas in conjunctive normal form (CNF), also known as clausal form.
In this case, we get
¬Sm(x) ∨ Ca(x) (4.30)
¬F r(x, y) ∨ ¬Sm(x) ∨ Sm(y) (4.31)
The first clause can be read as “Either x does not smoke or he has cancer”, which is logically equivalent
to Equation (4.28). (Note that in a clause, any unbound variable, such as x, is assumed to be universally
quantified.)
Suppose there are just two objects (people) in the world, Anna and Bob, which we will denote by
constant symbols A and B. We can then create 8 binary random variables Sm(x), Ca(x), and F r(x, y)
for x, y ∈ {A, B}. This defines 28 possible worlds, some of which are shown in Table 4.3.3
Our goal is to define a probability distribution over these joint assignments. We can do this by creating a
UGM with these variables, and adding a potential function to capture each logical rule or constraint. For
example, we can encode the rule ¬Sm(x) ∨ Ca(x) by creating a potential function Ψ(Sm(x), Ca(x)), where
we define (
1 if ¬Sm(x) ∨ Ca(x) = T
Ψ(Sm(x), Ca(x)) = (4.32)
0 if ¬Sm(x) ∨ Ca(x) = F
3 Note that we have not encoded the fact that F r is a symmetric relation, so F r(A, B) and F r(B, A) might have different

values. Similarly, we have the “degenerate” nodes F r(A) and F r(B), since we did not enforce x 6= y in Equation (4.29). (If we
add such constraints, then the model compiler, which generates the ground network, should avoid creating redundant nodes.)

42
Friends(A,B)

Friends(A,A) Smokes(A) Smokes(B) Friends(B,B)

Cancer(A) Cancer(B)
Friends(B,A)

Figure 4.5: An example of a ground Markov logic network represented as a pairwise MRF for 2 people. Adapted from
Figure 2.1 from [DL09]. Used with kind permission of Pedro Domingos.

The result is the UGM in Figure 4.5.


The above approach will assign non-zero probability to all logically valid worlds. However, logical rules
may not always be true. For example, smoking does not always cause cancer. We can relax the hard
constraints by using non-zero potential functions. In particular, we can associate a weight with each rule,
and thus get potentials such as
(
ew if ¬Sm(x) ∨ Ca(x) = T
Ψ(Sm(x), Ca(x)) = (4.33)
e0 if ¬Sm(x) ∨ Ca(x) = F

where the value of w > 0 controls strongly we want to enforce the corresponding rule.
The overall joint distribution has the form
1 X
p(x) = exp( wi ni (x)) (4.34)
Z(w) i

where ni (x) is the number of instances of clause i which evaluate to true in assignment x.
Given a grounded MLN model, we can then perform inference using standard methods. Of course, the
ground models are often extremely large, so more efficient inference methods, which avoid creating the full
ground model (known as lifted inference), must be used. See [DL09; KNP11] for details.
One way to gain tractability is to relax the discrete problem to a continuous one. This is the basic
idea behind hinge-loss MRFs [Bac+15], which support exact inference using scalable convex optimization.
There is a template language for this model family known as probabilistic soft logic, which has a similar
“flavor” to MLN, although it is not quite as expressive.
Recently MLNs have been combined with DL in various ways. For example, [Zha+20] uses graph neural
networks for inference. And [WP18] uses MLNs for evidence fusion, where the noisy predictions come from
DNNs trained using weak supervision.
Finally, it is worth noting one subtlety which arises with undirected models, namely that the size of the
unrolled model, which depends on the number of objects in the universe, can affect the results of inference,
even if we have no data about the new objects. For example, consider an undirected chain of length T , with T
hidden nodes zt and T observed nodes yt ; call this model M1 . Now suppose we double the length of the chain
to 2T , without adding more evidence; call this model M2 . We find that p(zt |y1:T , M1 ) 6= p(zt |y1:T , M2 ), for
t = 1 : T , even though we have not added new information, due to the different partition functions. This does
not happen with a directed chain, because the newly added nodes can be marginalized out without affecting
the original nodes, since the model is locally normalized and therefore modular. See [JBB09; Poo+12] for
further discussion.

43
44
Chapter 5

Information theory

45
46
Chapter 6

Optimization

6.1 Proximal methods


In this section, we discuss a class of optimization algorithms called proximal methods that use as their
basic subroutine the proximal operator of a function, as opposed to its gradient or Hessian. We define this
operator below, but essentially it involves solving a convex subproblem.
Compared to gradient methods, proximal method are easier to apply to nonsmooth problems (e.g., with
`1 terms), as well as large scale problems that need to be decomposed and solved in parallel. These methods
are widely used in signal and image processing, and in some applications in deep learning (e.g., [BWL19]
uses proximal methods for training quantized DNNs, [Yao+20] uses proximal methods for efficient neural
architecture search, [Sch+17; WHT19] uses proximal methods for policy gradient optimization, etc.).
Our presentation is based in part on the tutorial in [PB+14]. For another good review, see [PSW15].

6.1.1 Proximal operators


Let f : Rn → R ∪ {+∞} be a convex function, where f (x) = ∞ means the point is infeasible. Let the effective
domain of f be the set of feasible points:

dom(f ) = {x ∈ Rn : f (x) < ∞} (6.1)

The proximal operator (also called a proximal mapping) of f , denoted proxf (x) : Rn → Rn , is
defined by  
1 2
proxf (x) = argmin f (z) + ||z − x||2 (6.2)
z 2
This is a strongly convex function and hence has a unique minimizer. This operator is sketched in Figure 6.1a.
We see that points inside the domain move towards the minimum of the function, whereas points outside the
domain move to the boundary and then towards the minimum.
For example, suppose f is the indicator function for the convex set C, i.e.,
(
0 if x ∈ C
f (x) = IC (x) = (6.3)
∞ if x 6∈ C

In this case, the proximal operator is equivalent to projection onto the set C:

projC (x) = argmin ||z − x||2 (6.4)


z∈C

We can therefore think of the prox operator as generalized projection.

47
3.0
2.5
2.0
1.5
1.0
0.5
0.0

-3 -2 -1 0 1 2 3
(a) (b)

Figure 6.1: (a) Evaluating a proximal operator at various points. The thin lines represent level sets of a convex
function; the minimum is at the bottom left. The black line represents the boundary of its domain. Blue points get
mapped to red points by the prox operator, so points outside the feasible set get mapped to the boundary, and points
inside the feasible set get mapped to closer to the minimum. From Figure 1 of [PB+14]. Used with kind permission of
Stephen Boyd. (b) Illustration of the Moreau envelope with η = 1 (dotted line) of the absolute value function (solid
black line). See text for details. From Figure 1 of [PSW15]. Used with kind permission of Nicholas Polson.

We will often want to compute the prox operator for a scaled function ηf , for η > 0, which can be written
as  
1
proxηf (x) = argmin f (z) + ||z − x||22 (6.5)
z 2η
The solution to the problem in Equation (6.5) the same as the solution to the trust region optimization
problem of the form
argmin f (z) s.t. ||z − x||2 ≤ ρ (6.6)
z

for appropriate choices of η and ρ. This the proximal projection minimizes the function while staying close to
the current iterate. We give other interpretations of the proximal operator below.
We can generalize the operator by replacing the Euclidean distance with Mahalanobis distance:
 
1 T
proxηf,A (x) = argmin f (z) + (z − x) A(z − x) (6.7)
z 2η
where A is a psd matrix.

6.1.1.1 Moreau envelope


Let us define the following quadratic approximation to the function f as a function of z, requiring that it
touch f at x:
1
fxη (z) = f (z) + ||z − x||22 (6.8)

By definition, the location of the minimum of this function is z ∗ (x) = argminz fxη (z) = proxηf (x).
For example, consider approximating the function f (x) = |x| at x0 = 1.5 using fx10 (z) = |z| + 12 (z − x0 )2 .
This is shown in Figure 6.1b: the solid black line is f (x), x0 = 1.5 is the black square, and the light gray
line is fx10 (z). The proximal projection of x0 onto f is z ∗ (x0 ) = argminz fx10 (z) = 0.5, which is the minimum
of the quadratic, shown by the red cross. This proximal point is closer to the minimum of f (x) than the
starting point, x0 .
Now let us evaluate the approximation at this proximal point:
1 1
fxη (z ∗ (x)) = f (z ∗ ) + ||z ∗ − x||22 = min f (z) + ||z − x||22 , f η (x) (6.9)
2η z 2η
where f η (x) is called the Moreau envelope of f .

48
For example, in Figure 6.1b, we see that fx10 (z ∗ ) = fx10 (0.5) = 1.0, so f 1 (x0 ) = 1.0. This is shown by the
blue circle. The dotted line is the locus of blue points as we vary x0 , i.e., the Moreau envelope of f .
We see that the Moreau envelope is a smooth lower bound on f , and has the same minimum location as
f . Furthermore, it has domain Rn , even when f does not, and it is continuously differentiable, even when f
is not. This makes it easier to optimize. For example, the Moreau envelope of f (r) = |r| is the Huber loss
function, which is used in robust regression.

6.1.1.2 Prox operator on a linear approximation yields gradient update


Suppose we make a linear approximation of f at the current iterate xt :

fˆ(x) = f (xt ) + gTt (x − xt ) (6.10)

where gt = ∇f (xt ). To compute the prox operator, note that


 
1 1
∇z fˆxη (z) = ∇z f (xt ) + gTt (z − x) + ||z − xt ||22 = gt + (z − xt ) (6.11)
2η η

Solving ∇z fˆxη (z) = 0 yields the standard gradient update:

proxηfˆ(x) = x − ηgt (6.12)

Thus a prox step is equivalent to a gradient step on a linearized objective.

6.1.1.3 Prox operator on a quadratic approximation yields regularized Newton update


Now suppose we use a second order approximation at xt :
1
fˆ(x) = f (xt ) + ∇gt (x − xt ) + (x − xt )T Ht (x − xt ) (6.13)
2
The prox operator for this is
1
proxηfˆ(x) = x − (Ht + I)−1 gt (6.14)
η

6.1.1.4 Prox operator as gradient descent on a smoothed objective


Prox operators are arguably most useful for nonsmooth functions for which we cannot make a Taylor series
approximation. Instead, we will optimize the Moreau envelope, which is a smooth approximation.
In particular, from Equation (6.9), we have
1
f η (x) = f (proxηf (x)) + ||x − proxηf (x))||22 (6.15)

Hence the gradient of the Moreau envelope is given by
1
∇x f η (x) = (x − proxηf (x)) (6.16)
η
Thus we can rewrite the prox operator as

proxηf (x) = x − η∇f η (x) (6.17)

Thus a prox step is equivalent to a gradient step on the smoothed objective.

6.1.2 Computing proximal operators


In this section, we briefly discuss how to compute proximal operators for various functions that are useful in
ML, either as regularizers or constraints. More examples can be found in [PB+14; PSW15].

49
6.1.2.1 Moreau decomposition
A useful technique for computing some kinds of proximal operators leverages a result known as Moreau
decomposition, which states that
x = proxf (x) + proxf ∗ (x) (6.18)
where f ∗ is the convex conjugate of f (see Section 6.5).
For example, suppose f = || · || is a general norm on RD . If can be shown that f ∗ = IB , where

B = {x : ||x||∗ ≤ 1} (6.19)

is the unit ball for the dual norm || · ||∗ , defined by

||z||∗ = sup{zT x : ||x|| ≤ 1} (6.20)

Hence
proxλf (x) = x − λproxf ∗ /λ (x/λ) = x − λprojB (x/λ) (6.21)
Thus there is a close connection between proximal operators of norms and projections onto norm balls that
we will leverage below.

6.1.2.2 Projection onto box constraints


Let C = {x : l ≤ x ≤ u} be a box or hyper-rectangle, imposing lower and upper bounds on each element.
(These bounds can be infinite for certain elements if we don’t to constrain values along that dimension.) The
projection operator is easy to compute elementwise by simply thresholding at the boundaries:

ld if xk ≤ lk

projC (x)d = xd if lk ≤ xk ≤ uk (6.22)


ud if xk ≥ uk

For example, if we want to ensure all elements are non-negative, we can use

projC (x) = x+ = [max(x1 , 0), . . . , max(xD , 0)] (6.23)

6.1.2.3 `1 norm
Consider the 1-norm f (x) = ||x||1 . The proximal projection can be computed componentwise. We can solve
each 1d problem as follows:
1
proxλf (x) = argmin λ|z| + (z − x)2 (6.24)
z z
One can show that the solution to this is given by

x − λ if x ≥ λ

proxλf (x) = 0 if |x| ≥ λ (6.25)


x + λ if x ≤ λ

This is known as the soft thresholding operator, since values less than λ in absolute value are set to 0
(thresholded), but in a differentiable way. This is useful for enforcing sparsity. Note that soft thresholding
can be written more compactly as

SoftThresholdλ (x) = sign(x) (|x| − λ)+ (6.26)

where x+ = max(x, 0) is the positive part of x. In the vector case, we define SoftThresholdλ (x) to be
elementwise soft thresholding.

50
6.1.2.4 `2 norm
qP
D
Now consider the `2 norm f (x) = ||x||2 = d=1 xd . The dual norm for this is also the `2 norm. Projecting
2

onto the corresponding unit ball B can be done by simply scaling vectors that lie outside the unit sphere:
(
x
||x||2 > 1
projB (x) = ||x||2
(6.27)
x ||x||2 ≤ 1

Hence by the Moreau decomposition we have


(
(1 − λ
||x||2 )x if ||x||2 ≥ λ
proxλf (x) = (1 − λ/||x||2 )+ x = (6.28)
0 otherwise

This will set the whole vector to zero if its `2 norm is less than λ. This is therefore called block soft
thresholding.

6.1.2.5 Squared `2 norm


PD
Now consider using the squared `2 norm (scaled by 0.5), f (x) = 21 ||x||22 = 1
2 d=1 x2d . One can show that

1
proxλf (x) = x (6.29)
1+λ

This reduces the magnitude of the x vector, but does not enforce sparsity. It is therefore called the shrinkage
operator.
More generally, if f (x) = 12 xT Ax + bT x + c is a quadratic, with A being positive definite, then

proxλf (x) = (I + λA)−1 (x − λb) (6.30)

A special case of this is if f is affine, f (x) = bT x + c. Then we have proxλf (x) = x − λb. We saw an example
of this in Equation (6.12).

6.1.2.6 Nuclear norm

The nuclear norm, also called the trace norm, of an m × n matrix A is the `1 norm of of its singular
values: f (A) = ||A||∗ = ||σ||1 . Using this as a regularizer can result in a low rank matrix. The proximal
operator for this is defined by
X
proxλf (A) = (σi − λ)+ ui vTi (6.31)
i
P
where A = i σi ui viT is the SVD of A. This operation is called singular value thresholding.

6.1.2.7 Projection onto positive definite cone

Consider the cone of positive semidefinite matrices C, and let f (A) = IC (A) be the indicator function. The
proximal operator corresponds to projecting A onto the cone. This can be computed using
X
projC (A) = (λi )+ ui uTi (6.32)
i

P
where i λi ui uTi is the eigenvalue decomposition of A. This is useful for optimizing psd matrices.

51
6.1.2.8 Projection onto probability simplex
PD
Let C = {x : x ≥ 0, d=1 xd = 1} = SD be the probability simplex in D dimensions. We can project onto
this using
projC (x) = (x − ν1)+ (6.33)

The value ν ∈ R must be found using bisection search. See [PB+14, p.183] for details. This is useful for
optimizing over discrete probability distributions.

6.1.3 Proximal point methods (PPM)


A proximal point method (PPM), also called a proximal minimization algorithm, iteratively applies
the following update:
1
θt+1 = proxηt L (θt ) = argmin L(θ) + ||θ − θt ||22 (6.34)
θ 2ηt
where we assume L : Rn → R ∪ {+∞} is a closed proper convex function. The advantage of this method over
minimizing L directly is that sometimes adding quadratic regularization can improve the conditioning of the
problem, and hence speed convergence.

6.1.3.1 Stochastic and incremental PPM


PPM can be extended to the stochastic setting, where the goal is to optimize L(θ) = Eq(z) [`(θ, z)], by using
the following stochastic update:

1
θt+1 = proxηt `t (θt ) = argmin `t (θ) + ||θ − θt ||22 (6.35)
θ 2ηt

where `t (θ) = `(θ, zt ) and zt ∼ q. The resulting method is known as stochastic PPM (see e.g., [PN18]). If q
is the empirical distribution associated with a finite-sum objective, this is called the incremental proximal
point method [Ber15]. It is often more stable than SGD.
In the case where the cost function is a linear least squares problem, one can show [AEM18] that the
IPPM is equivalent to the Kalman filter (??), where the posterior mean is equal to the current parameter
estimate, θt . The advantage of this probabilistic perspective is that it also gives us the posterior covariance,
which can be used to define a variable-metric distance function inside the prox operator, as in Equation (6.7).
We can extend this to nonlinear problems using the extended KF (??).

6.1.3.2 SGD is PPM on a linearized objective


We now show that SGD is PPM on a linearized objective. To see this, let the approximation at the current
iterate be
`ˆt (θ) = `t (θt ) + gTt (θ − θt ) (6.36)

where gt = ∇θ `t (θt ). Now we compute a proximal update to this approximate objective:

1
θt+1 = proxηt `ˆt (θt ) = argmin `ˆt (θ) + ||θ − θt ||22 (6.37)
θ 2ηt

We have  
1 1
T
∇θ `t (θt ) + gt (θ − θt ) + 2
||θ − θt ||2 = gt + (θ − θt ) (6.38)
2ηt ηt
Setting the gradient to zero yields the SGD step θt+1 = θt − ηt gt .

52
50

45 0.8

40
SGM 0.6
35 Truncated
adam
30
trunc-adagrad 0.4

25
SGM
0.2 Truncated
20 adam
trunc-adagrad

10−3 10−2 10−1 100 101 102 103 10−3 10−2 10−1 100 101 102 103

(a) (b)

Figure 6.2: Illustration of the benefits of using a lower-bounded loss function when training a resnet-128 CNN on
the CIFAR10 image classification dataset. The curves are as follows: SGM (stochastic gradient method, i.e., SGD),
Adam, truncated SGD and truncated AdaGrad. (a) Time to reach an error that satisifes L(θt ) − L(θ ∗ ) ≤  vs initial
learning rate η0 . (b) Top-1 accuracy after 50 epochs vs η0 . The lines represent median performance across 50 random
restarts, and shading represents 90% confidence intervals. From Figure 4 of [AD19c]. Used with kind permission of
Hilal Asi.

6.1.3.3 Beyond linear approximations (truncated AdaGrad)

Sometimes we can do better than just using PPM with a linear approximation to the objective, at essentially
no extra cost, as pointed out in [AD19b; AD19a; AD19c]. For example, suppose we know a lower bound on
the loss, `min
t = minθ `t (θ). For example, when using squared error, or cross-entropy loss for discrete labels,
we have `t (θ) ≥ 0. Let us therefore define the truncated model

`ˆt (θ) = max `t (θ) + gTt (θ − θt ), `min
t (6.39)

We can further improve things by replacing the PEuclidean norm with a scaled Euclidean norm, where
t 1
the diagonal scaling matrix is given by At = diag( i=1 gi gTi ) 2 , as in AdaGrad [DHS11]. If `min
t = 0, the
resulting proximal update becomes

  1
θt+1 = argmin `t (θt ) + gTt (θ − θt ) + + (θ − θt )T At (θ − θt ) (6.40)
θ 2ηt
`t (θt )
= θt − min(ηt , T −1 )gt (6.41)
gt At gt

Thus the update is like a standard SGD update, but we truncate the learning rate if it is too big.1
[AD19c] call this truncated AdaGrad. Furthermore, they prove optimizing this truncated linear
approximation (with or without AdaGrad weighting), instead of the standard linear approximation used by
gradient descent, can result in significant benefits. In particular, it is guaranteed to be stable (under certain
technical conditions) for any learning rate, whereas standard GD can “blow up”, even for convex problems.
Figure 6.2 shows the benefits of this approach when training a resnet-128 CNN (??) on the CIFAR10
image classification dataset. For SGD and the truncated proximal method, the learning rate is decayed using
ηt = η0 t−β with β = 0.6. For Adam and truncated AdaGrad, the learning rate is set to ηt = η0 , since
we use diagonal scaling. We see that both truncated methods (regular and AdaGrad version) have good
performance for a much broader range of initial learning rate η0 compared to SGD or Adam.

1 One way to derive this update (suggested by Hilal Asi) is to do case analysison the value of `ˆ (θ ˆ
t t+1 ), where `t is the
truncated linear model. If `ˆt (θt+1 ) > 0, then setting the gradient to zero yields the usual SGD update, θt+1 = θt − ηt gt . (We
assume At = I for simplicity.) Otherwise we must have `ˆt (θt+1 ) = 0. But we know that θt+1 = θt − λgt for some λ, so we
solve `ˆt (θt − λgt ) = 0 to get λ = `ˆt (θt )/||gt ||22 .

53
6.1.4 Mirror descent
We can extend the proximal point update in Equation (6.34) by replacing the Euclidean distance term
||θ − θt ||22 by a more general Bregman divergence (??),
 
Dh (x, y) = h(x) − h(y) + ∇h(y)T (x − y) (6.42)
where h(x) is a strongly convex function. This gives the following update:
1
θt+1 = argmin L(θ) + Dh (θ, θt ) (6.43)
θ ηt
Suppose we make a linear approximation to L(θ).
L̂t (θ) = L(θt ) + gTt (θ − θt ) (6.44)
where gt = ∇θ L(θt ). If we perform a proximal update with this linear approximation using Euclidean
distance, we get a standard gradient update, as we showed in Section 6.1.3.2, However, combining this with
the Bregman divergence gives the following update:
θt+1 = argmin ηt gTt θ + Dh (θ, θt ) (6.45)
θ

This is known as mirror descent [NY83; BT03]. This can easily be extended to the stochastic setting in
the obvious way.
One can show that natural gradient descent (??) is a form of mirror descent [RM15]. More precisely, mirror
descent in the mean parameter space is equivalent to natural gradient descent in the canonical parameter
space.

6.1.5 Proximal gradient method


We are often interested in optimizing a composite objective of the form
L(θ) = Ls (θ) + Lr (θ) (6.46)
where Ls is convex and differentiable (smooth), and Lr is convex but not necessarily differentiable (i.e., it
may be non-smooth or “rough”). For example, Lr might be an `1 norm regularization term, and Ls might be
the NLL for linear regression (see Section 6.1.5.1).
The proximal gradient method is the following update:
θt+1 = proxηt Lr (θt − ηt ∇Ls (θt )) (6.47)
If Lr = IC , this is equivalent to projected gradient descent. If Lr = 0, this is equivalent to gradient
descent. If Ls = 0, this is equivalent to a proximal point method.
We can create a version of the proximal gradient method with Nesterov acceleration as follows:
θ̃t+1 = θt + βt (θt − θt−1 ) (6.48)
θt+1 = proxηt Lr (θ̃t+1 − ηt ∇Ls (θ̃t+1 )) (6.49)
See e.g., [Tse08].
Now we consider the stochastic case, where Ls (θ) = E [Ls (θ, z)]. (We assume Lr is deterministic.) In
this setting, we can use the following stochastic update:
θt+1 = proxηt Lr (θt − ηt ∇Ls (θt , zt )) (6.50)
where zt ∼ q. This is called the stochastic proximal gradient method. If q is the empirical distribution,
this is called the incremental proximal gradient method [Ber15]. Both methods can also be accelerated
(see e.g., [Nit14]).
If Ls is not convex, we can compute a locally convex approximation, as in Section 6.1.3.3. (We assume Lr
remains convex.) The accelerated version of this is studied in [LL15]. In the stochastic case, we can similarly
make a locally convex approximation to Ls (θ, z). This is studied in [Red+16; LL18]. An EKF interpretation
in the incremental case (where q = pD ) is given in [Aky+19].

54
6.1.5.1 Example: Iterative soft-thresholding algorithm (ISTA) for sparse linear regression
Suppose we are interested in fitting a linear regression model with a sparsity-promoting prior on the weights, as
in the lasso model (??). One way to implement this is to add the `1 -norm of the parameters as a (non-smooth)
PD
penalty term, Lr (θ) = ||θ||1 = d=1 |θd |. Thus the objective is

1
L(θ) = Ls (θ) + Lr (θ) = ||Xθ − y||22 + λ||θ||1 (6.51)
2
The proximal gradient descent update can be written as

θt+1 = SoftThresholdηt λ (θt − ηt ∇Ls (θt )) (6.52)

where the soft thresholding operator (Equation (6.26)) is applied elementwise, and ∇Ls (θ) = XT (Xθ − y).
This is called the iterative soft thresholding algorithm or ISTA [DDDM04; Don95]. If we combine this
with Nesterov acceleration, we get the method known as “fast ISTA” or FISTA [BT09], which is widely used
to fit sparse linear models.

6.1.6 Alternating direction method of multipliers (ADMM)


Consider the problem of optimizing L(x) = Ls (x) + Lr (x) where now both Ls and Lr may be non-smooth
(but we asssume both are convex). We may want to optimize these problems independently (e.g., so we can
do it in parallel), but need to ensure the solutions are consistent.
One way to do this is by using the variable splitting trick combined with constrained optimization:

minimize Ls (x) + Lr (z) s.t. x − z = 0 (6.53)

This is called consensus form.


The corresponding augmented Langragian is given by
ρ
Lρ (x, z, y) = Ls (x) + Lr (z) + yT (x − z) + ||x − z||22 (6.54)
2
where ρ > 0 is the penalty strength, and y ∈ Rn are the dual variables associated with the consistency
constraint. We can now perform the following block coordinate descent updates:

xt+1 = argmin Lρ (x, zt , yt ) (6.55)


x
zt+1 = argmin Lρ (xt+1 , z, yt ) (6.56)
z
yt+1 = yt + ρ(xt+1 − zt+1 ) (6.57)

We see that the dual variable is the (scaled) running average of the consensus errors.
Inserting the definition of Lρ (x, z, y) gives us the following more explicit update equations:
 ρ 
xt+1 = argmin Ls (x) + yTt x + ||x − zt ||22 (6.58)
x 2
 ρ 
zt+1 = argmin Lr (z) − yt z + ||xt+1 − z||22
T
(6.59)
z 2

If we combine the linear and quadratic terms, we get


 ρ 
xt+1 = argmin Ls (x) + ||x − zt + (1/ρ)yt ||22 (6.60)
x 2
 ρ 
zt+1 = argmin Lr (z) + ||xt+1 − z − (1/ρ)yt ||22 (6.61)
z 2

55
Figure 6.3: Robust PCA applied to some frames from a surveillance video. First column is input image. Second
column is low-rank background model. Third model is sparse foreground model. Last column is derived foreground
mask. From Figure 1 of [Bou+17]. Used with kind permission of Thierry Bouwmans.

Finally, if we define ut = (1/ρ)yt and λ = 1/ρ, we can now write this in a more general way:
xt+1 = proxλLs (zt − ut ) (6.62)
zt+1 = proxλLr (xt+1 + ut ) (6.63)
ut+1 = ut + xt+1 − zt+1 (6.64)
This is called the alternating direction method of multipliers or ADMM algorithm. The advantage
of this method is that the different terms in the objective (along with any constraints they may have) are
handled completely independently, allowing different solvers to be used. Furthermore, the method can be
extended to the stochastic setting as shown in [ZK14].

6.1.6.1 Example: robust PCA


In this section, we give an example of ADMM from [PB+14, Sec. 7.2].
Consider the following matrix decomposition problem:
J
X J
X
minimize γj φj (Xj ) s.t. Xj = A (6.65)
X1:J
j=1 j=1

where A ∈ Rm×n is a given data matrix, Xj ∈ Rm×n are the optimization variables, and γj > 0 are trade-off
parameters.
For example, suppose we want to find a good least squares approximation to A as a sum of a low rank
matrix plus a sparse matrix. This is called robust PCA [Can+11], since the sparse matrix can handle the
small number of outliers that might otherwise cause the rank of the approximation to be high. The method
is often used to decompose surveillance videos into a low rank model for the static background, and a sparse
model for the dynamic foreground objects, such as moving cars or people, as illustrated in Figure 6.3. (See
e.g., [Bou+17] for a review.) RPCA can also be used to remove small “outliers”, such as specularities and
shadows, from images of faces, to improve face recognition.
We can formulate robust PCA as the following optimization problem:
minimize ||A − (L + S)||2F + γL ||L||∗ + γS ||S||1 (6.66)
which is a sparse plus low rank decomposition of the observed data matrix. We can reformulate this to match
the form of a canonical matrix decomposition problem by defining X1 = L, X2 = S and X3 = A − (X1 + X2 ),
and then using these loss functions:
φ1 (X1 ) = ||X1 ||∗ , φ2 (X2 ) = ||X2 ||1 , φ3 (X3 ) = ||X3 ||2F (6.67)

56
We can tackle such matrix decomposition problems using ADMM, where we use the split Ls (X) =
P PJ
j γj φj (Xj ) and Lr (X) = IC (X), where X = (X1 , . . . , XJ ) and C = {X1:J : j=1 Xj = A}. The overall
algorithm becomes
1
Xj,t+1 = proxηt φj (Xj,t − Xt + A − Ut ) (6.68)
N
1
Ut+1 = Ut + Xt+1 − A (6.69)
J

where X is the elementwise average of X1 , . . . , XJ . Note that the Xj can be updated in parallel.
Projection onto the `1 norm is discussed in Section 6.1.2.3, projection onto the nuclear norm is discussed
in Section 6.1.2.6. projection onto the squared Frobenius norm is the same as projection
P onto the squared
Euclidean norm discussed in Section 6.1.2.5, and projection onto the constraint set j Xj = A can be done
using the averaging operator:
1
projC (X1 , . . . , XJ ) = (X1 , . . . , XJ ) − X + A (6.70)
J
An alternative to using `1 minimization in the inner loop is to use hard thresholding [CGJ17]. Although
not convex, this method can be shown to converge to the global optimum, and is much faster.
It is also possible to formulate a non-negative version of robust PCA. Even though NRPCA is not a
convex problem, it is possible to find the globally optimal solution [Fat18; AS19].

6.2 Local search


In this section, we discuss heuristic optimization algorithms that try to find the global maximum in a discrete,
unstructured search space. These algorithms replace the local gradient based update, which has the form
θt+1 = θt + ηt dt , with the following discrete analog:

xt+1 = argmax L(x) (6.71)


x∈nbr(xt )

where nbr(xt ) ⊆ X is the set of neighbors of xt . This is called hill climbing, steepest ascent, or greedy
search.
If the “neighborhood” of a point contains the entire space, Equation (6.71) will return the global optimum
in one step, but usually such a global neighborhood is too large to search exhaustively. Consequently we
usually define local neighborhoods. For example, consider the 8-queens problem. Here the goal is to
place queens on an 8 × 8 chessboard so that they don’t attack each other (see Figure 6.6). The state space
has the form X = 648 , since we have to specify the location of each queen on the grid. However, due to the
constraints, there are only 88 ≈ 17M feasible states. We define the neighbors of a state to be all possible
states generated by moving a single queen to another square in the same column, so each node has 8 × 7 = 56
neighbors. According to [RN10, p.123], if we start at a randomly generated 8-queens state, steepest ascent
gets stuck at a local maximum 86% of the time, so it only solves 14% of problem instances. However, it is
fast, taking an average of 4 steps when it succeeds and 3 when it gets stuck.
In the sections below, we discuss slightly smarter algorithms that are less likely to get stuck in local
maxima.

6.2.1 Stochastic local search


Hill climbing is greedy, since it picks the best point in its local neighborhood, by solving Equation (6.71)
exactly. One way to reduce the chance of getting stuck in local maxima is to approximately maximize this
objective at each step. For example, we can define a probability distribution over the uphill neighbors,
proportional to how much they improve, and then sample one at random. This is called stochastic hill

57
climbing. If we gradually decrease the entropy of this probability distribution (so we become greedier over
time), we get a method called simulated annealing, which we discuss in ??.
Another simple technique is to use greedy hill climbing, but then whenever we reach a local maximum,
we start again from a different random starting point. This is called random restart hill climbing. To
see the benefit of this, consider again the 8-queens problem. If each hill-climbing search has a probability of
p ≈ 0.14 of success, then we expect to need R = 1/p ≈ 7 restarts until we find a valid solution. The expected
number of total steps can be computed as follows. Let N1 = 4 be the average number of steps for successful
trials, and N0 = 3 be the average number of steps for failures. Then the total number of steps on average is
N1 + (R − 1)N0 = 4 + 6 × 3 = 22. Since each step is quick, the overall method is very fast. For example, it
can solve an n-queens problem with n =1M in under a minute.
Of course, solving the n-queens problem is not the most useful task in practice. However, it is typical of
several real-world boolean satisfiability problems, which arise in problems ranging from AI planning to
model checking (see e.g., [SLM92]). In such problems, simple stochastic local search (SLS) algorithms of
the kind we have discussed work surprisingly well (see e.g., [HS05]).

6.2.2 Tabu search

Algorithm 1: Tabu search.


1 t := 0 // counts iterations ;
2 c := 0 // counts number of steps with no progress;
3 Initialize x0 ;
4 x∗ := x0 // current best incumbent;
5 while c < cmax do
6 xt+1 = argmaxx∈nbr(xt )\{xt−τ ,...,xt−1 } f (x);
7 if f (xt ) > f (x∗ ) then
8 x∗ := xt ;
9 c := 0
10 else
11 c := c + 1
12 t := t + 1
13 return x∗

Hill climbing will stop as soon as it reaches a local maximum or a plateau. Obviously one can perform a
random restart, but this would ignore all the information that had been gained up to this point. A more
intelligent alternative is called tabu search [GL97]. This is like hill climbing, except it allows moves that
decrease (or at least do not increase) the scoring function, provided the move is to a new state that has not
been seen before. We can enforce this by keeping a tabu list which tracks the τ most recently visited states.
This forces the algorithm to explore new states, and increases the chances of escaping from local maxima.
We continue to do this for up to cmax steps (known as the “tabu tenure”). The pseudocode can be found in
Algorithm 1. (If we set cmax = 1, we get greedy hill climbing.)
For example, consider what happens when tabu search reaches a hill top, xt . At the next step, it will move
to one of the neighbors of the peak, xt+1 ∈ nbr(xt ), which will have a lower score. At the next step, it will
move to the neighbor of the previous step, xt+2 ∈ nbr(xt+1 ); the tabu list prevents it cycling back to xt (the
peak), so it will be forced to pick a neighboring point at the same height or lower. It continues in this way,
“circling” the peak, possibly being forced downhill to a lower level-set (an inverse basin flooding operation),
until it finds a ridge that leads to a new peak, or until it exceeds a maximum number of non-improving moves.
According to [RN10, p.123], tabu search increases the percentage of 8-queens problems that can be solved
from 14% to 94%, although this variant takes an average of 21 steps for each successful instance and 64 steps
for each failed instance.

58
Grid Layout Random Layout

Unimportant parameter
Unimportant parameter
Important parameter Important parameter

Figure 6.4: Illustration of grid search (left) vs random search (right). From Figure 1 of [BB12]. Used with kind
permission of James Bergstra.

6.2.3 Random search


A surprisingly effective strategy in problems where we know nothing about the objective is to use random
search. In this approach, each iterate xt+1 is chosen uniformly at random from X . This should always be
tried as a baseline.
In [BB12], they applied this technique to the problem of hyper-parameter optimization for some ML
models, where the objective is performance on a validation set. In their examples, the search space is
continuous, Θ = [0, 1]D . It is easy to sample from this at random. The standard alternative approach is to
quantize the space into a fixed set of values, and then to evaluate them all; this is known as grid search.
(Of course, this is only feasible if the number of dimensions D is small.) They found that random search
outperformed grid search. The intuitive reason for this is that many hyper-parameters do not make much
difference to the objective function, as illustrated in Figure 6.4. Consequently it is a waste of time to place a
fine grid along such unimportant dimensions.
RS has also been used to optimize the parameters of MDP policies, where the objective has the form
f (x) = Eτ ∼πx [R(τ )] is the expected reward of trajectories generated by using a policy with parameters x.
For policies with few free parameters, RS can outperform more sophisticated reinforcement learning methods
described in ??, as shown in [MGR18]. In cases where the policy has a large number of parameters, it is
sometimes possible to project them to a lower dimensional random subspace, and perform optimization
(either grid search or random search) in this subspace [Li+18].

6.3 Population-based optimization


Stochastic local search (SLS) maintains a single “best guess” at each step, xt . If we run this for T steps, and
restart K times, the total cost is T K. A natural alternative is to maintain a set or population of K good
candidates, St , which we try to improve at each step. This is called an an evolutionary algorithm (EA). If
we run this for T steps, it also takes T K time; however, it can often get better results than multi-restart SLS,
since the search procedure explores more of the space in parallel, and information from different members of
the population can be shared. Many versions of EA are possible, depending on how we update the population
at each step, as we discuss in Section 6.3.1.
An alternative to maintaining an explicit set of promising candidates is to maintain a probability
distribution over promising candidates. We call this distribution-based optimization (DBO).2 We can
think of EAs as using a nonparametric representation of this distribution in terms of a “bag of points”. We
2 Note that the terms “probabilistic optimization” and “stochastic optimization” have already been “taken”. (Proba-

bilistic optimization is often used to refer to Bayesian optimization (??), and stochastic optimization refers to any optimization
problem in which the objective is stochastic.) Another term that is used for DBO is “model based optimization” (see e.g.
[BBZ17]). However, this term is ambiguous, because it could either refer to a model of good candidates (i.e., a probability
distribution over X ) or a cheap (possibly differentiable) approximation to the objective (i.e., a regression function X → R), as
discussed in ??.

59
24748552 24 31% 32752411 32748552 32748152
32752411 23 29% 24748552 24752411 24752411
24415124 20 26% 32752411 32752124 32252124
32543213 11 14% 24415124 24415411 24415417

(a) (b) (c) (d) (e)


Initial Population Fitness Function Selection Crossover Mutation

Figure 6.5: Illustration of a genetic algorithm applied to the 8-queens problem. (a) Initial population of 4 strings. (b)
We rank the members of the population by fitness, and then compute their probability of mating. Here the integer
numbers represent the number of nonattacking pairs of queens, P so the global maximum has a value of 28. We pick
an individual θ with probability p(θ) = L(θ)/Z, where Z = θ∈P L(θ) sums the total fitness of the population. For
example, we pick the first individual with probability 24/78 = 0.31, the second with probability 23/78 = 0.29, etc. In
this example, we pick the first individual once, the second twice, the third one once, and the last one does not get to
breed. (c) A split point on the “chromosome” of each parent is chosen at random. (d) The two parents swap their
chromosome halves. (e) We can optionally apply pointwise mutation. From Figure 4.6 of [RN10]. Used with kind
permission of Peter Norvig.

can sometimes get better performance by using a parametric distribution, with suitable inductive bias. We
discuss some examples in Section 6.3.3.

6.3.1 Evolutionary algorithms


Since EA algorithms draw inspiration from the biological process of evolution, they also borrow a lot of its
terminology. The fitness of a member of the population is the value of the objective function (possibly
normalized across population members). The members of the population at step t + 1 are called the offspring.
These can be created by randomly choosing a parent from St and applying a random mutation to it. This
is like asexual reproduction. Alternatively we can create an offspring by choosing two parents from St , and
then combining them in some way to make a child, as in sexual reproduction; combining the parents is called
recombination. (It is often followed by mutation.)
The procedure by which parents are chosen is called the selection function. In truncation selection,
each parent is chosen from the fittest K members of the population (known as the elite set). In tournament
selection, each parent is the fittest out of K randomly chosen members. In fitness proportionate selection,
also called roulette wheel selection, each parent is chosen with probability proportional to its fitness
relative to the others. We can also “kill off” the oldest members of the population, and then select parents
based on their fitness; this is called regularized evolution [Rea+19]).
In addition to the selection rule for patents, we need to specify the recombination and mutation rules.
There are many possible choices for these heuristics. We briefly mention a few of them below.
• In a genetic algorithm (GA) [Gol89; Hol92], we use mutation and a particular recombination method
based on crossover. To implement crossover, we assume each individual is represented as a vector of
integers or binary numbers, by analogy to chromosomes. We pick a split point along the chromosome
for each of the two chosen parents, and then swap the strings, as illustrated in Figure 6.5.
• In genetic programming [Koz92], we use use a tree-structured representation of individuals, instead
of a bit string. This representation ensures that all crossovers result in valid children, as illustrated in
Figure 6.7. Genetic programming can be useful for finding good programs as well as other structured
objects, such as neural networks. In evolutionary programming, the structure of the tree is fixed
and only the numerical parameters are evolved.
• In surrogate assisted EA, a surrogate function fˆ(s) is used instead of the true objective function
f (s) in order to speed up the evaluation of members of the population (see [Jin11] for a survey). This

60
+ =

Figure 6.6: The 8-queens states corresponding to the first two parents in Figure 6.5(c) and their first child in
Figure 6.5(d). We see that the encoding 32752411 means that the first queen is in row 3 (counting from the bottom
left), the second queen is in row 2, etc. The shaded columns are lost in the crossover, but the unshaded columns are
kept. From Figure 4.7 of [RN10]. Used with kind permission of Peter Norvig.

(a) (b)

(c) (d)

Figure 6.7: Illustration


p of crossover operator in a genetic program. (a-b) the two parents, representing sin(x) + (x + y)2
and sin(x)+ x2 + y. The red circles denote the two crossover points. (c-d) the two children, representing sin(x)+(x2 )2

and sin(x) + x + y + y. Adapted from Figure 9.2 of [Mit97]

61
Figure 6.8: A taxonomy of various metaheuristic optimization algorithms. From https: // en. wikipedia. org/ wiki/
Metaheuristic . Used with kind permission of Wikipedia authors Johann Dreo and Caner Candan.

is similar to the use of response surface models in Bayesian optimization (??), except it does not deal
with the explore-exploit tradeoff.
• In a memetic algorithm [MC03], we combine mutation and recombination with standard local search.

Evolutionary algorithms have been applied to a large number of applications, including training neural
networks (this combination is known as neuroevolution [Sta+19]). An efficient JAX-based library for
(neuro)-evolution can be found at https://github.com/google/evojax.

6.3.2 Metaheuristic algorithms


The term “metaheuristics” is often used to describe different kinds of local or population-based search
algorithms (see e.g., [Luk13]). There are many types of metaheuristic algorithms, and several ways to
categorize them, as shown in Figure 6.8. Many of these algorithms are “inspired” by natural phenomena. For
example, [AQ+20] proposes to tackle covid19 forecasting using “an improved adaptive neuro-fuzzy inference
system (ANFIS) using an enhanced flower pollination algorithm (FPA) by using the salp swarm algorithm
(SSA)”, and [Mol+18] uses “Chicken Swarm Optimization and Deep Learning for Manufacturing Processes”.
However, such “inspiration” usually results in ad-hoc algorithms, with no theoretical support (and therefore
little reason to use them). As Glover and Sorensen [GS15] memorably put it
A large (and increasing) number of publications focuses on the development of (supposedly) new
metaheuristic frameworks based on metaphors. The list of natural or man-made processes that
has been used as the basis for a metaheuristic framework now includes such diverse processes as
bacterial foraging, river formation, biogeography, musicians playing together, electromagnetism,
gravity, colonization by an empire, mine blasts, league championships, clouds, and so forth. An
important subcategory is found in metaheuristics based on animal behavior. Ants, bees, bats,
wolves, cats, fireflies, eagles, dolphins, frogs, salmon, vultures, termites, flies, and many others,
have all been used to inspire a "novel" metaheuristic. [...] As a general rule, publication of papers
on metaphor-based metaheuristics has been limited to second-tier journals and conferences, but
some recent exceptions to this rule can be found. Sörensen [Sör15] states that research in this
direction is fundamentally flawed.

62
S
t-1
X1 X2 X3 … Xn eval
1 4 5 2 … 3 13.25
2 5 3 1 … 6 32.45
… … … … … … …
K’ 1 5 4 … 2 34.12

*
S Selection of K<K' individuals
t-1
X1 X2 X3 … Xn
1 4 1 5 … 3
2 2 3 1 … 6
… … … … … …
Induction of the
K 3 4 6 … 5
probability model
Selection of K<K’
individuals x1 x2
S
t
X1 X2 X3 … Xn eval
x3
1 3 3 4 … 5 32.78
2 2 5 1 … 4 33.45
....................
… … … … … … …
K’ 4 2 1 … 2 37.26
Sampling from
pt(x) xn-1
xn
*
pt (x) = p (x|S t-1)

Figure 6.9: Illustration of the BOA algorithm (EDA applied to a generative model structured as a Bayes net). Adapted
from Figure 3 of [PHL12].

In view of this assessment, we do not discuss such methods any further.

6.3.3 Estimation of distribution algorithms


EA methods maintain a population of good candidate solutions, which can be thought of as an implicit
(nonparametric) density model over states with high fitness. [BC95] proposed to “remove the genetics from
GAs”, by explicitly learning a probabilistic model over the configuration space that puts its mass on high
scoring solutions. Thay is, the population becomes the set of parameters of a generative model, θt .
One way to learn such as model is as follows. We start by creating a sample of K 0 > K candidate solutions
from the current model, St = {xk ∼ p(x|θt )}. We then rank the samples using the fitness function, and
then pick the most promising subset St∗ of size K using a selection operator (this is known as truncation
selection). Finally, we fit a new probabilistic model p(x|θt+1 ) to St∗ using maximum likelihood estimation.
This is called the estimation of distribution or EDA algorithm (see e.g., [LL02; PSCP06; Hau+11; PHL12;
Hu+12; San17; Bal17]).
Note that EDA is equivalent to minimizing the cross-entropy between the empirical distribution defined by
St∗ and the model distribution p(x|θt+1 ). Thus EDA is related to the cross entropy method, as described
in Section 6.3.4, although CEM usually assumes the special case where p(x|w) = N (x|µ, Σ). EDA is also
closely related to the EM algorithm, as discussed in [Bro+20].
As a simple
PD example, suppose the configuration space is bit strings of length D, and the fitness function
is f (x) = d=1 xd , where xd ∈ {0, 1} (this is called the one-max function in the EA literature). A simple
QD
probabilistic model for this is a fully factored model of the form p(x|θ) = d=1 Ber(xd |θd ). Using this model
inside of DBO results in a method called univariate marginal distribution algorithm or UMDA.
We can estimate the parameters of the Bernoulli model by setting θd to the fraction of samples in St∗
that have bit d turned on. Alternatively, we can incrementally adjust the parameters. The population-based
incremental learning (PBIL) algorithm [BC95] applies this idea to the factored Bernoulli model, resulting in
the following update:
θ̂d,t+1 = (1 − ηt )θ̂d,t + ηt θd,t (6.72)
P K
where θd,t = N1t k=1 I (xk,d = 1) is the MLE estimated from the K = |St∗ | samples generated in the current
iteration, and ηt is a learning rate.

63
It is straightforward to use more expressive probability models that capture dependencies between the
parameters (these are known as building blocks in the EA literature). For example, in the case of real-valued
parameters, we can use a multivariate Gaussian, p(x) = N (x|µ, Σ). The resulting method is called the
estimation of multivariate normal algorithm or EMNA, [LL02]. (See also Section 6.3.4.)
For discrete random variables, it is natural to use probabilistic graphical models (??) to capture de-
pendencies between the variables. [BD97] learn a tree-structured graphical model using the Chow-Liu
algorithm (Section 31.1.2); [BJV97] is a special case of this where the graph is a tree. We can also learn more
general graphical model structures (see e.g., [LL02]). We typically use a Bayes net (??), since we can use
ancestral sampling (??) to easily generate samples; the resulting method is therefore called the Bayesian
Optimization Algorithm (BOA) [PGCP00].3 The hierarchical BOA (hBOA) algorithm [Pel05] extends
this by using decision trees and decision graphs to represent the local CPTs in the Bayes net (as in [CHM97]),
rather than using tables. In general, learning the structure of the probability model for use in EDA is called
linkage learning, by analogy to how genes can be linked together if they can be co-inherited as a building
block.
We can also use deep generative models to represent the distribution over good candidates. For example,
[CSF16] use denoising autoencoders and NADE models (??), [Bal17] uses a DNN regressor which is then
inverted using gradient descent on the inputs, [PRG17] uses RBMs (??), [GSM18] uses VAEs (??), etc.
Such models might take more data to fit (and therefore more function calls), but can potentially model the
probability landscape more faithfully. (Whether that translates to better optimization performance is not
clear, however.)

6.3.4 Cross-entropy method


The cross-entropy method [Rub97; RK04; Boe+05] is a special case of EDA (Section 6.3.3) in which the
population is represented by a multivariate Gaussian. In particular, we set µt+1 and Σt+1 to the empirical
mean and covariance of St+1

, which are the top K samples. This is closely related to the SMC algorithm for
sampling rare events discussed in ??.
The CEM is sometimes used for model-based RL (??), since it is simple and can find reasonably good
optima of multi-modal objectives. It is also sometimes used inside of Bayesian optimization (??), to optimize
the multi-modal acquisition function (see [BK10]).

6.3.4.1 Differentiable CEM

The differentiable CEM method of [AY19] replaces the top K operator with a soft, differentiable approxi-
mation, which allows the optimizer to be used as part of an end-to-end differentiable pipeline. For example,
we can use this to create a differentiable model predictive control (MPC) algorithm (??), as described in ??.
The basic idea is as follows. Let St = {xt,i ∼ p(x|θt ) : i = 1 : K 0 } represent the current population, with
fitness values vt,i = f (xt,i ). Let vt,K

be the K’th smallest value. In CEM, we compute P the set of top K samples,
St = {i : vt,i ≥ vt,K }, and then update the model based on these: θt+1 = argmaxθ i∈St pt (i) log p(xt,i |θ),
∗ ∗

where pt (i) = I (i ∈ St∗ ) /|St∗ |. In the differentiable version, we replace the sparse distribution pt with the
“soft” dense distribution qt = Π(pt ; τ, K), where

Π(p; τ, K) = argmin −pT q − τ H(q) s.t. 1T q = K (6.73)


0≤q≤1

P
projects the distribution p onto the polytope of distributions which sum to K. (Here H(q) = − i qi log(qi ) +
(1 − qi ) log(1 − qi ) is the entropy, and τ > 0 is a temperature parameter.) This projection operator (and
hence the whole DCEM algorithm) can be backpropagated through using implicit differentiation [AKZK19].

3 This should not be confused with the Bayesian optimization methods we discuss in ??, that uses response surface modeling

to model p(f (x)) rather than p(x∗ ).

64
Figure 6.10: Illustration of the CMA-ES method applied to a simple 2d function. The dots represent members of the
population, and the dashed orange ellipse represents the multivariate Gaussian. From https: // en. wikipedia. org/
wiki/ CMA-ES . Used with kind permission of Wikipedia author Sentewolf.

6.3.5 Natural evolutionary strategies


Evolution strategies [Wie+14] are a form of distribution-based optimization in which the distribution over
the population is represented by a Gaussian, p(x|θt ) (see e.g., [Sal+17]). Unlike CEM, the parameters are
updated using gradient ascent applied to the expected value of the objective, rather than using MLE on a
set of elite samples. More precisely, consider the smoothed objective L(θ) = Ep(x|θ) [f (x)]. We can use the
REINFORCE estimator (??) to compute the gradient of this objective as follows:

∇θ L(θ) = Ep(x|θ) [f (x)∇θ log p(x|θ)] (6.74)

This can be approximated by drawing Monte Carlo samples. If the probability model is in the exponential
family, we can compute the natural gradient (??), rather than the “vanilla” gradient; such methods are called
natural evolution strategies [Wie+14].

6.3.5.1 CMA-ES

The CMA-ES method of [Han16], which stands for “covariance matrix adaptation evolution strategy” is
a kind of NES. It is very similar to CEM except it updates the parameters in a special way. In particular,
instead of computing the new mean and covariance using unweighted MLE on the elite set, we attach weights
to the elite samples based on their rank. We then set the new mean to the weighted MLE of the elite set.
The update equations for the covariance are more complex. In particular, “evolutionary paths” are also
used to accumulate the search directions across successive generations, and these are used to update the
covariance. It can be show that the resulting updates approximate the natural gradient of L(θ) without
explicitly modeling the Fisher information matrix [Oll+17].
Figure 6.10 illustrates the method in action.

6.4 Dynamic programming


Dynamic programming is a way to efficiently find the globally optimal solution to certain kinds of
optimization problems. The key requirement is that the optimal solution be expressed in terms of the optimal
solution to smaller subproblems, which can be reused many times. Note that DP is more of an algorithm
“family” rather than a specific algorithm. We give some examples below.

65
6.4.1 Example: computing Fibonnaci numbers
Consider the problem of computing Fibonnaci numbers, defined via the recursive equation

Fi = Fi−1 + Fi−2 (6.75)

with base cases F0 = F1 = 1. Thus we have that F2 = 2, F3 = 3, F4 = 5, F5 = 8, etc. A simple recursive


algorithm to compute the first n Fibbonaci numbers is shown in Algorithm 2. Unfortunately, this takes
exponential time. For example, evaluating fib(5) proceeds as follows:

F5 = F4 + F3 (6.76)
= (F3 + F2 ) + (F2 + F1 ) (6.77)
= ((F2 + F1 ) + (F1 + F0 )) + ((F1 + F0 ) + F1 ) (6.78)
= (((F1 + F0 ) + F1 ) + (F1 + F0 ))((F1 + F0 ) + F1 ) (6.79)

We see that there is a lot of repeated computation. For example, fib(2) is computed 3 times. One way to
improve the efficiency is to use memoization, which means memorizing each function value that is computed.
This will result in a linear time algorithm. However, the overhead involved can be high.
It is usually preferable to try to solve the problem bottom up, solving small subproblems first, and then
using their results to help solve larger problems later. A simple way to do this is shown in Algorithm 3.

Algorithm 2: Fibbonaci numbers, top down


1 function fib(n);
2 if n = 0 or n = 1 then
3 return 1
4 else
5 return (fib(n − 1) + fib(n − 2))

Algorithm 3: Fibbonaci numbers, bottom up


1 function fib(n);
2 F0 := 1, F1 := 2 ;
3 for i = 2, . . . , n do
4 Fi := Fi−1 + Fi−2
5 return Fn

6.4.2 ML examples
There are many applications of DP to ML problems, which we discuss elsewhere in this book. These include
the forwards-backwards algorithm for inference in HMMs (??), the Viterbi algorithm for MAP sequence
estimation in HMMs (??), inference in more general graphical models (??), reinforcement learning (??), etc.

6.5 Conjugate duality


In this section, we briefly discuss conjugate duality, which is a useful way to construct linear lower bounds
on non-convex functions. We follow the presentation of [Bis06, Sec. 10.5].

66
f(x) f(x)

−f*(λ)
y

y
λx
λx − f*(λ)

x x
(a) (b)

Figure 6.11: Illustration of a conjugate functon. Red line is original function f (x), and the blue line is a linear lower
bound λx. To make the bound tight, we find the x where ∇f (x) is parallel to λ, and slide the line up to touch there;
the amount we slide up is given by f ∗ (λ). Adapted from Figure 10.11 of [Bis06].

6.5.1 Introduction
Consider an arbitrary continuous function f (x), and suppose we create a linear lower bound on it of the form

L(x, λ) , λT x − f ∗ (λ) ≤ f (x) (6.80)

where λ is the slope, which we choose, and f ∗ (λ) is the intercept, which we solve for below. See Figure 6.11(a)
for an illustration.
For a fixed λ, we can find the point xλ where the lower bound is tight by “sliding” the line upwards until
it touches the curve at xλ , as shown in Figure 6.11(b). At xλ , we minimize the distance between the function
and the lower bound:
xλ , argmin f (x) − L(x, λ) = argmin f (x) − λT x (6.81)
x x
Since the bound is tight at this point, we have

f (xλ ) = L(xλ , λ) = λT xλ − f ∗ (λ) (6.82)

and hence
f ∗ (λ) = λT xλ − f (xλ ) = max λT x − f (x) (6.83)
x
The function f ∗ is called the conjugate of f , also known as the Fenchel transform of f . For the special
case of differentiable f , f ∗ is called the Legendre transform of f .
One reason conjugate functions are useful is that they can be used to create convex lower bounds to
non-convex functions. That is, we have L(x, λ) ≤ f (x), with equality at x = xλ , for any function f : RD → R.
For any given x, we can optimize over λ to make the bound as tight as possible, giving us a fixed function
L(x); this is called a variational approximation. We can then try to maximize this lower bound wrt x
instead of maximizing f (x). This method is used extensively in approximate Bayesian inference, as we discuss
in ??.

6.5.2 Example: exponential function


Let us consider an example. Suppose f (x) = e−x , which is convex. Consider a linear lower bound of the form

L(x, λ) = λx − f † (λ) (6.84)

where the conjugate function is given by

f † (λ) = max λx − f (x) = −λ log(−λ) + λ (6.85)


x

67
1

0.5

0
0 ξ 1.5 3

(a) (b)

Figure 6.12: (a) The red curve is f (x) = e−x and the colored lines are linear lower bounds. Each lower bound of slope
λ is tangent to the curve at the point xλ = − log(−λ), where f (xλ ) = elog(−λ) = −λ. For the blue curve, this occurs
at xλ = ξ. Adapted from Figure 10.10 of [Bis06]. Generated by opt_lower_bound.py. (b) For a convex function f (x),
its epipgraph can be represented as the intersection of half-spaces defined by linear lower bounds of the form f † (λ).
Adapted from Figure 13 of [JJ99].

as illustrated in Figure 6.12(a).


To see this, define
J(x, λ) = λx − f (x) (6.86)
We have
∂J
= λx − f 0 (x) = λ + e−x (6.87)
∂x
Setting the derivative to zero gives

xλ = arg max J(x, λ) = − log(−λ) (6.88)


x

Hence
f † (λ) = J(xλ , λ) = λ(− log(−λ)) − elog(−λ) = −λ log(−λ) + λ (6.89)

6.5.3 Conjugate of a conjugate


It is interesting to see what happens if we take the conjugate of the conjugate:

f ∗∗ (x) = max λT x − f † (λ) (6.90)


λ

If f is convex, then f ∗∗ = f , so f and f † are called conjugate duals. To see why, note that

f ∗∗ (x) = max L(x, λ) ≤ f (x) (6.91)


λ

Since we are free to modify λ for each x, we can make the lower bound tight at each x. This perfectly
characterizes f , since the epigraph of a convex function is an intersection of half-planes defined by linear
lower bounds, as shown in Figure 6.12(b).
Let us demonstrate this using the example from Section 6.5.2. We have

f ∗∗ (x) = max λx − f † (λ) = max λx + λ log(−λ) − λ (6.92)


λ λ

Define
J ∗ (x, λ) = λx − f † (x) = λx + λ log(−λ) − λ (6.93)

68
We have
 
∂ ∗ −1
J (x, λ) = x + log(−λ) + λ −1=0 (6.94)
∂λ −λ
x = − log(−λ) (6.95)
−x
λx = −e (6.96)

Substituting back we find

f ∗∗ (x) = J ∗ (x, λx ) = (−e−x )x + (−e−x )(−x) − (−e−x ) = e−x = f (x) (6.97)

6.5.4 Bounds for the logistic (sigmoid) function


In this section, we use the results on conjugate duality to derive upper and lower bounds to the logistic
function, σ(x) = 1+e1−x .

6.5.4.1 Exponential upper bound


The sigmoid function is neither convex nor concave. However, it is easy to show that f (x) = log σ(x) =
− log(1 + e−x ) is concave, by showing that its second derivative is negative. Now, any convex function f (x)
can be represented by
f (x) = min ηx − f † (η) (6.98)
η

where
f † (η) = min ηx − f (x) (6.99)
x

One can show that if f (x) = log σ(x), then

f † (η) = −η ln η − (1 − η) ln(1 − η) (6.100)

which is the binary entropy function. Hence

log σ(x) ≤ ηx − f † (η) (6.101)



σ(x) ≤ exp(ηx − f (η)) (6.102)

This exponential upper bound on σ(x) is illustrated in Figure 6.13(a).

6.5.4.2 Quadratic lower bound


It is also useful to compute a lower bound on σ(x). If we make this a quadratic lower bound, it will “play
nicely” with Gaussian priors, which simplifies the analysis of several models. This approach was first suggested
in [JJ96].
First we write
 
log σ(x) = − log(+e−x ) = − log e−x/2 (ex/2 + e−x/2 ) (6.103)
= x/2 − log(ex/2 + e−x/2 ) (6.104)

The function f (x) = − log(ex/2 + e−x/2 ) is a convex function of y = x2 , as can be verified by showing
dx2 f (x) > 0. Hence we can create a linear lower bound on f , using the conjugate function
d


f † (η) = max
2
ηx2 − f ( x2 ) (6.105)
x

We have
dx d 1 x
0=η− f (x) = η + tanh( ) (6.106)
dx2 dx 4x 2

69
1 1

0.8 0.8

eta = 0.2
0.6 0.6
eta = 0.7

0.4 0.4

xi = 2.5
0.2 0.2

0 0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

(a) (b)

Figure 6.13: Illustration of (a) exponental upper bound and (b) quadratic lower bound to the sigmoid function.
Generated by sigmoid_upper_bounds.py and sigmoid_lower_bounds.py.

The lower bound is tangent at the point xη = ξ, where


 
1 ξ 1 1
η = − tanh( ) = − σ(ξ) − = −λ(ξ) (6.107)
4ξ 2 2ξ 2

The conjugate function can be rewritten as

f † (λ(ξ)) = −λ(ξ)ξ 2 − f (ξ) = λ(ξ)ξ 2 + log(eξ/2 + e−ξ/2 ) (6.108)

So the lower bound on f becomes

f (x) ≥ −λ(ξ)x2 − g(λ(ξ)) = −λ(ξ)x2 + λ(ξ)ξ 2 − log(eξ/2 + e−ξ/2 ) (6.109)

and the lower bound on the sigmoid function becomes


 
σ(x) ≥ σ(ξ) exp (x − ξ)/2 − λ(ξ)(x2 − ξ 2 ) (6.110)

This is illustrated in Figure 6.13(b).


Although a quadratic is not a good representation for the overall shape of a sigmoid, it turns out that
when we use the sigmoid as a likelihood function and combine it with a Gaussian prior, we get a Gaussian-like
posterior; in this context, the quadratic lower bound works quite well (since a quadratic likelihood times a
Gaussian prior will yield an exact Gaussian posterior). See Section 15.1.1 for an example, where we use this
bound for Bayesian logistic regression.

70
Part II

Inference

71
Chapter 7

Inference algorithms: an overview

73
74
Chapter 8

State-space inference

75
76
Chapter 9

Message passing inference

9.1 MAP estimation for discrete PGMs


In this section, we consider the problem of finding the most probable configuration of variables in a probabilistic
graphical model, i.e., our goal is to find a MAP assignment x∗ = arg maxx∈X V p(x), where X = {1, . . . , K}
is the discrete state space of each node, V is the number of nodes, and the distribution is defined according
to a Markov Random field (??) with pairwise cliques, one per edge:
 
1 X X 
p(x) = exp θs (xs ) + θst (xs , xt ) (9.1)
Z  
s∈V (s,t)∈E

Here V = {x1 , . . . , xV } are the nodes, E are the edges, θs and θst are the node and edge potentials, and Z is
the partition function:  
X X X 
Z= exp θs (xs ) + θst (xs , xt ) (9.2)
 
x s∈V (s,t)∈E

Since we just want the MAP configuration, we can ignore Z, and just compute
X X
x∗ = argmax θs (xs ) + θst (xs , xt ) (9.3)
x
s∈V (s,t)∈E

We can compute this exactly using dynamic programming as we explain in ??; However, this takes time
exponential in the treewidth of the graph, which is often too slow. In this section, we focus on approximate
methods that can scale to intractable models. We only give a brief description here; more details can be
found in [WJ08; KF09].

9.1.1 Notation
To simplify the presentation, we write the distribution in the following form:
1
p(x) = exp(−E(x))E(x) , −θT T (x) (9.4)
Z(θ)
where θ = ({θs;j }, {θs,t;j,k }) are all the node and edge parameters (the canonical parameters), and T (x) =
({I (xs = j)}, {I (xs = j, xt = k)}) are all the node and edge indicator functions (the sufficient statistics).
Note: we use s, t ∈ V to index nodes and j, k ∈ X to index states.
The mean of the sufficient statistics are known as the mean parameters of the model, and are given by

µ = E [T (x)] = ({p(xs = j)}s , {p(xs = j, xt = k)}s6=t ) = ({µs;j }s , {µst;jk }s6=t ) (9.5)

77
This is a vector of length d = KV + K 2 E, where K = |X | is the number of states, V = |V| is the number of
nodes, and E = |E| is the number of edges. Since µ completely characterizes the distribution p(x), so we
sometimes treat µ as a distribution itself.
Equation (9.5) is called the standard overcomplete representation. It is called “overcomplete” because
it ignores the sum-to-one constraints. In some cases, it is convenient to remove this redundancy. For example,
consider an Ising model where Xs ∈ {0, 1}. The model can be written as
 
1 X X 
p(x) = exp θ s xs + θst xs xt (9.6)
Z(θ)  
s∈V (s,t)∈E

Hence we can use the following minimal parameterization

T (x) = (xs , s ∈ V ; xs xt , (s, t) ∈ E) ∈ Rd (9.7)

where d = V + E. The corresponding mean parameters are µs = p(xs = 1) and µst = p(xs = 1, xt = 1).

9.1.2 The marginal polytope


The space of allowable µ vectors is called the marginal polytope, and is denoted M(G), where G is the
structure of the graph. This is defined to be the set of all mean parameters for the given model that can be
generated from a valid probability distribution:
X X
M(G) , {µ ∈ Rd : ∃p s.t. µ = T (x)p(x) for some p(x) ≥ 0, p(x) = 1} (9.8)
x x

For example, consider an Ising model. If we have just two nodes connected as X1 − X2 , one can
show that we have the following minimal set of constraints: 0 ≤ µ12 , 0 ≤ µ12 ≤ µ1 , 0 ≤ µ12 ≤ µ2 , and
1 + µ12 − µ1 − µ2 ≥ 0. We can write these in matrix-vector form as
   
0 0 1   0
1  µ1 0
 0 −1   µ2  ≥  
0 (9.9)
1 −1 0
µ12
−1 −1 1 −1

These four constraints define a series of half-planes, whose intersection defines a polytope, as shown in
Figure 9.1(a).
Since M(G) is obtained by taking a convex combination of the T (x) vectors, it can also be written as the
convex hull of these vectors:
M(G) = conv{T1 (x), . . . , Td (x)} (9.10)
For example, for a 2 node MRF X1 − X2 with binary states, we have

M(G) = conv{(0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 1)} (9.11)

These are the four black dots in Figure 9.1(a). We see that the convex hull defines the same volume as the
intersection of half-spaces.

9.1.3 Linear programming relaxation


We can write the MAP estimation problem as follows:

max θT T (x) = max θT µ (9.12)


x∈X V µ∈M(G)

To see why this equation is true, note that we can just set µ to be a degenerate distribution with µ(xs ) =
I (xs = x∗s ), where x∗s is the optimal assigment of node s. Thus we can “emulate” the task of optimizing over

78
(a) (b) (c)

Figure 9.1: (a) Illustration of the marginal polytope for an Ising model with two variables. (b) Cartoon illustration of
the set MF (G), which is a nonconvex inner bound on the marginal polytope M(G). MF (G) is used by mean field. (c)
Cartoon illustration of the relationship between M(G) and L(G), which is used by loopy BP. The set L(G) is always
an outer bound on M(G), and the inclusion M(G) ⊂ L(G) is strict whenever G has loops. Both sets are polytopes,
which can be defined as an intersection of half-planes (defined by facets), or as the convex hull of the vertices. L(G)
actually has fewer facets than M(G), despite the picture. In fact, L(G) has O(|X ||V | + |X |2 |E|) facets, where |X | is
the number of states per variable, |V | is the number of variables, and |E| is the number of edges. By contrast, M(G)
has O(|X ||V | ) facets. On the other hand, L(G) has more vertices than M(G), despite the picture, since L(G) contains
all the binary vector extreme points µ ∈ M(G), plus additional fractional extreme points. From Figures 3.6, 5.4 and
4.2 of [WJ08]. Used with kind permission of Martin Wainwright.

discrete assignments by optimizing over probability distributions µ. Furthermore, the non-degenerate (“soft”)
distributions will not correspond to corners of the polytope, and hence will not maximize a linear function.
It seems like we have an easy problem to solve, since the objective in Equation (9.12) is linear in µ, and
the constraint set M(G) is convex. The trouble is, M(G) in general has a number of facets that is exponential
in the number of nodes.
A standard strategy in combinatorial optimization is to relax the constraints. In this case, instead of
requiring probability vector µ to live in the marginal polytope M(G), we allow it to live inside a simpler,
convex enclosing set L(G), which we define in Section 9.1.3.1. Thus we try to maximize the following upper
bound on the original objective:
τ ∗ = argmax θT τ (9.13)
τ ∈L(G)

This is called a linear programming relaxation of the problem. If the solution τ ∗ is integral, it corresponds
to the exact MAP estimate; this will be the case when the graph is a tree. In general, τ ∗ will be fractional;
we can derive an approximate MAP estimate by rounding (see [Wer07] for details).

9.1.3.1 A convex outer approximation to the marginal polytope


Consider a set of probability vectors τ that satisfy the following local consistency constraints:
X
τs (xs ) = 1 (9.14)
xs
X
τst (xs , xt ) = τs (xs ) (9.15)
xt

The first constraint is called the normalization constraint, and the second is called the marginalization
constraint. We then define the set

L(G) , {τ ≥ 0 : (Equation (9.14)) holds ∀s ∈ V, (Equation (9.15)) holds ∀(s, t) ∈ E} (9.16)

The set L(G) is also a polytope, but it only has O(|V |+|E|) constraints. It is a convex outer approximation
on M(G), as shown in Figure 9.1(c). (By contrast, the mean field approximation, which we discuss in ??, is a
non-convex inner approximation, as we discuss in ??.)

79
Figure 9.2: (a) Illustration of pairwise UGM on binary nodes, together with a set of pseudo marginals that are not
globally consistent. (b) A slice of the marginal polytope illustrating the set of feasible edge marginals, assuming the
node marginals are clamped at µ1 = µ2 = µ3 = 0.5. From Figure 4.1 of [WJ08]. Used with kind permission of Martin
Wainwright.

We call the terms τs , τst ∈ L(G) pseudo marginals, since they may not correspond to marginals of any
valid probability distribution. As an example of this, consider Figure 9.2(a). The picture shows a set of
pseudo node and edge marginals, which satisfy the local consistency requirements. However, they are not
globally consistent. To see why, note that τ12 implies p(X1 = X2 ) = 0.8, τ23 implies p(X2 = X3 ) = 0.8, but
τ13 implies p(X1 = X3 ) = 0.2, which is not possible (see [WJ08, p81] for a formal proof). Indeed, Figure 9.2(b)
shows that L(G) contains points that are not in M(G).
We claim that M(G) ⊆ L(G), with equality iff G is a tree. To see this, first consider an element µ ∈ M(G).
Any such vector must satisfy the normalization and marginalization constraints, hence M(G) ⊆ L(G).
Now consider the converse. Suppose T is a tree, and let µ ∈ L(T ). By definition, this satisfies the
normalization and marginalization constraints. However, any tree can be represented in the form
Y Y µst (xs , xt )
pµ (x) = µs (xs ) (9.17)
µs (xs )µt (xt )
s∈V (s,t)∈E

Hence satsifying normalization and local consistency is enough to define a valid distribution for any tree.
Hence µ ∈ M(T ) as well.
In contrast, if the graph has loops, we have that M(G) 6= L(G). See Figure 9.2(b) for an example of this
fact. The importance of this observation will become clear in Section 10.1.3.

9.1.3.2 Algorithms
Our task is to solve Equation (9.13), which requires maximizing a linear function over a simple convex
polytope. For this, we could use a generic linear programming package. However, this is often very slow.
Fortunately, one can show that a simple algorithm, that sends messages between nodes in the graph, can
be used to compute τ ∗ . In particular, the tree reweighted belief propagation algorithm can be used;
see Section 10.1.5.3 for details.

9.1.3.3 Application to stereo depth estimation


Belief propagation is often applied to low-level computer vision problems (see e.g., [Sze10; BKR11; Pri12]).
For example, Figure 9.3 illustrates its application to the problem of stereo depth estimation given a pair
of monocular images (only one is shown). The value xi is the distance of pixel i from the camera (quantized
to a certain number of values). The goal is to infer these values from noisy measurements. We quantize the
state space, rather than using a Gaussian model, in order to avoid oversmoothing at discontinuities, which

80
True Disparities 0 1 ’

Figure 9.3: Illustration of belief propagation for stereo depth estimation applied to the Venus image from the Middlebury
stereo benchmark dataset [SS02]. Left column: image and true disparities. Remaining columns: initial estimate,
estimate after 1 iteration, and estimate at convergence. Top row: Gaussian edge potentials using a continuous state
space. Bottom row: robust edge potentials using a quantized state space. From Figure 4 of [SF08]. Used with kind
permission of Erik Sudderth.

occur at object boundaries, as illustrated in Figure 9.3. (We can also use a hybrid discrete-continuous state
space, as discussed in [Yam+12], but we can no longer apply BP.)
Not surprisingly, people have recently applied deep learning to this problem. For example, [XAH19]
describes a differentiable version of message passing (??), which is fast and can be trained end-to-end.
However, it requires labeled data for training, i.e., pixel-wise ground truth depth values. For this particular
problem, such data can be collected from depth cameras, but for other problems, BP on “unsupervised” MRFs
may be needed.

9.1.4 Graphcuts
In this section, we show how to find MAP state estimates, or equivalently, minimum energy configurations,
by using the maxflow / mincut algorithm for graphs. This class of methods is known as graphcuts and is
very widely used, especially in computer vision applications (see e.g., [BK04]).
We will start by considering the case of MRFs with binary nodes and a restricted class of potentials; in
this case, graphcuts will find the exact global optimum. We then consider the case of multiple states per
node; we can approximately solve this case by solving a series of binary subproblems, as we will see.

9.1.4.1 Graphcuts for the Ising model


Let us start by considering a binary MRF where the edge energies have the following form:

0 if xu = xv
Euv (xu , xv ) = (9.18)
λuv if xu 6= xv

where λst ≥ 0 is the edge cost. This encourages neighboring nodes to have the same value (since we are trying
to minimize energy). Since we are free to add any constant we like to the overall energy without affecting the
MAP state estimate, let us rescale the local energy terms such that either Eu (1) = 0 or Eu (0) = 0.
Now let us construct a graph which has the same set of nodes as the MRF, plus two distinguished nodes:
the source s and the sink t. If Eu (1) = 0, we add the edge xu → t with cost Eu (0). Similarly, If Eu (0) = 0, we
add the edge s → xu with cost Eu (1). Finally, for every pair of variables that are connected in the MRF,
we add edges xu → xv and xv → xu , both with cost λu,v ≥ 0. Figure 9.4 illustrates this construction for an
MRF with 4 nodes and the following parameters:

E1 (0) = 7, E2 (1) = 2, E3 (1) = 1, E4 (1) = 6λ1,2 = 6, λ2,3 = 6, λ3,4 = 2, λ1,4 = 1 (9.19)

Having constructed the graph, we compute a minimal s − t cut. This is a partition of the nodes into two sets,
Xs and Xt , such that s ∈ Xs and t ∈ Xt . We then find the partition which minimizes the sum of the cost of

81
t
7

6
z1 z2

1 6

2
z4 z3
2

6 1
s

Figure 9.4: Illustration of graphcuts applied to an MRF with 4 nodes. Dashed lines are ones which contribute to the
cost of the cut (for bidirected edges, we only count one of the costs). Here the min cut has cost 6. From Figure 13.5
from [KF09]. Used with kind permission of Daphne Koller.

the edges between nodes on different sides of the partition:


X
cost(Xs , Xt ) = cost(xu , xv ) (9.20)
xu ∈Xs ,xv ∈Xt

In Figure 9.4, we see that the min-cut has cost 6. Minimizing the cost in this graph is equivalent to minimizing
the energy in the MRF. Hence nodes that are assigned to s have an optimal state of 0, and the nodes that are
assigned to t have an optimal state of 1. In Figure 9.4, we see that the optimal MAP estimate is (1, 1, 1, 0).
Thus we have converted the MAP estimation problem to a standard graph theory problem for which
efficient solvers exist (see e.g., [CLR90]).

9.1.4.2 Graphcuts for binary MRFs with submodular potentials


We now discuss how to extend the graphcuts construction to binary MRFs with more general kinds of
potential functions. In particular, suppose each pairwise energy satisfies the following condition:

Euv (1, 1) + Euv (0, 0) ≤ Euv (1, 0) + Euv (0, 1) (9.21)

In other words, the sum of the diagonal energies is less than the sum of the off-diagonal energies. In this
case, we say the energies are submodular (??). An example of a submodular energy is an Ising model
where λuv > 0. This is also known as an attractive MRF or associative MRF, since the model “wants”
neighboring states to be the same.
It is possible to modify the graph construction process for this setting, and then apply graphcuts, such
that the resulting estimate is the global optimum [GPS89].

9.1.4.3 Graphcuts for nonbinary metric MRFs


We now discuss how to use graphcuts for approximate MAP estimation in MRFs where each node can have
multiple states [BVZ01].
One approach is to use alpha expansion. At each step, it picks one of the available labels or states and
calls it α; then it solves a binary subproblem where each variable can choose to remain in its current state, or
to become state α (see Figure 9.5(d) for an illustration).
Another approach is to use alpha-beta swap. At each step, two labels are chosen, call them α and β.
All the nodes currently labeled α can change to β (and vice versa) if this reduces the energy (see Figure 9.5(c)
for an illustration).
In order to solve these binary subproblems optimally, we need to ensure the potentials for these subproblems
are submodular. This will be the case if the pairwise energies form a metric. We call such a model a metric

82
IEEE Transactions on PAMI, vol. 23, no. 11, pp. 1222-1239 p.8

(a) initial labeling (b) standard move (c) α-β-swap (d) α-expansion

Figure 9.5: (a) An image


Figure with 3 of
2: Examples labels. (b) A
standard standard
and local from
large moves movea (e.g.,
given by iterative
labeling (a). conditional
The numbermodes)
of just flips
the label of one pixel. (c) An α − β swap allows all nodes that are currently labeled as α to be relabeled as β if this
decreases the energy. |L|An
labels is(d) = 3.α A standardallows
expansion move all
(b) nodes
changes a label
that of currently
are not a single pixel (in as
labeled theαcircled area). as α if this
to be relabeled
Strong moves
decreases the energy. From (c-d)
Figureallow
2 oflarge number
[BVZ01]. of pixels
Used to change
with kind their labels
permission simultaneously.
of Ramin Zabih.

3.1 Partitions and move spaces


MRF. For example, suppose the states have a natural ordering, as commonly arises if they are a discretization
of an underlying
Any continuous space.
labeling f can In this case,
be uniquely we canbydefine
represented a metric
a partition of thepixels
of image formPE(x=s ,{P
xtl)| = L} ||xs −xt ||)
l ∈min(δ,
or a semi-metric of the form E(x , x ) = min(δ, (x
where Pl = {p ∈ P | fsp =tl} is a subset of − x t assigned label l. Since there is an obvious encourages
)
s pixels
2
), for some constant δ > 0. This energy
neighbors toone
have
to one correspondence between labelings f them
similar labels, but never “punishes” by moreP,than
and partitions (This
δ. use
we can δ term
these prevents over-
notions
smoothing, which we illustrate in Figure 9.3.)
interchangingly.
Given a pair of labels α, β, a move from a partition P (labeling f ) to a new partition
9.1.4.4 Application to stereo depth estimation
!
P (labeling f ) is called an α-β swap if Pl = Pl! for any label l "= α, β. This means that
!

Graphcuts isthe
often
onlyapplied to low-level
difference between Pcomputer
and P! is vision problems,
that some suchwere
pixels that as stereo
labeleddepth
α in Pestimation,
are now which we
discussed in Section 9.1.3.3. ! Figure 9.6 compares graphcuts (both swap and expansion ! version) to two other
labeled β in P , and some pixels that were labeled β in P are now labeled α in P . A special
algorithms (simulated annealated, and a patch matching method based on normalization cross correlation) on
case of an α-β swap is a move that gives the label α to some set of pixels previously labeled
the famous Tsukuba test image. The graphcuts approach works the best on this example, as well as others
β. One
[Sze+08; TF03]. It example
also tendsof α-β swap move isbelief
to outperform shownpropagation
in Fig. 2(c). (results not shown) in terms of speed and
accuracy on stereo problems
Given [Sze+08;
a label α, a move TF03], as well asPother
from a partition problems
(labeling f ) to asuch
new as CRF labeling
partition of LIDAR point
P! (labeling
cloud data [LMW17].
! ! !
f ) is called an α-expansion if Pα ⊂ Pα and Pl ⊂ Pl for any label l "= α. In other words, an
α-expansion move allows any set of image pixels to change their labels to α. An example of
an α-expansion move is shown in Fig. 2(d).
Recall that ICM and annealing use standard moves allowing only one pixel to change its
intensity. An example of a standard move is given in Fig. 2(b). Note that a move which
assigns a given label α to a single pixel is both an α-β swap and an α-expansion. As a
consequence, a standard move is a special case of both a α-β swap and an α-expansion.

83
IEEE Transactions on PAMI, vol. 23, no. 11, pp. 1222-1239 p.29

(a) Left image: 384x288, 15 labels (b) Ground truth

(c) Swap algorithm (d) Expansion algorithm

(e) Normalized correlation (f) Simulated annealing

Figure 10: Real imagery with ground truth


Figure 9.6: An example of stereo depth estimation using MAP estimation in a pairwise discrete MRF. (a) Left image,
of size 384 × 288 pixels, from the University of Tsukuba. (The corresponding right image is similar, but not shown.)
(b) Ground truth depth map, quantized to 15 levels. (c-f ): MAP estimates using different methods: (c) α − β swap,
(d) α expansion, (e) normalized cross correlation, (f ) simulated annealing. From Figure 10 of [BVZ01]. Used with
kind permission of Ramin Zabih.

84
Chapter 10

Variational inference

10.1 Exact and approximate inference for PGMs


In this section, we discuss exact and approximate inference for discrete PGMs from a variational perspective,
following [WJ08].
Similar to Section 9.1, we will assume a pairwise MRF of the form
 
1 X X 
pθ (z|x) = exp θs (zs ) + θst (zs , zt ) (10.1)
Z  
s∈V (s,t)∈E

We can write this as an exponential family model, p(z|x) = p̃(z)/Z, where Z = log p(x), p̃(z) = T (z)T θ,
θ = ({θs;j }, {θs,t;j,k }) are all the node and edge parameters (the canonical parameters), and T (z) =
({I (zs = j)}, {I (zs = j, zt = k)}) are all the node and edge indicator functions (the sufficient statistics). Note:
we use s, t ∈ V to index nodes and j, k ∈ X to index states.

10.1.1 Exact inference as VI


We know that the ELBO is a lower bound on the log marginal likelihood:

L(q) = Eq(z) [log p̃(z)] + H (q) ≤ log Z (10.2)

Let µ = Eq [T (z)] be the mean parameters of the variational distribution. Then we can rewrite this as

L(µ) = θT µ + H (µ) ≤ log Z (10.3)

The set of all valid (unrestricted) mean parameters µ is the marginal polytope corresponding to the graph,
M(G), as explained in Section 9.1.2. Optimizing over this set recovers q = p, and hence

max θ T µ + H (µ) = log Z (10.4)


µ∈M(G)

Equation (10.4) seems easy to optimize: the objective is concave, since it is the sum of a linear function
and a concave function (see Figure ?? to see why entropy is concave); furthermore, we are maximizing
this over a convex set, M(G). Hence there is a unique global optimum. However, the entropy is typically
intractable to compute, since it requires summing over all states. We discuss approximations below. See
Table 10.1 for a high level summary of the methods we discuss.

85
Method Definition Objective Opt. Domain Section
Exact maxµ∈M(G) θ T µ + H (µ) = log Z Concave Marginal polytope, convex Section 10.1.1
Mean field maxµ∈MF (G) θ T µ + HMF (µ) ≤ log Z Concave Nonconvex inner approx. Section 10.1.2
Loopy BP maxτ ∈L(G) θ T τ + HBethe (τ ) ≈ log Z Non-concave Convex outer approx. Section 10.1.3
TRBP maxτ ∈L(G) θ T τ + HTRBP (τ ) ≥ log Z Concave Convex outer approx. Section 10.1.5

Table 10.1: Summary of some variational inference methods for graphical models. TRBP is tree-reweighted belief
propagation.

10.1.2 Mean field VI


The mean field approximation to the entropy is simply
X
HMF (µ) = H (µs ) (10.5)
s

which follows from the factorization assumption. Thus the mean field objective is

LMF (µ) = θT µ + HMF (µ) ≤ log Z (10.6)

This is a concave lower bound on log Z. We will maximize this over a a simpler, but non-convex, inner
approximation to M(G), as we now show.
First, let F be an edge subgraph of the original graph G, and let I(F ) ⊆ I be the subset of sufficient
statistics associated with the cliques of F . Let Ω be the set of canonical parameters for the full model, and
define the canonical parameter space for the submodel as follows:

Ω(F ) , {θ ∈ Ω : θα = 0 ∀α ∈ I \ I(F )} (10.7)

In other words, we require that the natural parameters associated with the sufficient statistics α outside of
our chosen class to be zero. For example, in the case of a fully factorized approximation, F0 , we remove all
edges from the graph, giving
Ω(F0 ) , {θ ∈ Ω : θst = 0 ∀(s, t) ∈ E} (10.8)
In the case of structured mean field (Section ??), we set θst = 0 for edges which are not in our tractable
subgraph.
Next, we define the mean parameter space of the restricted model as follows:

MF (G) , {µ ∈ Rd : µ = Eθ [T (z)] for some θ ∈ Ω(F )} (10.9)

This is called an inner approximation to the marginal polytope, since MF (G) ⊆ M(G). See Figure 9.1(b)
for a sketch. Note that MF (G) is a non-convex polytope, which results in multiple local optima.
Thus the mean field problem becomes

max θT µ + HMF (µ) (10.10)


µ∈MF (G)

This requires maximizing a concave objective over a non-convex set. It is typically optimized using coordinate
ascent, since it is easy to optimize a scalar concave function over the marginal distribution for each node.

10.1.3 Loopy belief propagation as VI


Recall from Section 10.1.1 that exact inference can be posed as solving the following optimization problem:
maxµ∈M(G) θ T µ + H (µ), where M(G) is the marginal polytope corresponding to the graph (see Section 9.1.2
for details). Since this set has exponentially many facets, it is intractable to optimize over.
In Section 10.1.2, we discussed the mean field approximation, which uses a nonconvex inner approximation,
MF (G), obtained by dropping some edges from the graphical model, thus enforcing a factorization of the
posterior. We also approximated the entropy by using the entropy of each marginal.

86
In this section, we will consider a convex outer approximation, L(G), based on pseudo marginals, as in
Section 9.1.3.1. We also need to approximate the entropy (which was not needed when performing MAP
estimation, discussed in Section 9.1.3). We discuss this entropy approximation in Section 10.1.3.1, and then
show how we can use this to approximate log Z. Finally we show that loopy belief propagation attempts to
optimize this approximation.

10.1.3.1 Bethe free energy


From Equation (9.17), we know that a joint distribution over a tree-structured graphical model can be
represented exactly by the following:
Y Y µst (xs , xt )
pµ (x) = µs (xs ) (10.11)
µs (xs )µt (xt )
s∈V (s,t)∈E

This satisfies the normalization and pairwise marginalization constraints of the outer approximation by
construction.
From Equation 10.11, we can write the exact entropy of any tree structured distribution µ ∈ M(T ) as
follows:
X X
H (µ) = Hs (µs ) − Ist (µst ) (10.12)
s∈V (s,t)∈E
X
Hs (µs ) = − µs (xs ) log µs (xs ) (10.13)
xs ∈Xs
X µst (xs , xt )
Ist (µst ) = µst (xs , xt ) log (10.14)
µs (xs )µt (xt )
(xs ,xt )∈Xs ×Xt

Note that we can rewrite the mutual information term in the form Ist (µst ) = Hs (µs ) + Ht (µt ) − Hst (µst ),
and hence we get the following alternative but equivalent expression:
X X
H (µ) = − (ds − 1)Hs (µs ) + Hst (µst ) (10.15)
s∈V (s,t)∈E

where ds is the degree (number of neighbors) for node s.


The Bethe1 approximation to the entropy is simply the use of Equation 10.12 even when we don’t have a
tree: X X
HBethe (τ ) = Hs (τs ) − Ist (τst ) (10.16)
s∈V (s,t)∈E

We define the Bethe free energy as the expected energy minus approximate entropy:
 
FBethe (τ ) , − θ T τ + HBethe (τ ) ≈ − log Z (10.17)

Thus our final objective becomes


max θ T τ + HBethe (τ ) (10.18)
τ ∈L(G)

We call this the Bethe variational problem or BVP. The space we are optimizing over is a convex set,
but the objective itself is not concave (since HBethe is not concave). Thus there can be multiple local optima.
Also, the entropy approximation is not a bound (either upper or lower) on the true entropy. Thus the value
obtained by the BVP is just an approximation to log Z(θ). However, in the case of trees, the approximation
is exact. Also, in the case of models with attractive potentials, the resulting value turns out to be an upper
bound [SWW08]. In Section 10.1.5, we discuss how to modify the algorithm so it always minimizes an upper
bound for any model.
1 Hans Bethe was a German-American physicist, 1906–2005.

87
10.1.3.2 LBP messages are Lagrange multipliers
In this subsection, we will show that any fixed point of the LBP algorithm defines a stationary P point of the
above constrained objective. Let us define the normalization constraint as Css (τ ) , −1 + xs τs (xs ), and
P
the marginalization constraint as Cts (xs ; τ ) , τs (xs ) − xt τst (xs , xt ) for each edge t → s. We can now write
the Lagrangian as
X
L(τ , λ; θ) , θ T τ + HBethe (τ ) + λss Css (τ )
s
" #
X X X
+ λts (xs )Cts (xs ; τ ) + λst (xt )Cst (xt ; τ ) (10.19)
s,t xs xt

(The constraint that τ ≥ 0 is not explicitly enforced, but one can show that it will hold at the optimum since
θ > 0.) Some simple algebra then shows that ∇τ L = 0 yields
X
log τs (xs ) = λss + θs (xs ) + λts (xs ) (10.20)
t∈nbr(s)

τst (xs , xt )
log = θst (xs , xt ) − λts (xs ) − λst (xt ) (10.21)
τ̃s (xs )τ̃t (xt )
P
where we have defined τ̃s (xs ) , xt τ (xs , xt ). Using the fact that the marginalization constraint implies
τ̃s (xs ) = τs (xs ), we get

log τst (xs , xt ) = λss + λtt + θst (xs , xt ) + θs (xs ) + θt (xt )


X X
+ λus (xs ) + λut (xt ) (10.22)
u∈nbr(s)\t u∈nbr(t)\s

To make the connection to message passing, define mt→s (xs ) = exp(λts (xs )). With this notation, we can
rewrite the above equations (after taking exponents of both sides) as follows:
Y
τs (xs ) ∝ exp(θs (xs )) mt→s (xs ) (10.23)
t∈nbr(s)

τst (xs , xt ) ∝ exp (θst (xs , xt ) + θs (xs ) + θt (xt ))


Y Y
× mu→s (xs ) mu→t (xt ) (10.24)
u∈nbr(s)\t u∈nbr(t)\s

where the λ terms and irrelevant constants are absorbed into the constant of proportionality. We see that
this is equivalent to the usual expression for the node and edge marginals in LBP.
To derive an equation for the
Pmessages in terms of other messages (rather than in terms of λts ), we enforce
the marginalization condition xt τst (xs , xt ) = τs (xs ). Then one can show that
 
X Y
mt→s (xs ) ∝ exp {θst (xs , xt ) + θt (xt )} mu→t (xt ) (10.25)
xt u∈nbr(t)\s

We see that this is equivalent to the usual expression for the messages in LBP.

10.1.3.3 Kikuchi free energy


We have shown that LBP minimizes the Bethe free energy. In this section, we show that generalized BP
(??) minimizes the Kikuchi free energy; we define this below, but the key idea is that it is a tighter
approximation to log Z.

88
In more detail, define Lt (G) to be the set of all pseudo-marginals such that normalization and marginal-
ization constraints hold on a hyper-graph whose largest hyper-edge is of size t + 1. For example, in Figure ??,
we impose constraints of the form
X X
τ1245 (x1 , x2 , x4 , x5 ) = τ45 (x4 , x5 ), τ56 (x5 , x6 ) = τ5 (x5 ), . . . (10.26)
x1 ,x2 x6

Furthermore, we approximate the entropy as follows:


X
HKikuchi (τ ) , c(g)Hg (τg ) (10.27)
g∈E

where Hg (τg ) is the entropy of the joint (pseudo) distribution on the vertices in set g, and c(g) is called the
overcounting number of set g. These are related to Mobius numbers in set theory. Rather than giving
a precise definition, we just give a simple example. For the graph in Figure ??, we have

HKikuchi (τ ) = −[H1245 + H2356 + H4578 + H5689 ] − [H25 + H45 + H56 + H58 ] + H5 (10.28)

Putting these two approximations together, we can define the Kikuchi free energy2 as follows:
 
FKikuchi (τ ) , − θ T τ + HKikuchi (τ ) ≈ − log Z (10.29)

Our variational problem becomes


max θ T τ + HKikuchi (τ ) (10.30)
τ ∈L(G)

Just as with the Bethe free energy, this is not a concave objective. There are several possible algorithms
for finding a local optimum of this objective, including generalized belief propagation. For details, see e.g.,
[WJ08, Sec 4.2] or [KF09, Sec 11.3.2].

10.1.4 Convex belief propagation


The mean field energy functional is concave, but it is maximized over a non-convex inner approximation to
the marginal polytope. The Bethe and Kikuchi energy functionals are not concave, but they are maximized
over a convex outer approximation to the marginal polytope. Consequently, for both MF and LBP, the
optimization problem has multiple optima, so the methods are sensitive to the initial conditions. Given that
the exact formulation Equation (10.4) is a concave objective maximized over a convex set, it is natural to try
to come up with an appproximation of a similar form, without local optima.
Convex belief propagation involves working with a set of tractable submodels, F, such as trees or
planar graphs. For each model F ⊂ G, the entropy is higher, H (µ(F )) ≥ H (µ(G)), since F has fewer
constraints. Consequently, any convex combination of such subgraphs will have higher entropy, too:
X
H (µ(G)) ≤ ρ(F ) H (µ(F )) , H(µ, ρ) (10.31)
F ∈F
P
where ρ(F ) ≥ 0 and F ρ(F ) = 1. Furthermore, H(µ, ρ) is a concave function of µ.
Having defined an upper bound on the entropy, we now consider a convex outerbound on the marginal
polytope of mean parameters. We want to ensure we can evaluate the entropy of any vector τ in this set, so
we restrict it so that the projection of τ onto the subgraph G lives in the projection of M onto F :

L(G; F) , {τ ∈ Rd : τ (F ) ∈ M(F ) ∀F ∈ F} (10.32)

This is a convex set since each M(F ) is a projection of a convex set. Hence we define our problem as

max τ T θ + H(τ , ρ) (10.33)


τ ∈L(G;F )

2 Ryoichi Kikuchi is a Japanese physicist.

89
f f f f

b b b b

e e e e

Figure 10.1: (a) A graph. (b-d) Some of its spanning trees. From Figure 7.1 of [WJ08]. Used with kind permission of
Martin Wainwright.

This is a concave objective being maximized over a convex set, and hence has a unique optimum. Furthermore,
the result is always an upper bound on log Z, because the entropy is an upper bound, and we are optimizing
over a larger set than the marginal polytope.
It remains to specify the set of tractable submodels, F, and the distribution ρ. We discuss some options
below.

10.1.5 Tree-reweighted belief propagation


In this section, we discuss tree reweighted BP [WJW05b; Kol06], which is a form of convex BP which
uses spanning trees as the set of tractable models F, as we describe below.

10.1.5.1 Spanning tree polytope


It remains to specify the set of tractable submodels, F, and the distribution ρ. We will consider the case where
F is all spanning trees of a graph. For any given tree, the entropy is given by P Equation 10.12. To compute
the upper bound, obtained by averaging over all trees, note that P the terms F ρ(F )H(µ(F )s ) for single
nodes will just be Hs , since node s appears in every tree, and F ρ(F ) = 1. But the mutual information
term Ist receives weight ρst = Eρ [I ((s, t) ∈ E(T ))], known as the edge appearance probability. Hence
we have the following upper bound on the entropy:
X X
H (µ) ≤ Hs (µs ) − ρst Ist (µst ) , HTRBP (µ) (10.34)
s∈V (s,t)∈E

This is called the tree reweighted BP approximation [WJW05b; Kol06]. This is similar to the Bethe
approximation to the entropy except for the crucial ρst weights. So long as ρst > 0 for all edges (s, t), this
gives a valid concave upper bound on the exact entropy.
The edge appearance probabilities live in a space called the spanning tree polytope. This is because
they are constrained to arise from a distribution over trees. Figure 10.1 gives an example of a graph and
three of its spanning trees. Suppose each tree has equal weight under ρ. The edge f occurs in 1 of the 3 trees,
so ρf = 1/3. The edge e occurs in 2 of the 3 trees, so ρe = 2/3. The edge b appears in all of the trees, so
ρb = 1. And so on. Ideally we can find a distribution ρ, or equivalently edge probabilities in the spanning tree
polytope, that make the above bound as tight as possible. An algorithm to do this is described in [WJW05a].
A simpler approach is to use all single edges with weight ρe = 1/E.
What about the set we are optimizing over? We require µ(T ) ∈ M(T ) for each tree T , which means
enforcing normalization and local consistency. Since we have to do this for every tree, we are enforcing
normalization and local consistency on every edge. Thus we are effectively optimizing in the pseudo-marginal
polytope L(G). So our final optimization problem is as follows:

max τ T θ + HTRBP (τ ) ≥ log Z (10.35)


τ ∈L(G)

90
10.1.5.2 Message passing implementation
The simplest way to minimize Equation (10.35) is a modification of belief propagation known as tree
reweighted belief propagation. The message from t to s is now a function of all messages sent from other
neighbors v to t, as before, but now it is also a function of the message sent from s to t. Specifically, we have
the following [WJ08, Sec 7.2.1]:

X  Q ρvt
1 v∈nbr(t)\s [mv→t (xt )]
mt→s (xs ) ∝ exp θst (xs , xt ) + θt (xt ) (10.36)
x
ρst [ms→t (xt )]1−ρts
t

At convergence, the node and edge pseudo marginals are given by


Y
τs (xs ) ∝ exp(θs (xs )) [mv→s (xs )]ρvs (10.37)
v∈nbr(s)
Q ρvs
Q ρvt
v∈nbr(s)\t [mv→s (xs )] v∈nbr(t)\s [mv→t (xt )]
τst (xs , xt ) ∝ ϕst (xs , xt ) (10.38)
[mt→s (xs )]1−ρst [ms→t (xt )]1−ρts
 
1
ϕst (xs , xt ) , exp θst (xs , xt ) + θs (xs ) + θt (xt ) (10.39)
ρst

If ρst = 1 for all edges (s, t) ∈ E, the algorithm reduces to the standard LBP algorithm. However, the
condition ρst = 1 implies every edge is present in every spanning tree with probability 1, which is only
possible if the original graph is a tree. Hence the method is only equivalent to standard LBP on trees, when
the method is of course exact.
In general, this message passing scheme is not guaranteed to converge to the unique global optimum.
One can devise double-loop methods that are guaranteed to converge [HS08], but in practice, using damped
updates as in Equation ?? is often sufficient to ensure convergence.

10.1.5.3 Max-product version


We can modify TRBP to solve the MAP estimation problem (as opposed to estimating posterior marginals)
by replacing sums with products in Equation (10.36) (see [WJ08, Sec 8.4.3] for details). This is guaranteed
to converge to the LP relaxation discussed in Section 9.1.3 under a suitable scheduling known as sequential
tree-reweighted message passing [Kol06].

10.1.6 Other tractable versions of convex BP


It is possible to upper bound the entropy using convex combinations of other kinds of tractable models
besides trees. One example is a planar MRF (one where the graph Phas no edges that cross), with binary
nodes and no external field, i.e., the model has the form p(x) ∝ exp( (s,t)∈E θst xs xt ). It turns out that it is
possible to perform exact inference in this model. Hence one can use convex combinations of such graphs
which can sometimes yield more accurate results than TRBP, albeit at higher computational cost. See [GJ07]
for details, and [Sch10b] for a related exact method for planar Ising models.

91
92
Chapter 11

Monte Carlo Inference

93
94
Chapter 12

Markov Chain Monte Carlo (MCMC)


inference

95
96
Chapter 13

Sequential Monte Carlo (SMC) inference

97
98
Part III

Prediction

99
Chapter 14

Discriminative models: an overview

101
102
Chapter 15

Generalized linear models

15.1 Variational inference for logistic regression


In this section we discuss a variational approach to Bayesian inference for logistic regression models based
on local bounds to the likelihood. We will use a Gaussian prior, p(w) = N (w|µ0 , V0 ). We will create a
“Gaussian-like” lower bound to the likelihood, which becomes conjugate to this prior. We then iteratively
improve this lower bound.

15.1.1 Binary logistic regression


In this section, we discuss VI for binary logistic regression. Our presentation follows [Bis06, Sec 10.6].
Let us first rewrite the likelihood for a single observation as follows:
p(yn |xn , w) = σ(ηn )yn (1 − σ(ηn ))1−ηn (15.1)
 yn  1−yn
1 1
= 1− (15.2)
1 + e−ηn 1 + e−ηn
−ηn
e
= e−ηn yn = e−ηn yn σ(−ηn ) (15.3)
1 + e−ηn
where ηn = wT xn are the logits. This is not conjugate to the Gaussian prior. So we will use the following
“Gaussian-like” variational lower bound to the sigmoid function, proposed in [JJ96; JJ00]:
 
σ(ηn ) ≥ σ(ψ n ) exp (ηn − ψ n )/2 − λ(ψ n )(ηn2 − ψ 2n ) (15.4)
where ψ n is the variational parameter for datapoint n, and
 
1 1 1
λ(ψ) , tanh(ψ/2) = σ(ψ) − (15.5)
4ψ 2ψ 2
We shall refer to this as the JJ bound, after its inventors, Jaakkola and Jordan. See Figure 15.1(a) for a
plot, and see Section 6.5.4.2 for a derivation.
Using this bound, we can write
 
p(yn |xn , w) = e−ηn yn σ(−ηn ) ≥ e−ηn yn σ(ψ n ) exp (−ηn + ψ n )/2 − λ(ψ n )(ηn2 − ψ 2n ) (15.6)
We can now lower bound the log joint as follows:
1
log p(y|X, w) + log p(w) ≥ − (w − µ0 )T V0−1 (w − µ0 ) (15.7)
2
N
X 
+ ηn (yn − 1/2) − λ(ψ n )wT (xn xTn )w (15.8)
n=1

103
JJ bound, χ=2.5 Bohning bound, χ=−2.5
1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

(a) (b)

Figure 15.1: Quadratic lower bounds on the sigmoid (logistic) function. In solid red, we plot σ(x) vs x. In dotted
blue, we plot the lower bound L(x, ψ) vs x for ψ = 2.5. (a) JJ bound. This is tight at ψ = ±2.5. (b) Bohning bound
(Section 15.1.2.2). This is tight at ψ = 2.5. Generated by sigmoid_lower_bounds.py.

Since this is a quadratic function of w, we can derive a Gaussian posterior approximation as follows:

q(w|ψ) = N (w|µN , VN ) (15.9)


N
!
X
µN = VN V0−1 µ0 + (yn − 1/2)xn (15.10)
n=1
N
X
−1
VN = V0−1 + 2 λ(ψ n )xn xTn (15.11)
n=1

This is more flexible than a Laplace approximation, since the variational parameters ψ can be used to
optimize the curvature of the posterior covariance. To find the optimal ψ, we can maximize the ELBO, which
is given by Z Z
log p(y|X) = log p(y|X, w)p(w)dw ≥ log h(w, ψ)p(w)dw = Ł(ψ) (15.12)

where
N
Y  
h(w, ψ) = σ(ψ n ) exp ηn yn − (ηn + ψ n )/2 − λ(ψ n )(ηn2 − ψ 2n ) (15.13)
n=1
We can evaluate the lower bound analytically to get
XN  
1 |VN | 1 T −1 1 1
Ł(ψ) = log + µN VN µN − µT0 V0−1 µ0 + log σ(ψ n ) − ψ n + λ(ψ n )ψ 2n (15.14)
2 |V0 | 2 2 n=1
2

If we solve for ∇ψ Ł(ψ) = 0, we get the following iterative update equation for each variational parameter:
  
(ψ new 2
n ) = xn E ww
T
xn = xn VN + µN µTN xn (15.15)

One we have estimated ψ n , we can plug it into the above Gaussian approximation q(w|ψ).

15.1.2 Multinomial logistic regression


In this section we discuss how to approximate the posterior p(w|D) for multinomial logistic regression using
variational inference, extending the approach of Section 15.1 to the multi-class case. The key idea is to create
a “Gaussian-like” lower bound on the multi-class logistic regression likelihood due to [Boh92]. We can then
compute the variational posterior in closed form. This will let us deterministically optimize the ELBO.
Let yi ∈ {0, 1}C be a one-hot label vector, and define the logits for example i to be

η i = [xTi w1 , . . . , xTi wC ] (15.16)

104
If we define Xi = I ⊗ xi , where ⊗ is the kronecker product, and I is C × C identity matrix, then we can write
the logits as η i = Xi w. (For example, if C = 2 and xi = [1, 2, 3], we have Xi = [1, 2, 3, 0, 0, 0; 0, 0, 0, 1, 2, 3].)
Then the likelihood is given by
YN
p(y|X, w) = exp[yTi η i − lse(η i )] (15.17)
i=1

where lse() is the log-sum-exp function


C
!
X
lse(η i ) , log exp(ηic ) (15.18)
c=1

For identifiability, we can set wC = 0, so


M
!
X
lse(η i ) = log 1 + exp(ηim ) (15.19)
m=1

where M = C − 1. (We subtract 1 so that in the binary case, M = 1.)

15.1.2.1 Bohning’s quadratic bound to the log-sum-exp function


The above likelihood is not conjugate to the Gaussian prior. However, we will now can convert it to a
quadratic form. Consider a Taylor series expansion of the log-sum-exp function around ψ i ∈ RM :
1
lse(η i ) = lse(ψ i ) + (η i − ψ i )T g(ψ i ) + (η i − ψ i )T H(ψ i )(η i − ψ i ) (15.20)
2
g(ψ i ) = exp[ψ i − lse(ψ i )] = S(ψ i ) (15.21)
H(ψ i ) = diag(g(ψ i )) − g(ψ i )g(ψ i )T (15.22)

where g and H are the gradient and Hessian of lse, and ψ i ∈ RM , where M = C − 1 is the number
of classes minus 1. An upper bound to lse can be found by replacing the Hessian matrix H(ψ i ) with a
matrix Ah i such that Ai iH(ψ i ) for all ψi . [Boh92] showed that this can be achieved if we use the matrix
Ai = 12 IM − M1+1 1M 1TM . In the binary case, this becomes Ai = 12 (1 − 12 ) = 14 .
Note that Ai is independent of ψ i ; however, we still write it as Ai (rather than dropping the i subscript),
since other bounds that we consider below will have a data-dependent curvature term. The upper bound on
lse therefore becomes
1 T
lse(η i ) ≤η Ai η i − bTi η i + ci (15.23)
2 i 
1 1
Ai = IM − 1M 1MT
(15.24)
2 M +1
bi = Ai ψ i − g(ψ i ) (15.25)
1
ci = ψTi Ai ψ i − g(ψ i )T ψ i + lse(ψ i ) (15.26)
2
where ψ i ∈ RM is a vector of variational parameters.
We can use the above result to get the following lower bound on the softmax likelihood:
 
1
log p(yi = c|xi , w) ≥ yTi Xi w − 4 wT Xi Ai Xi w + bTi Xi w − ci (15.27)
2 c

To simplify notation, define the pseudo-measurement

ỹi , A−1
i (bi + yi ) (15.28)

105
Then we can get a “Gaussianized” version of the observation model:

p(yi |xi , w) ≥ f (xi , ψ i ) N (ỹi |Xi w, A−1


i ) (15.29)

where f (xi , ψ i ) is some function that does not depend on w. Given this, it is easy to compute the posterior
q(w) = N (mN , VN ), using Bayes rule for Gaussians.
Given the posterior, we can write the ELBO as follows:
"N #
X
Ł(ψ) , −DKL (q(w)kp(w)) + Eq log p(yi |xi , w) (15.30)
i=1
"N #
X
= −DKL (q(w)kp(w)) + Eq yTi η i − lse(η i ) (15.31)
i=1
N
X N
X
= −DKL (q(w)kp(w)) + yTi Eq [η i ] − Eq [lse(η i )] (15.32)
i=1 i=1

where p(w) = N (w|m0 , V0 ) is the prior and q(w) = N (w|mN , VN ) is the approximate posterior. The first
term is just the KL divergence between two Gaussians, which is given by
1
−DKL (N (mN , VN )kN (m0 , V0 )) = − tr(VN V0−1 ) − log |VN V0−1 |
2 
+(mN − m0 )T V0−1 (mN − m0 ) − DM (15.33)

where DM is the dimensionality of the Gaussian, and we assume a prior of the form p(w) = N (m0 , V0 ),
where typically µ0 = 0DM , and V0 is block diagonal. The second term is simply
N
X N
X
yTi Eq [η i ] = yTi m̃i (15.34)
i=1 i=1

where m̃i , Xi mN . The final term can be lower bounded by taking expectations of our quadratic upper
bound on lse as follows:
N
X 1 1
− Eq [lse(η i )] ≥ − tr(Ai Ṽi ) − m̃i Ai m̃i + bTi m̃i − ci (15.35)
i=1
2 2

where Ṽi , Xi VN XTi . Hence we have


1 
Ł(ψ) ≥ − tr(VN V0−1 ) − log |VN V0−1 | + (mN − m0 )T V0−1 (mN − m0 )
2
XN
1 1 1
− DM + yTi m̃i − tr(Ai Ṽi ) − m̃i Ai m̃i + bTi m̃i − ci (15.36)
2 i=1
2 2

We will use coordinate ascent to optimize this lower bound. That is, we update the variational posterior
parameters VN and mN , and then the variational likelihood parameters ψ i . We leave the detailed derivation
as an exercise, and just state the results. We have
N
!−1
X
VN = V0 + T
Xi Ai Xi (15.37)
i=1
N
!
X
mN = VN V0−1 m0 + XTi (yi + bi ) (15.38)
i=1
ψ i = m̃i = Xi mN (15.39)

106
We can exploit the fact that Ai is a constant matrix, plus the fact that Xi has block structure, to simplify
the first two terms as follows:
N
!−1
X
VN = V0 + A ⊗ T
xi xi (15.40)
i=1
N
!
X
mN = VN V0−1 m0 + (yi + bi ) ⊗ xi (15.41)
i=1

where ⊗ denotes the kronecker product.

15.1.2.2 Bohning’s bound in the binary case


If we have binary data, then yi ∈ {0, 1}, M = 1 and ηi = wT xi where w ∈ RD is a weight vector (not matrix).
In this case, the Bohning bound becomes
1 2
log(1 + eη ) ≤ aη − bη + c (15.42)
2
1
a= (15.43)
4
b = aψ − (1 + e−ψ )−1 (15.44)
1
c = aψ 2 − (1 + e−ψ )−1 ψ + log(1 + eψ ) (15.45)
2
It is possible to derive an alternative quadratic bound for this case. as shown in Section 6.5.4.2. This has
the following form
1
log(1 + eη ) ≤ λ(ψ)(η 2 − ψ 2 ) + (η − ψ) + log(1 + eψ ) (15.46)
2  
1 1 1
λ(ψ) , tanh(ψ/2) = σ(ψ) − (15.47)
4ψ 2ψ 2
To facilitate comparison with Bohning’s bound, let us rewrite the JJ bound as a quadratic form as follows
1
log(1 + eη ) ≤ a(ψ)η 2 − b(ψ)η + c(ψ) (15.48)
2
a(ψ) = 2λ(ψ) (15.49)
1
b(ψ) = − (15.50)
2
1
c(ψ) = −λ(ψ)ψ 2 − ψ + log(1 + eψ ) (15.51)
2
The JJ bound has an adaptive curvature term, since a depends on ψ. In addition, it is tight at two points,
as is evident from Figure 15.1(a). By contrast, the Bohning bound is a constant curvature bound, and is
only tight at one point, as is evident from Figure 15.1(b). Nevertheless, the Bohning bound is simpler, and
somewhat faster to compute, since VN is a constant, independent of the variational parameters Ψ.

15.1.2.3 Other bounds


It is possible to devise bounds that are even more accurate than the JJ bound, and which work for the
multiclass case, by using a piecewise quadratic upper bound to lse, as described in [MKM11]. By increasing
the number of pieces, the bound can be made arbitrarily tight.
It is also possible to come up with approximations that are not bounds. For example, [SF19] gives a
simple approximation for the output of a softmax layer when applied to a stochastic input (characterized in
terms of its first two moments).

107
15.2 Converting multinomial logistic regression to Poisson regres-
sion
It is possible to represent a multinomial logistic regression model with K outputs as K separate Poisson
regression models. (Although the Poisson models are fit separately, they are implicitly coupled, since the
counts must sum to Nn across all K outcomes.) This fact can enable more efficient training when the number
of categories is large [Tad15].
To see why this relationship is true, we follow the presentation of [McE20, Sec 11.3.3]. We assume K = 2
for notational brevity (i.e., binomial regression). Assume we have m trials, with counts y1 and y2 of each
outcome type. The multinomial likelihood has the form
m! y1 y2
p(y1 , y2 |m, µ1 , µ2 ) = µ µ (15.52)
y1 !y2 ! 1 2
Now consider a product of two Poisson likelihoods, for each set of counts:
e−λ1 λy11 e−λ2 λy22
p(y1 , y2 |λ1 , λ2 ) = p(y1 |λ1 )p(y2 |λ2 ) = (15.53)
y1 ! y2 !
We now show that these are equivalent, under a suitable setting of the parameters.
Let Λ = λ1 + λ2 be the expected total number of counts of any type, µ1 = λ1 /Λ and µ2 = λ2 /Λ.
Substituting into the binomial likelihood gives
 y1  y2
m! λ1 λ2 m! λy1 λy2
p(y1 , y2 |m, µ1 , µ2 ) = = y1 y2 1 2 (15.54)
y1 !y2 ! Λ Λ Λ Λ y1 ! y2 !
m! e−λ1 λy11 e−λ2 λy22
= (15.55)
Λm e−λ1 y1 ! e−λ2 y2 !
m! e−λ1 λy11 e−λ2 λy22
= −Λ m (15.56)
|e {zΛ } | y{z
1! y2 !
} | {z }
p(m)−1 p(y1 ) p(y2 )

The final expression says that p(y1 , y2 |m) = p(y1 )p(y2 )/p(m), which makes sense.

15.3 Case study: is Berkeley admissions biased against women?


In this section, we consider a simple but interesting example of logistic regression from [McE20, Sec 11.1.4].
The question of interest is whether admission to graduate school at UC Berkeley is biased against women.
The dataset comes from a famous paper [BHO75], which collected statistics for 6 departments for men and
women. The data table only has 12 rows, shown in Table 15.1, although the total sample size (number of
observations) is 4526.
We conduct a regression analysis to try to determine if gender causes imbalanced admissions rates. (This
is a simple example of a fairness analysis.)

15.3.1 Binomial logistic regression


An obvious way to attempt to answer the question of interest is to fit a binomial logistic regression model, in
which the outcome is the admissions rate for each row, and the input is the gender id of the corresponding
group. One way to write this model is as follows:
Ai ∼ Bin(Ni , µi ) (15.57)
logit(µi ) = α + β MALE[i] (15.58)
α ∼ N (0, 10) (15.59)
β ∼ N (0, 1.5) (15.60)

108
dept gender admit reject applications
A male 512 313 825
A female 89 19 108
B male 353 207 560
B female 17 8 25
C male 120 205 325
C female 202 391 593
D male 138 279 417
D female 131 244 375
E male 53 138 191
E female 94 299 393
F male 22 351 373
F female 24 317 341

Table 15.1: Admissions data for UC Berkeley from [BHO75].

Here MALE[i] = 1 iff case i refers to male admissions data. So the log odds is α for female cases, and α + β
for male candidates. (The choice of prior for these parameters is discussed in ??.)
The above formulation is asymmetric in the genders. In particular, the log odds for males has two random
variables associated with it, and hence is a priori is more uncertain. It is often better to rewrite the model in
the following symmetric way:

Ai ∼ Bin(Ni , µi ) (15.61)
logit(µi ) = αGENDER[i] (15.62)
αj ∼ N (0, 1.5), j ∈ {1, 2} (15.63)

Here GENDER[i] is the gender (1 for male, 2 for female), so the log odds is α1 for males and α2 for females.
We can perform posterior inference using a variety of methods (see ??). Here we use HMC (??). We find
the 89% credible interval for α1 is [−0.29, 0.16] and for α2 is [−0.91, 0.75].1 The corresponding distribution
for the difference in probability, σ(α1 ) − σ(α2 ), is [0.12, 0.16], with a mean of 0.14. So it seems that Berkeley
is biased in favor of men.
However, before jumping to conclusions, we should check if the model is any good. In Figure 15.2a, we
plot the posterior predictive distribution, along with the original data. We see the model is a very bad fit to
the data (the blue data dots are often outside the black predictive intervals). In particular, we see that the
empirical admissions rate for women is actually higher in all the departments except for C and E, yet the
model says that women should have a 14% lower chance of admission.
The trouble is that men and women did not apply to the same departments in equal amounts. Women
tended not to apply to departments, like A and B, with high admissions rates, but instead applied more to
departments, like F, with low admissions rates. So even though less women were accepted overall, within in
each department, women tended to be accepted at about the same rate.
We can get a better understanding if we consider the DAG in Figure 15.3a. This is intended to be a
causal model of the relevant factors. We discuss causality in more detail in ??, but the basic idea should be
clear from this picture. In particular, we see that there is an indirect causal path G → D → A from gender
to acceptance, so to infer the direct affect G → A, we need to condition on D and close the indirect path.

1 McElreath uses 89% interval instead of 95% to emphasize the arbitrary nature of these values. The difference is insignificant.

109
1.0 1.0

0.8 A 0.8 A
B B
0.6 0.6
admit

admit
0.4 C D 0.4 C D
E E
0.2 0.2
F F
0.0 0.0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
case case

(a) (b)

Figure 15.2: Blue dots are admission rates for each of the 6 departments (A-F) for males (left half of each dyad)
and females (right half ). The circle is the posterior mean of µi , the small vertical black lines indicate 1 standard
deviation of µi . The + marks indicate 95% predictive interval for Ai . (a) Basic model, only taking gender into
account. (b) Augmented model, adding department specific offsets. Adapted from Figure 11.5 of [McE20]. Generated
by logreg_ucb_admissions_numpyro.ipynb.

We can do this by adding department id as another feature:


Ai ∼ Bin(Ni , µi ) (15.64)
logit(µi ) = αGENDER[i] + γDEPT[i] (15.65)
αj ∼ N (0, 1.5), j ∈ {1, 2} (15.66)
γk ∼ N (0, 1.5), k ∈ {1, . . . , 6} (15.67)
Here j ∈ {1, 2} (for gender) and k ∈ {1, . . . , 6} (for department). Note that there 12 parameters in this
model, but each combination (slice of the data) has a fairly large sample size of data associated with it, as we
see in Table 15.1.
In Figure 15.2b, we plot the posterior predictive distribution for this new model; we see the fit is now much
better. We find the 89% credible interval for α1 is [−1.38, 0.35] and for α2 is [−1.31, 0.42]. The corresponding
distribution for the difference in probability, σ(α1 ) − σ(α2 ), is [−0.05, 0.01]. So it seems that there is no bias
after all.
However, the above conclusion is based on the correctness of the model in Figure 15.3a. What if there
are unobserved confounders U , such as academic ability, influencing both admission rate and department
choice? This hypothesis is shown in Figure 15.3b. In this case, conditioning on the collider D opens up a
non-causal path between gender and admissions, G → D ← U → A. This invalidates any causal conclusions
we may want to draw.
The point of this example is to serve as a cautionary tale to those trying to draw causal conclusions from
predictive models. See ?? for more details.

15.3.2 Beta-binomial logistic regression


In some cases, there is more variability in the observed counts than we might expect from just a binomial
model, even after taking into account the observed predictors. This is called over-dispersion, and is usually
due to unobserved factors that are omitted from the model. In such cases, we can use a beta-binomial
model instead of a binomial model:
yi ∼ BetaBinom(mi , αi , βi ) (15.68)
αi = πi κ (15.69)
βi = (1 − πi )κ (15.70)
T
πi = σ(w xi ) (15.71)

110
G G

D D

A A

(a) (b)

Figure 15.3: Some possible causal models of admissions rates. G is gender, D is department, A is acceptance
rate. (a) No hidden confounders. (b) Hidden confounder (small dot) affects both D and A. Generated by lo-
greg_ucb_admissions_numpyro.ipynb.

Note that we have parameterized the model in terms of its mean rate,
αi
πi = (15.72)
α i + βi

and shape,
κi = αi + βi (15.73)
We choose to make the mean depend on the inputs (covariates), but to treat the shape (which is like a
precision term) as a shared constant.
As we discussed in ??, the beta-binomial distribution as a continuous mixture distribution of the following
form: Z
BetaBinom(y|m, α, β) = Bin(y|m, µ)Beta(µ|α, β)dµ (15.74)

In the regression context, we can interpret this as follows: rather than just predicting the mean directly, we
predict the mean and variance. This allows for each individual example to have more variability than we
might otherwise expect.
If the shape parameter κ is less than 2, then the distribution is an inverted U-shape which strongly favors
probabilities of 0 or 1 (see ??). We generally want to avoid this, which we can do by ensuring κ > 2.
Following [McE20, p371], let us use this model to reanalyze the Berkeley admissions data from Section 15.3.
We saw that there was a lot of variability in the outcomes, due to the different admissions rates of each
department. Suppose we just regress on the gender, i.e., xi = (I (GENDERi = 1) , I (GENDERi = 2)), and
w = (α1 , α2 ) are the corresponding logits. If we use a binomial regression model, we can be misled into
thinking there is gender bias. But if we use the more robust beta-binomial model, we avoid this false
conclusion, as we show below.
We fit the following model:

Ai ∼ BetaBinom(Ni , πi , κ) (15.75)
logit(πi ) = αGENDER[i] (15.76)
αj ∼ N (0, 1.5) (15.77)
κ=φ+2 (15.78)
φ ∼ Expon(1) (15.79)

(To ensure that κ > 2, we use a trick and define it as κ = φ + 2, where we put an exponential prior (which
has a lower bound of 0) on φ.)
We fit this model (using HMC) and plot the results in Figure 15.4. In Figure 15.4a, we show the posterior
predictive distribution; we see that is quite broad, so the model is no longer overconfident. In Figure 15.4b,

111
distribution of admission rates
3.0
male
0.8 female
2.5

0.6 2.0

Density
1.5
0.4
1.0
0.2
0.5
0.0 0.0
2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0
probability admit

(a) (b)

Figure 15.4: Results of fitting beta-binomial regression model to Berkeley admissions data. (b) Posterior predictive distri-
bution (black) superimposed on empirical data (blue). The hollow circle is the posterior predicted mean acceptance rate,
E [Ai |D]; the vertical lines are 1 standard deviation around this mean, std [Ai |D]; the + signs indicate the 89% predictive
interval. (b) Samples from the posterior distribution for the admissions rate for men (blue) and women (red). Thick
curve is posterior mean. Adapted from Figure 12.1 of [McE20]. Generated by logreg_ucb_admissions_numpyro.ipynb.

we plot p(σ(αj )|D), which is the posterior over the rate of admissions for men and women. We see that there
is considerable uncertainty in these value, so now we avoid the false conclusion that one is significantly higher
than the other. However, the model is so vague in its predictions as to be useless. In Section 15.3.4, we fix
this problem by using a multi-level logistic regression model.

15.3.3 Poisson regression


Let us revisit the Berkeley admissions example from Section 15.3 using Poisson regression. We use a simplified
form of the model, in which we just model the outcome counts without using any features, such as gender or
department. That is, the model has the form

yj,n ∼ Poi(λj ) (15.80)


λj = e αj
(15.81)
αj ∼ N (0, 1.5) (15.82)

for j = 1 : 2 and n = 1 : 12. Let λi = E [λi |Di ], where D1 = y1,1:N is the vector of admission counts, and
D2 = y2,1:N is the vector of rejection counts (so mn = y1,n + y2,n is the total number of applications for case
n). The expected acceptance rate across the entire dataset is

λ1 146.2
= = 0.38 (15.83)
λ1 + λ2 146.2 + 230.9

Let us compare this to a binomial regression model of the form

yn ∼ Bin(mn , µ) (15.84)
µ = σ(α) (15.85)
α ∼ N (0, 1.5) (15.86)

Let α = E [α|D], where D = (y1,1:N , m1:N ). The expected acceptance rate across the entire dataset is
σ(α) = 0.38, which matches Equation (15.83). (See logreg_ucb_admissions_numpyro.ipynb for the code.)

15.3.4 GLMM (hierarchical Bayes) regression


Let us revisit the Berkeley admissions dataset from Section 15.3, where there are 12 examples, corresponding
to male and female admissions to 6 departments. Thus the data is grouped both by gender and department.

112
Recall that Ai is the number of students admitted in example i, Ni is the number of applicants, µi is the
expected rate of admissions (the variable of interest), and DEPT[i] is the department (6 possible values). For
pedagogical reasons, we replace the categorical variable GENDER[i] with the binary indicator MALE[i]. We
can create a model with varying intercept and varying slope as follows:

Ai ∼ Bin(Ni , µi ) (15.87)
logit(µi ) = αDEPT[i] + βDEPT[i] × MALE[i] (15.88)

This has 12 parameters, as does the original formulation in Equation (15.65). However, these are not
independent degrees of freedom. In particular, the intercept and slope are correlated, as we see in Figure 15.2
(higher admissions means steeper slope). We can capture this using the following prior:
 
α
(αj , βj ) ∼ N ( , Σ) (15.89)
β
α ∼ N (0, 4) (15.90)
β ∼ N (0, 1) (15.91)
Σ = diag(σ) R diag(σ) (15.92)
R ∼ LKJ(2) (15.93)
2
Y
σ∼ N+ (σd |0, 1) (15.94)
d=1

We can write this more compactly in the following way.2 . We define u = (α, β), and wj = (αj , βj ), and
then use this model:

log(µi ) = wDEPT[i] [0] + wDEPT[i] [1] × MALE[i] (15.95)


wj ∼ N (u, Σ) (15.96)
u ∼ N (0, diag(4, 1)) (15.97)

See Figure 15.5(a) for the graphical model.


Following the discussion in ??, it is advisable to rewrite the model in a non-centered form. Thus we write

wj = u + σLzj (15.98)

where L = chol(R) is the Cholesy factor for the correlation matrix R, and zj ∼ N (0, I2 ). Thus the model
becomes the following:3 .

zj ∼ N (0, I2 ) (15.99)
vj = diag(σ)Lzj (15.100)
u ∼ N (0, diag(4, 1)) (15.101)
log(µi ) = u[0] + v[DEPT[i], 0] + (u[1] + v[DEPT[i], 1]) × MALE[i] (15.102)

This is the version of the model that is implemented in the numypro code.
The results of fitting this model are shown in Figure 15.5(b). The fit is slightly better than in Figure 15.2b,
especially for the second column (females in department 2), where the observed value is now inside the
predictive interval.

2 In https://bit.ly/3mP1QWH, this is referred to as glmm4. Note that we use w instead of v, and we use u instead of vµ .
3 In https://bit.ly/3mP1QWH, this is referred to as glmm5.

113
Posterior Predictive Check with 90% CI
di mi actual rate
mean ± std
0.8

σ R 0.7

αdi βd i 0.6
u Σ
0.5
admit rate

wj µi 0.4

0.3
J
Ai
0.2

0.1
Ni

N 2 4 6 8 10 12
cases

(a) (b)

Figure 15.5: (a) Generalized linear mixed model for inputs di (department) and mi (male), and output Ai (number of
admissionss), given Ni (number of applicants). (b) Results of fitting this model to the UCB dataset. Generated by
logreg_ucb_admissions_numpyro.ipynb.

114
Chapter 16

Deep neural networks

115
116
Chapter 17

Gaussian processes

117
118
Chapter 18

Structured prediction

119
120
Chapter 19

Beyond the iid assumption

121
122
Part IV

Generation

123
Chapter 20

Generative models: an overview

125
126
Chapter 21

Variational autoencoders

127
128
Chapter 22

Auto-regressive models

129
130
Chapter 23

Normalizing flows

131
132
Chapter 24

Energy-based models

133
134
Chapter 25

Denoising diffusion models

135
136
Chapter 26

Generative adversarial networks

137
138
Part V

Discovery

139
Chapter 27

Discovery methods: an overview

141
142
Chapter 28

Latent variable models

28.1 Topic models


In this section, we show how to modify the multinomial PCA model of ?? to handle variable-sized observations,
such as text documents. The basic idea is to assume each observation xl is generated independently of all the
others conditional on the shared latent factors z; the likelihood is categorical, with shared parameters, and
the prior is Dirichlet distribution. This is called a topic model, since the z variables can be interpreted as a
mixture of different topics that are present in document x.

28.1.1 Latent Dirichlet Allocation (LDA)


The most common kind of topic model is called latent Dirichlet allocation (LDA) [BNJ03; Ble12;
BGHM17]. (This usage of the term “LDA” is not to be confused with linear discriminant analysis.)

28.1.1.1 Model definition


We can define the LDA model as follows. Let xnl ∈ {1, . . . , V } be the identity of the l’th word in document
n, where l can now range from 1 to Ln , the length of the document, and V is the size of the vocabulary. The
probability of word v at location l is given by
X
p(xnl = v|zn ) = znk wkv (28.1)
k

where 0 ≤ znk ≤ 1 is the proportion of “topic” k in document n, and zn ∼ Dir(α).


We can rewrite this model by associating a discrete latent variable cnl ∈ {1, . . . , Nz } with each word in
each document, with distribution p(cnl |zn ) = Cat(cnl |zn ). Thus cnl specifies the topic to use for word l in
document n. The full joint model becomes
Ln
Y
p(xn , zn , cn ) = Dir(zn |α) Cat(cnl |zn )Cat(xnl |W[cnl , :]) (28.2)
l=1

where W[k, :] = wk is the distribution over words for the k’th topic. See Figure 28.1 for the corresponding
PGM-D.
We typically use a Dirichlet prior the topic parameters, p(wk ) = Dir(wk |β1V ); by setting β small enough,
we can encourage these topics to be sparse, so that each topic only predicts a subset of the words. In addition,
we use a Dirichlet prior on the latent factors, p(zn ) = Dir(zn |α1Nz ). If we set α small enough, we can
encourage the topic distribution for each document to be sparse, so that each document only contains a
subset of the topics. See Figure 28.2 for an illustration.

143
α α

zn

cn,1 cn,Ln zn
β
xn,1 xn,Ln cnl

W wk
xnl Ln K
β N
(a) (b)

Figure 28.1: Latent Dirichlet Allocation (LDA) as a PGM-D. (a) Unrolled form. (b) Plate form.

Figure 28.2: Illustration of latent Dirichlet allocation (LDA). We have color coded certain words by the topic they
have been assigned to: yellow represents the genetics cluster, pink represents the evolution cluster, blue represent
the data analysis cluster, and green represents the neuroscience cluster. Each topic is in turn defined as a sparse
distribution over words. This article is not related to neuroscience, so no words are assigned to the green topic. The
overall distribution over topic assignments for this document is shown in the right as a sparse histogram. Adapted
from Figure 1 of [Ble12]. Used with kind permission of David Blei.

144
Topic 77 Topic 82 Topic 166
word prob. word prob. word prob.
MUSIC .090 LITERATURE .031 PLAY .136
DANCE .034 POEM .028 BALL .129
SONG .033 POETRY .027 GAME .065
PLAY .030 POET .020 PLAYING .042
SING .026 PLAYS .019 HIT .032
SINGING .026 POEMS .019 PLAYED .031
BAND .026 PLAY .015 BASEBALL .027
PLAYED .023 LITERARY .013 GAMES .025
SANG .022 WRITERS .013 BAT .019
SONGS .021 DRAMA .012 RUN .019
DANCING .020 WROTE .012 THROW .016
PIANO .017 POETS .011 BALLS .015
PLAYING .016 WRITER .011 TENNIS .011
RHYTHM .015 SHAKESPEARE .010 HOME .010
ALBERT .013 WRITTEN .009 CATCH .010
MUSICAL .013 STAGE .009 FIELD .010

Figure 28.3: Three topics related to the word play. From Figure 9 of [SG07]. Used with kind permission of Tom
Griffiths.

Document #29795
Bix beiderbecke, at age060 fifteen207, sat174 on the slope071 of a bluff055 overlooking027 the mississippi137 river137. He
was listening077 to music077 coming009 from a passing043 riverboat. The music077 had already captured006 his heart157
as well as his ear119. It was jazz077. Bix beiderbecke had already had music077 lessons077. He showed002 promise134 on
the piano077, and his parents035 hoped268 he might consider118 becoming a concert077 pianist077. But bix was
interested268 in another kind050 of music077. He wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...

Document #1883
There is a simple050 reason106 why there are so few periods078 of really great theater082 in our whole western046 world.
Too many things300 have to come right at the very same time. The dramatists must have the right actors082, the
actors082 must have the right playhouses, the playhouses must have the right audiences082. We must remember288 that
plays082 exist143 to be performed077, not merely050 to be read254. ( even when you read254 a play082 to yourself, try288 to
perform062 it, to put174 it on a stage078, as you go along.) as soon028 as a play082 has to be performed082, then some
kind126 of theatrical082...

Document #21359
Jim296 has a game166 book254. Jim296 reads254 the book254. Jim296 sees081 a game166 for one. Jim296 plays166 the game166.
Jim296 likes081 the game166 for one. The game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180 and
jim296 read254 the game166 book254. The boys020 see a game166 for two. The two boys020 play166 the game166. The
boys020 play166 the game166 for two. The boys020 like the game166. Meg282 comes040 into the house282. Meg282 and
don180 and jim296 read254 the book254. They see a game166 for three. Meg282 and don180 and jim296 play166 the game166.
They play166 ...

Figure 28.4: Three documents from the TASA corpus containing different senses of the word play. Grayed out words
were ignored by the model, because they correspond to uninteresting stop words (such as “and”, “the”, etc.) or very low
frequency words. From Figure 10 of [SG07]. Used with kind permission of Tom Griffiths.

Note that an earlier version of LDA, known as probabilistic LSA, was proposed in [Hof99]. (LSA stands
for “latent semantic analysis”, and refers to the application of PCA to text data; see [Mur22, Sec 20.5.1.2]
for details.) The likelihood function, p(x|z), is the same as in LDA, but pLSA does not specify a prior for
z, since it is designed for posterior analysis of a fixed corpus (similar to LSA), rather than being a true
generative model.

28.1.1.2 Polysemy
Each topic is a distribution over words that co-occur together, and which are therefore semantically related.
For example, Figure 28.3 shows 3 topics which were learned from an LDA model fit to the TASA corpus1 .
These seem to correspond to 3 different senses of the word “play”: playing an instrument, a theatrical play,
and playing a sports game.
We can use the inferred document-level topic distribution to overcome polysemy, i.e., to disambiguate
1 The TASA corpus is an untagged collection of educational materials consisting of 37,651 documents and 12,190,931 word

tokens. Words appearing in fewer than 5 documents were replaced with an asterisk, but punctuation was included. The combined
vocabulary was of size 37,202 unique words.

145
the meaning of a particular word. This is illustrated in Figure 28.4, where a subset of the words are annotated
with the topic to which they were assigned (i.e., we show argmaxk p(cnl = k|xn ). In the first document, the
word “music” makes it clear that the musical topic (number 77) is present in the document, which in turn
makes it more likely that cnl = 77 where l is the index corresponding to the word “play”.

28.1.1.3 Posterior inference


Many algorithms have been proposed to perform approximate posterior inference in the LDA model. In
the original LDA paper, [BNJ03], they use variational mean field inference (see ??), and in [HBB10], they
use stochastic VI (see Section 28.1.6). In Section 28.1.5 we discuss the collapsed Gibbs sampler, which
marginalizes out the discrete latents. In [MB16; SS17] they discuss how to learned amortized inference
networks to perform VI for the collapsed model.
Recently, there has been considerable interest in spectral methods for fitting LDA-like models which are
fast and which come with provable guarantees about the quality of the solution they obtain (unlike MCMC
and variational methods, where the solution is just an approximation of unknown quality). These methods
make certain (reasonable) assumptions beyond the basic model, such as the existence of some anchor words,
which uniquely the topic for a document. See [Aro+13] for details.

28.1.1.4 Determining the number of topics


Choosing Nz , the number of topics, is a standard model selection problem. Here are some approaches that
have been taken:
• Use annealed importance sampling (??) to approximate the evidence [Wal+09].
• Cross validation, using the log likelihood on a test set.
• Use the variational lower bound as a proxy for log p(D|Nz ).
• Use non-parametric Bayesian methods [Teh+06].

28.1.2 Correlated topic model


One weakness of LDA is that it cannot capture correlation between topics. For example, if a document has
the “business” topic, it is reasonable to expect the “finance” topic to co-occcur. The source of the problem is
the use of a Dirichlet prior for zn . The problem with the Dirichlet it that it is characterized by just a mean
vector α, but its covariance is fixed (Σij = −αi αj ), rather than being a free parameter.
One way around this is to replace the Dirichlet prior with the logistic normal distribution, which is
defined as follows: Z
p(z) = Cat(z|S())N (|µ, Σ)d (28.3)

This is known as the correlated topic model [BL07];


The difference from categorical PCA discussed in ?? is that CTM uses a logistic normal to model the
mean parameters, so zn is sparse and non-negative, whereas CatPCA uses a normal to model the natural
parameters, so zn is dense and can be negative. More precisely, the CTM defines xnl ∼ Cat(WS(n )), but
CatPCA defines xnd ∼ Cat(S(Wd zn )).
Fitting the CTM model is tricky, since the prior for n is no longer conjugate to the multinomial likelihood
for cnl . However, we can derive a variational mean field approximation, as described in [BL07].
Having fit the model, one can then convert Σ̂ to a sparse precision matrix Σ̂−1 by pruning low-strength
edges, to get a sparse Gaussian graphical model. This allows you to visualize the correlation between topics.
Figure 28.5 shows the result of applying this procedure to articles from Science magazine, from 1990-1999.

28.1.3 Dynamic topic model


In LDA, the topics (distributions over words) are assumed to be static. In some cases, it makes sense to allow
these distributions to evolve smoothly over time. For example, an article might use the topic “neuroscience”,
but if it was written in the 1900s, it is more likely to use words like “nerve”, whereas if it was written in the

146
neurons
brain stimulus
motor
memory
visual
activated subjects synapses
tyrosine phosphorylation cortical
left ltp
activation
phosphorylation p53 task surface glutamate
kinase cell cycle proteins tip synaptic
activity protein
cyclin binding rna image neurons
regulation domain dna sample materials
computer
domains rna polymerase organic
problem device
receptor cleavage
information
polymer
science amino acids
research
scientists
receptors cdna
site
computers
polymers
funding molecules physicists
support says ligand sequence problems
laser particles
nih research ligands isolated optical physics
program people protein sequence light
apoptosis sequences surface
particle
electrons experiment
genome liquid quantum
wild type dna surfaces stars
mutant enzyme sequencing fluid
mutations enzymes model reaction astronomers
united states mutants
iron
active site
reactions universe
women cells
mutation reduction molecule galaxies
universities
cell molecules
expression magnetic
galaxy
cell lines plants
magnetic field transition state
students bone marrow plant
spin
superconductivity
gene
education genes
superconducting
pressure mantle
arabidopsis
bacteria high pressure crust sun
bacterial pressures upper mantle solar wind
host fossil record core meteorites earth
resistance development birds inner core ratios planets
mice parasite embryos fossils planet
gene dinosaurs
antigen virus drosophila species
disease fossil
t cells hiv genes forest
mutations
antigens aids expression forests
families earthquake co2
immune response infection populations
mutation earthquakes carbon
viruses ecosystems
fault carbon dioxide
ancient images methane
patients genetic found
disease cells population impact
data water
ozone
treatment proteins populations million years ago volcanic atmospheric
drugs differences africa
clinical researchers deposits climate
measurements
variation stratosphere
protein magma ocean
eruption ice concentrations
found volcanism changes
climate change

Figure 28.5: Output of the correlated topic model (with K = 50 topics) when applied to articles from Science. Nodes
represent topics, with the 5 most probable phrases from each topic shown inside. Font size reflects overall prevalence of
the topic. See http: // www. cs. cmu. edu/ ~lemur/ science/ for an interactive version of this model with 100 topics.
Used with kind permission of Figure 2 of [BL07]. Used with kind permission of David Blei.
α

z t−1
n z tn z t+1
n

st−1
nl stnl st+1
nl

xt−1
nl xtnl xt+1
nl

Nt−1 Nt Nt+1

wt−1
k w tk w t+1
k

Nz

Figure 28.6: The dynamic topic model as a PGM-D.

2000s, it is more likely to use words like “calcium receptor” (this reflects the general trend of neuroscience
towards molecular biology).
One way to model this is to assume the topic distributions evolve according to a Gaussian random walk,
as in a state space mdoel (see ??). We can map these Gaussian vectors to probabilities via the softmax
function, resulting in the following model:

wkt |wkt−1 ∼ N (wt−1,k , σ 2 1Nw ) (28.4)


znt ∼ Dir(α1Nz ) (28.5)
ctnl |znt ∼ Cat(znt ) (28.6)
xtnl |ctnl = k, W ∼ t
Cat(S(wkt )) (28.7)

This is known as a dynamic topic model [BL06a]. See Figure 28.6 for the PGM-D.
One can perform approximate inference in this model using a structured mean field method (??), that
exploits the Kalman smoothing algorithm (??) to perform exact inference on the linear-Gaussian chain
between the wkt nodes (see [BL06a] for details). See the main text for an example of this model applied to
100 years of articles from Science.

147
α

zn

cn,t−1 cn,t cn,t+1 . . .

W
xn,t−1 xn,t xn,t+1 ...

hn,t−1 hn,t hn,t+1. . .

ND

A B

Figure 28.7: LDA-HMM model as a PGM-D.

It is also possible to use amortized inference, and to learn embeddings for each word, which works much
better with rare words. This is called the dynamic embedded topic model [DRB19].

28.1.4 LDA-HMM
The Latent Dirichlet Allocation (LDA) model of Section 28.1.1 assumes words are exchangeable, and thus
ignores word order. A simple way to model sequential dependence between words is to use an HMM. The
trouble with HMMs is that they can only model short-range dependencies, so they cannot capture the overall
gist of a document. Hence they can generate syntactically correct sentences, but not semantically plausible
ones.
It is possible to combine LDA with HMM to create a model called LDA-HMM [Gri+04]. This model
uses the HMM states to model function or syntactic words, such as “and” or “however”, and uses the LDA to
model content or semantic words, which are harder to predict. There is a distinguished HMM state which
specifies when the LDA model should be used to generate the word; the rest of the time, the HMM generates
the word.
More formally, for each document n, the model defines an HMM with states hnl ∈ {0, . . . , H}. In addition,
each document has an LDA model associated with it. If hnl = 0, we generate word xnl from the semantic
LDA model, with topic specified by cnl ; otherwise we generate word xnl from the syntactic HMM model.
The PGM-D is shown in Figure 28.7. The CPDs are as follows:

p(zn ) = Dir(zn |α1Nz ) (28.8)


p(cnl = k|zn ) = znk (28.9)
p(hn,l = j|hn,l−1 = i) = Aij (28.10)

Wkd if j = 0
p(xnl = d|cnl = k, hnl = j) = (28.11)
Bjd if j > 0

where W is the usual topic-word matrix, B is the state-word HMM emission matrix and A is the state-state
HMM transition matrix.
Inference in this model can be done with collapsed Gibbs sampling, analytically integrating out all the
continuous quantities. See [Gri+04] for the details.
The results of applying this model (with Nz = 200 LDA topics and H = 20 HMM states) to the combined
Brown and TASA corpora2 are shown in Table 28.1. We see that the HMM generally is responsible for
2 The Brown corpus consists of 500 documents and 1,137,466 word tokens, with part-of-speech tags for each token. The TASA

corpus is an untagged collection of educational materials consisting of 37,651 documents and 12,190,931 word tokens. Words
appearing in fewer than 5 documents were replaced with an asterisk, but punctuation was included. The combined vocabulary
was of size 37,202 unique words.

148
the the the the the a the the the
blood , , of a the , , ,
, and and , of of of a a
of of of to , , a of in
body a in in in in and and game
heart in land and to water in drink ball
and trees to classes picture is story alcohol and
in tree farmers government film and is to team
to with for a image matter to bottle to
is on farm state lens are as in play
blood forest farmers government light water story drugs ball
heart trees land state eye matter stories drug game
pressure forests crops federal lens molecules poem alcohol team
body land farm public image liquid characters people *
lungs soil
image data food statelocal mirror
membrane particles
chip poetry
experts drinkingkernel baseball
network
oxygen areas people act eyes gas character person players
images
vessels park gaussian
farming policy
states synaptic
glass analog
solid expert
author support football
effects neural
object
arteries wildlifemixturewheat value
national cell
object neuron
substance gating marijuana
poems vector networks
player
objects
* arealikelihoodfarms function
laws objects * digital
temperature lifehme body svm output
field
breathing
feature rain posteriorcorn department
action lenses
current changes
synapse poet
architecture use
kernels basketball
input
the
recognition in prior he reinforcement * be
dendritic said
neural can
mixture time # ,
training
a for it new have made would way ;
views
his distribution
to you learning
other potential
see hardware
used learning
will yearsspace (inputs
this # on em they classesfirst neuron
make cameweight mixtures
could function
day weights
:
pixel
their with bayesian i optimal
same conductance
do went# function
may machines
part ) #
visual
these parametersshe
at *great channels
know found vlsi had gate number set outputs
yourin by is we seegood get
used called
model must
networks kind
however #
her from there small go do place
with
my as was this show little trained
take algorithm values
have also *
somefor into has who noteold obtained
find system results
did then i
on becomes consider described case models thus x
from denotes assume given problem parameters therefore t
at
Table 28.1: Upper being extracted
row: Topics present found when trained
by the LDA model network on the units
combined Brownfirst and TASAn corpora.
using remains need presented method data here -
Middle row: topics extracted by LDA part of LDA-HMM model. Bottom row: topics extracted by HMM part of
into represents propose defined approach functions now c
LDA-HMM model.over Eachexists
column represents
describe a single topic/class, paper
generated and words problems
appear in orderhence of probability
r in that
topic/class. within
Since some classes
seems give almost
suggestall probability
shown to onlyprocess
a few words, a list is terminated
algorithms finally when
p the words
account for 90% of the probability mass. From Figure 2 of [Gri+04]. Used with kind permission of Tom Griffiths.
Figure 3: Topics and classes from the composite model on the NIPS corpus.
In contrast to this approach, we study here how the overall network activity can control single cell
parameters such as input resistance, as well as time and space constants, parameters that are crucial for
excitability and spariotemporal (sic) integration.
1.
The integrated architecture in this paper combines feed forward control and error feedback adaptive
control using neural networks.

In other words, for our proof of convergence, we require the softassign algorithm to return a doubly
stochastic matrix as *sinkhorn theorem guarantees that it will instead of a matrix which is merely close
2. to being doubly stochastic based on some reasonable metric.

The aim is to construct a portfolio with a maximal expected return for a given risk level and time
horizon while simultaneously obeying *institutional or *legally required constraints.

The left graph is the standard experiment the right from a training with # samples.
3.
The graph G is called the *guest graph, and H is called the host graph.

FigureFigure 4: Function
28.8: Function and content
and content words in words in corpus,
the NIPS the NIPS corpus. Graylevel
as distinguished indicates
by the LDA-HMM posterior
model. Graylevel
indicates
probability of assignment to LDA component, with black being highest. The boxed word as
posterior probability of assignment to LDA component, with black being highest. The boxed word appears
a function word in one sentence, and as a content word in another sentence. Asterisked words had low frequency,
appears as a function word and a content word in one element of each pair of sentences.
and were treated as a single word type by the model. From Figure 4 of [Gri+04]. Used with kind permission of Tom
Asterisked words had low frequency, and were treated as a single word type by the model.
Griffiths.

being assigned to syntactic HMM classes produces


149 templates for writing NIPS papers, into
which content words can be inserted. For example, replacing the content words that the
model identifies in the second sentence with content words appropriate to the topic of the
present paper, we could write: The integrated architecture in this paper combines simple
probabilistic syntax and topic-based semantics using generative models.
α

z1 ... ... zN α

c1,1 c
. . . 1,L1 ...
cN,1 c
. .N,L
. N c1,1 c cN,1 c
. . . 1,L1 ... . .N,L
. N
x1,1 x
. . . 1,L1 ...
xN,1 x
. .N,L
. N
x1,1 x
. . . 1,L1 ...
xN,1 x
. .N,L
. N
W

β β
(a) (b)

Figure 28.9: (a) LDA unrolled for N documents. (b) Collapsed LDA, where we integrate out the continuous latents zn
and the continuous topic parameters W.

syntactic words, and the LDA for semantics words. If we did not have the HMM, the LDA topics would
get “polluted” by function words (see top of figure), which is why such words are normally removed during
preprocessing.
The model can also help disambiguate when the same word is being used syntactically or semantically.
Figure 28.8 shows some examples when the model was applied to the NIPS corpus.3 We see that the roles of
words are distinguished, e.g., “we require the algorithm to return a matrix” (verb) vs “the maximal expected
return” (noun). In principle, a part of speech tagger could disambiguate these two uses, but note that (1) the
LDA-HMM method is fully unsupervised (no POS tags were used), and (2) sometimes a word can have the
same POS tag, but different senses, e.g., “the left graph” (a synactic role) vs “the graph G” (a semantic role).
More recently, [Die+17] proposed topic-RNN, which is similar to LDA-HMM, but replaces the HMM
model with an RNN, which is a much more powerful model.

28.1.5 Collapsed Gibbs sampling for LDA


In this section, we discuss how to perform inference using MCMC. Vanilla Gibbs sampling samples from the
following full conditionals:

p(cil = k|·) ∝ exp[log πik + log wk,xil ] (28.12)


X
p(π i |·) = Dir({αk + I (cil = k)}) (28.13)
l
XX
p(wk |·) = Dir({γv + I (xil = v, cil = k)}) (28.14)
i l

However, one can get better performance by analytically integrating out the π i ’s and the wk ’s, both of
which have a Dirichlet distribution, and just sampling the discrete cil ’s. This approach was first suggested in
[GS04], and is an example of collapsed Gibbs sampling. Figure 28.9(b) shows that now all the cil variables
are fully correlated. However, we can sample them one at a time, as we explain below.
PLi
First, we need some notation. Let Nivk =P l=1 I (cil = k, xil = v) be the number of times word v is
assigned to topic k in document i. Let NP ik = v ivk be the number of times any word from document i
N
has been assigned to topic k. Let
P vkN = i ivk be the number of times word v has been assigned to
N P topic
k in any document. Let Nk = v Nvk be the number of words assigned to topic k. Finally, let Li = k Nik
be the number of words in document i; this is observed.
3 NIPS stands for “Neural Information Processing Systems”. It is one of the top machine learning conferences. The NIPS

corpus volumes 1–12 contains 1713 documents.

150
4
5
6
7
8
9
10
11
12
13
14
15
16

River Stream Bank Money Loan


1 River Stream Bank Money Loan
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9
11 10
12 11
13 12
13
14 14
15 15
16 16

(a) (b)
River Stream Bank Money Loan
1
2
Figure 28.10: Illustration of (collapsed) Gibbs sampling applied to a small LDA example. There are N = 16 documents,
3
4
5
each containing a variable number of words drawn from a vocabulary of V = 5 words, There are two topics. A white
6
7
dot means word the word is assigned to topic 1, a black dot means the word is assigned to topic 2. (a) The initial
8
9
random assignment of states. (b) A sample from the posterior after 64 steps of Gibbs sampling. From Figure 7 of
10
11
12
[SG07]. Used with kind permission of Tom Griffiths.
13
14
15
16

We can now derive the marginal prior. By applying ??, one can show that
"L #
YZ Y i

p(c|α) = Cat(cil |π i ) Dir(π i |α1K )dπ i (28.15)


i l=1
 N YN QK
Γ(Kα) k=1 Γ(Nik + α)
= (28.16)
Γ(α)K i=1
Γ(Li + Kα)

By similar reasoning, one can show


" #
YZ Y
p(x|c, β) = Cat(xil |wk ) Dir(wk |β1V )dwk (28.17)
k il:cil =k
 K QV
K Y
Γ(V β) Γ(Nvk + β)
= v=1
(28.18)
Γ(β)V Γ(Nk + V β)
k=1

From the above equations, and using the fact that Γ(x + 1)/Γ(x) = x, we can derive the full conditional

for p(cil |c−i,l ). Define Nivk to be the same as Nivk except it is computed by summing over all locations in
document i except for cil . Also, let xil = v. Then
− −
Nv,k +β Ni,k +α
p(ci,l = k|c−i,l , y, α, β) ∝ (28.19)
Nk− + V β Li + Kα

We see that a word in a document is assigned to a topic based both on how often that word is generated by
the topic (first term), and also on how often that topic is used in that document (second term).
Given Equation (28.19), we can implement the collapsed Gibbs sampler as follows. We randomly assign a
topic to each word, cil ∈ {1, . . . , K}. We can then sample a new topic as follows: for a given word in the
corpus, decrement the relevant counts, based on the topic assigned to the current word; draw a new topic
from Equation (28.19), update the count matrices; and repeat. This algorithm can be made efficient since
the count matrices are very sparse [Li+14].
This process is illustrated in Figure 28.10 on a small example with two topics, and five words. The left
part of the figure illustrates 16 documents that were sampled from the LDA model using p(money|k = 1) =
p(loan|k = 1) = p(bank|k = 1) = 1/3 and p(river|k = 2) = p(stream|k = 2) = p(bank|k = 2) = 1/3. For
example, we see that the first document contains the word “bank” 4 times (indicated by the four dots in
row 1 of the “bank” column), as well as various other financial terms. The right part of the figure shows the
state of the Gibbs sampler after 64 iterations. The “correct” topic has been assigned to each token in most
cases. For example, in document 1, we see that the word “bank” has been correctly assigned to the financial

151
topic, based on the presence of the words “money” and “loan”. The posterior mean estimate of the parameters
is given by p̂(money|k = 1) = 0.32, p̂(loan|k = 1) = 0.29, p̂(bank|k = 1) = 0.39, p̂(river|k = 2) = 0.25,
p̂(stream|k = 2) = 0.4, and p̂(bank|k = 2) = 0.35, which is impressively accurate, given that there are only 16
training examples.

28.1.6 Variational inference for LDA


A faster alternative to MCMC is to use variational EM, which we discuss in general terms in ??. There are
several ways to apply this to LDA, which we discuss in the following sections.

28.1.6.1 Sequence version


In this section, we focus on a version in which we unroll the model, and work with a latent variable for each
word. Following [BNJ03], we will use a fully factorized (mean field) approximation of the form
Y
q(zn , sn ) = Dir(zn |z̃n ) Cat(snl |Ñnl ) (28.20)
l

where z̃n are the variational parameters for the approximate posterior over zn , and Ñnl are the variational
parameters for the approximate posterior over snl . We will follow the usual mean field recipe. For q(snl ), we
use Bayes’ rule, but where we need to take expectations over the prior:
Ñnlk ∝ wd,k exp(E[log znk ]) (28.21)
where d = xnl , and X
E [log znk ] = ψk (z̃n ) , Ψ(z̃nk ) − ψ( z̃nk0 ) (28.22)
k0

where ψ is the digamma function. The update for q(zn ) is obtained by adding up the expected counts:
X
z̃nk = αk + Ñnlk (28.23)
l

The M step is obtained by adding up the expected counts and normalizing:


Ln
N X
X
ŵdk ∝ βd + Ñnlk I (xnl = d) (28.24)
n=1 l=1

28.1.6.2 Count version


P
Note that the E step takes O(( n Ln )Nw Nz ) space to store the Ñnlk . It is much more space efficient to
perform inference in the mPCA version of the model, which works with counts; these only take O(N Nw Nz )
space, which is a big savings if documents are long. (By contrast, the collapsed Gibbs sampler must work
explicitly with the snl variables.)
Following the discussion in ??, we will work with the variables zn and Nn , where Nn = [Nndk ] is the
matrix of counts, which can be derived from sn,1:Ln . We will again use a fully factorized (mean field)
approximation of the form Y
q(zn , Nn ) = Dir(zn |z̃n ) M(Nnd |xnd , Ñnd ) (28.25)
d
PLn
where xnd = l=1 I (xnl = d) is the the total number of times token d occurs in document n.
The E step becomes
X
z̃nk = αk + xnd Ñndk (28.26)
d

Ñndk ∝ wdk exp(E [log znk ]) (28.27)

152
The M step becomes
X
ŵdk ∝ βd + xnd Ñndk (28.28)
n

28.1.6.3 Bayesian version

Algorithm 4: Batch VB for LDA


1 Input: {xnd }, Nz , α, β;
2 Estimate w̃dk using EM for multinomial mixtures;
3 while not converged do
4 // E step ;
5 adk = 0 // expected sufficient statistics;
6 for each document n = 1 : N do
7 (z̃n , Ñn ) = VB-Estep(xn , W̃, α);
8 adk + = xnd Ñndk ;
9 // M step ;
10 for each topic k = 1 : Nz do
11 w̃dk = βd + adk ;

12 function (z̃n , Ñn ) = VB-Estep(xn , W̃, α);


13 Initialize z̃nk = αk ;
14 repeat
15 z̃nold = z̃n , z̃nk = αk ;
16 for each word d = 1 : Nw do
17 for each topic k = 1 : Nz do 
18 Ñndk = exp ψk (w̃d ) + ψk (z̃nold ) ;
19 Ñnd = normalize(Ñnd );
20 z̃n + = xnd Ñnd
21 until Converged ;

We now modify the algorithm to use variational Bayes (VB) instead of EM, i.e., we infer the parameters
as well as the latent variables. There are two advantages to this. First, by setting β  1, VB will encourage
W to be sparse (as in ??). Second, we will be able to generalize this to the online learning setting, as we
discuss below.
Our new posterior approximation becomes
Y Y
q(zn , Nn , W) = Dir(zn |z̃n ) M(Nnd |xnd , Ñnd ) Dir(wk |w̃k ) (28.29)
d k

The update for Ñndk changes, to the following:

Ñndk ∝ exp (E[log wdk ] + E[log znk ]) (28.30)

The M step is the same as before:


X
ŵdk ∝ βd + xnd Ñndk (28.31)
n

No normalization is required, since we are just updating the pseudcounts. The overall algorithm is summarized
in Algorithm 4.

153
Algorithm 5: Online VB for LDA
1 Input: {xnd }, Nz , α, β, LR schedule;
2 Initialize w̃dk randomly;
3 for t = 1 : ∞ do
4 Set step size ηt ;
5 Pick document n; ;
6 (z̃n , Ñn ) = VB-Estep(xn , W̃, α);
7
new
w̃dk = βd + N xnd Ñndk ;
8
new
w̃dk = (1 − ηt )w̃dk + ηt w̃dk ;

Online 98K
900
850
800 Batch 98K
Perplexity

Online 3.3M
750
700
650
600

103.5 104 104.5 105 105.5 106 106.5


Documents seen (log scale)
Documents 2048 4096 8192 12288 16384 32768 49152 65536
Figure 28.11: analyzed
Test perplexity vs number of training documents for batch and online VB-LDA. From Figure 1 of
[HBB10]. Used with kindsystems
permission of David Blei.
systems service service service business business business
road health systems systems companies service service industry
made communication health companies systems companies companies service
Top eight service service companies business business industry industry companies
28.1.6.4 Online
words (SVI) version
announced billion market company company company services services
national language communication billion industry management company company
In the batch version, thewest
E step takes
care O(N N
z Nw ) per mean field update. This can be slow if we
company health market systems management management have many
language road billion industry billion services public public
documents. This can be reduced by using stochastic variatonal inference, as discussed in ??. We perform an
E step Figure
in the usual
1: Top: way. We thenoncompute
Perplexity held-outthe variational
Wikipedia parameters
documents for W treating
as a function of numbertheof
expected
documentssufficient
analyzed,
statistics from the i.e., the number
single data case of E
as steps.
if the Online VB run
whole data set on
had3.3 million
those unique Finally,
statistics. Wikipediawe articles
make aispartial
update compared with online
for the variational VB run onfor
parameters 98,000 Wikipedia
W, putting articles
weight and
ηt on thewith
newthe batch algorithm
estimate and weight run1on
− ηthe
t on the
old estimate. The step size ηt decays over time, according to some schedule, as in SGD. The overalldoes.
same 98,000 articles. The online algorithms converge much faster than the batch algorithm algorithm
Bottom:inEvolution
is summarized Algorithm of a5.topic about business
In practice, as online
we should LDA sees more
use mini-batches, and more documents.
as explained in ??. In [HBB10], they
used a batch of size 256–4096.
Figure 28.11 plots the perplexity on a test set of size 1000 vs number of analyzed documents (E steps),
to summarize
where the data is drawn the latent structure ofWikipedia.
from (English) massive document
The figurecollections thatonline
shows that cannotvariational
be annotated by hand.
inference is much
A central
faster than research
offline problem
inference, for topic similar
yet produces modeling is to efficiently fit models to larger corpora [4, 5].
results.
To this end, we develop an online variational Bayes algorithm for latent Dirichlet allocation (LDA),
one of the simplest topic models and one on which many others are based. Our algorithm is based on
online stochastic optimization, which has been shown to produce good parameter estimates dramat-
ically faster than batch algorithms on large datasets [6]. Online LDA handily analyzes massive col-
lections of documents and, moreover, online LDA need not locally store or collect the documents—
each can arrive in a stream and be discarded after one look.
In the subsequent sections, we derive online LDA and show that it converges to a stationary point
of the variational objective function. We study the performance of online LDA in several ways,
including by fitting a topic model to 3.3M articles from Wikipedia without looking at the same
article twice. We show that online LDA finds topic models as good as or better than those found
with batch VB, and in a fraction of the time (see figure 1). Online variational Bayes is a practical
new method for estimating the posterior of complex hierarchical Bayesian models.
154
2 Online variational Bayes for latent Dirichlet allocation

Latent Dirichlet Allocation (LDA) [7] is a Bayesian probabilistic model of text documents. It as-
sumes a collection of K “topics.” Each topic defines a multinomial distribution over the vocabulary
and is assumed to have been drawn from a Dirichlet, βk ∼ Dirichlet(η). Given the topics, LDA
Chapter 29

Hidden Markov models

155
156
Chapter 30

State-space models

157
158
Chapter 31

Graph learning

31.1 Learning tree structures


Since the problem of structure learning for general graphs is NP-hard [Chi96], we start by considering the
special case of trees. Trees are special because we can learn their structure efficiently, as we discuss below,
and because, once we have learned the tree, we can use them for efficient exact inference, as discussed in ??.

31.1.1 Directed or undirected tree?


Before continuing, we need to discuss the issue of whether we should use directed or undirected trees. A
directed tree, with a single root node r, defines a joint distribution as follows:
Y
p(x|T ) = p(xt |xpa(t) ) (31.1)
t∈V

where we define pa(r) = ∅. For example, in Figure 31.1(b-c), we have


p(x1 , x2 , x3 , x4 |T ) = p(x1 )p(x2 |x1 )p(x3 |x2 )p(x4 |x2 ) (31.2)
= p(x2 )p(x1 |x2 )p(x3 |x2 )p(x4 |x2 ) (31.3)
We see that the choice of root does not matter: both of these models are equivalent.
To make the model more symmetric, it is preferable to use an undirected tree. This can be represented as
follows:
Y Y p(xs , xt )
p(x|T ) = p(xt ) (31.4)
p(xs )p(xt )
t∈V (s,t)∈E

where p(xs , xt ) is an edge marginal and p(xt ) is a node marginal. For example, in Figure 31.1(a) we have
p(x1 , x2 )p(x2 , x3 )p(x2 , x4 )
p(x1 , x2 , x3 , x4 |T ) = p(x1 )p(x2 )p(x3 )p(x4 ) (31.5)
p(x1 )p(x2 )p(x2 )p(x3 )p(x2 )p(x4 )

1 2 1 2 1 2

4 3 4 3 4 3

(a) (b) (c)

Figure 31.1: An undirected tree and two equivalent directed trees.

159
To see the equivalence with the directed representation, let us cancel terms to get
p(x2 , x3 ) p(x2 , x4 )
p(x1 , x2 , x3 , x4 |T ) = p(x1 , x2 ) (31.6)
p(x2 ) p(x2 )
= p(x1 )p(x2 |x1 )p(x3 |x2 )p(x4 |x2 ) (31.7)
= p(x2 )p(x1 |x2 )p(x3 |x2 )p(x4 |x2 ) (31.8)

where p(xt |xs ) = p(xs , xt )/p(xs ).


Thus a tree can be represented as either an undirected or directed graph: the number of parameters is
the same, and hence the complexity of learning is the same. And of course, inference is the same in both
representations, too. The undirected representation, which is symmetric, is useful for structure learning, but
the directed representation is more convenient for parameter learning.

31.1.2 Chow-Liu algorithm


Using Equation (31.4), we can write the log-likelihood for a tree as follows:
XX
log p(D|θ, T ) = Ntk log p(xt = k|θ)
t k
XX p(xs = j, xt = k|θ)
+ Nstjk log (31.9)
s,t j,k
p(xs = j|θ)p(xt = k|θ)

where Nstjk is the number of times node s is in state j and node t is in state k, and Ntk is the number
of times node t is in state k. We can rewrite these counts in terms of the empirical distribution: Nstjk =
N pD (xs = j, xt = k) and Ntk = N pD (xt = k). Setting θ to the MLEs, this becomes

log p(D|θ, T ) X X
= pD (xt = k) log pD (xt = k) (31.10)
N
t∈V k
X
+ I(xs , xt |θ̂st ) (31.11)
(s,t)∈E(T )

where I(xs , xt |θ̂st ) ≥ 0 is the mutual information between xs and xt given the empirical distribution:
XX pD (xs = j, xt = k)
I(xs , xt |θ̂st ) = pD (xs = j, xt = k) log (31.12)
j
pD (xs = j)pD (xt = k)
k

Since the first term in Equation (31.11) is independent of the topology T , we can ignore it when learning
structure. Thus the tree topology that maximizes the likelihood can be found by computing the maximum
weight spanning tree, where the edge weights are the pairwise mutual informations, I(ys , yt |θ̂st ). This is
called the Chow-Liu algorithm [CL68].
There are several algorithms for finding a max spanning tree (MST). The two best known are Prim’s
algorithm and Kruskal’s algorithm. Both can be implemented to run in O(E log V ) time, where E = V 2 is
the number of edges and V is the number of nodes. See e.g., [SW11, Sec 4.3] for details. Thus the overall
running time is O(N V 2 + V 2 log V ), where the first term is the cost of computing the sufficient statistics.
Figure 31.2 gives an example of the method in action, applied to the binary 20 newsgroups data shown in
??. The tree has been arbitrarily rooted at the node representing “email”. The connections that are learned
seem intuitively reasonable.

31.1.3 Finding the MAP forest


Since all trees have the same number of parameters, we can safely use the maximum likelihood score as
a model selection criterion without worrying about overfitting. However, sometimes we may want to fit a

160
Figure 31.2: The MLE tree estimated from the 20-newsgroup data. Generated by chow_liu_tree_demo.py.

forest rather than a single tree, since inference in a forest is much faster than in a tree (we can run belief
propagation in each tree in the forest in parallel). The MLE criterion will never choose to omit an edge.
However, if we use the marginal likelihood or a penalized likelihood (such as BIC), the optimal solution may
be a forest. Below we give the details for the marginal likelihood case.
In Section 31.2.3.2, we explain how to compute the marginal likelihood of any DAG using a Dirichlet
prior for the CPTs. The resulting expression can be written as follows:

X Z Y
N X
log p(D|T ) = log p(xit |xi,pa(t) |θt )p(θt )dθt = score(Nt,pa(t) ) (31.13)
t∈V i=1 t

where Nt,pa(t) are the counts (sufficient statistics) for node t and its parents, and score is defined in
Equation (31.26).
Now suppose we only allow DAGs with at most one parent. Following [HGC95, p227], let us associate a
weight with each s → t edge, ws,t , score(t|s) − score(t|0), where score(t|0) is the score when t has no parents.
Note that the weights might be negative (unlike the MLE case, where edge weights are aways non-negative
because they correspond to mutual information). Then we can rewrite the objective as follows:
X X X
log p(D|T ) = score(t|pa(t)) = wpa(t),t + score(t|0) (31.14)
t t t

The last term is the same for all trees T , so we can ignore it. Thus finding the most probable tree amounts
to finding a maximal branching in the corresponding weighted directed graph. This can be found using
the algorithm in [GGS84].
If the scoring function is prior and likelihood equivalent (these terms are explained in Section 31.2.3.3),
we have
score(s|t) + score(t|0) = score(t|s) + score(s|0) (31.15)
and hence the weight matrix is symmetric. In this case, the maximal branching is the same as the maximal
weight forest. We can apply a slightly modified version of the MST algorithm to find this [EAL10]. To see
this, let G = (V, E) be a graph with both positive and negative edge weights. Now let G0 be a graph obtained
by omitting all the negative edges from G. This cannot reduce the total weight, so we can find the maximum
weight forest of G by finding the MST for each connected component of G0 . We can do this by running
Kruskal’s algorithm directly on G0 : there is no need to find the connected components explicitly.

31.1.4 Mixtures of trees


A single tree is rather limited in its expressive power. Later in this chapter we discuss ways to learn more
general graphs. However, the resulting graphs can be expensive to do inference in. An interesting alternative

161
Figure 31.3: A simple linear Gaussian model. .

is to learn a mixture of trees [MJ00], where each mixture component may have a different tree topology.
This is like an unsupervised version of the TAN classifier discussed in ??. We can fit a mixture of trees by
using EM: in the E step, we compute the responsibilities of each cluster for each data point, and in the M
step, we use a weighted version of the Chow-Liu algorithm. See [MJ00] for details.
In fact, it is possible to create an “infinite mixture of trees”, by integrating out over all possible trees.
Remarkably, this can be done in V 3 time using the matrix tree theorem. This allows us to perform exact
Bayesian inference of posterior edge marginals etc. However, it is not tractable to use this infinite mixture for
inference of hidden nodes. See [MJ06] for details.

31.2 Learning DAG structures


In this section, we discuss how to estimate the structure of directed graphical models from observational
data. This is often called Bayes net structure learning. We can only do this if we make the faithfulness
assumption, which we explain in Section 31.2.1. Furthermore our output will be a set of equivalent DAGs,
rather than a single unique DAG, as we explain in Section 31.2.2. After introducing these restrictions, we
discuss some statistical and algorithmic techniques. If the DAG is interpreted causal, these techniques can be
used for causal discovery, although this relies on additional assumptions about non-confounding. For more
details, see e.g., [GZS19].

31.2.1 Faithfulness
The Markov assumption allows us to infer CI properties of a distribution p from a graph G. To go in the
opposite direction, we need to assume that the generating distribution p is faithful to the generating DAG
G. This means that all the conditional indepence (CI) properties of p are exactly captured by the graphical
structure, so I(p) = I(G); this means there cannot be any CI properties in p that are due to particular
settings of the parameters (such as zeros in a regression matrix) that are not graphically explicit. (For this
reason, a faithful distribution is also called a stable distribution.)
Let us consider an example of a non-faithful distribution (from [PJS17, Sec 6.5.3]). Consider a linear
Gaussian model of the form
2
X = EX , EX ∼ N (0, σX ) (31.16)
Y = aX + EY , EY ∼ N (0, σY2 ) (31.17)
Z = bY + cX + EZ , EZ ∼ 2
N (0, σZ ) (31.18)

where the error terms are independent. If ab + c = 0, then X ⊥ Z, even though this is not implied by the
DAG in Figure 31.3. Fortunately, this kind of accidental cancellation happens with zero probability if the
coefficients are drawn randomly from positive densities [SGS00, Thm 3.2].

31.2.2 Markov equivalence


Even with the faithfulness assumption, we cannot always uniquely identify a DAG from a joint distribution.
To see this, consider the following 3 DGMs: X → Y → Z, X ← Y ← Z and X ← Y → Z. These all represent

162
G1 G2 G3
X1 X3 X1 X3 X1 X3

X2 X2 X2
X5 X5 X5

X4 X4 X4

Figure 31.4: Three DAGs. G1 and G3 are Markov equivalent, G2 is not.

X X X X X X

Y Y Y ≡ Y Y ≡ Y

Z Z Z Z Z Z

Figure 31.5: PDAG representation of Markov equivalent DAGs.

the same set of CI statements, namely


X ⊥ Z|Y, X 6⊥ Z (31.19)

We say these graphs are Markov equivalent, since they encode the same set of CI assumptions. That is,
they all belong to the same Markov equivalence class. However, the DAG X → Y ← Z encodes X ⊥ Z
and X 6⊥ Z|Y , so corresponds to a different distribution.
In [VP90], they prove the following theorem.

Theorem 31.2.1. Two structures are Markov equivalent iff they have the same skeleton, i.e., the have the
same edges (disregarding direction) and they have the same set of v-structures (colliders whose parents are
not adjacent).

For example, referring to Figure 31.4, we see that G1 6≡ G2 , since reversing the 2 → 4 arc creates a new
v-structure. However, G1 ≡ G3 , since reversing the 1 → 5 arc does not create a new v-structure.
We can represent a Markov equivalence class using a single partially directed acyclic graph or PDAG
(also called an essential graph or pattern), in which some edges are directed and some undirected (see ??).
The undirected edges represent reversible edges; any combination is possible so long as no new v-structures
are created. The directed edges are called compelled edges, since changing their orientation would change
the v-structures and hence change the equivalence class. For example, the PDAG X − Y − Z represents
{X → Y → Z, X ← Y ← Z, X ← Y → Z} which encodes X 6⊥ Z and X ⊥ Z|Y . See Figure 31.4 for another
example.
The significance of the above theorem is that, when we learn the DAG structure from data, we will not be
able to uniquely identify all of the edge directions, even given an infinite amount of data. We say that we
can learn DAG structure “up to Markov equivalence”. This also cautions us not to read too much into the
meaning of particular edge orientations, since we can often change them without changing the model in any
observable way. (If we want to distinguish between edge orientations within a PDAG (e.g., if we want to
imbue a causal interpretation on the edges), we can use interventional data, as we discuss in Section 31.4.2.)

163
31.2.3 Bayesian model selection: statistical foundations
In this section, we discuss how to compute the exact posterior over graphs, p(G|D), ignoring for now the issue
of computational tractability. We assume there is no missing data, and that there are no hidden variables.
This is called the complete data assumption.
For simplicity, we will focus on the case where all the variables are categorical and all the CPDs are tables.
Our presentation is based in part on [HGC95], although we will follow the notation of ??. In particular,
let xit ∈ {1, . . . , Kt } be the value of node t in case i, where Kt is the number of states for node t. Let
θtck , p(xt = k|xpa(t) = c), for k = 1 : Kt , and c = 1 : Ct , where Ct is the number of parent combinations
(possible conditioning cases). For notational simplicity, we will often assume Kt = K, so all nodes have the
same number of states. We will also let dt = dim(pa(t)) be the degree or fan-in of node t, so that Ct = K dt .

31.2.3.1 Deriving the likelihood


Assuming there is no missing data, and that all CPDs are tabular, the likelihood can be written as follows:

N Y
Y V
p(D|G, θ) = Cat(xit |xi,pa(t) , θt ) (31.20)
i=1 t=1
N Y
Y Ct Y
V Y Kt Ct Y
V Y
Y Kt
I(xi,t =k,xi,pa(t) =c) Ntck
= θtck = θtck (31.21)
i=1 t=1 c=1 k=1 t=1 c=1 k=1

where Ntck is the number of times node t is in state k and its parents are in state c. (Technically these counts
depend on the graph structure G, but we drop this from the notation.)

31.2.3.2 Deriving the marginal likelihood


Choosing the graph with the maximum likelihood will always pick a fully connected graph (subject to the
acyclicity constraint), since this maximizes the number of parameters. To avoid such overfitting, we will
choose the graph with the maximum marginal likelihood, p(D|G), where we integrate out the parameters; the
magic of the Bayesian Occam’s razor (??) will then penalize overly complex graphs.
To compute the marginal likelihood, we need to specify priors on the parameters. We will make two QVstandard
assumptions. First, we assume global prior parameter independence, which means p(θ) = t=1 p(θt ).
QCt
Second, we assume local prior parameter independence, which means p(θt ) = c=1 p(θtc ) for each t. It
turns out that these assumtions imply that the prior for each row of each CPT must be a Dirichlet [GH97],
that is, p(θtc ) = Dir(θtc |αtc ). Given these assumptions, and using the results of ??, we can write down the
marginal likelihood of any DAG as follows:
 
V Y
Y Ct Z Y
p(D|G) =  Cat(xit |θtc ) Dir(θtc )dθtc (31.22)
t=1 c=1 i:xi,pa(t) =c
Ct
V Y
Y B(Ntc + αtc )
= (31.23)
t=1 c=1
B(αtc )
Y Ct
V Y YKt G
Γ(Ntc ) Γ(Ntck + αtck )
= G
(31.24)
t=1 c=1
Γ(Ntc + α tc ) Γ(α tck )
k=1
V
Y
= score(Nt,pa(t) ) (31.25)
t=1
P P
where Ntc = k Ntck , αtc = k αtck , Nt,pa(t) is the vector of counts (sufficient statistics) for node t and its

164
parents, and score() is a local scoring function defined by

YCt
B(Ntc + αtc )
score(Nt,pa(t) ) , (31.26)
c=1
B(αtc )

We say that the marginal likelihood decomposes or factorizes according to the graph structure.

31.2.3.3 Setting the prior


How should we set the hyper-parameters αtck ? It is tempting to use a Jeffreys prior of the form αtck = 12
(??). However, it turns out that this violates a property called likelihood equivalence, which is sometimes
considered desirable. This property says that if G1 and G2 are Markov equivalent (Section 31.2.2), they
should have the same marginal likelihood, since they are essentially equivalent models. Geiger and Heckerman
[GH97] proved that, for complete graphs, the only prior that satisfies likelihood equivalence and parameter
independence is the Dirichlet prior, where the pseudo counts have the form

αtck = α p0 (xt = k, xpa(t) = c) (31.27)

where α > 0 is called the equivalent sample size, and p0 is some prior joint probability distribution. This
is called the BDe prior, which stands for Bayesian Dirichlet likelihood equivalent.
To derive the hyper-parameters for other graph structures, Geiger and Heckerman [GH97] invoked an
additional assumption called parameter modularity, which says that if node Xt has the same parents in
G1 and G2 , then p(θt |G1 ) = p(θt |G2 ). With this assumption, we can always derive αt for a node t in any
other graph by marginalizing the pseudo counts in Equation (31.27).
Typically the prior distribution p0 is assumed to be uniform over all possible joint configurations. In this
case, we have αtck = KtαCt , since p0 (xt = k, xpa(t) = c) = Kt1Ct . Thus if we sum the pseudo counts over all
Ct × Kt entries in the CPT, we get a total equivalent sample size of α. This is called the BDeu prior, where
the “u” stands for uniform. This is the most widely used prior for learning Bayes net structures. For advice
on setting the global tuning parameter α, see [SKM07].

31.2.3.4 Example: analysis of the college plans dataset


We now consider a larger example from [HMC97], who analyzed a dataset of 5 variables, related to the
decision of high school students about whether to attend college. Specifically, the variables are as follows:
• Sex: Male or female
• SES: Socio economic status: low, lower middle, upper middle or high.
• IQ: Intelligence quotient: discretized into low, lower middle, upper middle or high.
• PE: Parental encouragment: low or high
• CP: College plans: yes or no.
These variables were measured for 10,318 Wisconsin high school seniors. There are 2 × 4 × 4 × 2× = 128
possible joint configurations.
Heckerman et al. computed the exact posterior over all 29,281 possible 5 node DAGs, except for ones in
which SEX and/or SES have parents, and/or CP have children. (The prior probability of these graphs was
set to 0, based on domain knowledge.) They used the BDeu score with α = 5, although they said that the
results were robust to any α in the range 3 to 40. The top two graphs are shown in Figure 31.6. We see that
the most probable one has approximately all of the probability mass, so the posterior is extremely peaked.
It is tempting to interpret this graph in terms of causality (see ?? for a detailed discussion of this topic).
In particular, it seems that socio-economic status, IQ and parental encouragment all causally influence the
decision about whether to go to college, which makes sense. Also, sex influences college plans only indirectly
through parental encouragement, which also makes sense. However, the direct link from socio economic status
to IQ seems surprising; this may be due to a hidden common cause. In Section 31.2.8.5 we will re-examine
this dataset allowing for the presence of hidden variables.

165
Figure 31.6: The two most probable DAGs learned from the Sewell-Shah data. From [HMC97]. Used with kind
permission of David Heckerman

31.2.3.5 Marginal likelihood for non-tabular CPDs

If all CPDs are linear Gaussian, we can replace the Dirichlet-multinomial model with the normal-gamma
model, and thus derive a different exact expression for the marginal likelihood. See [GH94] for the details.
In fact, we can easily combine discrete nodes and Gaussian nodes, as long as the discrete nodes always
have discrete parents; this is called a conditional Gaussian DAG. Again, we can compute the marginal
likelihood in closed form. See [BD03] for the details.
In the general case (i.e., everything except Gaussians and CPTs), we need to approximate the marginal
likelihood. The simplest approach is to use the BIC approximation, which has the form
X Kt Ct
log p(Dt |θ̂t ) − log N (31.28)
t
2

31.2.4 Bayesian model selection: algorithms


In this section, we discuss some algorithms for approximately computing the mode of (or samples from) the
posterior p(G|D).

31.2.4.1 The K2 algorithm for known node orderings

Suppose we know a total ordering of the nodes. Then we can compute the distribution over parents for
each node independently, without the risk of introducing any directed cycles: we simply enumerate over all
possible subsets of ancestors and compute their marginal likelihoods. If we just return the best set of parents
for each node, we get the the K2 algorithm [CH92]. In this case, we can find the best set of parents for
each node using `1 -regularization, as shown in [SNMM07].

31.2.4.2 Dynamic programming algorithms

In general, the ordering of the nodes is not known, so the posterior does not decompose. Nevertheless, we
can use dynamic programming to find the globally optimal MAP DAG (up to Markov equivalence), as shown
in [KS04; SM06].
If our goal is knowledge discovery, the MAP DAG can be misleading, for reasons we discussed in ??. A
better approach is to compute the marginal probability that each edge is present, p(Gst = 1|D). We can also
compute these quantities using dynamic programming, as shown in [Koi06; PK11].
Unfortunately, all of these methods take V 2V time in the general case, making them intractable for graphs
with more than about 16 nodes.

166
evidence case course question

msg fact drive

god nasa scsi

gun christian shuttle disk

government religion jesus car disease mission space

law jews engine patients orbit games program

rights power bible honda computer bmw medicine earth solar season launch technology dos

dealer science moon system team satellite files

problem studies mars lunar players version

human hockey hit windows

israel university nhl puck baseball won email memory ftp

president war state research league fans win phone format video mac

children world oil cancer number image data driver software

water health pc

food aids insurance doctor card

help server graphics

vitamin display

Figure 31.7: A locally optimal DAG learned from the 20-newsgroup data. From Figure 4.10 of [Sch10a]. Used with
kind permission of Mark Schmidt.

31.2.4.3 Scaling up to larger graphs


The main challenge in computing the posterior over DAGs is that there are so many possible graphs. More
precisely, [Rob73] showed that the number of DAGs on D nodes satisfies the following recurrence:
D
X  
D i(D−i)
f (D) = (−1)i+1 2 f (D − i) (31.29)
i=1
i

for D > 2. The base case is f (1) = 1. Solving this recurrence yields the following sequence: 1, 3, 25, 543,
29281, 3781503, etc.1
Indeed, the general problem of finding the globally optimal MAP DAG is provably NP-complete [Chi96].
In view of the enormous size of the hypothesis space, we are generally forced to use approximate methods,
some of which we review below.

31.2.4.4 Hill climbing methods for approximating the mode


A common way to find an approximate MAP graph structure is to use a greedy hill climbing method. At
each step, the algorithm proposes small changes to the current graph, such as adding, deleting or reversing
a single edge; it then moves to the neighboring graph which most increases the posterior. The method
stops when it reaches a local maximum. It is important that the method only proposes local changes to
the graph, since this enables the change in marginal likelihood (and hence the posterior) to be computed in
constant time (assuming we cache the sufficient statistics). This is because all but one or two of the terms in
Equation (31.23) will cancel out when computing the log Bayes factor δ(G → G0 ) = log p(G0 |D) − log p(G|D).
We can initialize the search from the best tree, which can be found using exact methods discussed in
Section 31.1.2. For speed, we can restrict the search so it only adds edges which are part of the Markov
1 A longer list of values can be found at http://www.research.att.com/~njas/sequences/A003024. Interestingly, the number

of DAGs is equal to the number of (0,1) matrices all of whose eigenvalues are positive real numbers [McK+04].

167
blankets estimated from a dependency network [Sch10a]. Figure 31.7 gives an example of a DAG learned in
this way from the 20-newsgroup data. For binary data, it is possible to use techniques from frequent itemset
mining to find good Markov blanket candidates, as described in [GM04].
We can use techniques such as multiple random restarts to increase the chance of finding a good local
maximum. We can also use more sophisticated local search methods, such as genetic algorithms or simulated
annealing, for structure learning. (See also Section 31.2.6 for gradient based techniques based on continuous
relaxations.)
It is also possible to perform the greedy search in the space of PDAGs instead of in the space of DAGs;
this is known as the greedy equivalence search method [Chi02]. Although each step is somewhat more
complicated, the advantage is that the search space is smaller.

31.2.4.5 Sampling methods


If our goal is knowledge discovery, the MAP DAG can be misleading, for reasons we discussed in ??. A better
approach is to compute the probability that each edge is present, p(Gst = 1|D). We can do this exactly using
dynamic programming [Koi06; PK11], although this can be expensive. An approximate method is to sample
DAGs from the posterior, and then to compute the fraction of times there is an s → t edge or path for each
(s, t) pair. The standard way to draw samples is to use the Metropolis Hastings algorithm (??), where we use
the same local proposal as we did in greedy search [MR94].
A faster-mixing method is to use a collapsed MH sampler, as suggested in [FK03]. This exploits the fact
that, if a total ordering of the nodes is known, we can select the parents for each node independently, without
worrying about cycles, as discussed in Section 31.2.4.1. By summing over all possible choice of parents, we
can marginalize out this part of the problem, and just sample total orders. [EW08] also use order-space
(collapsed) MCMC, but this time with a parallel tempering MCMC algorithm.

31.2.5 Constraint-based approach


We now present an approach to learning a DAG structure — up to Markov equivalence (the output of the
method is a PDAG) — that uses local conditional independence tests, rather than scoring models globally
with a likelihood. The CI tests are combined together to infer the global graph structure, so this approach is
called constraint-based. The advantage of CI testing is that it is more local and does not require specifying
a complete model. (However, the form of the CI test implicitly relies on assumptions, see e.g., [SP18].)

31.2.5.1 IC algorithm
The original algorithm, due to Verma and Pearl [VP90], was called the IC algorithm, which stands for
“inductive causation”. The method is as follows [Pea09, p50]:

1. For each pair of variables a and b, search for a set Sab such that a ⊥ b|Sab . Construct an undirected
graph such that a and b are connected iff no such set Sab can be found (i.e., they cannot be made
conditionally independent).
2. Orient the edges involved in v-structures as follows: for each pair of nonadjacent nodes a and b with a
common neighbor c, check if c ∈ Sab ; if it is, the corresponding DAG must be a → c → b, a ← c → b
or a ← c ← b, so we cannot determine the direction; if it is not, the DAG must be a → c ← b, so add
these arrows to the graph.
3. In the partially directed graph that results, orient as many of the undirected edges as possible, subject
to two conditions: (1) the orientation should not create a new v-structure (since that would have been
detected already if it existed), and (2) the orientation should not create a directed cycle. More precisely,
follow the rules shown in Figure 31.8. In the first case, if X → Y has a known orientation, but Y − Z is
unknown, then we must have Y → Z, otherwise we would have created a new v-structure X → Y ← Z,
which is not allowed. The other two cases follow similar reasoning.

168
Figure 31.8: The 3 rules for inferring compelled edges in PDAGs. Adapted from [Pe’05].

31.2.5.2 PC algorithm
A significant speedup of IC, known as the PC algorithm after is creators Peter Spirtes and Clark Glymour
[SG91], can be obtained by ordering the search for separating sets in step 1 in terms of sets of increasing
cardinality. We start with a fully connected graph, and then look for sets Sab of size 0, then of size 1, and so
on; as soon we find a separating set, we remove the corresponding edge. See Figure 31.9 for an example.
Another variant on the PC algorithm is to learn the original undirected structure (i.e., the Markov blanket
of each node) using generic variable selection techniques instead of CI tests. This tends to be more robust,
since it avoids issues of statisical significance that can arise with independence tests. See [PE08] for details.
The running time of the PC algorithm is O(DK+1 ) [SGS00, p85], where D is the number of nodes and K
is the maximal degree (number of neighbors) of any node in the corresponding undirected graph.

31.2.5.3 Frequentist vs Bayesian methods


The IC/PC algorithm relies on an oracle that can test for conditional independence between any set of
variables, A ⊥ B|C. This can be approximated using hypothesis testing methods applied to a finite data set,
such as chi-squared tests for discrete data. However, such methods work poorly with small sample sizes, and
can run into problems with multiple testing (since so many hypotheses are being compared). In addition,
errors made at any given step can lead to an incorrect final result, as erroneous constraints get propagated.
In practice it is a common to use a hybrid approach, where we use IC/PC to create an initial structure, and
then use this to speed up Bayesian model selection, which tends to be more robust, since it avoids any hard
decisions about conditional independence or lack thereof.

31.2.6 Methods based on sparse optimization


There is a 1:1 connection between sparse graphs and sparse adjacency matrices. This suggests that we can
perform structure learning by using continuous optimization methods that enforce sparsity, similar to lasso
and other `1 penalty methods (??). In the cases of undirected graphs, this is relatively straightforward, and
results in a convex objective, as we discuss in Section 31.3.2. However, in the case of DAGs, the problem is
harder, because of the acyclicity constraint. Fortunately, [Zhe+18] showed how to encode this constraint as a
smooth penalty term. (They call their method “DAGs with no tears”, since it is supposed to be painless to

169
Figure 31.9: Example of step 1 of the PC algorithm. From Figure 5.1 of [SGS00]. Used with kind permission of Peter
Spirtes.

use.) In particular, they show how to convert the combinatorial problem into a continuous problem:

min f (W) s.t. G(W) ∈ DAGs ⇐⇒ min f (W) s.t. h(W) = 0 (31.30)
W∈RD×D W∈RD×D

Here W is a weighted adjacency matrix on D nodes, G(W) is the corresponding graph (obtained by
thresholding W at 0), f (W) is a scoring function (e.g., penalized log likelihood), and h(W) is a constraint
function that measures how close W is to defining a DAG. The constraint is given by
d
X
d
h(W) = tr((I + αW) ) − d ∝ tr( αk W k ) (31.31)
k=1

where Wk = W · · · W with k terms, and α > 0 is a regularizer. Element (i, j) of Wk will be non-zero iff
theree is a path from j to i made of K educes. Hence the diagonal elements count the number of paths from
an edge to itself in k steps. Thus h(w) will be 0 if W defines a valid DAG.
The scoring function considered in [Zhe+18] has the form
1
f (W) = ||X − XW||2F + λ||W||1 (31.32)
2N
where X ∈ RN D is the data matrix. The show how to find a local optimum of the equality constrained
objective using gradient-based methods. The cost per iteration is O(D3 ).
Several extensions of this have been proposed. For example, [Yu+19] replace the Gaussian noise assumption
with a VAE (variational autoencoder, ??), and use a graph neural network as the encoder/decoder. And
[Lac+20] relax the linearity assumption, and allow for the use of neural network dependencies between
variables.

31.2.7 Consistent estimators


A natural question is whether any of the above algorithms can recover the “true” DAG structure G (up to
Markov equivalence), in the limit of infinite data. We assume that the data was generated by a distribution p
that is faithful to G (see Section 31.2.1).

170
The posterior mode (MAP) is known to converge to the MLE, which in turn will converge to the true
graph G (up to Markov equivalence), so any exact algorithm for Bayesian inference is a consistent estimator.
[Chi02] showed that his greedy equivalence search method (which is a form of hill climbing in the space of
PDAGs) is a consistent estimator. Similarly, [SGS00; KB07] showed that the PC is a consistent estimator.
However, the running time of these algorithms might be exponential in the number of nodes. Also, all of
these methods assume that all the variables are fully observed.

31.2.8 Handling latent variables


In general, we will not get to observe the values of all the nodes (i.e., the complete data assumption does
not hold), either because we have missing data, and/ or because we have hidden variables. This makes it
intractable to compute the marginal likelihood of any given graph structure, as we discuss in Section 31.2.8.1.
It also opens up new problems, such as knowing how many hidden variables to add to the model, and how to
connect them, as we discuss in Section 31.2.8.7.

31.2.8.1 Approximating the marginal likelihood


If we have hidden or missing variables h, the marginal likelihood is given by
Z X XZ
p(D|G) = p(D, h|θ, G)p(θ|G)dθ = p(D, h|θ, G)p(θ|G)dθ (31.33)
h h

In general this is intractable to compute. For example, consider a mixture model, where we don’t observe the
cluster label. In this case, there are K N possible completions of the data (assuming we have K clusters);
we can evaluate the inner integral for each one of these assignments to h, but we cannot afford to evaluate
all of the integrals. (Of course, most of these integrals will correspond to hypotheses with little posterior
support, such as assigning single data points to isolated clusters, but we don’t know ahead of time the relative
weight of these assignments.) Below we mention some faster deterministic approximations for the marginal
likelihood.

31.2.8.2 BIC approximation


A simple approximation to the marginal likelihood is to use the BIC score (??), which is given by
log N
BIC(G) , log p(D|θ̂, G) − dim(G) (31.34)
2
where dim(G) is the number of degrees of freedom in the model and θ̂ is the MAP or ML estimate. However,
the BIC score often severely underestimates the true marginal likelihood [CH97], resulting in it selecting
overly simple models. We discuss some better approximations below.

31.2.8.3 Cheeseman-Stutz approximation


We now discuss the Cheeseman-Stutz approximation (CS) to the marginal likelihood [CS96]. We first
compute a MAP estimate of the parameters θ̂ (e.g., using EM). Denote the expected sufficient statistics
of the data by D = D(θ̂); in the case of discrete variables, we just “fill in” the hidden variables with their
expectation. We then use the exact marginal likelihood equation on this filled-in data:
Z
p(D|G) ≈ p(D|G) = p(D|θ, G)p(θ|G)dθ (31.35)

However, comparing this to Equation (31.33), we can see that the value will be exponentially smaller, since it
does not sum over all values of h. To correct for this, we first write

log p(D|G) = log p(D|G) + log p(D|G) − log p(D|G) (31.36)

171
p(H=0) = 0.63
H p(H=1) = 0.37
H p(SES=high|H)
PE H p(IQ=high|PE,H)
p(male) = 0.48 0 0.088
low 0 0.098 SEX 1 0.51
low 1 0.22
high 0 0.21 SES
high 1 0.49
PE
IQ SES IQ PE p(CP=yes|SES,IQ,PE)
low low low 0.011
SES SEX p(PE=high|SES,SEX) low low high 0.170
low high low 0.124
low male 0.32 CP low high high 0.53
low female 0.166 high low low 0.093
high male 0.86 high low high 0.39
high female 0.81 high high low 0.24
high high high 0.84

Figure 31.10: The most probable DAG with a single binary hidden variable learned from the Sewell-Shah data. MAP
estimates of the CPT entries are shown for some of the nodes. From [HMC97]. Used with kind permission of David
Heckerman.

and then we apply a BIC approximation to the last two terms:


 
log N
log p(D|G) − log p(D|G) ≈ log p(D|θ̂, G) − dim(G)
2
 
log N
− log p(D|θ̂, G) − dim(G) (31.37)
2
= log p(D|θ̂, G) − log p(D|θ̂, G) (31.38)

Putting it altogether we get

log p(D|G) ≈ log p(D|G) + log p(D|θ̂, G) − log p(D|θ̂, G) (31.39)

The first term p(D|G) can be computed by plugging in the filled-in data into the exact marginal likelihood.
The second term p(D|θ̂, G), which involves an exponential sum (thus matching the “dimensionality” of the
left hand side) can be computed using an inference algorithm. The final term p(D|θ̂, G) can be computed by
plugging in the filled-in data into the regular likelihood.

31.2.8.4 Variational Bayes EM


An even more accurate approach is to use the variational Bayes EM algorithm. Recall from ?? that the key
idea is to make the following factorization assumption:
Y
p(θ, z1:N |D) ≈ q(θ)q(z) = q(θ) q(zi ) (31.40)
i

where zi are the hidden variables in case i. In the E step, we update the q(zi ), and in the M step, we update
q(θ). The corresponding variational free energy provides a lower bound on the log marginal likelihood. In
[BG06], it is shown that this bound is a much better approximation to the true log marginal likelihood (as
estimated by a slow annealed importance sampling procedure) than either BIC or CS. In fact, one can prove
that the variational bound will always be more accurate than CS (which in turn is always more accurate
than BIC).

31.2.8.5 Example: college plans revisited


Let us revisit the college plans dataset from Section 31.2.3.4. Recall that if we ignore the possibility of
hidden variables there was a direct link from socio economic status to IQ in the MAP DAG. Heckerman et

172
al. decided to see what would happen if they introduced a hidden variable H, which they made a parent of
both SES and IQ, representing a hidden common cause. They also considered a variant in which H points
to SES, IQ and PE. For both such cases, they considered dropping none, one, or both of the SES-PE and
PE-IQ edges. They varied the number of states for the hidden node from 2 to 6. Thus they computed the
approximate posterior over 8 × 5 = 40 different models, using the CS approximation.
The most probable model which they found is shown in Figure 31.10. This is 2 · 1010 times more likely
than the best model containing no hidden variable. It is also 5 · 109 times more likely than the second most
probable model with a hidden variable. So again the posterior is very peaked.
These results suggests that there is indeed a hidden common cause underlying both the socio-economic
status of the parents and the IQ of the children. By examining the CPT entries, we see that both SES and IQ
are more likely to be high when H takes on the value 1. They interpret this to mean that the hidden variable
represents “parent quality” (possibly a genetic factor). Note, however, that the arc between H and SES can
be reversed without changing the v-structures in the graph, and thus without affecting the likelihood; this
underscores the difficulty in interpreting hidden variables.
Interestingly, the hidden variable model has the same conditional independence assumptions amongst the
visible variables as the most probable visible variable model. So it is not possible to distinguish between these
hypotheses by merely looking at the empirical conditional independencies in the data (which is the basis
of the constraint-based approach to structure learning discussed in Section 31.2.5). Instead, by adopting a
Bayesian approach, which takes parsimony into account (and not just conditional independence), we can
discover the possible existence of hidden factors. This is the basis of much of scientific and everday human
reasoning (see e.g. [GT09] for a discussion).

31.2.8.6 Structural EM
One way to perform structural inference in the presence of missing data is to use a standard search procedure
(deterministic or stochastic), and to use the methods from Section 31.2.8.1 to estimate the marginal likelihood.
However, this approach is not very efficient, because the marginal likelihood does not decompose when we
have missing data, and nor do its approximations. For example, if we use the CS approximation or the VBEM
approximation, we have to perform inference in every neighboring model, just to evaluate the quality of a
single move!
[Fri97; Thi+98] presents a much more efficient approach called the structural EM algorithm. The basic
idea is this: instead of fitting each candidate neighboring graph and then filling in its data, fill in the data
once, and use this filled-in data to evaluate the score of all the neighbors. Although this might be a bad
approximation to the marginal likelihood, it can be a good enough approximation of the difference in marginal
likelihoods between different models, which is all we need in order to pick the best neighbor.
More precisely, define D(G0 , θ̂0 ) to be the data filled in using model G0 with MAP parameters θ̂0 . Now
define a modified BIC score as follows:
log N
BIC(G, D) , log p(D|θ̂, G) − dim(G) + log p(G) + log p(θ̂|G) (31.41)
2
where we have included the log prior for the graph and parameters. One can show [Fri97] that if we pick a
graph G which increases the BIC score relative to G0 on the expected data, it will also increase the score on
the actual data, i.e.,
BIC(G, D) − BIC(G0 , D) ≤ BIC(G, D) − BIC(G0 , D) (31.42)
To convert this into an algorithm, we proceed as follows. First we initialize with some graph G0 and some
set of parameters θ0 . Then we fill-in the data using the current parameters — in practice, this means when
we ask for the expected counts for any particular family, we perform inference using our current model. (If
we know which counts we will need, we can precompute all of them, which is much faster.) We then evaluate
the BIC score of all of our neighbors using the filled-in data, and we pick the best neighbor. We then refit
the model parameters, fill-in the data again, and repeat. For increased speed, we may choose to only refit
the model every few steps, since small changes to the structure hopefully won’t invalidate the parameter
estimates and the filled-in data too much.

173
Figure 31.11: A DGM with and without hidden variables. For example, the leaves might represent medical symptoms,
the root nodes primary causes (such as smoking, diet and exercise), and the hidden variable can represent mediating
factors, such as heart disease. Marginalizing out the hidden variable induces a clique.

One interesting application is to learn a phylogenetic tree structure. Here the observed leaves are the
DNA or protein sequences of currently alive species, and the goal is to infer the topology of the tree and the
values of the missing internal nodes. There are many classical algorithms for this task (see e.g., [Dur+98]),
but one that uses structural EM is discussed in [Fri+02].
Another interesting application of this method is to learn sparse mixture models [BF02]. The idea is
that we have one hidden variable C specifying the cluster, and we have to choose whether to add edges
C → Xt for each possible feature Xt . Thus some features will be dependent on the cluster id, and some will
be independent. (See also [LFJ04] for a different way to perform this task, using regular EM and a set of bits,
one per feature, that are free to change across data cases.)

31.2.8.7 Discovering hidden variables


In Section 31.2.8.5, we introduced a hidden variable “by hand”, and then figured out the local topology by
fitting a series of different models and computing the one with the best marginal likelihood. How can we
automate this process?
Figure 31.11 provides one useful intuition: if there is a hidden variable in the “true model”, then its
children are likely to be densely connected. This suggest the following heuristic [Eli+00]: perform structure
learning in the visible domain, and then look for structural signatures, such as sets of densely connected
nodes (near-cliques); introduce a hidden variable and connect it to all nodes in this near-clique; and then let
structural EM sort out the details. Unfortunately, this technique does not work too well, since structure
learning algorithms are biased against fitting models with densely connected cliques.
Another useful intuition comes from clustering. In a flat mixture model, also called a latent class model,
the discrete latent variable provides a compressed representation of its children. Thus we want to create
hidden variables with high mutual information with their children.
One way to do this is to create a tree-structured hierarchy of latent variables, each of which only has
to explain a small set of children. [Zha04] calls this a hierarchical latent class model. They propose a
greedy local search algorithm to learn such structures, based on adding or deleting hidden nodes, adding or
deleting edges, etc. (Note that learning the optimal latent tree is NP-hard [Roc06].)
Recently [HW11] proposed a faster greedy algorithm for learning such models based on agglomerative
hierarchical clustering. Rather than go into details, we just give an example of what this system can learn.
Figure 31.12 shows part of a latent forest learned from the 20-newsgroup data. The algorithm imposes the
constraint that each latent node has exactly two children, for speed reasons. Nevertheless, we see interpretable
clusters arising. For example, Figure 31.12 shows separate clusters concerning medicine, sports and religion.
This provides an alternative to LDA and other topic models (Section 28.1.1), with the added advantage that
inference in latent trees is exact and takes time linear in the number of nodes.

174
erred by method BIN - A modelling co-occurrences in the 20 newsgroup dataset. ! " " "

 ! ! !  !  !

      
        "

   
  !         


           !  



   
 
  "      



   "     
      

   
         
 
  
 

   

Figure 31.12: Part of a hierarchical latent tree learned from the 20-newsgroup data. From Figure 2 of [HW11]. Used  

with kind permission of Stefan Harmeling.


  

h3 h17

president government power h4 war h20 religion h14 earth lunar orbit satellite solar
children

moon technology mission


law state human rights world israel jews h8 bible god

mars h1
gun
h2 christian jesus
space launch shuttle nasa

health
case course evidence fact question
program h9
food aids h21
insurance
version h12 ftp email
msg water studies h13 medicine
car
h25 files format phone

dealer h15 cancer disease doctor patients vitamin


windows h18 h11 image number

bmw engine honda oil


card driver h10 dos h19 h26
h5

video h16 disk memory h22 pc software display server


h6 puck season team h7 win

graphics h23 system data scsi drive computer h24


games baseball league players fans hockey nhl won

science university research


hit problem help mac

Figure 31.13: A partially latent tree learned from the 20-newsgroup data. Note that some words can have multiple
meanings, and get connected to different latent variables, representing different “topics”. For example, the word “win”
can refer to a sports context (represented by h5) or the Microsoft Windows context (represented by h25). From Figure
12 of [Cho+11]. Used with kind permission of Jin Choi.

175
DQG%H[SODLQV$
VWRSZRUGV

Ɣ 'LVFDUGFOXVWHUVWKDWDUHXVHGWRRUDUHO\

Figure 31.14: Google’s rephil model. Leaves represent presence or absence of words. Internal nodes represent clusters
of co-occuring words, or “concepts”. All nodes are binary, and all CPDs are noisy-OR. The model contains 12 million
word nodes, 1 million latent cluster nodes, and 350 million edges. Used with kind permission of Brian Milch.

An alternative approach is proposed in [Cho+11], in which the observed data is not constrained to be at
the leaves. This method starts with the Chow-Liu tree on the observed data, and then adds hidden variables
to capture higher-order dependencies between internal nodes. This results in much more compact models,
as shown in Figure 31.13. This model also has better predictive accuracy than other approaches, such as
mixture models, or trees where all the observed data is forced to be at the leaves. Interestingly, one can show
that this method can recover the exact latent tree structure, providing the data is generated from a tree. See
[Cho+11] for details. Note, however, that this approach, unlike [Zha04; HW11], requires that the cardinality
of all the variables, hidden and observed, be the same. Furthermore, if the observed variables are Gaussian,
the hidden variables must be Gaussian also.

31.2.8.8 Example: Google’s Rephil


In this section, we describe a huge DGM called Rephil, which was automatically learned from data.2 The
model is widely used inside Google for various purposes, including their famous AdSense system.3
The model structure is shown in Figure 31.14. The leaves are binary nodes, and represent the presence or
absence of words or compounds (such as “New York City”) in a text document or query. The latent variables
are also binary, and represent clusters of co-occuring words. All CPDs are noisy-OR, since some leaf nodes
(representing words) can have many parents. This means each edge can be augmented with a hidden variable
specifying if the link was activated or not; if the link is not active, then the parent cannot turn the child on.
(A very similar model was proposed independently in [SH06].)
Parameter learning is based on EM, where the hidden activation status of each edge needs to be inferred
[MH97]. Structure learning is based on the old neuroscience idea that “nodes that fire together should
wire together”. To implement this, we run inference and check for cluster-word and cluster-cluster pairs
that frequently turn on together. We then add an edge from parent to child if the link can significantly
increase the probability of the child. Links that are not activated very often are pruned out. We initialize
with one cluster per “document” (corresponding to a set of semantically related phrases). We then merge
clusters A and B if A explains B’s top words and vice versa. We can also discard clusters that are used too
rarely.
2 The original system, called “Phil”, was developed by Georges Harik and Noam Shazeer,. It has been published as US

Patent #8024372, “Method and apparatus for learning a probabilistic generative model for text”, filed in 2004. Rephil is a more
probabilistically sound version of the method, developed by Uri Lerner et al. The summary below is based on notes by Brian
Milch (who also works at Google).
3 AdSense is Google’s system for matching web pages with content-appropriate ads in an automatic way, by extracting

semantic keywords from web pages. These keywords play a role analogous to the words that users type in when searching; this
latter form of information is used by Google’s AdWords system. The details are secret, but [Lev11] gives an overview.

176
The model was trained on about 100 billion text snippets or search queries; this takes several weeks,
even on a parallel distributed computing architecture. The resulting model contains 12 million word nodes
and about 1 million latent cluster nodes. There are about 350 million links in the model, including many
cluster-cluster dependencies. The longest path in the graph has length 555, so the model is quite deep.
Exact inference in this model is obviously infeasible. However note that most leaves will be off, since
most words do not occur in a given query; such leaves can be analytically removed. We can also prune out
unlikely hidden nodes by following the strongest links from the words that are on up to their parents to
get a candidate set of concepts. We then perform iterative conditional modes (ICM) to form approximate
inference. (ICM is a deterministic version of Gibbs sampling that sets each node to its most probable state
given the values of its neighbors in its Markov blanket.) This continues until it reaches a local maximum. We
can repeat this process a few times from random starting configurations. At Google, this can be made to run
in 15 milliseconds!

31.2.8.9 Spectral methods


Recently, various methods have been developed that can recover the exact structure of the DAG, even in the
presence of (a known number of) latent variables, under certain assumptions. In particular, identifiability
results have been obtained for the following cases:

• If x contains 3 of more indepenent views of z [Goo74; AMR09; AHK12; HKZ12], sometimes called the
triad constraint.

• If z is categorical, and x is a GMM with mixture components which depend on z [Ana+14].

• If z is composed of binary variables, and x is a set of noisy-OR CPDs [JHS13; Aro+16].

In terms of algorithms, most of these methods are not based on maximum likelihood, but instead use the
method of moments and spectral methods. For details, see [Ana+14].

31.2.8.10 Constraint-based methods for learning ADMGs


An alternative to explicitly modeling latent variables is to marginalize them out, and work with acyclic
directed mixed graphs (??). It is possible to perform Bayesian model selection for ADMGs, although the
method is somewhat slow and complicated [SG09]. Alternatively, one can modify the PC/IC algorithm to
learn an ADMG. This method is known as the IC* algorithm [Pea09, p52]; one can speed it up to get the
FCI algorithm (FCI stands for “fast causal inference”) [SGS00, p144].
Since there will inevitably be some uncertainty about edge orientations, due to Markov equivalence, the
output of IC*/ FCI is not actually an ADMG, but is a closely related structured called a partially oriented
inducing path graph [SGS00, p135] or a marked pattern [Pea09, p52]. Such a graph has 4 kinds of
edges:

• A marked arrow a → b signifying a directed path from a to b.
• An unmarked arrow a → b signifying a directed path from a to b or a latent common cause a ← L → b.
• A bidirected arrow a ↔ b signifying a latent common causes a ← L → b.
• An undirected edge a − b signifying a ← b or a → b or a latent common causes a ← L → b.
IC*/ FCI is faster than Bayesian inference, but suffers from the same problems as the original IC/PC
algorithm (namely, the need for a CI testing oracle, problems due to multiple testing, no probabilistic
representation of uncertainty, etc.) Furthermore, by not explicitly representing the latent variables, the
resulting model cannot be used for inference and prediction.

31.3 Learning undirected graph structures


In this section, we discuss how to learn the structure of undirected graphical models. On the one hand, this
is easier than learning DAG structure because we don’t need to worry about acyclicity. On the other hand, it

177
children case course fact question

earth bible christian food baseball

mission god disk mac car aids doctor fans

nasa jesus pc dos drive bmw israel government health games hockey hit

launch memory scsi jews engine dealer state war computer president medicine season puck nhl

shuttle religion data card honda power oil world insurance science studies team

software solar graphics driver gun research university water human cancer win league

lunar system display video windows law disease won players

moon server files rights problem evidence

space program format patients msg help

mars orbit technology ftp image number vitamin email

satellite version phone

Figure 31.15: A dependency network constructed from the 20 newsgroup data. We show all edges with regression weight
above 0.5 in the Markov blankets estimated by `1 penalized logistic regression. Undirected edges represent cases where a
directed edge was found in both directions. From Figure 4.9 of [Sch10a]. Used with kind permission of Mark Schmidt.

is harder than learning DAG structure since the likelihood does not decompose (see ??). This precludes the
kind of local search methods (both greedy search and MCMC sampling) we used to learn DAG structures,
because the cost of evaluating each neighboring graph is too high, since we have to refit each model from
scratch (there is no way to incrementally update the score of a model). In this section, we discuss several
solutions to this problem.

31.3.1 Dependency networks


A simple way to learn the structure of a UGM is to represent it is as a product of full conditionals:
D
1 Y
p(x) = p(xd |x−d ) (31.43)
Z
d=1

This expression is called the pseudolikelihood.


Such a collection of local distributions defines a model called a dependency network [Hec+00]. Unfor-
tunately, a product of full conditionals which are independently estimated is not guaranteed to be consistent
with any valid joint distribution. However, we can still use the model inside of a Gibbs sampler to approximate
a joint distribution. This approach is sometimes used for data imputation [GR01].
However, the main use advantage of dependency networks is that we can use sparse regression techniques
for each distribution p(xd |x−d ) to induce a sparse graph structure. For example, [Hec+00] use classification/
regression trees, [MB06] use `1 -regularized linear regression, [WRL06; WSD19] use `1 -regularized logistic
regression, [Dob09] uses Bayesian variable selection, etc.
Figure 31.15 shows a dependency network that was learned from the 20-newsgroup data using `1 regularized
logistic regression, where the penalty parameter λ was chosen by BIC. Many of the words present in these
estimated Markov blankets represent fairly natural associations (aids:disease, baseball:fans, bible:god, bmw:car,
cancer:patients, etc.). However, some of the estimated statistical dependencies seem less intuitive, such as

178
baseball:windows and bmw:christian. We can gain more insight if we look not only at the sparsity pattern,
but also the values of the regression weights. For example, here are the incoming weights for the first 5 words:

• aids: children (0.53), disease (0.84), fact (0.47), health (0.77), president (0.50), research (0.53)

• baseball: christian (-0.98), drive (-0.49), games (0.81), god (-0.46), government (-0.69), hit (0.62),
memory (-1.29), players (1.16), season (0.31), software (-0.68), windows (-1.45)

• bible: car (-0.72), card (-0.88), christian (0.49), fact (0.21), god (1.01), jesus (0.68), orbit (0.83),
program (-0.56), religion (0.24), version (0.49)

• bmw: car (0.60), christian (-11.54), engine (0.69), god (-0.74), government (-1.01), help (-0.50), windows
(-1.43)

• cancer: disease (0.62), medicine (0.58), patients (0.90), research (0.49), studies (0.70)

Words in italic red have negative weights, which represents a dissociative relationship. For example, the
model reflects that baseball:windows is an unlikely combination. It turns out that most of the weights are
negative (1173 negative, 286 positive, 8541 zero) in this model.
[MB06] discuss theoretical conditions under which dependency networks using `1 -regularized linear
regression can recover the true graph structure, assuming the data was generated from a sparse Gaussian
graphical model. We discuss a more general solution in Section 31.3.2.

31.3.2 Graphical lasso for GGMs


In this section, we consider the problem of learning the structure of undirected Gaussian graphical models
(GGM)s. These models are useful, since there is a 1:1 mapping between sparse parameters and sparse graph
structures. This allows us to extend the efficient techniques of `1 regularized estimation in ?? to the graph
case; the resulting method is called the graphical lasso or Glasso [FHT08; MH12].

31.3.2.1 MLE for a GGM


Before discussing structure learning, we need to discuss parameter estimation. The task of computing the
MLE for a (non-decomposable) GGM is called covariance selection [Dem72].
The log likelihood can be written as

`(Ω) = log det Ω − tr(SΩ) (31.44)


PN
where Ω = Σ−1 is the precision matrix, and S = N1 i=1 (xi − x)(xi − x)T is the empirical covariance matrix.
(For notational simplicity, we assume we have already estimated µ̂ = x.) One can show that the gradient of
this is given by
∇`(Ω) = Ω−1 − S (31.45)
However, we have to enforce the constraints that Ωst = 0 if Gst = 0 (structural zeros), and that Ω is positive
definite. The former constraint is easy to enforce, but the latter is somewhat challenging (albeit still a convex
constraint). One approach is to add a penalty term to the objective if Ω leaves the positive definite cone;
this is the approach used in [DVR08]. Another approach is to use a coordinate descent method, described in
[HTF09, p633].
Interestingly, one can show that the MLE must satisfy the following property: Σst = Sst if Gst = 1 or
s = t, i.e., the covariance of a pair that are connected by an edge must match the empirical covariance.
In addition, we have Ωst = 0 if Gst = 0, by definition of a GGM, i.e., the precision of a pair that are not
connected must be 0. We say that Σ is a positive definite matrix completion of S, since it retains as many
of the entries in S as possible, corresponding to the edges in the graph, subject to the required sparsity
pattern on Σ−1 , corresponding to the absent edges; the remaining entries in Σ are filled in so as to maximize
the likelihood.

179
Let us consider a worked example from [HTF09, p652]. We will use the following adjacency matrix,
representing the cyclic structure, X1 − X2 − X3 − X4 − X1 , and the following empirical covariance matrix:
   
0 1 0 1 10 1 5 4
 1 0 1 0  1 10 2 6 
G=  
0 1 0 1 , S =  5 2 10 3 
 (31.46)
1 0 1 0 4 6 3 10

The MLE is given by


   
10.00 1.00 1.31 4.00 0.12 −0.01 0 −0.05
 1.00 10.00 2.00 0.87  −0.01 0.11 −0.02 0 
Σ=
 1.31 2.00 10.00
, Ω =   (31.47)
3.00   0 −0.02 0.11 −0.03
4.00 0.87 3.00 10.00 −0.05 0 −0.03 0.13

(See ggmFitDemo.py for the code to reproduce these numbers, using the coordinate descent algorithm from
[FHT08].) The constrained elements in Ω, and the free elements in Σ, both of which correspond to absent
edges, have been highlighted.

31.3.2.2 Promoting sparsity


We now discuss one way to learn a sparse Gaussian MRF structure, which exploits the fact that there is a 1:1
correspondence between zeros in the precision matrix and absent edges in the graph. This suggests that we
can learn a sparse graph structure by using an objective that encourages zeros in the precision matrix. By
analogy to lasso (see ??), one can define the following `1 penalized NLL:

J(Ω) = − log det Ω + tr(SΩ) + λ||Ω||1 (31.48)


P
where ||Ω||1 = j,k |ωjk | is the 1-norm of the matrix. This is called the graphical lasso or Glasso.
Although the objective is convex, it is non-smooth (because of the non-differentiable `1 penalty) and
is constrained (because Ω must be a positive definite matrix). Several algorithms have been proposed for
optimizing this objective [YL07; BGd08; DGK08], although arguably the simplest is the one in [FHT08],
which uses a coordinate descent algorithm similar to the shooting algorithm for lasso. An even faster method,
based on soft thresholding, is described in [FS18; FZS18].
As an example, let us apply the method to the flow cytometry dataset from [Sac+05]. A discretized
version of the data is shown in Figure 31.19(a). Here we use the original continuous data. However, we are
ignoring the fact that the data was sampled under intervention. In ??, we illustrate the graph structures that
are learned as we sweep λ from 0 to a large value. These represent a range of plausible hypotheses about the
connectivity of these proteins.
It is worth comparing this with the DAG that was learned in Figure 31.19(b). The DAG has the advantage
that it can easily model the interventional nature of the data, but the disadvantage that it cannot model the
feedback loops that are known to exist in this biological pathway (see the discussion in [SM09]). Note that
the fact that we show many UGMs and only one DAG is incidental: we could easily use BIC to pick the
“best” UGM, and conversely, we could easily display several DAG structures, sampled from the posterior.

31.3.3 Graphical lasso for discrete MRFs/CRFs


It is possible to extend the graphical lasso idea to the discrete MRF and CRF case. However, now there is a
set of parameters associated with each edge in the graph, so we have to use the graph analog of group lasso
(see ??). For example, consider a pairwise CRF with ternary nodes, and node and edge potentials given by
 T   T 
vt1 x wt11 x wTst12 x wTst13 x
ψt (yt , x) = vTt2 x , ψst (ys , yt , x) = wTst21 x wTst22 x wTst23 x (31.49)
vTt3 x wTst31 x wTst32 x wTst33 x

180
where we assume x begins with a constant 1 term, to account for the offset. (If x only contains 1, the CRF
reduces to an MRF.) Note that we may choose to set some of the vtk and wstjk weights to 0, to ensure
identifiability, although this can also be taken care of by the prior.
To learn sparse structure, we can minimize the following objective:
N
" V V
#
X X X X
J =− log ψt (yit , xi , vt ) + log ψst (yis , yit , xi , wst )
i=1 t s=1 t=s+1
V
X V
X V
X
+ λ1 ||wst ||p + λ2 ||vt ||22 (31.50)
s=1 t=s+1 t=1

where ||wst ||p is the p-norm; common choices are p = 2 or p = ∞, as explained in ??. This method of CRF
structure learning was first suggested in [Sch+08]. (The use of `1 regularization for learning the structure of
binary MRFs was proposed in [LGK06].)
Although this objective is convex, it can be costly to evaluate, since we need to perform inference to
compute its gradient, as explained in ?? (this is true also for MRFs), due to the global partition function. We
should therefore use an optimizer that does not make too many calls to the objective function or its gradient,
such as the projected quasi-Newton method in [Sch+09]. In addition, we can use approximate inference, such
as loopy belief propagation (??), to compute an approximate objective and gradient more quickly, although
this is not necessarily theoretically sound.
Another approach is to apply the group lasso penalty to the pseudo-likelihood discussed in ??. This is
much faster, since inference is no longer required [HT09]. Figure 31.16 shows the result of applying this
procedure to the 20-newsgroup data, where yit indicates the presence of word t in document i, and xi = 1
(so the model is an MRF).
For a more recent approach to learning sparse discrete PGM-U structures, based on sparse full conditionals,
see the GRISE (Generalized Regularized Interaction Screening Estimator) method of [VML19], which takes
polynomial time, yet its sample complexity is close to the information-theoretic lower bounds [Lok+18].

31.3.4 Bayesian inference for undirected graph structures


Although the graphical lasso is reasonably fast, it only gives a point estimate of the structure. Furthermore,
it is not model-selection consistent [Mei05], meaning it cannot recover the true graph even as N → ∞. It
would be preferable to integrate out the parameters, and perform posterior inference in the space of graphs,
i.e., to compute p(G|D). We can then extract summaries of the posterior, such as posterior edge marginals,
p(Gij = 1|D), just as we did for DAGs. In this section, we discuss how to do this.
If the graph is decomposable, and if we use conjugate priors, we can compute the marginal likelihood in
closed form [DL93]. Furthermore, we can efficiently identify the decomposable neighbors of a graph [TG09],
i.e., the set of legal edge additions and removals. This means that we can perform relatively efficient stochastic
local search to approximate the posterior (see e.g. [GG99; Arm+08; SC08]).
However, the restriction to decomposable graphs is rather limiting if one’s goal is knowledge discovery,
since the number of decomposable graphs is much less than the number of general undirected graphs.4
A few authors have looked at Bayesian inference for GGM structure in the non-decomposable case
(e.g., [DGR03; WCK03; Jon+05]), but such methods cannot scale to large models because they use an
expensive Monte Carlo approximation to the marginal likelihood [AKM05]. [LD08] suggested using a Laplace
approximation. This requires computing the MAP estimate of the parameters for Ω under a G-Wishart prior
[Rov02]. In [LD08], they used the iterative proportional scaling algorithm [SK86; HT08] to find the mode.
However, this is very slow, since it requires knowing the maximal cliques of the graph, which is NP-hard in
general.
4 The number of decomposable graphs on V nodes, for V = 2, . . . , 8, is as follows ([Arm05, p158]): 2; 8; 61; 822; 18,154;

61,7675; 30,888,596. If we divide these numbers by the number of undirected graphs, which is 2V (V −1)/2 , we find the ratios are:
1, 1, 0.95, 0.8, 0.55, 0.29, 0.12. So we see that decomposable graphs form a vanishing fraction of the total hypothesis space.

181
case children bible health

course christian insurance

computer evidence

disk email display card fact earth

files graphics government god

dos format help data image video gun human car president israel jesus

drive memory number power law engine dealer jews baseball

ftp mac scsi problem rights war religion games fans

pc program phone nasa state question hockey

software research shuttle league nhl

launch moon science orbit players

space university world season

system driver team

version technology win

windows won

Figure 31.16: An MRF estimated from the 20-newsgroup data using group `1 regularization with λ = 256. Isolated
nodes are not plotted. From Figure 5.9 of [Sch10a]. Used with kind permission of Mark Schmidt.

In [Mog+09], a much faster method is proposed. In particular, they modify the gradient-based methods
from Section 31.3.2.1 to find the MAP estimate; these algorithms do not need to know the cliques of the
graph. A further speedup is obtained by just using a diagonal Laplace approximation, which is more accurate
than BIC, but has essentially the same cost. This, plus the lack of restriction to decomposable graphs,
enables fairly fast stochastic search methods to be used to approximate p(G|D) and its mode. This approach
significantly outperfomed graphical lasso, both in terms of predictive accuracy and structural recovery, for a
comparable computational cost.

31.4 Learning causal DAGs


Causal reasoning (which we discuss in more detail in ??) relies on knowing the underlying structure of
the DAG (although [JZB19] shows how to answer some queries if we just know the graph up to Markov
equivalence). Learning this structure is called causal discovery (see e.g., [GZS19]).
If we just have two variables, we need to know if the causal model should be written as X → Y or X ← Y .
both of these models are Markov equivalent (Section 31.2.2), meaning they cannot be distinguished from
observational data, yet they make very different causal predictions. We discuss how to learn cause-effect pairs
in Section 31.4.1.
When we have more than 2 variables, we need to consider more general techniques. In Section 31.2, we
discuss how to learn a DAG structure from observational data using likelihood based methods, and hypothesis
testing methods. However, these approaches cannot distinguish between models that are Markov equivalent,
so we need to perform interventions to reduce the size of the equivalence class [Sol19]. We discuss some
suitable methods in Section 31.4.2.
The above techniques assume that the causal variables of interest (e.g., cancer rates, smoking rates) can be
measured directly. However, in many ML problems, the data is much more “low level”. For example, consider
trying to learn a causal model of the world from raw pixels. We briefly disuss this topic in in Section 31.4.3.
For more details on causal discovery methods, see e.g., [Ebe17; PJS17; HDMM18; Guo+21].

182
31.4.1 Learning cause-effect pairs
If we only observe a pair of variables, we cannot use methods discussed in Section 31.2 to learn graph
structure, since such methods are based on conditional independence tests, which need at least 3 variables.
However, intuitively, we should still be able to learn causal relationships in this case. For example, we
know that altitude X causes temperature Y and not vice versa. For example, suppose we measure X and
Y in two different countries, say the Netherlands (low altitude) and Switzerland (high altitude). If we
represent the joint distribution as p(X, Y ) = p(X)p(Y |X), we find that the p(Y |X) distribution is stable
across the two populations, while p(X) will change. However, if we represent the joint distribution as
p(X, Y ) = p(Y )p(X|Y ), we find that both p(Y ) and p(X|Y ) need to change across populations, so both of
the corresponding distributions will be more “complicated” to capture this non-stationarity in the data. In
this section, we discuss some approaches that exploit this idea. Our presentation is based on [PJS17]. (See
[Moo+16] for more details.)

31.4.1.1 Algorithmic information theory


Suppose X ∈ {0, 1} and Y ∈ R and we represent the joint p(x, y) using

p(x, y) = p(x)p(y|x) = Ber(x|θ)N (y|µx , 1) (31.51)

We can equally well write this in the following form [Daw02, p165]:

p(x, y) = p(y)p(x|y) = [θN (y|µ1 , 1) + (1 − θ)N (y|µ2 , 1)]Ber(x|σ(α + βy)) (31.52)

where α = logit(θ) + µ22 − µ21 and β = µ1 − µ2 . We can plausibly argue that the first model, which corresponds
to X → Y , is more likely to be correct, since it is consists of two simple distributions that seem to be
rather generic. By contrast, in Equation (31.52), the distribution of p(Y ) is more complex, and seems to be
dependent on the specific form of p(X|Y ).
[JS10] show how to formalize this intuition using algorithmic information theory. In particular,
they say that X causes Y if the distributions PX and PY |X (not the random variables X and Y ) are
algorithmically independent. To define this, let PX (X) be the distribution induced by fx (X, UX ), where
UX is a bit string, and fX is represented by a Turing machine. Define PY |X analogously. Finally, let K(s)
be the Kolmogorov complexity of bit string s, i.e., the length of the shortest program that would generate
s using a universal Turing machine. We say that PX and PY |X are algorithmically independent if

K(PX,Y ) = K(PX ) + K(PY |X ) (31.53)

Unfortunately, there is no algorithm to compute the Kolmogorov complexity, so this approach is purely
conceptual. In the sections below, we discuss some more practical metrics.

31.4.1.2 Additive noise models


A generic two-variable SCM of the form X → Y requires specifying the function X = fX (UX ), the distribution
of UX , the function Y = fY (X, UY ), and the distribution of UY . We can simplify our notation by letting
X = Ux and defining p(X) directly, and defining Y = fY (X, UY ) = f (X, U ), where U is a noise term.
In general, such a model is not identifiable from a finite dataset. For example, we can imagine that the
value of U can be used to select between different functional mappings, Y = f (X, U = u) = fu (X). Since U
is not observed, the induced distribution will be a mixture of different mappings, and it will generally be
impossible to disentangle. For example, consider the case where X and U are Bernoulli random variables,
and U selects between the functions Y = fid (X) = I (Y = X) and Y = fneg (X) = I (Y 6= X). In this case,
the induced distribution p(Y ) is uniform, independent of X, even though we have the structure X → Y .
The above concerns motivate the desire to restrict the flexibility of the functions at each node. One natural
family is additive noise models (ANM), where we assume each variable has the following dependence on
its parents [Hoy+09]:
Xi = fi (Xpai ) + Ui (31.54)

183
𝑦

𝑡
all the heights
are the same

all the heights


are different

0
0
𝑥 𝑥

Figure 31.17: Signature of X causing Y . Left: If we try to predict Y from X, the residual error (noise term, shown by
vertical arrows) is independent of X. Right: If we try to predict X from Y , the residual error is not constant. From
Figure 8.8 of [Var21]. Used with kind permission of Kush Varshney.

In the case of two variables, we have Y = f (X) + U . If X and U are both Gaussian, and f is linear, the
system defines a jointly Gaussian distribution p(X, Y ), as we discussed in ??. This is symmetric, and prevents
us distinguishing X → Y from Y → X. However, if we let f be nonlinear, and/or let X or U be non-Gaussian,
we can distinguish X → Y from Y → X, as we discuss below.

31.4.1.3 Nonlinear additive noise models

Suppose pY |X is an additive noise model (possibly Gaussian noise) where f is a nonlinear function. In this
case, we will not, in general, be able to create an ANM for pX|Y . Thus we can determine whether X → Y
or vice versa as follows: we fit a (nonlinear) regression model for X → Y , and then check if the residual
error Y − fˆY (X) is independent of X; we then repeat the procedure swapping the roles of X and Y . The
theory [PJS17] says that the independence test will only pass for the causal direction. See Figure 31.17 for an
illustration.

31.4.1.4 Linear models with non-Gaussian noise

If the function mapping from X to Y is linear, we cannot tell if X → Y or Y → X if we assume Gaussian


noise. This is apparent from the symmetry of Figure 31.17 in the linear case. However, by combining linear
models with non-Gaussian noise, we can recover identifiability.
For example, consider the ICA model from ??. This is a simple linear model of the form y = Ax,
where p(x) has a non-Gaussian distribution, and p(y|x) is a degenerate distribution, since we assume the
observation model is deterministic (noise-free). In the ICA case, we can uniquely identify the parameters A
and the corresponding latent source x. This lets us distinguish the X → Y model from the Y → X model.
(The intuition behind this method is that linear combinations of random variables tend towards a Gaussian
distribution (by the central limit theorem), so if X → Y , then p(Y ) will “look more Gaussian” than p(X).)
Another setting in which we can distinguish the direction of the arrow is when we have non-Gaussian
observation noise, i.e., y = Ax + UY , where UY is non-Gaussian. This is an example of a “linear non-Gaussian
acyclic model” (LiNGAM) [Shi+06]. The non-Gaussian additive noise results in the induced distributions
p(X, Y ) being different depending on whether X → Y or Y → X.

31.4.1.5 Information-geometric causal inference

An alternative approach, known as information-geometric causal inference, or IGCI, was proposed in


[Dan+10; Jan+12]. In this method, we assume f is a deterministic strictly monotonic function on [0, 1], with
f (0) = 0 and f (1) = 1, and there is no observation noise, so Y = f (X). If X has the distribution p(X),
then the shape of the induced distribution p(Y ) will depend on the form of the function f , as illustrated in
Figure 31.18. Intuitively, the peaks of p(Y ) will occur in regions where f has small slope, and thus f −1 has

184
Figure 31.18: Illustration of information-geometric causal inference for Y = f (X). The density of the effect p(Y )
tends to be high in regions where f is flat (and hence f −1 is steep). From Figure 4 of [Jan+12].

large slope. Thus pY (Y ) and f −1 (Y ) will depend on each other, whereas pX (X) and f (X) do not (since we
assume the distribution of causes is independent of the causal mechanism).
More precisely, let the functions log f 0 (the log of the derivative function) and pX be viewed as random
variables on the probability space [0, 1] with a uniform distrbution. We say pX,Y satisfies an IGCI model if f
is a mapping as above, and the following independence criterion holds: Cov [log f 0 , pX ] = 0, where
Z 1 Z 1 Z 1
0 0 0
Cov [log f , pX ] = log f (x)pX (x)dx − log f (x)dx pX (x)dx (31.55)
0 0 0

R1 h 0
i
where 0 pX (x)dx = 1. One can show that the inverse function f −1 satisfies Cov log f −1 , pY ≥ 0, with
equality iff f is linear.
This can be turned into an empirical test as follows. Define
Z N −1
1
1 X |yj+1 − yj |
CX→Y = log f 0 (x)p(x)dx ≈ log (31.56)
0 N − 1 j=1 |xj+1 − xj |

where x1 < x2 · · · xN are the observed x-values in increasing order. The quantity CY →X is defined analogously.
We then choose X → Y as the model whenever ĈX→Y < ĈY →X . This is called the slope based approach
to IGCI.
One can also show that an IGCI model satisfies the property that H(X) ≤ H(Y ), where H() is the
differential entropy. Intuitively, the reason is that applying a nonlinear function f to pX can introduce
additional irregularities, thus making pY less uniform that pX . This is illustrated in Figure 31.18. We can
then choose between X → Y and X ← Y based on the difference in estimated entropies.
An empirical comparison of the slope-based and entropy-based approaches to IGCI can be found in
[Moo+16].

31.4.2 Learning causal DAGs from interventional data


In Section 31.2, we discuss how to learn a DAG structure from observational data, using either likelihood-based
(Bayesian) methods of model selection, or constraint-based (frequentist) methods. (See [Tu+19] for a recent
empirical comparison of such methods applied to a medical simulator.) However, such approaches cannot
distinguish between models that are Markov equivalent, and thus the output may not be sufficient to answer
all causal queries of interest.
To distinguish DAGs within the same Markov equivalence class, we can use interventional data, where
certain variables have been set, and the consequences have been measured. In particular, we can modify the
standard likelihood-based DAG learning method discussed in Section 31.2 to take into account the fact that

185
Psitect AKT inh U0126
PMA
pkc
plcy

pip3
akt
raff
pip2
pka
mek12
G06967

erk
Present
Missing
Int. edge jnk p38 B2cAMP

(a) (b)

Figure 31.19: (a) A design matrix consisting of 5400 data points (rows) measuring the status (using flow cytometry) of
11 proteins (columns) under different experimental conditions. The data has been discretized into 3 states: low (black),
medium (grey) and high (white). Some proteins were explicitly controlled using activating or inhibiting chemicals. (b)
A directed graphical model representing dependencies between various proteins (blue circles) and various experimental
interventions (pink ovals), which was inferred from this data. We plot all edges for which p(Gst = 1|D) > 0.5. Dotted
edges are believed to exist in nature but were not discovered by the algorithm (1 false negative). Solid edges are true
positives. The light colored edges represent the effects of intervention. From Figure 6d of [EM07].

the data generating mechanism has been changed. For example, if θijk = p(Xi = j|Xpa(i) = k) is a CPT
P
for node i, then when we compute the sufficient statistics Nijk = n I Xni = j, Xn,pa(i) = k , we exclude
cases n where Xi was set externally by intervention, rather than sampled from θijk . This technique was first
proposed in [CY99], and corresponds to Bayesian parameter inference from a set of mutiliated models with
shared parameters.
The preceding method assumes that we use perfect interventions, where we deterministically set a
variable to a chosen value. In reality, experimenters can rarely control the state of individual variables.
Instead, they can perform actions which may affect many variables at the same time. (This is sometimes
called a “fat hand intervention”, by analogy to an experiment where someone tries to change a single
component of some system (e.g., electronic circuit), but accidently touching multiple components and thereby
causing various side effects.) We can model this by adding the intervention nodes to the DAG (??), and
then learning a larger augmented DAG structure, with the constraint that there are no edges between the
intervention nodes, and no edges from the “regular” nodes back to the intervention nodes.
For example, suppose we perturb various proteins in a cellular signalling pathway, and measure the
resulting phosphorylation status using a technique such as flow cytometry, as in [Sac+05]. An example of
such a dataset is shown in Figure 31.19(a). Figure 31.19(b) shows the augmented DAG that was learned from
the interventional flow cytometry data depicted in Figure 31.19(a). In particular, we plot the median graph,
which includes all edges for which p(Gij = 1|D) > 0.5. These were computed using the exact algorithm of
[Koi06]. See [EM07] for details.
Since interventional data can help to uniquely identify the DAG, it is natural to try to choose the optimal
set of interventions so as to discover the graph structure with as little data as possible. This is a form of
active learning or experiment design, and is similar to what scientists do. See e.g., [Mur01; HG09; KB14;
HB14; Mue+17] for some approaches to this problem.

31.4.3 Learning from low-level inputs


In many problems, the available data is quite “low level”, such as pixels in an image, and is believed to be
generated by some “higher level” latent causal factors, such as objects interacting in a scene. Learning causal
models of this type is known as causal representation learning, and combines the causal discovery methods
discussed in this section with techniques from latent variable modeling (e.g., VAEs, ??) and representation

186
learning (??). For more details, see e.g., [CEP17; Sch+21].

187
188
Chapter 32

Non-parametric Bayesian models

189
190
Chapter 33

Representation learning

191
192
Chapter 34

Interpretability

193
194
Part VI

Decision making

195
Chapter 35

Multi-step decision problems

197
198
Chapter 36

Reinforcement learning

199
200
Bibliography

[AD19a] H. Asi and J. C. Duchi. “Modeling simple structures and geometry for better stochastic
optimization algorithms”. In: AISTATS. 2019.
[AD19b] H. Asi and J. C. Duchi. “Stochastic (Approximate) Proximal Point Methods: Convergence,
Optimality, and Adaptivity”. In: SIAM J. Optim. (2019).
[AD19c] H. Asi and J. C. Duchi. “The importance of better models in stochastic optimization”. en. In:
PNAS 116.46 (Nov. 2019), pp. 22924–22930.
[AEM18] Ö. D. Akyildiz, V. Elvira, and J. Miguez. “The Incremental Proximal Method: A Probabilistic
Perspective”. In: ICASSP. 2018.
[AHK12] A. Anandkumar, D. Hsu, and S. Kakade. “A method of moments for mixture models and hidden
Markov models”. In: COLT. 2012.
[AKM05] A. Atay-Kayis and H. Massam. “A Monte Carlo method for computing the marginal likelihood
in nondecomposable Gaussian graphical models”. In: Biometrika 92 (2005), pp. 317–335.
[Aky+19] Ö. D. Akyildiz, É. Chouzenoux, V. Elvira, and J. Míguez. “A probabilistic incremental proximal
gradient method”. In: IEEE Signal Process. Lett. 26.8 (2019).
[AKZK19] B. Amos, V. Koltun, and J Zico Kolter. “The Limited Multi-Label Projection Layer”. In: (June
2019). arXiv: 1906.08707 [cs.LG].
[ALK06] C. Albers, M. Leisink, and H. Kappen. “The Cluster Variation Method for Efficient Linkage
Analysis on Extended Pedigrees”. In: BMC Bioinformatics 7 (2006).
[AMR09] E. S. Allman, C. Matias, and J. A. Rhodes. “Identifiability of parameters in latent structure
models with many observed variables”. en. In: Ann. Stat. 37.6A (Dec. 2009), pp. 3099–3132.
[Ana+14] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. “Tensor Decompositions for
Learning Latent Variable Models”. In: JMLR 15 (2014), pp. 2773–2832.
[AQ+20] M. A. A. Al-Qaness, A. A. Ewees, H. Fan, and M. Abd El Aziz. “Optimization Method for
Forecasting Confirmed Cases of COVID-19 in China”. en. In: J. Clinical Medicine 9.3 (Mar.
2020).
[Arm05] H. Armstrong. “Bayesian estimation of decomposable Gaussian graphical models”. PhD thesis.
UNSW, 2005.
[Arm+08] H. Armstrong, C. Carter, K. Wong, and R. Kohn. “Bayesian Covariance Matrix Estimation
using a Mixture of Decomposable Graphical Models”. In: Statistics and Computing (2008),
pp. 1573–1375.
[Aro+13] S. Arora et al. “A Practical Algorithm for Topic Modeling with Provable Guarantees”. In: ICML.
2013.
[Aro+16] S. Arora, R. Ge, T. Ma, and A. Risteski. “Provable learning of Noisy-or Networks”. In: (2016).
arXiv: 1612.08795 [cs.LG].

201
[AS19] B. G. Anderson and S. Sojoudi. “Global Optimality Guarantees for Nonconvex Unsupervised
Video Segmentation”. In: 57th Annual Allerton Conference on Communication, Control, and
Computing (2019).
[AY19] B. Amos and D. Yarats. “The Differentiable Cross-Entropy Method”. In: (Sept. 2019). arXiv:
1909.12830 [cs.LG].
[Bac+15] S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. “Hinge-Loss Markov Random Fields and
Probabilistic Soft Logic”. In: (2015). arXiv: 1505.04406 [cs.LG].
[Bal17] S. Baluja. “Learning deep models of optimization landscapes”. In: IEEE Symposium Series on
Computational Intelligence (SSCI) (2017).
[BB12] J. Bergstra and Y. Bengio. “Random Search for Hyper-Parameter Optimization”. In: JMLR 13
(2012), pp. 281–305.
[BBZ17] T. Bartz-Beielstein and M. Zaefferer. “Model-based Methods for Continuous and Discrete Global
Optimization”. In: Appl. Soft Comput. 55.C (June 2017), pp. 154–167.
[BC95] S. Baluja and R. Caruana. “Removing the Genetics from the Standard Genetic Algorithm”. In:
ICML. 1995, pp. 38–46.
[BD03] S. G. Bottcher and C. Dethlefsen. “deal: A Package for Learning Bayesian Networks”. In: J. of
Statistical Software 8.20 (2003).
[BD97] S. Baluja and S. Davies. “Using Optimal Dependency-Trees for Combinatorial Optimization:
Learning the Structure of the Search Space”. In: ICML. 1997.
[Ber15] D. P. Bertsekas. “Incremental Gradient, Subgradient, and Proximal Methods for Convex Opti-
mization: A Survey”. In: (July 2015). arXiv: 1507.01030 [cs.SY].
[BF02] Y. Barash and N. Friedman. “Context-specific Bayesian clustering for gene expression data”. In:
J. Comp. Bio. 9 (2002), pp. 169–191.
[BG06] M. Beal and Z. Ghahramani. “Variational Bayesian Learning of Directed Graphical Models with
Hidden Variables”. In: Bayesian Analysis 1.4 (2006).
[BGd08] O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. “Model selection through sparse maximum
likelihood estimation for multivariate Gaussian or binary data”. In: JMLR 9 (2008), pp. 485–516.
[BGHM17] J. Boyd-Graber, Y. Hu, and D. Mimno. “Applications of Topic Models”. In: Foundations and
Trends® in Information Retrieval 11.2-3 (2017), pp. 143–296.
[BHO75] P. J. Bickel, E. A. Hammel, and J. W. O’connell. “Sex bias in graduate admissions: data from
berkeley”. en. In: Science 187.4175 (Feb. 1975), pp. 398–404.
[Bis06] C. Bishop. Pattern recognition and machine learning. Springer, 2006.
[BJV97] J. S. D. Bonet, C. L. I. Jr., and P. A. Viola. “MIMIC: Finding Optima by Estimating Probability
Densities”. In: NIPS. MIT Press, 1997, pp. 424–430.
[BK04] Y. Boykov and V. Kolmogorov. “An experimental comparison of min-cut/max-flow algorithms
for energy minimization in vision”. en. In: IEEE PAMI 26.9 (Sept. 2004), pp. 1124–1137.
[BK10] R. Bardenet and B. Kegl. “Surrogating the surrogate: accelerating Gaussian-process-based global
optimization with a mixture cross-entropy algorithm”. In: ICML. 2010.
[BKR11] A. Blake, P. Kohli, and C. Rother, eds. Advances in Markov Random Fields for Vision and
Image Processing. MIT Press, 2011.
[BL06a] D. Blei and J. Lafferty. “Dynamic topic models”. In: ICML. 2006, pp. 113–120.
[BL06b] K. Bryan and T. Leise. “The $25,000,000,000 Eigenvector: The Linear Algebra behind Google”.
In: SIAM Review 48.3 (2006).
[BL07] D. Blei and J. Lafferty. “A Correlated Topic Model of "Science"”. In: Annals of Applied Stat.
1.1 (2007), pp. 17–35.

202
[Ble12] D. M. Blei. “Probabilistic topic models”. In: Commun. ACM 55.4 (2012), pp. 77–84.
[BNJ03] D. Blei, A. Ng, and M. Jordan. “Latent Dirichlet allocation”. In: JMLR 3 (2003), pp. 993–1022.
[Boe+05] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. “A Tutorial on the Cross-Entropy
Method”. en. In: Ann. Oper. Res. 134.1 (Feb. 2005), pp. 19–67.
[Boh92] D. Bohning. “Multinomial logistic regression algorithm”. In: Annals of the Inst. of Statistical
Math. 44 (1992), pp. 197–200.
[Bou+17] T. Bouwmans, A. Sobral, S. Javed, S. K. Jung, and E.-H. Zahzah. “Decomposition into low-
rank plus additive matrices for background/foreground separation: A review for a comparative
evaluation with a large-scale dataset”. In: Computer Science Review 23 (Feb. 2017), pp. 1–71.
[Bro+20] D. Brookes, A. Busia, C. Fannjiang, K. Murphy, and J. Listgarten. “A view of estimation of
distribution algorithms through the lens of expectation-maximization”. In: GECCO. GECCO
’20. Cancún, Mexico: Association for Computing Machinery, July 2020, pp. 189–190.
[BT03] A. Beck and M. Teoulle. “Mirror descent and nonlinear projected subgradient methods for
convex optimization”. In: Operations Research Letters 31.3 (2003), pp. 167–175.
[BT09] A Beck and M Teboulle. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse
Problems”. In: SIAM J. Imaging Sci. 2.1 (Jan. 2009), pp. 183–202.
[BVZ01] Y. Boykov, O. Veksler, and R. Zabih. “Fast Approximate Energy Minimization via Graph Cuts”.
In: IEEE PAMI 23.11 (2001).
[BWL19] Y. Bai, Y.-X. Wang, and E. Liberty. “ProxQuant: Quantized Neural Networks via Proximal
Operators”. In: ICLR. 2019.
[Can+11] E. J. Candes, X. Li, Y. Ma, and J. Wright. “Robust Principal Component Analysis?” In: JACM
58.3 (June 2011), 11:1–11:37.
[CEP17] K. Chalupka, F. Eberhardt, and P. Perona. “Causal feature learning: an overview”. In: Behav-
iormetrika 44.1 (Jan. 2017), pp. 137–164.
[CGJ17] Y. Cherapanamjeri, K. Gupta, and P. Jain. “Nearly-optimal Robust Matrix Completion”. In:
ICML. 2017.
[CH92] G. Cooper and E. Herskovits. “A Bayesian method for the induction of probabilistic networks
from data”. In: Machine Learning 9 (1992), pp. 309–347.
[CH97] D. Chickering and D. Heckerman. “Efficient approximations for the marginal likelihood of
incomplete data given a Bayesian network”. In: Machine Learning 29 (1997), pp. 181–212.
[Chi02] D. M. Chickering. “Optimal structure identification with greedy search”. In: Journal of Machine
Learning Research 3 (2002), pp. 507–554.
[Chi96] D. Chickering. “Learning Bayesian networks is NP-Complete”. In: AI/Stats V. 1996.
[CHM97] D. M. Chickering, D. Heckerman, and C. Meek. “A Bayesian Approach to Learning Bayesian
Networks with Local Structure”. In: UAI. UAI’97. San Francisco, CA, USA, 1997, pp. 80–89.
[Cho+11] M. Choi, V. Tan, A. Anandkumar, and A. Willsky. “Learning Latent Tree Graphical Models”.
In: JMLR (2011).
[CL68] C. K. Chow and C. N. Liu. “Approximating discrete probability distributions with dependence
trees”. In: IEEE Trans. on Info. Theory 14 (1968), pp. 462–67.
[CLR90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. An Introduction to Algorithms. MIT Press,
1990.
[CS96] P. Cheeseman and J. Stutz. “Bayesian Classification (AutoClass): Theory and Results”. In:
Advances in Knowledge Discovery and Data Mining. Ed. by Fayyad, Pratetsky-Shapiro, Smyth,
and Uthurasamy. MIT Press, 1996.

203
[CSF16] A. W. Churchill, S. Sigtia, and C. Fernando. “Learning to Generate Genotypes with Neural
Networks”. In: (Apr. 2016). arXiv: 1604.04153 [cs.NE].
[CY99] G. Cooper and C. Yoo. “Causal Discovery from a Mixture of Experimental and Observational
Data”. In: UAI. 1999.
[Dan+10] P. Daniusis et al. “Inferring deterministic causal relations”. In: UAI. 2010.
[Daw02] A. P. Dawid. “Influence diagrams for causal modelling and inference”. In: Intl. Stat. Review 70
(2002). Corrections p437, pp. 161–189.
[DDDM04] I Daubechies, M Defrise, and C De Mol. “An iterative thresholding algorithm for linear inverse
problems with a sparsity constraint”. In: Commun. Pure Appl. Math. Advances in E 57.11 (Nov.
2004), pp. 1413–1457.
[DDL97] S. DellaPietra, V. DellaPietra, and J. Lafferty. “Inducing features of random fields”. In: IEEE
PAMI 19.4 (1997).
[Dem72] A. Dempster. “Covariance selection”. In: Biometrics 28.1 (1972).
[DGK08] J. Duchi, S. Gould, and D. Koller. “Projected Subgradient Methods for Learning Sparse
Gaussians”. In: UAI. 2008.
[DGR03] P. Dellaportas, P. Giudici, and G. Roberts. “Bayesian inference for nondecomposable graphical
Gaussian models”. In: Sankhya, Ser. A 65 (2003), pp. 43–55.
[DHS11] J. Duchi, E. Hazan, and Y. Singer. “Adaptive Subgradient Methods for Online Learning and
Stochastic Optimization”. In: JMLR 12 (2011), pp. 2121–2159.
[Die+17] A. B. Dieng, C. Wang, J. Gao, and J. Paisley. “TopicRNN: A Recurrent Neural Network with
Long-Range Semantic Dependency”. In: ICLR. 2017.
[DL09] P. Domingos and D. Lowd. Markov Logic: An Interface Layer for AI. Morgan & Claypool, 2009.
[DL93] A. P. Dawid and S. L. Lauritzen. “Hyper-Markov laws in the statistical analysis of decomposable
graphical models”. In: The Annals of Statistics 3 (1993), pp. 1272–1317.
[Dob09] A. Dobra. Dependency networks for genome-wide data. Tech. rep. U. Washington, 2009.
[Dom+06] P. Domingos, S. Kok, H. Poon, M. Richardson, and P. Singla. “Unifying Logical and Statistical
AI”. In: IJCAI. 2006.
[Don95] D. L. Donoho. “De-noising by soft-thresholding”. In: IEEE Trans. Inf. Theory 41.3 (May 1995),
pp. 613–627.
[DRB19] A. B. Dieng, F. J. R. Ruiz, and D. M. Blei. “The Dynamic Embedded Topic Model”. In: (July
2019). arXiv: 1907.05545 [cs.CL].
[Dur+98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
[DVR08] J. Dahl, L. Vandenberghe, and V. Roychowdhury. “Covariance selection for non-chordal graphs
via chordal embedding”. In: Optimization Methods and Software 23.4 (2008), pp. 501–502.
[EAL10] D. Edwards, G. de Abreu, and R. Labouriau. “Selecting high-dimensional mixed graphical
models using minimal AIC or BIC forests”. In: BMC Bioinformatics 11.18 (2010).
[Ebe17] F. Eberhardt. “Introduction to the Foundations of Causal Discovery”. In: International Journal
of Data Science and Analytics 3.2 (2017), pp. 81–91.
[Eke+13] M. Ekeberg, C. Lövkvist, Y. Lan, M. Weigt, and E. Aurell. “Improved contact prediction in
proteins: using pseudolikelihoods to infer Potts models”. en. In: Phys. Rev. E Stat. Nonlin. Soft
Matter Phys. 87.1 (Jan. 2013), p. 012707.
[Eks+18] C. Eksombatchai et al. “Pixie: A System for Recommending 3+ Billion Items to 200+ Million
Users in Real-Time”. In: WWW. 2018.

204
[Eli+00] G. Elidan, N. Lotner, N. Friedman, and D. Koller. “Discovering Hidden Variables: A Structure-
Based Approach”. In: NIPS. 2000.
[EM07] D. Eaton and K. Murphy. “Exact Bayesian structure learning from uncertain interventions”. In:
AI/Statistics. 2007.
[Eva+18] R. Evans et al. “De novo structure prediction with deep-learning based scoring”. In: (2018).
[EW08] B. Ellis and W. H. Wong. “Learning Causal Bayesian Network Structures From Experimental
Data”. In: JASA 103.482 (2008), pp. 778–789.
[Fat18] S. Fattahi. “Exact Guarantees on the Absence of Spurious Local Minima for Non-negative
Robust Principal Component Analysis”. In: JMLR (2018).
[FG02] M. Fishelson and D. Geiger. “Exact genetic linkage computations for general pedigrees”. In:
BMC Bioinformatics 18 (2002).
[FGL00] N. Friedman, D. Geiger, and N. Lotner. “Likelihood computation with value abstraction”. In:
UAI. 2000.
[FHT08] J. Friedman, T. Hastie, and R. Tibshirani. “Sparse inverse covariance estimation the graphical
lasso”. In: Biostatistics 9.3 (2008), pp. 432–441.
[FK03] N. Friedman and D. Koller. “Being Bayesian about Network Structure: A Bayesian Approach to
Structure Discovery in Bayesian Networks”. In: Machine Learning 50 (2003), pp. 95–126.
[Fri+02] N. Friedman, M. Ninion, I. Pe’er, and T. Pupko. “A Structural EM Algorithm for Phylogenetic
Inference”. In: J. Comp. Bio. 9 (2002), pp. 331–353.
[Fri97] N. Friedman. “Learning Bayesian Networks in the Presence of Missing Values and Hidden
Variables”. In: UAI. 1997.
[FS18] S. Fattahi and S. Sojoudi. “Graphical Lasso and Thresholding: Equivalence and Closed-form
Solutions”. In: JMLR (2018).
[FZS18] S. Fattahi, R. Y. Zhang, and S. Sojoudi. “Linear-Time Algorithm for Learning Large-Scale
Sparse Graphical Models”. In: IEEE Access (2018).
[GG99] P. Giudici and P. Green. “Decomposable graphical Gaussian model determination”. In: Biometrika
86.4 (1999), pp. 785–801.
[GGS84] H. Gabow, Z. Galil, and T. Spencer. “Efficient implementation of graph algorithms using
contraction”. In: FOCS. 1984.
[GH94] D. Geiger and D. Heckerman. “Learning Gaussian Networks”. In: UAI. Vol. 10. 1994, pp. 235–243.
[GH97] D. Geiger and D. Heckerman. “A characterization of Dirchlet distributions through local and
global independence”. In: Annals of Statistics 25 (1997), pp. 1344–1368.
[GJ07] A. Globerson and T. Jaakkola. “Approximate inference using planar graph decomposition”. In:
AISTATS. 2007.
[GL97] F. Glover and M. Laguna. Kluwer Academic Publishers, 1997.
[GM04] A. Goldenberg and A. Moore. “Tractable Learning of Large Bayes Net Structures from Sparse
Data”. In: ICML. 2004.
[Gol89] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. 1st. Boston,
MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.
[Goo74] L. A. Goodman. “Exploratory latent structure analysis using both identifiable and unidentifiable
models”. In: Biometrika 61.2 (1974), pp. 215–231.
[Gop98] A. Gopnik. “Explanation as Orgasm”. In: Minds and Machines 8.1 (1998), pp. 101–118.
[GPS89] D. Greig, B. Porteous, and A. Seheult. “Exact maximum a posteriori estimation for binary
images”. In: J. of Royal Stat. Soc. Series B 51.2 (1989), pp. 271–279.

205
[GR01] A. Gelman and T. Raghunathan. “Using conditional distributions for missing-data imputation”.
In: Statistical Science (2001).
[Gri+04] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. “Integrating Topics and Syntax”. In: NIPS.
2004.
[GS04] T. Griffiths and M. Steyvers. “Finding scientific topics”. In: PNAS 101 (2004), pp. 5228–5235.
[GS15] F. Glover and K. Sorensen. “Metaheuristics”. In: Scholarpedia J. 10.4 (2015), p. 6532.
[GSM18] U. Garciarena, R. Santana, and A. Mendiburu. “Expanding Variational Autoencoders for
Learning and Exploiting Latent Representations in Search Distributions”. In: Proc. of the Conf.
on Genetic and Evolutionary Computation. 2018, pp. 849–856.
[GT06] T. Griffiths and J. Tenenbaum. “Optimal predictions in everyday cognition”. In: Psychological
Science 17.9 (2006), pp. 767–773.
[GT09] T. Griffiths and J. Tenenbaum. “Theory-Based Causal Induction”. In: Psychological Review
116.4 (2009), pp. 661–716.
[Guo+21] R. Guo, L. Cheng, J. Li, P Richard Hahn, and H. Liu. “A Survey of Learning Causality with
Data: Problems and Methods”. In: ACM Computing Surveys 53.4 (2021).
[GZS19] C. Glymour, K. Zhang, and P. Spirtes. “Review of Causal Discovery Methods Based on Graphical
Models”. en. In: Front. Genet. 10 (June 2019), p. 524.
[Han16] N. Hansen. “The CMA Evolution Strategy: A Tutorial”. In: (Apr. 2016). arXiv: 1604.00772
[cs.LG].
[Hau+11] M. Hauschild, M. Pelikan, M. Hauschild, and M. Pelikan. “An introduction and survey of
estimation of distribution algorithms”. In: Swarm and Evolutionary Computation. 2011.
[HB14] A. Hauser and P. Bühlmann. “Two optimal strategies for active learning of causal models from
interventional data”. In: Int. J. Approx. Reason. 55.4 (June 2014), pp. 926–939.
[HBB10] M. Hoffman, D. Blei, and F. Bach. “Online learning for latent Dirichlet allocation”. In: NIPS.
2010.
[HDMM18] C. Heinze-Deml, M. H. Maathuis, and N. Meinshausen. “Causal Structure Learning”. In: Annu.
Rev. Stat. Appl. 5.1 (Mar. 2018), pp. 371–391.
[Hec+00] D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. “Dependency Networks
for Density Estimation, Collaborative Filtering, and Data Visualization”. In: JMLR 1 (2000),
pp. 49–75.
[HG09] Y.-B. He and Z. Geng. “Active learning of causal networks with intervention experiments and
optimal designs”. In: JMLR 10 (2009), pp. 2523–2547.
[HGC95] D. Heckerman, D. Geiger, and M. Chickering. “Learning Bayesian networks: the combination of
knowledge and statistical data”. In: Machine Learning 20.3 (1995), pp. 197–243.
[HKP91] J. Hertz, A. Krogh, and R. G. Palmer. An Introduction to the Theory of Neural Comptuation.
Addison-Wesley, 1991.
[HKZ12] D. Hsu, S. Kakade, and T. Zhang. “A spectral algorithm for learning hidden Markov models”.
In: J. of Computer and System Sciences 78.5 (2012), pp. 1460–1480.
[HMC97] D. Heckerman, C. Meek, and G. Cooper. A Bayesian approach to Causal Discovery. Tech. rep.
MSR-TR-97-05. Microsoft Research, 1997.
[Hoe+99] J. Hoeting, D. Madigan, A. Raftery, and C. Volinsky. “Bayesian Model Averaging: A Tutorial”.
In: Statistical Science 4.4 (1999).
[Hof99] T. Hofmann. “Probabilistic latent semantic indexing”. In: Research and Development in Infor-
mation Retrieval (1999), pp. 50–57.

206
[Hol92] J. H. Holland. Adaptation in Natural and Artificial Systems. https://mitpress.mit.edu/
books/adaptation-natural-and-artificial-systems. Accessed: 2017-11-26. Apr. 1992.
[Hop82] J. J. Hopfield. “Neural networks and physical systems with emergent collective computational
abilities”. In: PNAS 79.8 (1982), 2554–2558.
[Hoy+09] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and P. B. Schölkopf. “Nonlinear causal discovery
with additive noise models”. In: NIPS. 2009, pp. 689–696.
[HS05] H. Hoos and T. Stutzle. Stochastic local search: Foundations and applications. Morgan Kauffman,
2005.
[HS08] T. Hazan and A. Shashua. “Convergent message-passing algorithms for inference over general
graphs with convex free energy”. In: UAI. 2008.
[HSDK12] C. Hillar, J. Sohl-Dickstein, and K. Koepsell. Efficient and Optimal Binary Hopfield Associative
Memory Storage Using Minimum Probability Flow. Tech. rep. Apr. 2012. arXiv: 1204.2916.
[HT08] H. Hara and A. Takimura. “A Localization Approach to Improve Iterative Proportional Scaling
in Gaussian Graphical Models”. In: Communications in Statistics - Theory and Method (2008).
to appear.
[HT09] H. Hoefling and R. Tibshirani. “Estimation of Sparse Binary Pairwise Markov Networks using
Pseudo-likelihoods”. In: JMLR 10 (2009).
[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. 2nd edition.
Springer, 2009.
[Hu+12] J. Hu, Y. Wang, E. Zhou, M. C. Fu, and S. I. Marcus. “A Survey of Some Model-Based Methods
for Global Optimization”. en. In: Optimization, Control, and Applications of Stochastic Systems.
Systems & Control: Foundations & Applications. Birkhäuser, Boston, 2012, pp. 157–179.
[HW11] S. Harmeling and C. K. I. Williams. “Greedy Learning of Binary Latent Trees”. In: IEEE PAMI
33.6 (2011), pp. 1087–1097.
[IM17] J. Ingraham and D. Marks. “Bayesian Sparsity for Intractable Undirected Models”. In: ICML.
2017.
[Jan+12] D. Janzing et al. “Information-geometric approach to inferring causal directions”. In: AIJ 182
(2012), pp. 1–31.
[Jay03] E. T. Jaynes. Probability theory: the logic of science. Cambridge university press, 2003.
[JBB09] D. Jian, A. Barthels, and M. Beetz. “Adaptive Markov logic networks: Learning statistical
relational models with dynamic parameters”. In: 9th European Conf. on AI. 2009, 937–942.
[JHS13] Y. Jernite, Y. Halpern, and D. Sontag. “Discovering hidden variables in noisy-or networks using
quartet tests”. In: NIPS. 2013.
[Jia+13] Y. Jia, J. T. Abbott, J. L. Austerweil, T. Griffiths, and T. Darrell. “Visual Concept Learning:
Combining Machine Vision and Bayesian Generalization on Concept Hierarchies”. In: NIPS.
2013.
[Jin11] Y. Jin. “Surrogate-assisted evolutionary computation: Recent advances and future challenges”.
In: Swarm and Evolutionary Computation 1.2 (June 2011), pp. 61–70.
[JJ00] T. S. Jaakkola and M. I. Jordan. “Bayesian parameter estimation via variational methods”. In:
Statistics and Computing 10 (2000), pp. 25–37.
[JJ96] T. Jaakkola and M. Jordan. “A variational approach to Bayesian logistic regression problems
and their extensions”. In: AISTATS. 1996.
[JJ99] T. Jaakkola and M. Jordan. “Variational probabilistic inference and the QMR-DT network”. In:
JAIR 10 (1999), pp. 291–322.

207
[Jon+05] B. Jones, A. Dobra, C. Carvalho, C. Hans, C. Carter, and M. West. “Experiments in stochastic
computation for high-dimensional graphical models”. In: Statistical Science 20 (2005), pp. 388–
400.
[JS10] D. Janzing and B. Scholkopf. “Causal inference using the algorithmic Markov condition”. In:
IEEE Trans. on Information Theory 56.10 (2010), pp. 5168–5194.
[JZB19] A. Jaber, J. Zhang, and E. Bareinboim. “Identification of Conditional Causal Effects under
Markov Equivalence”. In: NIPS. 2019, pp. 11512–11520.
[KB07] M. Kalisch and P. Buhlmann. “Estimating high dimensional directed acyclic graphs with the
PC algorithm”. In: JMLR 8 (2007), pp. 613–636.
[KB14] M. Kalisch and P. Bühlmann. “Causal Structure Learning and Inference: A Selective Review”.
In: Qual. Technol. Quant. Manag. 11.1 (Jan. 2014), pp. 3–21.
[KF09] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT
Press, 2009.
[KNH11] K. B. Korb, E. P. Nyberg, and L. Hope. “A new causal power theory”. In: Causality in the
Sciences. Oxford University Press, 2011.
[KNP11] K. Kersting, S. Natarajan, and D. Poole. Statistical Relational AI: Logic, Probability and
Computation. Tech. rep. UBC, 2011.
[Koi06] M. Koivisto. “Advances in exact Bayesian structure discovery in Bayesian networks”. In: UAI.
2006.
[Kol06] V. Kolmogorov. “Convergent Tree-reweighted Message Passing for Energy Minimization”. In:
IEEE PAMI 28.10 (2006), pp. 1568–1583.
[Koz92] J. R. Koza. Genetic Programming. https://mitpress.mit.edu/books/genetic-programming.
Accessed: 2017-11-26. Dec. 1992.
[KS04] M. Koivisto and K. Sood. “Exact Bayesian structure discovery in Bayesian networks”. In: JMLR
5 (2004), pp. 549–573.
[Lac+20] S. Lachapelle, P. Brouillard, T. Deleu, and S. Lacoste-Julien. “Gradient-Based Neural DAG
Learning”. In: ICLR. 2020.
[LD08] A. Lenkoski and A. Dobra. Bayesian structural learning and estimation in Gaussian graphical
models. Tech. rep. 545. Department of Statistics, University of Washington, 2008.
[Lev11] S. Levy. In The Plex: How Google Thinks, Works, and Shapes Our Lives. Simon & Schuster,
2011.
[LFJ04] M. Law, M. Figueiredo, and A. Jain. “Simultaneous Feature Selection and Clustering Using
Mixture Models”. In: IEEE PAMI 26.4 (2004).
[LGK06] S.-I. Lee, V. Ganapathi, and D. Koller. “Efficient Structure Learning of Markov Networks using
L1-Regularization”. In: NIPS. 2006.
[LHF17] R. M. Levy, A. Haldane, and W. F. Flynn. “Potts Hamiltonian models of protein co-variation,
free energy landscapes, and evolutionary fitness”. en. In: Curr. Opin. Struct. Biol. 43 (Apr.
2017), pp. 55–62.
[Li+14] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. “Reducing the sampling complexity of topic
models”. In: KDD. ACM, 2014, pp. 891–900.
[Li+18] C. Li, H. Farkhoor, R. Liu, and J. Yosinski. “Measuring the Intrinsic Dimension of Objective
Landscapes”. In: ICLR. 2018.
[Li+19] X. Li, L. Vilnis, D. Zhang, M. Boratko, and A. McCallum. “Smoothing the Geometry of
Probabilistic Box Embeddings”. In: ICLR. 2019.
[LL02] P. Larranaga and J. A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolu-
tionary Computation. Norwell, MA, USA: Kluwer Academic Publishers, 2002.

208
[LL15] H. Li and Z. Lin. “Accelerated Proximal Gradient Methods for Nonconvex Programming”. In:
NIPS. 2015, pp. 379–387.
[LL18] Z. Li and J. Li. “A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex
Optimization”. In: (Feb. 2018). arXiv: 1802.04477 [math.OC].
[LM06] A. Langville and C. Meyer. “Updating Markov chains with an eye on Google’s PageRank”. In:
SIAM J. on Matrix Analysis and Applications 27.4 (2006), pp. 968–987.
[LMW17] L Landrieu, C Mallet, and M Weinmann. “Comparison of belief propagation and graph-cut
approaches for contextual classification of 3D lidar point cloud data”. In: IEEE International
Geoscience and Remote Sensing Symposium (IGARSS). July 2017, pp. 2768–2771.
[Lof15] P. Lofgren. “Efficient Algorithms for Personalized PageRank”. PhD thesis. Stanford, 2015.
[Lok+18] A. Y. Lokhov, M. Vuffray, S. Misra, and M. Chertkov. “Optimal structure and parameter
learning of Ising models”. en. In: Science Advances 4.3 (Mar. 2018), e1700791.
[Luk13] S. Luke. Essentials of Metaheuristics. 2013.
[MB06] N. Meinshausen and P. Buhlmann. “High dimensional graphs and variable selection with the
lasso”. In: The Annals of Statistics 34 (2006), pp. 1436–1462.
[MB16] Y. Miao and P. Blunsom. “Language as a Latent Variable: Discrete Generative Models for
Sentence Compression”. In: EMNLP. 2016.
[MC03] P. Moscato and C. Cotta. “A Gentle Introduction to Memetic Algorithms”. en. In: Handbook of
Metaheuristics. International Series in Operations Research & Management Science. Springer,
Boston, MA, 2003, pp. 105–144.
[McE20] R. McElreath. Statistical Rethinking: A Bayesian Course with Examples in R and Stan (2nd
edition). en. Chapman and Hall/CRC, 2020.
[McK+04] B. D. McKay, F. E. Oggier, G. F. Royle, N. J. A. Sloane, I. M. Wanless, and H. S. Wilf. “
Acyclic digraphs and eigenvalues of (0,1)-matrices”. In: J. Integer Sequences 7.04.3.3 (2004).
[Mei05] N. Meinshausen. A note on the Lasso for Gaussian graphical model selection. Tech. rep. ETH
Seminar fur Statistik, 2005.
[MGR18] H. Mania, A. Guy, and B. Recht. “Simple random search of static linear policies is competitive
for reinforcement learning”. In: NIPS. Ed. by S Bengio, H Wallach, H Larochelle, K Grauman,
N Cesa-Bianchi, and R Garnett. Curran Associates, Inc., 2018, pp. 1800–1809.
[MH12] R. Mazumder and T. Hastie. The Graphical Lasso: New Insights and Alternatives. Tech. rep.
Stanford Dept. Statistics, 2012.
[MH97] C. Meek and D. Heckerman. “Structure and Parameter Learning for Causal Independence and
Causal Interaction Models”. In: UAI. 1997, pp. 366–375.
[Min01] T. Minka. Statistical Approaches to Learning and Discovery 10-602: Homework assignment 2,
question 5. Tech. rep. CMU, 2001.
[Mit97] T. Mitchell. Machine Learning. McGraw Hill, 1997.
[MJ00] M. Meila and M. I. Jordan. “Learning with mixtures of trees”. In: JMLR 1 (2000), pp. 1–48.
[MJ06] M. Meila and T. Jaakkola. “Tractable Bayesian learning of tree belief networks”. In: Statistics
and Computing 16 (2006), pp. 77–92.
[MKM11] B. Marlin, E. Khan, and K. Murphy. “Piecewise Bounds for Estimating Bernoulli-Logistic Latent
Gaussian Models”. In: ICML. 2011.
[MM01] T. K. Marks and J. R. Movellan. Diffusion networks, products of experts, and factor analysis.
Tech. rep. University of California San Diego, 2001.
[Mog+09] B. Moghaddam, B. Marlin, E. Khan, and K. Murphy. “Accelerating Bayesian Structural Inference
for Non-Decomposable Gaussian Graphical Models”. In: NIPS. 2009.

209
[Mol+18] D. Moldovan, V. Chifu, C. Pop, T. Cioara, I. Anghel, and I. Salomie. “Chicken Swarm Opti-
mization and Deep Learning for Manufacturing Processes”. In: Networking in Education and
Research (RoEduNet) conference. Cluj-Napoca: IEEE, Sept. 2018, pp. 1–6.
[Moo+16] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. “Distinguishing Cause
from Effect Using Observational Data: Methods and Benchmarks”. In: JMLR 17.1 (Jan. 2016),
pp. 1103–1204.
[Mor+11] F. Morcos et al. “Direct-coupling analysis of residue coevolution captures native contacts across
many protein families”. en. In: Proc. Natl. Acad. Sci. U. S. A. 108.49 (Dec. 2011), E1293–301.
[MR94] D. Madigan and A. Raftery. “Model selection and accounting for model uncertainty in graphical
models using Occam’s window”. In: JASA 89 (1994), pp. 1535–1546.
[Mue+17] J. Mueller, D. N. Reshef, G. Du, and T. Jaakkola. “Learning Optimal Interventions”. In:
AISTATS. 2017.
[Mur01] K. Murphy. Active Learning of Causal Bayes Net Structure. Tech. rep. Comp. Sci. Div., UC
Berkeley, 2001.
[Mur22] K. P. Murphy. Probabilistic Machine Learning: An introduciton. MIT Press, 2022.
[Nea92] R. Neal. “Connectionist learning of belief networks”. In: Artificial Intelligence 56 (1992), pp. 71–
113.
[Nit14] A. Nitanda. “Stochastic Proximal Gradient Descent with Acceleration Techniques”. In: NIPS.
2014, pp. 1574–1582.
[NY83] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization.
Wiley, 1983.
[Oll+17] Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. “Information-Geometric Optimization Algo-
rithms: A Unifying Picture via Invariance Principles”. In: JMLR 18 (2017), pp. 1–65.
[PB+14] N. Parikh, S. Boyd, et al. “Proximal algorithms”. In: Foundations and Trends in Optimization
1.3 (2014), pp. 127–239.
[Pe’05] D. Pe’er. “Bayesian network analysis of signaling networks: a primer”. In: Science STKE 281
(2005), p. 14.
[PE08] J.-P. Pellet and A. Elisseeff. “Using Markov blankets for causal structure learning”. In: JMLR 9
(2008), pp. 1295–1342.
[Pea09] J. Pearl. Causality: Models, Reasoning and Inference (Second Edition). Cambridge Univ. Press,
2009.
[Pel05] M. Pelikan. Hierarchical Bayesian Optimization Algorithm: Toward a New Generation of
Evolutionary Algorithms. en. Softcover reprint of hardcover 1st ed. 2005 edition. Springer, 2005.
[PGCP00] M Pelikan, D. E. Goldberg, and E Cantú-Paz. “Linkage problem, distribution estimation, and
Bayesian networks”. en. In: Evol. Comput. 8.3 (2000), pp. 311–340.
[PHL12] M. Pelikan, M. Hausschild, and F. Lobo. Introduction to estimation of distribution algorithms.
Tech. rep. U. Missouri, 2012.
[PJS17] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference: Foundations and Learning
Algorithms (Adaptive Computation and Machine Learning series). The MIT Press, Nov. 2017.
[PK11] P. Parviainen and M. Koivisto. “Ancestor Relations in the Presence of Unobserved Variables”.
In: ECML. 2011.
[PN18] A. Patrascu and I. Necoara. “Nonasymptotic convergence of stochastic proximal point methods
for constrained convex optimization”. In: JMLR 18.198 (2018), pp. 1–42.
[Poo+12] D. Poole, D. Buchman, S. Natarajan, and K. Kersting. “Aggregation and Population Growth:
The Relational Logistic Regression and Markov Logic Cases”. In: Statistical Relational AI
workshop. 2012.

210
[PRG17] M. Probst, F. Rothlauf, and J. Grahl. “Scalability of using Restricted Boltzmann Machines for
combinatorial optimization”. In: Eur. J. Oper. Res. 256.2 (Jan. 2017), pp. 368–383.
[Pri12] S. Prince. Computer Vision: Models, Learning and Inference. Cambridge, 2012.
[PSCP06] M. Pelikan, K. Sastry, and E. Cantú-Paz. Scalable Optimization via Probabilistic Modeling:
From Algorithms to Applications (Studies in Computational Intelligence). Secaucus, NJ, USA:
Springer-Verlag New York, Inc., 2006.
[PSW15] N. G. Polson, J. G. Scott, and B. T. Willard. “Proximal Algorithms in Statistics and Machine
Learning”. en. In: Stat. Sci. 30.4 (Nov. 2015), pp. 559–581.
[RD06] M. Richardson and P. Domingos. “Markov logic networks”. In: Machine Learning 62 (2006),
pp. 107–136.
[Rea+19] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. “Regularized Evolution for Image Classifier
Architecture Search”. In: AAAI. 2019.
[Red+16] S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. “Proximal Stochastic Methods for Nonsmooth
Nonconvex Finite-sum Optimization”. In: NIPS. NIPS’16. USA, 2016, pp. 1153–1161.
[RH10] M. Ranzato and G. Hinton. “Modeling pixel means and covariances using factored third-order
Boltzmann machines”. In: CVPR. 2010.
[RK04] R. Rubinstein and D. Kroese. The Cross-Entropy Method: A Unified Approach to Combinatorial
Optimization, Monte-Carlo Simulation, and Machine Learning. Springer-Verlag, 2004.
[RM15] G. Raskutti and S. Mukherjee. “The information geometry of mirror descent”. In: IEEE Trans.
Info. Theory 61.3 (2015), pp. 1451–1457.
[RN10] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. 3rd edition. Prentice Hall,
2010.
[Rob73] R. W. Robinson. “Counting labeled acyclic digraphs”. In: New Directions in the Theory of
Graphs. Ed. by F. Harary. Academic Press, 1973, pp. 239–273.
[Roc06] S. Roch. “A short proof that phylogenetic tree reconstrution by maximum likelihood is hard”.
In: IEEE/ACM Trans. Comp. Bio. Bioinformatics 31.1 (2006).
[Rov02] A. Roverato. “Hyper inverse Wishart distribution for non-decomposable graphs and its applica-
tion to Bayesian inference for Gaussian graphical models”. In: Scand. J. Statistics 29 (2002),
pp. 391–411.
[RU10] A. Rajaraman and J. Ullman. Mining of massive datasets. Self-published, 2010.
[Rub97] R. Y. Rubinstein. “Optimization of computer simulation models with rare events”. In: Eur. J.
Oper. Res. 99.1 (May 1997), pp. 89–112.
[Sac+05] K. Sachs, O. Perez, D. Pe’er, D. Lauffenburger, and G. Nolan. “Causal Protein-Signaling
Networks Derived from Multiparameter Single-Cell Data”. In: Science 308 (2005).
[Sal+17] T. Salimans, J. Ho, X. Chen, and I. Sutskever. “Evolution Strategies as a Scalable Alternative
to Reinforcement Learning”. In: (Mar. 2017). arXiv: 1703.03864 [stat.ML].
[San17] R. Santana. “Gray-box optimization and factorized distribution algorithms: where two worlds
collide”. In: (July 2017). arXiv: 1707.03093 [cs.NE].
[SC08] J. G. Scott and C. M. Carvalho. “Feature-inclusion Stochastic Search for Gaussian Graphical
Models”. In: J. of Computational and Graphical Statistics 17.4 (2008), pp. 790–808.
[Sch+08] M. Schmidt, K. Murphy, G. Fung, and R. Rosales. “Structure Learning in Random Fields for
Heart Motion Abnormality Detection”. In: CVPR. 2008.
[Sch+09] M. Schmidt, E. van den Berg, M. Friedlander, and K. Murphy. “Optimizing Costly Functions
with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm”. In: AI &
Statistics. 2009.

211
[Sch10a] M. Schmidt. “Graphical model structure learning with L1 regularization”. PhD thesis. UBC,
2010.
[Sch10b] N. Schraudolph. “Polynomial-Time Exact Inference in NP-Hard Binary MRFs via Reweighted
Perfect Matching”. In: AISTATS. 2010.
[Sch+17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization
Algorithms”. In: (July 2017). arXiv: 1707.06347 [cs.LG].
[Sch+21] B. Schölkopf et al. “Toward Causal Representation Learning”. In: Proc. IEEE 109.5 (May 2021),
pp. 612–634.
[Seg11] D. Segal. “The dirty little secrets of search”. In: New York Times (2011).
[Sen+08] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. “Collective Classifi-
cation in Network Data”. en. In: AI Magazine 29.3 (Sept. 2008), pp. 93–93.
[SF08] E. Sudderth and W. Freeman. “Signal and Image Processing with Belief Propagation”. In: IEEE
Signal Processing Magazine (2008).
[SF19] A. Shekhovtsov and B. Flach. “Feed-forward Propagation in Probabilistic Neural Networks with
Categorical and Max Layers”. In: ICLR. 2019.
[SG07] M. Steyvers and T. Griffiths. “Probabilistic topic models”. In: Latent Semantic Analysis: A Road
to Meaning. Ed. by T. Landauer, D McNamara, S. Dennis, and W. Kintsch. Laurence Erlbaum,
2007.
[SG09] R. Silva and Z. Ghahramani. “The Hidden Life of Latent Variables: Bayesian Learning with
Mixed Graph Models”. In: JMLR 10 (2009), pp. 1187–1238.
[SG91] P. Spirtes and C. Glymour. “An algorithm for fast recovery of sparse causal graphs”. In: Social
Science Computer Review 9 (1991), pp. 62–72.
[SGS00] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. 2nd edition. MIT
Press, 2000.
[SH03] A. Siepel and D. Haussler. “Combining phylogenetic and hid- den Markov models in biosequence
analysis”. In: Proc. 7th Intl. Conf. on Computational Molecular Biology (RECOMB). 2003.
[SH06] T. Singliar and M. Hauskrecht. “Noisy-OR Component Analysis and its Application to Link
Analysis”. In: JMLR 7 (2006).
[SH10] R. Salakhutdinov and G. Hinton. “Replicated Softmax: an Undirected Topic Model”. In: NIPS.
2010.
[Shi+06] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. “A Linear Non-Gaussian Acyclic
Model for Causal Discovery”. In: JMLR 7.Oct (2006), pp. 2003–2030.
[Shw+91] M. Shwe et al. “Probabilistic Diagnosis Using a Reformulation of the INTERNIST-1/QMR
Knowledge Base”. In: Methods. Inf. Med 30.4 (1991), pp. 241–255.
[SK86] T. Speed and H. Kiiveri. “Gaussian Markov distributions over finite graphs”. In: Annals of
Statistics 14.1 (1986), pp. 138–150.
[SKM07] T. Silander, P. Kontkanen, and P. Myllymaki. “On Sensitivity of the MAP Bayesian Network
Structure to the Equivalent Sample Size Parameter”. In: UAI. 2007, pp. 360–367.
[SLM92] B. Selman, H. Levesque, and D. Mitchell. “A New Method for Solving Hard Satisfiability
Problems”. In: Proceedings of the Tenth National Conference on Artificial Intelligence. AAAI’92.
San Jose, California: AAAI Press, 1992, pp. 440–446.
[SM06] T. Silander and P. Myllymaki. “A simple approach for finding the globally optimal Bayesian
network structure”. In: UAI. 2006.
[SM09] M. Schmidt and K. Murphy. “Modeling Discrete Interventional Data using Directed Cyclic
Graphical Models”. In: UAI. 2009.

212
[SMH07] R. R. Salakhutdinov, A. Mnih, and G. E. Hinton. “Restricted Boltzmann machines for collabo-
rative filtering”. In: ICML. Vol. 24. 2007, pp. 791–798.
[SNMM07] M. Schmidt, A. Niculescu-Mizil, and K. Murphy. “Learning Graphical Model Structure using
L1-Regularization Paths”. In: AAAI. 2007.
[Sol19] L. Solus. “Interventional Markov Equivalence for Mixed Graph Models”. In: (Nov. 2019). arXiv:
1911.10114 [math.ST].
[Sör15] K. Sörensen. “Metaheuristics—the metaphor exposed”. In: Intl. Trans. in Op. Res. 22.1 (Jan.
2015), pp. 3–18.
[SP18] R. D. Shah and J. Peters. “The Hardness of Conditional Independence Testing and the Gener-
alised Covariance Measure”. In: Ann. Stat. (2018).
[SS02] D. Scharstein and R. Szeliski. “A taxonomy and evaluation of dense two-frame stereo correspon-
dence algorithms”. In: Intl. J. Computer Vision 47.1 (2002), pp. 7–42.
[SS17] A. Srivastava and C. Sutton. “Autoencoding Variational Inference For Topic Models”. In: ICLR.
2017.
[Sta+19] K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen. “Designing neural networks through
neuroevolution”. In: Nature Machine Intelligence 1.1 (2019).
[SW11] R. Sedgewick and K. Wayne. Algorithms. Addison Wesley, 2011.
[SWW08] E. Sudderth, M. Wainwright, and A. Willsky. “Loop series and Bethe variational bounds for
attractive graphical models”. In: NIPS. 2008.
[Sze+08] R. Szeliski et al. “A Comparative Study of Energy Minimization Methods for Markov Random
Fields with Smoothness-Based Priors”. In: IEEE PAMI 30.6 (2008), pp. 1068–1080.
[Sze10] R. Szeliski. Computer Vision: Algorithms and Applications. Springer, 2010.
[Tad15] M. Taddy. “Distributed multinomial regression”. en. In: Annals of Applied Statistics 9.3 (Sept.
2015), pp. 1394–1414.
[Teh+06] Y.-W. Teh, M. Jordan, M. Beal, and D. Blei. “Hierarchical Dirichlet processes”. In: JASA 101.476
(2006), pp. 1566–1581.
[Ten+11] J. Tenenbaum, C. Kemp, T. Griffiths, and N. Goodman. “How to Grow a Mind: Statistics,
Structure, and Abstraction”. In: Science 6022 (2011), pp. 1279–1285.
[Ten99] J. Tenenbaum. “A Bayesian framework for concept learning”. PhD thesis. MIT, 1999.
[TF03] M. Tappen and B. Freeman. “Comparison of graph cuts with belief propagation for stereo, using
identical MRF parameters”. In: ICCV. Oct. 2003, 900–906 vol.2.
[TG09] A. Thomas and P. Green. “Enumerating the Decomposable Neighbours of a Decomposable
Graph Under a Simple Perturbation Scheme”. In: Comp. Statistics and Data Analysis 53 (2009),
pp. 1232–1238.
[Thi+98] B. Thiesson, C. Meek, D. Chickering, and D. Heckerman. “Learning Mixtures of DAG models”.
In: UAI. 1998.
[Tse08] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Unpublished
manuscript. 2008.
[Tu+19] R. Tu, K. Zhang, B. C. Bertilson, H. Kjellström, and C. Zhang. “Neuropathic Pain Diagnosis
Simulator for Causal Discovery Algorithm Evaluation”. In: NIPS. 2019.
[Var21] K. R. Varshney. Trustworthy Machine Learning. 2021.
[VML19] M. Vuffray, S. Misra, and A. Y. Lokhov. “Efficient Learning of Discrete Graphical Models”. In:
(Feb. 2019). arXiv: 1902.00600 [cs.LG].
[VP90] T. Verma and J. Pearl. “Equivalence and synthesis of causal models”. In: UAI. 1990.

213
[Wal+09] H. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. “Evaluation Methods for Topic Models”.
In: ICML. 2009.
[WCK03] F. Wong, C. Carter, and R. Kohn. “Efficient estimation of covariance selection models”. In:
Biometrika 90.4 (2003), pp. 809–830.
[Wer07] T. Werner. “A linear programming approach to the max-sum problem: A review”. In: IEEE
PAMI 29.7 (2007), pp. 1165–1179.
[WHT19] Y. Wang, H. He, and X. Tan. “Truly Proximal Policy Optimization”. In: UAI. 2019.
[Wie+14] D Wierstra, T Schaul, J Peters, and J Schmidhuber. “Natural Evolution Strategies”. In: JMLR
15.1 (2014), pp. 949–980.
[WJ08] M. J. Wainwright and M. I. Jordan. “Graphical models, exponential families, and variational
inference”. In: Foundations and Trends in Machine Learning 1–2 (2008), pp. 1–305.
[WJW05a] M. Wainwright, T. Jaakkola, and A. Willsky. “A new class of upper bounds on the log partition
function”. In: IEEE Trans. Info. Theory 51.7 (2005), pp. 2313–2335.
[WJW05b] M. Wainwright, T. Jaakkola, and A. Willsky. “MAP estimation via agreement on trees: message-
passing and linear programming”. In: IEEE Trans. Info. Theory 51.11 (2005), pp. 3697–3717.
[WP18] H. Wang and H. Poon. “Deep Probabilistic Logic: A Unifying Framework for Indirect Supervision”.
In: EMNLP. 2018.
[WRL06] M. Wainwright, P. Ravikumar, and J. Lafferty. “Inferring Graphical Model Structure using
` − 1-Regularized Pseudo-Likelihood”. In: NIPS. 2006.
[WSD19] S. Wu, S. Sanghavi, and A. G. Dimakis. “Sparse Logistic Regression Learns All Discrete Pairwise
Graphical Models”. In: NIPS. 2019.
[XAH19] Z. Xu, T. Ajanthan, and R. Hartley. “Fast and Differentiable Message Passing for Stereo Vision”.
In: (Oct. 2019). arXiv: 1910.10892 [cs.CV].
[XT07] F. Xu and J. Tenenbaum. “Word learning as Bayesian inference”. In: Psychological Review 114.2
(2007).
[Xu18] J. Xu. “Distance-based Protein Folding Powered by Deep Learning”. In: (Nov. 2018). arXiv:
1811.03481 [q-bio.BM].
[Yam+12] K. Yamaguchi, T. Hazan, D. McAllester, and R. Urtasun. “Continuous Markov Random Fields
for Robust Stereo Estimation”. In: (Apr. 2012). arXiv: 1204.1393 [cs.CV].
[Yao+20] Q. Yao, J. Xu, W.-W. Tu, and Z. Zhu. “Efficient Neural Architecture Search via Proximal
Iterations”. In: AAAI. 2020.
[YL07] M. Yuan and Y. Lin. “Model Selection and Estimation in the Gaussian Graphical Model”. In:
Biometrika 94.1 (2007), pp. 19–35.
[Yu+19] Y. Yu, J. Chen, T. Gao, and M. Yu. “DAG-GNN: DAG Structure Learning with Graph Neural
Networks”. In: ICML. 2019.
[Zha04] N. Zhang. “Hierarchical latent class models for cluster analysis”. In: JMLR (2004), pp. 301–308.
[Zha+20] Y. Zhang et al. “Efficient Probabilistic Logic Reasoning with Graph Neural Networks”. In: ICLR.
2020.
[Zhe+18] X. Zheng, B. Aragam, P. Ravikumar, and E. P. Xing. “DAGs with NO TEARS: Smooth
Optimization for Structure Learning”. In: NIPS. 2018.
[ZK14] W. Zhong and J. Kwok. “Fast Stochastic Alternating Direction Method of Multipliers”. In:
ICML. Ed. by E. P. Xing and T. Jebara. Vol. 32. Proceedings of Machine Learning Research.
Bejing, China: PMLR, 2014, pp. 46–54.
[ZWM97] C. S. Zhu, N. Y. Wu, and D. Mumford. “Minimax Entropy Principle and Its Application to
Texture Modeling”. In: Neural Computation 9.8 (Nov. 1997).

214

You might also like