0% found this document useful (0 votes)
47 views67 pages

Wainwright Microsoft Slides2

This document provides an introduction to graphical models and variational methods. Graphical models use graphs to represent dependencies between random variables. They are widely used in machine learning, signal processing, and other fields. Exact computation with graphical models is often intractable, motivating the use of variational methods that provide approximate solutions. The document outlines variational techniques for both directed and undirected graphical models, including mean field methods and belief propagation.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views67 pages

Wainwright Microsoft Slides2

This document provides an introduction to graphical models and variational methods. Graphical models use graphs to represent dependencies between random variables. They are widely used in machine learning, signal processing, and other fields. Exact computation with graphical models is often intractable, motivating the use of variational methods that provide approximate solutions. The document outlines variational techniques for both directed and undirected graphical models, including mean field methods and belief propagation.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Graphical models, exponential families, and

variational methods

Martin Wainwright
Departments of Statistics, and
Electrical Engineering and Computer Science,
UC Berkeley

Email: [email protected]

Tutorial slides based on joint paper with Michael Jordan


Paper at: www.eecs.berkeley.edu/ewainwrig/WaiJorVariational03.ps

1
Introduction

• graphical models are used and studied in various applied statistical


and computational fields:
– machine learning and artificial intelligence
– computational biology
– statistical signal/image processing
– communication and information theory
– statistical physics
– .....

• based on correspondences between graph theory and probability


theory

• important but difficult problems:


– computing likelihoods, marginal distributions, modes
– estimating model parameters and structure from (noisy) data

2
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods

2. Exponential families and variational methods


(a) What is a variational method (and why should I care)?
(b) Graphical models as exponential families
(c) The power of conjugate duality

3. Exact techniques as variational methods


(a) Gaussian inference on arbitrary graphs
(b) Belief-propagation/sum-product on trees (e.g., Kalman filter; α-β alg.)
(c) Max-product on trees (e.g., Viterbi)

4. Approximate techniques as variational methods


(a) Mean field and variants
(b) Belief propagation and extensions on graphs with cycles
(c) Semidefinite constraints and convex relaxations

3
Undirected graphical models
Based on correspondences between graphs and random variables.
• given an undirected graph G = (V, E), associate to each node s a
random variable Xs
• for each subset A ⊆ V , define XA := {xs , s ∈ A}.

2
ag replacements 3 4
7
PSfrag replacements B

1 5 6 A S
Maximal cliques (123), (345), (456), (47) Vertex cutset S
• a clique C ⊆ V is a subset of vertices all joined by edges
• a vertex cutset is a subset S ⊂ V whose removal breaks the graph
into two or more pieces

4
Factorization and Markov properties

The graph G can be used to impose constraints on the random vector


X = XV (or on the distribution p) in different ways.

Markov property: X is Markov w.r.t G if XA and XB are


conditionally indpt. given XS whenever S separates A and B.

Factorization: The distribution p factorizes according to G if it can


be expressed as a product over cliques:
1 Y
p(x) = ψC (xC )
Z | {z }
C∈C
compatibility function on clique C

Theorem: (Hammersley-Clifford) For strictly positive p(·), the


Markov property and the Factorization property are equivalent.

5
Example: Hidden Markov models

X1 X2 X3 XT
q q

Y1 Y2 Y3 YT

(a) Hidden Markov model (b) Coupled HMM

• HMMs are widely used in various applications


discrete Xt : computational biology, speech processing, etc.
Gaussian Xt : control theory, signal processing, etc.

• frequently wish to solve smoothing problem of computing


p(xt | y1 , . . . , yT )

• exact computation in HMMs is tractable, but coupled HMMs require


algorithms for approximate computation (e.g., structured mean field)

6
Example: Graphical codes for communication
ag replacements
Goal: Achieve reliable communication over a noisy channel.
0 00000 10010 00000
Encoder Noisy Decoder
Channel Y b
source X X

• wide variety of applications: satellite communication, sensor


networks, computer memory, neural communication

• error-control codes based on careful addition of redundancy, with


their fundamental limits determined by Shannon theory

• key implementational issues: efficient construction, encoding and


decoding

• very active area of current research: graphical codes (e.g., turbo


codes, low-density parity check codes) and iterative
message-passing algorithms (belief propagation; max-product)

7
Graphical codes and decoding
Parity check matrix Factor graph
x1

x2
ψ1357
2 3
1 0 1 0 1 0 1 x3
6 7
H=6
40 1 1 0 0 1 17 5 x4 ψ2367
PSfrag replacements
0 0 0 1 1 1 1
x5
ψ4567
Codeword: [0 1 0 1 0 1 0] x6
Non-codeword: [0 0 0 0 0 1 1]
x7

• Decoding: requires finding maximum likelihood codeword:

bM L = arg max p(y | x) s. t. Hx = 0 (mod 2).


x
x

• use of belief propagation as an approximate decoder has revolutionized


the field of error-control coding

8
Challenging computational problems

Frequently, it is of interest to compute various quantities associated


with an undirected graphical model:

(a) the log normalization constant log Z

(b) local marginal distributions or other local statistics

(c) modes or most probable configurations

Relevant dimensions often grow rapidly in graph size =⇒ major


computational challenges.

Example: Consider a naive approach to computing the normalization constant for


binary random variables:
X Y
Z = ψC (xC )
x∈{0,1}n C∈C

Complexity scales exponentially as 2n .

9
Gibbs sampling in the Ising model

• binary variables on a graph G = (V, E) with pairwise interactions:


X X
p(x; θ) ∝ exp θs xs + θst xs xt
s∈V (s,t)∈E

(m+1) (m)
Update xs stochastically based on values xN (s)
at neighbors:














1. Choose s ∈ V at random.
2. Sample u ∼ U(0, 1) and update
8 P
>
< 1 if u ≤ {1 + exp[−(θs + (m)
θst xt )]}−1
xs(m+1) = t∈N (s) ,
>
: 0 otherwise

10
Mean field updates in the Ising model

• binary variables on a graph G = (V, E) with pairwise interactions:


X X
p(x; θ) ∝ exp θs xs + θst xs xt
s∈V (s,t)∈E

• simple (deterministic) message-passing algorithm involving


variational parameters νs ∈ (0, 1) at each node

3
1. Choose s ∈ V at random.
placements ν3
1 2. Update νs based on neighbors


2 4


ν2 {νt , t ∈ N (s)}:


ν4


ν5 ˘ X ¯−1
νs ←− 1 + exp[−(θs + θst νt )]
5 t∈N (s)

Questions: • principled derivation? • convergence and accuracy?

11
Sum-product (belief-propagation) in the Ising model

• alternative set of message-passing updates (motivated by exactness


for trees)

acements
Tt 1. For each (direction of each) edge, update message:
Tw
w s 1
X Y
νwt νts (xs ) ← exp(θt xt +θst xs xt ) νut (xt )
νts
t νut xt =0 u∈N (t)\s
u νvt
Tu 2. Upon convergence, compute approx. to marginal:
v ` ´ Y
Tv p(xs ) ∝ exp θs xs νts (xs ).
t∈N (s)

• for any tree (i.e., no cycles), updates will converge (after a finite
number of steps), and yield exact marginals (cf. Pearl, 1988)

• behavior for graphs with cycles?

12
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods

2. Exponential families and variational methods


(a) What is a variational method (and why should I care)?
(b) Graphical models as exponential families
(c) The power of conjugate duality

3. Exact techniques as variational methods


(a) Gaussian inference on arbitrary graphs
(b) Belief-propagation/sum-product on trees (e.g., Kalman filter; α-β alg.)
(c) Max-product on trees (e.g., Viterbi)

4. Approximate techniques as variational methods


(a) Mean field and variants
(b) Belief propagation and extensions
(c) Semidefinite constraints and convex relaxations

13
Variational methods

• “variational”: umbrella term for optimization-based formulation of


problems, and methods for their solution
• historical roots in the calculus of variations
• modern variational methods encompass a wider class of methods
(e.g., dynamic programming; finite-element methods)

b as
Variational principle: Representation of a quantity of interest u
the solution of an optimization problem.
b to be studied through the lens of the
1. allows the quantity u
optimization problem
2. approximations to u b can be obtained by approximating or relaxing
the variational principle

14
Illustration: A simple variational principle

Goal: Given a vector y ∈ Rn and a symmetric matrix Q  0, solve the


linear system Qu = y.
b (y) = Q−1 y can be obtained by matrix inversion.
Unique solution u
Variational formulation: Consider the function Jy : Rn → R defined by
1 T
Jy (u) := u Qu − yT u.
2

It is strictly convex, and the minimum is uniquely attained:

b (y) = arg minn Jy (u) = Q−1 y.


u
u∈R

Various methods for solving linear systems (e.g., conjugate gradient)


exploit this variational representation.

15
Useful variational principles for graphical models?
Consider an undirected graphical model:
1 Y
p(x) = ψC (xC )
Z
C∈C

Core problems that arise in many applications:


(a) computing the log normalization constant log Z
P
(b) computing local marginal distributions (e.g., p(xs ) = xt ,t6=s p(x))

b ∈ arg maxx p(x)


(c) computing modes or most likely configurations x

Approach: Develop variational representations of all of these


problems by exploiting ideas and results from:
(a) exponential families (e.g.,Brown, 1986)

(b) convex analysis (e.g., Rockafellar,1973)

16
Exponential families

φα : X n → R ≡ sufficient statistic
φ = {φα , α ∈ I} ≡ vector of sufficient statistics
θ = {θα , α ∈ I} ≡ parameter vector
ν ≡ base measure (e.g., Lebesgue, counting)

• parameterized family of densities (w.r.t. ν):


X
p(x; θ) = exp θα φα (x) − A(θ)
α

• cumulant generating function (log normalization constant):


Z

A(θ) = log exp{hθ, φ(x)i}ν(dx)

• set of valid parameters Θ := {θ ∈ Rd | A(θ) < +∞}.


• will focus on regular families for which Θ is open.

17
Examples: Scalar exponential families

Family X ν log p(x; θ) A(θ)

Bernoulli {0, 1} Counting θx − A(θ) log[1 + exp(θ)]

1 2πe
Gaussian R Lebesgue θ1 x + θ2 x2 − A(θ) [θ
2 1
+ log −θ2
]

Exponential (0, +∞) Lebesgue θ (−x) − A(θ) − log θ

Poisson {0, 1, 2 . . .} Counting θx − A(θ) exp(θ)


h(x) = 1/x!

18
Graphical models as exponential families

• choose random variables Xs at each vertex s ∈ V from an arbitrary


exponential family (e.g., Bernoulli, Gaussian, Dirichlet etc.)
• exponential
PSfrag family can be the same at each node (e.g., multivariate
replacements
Gaussian), or different (e.g., latent Dirichlet allocation model)

3 4
7

1 5 6

Key requirement: The collection φ of sufficient statistics must


respect the structure of G.

19
Example: Ising model





φ = { xs | s ∈ V } ∪ { xs xt | (s, t) ∈ E }











I = V ∪ E






Xn = {0, 1}n















Density (w.r.t. counting measure) of the form:


n
˘X X ¯
p(x; θ) ∝ exp θs x s + θst xs xt
s=1 (s,t)∈E

Cumulant generating function (log normalization constant):


X n
˘X X ¯
A(θ) = log exp θs x s + θst xs xt
x∈{0,1}n s=1 (s,t)∈E

20
Example: Multivariate Gaussian
U (θ): Matrix of natural parameters φ(x): Matrix of sufficient statistics

2 3 2 3
0 θ1 θ2 ... θn 1 x1 x2 ... xn
6 7 6 7
6θ θ11 θ12 ... θ1n 7 6x (x1 )2 x1 x2 ... x 1 xn 7
6 1 7 6 1 7
6 7 6 7
6 θ2 θ21 θ22 ... θ2n 7 6 x2 x2 x1 (x2 )2 ... x 2 xn 7
6 7 6 7
6 . .. .. .. .. 7 6 . .. .. .. .. 7
6 . 7 6 . 7
6 . . . . . 7 6 . . . . . 7
4 5 4 5
θn θn1 θn2 ... θnn xn xn x1 xn x2 ... (xn )2

Edgewise natural parameters θst = θts must respect graph structure:


1 2 3 4 5
1
1 2
2
3
3

5
4 5

(a) Graph structure (b) Structure of [Z(θ)]st = θst .

21
PSfrag replacements
Example: Latent Dirichlet Allocation model
γ

α u z w

Model components:
Dirichlet u ∼ Dir(α)
Multinomial “topic” z ∼ Mult(u)
“Word” w ∼ multinomial conditioned on z
(with parameter γ)

With variables x := (u, z, w) and parameter θ := (α, γ), density


p(u; α)p(z; u)p(w | z, γ) is proportional to:
n k k X
l
X X X
exp αi log ui + I i [z] log ui + γij I i [z]I j [w] .
i=1 i=1 i=1 j=1

22
The power of conjugate duality

Conjugate duality is a fertile source of variational principles.


(Rockafellar, 1973)

• any function f can be used to define another function f ∗ as follows:


∗ ˘ ¯
f (y) := sup hy, xi − f (x) .
x∈Rn

• easy to show that f ∗ is always a convex function

• how about taking the “dual of the dual”? I.e., what is (f ∗ )∗ ?

• when f is well-behaved (convex and lower semi-continuous), we have


(f ∗ )∗ = f , or alternatively stated:
˘ ∗ ¯
f (x) = sup hx, yi − f (y)
y∈Rn

23
Geometric view: Supporting hyperplanes
Question: Given all hyperplanes in Rn × R with normal (y, −1), what
is the intercept of the one that supports epi(f )?
u
f (x)

hy, xi − ca
Epigraph of f :
PSfrag replacements
epi(f ) := {(x, u) ∈ Rn+1 | f (x) ≤ u}.
hy, xi − cb
−ca

−cb x

Analytically, we require the smallest c ∈ R such that:

hy, xi − c ≤ f (x) for all x ∈ Rn

By re-arranging, we find that this optimal c∗ is the dual value:


˘ ¯
c∗ = sup hy, xi − f (x) .
x∈Rn

24
Example: Single Bernoulli
Random variable X ∈ {0, 1} yields exponential family of the form:
˘ ¯ ˆ ˜
p(x; θ) ∝ exp θ x with A(θ) = log 1 + exp(θ) .
˘ ¯
Let’s compute the dual A∗ (µ) := sup µθ − log[1 + exp(θ)] .
θ∈R

(Possible) stationary point: µ = exp(θ)/[1 + exp(θ)].


A(θ) A(θ)

hµ, θi − A∗ (µ)
replacements PSfrag replacements
θ θ
hµ, θi − c

(a) Epigraph supported (b) Epigraph cannot be supported


8
<µ log µ + (1 − µ) log(1 − µ) if µ ∈ [0, 1]
We find that: A∗ (µ) = .
:+∞ otherwise.
˘ ¯
Leads to the variational representation: A(θ) = maxµ∈[0,1] µ · θ − A∗ (µ) .

25
More general computation of the dual A∗
• consider the definition of the dual function:


A (µ) = sup hµ, θi − A(θ) .
θ∈Rd

• taking derivatives w.r.t θ to find a stationary point yields:

µ − ∇A(θ) = 0.

• Useful fact: Derivatives of A yield mean parameters:


Z
∂A
(θ) = Eθ [φα (x)] := φα (x)p(x; θ)ν(x).
∂θα

Thus, stationary points satisfy the equation:

µ = Eθ [φ(x)] (1)

26
Computation of dual (continued)
• assume solution θ(µ) to equation (1) exists

• strict concavity of objective guarantees that θ(µ) attains global


maximum with value

A∗ (µ) = hµ, θ(µ)i − A(θ(µ))


h i
= Eθ(µ) hθ(µ), φ(x)i − A(θ(µ))
= Eθ(µ) [log p(x; θ(µ))]

• recall the definition of entropy:


Z
ˆ ˜
H(p(x)) := − log p(x) p(x)ν(dx)

• thus, we recognize that A∗ (µ) = −H(p(x; θ(µ))) when equation (1) has a
solution

Question: For which µ ∈ Rd does equation (1) have a solution


θ(µ)?

27
Sets of realizable mean parameters

• for any distribution p(·), define a vector µ ∈ Rd of mean


parameters:
Z
µα := φα (x)p(x)ν(dx)

• now consider the set M(G; φ) of all realizable mean


parameters:
Z

M(G; φ) = µ ∈ Rd µα = φα (x)p(x)ν(dx) for some p(·)

• for discrete families, we refer to this set as a marginal polytope,


denoted by MARG(G; φ)

28
Examples of M:
1. Gaussian MRF: Matrices of suff. statistics and mean parameters:
µ11
2 3
1 h i
Mgauss
φ(x) = 4 5 1 x .
x
(2 3 )
1 h i
U (µ) := E 4 5 PSfrag
1 xreplacements µ1
x

Semidefinite set MGauss = {µ | U (µ)  0}.

2. Ising model: Binary vector X ∈ {0, 1}n

Sufficient statistics: φ(x) = {xs , s ∈ V } ∪ {xs xt , (s, t) ∈ E}


M(G) is the binary quadric polytope of realizable singleton and pairwise
marginal probabilities:

µs = p(Xs = 1), µst = p(Xs = 1, Xt = 1)

29
Geometry and moment mapping

M
Θ

θ µ
eplacements

Theorem: In a regular, minimal exponential family, the gradient map


∇A is one-to-one and onto the interior of the set M.
(e.g., Barndorff-Nielsen, 1978; Brown, 1986; Efron, 1978)

30
Variational principles in terms of mean parameters
Theorem:
(a) The conjugate dual of A takes the form:
8
<−H(p(x; θ(µ))) if µ ∈ int M(G; φ)

A (µ) =
:+∞ if µ ∈
/ cl M(G; φ).
Note: Boundary behavior by lower semi-continuity.

(b) The cumulant generating function A has the representation:

A(θ) = sup {hθ, µi − A∗ (µ)},


| {z } µ∈M(G;φ)
| {z }
cumulant generating func. max. ent. problem over M

bα = Eθ [φα (x)] (for all θ ∈ Θ).


with max. attained at mean parameters µ

(c) The problem of mode computation has the representation:

sup log p(x; θ) + C = sup hθ, φ(x)i = supµ∈M(G;φ) hθ, µi.


x∈X n x∈X n

31
Alternative view: Kullback-Leibler divergence
• Kullback-Leibler divergence defines “distance” between probability
distributions:
Z h
p(x) i
D(p || q) := log p(x)ν(dx)
q(x)

• for two exponential family members p(x; θ 1 ) and p(x; θ 2 ), we have

D(p(x; θ 1 ) || p(x; θ 2 )) = A(θ 2 ) − A(θ 1 ) − hµ1 , θ2 − θ1 i

• substituting A(θ 1 ) = hθ 1 , µ1 i − A∗ (µ1 ) yields a mixed form:

D(p(x; θ 1 ) || p(x; θ 2 )) ≡ D(µ1 || θ2 ) = A(θ 2 ) + A∗ (µ1 ) − hµ1 , θ2 i

Hence, the following two assertions are equivalent:

A(θ2 ) = sup {hθ2 , µ1 i − A∗ (µ1 )}


µ1 ∈M(G;φ)

0 = inf D(µ1 || θ 2 )
µ1 ∈M(G;φ)

32
Challenges

1. In general, mean parameter spaces M can be very difficult to


characterize (e.g., multidimensional moment problems).

2. Entropy A∗ (µ) as a function of only the mean parameters µ


typically lacks an explicit form.

Remarks:
1. Variational representation clarifies why certain models are
tractable.
2. For intractable cases, one strategy is to solve an approximate form
of the optimization problem.

33
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods

2. Exponential families and variational methods


(a) What is a variational method (and why should I care)?
(b) Graphical models as exponential families
(c) The power of conjugate duality

3. Exact techniques as variational methods


(a) Gaussian inference on arbitrary graphs
(b) Belief-propagation/sum-product on trees (e.g., Kalman filter; α-β alg.)
(c) Max-product on trees (e.g., Viterbi)

4. Approximate techniques as variational methods


(a) Mean field and variants
(b) Belief propagation and extensions
(c) Semidefinite constraints and convex relaxations

34
A(i): Multivariate Gaussian (fixed covariance)
Consider the set of all Gaussians with fixed inverse covariance Q  0.
• potentials φ(x) = {x1 , . . . , xn } and natural parameter θ ∈ Θ = Rn .
• cumulant generating function:

density
z }| {
Z n
˘X ¯ ˘1 T ¯
A(θ) = log exp θs x s exp − x Qx dx
s=1 | 2{z }
Rn
base measure

• completing the square yields A(θ) = 12 θT Q−1 θ + constant


• straightforward computation leads to the dual
A∗ (µ) = 12 µT Qµ − constant
• putting the pieces back together yields the variational principle
˘ T 1 T ¯
A(θ) = sup θ µ − µ Qµ + constant
µ∈Rn 2

b = Q−1 θ.
• optimum is uniquely obtained at the familiar Gaussian mean µ

35
A(ii): Multivariate Gaussian (arbitrary covariance)
• matrices of sufficient statistics, natural parameters, and mean
parameters:
2 3 2 3 (2 3 )
1 h i 0 [θs ] 1 h i
φ(x) = 4 5 1 x , U (θ) := 4 5 U (µ) := E 4 5 1 x
x [θs ] [θst ] x

• cumulant generating function:


Z n o
A(θ) = log exp hhU (θ), φ(x)ii dx

• computing the dual function:


1 n
A∗ (µ) = − log det U (µ) − log 2πe,
2 2

• exact variational principle is a log-determinant problem:


˘ 1 ¯ n
A(θ) = sup hhU (θ), U (µ)ii + log det U (µ) + log 2πe
U (µ)0, [U (µ)]11 =1 2 2

• solution yields the normal equations for Gaussian mean and covariance.

36
B: Belief propagation/sum-product on trees
• multinomial variables Xs ∈ {0, 1, . . . , ms − 1} on a tree T = (V, E)
• sufficient statistics: indicator functions for each node and edge

I j (xs ) for s = 1, . . . n, j ∈ Xs
I jk (xs , xt ) for (s, t) ∈ E, (j, k) ∈ Xs × Xt .

• exponential representation of distribution:


˘X X ¯
p(x; θ) ∝ exp θs (xs ) + θst (xs , xt )
s∈V (s,t)∈E
P
where θs (xs ) := j∈Xs θs;j I j (xs ) (and similarly for θst (xs , xt ))
• mean parameters are simply marginal probabilities, represented as:
X X
µs (xs ) := µs;j I j (xs ), µst (xs , xt ) := µst;jk I jk (xs , xt )
j∈Xs (j,k)∈Xs ×Xt

• the marginals must belong to the following marginal polytope:


X X
MARG(T ) := { µ ≥ 0 | µs (xs ) = 1, µst (xs , xt ) = µs (xs ) },
xs xt

37
Decomposition of entropy for trees
• by the junction tree theorem, any tree can be factorized in terms of
its marginals µ ≡ µ(θ) as follows:
Y Y µst (xs , xt )
p(x; θ) = µs (xs )
µs (xs )µt (xt )
s∈V (s,t)∈E

• taking logs and expectations leads to the following entropy


decomposition:
X X

H(p(x; θ)) = −A (µ(θ)) = Hs (µs ) − Ist (µst )
s∈V (s,t)∈E

where
X
Hs (µs ) := − µs (xs ) log µs (xs )
xs
X µst (xs , xt )
Ist (µst ) := µst (xs , xt ) log .
xs ,xt
µs (xs )µt (xt )

38
Exact variational principle on trees
• putting the pieces back together yields:
˘ X X ¯
A(θ) = max hθ, µi + Hs (µs ) − Ist (µst ) .
µ∈MARG(T )
s∈V (s,t)∈E(T )

• let’s try to solve this problem by a (partial) Lagrangian formulation

• assign a Lagrange multiplier λts (xs ) for each constraint


P
Cts (xs ) := µs (xs ) − xt µst (xs , xt ) = 0
P
• will enforce the normalization ( xs µs (xs ) = 1) and non-negativity
constraints explicitly

• the Lagrangian takes the form:


X X
L(µ; λ) = hθ, µi + Hs (µs ) − Ist (µst )
s∈V (s,t)∈E(T )
X ˆX X ˜
+ λst (xt )Cst (xt ) + λts (xs )Cts (xs )
(s,t)∈E xt xs

39
Lagrangian derivation (continued)
• taking derivatives of the Lagrangian w.r.t µs and µst yields
∂L X
= θs (xs ) − log µs (xs ) + λts (xs ) + C
∂µs (xs )
t∈Γ(s)

∂L µst (xs , xt )
= θst (xs , xt ) − log − λts (xs ) − λst (xt ) + C 0
∂µst (xs , xt ) µs (xs )µt (xt )

• setting these partial derivatives to zero and simplifying:


˘ ¯ Y ˘ ¯
µs (xs ) ∝ exp θs (xs ) exp λts (xs )
t∈Γ(s)
˘ ¯
µs (xs , xt ) ∝ exp θs (xs ) + θt (xt ) + θst (xs , xt ) ×
Y ˘ ¯ Y ˘ ¯
exp λus (xs ) exp λvt (xt )
u∈Γ(s)\t v∈Γ(t)\s

• enforcing the constraint Cts (xs ) = 0 on these representations yields the


familiar update rule for the messages Mts (xs ) = exp(λts (xs )):
X ˘ ¯ Y
Mts (xs ) ← exp θt (xt ) + θst (xs , xt ) Mut (xt )
xt u∈Γ(t)\s

40
C: Max-product (belief revision) on trees

Question: What should be the form of a variational principle for


computing modes?
Intuition: Consider behavior of the family {p(x; βθ) | β > 0}.
Low Beta High Beta

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) Low β (b) High β

Conclusion: Problem of computing modes should be related to


limiting form (β → +∞) of computing marginals.

41
Limiting form of variational principle (on trees)
• consider the tree-structured variational principle for p(x; βθ):
1 1 ˘ X X ¯
A(βθ) = max hβθ, µi + Hs (µs ) − Ist (µst ) .
β β µ∈MARG(T )
s∈V (s,t)∈E(T )

• taking limits as β → +∞ yields:


X X 
max θs (xs ) + θst (xs , xt ) = max hθ, µi . (2)
x∈X N µ∈MARG(T )
s∈V (s,t)∈E | {z }
| {z }
computation of modes linear program

• recall the max-product (belief revision) updates:


 Y
Mts (xs ) ← max exp θt (xt ) + θst (xs , xt ) Mut (xt )
xt
u∈Γ(t)\s

• the LHS of equation (2) is a linear program: a similar Lagrangian


formulation shows that max-product is an iterative method for
solving it (details in Wainwright & Jordan, 2003)

42
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods

2. Exponential families and variational methods


(a) What is a variational method (and why should I care)?
(b) Graphical models as exponential families
(c) The power of conjugate duality

3. Exact techniques as variational methods


(a) Gaussian inference on arbitrary graphs
(b) Belief-propagation/sum-product on trees (e.g., Kalman filter; α-β alg.)
(c) Max-product on trees (e.g., Viterbi)

4. Approximate techniques as variational methods


(a) Mean field and variants
(b) Belief propagation and extensions
(c) Semidefinite constraints and convex relaxations

43
A: Mean field theory

Difficulty: (typically) no explicit form for −A∗ (µ) (i.e., entropy as a


function of mean parameters) =⇒ exact variational principle is
intractable.

Idea: Restrict µ to a subset of distributions for which −A∗ (µ) has a


tractable form.

Examples:
Q
(a) For product distributions p(x) = s∈V µs (xs ), entropy decomposes
P
as −A∗ (µ) = s∈V Hs (xs ).
(b) Similarly, for trees (more generally, decomposable graphs), the
junction tree theorem yields an explicit form for −A∗ (µ).

Definition: A subgraph H of G is tractable if the entropy has an


explicit form for any distribution that respects H.

44
Geometry of mean field
1 2 3

4 5 6

• let H represent a tractable subgraph (i.e., for which


A∗ has explicit form)
7 8 9

• let Mtr (G; H) represent tractable mean parameters:


1 2 3

Mtr (G; H) := {µ| µ = Eθ [φ(x)] s. t. θ respects H}. 4 5 6

7 8 9

µe
Mtr • under mild conditions, Mtr is a non-
convex inner approximation to M
• optimizing over Mtr (as opposed to M)
yields lower bound :
lacements ˘ ¯
A(θ) ≥ sup hθ, µ ei − A∗ (e
µ) .
e∈Mtr
µ
M

45
Alternative view: Minimizing KL divergence

• recall the mixed form of the KL divergence between p(x; θ) and


e
p(x; θ):

µ || θ) = A(θ) + A∗ (e
D(e µ) − he
µ, θi

• try to find the “best” approximation to p(x; θ) in the sense of KL


divergence

• in analytical terms, the problem of interest is


n o
inf D(e µ || θ) = A(θ) + inf A∗ (e
µ) − he
µ, θi
e∈Mtr
µ e∈Mtr
µ

• hence, finding the tightest lower bound on A(θ) is equivalent to


finding the best approximation to p(x; θ) from distributions with
e ∈ Mtr
µ

46
Example: Naive mean field algorithm for Ising model
• consider completely disconnected subgraph H = (V, ∅)
• permissible exponential parameters belong to subspace
E(H) = {θ ∈ Rd | θst = 0 ∀ (s, t) ∈ E}
Q
• allowed distributions take product form p(x; θ) = p(xs ; θs ), and
s∈V
generate
Mtr (G; H) = {µ | µst = µs µt , µs ∈ [0, 1] }.
• approximate variational principle:
X X ff
ˆX ˜
max θ s µs + θst µs µt − µs log µs +(1−µs ) log(1−µs ) .
µs ∈[0,1]
s∈V (s,t)∈E s∈V

• Co-ordinate ascent: with all {µt , t 6= s} fixed, problem is strictly


concave in µs and optimum is attained at
 X −1
µs ←− 1 + exp[−(θs + θst µt )]
t∈N (s)

47
Example: Structured mean field for coupled HMM

(a) (b)

• entropy of distribution that respects H decouples into sum: one


term for each chain.
• structured mean field updates are an iterative method for finding
the tightest approximation (either in terms of KL or lower bound)

48
B: Belief propagation on arbitrary graphs

Two main ingredients:


1. Exact entropy −A∗ (µ) is intractable, so let’s approximate it.
The Bethe approximation A∗Bethe (µ) ≈ A∗ (µ) is based on the exact
expression for trees:
X X

−ABethe (µ) = Hs (µs ) − Ist (µst ).
s∈V (s,t)∈E

2. The marginal polytope MARG(G) is also difficult to characterize, so


let’s use the following (tree-based) outer bound:
X X
LOCAL(G) := { τ ≥ 0 | τs (xs ) = 1, τst (xs , xt ) = τs (xs ) },
xs xt

Note: Use τ to distinguish these locally consistent pseudomarginals from globally


consistent marginals.

49
Geometry of belief propagation

• combining these ingredients leads to the Bethe variational principle:


 X X
max hθ, τ i + Hs (µs ) − Ist (τst )
τ ∈LOCAL(G)
s∈V (s,t)∈E

• belief propagation can be derived as an iterative method for solving


a Lagrangian formulation of the BVP (Yedidia et al., 2002)

µint

MARG(G)
• belief propagation uses a polyhedral outer µf rac
approximation to M
PSfrag replacements
• for any graph, LOCAL(G) ⊇ MARG(G).
• equality holds ⇐⇒ G is a tree.
LOCAL(G)

50
Illustration: Globally inconsistent BP fixed points
Consider the following assignment of pseudomarginals τs , τst :









2
























1 3
Locally consistent
(pseudo)marginals





























• can verify that τ ∈ LOCAL(G), and that τ is a fixed point of belief
propagation (with all constant messages)
• however, τ is globally inconsistent
Note: More generally: for any τ in the interior of LOCAL(G), can
construct a distribution with τ as a BP fixed point.

51
High-level perspective

• message-passing algorithms (e.g., mean field, belief propagation)


are solving approximate versions of exact variational principle in
exponential families
• there are two distinct components to approximations:
(a) can use either inner or outer bounds to M
(b) various approximations to entropy function −A∗ (µ)

• mean field: non-convex inner bound and exact form of entropy

• BP: polyhedral outer bound and non-convex Bethe approximation

• Kikuchi and variants: tighter polyhedral outer bounds and better


entropy approximations
(e.g.,Yedidia et al., 2002)

52
Generalized belief propagation on hypergraphs

• a hypergraph is a natural generalization of a graph


• it consists of a set of vertices V and a set E of hyperedges, where
each hyperedge is a subset of V
• convenient graphical representation in terms of poset diagrams

12
1245 2356
1 2 123 234 25
14 23
45 5 56
23
4 3
58
34 4578 5689

(a) Ordinary graph (b) Hypertree (width 2) (c) Hypergraph


• descendants and ancestors of a hyperedge h:

D+ (h) := {g ∈ E | g ⊆ h }, A+ (h) := {g ∈ E | g ⊇ h }.

53
Hypertree factorization and entropy
• hypertrees are an alternative way to describe junction trees

• associated with any poset is a Möbius function ω : E × E → Z


X
ω(g, g) = 1, ω(g, h) = − ω(g, f )
{f | g⊆f ⊂h}

Example: For Boolean poset, ω(g, h) = (−1)|h|\|g| .

• use the Möbius function to define a correspondence between the


collection of marginals µ := {µh } and new set of functions
ϕ := {ϕh }:
X X
log ϕh (xh ) = ω(g, h) log µg (xg ), log µh (xh ) = log ϕg (xg ).
g∈D + (h) g∈D + (h)

• any hypertree-structured distribution is guaranteed to factor as:


Y
p(x) = ϕh (xh ).
h∈E

54
Examples: Hypertree factorization

1. Ordinary tree:

ϕs (xs ) = µs (xs ) for any vertex s


µst (xs , xt )
ϕst (xs , xt ) = for any edge (s, t)
µs (xs ) µt (xt )

2. Hypertree:
1245 2356
25
µ1245
ϕ1245 = µ25 µ45
µ5 µ5 µ 5
45 5 56

µ45 58
ϕ45 = 4578
µ5
ϕ5 = µ5

Combining the pieces:


µ1245 µ2356 µ4578 µ25 µ45 µ56 µ58 µ1245 µ2356 µ4578
p = µ25 µ45 µ25 µ56 µ45 µ58 µ 5 =
µ µ
µ5 µ µ
µ5 µ µ
µ 5 µ 5 µ 5 µ 5 µ 5 µ25 µ45
5 5 5 5 5 5

55
Building augmented hypergraphs
Better entropy approximations via augmented hypergraphs.

1 12 23 3
1 2 3 2
1 2 3 1245 2356
25
14 36
4 5 6 4 45 5 56 6
4 5 6 47 69
58
4578 5689
8
7 8 9 7 78 89 9
7 8 9

(a) Original (b) Clustering (c) Full covering

1245 2356 1245 2356


25 25

45 5 56 45 56

58 58
4578 5689 4578 5689

(d) Kikuchi (e) Fails single counting

56
C. Convex relaxations
Possible concerns with the Bethe/Kikuchi problems and variations?

(a) lack of convexity ⇒ multiple local optima, and substantial


algorithmic complications
(b) failure to bound the log partition function

Goal: Techniques for approximate computation of marginals and


parameter estimation based on:
(a) convex variational problems ⇒ unique global optimum
(b) relaxations of exact problem ⇒ upper bounds on A(θ)
Usefulness of bounds:
(a) interval estimates for marginals
(b) approximate parameter estimation
(c) large deviations (prob. of rare events)

57
Bounds from “convexified” Bethe/Kikuchi problems
Idea: Upper bound −A∗ (µ) by convex combination of tree-structured
entropies.

placementsPSfrag replacements PSfrag replacements PSfrag replacements

−A∗ (µ) ≤ −ρ(T 1 )A∗ (µ(T 1 )) − ρ(T 2 )A∗ (µ(T 2 )) − ρ(T 3 )A∗ (µ(T 3 ))

• given any spanning tree T , define the moment-matched tree distribution:


Y Y µst (xs , xt )
p(x; µ(T )) := µs (xs )
s∈V
µs (xs ) µt (xt )
(s,t)∈E

• use −A∗ (µ(T )) to denote the associated tree entropy


• let ρ = {ρ(T )} be a probability distribution over spanning trees

58
Edge appearance probabilities

Experiment: What is the probability ρe that a given edge e ∈ E


belongs to a tree T drawn randomly under ρ?
f f f f

b b b b
lacements
PSfrag replacements
PSfrag replacements
PSfrag replacements

e e e e
1 1 1
(a) Original (b) ρ(T 1 ) = 3
(c) ρ(T 2 ) = 3
(d) ρ(T 3 ) = 3

In this example: ρb = 1; ρe = 23 ; ρf = 13 .

The vector ρe = { ρe | e ∈ E } must belong to the spanning tree


polytope, denoted T(G). (Edmonds, 1971)

59
Optimal bounds by tree-reweighted message-passing
Recall the constraint set of locally consistent marginal distributions:
X X
LOCAL(G) = { τ ≥ 0 | τs (xs ) = 1, τst (xs , xt ) = τt (xt ) }.
xs xs
| {z } | {z }
normalization marginalization

Theorem: (Wainwright, Jaakkola, & Willsky, UAI 2002)

(a) For any given edge weights ρe = {ρe } in the spanning tree polytope,
the optimal upper bound over all tree parameters is given by:
˘ X X ¯
A(θ) ≤ max hθ, τ i + Hs (τs ) − ρst Ist (τst ) .
τ ∈LOCAL(G)
s∈V (s,t)∈E

(b) This optimization problem is strictly convex, and its unique optimum
is specified by the fixed point of ρe -reweighted message passing:
Q ˆ ∗ ˜ρvt
( Mvt (xt ) )
X h 0
θst (xs , xt ) i v∈Γ(t)\s

Mts (xs ) = κ exp + θt (x0t ) ˆ ∗ ˜(1−ρts ) .
ρst Mst (xt )
x0 ∈Xt
t

60
Upper bounds on lattice model

0.04

0.03

0.02
Opt. Upper
Relative Error

0.01

0 Doub. Opt. Upper


−0.01
MF Lower
−0.02

−0.03

−0.04
0 0.25 0.5 0.75 1
Coupling Strength

61
Upper bounds on fully connected models

1.5
Opt. Upper
1
Doub. Opt. Upper
0.5
Relative Error

0
MF Lower
−0.5

−1

−1.5
0 0.25 0.5 0.75 1
Coupling Strength

62
Semidefinite constraints in convex relaxations

Fact: Belief propagation and its hypergraph-based generalizations all


involve polyhedral (i.e., linear ) outer bounds on the marginal polytope.

Idea: Use semidefinite constraints to generate more global outer


bounds.
Example: For the Ising model, relevant mean parameters are µs = p(Xs = 1) and
µst = p(Xs = 1, Xt = 1).
Define y = [1 x]T , and consider the second-order moment matrix:
2 3
1 µ1 µ2 . . . µn
6 7
6µ µ1 µ12 . . . µ1n 7
6 1 7
6 7
6 . . . µ2n 7
E[yyT ] = 6 µ2 µ12 µ2 7
6 . .. .. .. .. 7
6 . 7
6 . . . . . 7
4 5
µn µn1 µn2 . . . µn

It must be positive semidefinite, which imposes (an infinite number of) linear
constraints on µs , µst .

63
Illustrative example









2
























1 3
Locally consistent
(pseudo)marginals





























   
µ1 µ12 µ13 0.5 0.4 0.1
Second-order    
µ21 µ2 µ23  = 0.4 0.5 0.4
moment matrix    
µ31 µ32 µ3 0.1 0.4 0.5

Not positive-semidefinite!

64
Log-determinant relaxation
• based on optimizing over covariance matrices M1 (µ) ∈ SDEF1 (Kn )

Theorem: Consider an outer bound OUT(Kn ) that satisfies:

MARG(Kn ) ⊆ OUT(Kn ) ⊆ SDEF1 (Kn )

For any such outer bound, A(θ) is upper bounded by:


 
1  1  n πe
max hθ, µi+ log det M1 (µ)+ blkdiag[0, In ] + log( )
µ∈OUT(Kn ) 2 3 2 2

Remarks:
1. Log-det. problem can be solved efficiently by interior point methods.
2. Relevance for applications:
(a) Upper bound on A(θ).
(b) Method for computing approximate marginals.

(Wainwright & Jordan, 2003)

65
Results for approximating marginals

Nearest−neighbor grid Fully connected

LD LD
Weak − BP Weak − BP

Strong − Strong −

Weak +/− Weak +/−

Strong +/− Strong +/−

Weak + Weak +

Strong + Strong +

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Average error in marginal Average error in marginal

(a) Nearest-neighbor grid (b) Fully connected

• average `1 error in approximate marginals over 100 trials

• coupling types: repulsive (−), mixed (+/−), attractive (+)

66
Summary and future directions

• variational methods are based on converting computational tasks


to optimization problems:
(a) complementary to sampling-based methods (e.g., MCMC)
(b) a variety of new “relaxations” remain to be explored

• many open questions:


(a) prior error bounds available only in special cases
(b) extension to non-parametric settings?
(c) hybrid techniques (variational and MCMC)
(d) variational methods in parameter estimation
(e) fast techniques for solving large-scale relaxations (e.g., SDPs,
other convex programs)

67

You might also like