0% found this document useful (0 votes)

47 views67 pages

Wainwright Microsoft Slides2

This document provides an introduction to graphical models and variational methods. Graphical models use graphs to represent dependencies between random variables. They are widely used in machine learning, signal processing, and other fields. Exact computation with graphical models is often intractable, motivating the use of variational methods that provide approximate solutions. The document outlines variational techniques for both directed and undirected graphical models, including mean field methods and belief propagation.

Uploaded by

Cyrus Ray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views67 pages

Wainwright Microsoft Slides2

Uploaded by

Cyrus Ray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Graphical models, exponential families, and

variational methods

Martin Wainwright
Departments of Statistics, and
Electrical Engineering and Computer Science,
UC Berkeley

Email: [email protected]

Tutorial slides based on joint paper with Michael Jordan

Paper at: www.eecs.berkeley.edu/ewainwrig/WaiJorVariational03.ps

1
Introduction

• graphical models are used and studied in various applied statistical

and computational fields:
– machine learning and artificial intelligence
– computational biology
– statistical signal/image processing
– communication and information theory
– statistical physics
– .....

• based on correspondences between graph theory and probability

theory

• important but difficult problems:

– computing likelihoods, marginal distributions, modes
– estimating model parameters and structure from (noisy) data

2
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods

2. Exponential families and variational methods

(a) What is a variational method (and why should I care)?
(b) Graphical models as exponential families
(c) The power of conjugate duality

3. Exact techniques as variational methods

(a) Gaussian inference on arbitrary graphs
(b) Belief-propagation/sum-product on trees (e.g., Kalman filter; α-β alg.)
(c) Max-product on trees (e.g., Viterbi)

4. Approximate techniques as variational methods

(a) Mean field and variants
(b) Belief propagation and extensions on graphs with cycles
(c) Semidefinite constraints and convex relaxations

3
Undirected graphical models
Based on correspondences between graphs and random variables.
• given an undirected graph G = (V, E), associate to each node s a
random variable Xs
• for each subset A ⊆ V , define XA := {xs , s ∈ A}.

2
ag replacements 3 4
7
PSfrag replacements B

1 5 6 A S
Maximal cliques (123), (345), (456), (47) Vertex cutset S
• a clique C ⊆ V is a subset of vertices all joined by edges
• a vertex cutset is a subset S ⊂ V whose removal breaks the graph
into two or more pieces

4
Factorization and Markov properties

The graph G can be used to impose constraints on the random vector

X = XV (or on the distribution p) in different ways.

Markov property: X is Markov w.r.t G if XA and XB are

conditionally indpt. given XS whenever S separates A and B.

Factorization: The distribution p factorizes according to G if it can

be expressed as a product over cliques:
1 Y
p(x) = ψC (xC )
Z | {z }
C∈C
compatibility function on clique C

Theorem: (Hammersley-Clifford) For strictly positive p(·), the

Markov property and the Factorization property are equivalent.

5
Example: Hidden Markov models

X1 X2 X3 XT
q q

Y1 Y2 Y3 YT

(a) Hidden Markov model (b) Coupled HMM

• HMMs are widely used in various applications

discrete Xt : computational biology, speech processing, etc.
Gaussian Xt : control theory, signal processing, etc.

• frequently wish to solve smoothing problem of computing

p(xt | y1 , . . . , yT )

• exact computation in HMMs is tractable, but coupled HMMs require

algorithms for approximate computation (e.g., structured mean field)

6
Example: Graphical codes for communication
ag replacements
Goal: Achieve reliable communication over a noisy channel.
0 00000 10010 00000
Encoder Noisy Decoder
Channel Y b
source X X

• wide variety of applications: satellite communication, sensor

networks, computer memory, neural communication

• error-control codes based on careful addition of redundancy, with

their fundamental limits determined by Shannon theory

• key implementational issues: efficient construction, encoding and

decoding

• very active area of current research: graphical codes (e.g., turbo

codes, low-density parity check codes) and iterative
message-passing algorithms (belief propagation; max-product)

7
Graphical codes and decoding
Parity check matrix Factor graph
x1

x2
ψ1357
2 3
1 0 1 0 1 0 1 x3
6 7
H=6
40 1 1 0 0 1 17 5 x4 ψ2367
PSfrag replacements
0 0 0 1 1 1 1
x5
ψ4567
Codeword: [0 1 0 1 0 1 0] x6
Non-codeword: [0 0 0 0 0 1 1]
x7

• Decoding: requires finding maximum likelihood codeword:

bM L = arg max p(y | x) s. t. Hx = 0 (mod 2).

x
x

• use of belief propagation as an approximate decoder has revolutionized

the field of error-control coding

8
Challenging computational problems

Frequently, it is of interest to compute various quantities associated

with an undirected graphical model:

(a) the log normalization constant log Z

(b) local marginal distributions or other local statistics

(c) modes or most probable configurations

Relevant dimensions often grow rapidly in graph size =⇒ major

computational challenges.

Example: Consider a naive approach to computing the normalization constant for

binary random variables:
X Y
Z = ψC (xC )
x∈{0,1}n C∈C

Complexity scales exponentially as 2n .

9
Gibbs sampling in the Ising model

• binary variables on a graph G = (V, E) with pairwise interactions:

X X
p(x; θ) ∝ exp θs xs + θst xs xt
s∈V (s,t)∈E

(m+1) (m)
Update xs stochastically based on values xN (s)
at neighbors:

1. Choose s ∈ V at random.
2. Sample u ∼ U(0, 1) and update
8 P
>
< 1 if u ≤ {1 + exp[−(θs + (m)
θst xt )]}−1
xs(m+1) = t∈N (s) ,
>
: 0 otherwise

10
Mean field updates in the Ising model

• binary variables on a graph G = (V, E) with pairwise interactions:

X X
p(x; θ) ∝ exp θs xs + θst xs xt
s∈V (s,t)∈E

• simple (deterministic) message-passing algorithm involving

variational parameters νs ∈ (0, 1) at each node

3
1. Choose s ∈ V at random.
placements ν3
1 2. Update νs based on neighbors

2 4

ν2 {νt , t ∈ N (s)}:

ν4

ν5 ˘ X ¯−1
νs ←− 1 + exp[−(θs + θst νt )]
5 t∈N (s)

Questions: • principled derivation? • convergence and accuracy?

11
Sum-product (belief-propagation) in the Ising model

• alternative set of message-passing updates (motivated by exactness

for trees)

acements
Tt 1. For each (direction of each) edge, update message:
Tw
w s 1
X Y
νwt νts (xs ) ← exp(θt xt +θst xs xt ) νut (xt )
νts
t νut xt =0 u∈N (t)\s
u νvt
Tu 2. Upon convergence, compute approx. to marginal:
v ` ´ Y
Tv p(xs ) ∝ exp θs xs νts (xs ).
t∈N (s)

• for any tree (i.e., no cycles), updates will converge (after a finite
number of steps), and yield exact marginals (cf. Pearl, 1988)

• behavior for graphs with cycles?

12
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods

2. Exponential families and variational methods

(a) What is a variational method (and why should I care)?
(b) Graphical models as exponential families
(c) The power of conjugate duality

3. Exact techniques as variational methods

(a) Gaussian inference on arbitrary graphs
(b) Belief-propagation/sum-product on trees (e.g., Kalman filter; α-β alg.)
(c) Max-product on trees (e.g., Viterbi)

4. Approximate techniques as variational methods

(a) Mean field and variants
(b) Belief propagation and extensions
(c) Semidefinite constraints and convex relaxations

13
Variational methods

• “variational”: umbrella term for optimization-based formulation of

problems, and methods for their solution
• historical roots in the calculus of variations
• modern variational methods encompass a wider class of methods
(e.g., dynamic programming; finite-element methods)

b as
Variational principle: Representation of a quantity of interest u
the solution of an optimization problem.
b to be studied through the lens of the
1. allows the quantity u
optimization problem
2. approximations to u b can be obtained by approximating or relaxing
the variational principle

14
Illustration: A simple variational principle

Goal: Given a vector y ∈ Rn and a symmetric matrix Q 0, solve the

linear system Qu = y.
b (y) = Q−1 y can be obtained by matrix inversion.
Unique solution u
Variational formulation: Consider the function Jy : Rn → R defined by
1 T
Jy (u) := u Qu − yT u.
2

It is strictly convex, and the minimum is uniquely attained:

b (y) = arg minn Jy (u) = Q−1 y.

u
u∈R

Various methods for solving linear systems (e.g., conjugate gradient)

exploit this variational representation.

15
Useful variational principles for graphical models?
Consider an undirected graphical model:
1 Y
p(x) = ψC (xC )
Z
C∈C

Core problems that arise in many applications:

(a) computing the log normalization constant log Z
P
(b) computing local marginal distributions (e.g., p(xs ) = xt ,t6=s p(x))

b ∈ arg maxx p(x)

Approach: Develop variational representations of all of these

problems by exploiting ideas and results from:
(a) exponential families (e.g.,Brown, 1986)

(b) convex analysis (e.g., Rockafellar,1973)

16
Exponential families

φα : X n → R ≡ sufficient statistic
φ = {φα , α ∈ I} ≡ vector of sufficient statistics
θ = {θα , α ∈ I} ≡ parameter vector
ν ≡ base measure (e.g., Lebesgue, counting)

• parameterized family of densities (w.r.t. ν):

X
p(x; θ) = exp θα φα (x) − A(θ)
α

• cumulant generating function (log normalization constant):

Z

A(θ) = log exp{hθ, φ(x)i}ν(dx)

• set of valid parameters Θ := {θ ∈ Rd | A(θ) < +∞}.

• will focus on regular families for which Θ is open.

17
Examples: Scalar exponential families

Family X ν log p(x; θ) A(θ)

Bernoulli {0, 1} Counting θx − A(θ) log[1 + exp(θ)]

1 2πe
Gaussian R Lebesgue θ1 x + θ2 x2 − A(θ) [θ
2 1
+ log −θ2
]

Exponential (0, +∞) Lebesgue θ (−x) − A(θ) − log θ

Poisson {0, 1, 2 . . .} Counting θx − A(θ) exp(θ)

h(x) = 1/x!

18
Graphical models as exponential families

• choose random variables Xs at each vertex s ∈ V from an arbitrary

exponential family (e.g., Bernoulli, Gaussian, Dirichlet etc.)
• exponential
PSfrag family can be the same at each node (e.g., multivariate
replacements
Gaussian), or different (e.g., latent Dirichlet allocation model)

3 4
7

1 5 6

Key requirement: The collection φ of sufficient statistics must

respect the structure of G.

19
Example: Ising model

φ = { xs | s ∈ V } ∪ { xs xt | (s, t) ∈ E }

I = V ∪ E

Xn = {0, 1}n

Density (w.r.t. counting measure) of the form:

n
˘X X ¯
p(x; θ) ∝ exp θs x s + θst xs xt
s=1 (s,t)∈E

Cumulant generating function (log normalization constant):

X n
˘X X ¯
A(θ) = log exp θs x s + θst xs xt
x∈{0,1}n s=1 (s,t)∈E

20
Example: Multivariate Gaussian
U (θ): Matrix of natural parameters φ(x): Matrix of sufficient statistics

2 3 2 3
0 θ1 θ2 ... θn 1 x1 x2 ... xn
6 7 6 7
6θ θ11 θ12 ... θ1n 7 6x (x1 )2 x1 x2 ... x 1 xn 7
6 1 7 6 1 7
6 7 6 7
6 θ2 θ21 θ22 ... θ2n 7 6 x2 x2 x1 (x2 )2 ... x 2 xn 7
6 7 6 7
6 . .. .. .. .. 7 6 . .. .. .. .. 7
6 . 7 6 . 7
6 . . . . . 7 6 . . . . . 7
4 5 4 5
θn θn1 θn2 ... θnn xn xn x1 xn x2 ... (xn )2

Edgewise natural parameters θst = θts must respect graph structure:

1 2 3 4 5
1
1 2
2
3
3

5
4 5

(a) Graph structure (b) Structure of [Z(θ)]st = θst .

21
PSfrag replacements
Example: Latent Dirichlet Allocation model
γ

α u z w

Model components:
Dirichlet u ∼ Dir(α)
Multinomial “topic” z ∼ Mult(u)
“Word” w ∼ multinomial conditioned on z
(with parameter γ)

With variables x := (u, z, w) and parameter θ := (α, γ), density

p(u; α)p(z; u)p(w | z, γ) is proportional to:
n k k X
l
X X X
exp αi log ui + I i [z] log ui + γij I i [z]I j [w] .
i=1 i=1 i=1 j=1

22
The power of conjugate duality

Conjugate duality is a fertile source of variational principles.

(Rockafellar, 1973)

• any function f can be used to define another function f ∗ as follows:

∗ ˘ ¯
f (y) := sup hy, xi − f (x) .
x∈Rn

• easy to show that f ∗ is always a convex function

• how about taking the “dual of the dual”? I.e., what is (f ∗ )∗ ?

• when f is well-behaved (convex and lower semi-continuous), we have

(f ∗ )∗ = f , or alternatively stated:
˘ ∗ ¯
f (x) = sup hx, yi − f (y)
y∈Rn

23
Geometric view: Supporting hyperplanes
Question: Given all hyperplanes in Rn × R with normal (y, −1), what
is the intercept of the one that supports epi(f )?
u
f (x)

hy, xi − ca
Epigraph of f :
PSfrag replacements
epi(f ) := {(x, u) ∈ Rn+1 | f (x) ≤ u}.
hy, xi − cb
−ca

−cb x

Analytically, we require the smallest c ∈ R such that:

hy, xi − c ≤ f (x) for all x ∈ Rn

By re-arranging, we find that this optimal c∗ is the dual value:

˘ ¯
c∗ = sup hy, xi − f (x) .
x∈Rn

24
Example: Single Bernoulli
Random variable X ∈ {0, 1} yields exponential family of the form:
˘ ¯ ˆ ˜
p(x; θ) ∝ exp θ x with A(θ) = log 1 + exp(θ) .
˘ ¯
Let’s compute the dual A∗ (µ) := sup µθ − log[1 + exp(θ)] .
θ∈R

(Possible) stationary point: µ = exp(θ)/[1 + exp(θ)].

A(θ) A(θ)

hµ, θi − A∗ (µ)
replacements PSfrag replacements
θ θ
hµ, θi − c

(a) Epigraph supported (b) Epigraph cannot be supported

8
<µ log µ + (1 − µ) log(1 − µ) if µ ∈ [0, 1]
We find that: A∗ (µ) = .
:+∞ otherwise.
˘ ¯
Leads to the variational representation: A(θ) = maxµ∈[0,1] µ · θ − A∗ (µ) .

25
More general computation of the dual A∗
• consider the definition of the dual function:
∗

A (µ) = sup hµ, θi − A(θ) .
θ∈Rd

• taking derivatives w.r.t θ to find a stationary point yields:

µ − ∇A(θ) = 0.

• Useful fact: Derivatives of A yield mean parameters:

Z
∂A
(θ) = Eθ [φα (x)] := φα (x)p(x; θ)ν(x).
∂θα

Thus, stationary points satisfy the equation:

µ = Eθ [φ(x)] (1)

26
Computation of dual (continued)
• assume solution θ(µ) to equation (1) exists

• strict concavity of objective guarantees that θ(µ) attains global

maximum with value

A∗ (µ) = hµ, θ(µ)i − A(θ(µ))

h i
= Eθ(µ) hθ(µ), φ(x)i − A(θ(µ))
= Eθ(µ) [log p(x; θ(µ))]

• recall the definition of entropy:

Z
ˆ ˜
H(p(x)) := − log p(x) p(x)ν(dx)

• thus, we recognize that A∗ (µ) = −H(p(x; θ(µ))) when equation (1) has a
solution

Question: For which µ ∈ Rd does equation (1) have a solution

θ(µ)?

27
Sets of realizable mean parameters

• for any distribution p(·), define a vector µ ∈ Rd of mean

parameters:
Z
µα := φα (x)p(x)ν(dx)

• now consider the set M(G; φ) of all realizable mean

parameters:
Z

M(G; φ) = µ ∈ Rd µα = φα (x)p(x)ν(dx) for some p(·)

• for discrete families, we refer to this set as a marginal polytope,

denoted by MARG(G; φ)

28
Examples of M:
1. Gaussian MRF: Matrices of suff. statistics and mean parameters:
µ11
2 3
1 h i
Mgauss
φ(x) = 4 5 1 x .
x
(2 3 )
1 h i
U (µ) := E 4 5 PSfrag
1 xreplacements µ1
x

Semidefinite set MGauss = {µ | U (µ) 0}.

2. Ising model: Binary vector X ∈ {0, 1}n

Sufficient statistics: φ(x) = {xs , s ∈ V } ∪ {xs xt , (s, t) ∈ E}

M(G) is the binary quadric polytope of realizable singleton and pairwise
marginal probabilities:

µs = p(Xs = 1), µst = p(Xs = 1, Xt = 1)

29
Geometry and moment mapping

M
Θ

θ µ
eplacements

Theorem: In a regular, minimal exponential family, the gradient map

∇A is one-to-one and onto the interior of the set M.
(e.g., Barndorff-Nielsen, 1978; Brown, 1986; Efron, 1978)

30
Variational principles in terms of mean parameters
Theorem:
(a) The conjugate dual of A takes the form:
8
<−H(p(x; θ(µ))) if µ ∈ int M(G; φ)
∗
A (µ) =
:+∞ if µ ∈
/ cl M(G; φ).
Note: Boundary behavior by lower semi-continuity.

(b) The cumulant generating function A has the representation:

A(θ) = sup {hθ, µi − A∗ (µ)},

| {z } µ∈M(G;φ)
| {z }
cumulant generating func. max. ent. problem over M

bα = Eθ [φα (x)] (for all θ ∈ Θ).

with max. attained at mean parameters µ

(c) The problem of mode computation has the representation:

sup log p(x; θ) + C = sup hθ, φ(x)i = supµ∈M(G;φ) hθ, µi.

x∈X n x∈X n

31
Alternative view: Kullback-Leibler divergence
• Kullback-Leibler divergence defines “distance” between probability
distributions:
Z h
p(x) i
D(p || q) := log p(x)ν(dx)
q(x)

• for two exponential family members p(x; θ 1 ) and p(x; θ 2 ), we have

D(p(x; θ 1 ) || p(x; θ 2 )) = A(θ 2 ) − A(θ 1 ) − hµ1 , θ2 − θ1 i

• substituting A(θ 1 ) = hθ 1 , µ1 i − A∗ (µ1 ) yields a mixed form:

D(p(x; θ 1 ) || p(x; θ 2 )) ≡ D(µ1 || θ2 ) = A(θ 2 ) + A∗ (µ1 ) − hµ1 , θ2 i

Hence, the following two assertions are equivalent:

A(θ2 ) = sup {hθ2 , µ1 i − A∗ (µ1 )}

µ1 ∈M(G;φ)

0 = inf D(µ1 || θ 2 )
µ1 ∈M(G;φ)

32
Challenges

1. In general, mean parameter spaces M can be very difficult to

characterize (e.g., multidimensional moment problems).

2. Entropy A∗ (µ) as a function of only the mean parameters µ

typically lacks an explicit form.

Remarks:
1. Variational representation clarifies why certain models are
tractable.
2. For intractable cases, one strategy is to solve an approximate form
of the optimization problem.

33
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods

2. Exponential families and variational methods

(a) What is a variational method (and why should I care)?
(b) Graphical models as exponential families
(c) The power of conjugate duality

3. Exact techniques as variational methods

(a) Gaussian inference on arbitrary graphs
(b) Belief-propagation/sum-product on trees (e.g., Kalman filter; α-β alg.)
(c) Max-product on trees (e.g., Viterbi)

4. Approximate techniques as variational methods

(a) Mean field and variants
(b) Belief propagation and extensions
(c) Semidefinite constraints and convex relaxations

34
A(i): Multivariate Gaussian (fixed covariance)
Consider the set of all Gaussians with fixed inverse covariance Q 0.
• potentials φ(x) = {x1 , . . . , xn } and natural parameter θ ∈ Θ = Rn .
• cumulant generating function:

density
z }| {
Z n
˘X ¯ ˘1 T ¯
A(θ) = log exp θs x s exp − x Qx dx
s=1 | 2{z }
Rn
base measure

• completing the square yields A(θ) = 12 θT Q−1 θ + constant

• straightforward computation leads to the dual
A∗ (µ) = 12 µT Qµ − constant
• putting the pieces back together yields the variational principle
˘ T 1 T ¯
A(θ) = sup θ µ − µ Qµ + constant
µ∈Rn 2

b = Q−1 θ.
• optimum is uniquely obtained at the familiar Gaussian mean µ

35
A(ii): Multivariate Gaussian (arbitrary covariance)
• matrices of sufficient statistics, natural parameters, and mean
parameters:
2 3 2 3 (2 3 )
1 h i 0 [θs ] 1 h i
φ(x) = 4 5 1 x , U (θ) := 4 5 U (µ) := E 4 5 1 x
x [θs ] [θst ] x

• cumulant generating function:

Z n o
A(θ) = log exp hhU (θ), φ(x)ii dx

• computing the dual function:

1 n
A∗ (µ) = − log det U (µ) − log 2πe,
2 2

• exact variational principle is a log-determinant problem:

˘ 1 ¯ n
A(θ) = sup hhU (θ), U (µ)ii + log det U (µ) + log 2πe
U (µ)0, [U (µ)]11 =1 2 2

• solution yields the normal equations for Gaussian mean and covariance.

36
B: Belief propagation/sum-product on trees
• multinomial variables Xs ∈ {0, 1, . . . , ms − 1} on a tree T = (V, E)
• sufficient statistics: indicator functions for each node and edge

I j (xs ) for s = 1, . . . n, j ∈ Xs
I jk (xs , xt ) for (s, t) ∈ E, (j, k) ∈ Xs × Xt .

• exponential representation of distribution:

˘X X ¯
p(x; θ) ∝ exp θs (xs ) + θst (xs , xt )
s∈V (s,t)∈E
P
where θs (xs ) := j∈Xs θs;j I j (xs ) (and similarly for θst (xs , xt ))
• mean parameters are simply marginal probabilities, represented as:
X X
µs (xs ) := µs;j I j (xs ), µst (xs , xt ) := µst;jk I jk (xs , xt )
j∈Xs (j,k)∈Xs ×Xt

• the marginals must belong to the following marginal polytope:

X X
MARG(T ) := { µ ≥ 0 | µs (xs ) = 1, µst (xs , xt ) = µs (xs ) },
xs xt

37
Decomposition of entropy for trees
• by the junction tree theorem, any tree can be factorized in terms of
its marginals µ ≡ µ(θ) as follows:
Y Y µst (xs , xt )
p(x; θ) = µs (xs )
µs (xs )µt (xt )
s∈V (s,t)∈E

• taking logs and expectations leads to the following entropy

decomposition:
X X
∗
H(p(x; θ)) = −A (µ(θ)) = Hs (µs ) − Ist (µst )
s∈V (s,t)∈E

where
X
Hs (µs ) := − µs (xs ) log µs (xs )
xs
X µst (xs , xt )
Ist (µst ) := µst (xs , xt ) log .
xs ,xt
µs (xs )µt (xt )

38
Exact variational principle on trees
• putting the pieces back together yields:
˘ X X ¯
A(θ) = max hθ, µi + Hs (µs ) − Ist (µst ) .
µ∈MARG(T )
s∈V (s,t)∈E(T )

• let’s try to solve this problem by a (partial) Lagrangian formulation

• assign a Lagrange multiplier λts (xs ) for each constraint

P
Cts (xs ) := µs (xs ) − xt µst (xs , xt ) = 0
P
• will enforce the normalization ( xs µs (xs ) = 1) and non-negativity
constraints explicitly

• the Lagrangian takes the form:

X X
L(µ; λ) = hθ, µi + Hs (µs ) − Ist (µst )
s∈V (s,t)∈E(T )
X ˆX X ˜
+ λst (xt )Cst (xt ) + λts (xs )Cts (xs )
(s,t)∈E xt xs

39
Lagrangian derivation (continued)
• taking derivatives of the Lagrangian w.r.t µs and µst yields
∂L X
= θs (xs ) − log µs (xs ) + λts (xs ) + C
∂µs (xs )
t∈Γ(s)

∂L µst (xs , xt )
= θst (xs , xt ) − log − λts (xs ) − λst (xt ) + C 0
∂µst (xs , xt ) µs (xs )µt (xt )

• setting these partial derivatives to zero and simplifying:

˘ ¯ Y ˘ ¯
µs (xs ) ∝ exp θs (xs ) exp λts (xs )
t∈Γ(s)
˘ ¯
µs (xs , xt ) ∝ exp θs (xs ) + θt (xt ) + θst (xs , xt ) ×
Y ˘ ¯ Y ˘ ¯
exp λus (xs ) exp λvt (xt )
u∈Γ(s)\t v∈Γ(t)\s

• enforcing the constraint Cts (xs ) = 0 on these representations yields the

familiar update rule for the messages Mts (xs ) = exp(λts (xs )):
X ˘ ¯ Y
Mts (xs ) ← exp θt (xt ) + θst (xs , xt ) Mut (xt )
xt u∈Γ(t)\s

40
C: Max-product (belief revision) on trees

Question: What should be the form of a variational principle for

computing modes?
Intuition: Consider behavior of the family {p(x; βθ) | β > 0}.
Low Beta High Beta

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) Low β (b) High β

Conclusion: Problem of computing modes should be related to

limiting form (β → +∞) of computing marginals.

41
Limiting form of variational principle (on trees)
• consider the tree-structured variational principle for p(x; βθ):
1 1 ˘ X X ¯
A(βθ) = max hβθ, µi + Hs (µs ) − Ist (µst ) .
β β µ∈MARG(T )
s∈V (s,t)∈E(T )

• taking limits as β → +∞ yields:

X X
max θs (xs ) + θst (xs , xt ) = max hθ, µi . (2)
x∈X N µ∈MARG(T )
s∈V (s,t)∈E | {z }
| {z }
computation of modes linear program

• recall the max-product (belief revision) updates:

Y
Mts (xs ) ← max exp θt (xt ) + θst (xs , xt ) Mut (xt )
xt
u∈Γ(t)\s

• the LHS of equation (2) is a linear program: a similar Lagrangian

formulation shows that max-product is an iterative method for
solving it (details in Wainwright & Jordan, 2003)

42
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods

2. Exponential families and variational methods

(a) What is a variational method (and why should I care)?
(b) Graphical models as exponential families
(c) The power of conjugate duality

3. Exact techniques as variational methods

(a) Gaussian inference on arbitrary graphs
(b) Belief-propagation/sum-product on trees (e.g., Kalman filter; α-β alg.)
(c) Max-product on trees (e.g., Viterbi)

4. Approximate techniques as variational methods

(a) Mean field and variants
(b) Belief propagation and extensions
(c) Semidefinite constraints and convex relaxations

43
A: Mean field theory

Difficulty: (typically) no explicit form for −A∗ (µ) (i.e., entropy as a

function of mean parameters) =⇒ exact variational principle is
intractable.

Idea: Restrict µ to a subset of distributions for which −A∗ (µ) has a

tractable form.

Examples:
Q
(a) For product distributions p(x) = s∈V µs (xs ), entropy decomposes
P
as −A∗ (µ) = s∈V Hs (xs ).
(b) Similarly, for trees (more generally, decomposable graphs), the
junction tree theorem yields an explicit form for −A∗ (µ).

Definition: A subgraph H of G is tractable if the entropy has an

explicit form for any distribution that respects H.

44
Geometry of mean field
1 2 3

4 5 6

• let H represent a tractable subgraph (i.e., for which

A∗ has explicit form)
7 8 9

• let Mtr (G; H) represent tractable mean parameters:

1 2 3

Mtr (G; H) := {µ| µ = Eθ [φ(x)] s. t. θ respects H}. 4 5 6

7 8 9

µe
Mtr • under mild conditions, Mtr is a non-
convex inner approximation to M
• optimizing over Mtr (as opposed to M)
yields lower bound :
lacements ˘ ¯
A(θ) ≥ sup hθ, µ ei − A∗ (e
µ) .
e∈Mtr
µ
M

45
Alternative view: Minimizing KL divergence

• recall the mixed form of the KL divergence between p(x; θ) and

e
p(x; θ):

µ || θ) = A(θ) + A∗ (e
D(e µ) − he
µ, θi

• try to find the “best” approximation to p(x; θ) in the sense of KL

divergence

• in analytical terms, the problem of interest is

n o
inf D(e µ || θ) = A(θ) + inf A∗ (e
µ) − he
µ, θi
e∈Mtr
µ e∈Mtr
µ

• hence, finding the tightest lower bound on A(θ) is equivalent to

finding the best approximation to p(x; θ) from distributions with
e ∈ Mtr
µ

46
Example: Naive mean field algorithm for Ising model
• consider completely disconnected subgraph H = (V, ∅)
• permissible exponential parameters belong to subspace
E(H) = {θ ∈ Rd | θst = 0 ∀ (s, t) ∈ E}
Q
• allowed distributions take product form p(x; θ) = p(xs ; θs ), and
s∈V
generate
Mtr (G; H) = {µ | µst = µs µt , µs ∈ [0, 1] }.
• approximate variational principle:
X X ff
ˆX ˜
max θ s µs + θst µs µt − µs log µs +(1−µs ) log(1−µs ) .
µs ∈[0,1]
s∈V (s,t)∈E s∈V

• Co-ordinate ascent: with all {µt , t 6= s} fixed, problem is strictly

concave in µs and optimum is attained at
X −1
µs ←− 1 + exp[−(θs + θst µt )]
t∈N (s)

47
Example: Structured mean field for coupled HMM

(a) (b)

• entropy of distribution that respects H decouples into sum: one

term for each chain.
• structured mean field updates are an iterative method for finding
the tightest approximation (either in terms of KL or lower bound)

48
B: Belief propagation on arbitrary graphs

Two main ingredients:

1. Exact entropy −A∗ (µ) is intractable, so let’s approximate it.
The Bethe approximation A∗Bethe (µ) ≈ A∗ (µ) is based on the exact
expression for trees:
X X
∗
−ABethe (µ) = Hs (µs ) − Ist (µst ).
s∈V (s,t)∈E

2. The marginal polytope MARG(G) is also difficult to characterize, so

let’s use the following (tree-based) outer bound:
X X
LOCAL(G) := { τ ≥ 0 | τs (xs ) = 1, τst (xs , xt ) = τs (xs ) },
xs xt

Note: Use τ to distinguish these locally consistent pseudomarginals from globally

consistent marginals.

49
Geometry of belief propagation

• combining these ingredients leads to the Bethe variational principle:

X X
max hθ, τ i + Hs (µs ) − Ist (τst )
τ ∈LOCAL(G)
s∈V (s,t)∈E

• belief propagation can be derived as an iterative method for solving

a Lagrangian formulation of the BVP (Yedidia et al., 2002)

µint

MARG(G)
• belief propagation uses a polyhedral outer µf rac
approximation to M
PSfrag replacements
• for any graph, LOCAL(G) ⊇ MARG(G).
• equality holds ⇐⇒ G is a tree.
LOCAL(G)

50
Illustration: Globally inconsistent BP fixed points
Consider the following assignment of pseudomarginals τs , τst :

1 3
Locally consistent
(pseudo)marginals

• can verify that τ ∈ LOCAL(G), and that τ is a fixed point of belief
propagation (with all constant messages)
• however, τ is globally inconsistent
Note: More generally: for any τ in the interior of LOCAL(G), can
construct a distribution with τ as a BP fixed point.

51
High-level perspective

• message-passing algorithms (e.g., mean field, belief propagation)

are solving approximate versions of exact variational principle in
exponential families
• there are two distinct components to approximations:
(a) can use either inner or outer bounds to M
(b) various approximations to entropy function −A∗ (µ)

• mean field: non-convex inner bound and exact form of entropy

• BP: polyhedral outer bound and non-convex Bethe approximation

• Kikuchi and variants: tighter polyhedral outer bounds and better

entropy approximations
(e.g.,Yedidia et al., 2002)

52
Generalized belief propagation on hypergraphs

• a hypergraph is a natural generalization of a graph

• it consists of a set of vertices V and a set E of hyperedges, where
each hyperedge is a subset of V
• convenient graphical representation in terms of poset diagrams

12
1245 2356
1 2 123 234 25
14 23
45 5 56
23
4 3
58
34 4578 5689

(a) Ordinary graph (b) Hypertree (width 2) (c) Hypergraph

• descendants and ancestors of a hyperedge h:

D+ (h) := {g ∈ E | g ⊆ h }, A+ (h) := {g ∈ E | g ⊇ h }.

53
Hypertree factorization and entropy
• hypertrees are an alternative way to describe junction trees

• associated with any poset is a Möbius function ω : E × E → Z

X
ω(g, g) = 1, ω(g, h) = − ω(g, f )
{f | g⊆f ⊂h}

Example: For Boolean poset, ω(g, h) = (−1)|h|\|g| .

• use the Möbius function to define a correspondence between the

collection of marginals µ := {µh } and new set of functions
ϕ := {ϕh }:
X X
log ϕh (xh ) = ω(g, h) log µg (xg ), log µh (xh ) = log ϕg (xg ).
g∈D + (h) g∈D + (h)

• any hypertree-structured distribution is guaranteed to factor as:

Y
p(x) = ϕh (xh ).
h∈E

54
Examples: Hypertree factorization

1. Ordinary tree:

ϕs (xs ) = µs (xs ) for any vertex s

µst (xs , xt )
ϕst (xs , xt ) = for any edge (s, t)
µs (xs ) µt (xt )

2. Hypertree:
1245 2356
25
µ1245
ϕ1245 = µ25 µ45
µ5 µ5 µ 5
45 5 56

µ45 58
ϕ45 = 4578
µ5
ϕ5 = µ5

Combining the pieces:

µ1245 µ2356 µ4578 µ25 µ45 µ56 µ58 µ1245 µ2356 µ4578
p = µ25 µ45 µ25 µ56 µ45 µ58 µ 5 =
µ µ
µ5 µ µ
µ5 µ µ
µ 5 µ 5 µ 5 µ 5 µ 5 µ25 µ45
5 5 5 5 5 5

55
Building augmented hypergraphs
Better entropy approximations via augmented hypergraphs.

1 12 23 3
1 2 3 2
1 2 3 1245 2356
25
14 36
4 5 6 4 45 5 56 6
4 5 6 47 69
58
4578 5689
8
7 8 9 7 78 89 9
7 8 9

(a) Original (b) Clustering (c) Full covering

1245 2356 1245 2356

25 25

45 5 56 45 56

58 58
4578 5689 4578 5689

(d) Kikuchi (e) Fails single counting

56
C. Convex relaxations
Possible concerns with the Bethe/Kikuchi problems and variations?

(a) lack of convexity ⇒ multiple local optima, and substantial

algorithmic complications
(b) failure to bound the log partition function

Goal: Techniques for approximate computation of marginals and

parameter estimation based on:
(a) convex variational problems ⇒ unique global optimum
(b) relaxations of exact problem ⇒ upper bounds on A(θ)
Usefulness of bounds:
(a) interval estimates for marginals
(b) approximate parameter estimation
(c) large deviations (prob. of rare events)

57
Bounds from “convexified” Bethe/Kikuchi problems
Idea: Upper bound −A∗ (µ) by convex combination of tree-structured
entropies.

placementsPSfrag replacements PSfrag replacements PSfrag replacements

−A∗ (µ) ≤ −ρ(T 1 )A∗ (µ(T 1 )) − ρ(T 2 )A∗ (µ(T 2 )) − ρ(T 3 )A∗ (µ(T 3 ))

• given any spanning tree T , define the moment-matched tree distribution:

Y Y µst (xs , xt )
p(x; µ(T )) := µs (xs )
s∈V
µs (xs ) µt (xt )
(s,t)∈E

• use −A∗ (µ(T )) to denote the associated tree entropy

• let ρ = {ρ(T )} be a probability distribution over spanning trees

58
Edge appearance probabilities

Experiment: What is the probability ρe that a given edge e ∈ E

belongs to a tree T drawn randomly under ρ?
f f f f

b b b b
lacements
PSfrag replacements
PSfrag replacements
PSfrag replacements

e e e e
1 1 1
(a) Original (b) ρ(T 1 ) = 3
(c) ρ(T 2 ) = 3
(d) ρ(T 3 ) = 3

In this example: ρb = 1; ρe = 23 ; ρf = 13 .

The vector ρe = { ρe | e ∈ E } must belong to the spanning tree

polytope, denoted T(G). (Edmonds, 1971)

59
Optimal bounds by tree-reweighted message-passing
Recall the constraint set of locally consistent marginal distributions:
X X
LOCAL(G) = { τ ≥ 0 | τs (xs ) = 1, τst (xs , xt ) = τt (xt ) }.
xs xs
| {z } | {z }
normalization marginalization

Theorem: (Wainwright, Jaakkola, & Willsky, UAI 2002)

(a) For any given edge weights ρe = {ρe } in the spanning tree polytope,
the optimal upper bound over all tree parameters is given by:
˘ X X ¯
A(θ) ≤ max hθ, τ i + Hs (τs ) − ρst Ist (τst ) .
τ ∈LOCAL(G)
s∈V (s,t)∈E

(b) This optimization problem is strictly convex, and its unique optimum
is specified by the fixed point of ρe -reweighted message passing:
Q ˆ ∗ ˜ρvt
( Mvt (xt ) )
X h 0
θst (xs , xt ) i v∈Γ(t)\s
∗
Mts (xs ) = κ exp + θt (x0t ) ˆ ∗ ˜(1−ρts ) .
ρst Mst (xt )
x0 ∈Xt
t

60
Upper bounds on lattice model

0.04

0.03

0.02
Opt. Upper
Relative Error

0.01

0 Doub. Opt. Upper

−0.01
MF Lower
−0.02

−0.03

−0.04
0 0.25 0.5 0.75 1
Coupling Strength

61
Upper bounds on fully connected models

1.5
Opt. Upper
1
Doub. Opt. Upper
0.5
Relative Error

0
MF Lower
−0.5

−1

−1.5
0 0.25 0.5 0.75 1
Coupling Strength

62
Semidefinite constraints in convex relaxations

Fact: Belief propagation and its hypergraph-based generalizations all

involve polyhedral (i.e., linear ) outer bounds on the marginal polytope.

Idea: Use semidefinite constraints to generate more global outer

bounds.
Example: For the Ising model, relevant mean parameters are µs = p(Xs = 1) and
µst = p(Xs = 1, Xt = 1).
Define y = [1 x]T , and consider the second-order moment matrix:
2 3
1 µ1 µ2 . . . µn
6 7
6µ µ1 µ12 . . . µ1n 7
6 1 7
6 7
6 . . . µ2n 7
E[yyT ] = 6 µ2 µ12 µ2 7
6 . .. .. .. .. 7
6 . 7
6 . . . . . 7
4 5
µn µn1 µn2 . . . µn

It must be positive semidefinite, which imposes (an infinite number of) linear
constraints on µs , µst .

63
Illustrative example

1 3
Locally consistent
(pseudo)marginals

   
µ1 µ12 µ13 0.5 0.4 0.1
Second-order    
µ21 µ2 µ23  = 0.4 0.5 0.4
moment matrix    
µ31 µ32 µ3 0.1 0.4 0.5

Not positive-semidefinite!

64
Log-determinant relaxation
• based on optimizing over covariance matrices M1 (µ) ∈ SDEF1 (Kn )

Theorem: Consider an outer bound OUT(Kn ) that satisfies:

MARG(Kn ) ⊆ OUT(Kn ) ⊆ SDEF1 (Kn )

For any such outer bound, A(θ) is upper bounded by:

1 1 n πe
max hθ, µi+ log det M1 (µ)+ blkdiag[0, In ] + log( )
µ∈OUT(Kn ) 2 3 2 2

Remarks:
1. Log-det. problem can be solved efficiently by interior point methods.
2. Relevance for applications:
(a) Upper bound on A(θ).
(b) Method for computing approximate marginals.

(Wainwright & Jordan, 2003)

65
Results for approximating marginals

Nearest−neighbor grid Fully connected

LD LD
Weak − BP Weak − BP

Strong − Strong −

Weak +/− Weak +/−

Strong +/− Strong +/−

Weak + Weak +

Strong + Strong +

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Average error in marginal Average error in marginal

(a) Nearest-neighbor grid (b) Fully connected

• average `1 error in approximate marginals over 100 trials

• coupling types: repulsive (−), mixed (+/−), attractive (+)

66
Summary and future directions

• variational methods are based on converting computational tasks

to optimization problems:
(a) complementary to sampling-based methods (e.g., MCMC)
(b) a variety of new “relaxations” remain to be explored

• many open questions:

(a) prior error bounds available only in special cases
(b) extension to non-parametric settings?
(c) hybrid techniques (variational and MCMC)
(d) variational methods in parameter estimation
(e) fast techniques for solving large-scale relaxations (e.g., SDPs,
other convex programs)

Lauritzen - Graphical Models (1996)
No ratings yet
Lauritzen - Graphical Models (1996)
306 pages
(Ebook PDF) Introduction To Probability Models 12th Edition PDF Download
100% (1)
(Ebook PDF) Introduction To Probability Models 12th Edition PDF Download
63 pages
ASTM F2245-07 Airplanes PDF
No ratings yet
ASTM F2245-07 Airplanes PDF
29 pages
Probabilistic Graphical Models Principles and Techniques - Koller, Friedman - Unknown - 2009
100% (1)
Probabilistic Graphical Models Principles and Techniques - Koller, Friedman - Unknown - 2009
1,266 pages
Building Probabilistic Graphical Models With Python
No ratings yet
Building Probabilistic Graphical Models With Python
24 pages
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
No ratings yet
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
46 pages
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
No ratings yet
Learning With Kernels Support Vector Machines, Regularization, Optimization, and Beyond by Bernhard Schlkopf, Alexander J. Smola
644 pages
Graphical Models, Exponential Families, and Variational Inference
No ratings yet
Graphical Models, Exponential Families, and Variational Inference
301 pages
Prob Inf
No ratings yet
Prob Inf
56 pages
Graphical Models, Exponential Families, and Variational Inference
No ratings yet
Graphical Models, Exponential Families, and Variational Inference
305 pages
Introduction To Variational Methods
No ratings yet
Introduction To Variational Methods
51 pages
Graphical Models: Michael I. Jordan
No ratings yet
Graphical Models: Michael I. Jordan
16 pages
FL LectureNotes
No ratings yet
FL LectureNotes
92 pages
MCMC Brief
100% (1)
MCMC Brief
69 pages
Graph Lecture19
No ratings yet
Graph Lecture19
42 pages
Cs Theorists Toolkit
No ratings yet
Cs Theorists Toolkit
95 pages
Belief Propagation Algorithm
No ratings yet
Belief Propagation Algorithm
20 pages
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation
No ratings yet
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation
58 pages
Structure Learning in Graphical Modeling
No ratings yet
Structure Learning in Graphical Modeling
28 pages
Lec1 Mathreview
No ratings yet
Lec1 Mathreview
61 pages
ST Flour Notes
No ratings yet
ST Flour Notes
104 pages
Case Study With Probabilistic Models
No ratings yet
Case Study With Probabilistic Models
85 pages
CVX Lecture Graphs
No ratings yet
CVX Lecture Graphs
79 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
DL (Unit I)
No ratings yet
DL (Unit I)
25 pages
ECE 6504: Advanced Topics in Machine Learning: Probabilistic Graphical Models and Large-Scale Learning
No ratings yet
ECE 6504: Advanced Topics in Machine Learning: Probabilistic Graphical Models and Large-Scale Learning
40 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Directed Graphical Models
No ratings yet
Directed Graphical Models
54 pages
Exercise 01
No ratings yet
Exercise 01
3 pages
Slides No Break
No ratings yet
Slides No Break
77 pages
Variation Al
No ratings yet
Variation Al
25 pages
CPSC 540: Machine Learning: Gibbs Sampling, Variational Inference
No ratings yet
CPSC 540: Machine Learning: Gibbs Sampling, Variational Inference
37 pages
H4 IdeCozman ISIPTA05
No ratings yet
H4 IdeCozman ISIPTA05
10 pages
Stat513 l10
No ratings yet
Stat513 l10
27 pages
Probabilistic Graphical Models Homework Solutions
100% (2)
Probabilistic Graphical Models Homework Solutions
6 pages
16 Graphical Models
No ratings yet
16 Graphical Models
27 pages
Thesis Proposal: Graph Structured Statistical Inference: James Sharpnack
No ratings yet
Thesis Proposal: Graph Structured Statistical Inference: James Sharpnack
20 pages
Week 7
No ratings yet
Week 7
9 pages
Probabilistic Graphical Models: EEE 485/585 Statistical Learning and Data Analytics
No ratings yet
Probabilistic Graphical Models: EEE 485/585 Statistical Learning and Data Analytics
29 pages
Wasserstein Propagation For Semi-Supervised Learning
No ratings yet
Wasserstein Propagation For Semi-Supervised Learning
9 pages
Graphical Models
No ratings yet
Graphical Models
4 pages
Lecture 15
No ratings yet
Lecture 15
7 pages
2006 March 21 MRF
No ratings yet
2006 March 21 MRF
101 pages
Roch MDP Full
No ratings yet
Roch MDP Full
571 pages
04 Exact Inference
No ratings yet
04 Exact Inference
23 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Graphical
No ratings yet
Graphical
99 pages
HW 4
No ratings yet
HW 4
5 pages
Factor Graph
No ratings yet
Factor Graph
3 pages
Cheat Sheet 4
No ratings yet
Cheat Sheet 4
2 pages
DSC4821 Full Study Guide
No ratings yet
DSC4821 Full Study Guide
109 pages
cs228 HW 1
No ratings yet
cs228 HW 1
6 pages
25 Customizing Models A Algorithms
No ratings yet
25 Customizing Models A Algorithms
38 pages
Lecture-02 Probability Basics
No ratings yet
Lecture-02 Probability Basics
33 pages
05 MCMC
No ratings yet
05 MCMC
36 pages
Diffusion On Assortative Networks - From Mean-Field To Agent-Based, Via Newman Rewiring
No ratings yet
Diffusion On Assortative Networks - From Mean-Field To Agent-Based, Via Newman Rewiring
26 pages
Amp Sparse Paper Detail
No ratings yet
Amp Sparse Paper Detail
43 pages
Projects2017-4 12 PDF
No ratings yet
Projects2017-4 12 PDF
54 pages
41-, Gaussian Mixture Models, Expectation Maximization-20-11-2024
No ratings yet
41-, Gaussian Mixture Models, Expectation Maximization-20-11-2024
40 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Invited Commentary - Machine Learning in Causal Inference-How Do I Love
No ratings yet
Invited Commentary - Machine Learning in Causal Inference-How Do I Love
5 pages
Why Liberalism Works - How True Liberal Values Produce A - Deirdre Nansen McCloskey - 2019 - Yale University Press - 9780300235081 - Anna's Archive
100% (1)
Why Liberalism Works - How True Liberal Values Produce A - Deirdre Nansen McCloskey - 2019 - Yale University Press - 9780300235081 - Anna's Archive
399 pages
Kejriwal Knowledge Graph Tutorial - 2020-12-asonam-tutorial-KG
No ratings yet
Kejriwal Knowledge Graph Tutorial - 2020-12-asonam-tutorial-KG
73 pages
Sucicide Rates in India - 2014-2021
No ratings yet
Sucicide Rates in India - 2014-2021
3 pages
Scientist's Guide To Developing Explanatory Statistical Models Using Causal Principles
No ratings yet
Scientist's Guide To Developing Explanatory Statistical Models Using Causal Principles
14 pages
AGRAME - Any-GranularityRanking With Multi-Vector Embeddings
No ratings yet
AGRAME - Any-GranularityRanking With Multi-Vector Embeddings
13 pages
Efron Mixture
No ratings yet
Efron Mixture
9 pages
Coarse
No ratings yet
Coarse
8 pages
Domain Representative Keywords Selection - A Probabilistic Approach
No ratings yet
Domain Representative Keywords Selection - A Probabilistic Approach
14 pages
09 Boosting
No ratings yet
09 Boosting
17 pages
Dynamics of Non-Expansive Maps On Strictly Convex
No ratings yet
Dynamics of Non-Expansive Maps On Strictly Convex
18 pages
S0007125022000137 Sup 001
No ratings yet
S0007125022000137 Sup 001
34 pages
A Grammar of Freethought by Chapman Cohen
No ratings yet
A Grammar of Freethought by Chapman Cohen
72 pages
A Brief History of Computing
No ratings yet
A Brief History of Computing
61 pages
Ryali Multivariate Dynamical 10
No ratings yet
Ryali Multivariate Dynamical 10
17 pages
An Investigation of A Model For Air Resistance Lab
No ratings yet
An Investigation of A Model For Air Resistance Lab
4 pages
MIT's Undergraduate String Theory Project
100% (13)
MIT's Undergraduate String Theory Project
18 pages
Chap-5-Keyboard-Worksheet 2,3 Answer
100% (1)
Chap-5-Keyboard-Worksheet 2,3 Answer
3 pages
ONGC Uran
No ratings yet
ONGC Uran
10 pages
Lesson Plan in General Mathematics: R Log I I R I I I 10,000 I I
No ratings yet
Lesson Plan in General Mathematics: R Log I I R I I I 10,000 I I
4 pages
Correct Solution For Dominator Task From Codility Johnnyjavago Java Passion Coding - HTM
No ratings yet
Correct Solution For Dominator Task From Codility Johnnyjavago Java Passion Coding - HTM
14 pages
Ne XTFAQ
No ratings yet
Ne XTFAQ
103 pages
CBI245A Data Sheet - R27-D
No ratings yet
CBI245A Data Sheet - R27-D
1 page
Optical Computers Technical Seminar Report Vtu Ece
100% (1)
Optical Computers Technical Seminar Report Vtu Ece
33 pages
0 - A Manual For The Part-Compositor Framework
No ratings yet
0 - A Manual For The Part-Compositor Framework
10 pages
An Introduction To Air Density and Density Altitude Calculations
No ratings yet
An Introduction To Air Density and Density Altitude Calculations
22 pages
Deh-P4180sd crt4248
No ratings yet
Deh-P4180sd crt4248
83 pages
Internship Training in Python
No ratings yet
Internship Training in Python
2 pages
KYONGBO GD3-P11 영문 사용설명서 (V1.20)
No ratings yet
KYONGBO GD3-P11 영문 사용설명서 (V1.20)
115 pages
Translation Analysis
No ratings yet
Translation Analysis
91 pages
Learn HTML - Forms Cheatsheet - Codecademy
No ratings yet
Learn HTML - Forms Cheatsheet - Codecademy
8 pages
M. Tech. Chemical 2018
No ratings yet
M. Tech. Chemical 2018
37 pages
Project Report
67% (15)
Project Report
40 pages
Uncertainty in Humidity Measurements PDF
No ratings yet
Uncertainty in Humidity Measurements PDF
48 pages
Sol 5
No ratings yet
Sol 5
7 pages
Group 2 - How Does Music Impact Plant Growth
No ratings yet
Group 2 - How Does Music Impact Plant Growth
5 pages
Frafos ABC SBC Brochure
No ratings yet
Frafos ABC SBC Brochure
4 pages
Chapter3 Electrochemistyry
No ratings yet
Chapter3 Electrochemistyry
2 pages
Integrated Circuits - K. R. Botkar
No ratings yet
Integrated Circuits - K. R. Botkar
67 pages
Scientific Method and Retailing Research A Retrospective
No ratings yet
Scientific Method and Retailing Research A Retrospective
13 pages
Chapter 4 Measures of Location
No ratings yet
Chapter 4 Measures of Location
37 pages
Risk Management Package
No ratings yet
Risk Management Package
25 pages
Introduction To Cooling Water Treatment
No ratings yet
Introduction To Cooling Water Treatment
40 pages
Unit 1
No ratings yet
Unit 1
57 pages

Wainwright Microsoft Slides2

Uploaded by

Wainwright Microsoft Slides2

Uploaded by

Graphical models, exponential families, and

Tutorial slides based on joint paper with Michael Jordan

• graphical models are used and studied in various applied statistical

• based on correspondences between graph theory and probability

• important but difficult problems:

2. Exponential families and variational methods

3. Exact techniques as variational methods

4. Approximate techniques as variational methods

The graph G can be used to impose constraints on the random vector

Markov property: X is Markov w.r.t G if XA and XB are

Factorization: The distribution p factorizes according to G if it can

Theorem: (Hammersley-Clifford) For strictly positive p(·), the

(a) Hidden Markov model (b) Coupled HMM

• HMMs are widely used in various applications

• frequently wish to solve smoothing problem of computing

• exact computation in HMMs is tractable, but coupled HMMs require

• wide variety of applications: satellite communication, sensor

• error-control codes based on careful addition of redundancy, with

• key implementational issues: efficient construction, encoding and

• very active area of current research: graphical codes (e.g., turbo

• Decoding: requires finding maximum likelihood codeword:

bM L = arg max p(y | x) s. t. Hx = 0 (mod 2).

• use of belief propagation as an approximate decoder has revolutionized

Frequently, it is of interest to compute various quantities associated

(a) the log normalization constant log Z

(b) local marginal distributions or other local statistics

(c) modes or most probable configurations

Relevant dimensions often grow rapidly in graph size =⇒ major

Example: Consider a naive approach to computing the normalization constant for

Complexity scales exponentially as 2n .

• binary variables on a graph G = (V, E) with pairwise interactions:

• binary variables on a graph G = (V, E) with pairwise interactions:

• simple (deterministic) message-passing algorithm involving

Questions: • principled derivation? • convergence and accuracy?

• alternative set of message-passing updates (motivated by exactness

• behavior for graphs with cycles?

2. Exponential families and variational methods

3. Exact techniques as variational methods

4. Approximate techniques as variational methods

• “variational”: umbrella term for optimization-based formulation of

Goal: Given a vector y ∈ Rn and a symmetric matrix Q  0, solve the

It is strictly convex, and the minimum is uniquely attained:

b (y) = arg minn Jy (u) = Q−1 y.

Various methods for solving linear systems (e.g., conjugate gradient)

Core problems that arise in many applications:

b ∈ arg maxx p(x)

Approach: Develop variational representations of all of these

(b) convex analysis (e.g., Rockafellar,1973)

• parameterized family of densities (w.r.t. ν):

• cumulant generating function (log normalization constant):

• set of valid parameters Θ := {θ ∈ Rd | A(θ) < +∞}.

Family X ν log p(x; θ) A(θ)

Bernoulli {0, 1} Counting θx − A(θ) log[1 + exp(θ)]

Exponential (0, +∞) Lebesgue θ (−x) − A(θ) − log θ

Poisson {0, 1, 2 . . .} Counting θx − A(θ) exp(θ)

• choose random variables Xs at each vertex s ∈ V from an arbitrary

Key requirement: The collection φ of sufficient statistics must

Density (w.r.t. counting measure) of the form:

Cumulant generating function (log normalization constant):

Edgewise natural parameters θst = θts must respect graph structure:

(a) Graph structure (b) Structure of [Z(θ)]st = θst .

With variables x := (u, z, w) and parameter θ := (α, γ), density

Conjugate duality is a fertile source of variational principles.

• any function f can be used to define another function f ∗ as follows:

• easy to show that f ∗ is always a convex function

• how about taking the “dual of the dual”? I.e., what is (f ∗ )∗ ?

• when f is well-behaved (convex and lower semi-continuous), we have

Analytically, we require the smallest c ∈ R such that:

hy, xi − c ≤ f (x) for all x ∈ Rn

By re-arranging, we find that this optimal c∗ is the dual value:

(Possible) stationary point: µ = exp(θ)/[1 + exp(θ)].

(a) Epigraph supported (b) Epigraph cannot be supported

• taking derivatives w.r.t θ to find a stationary point yields:

• Useful fact: Derivatives of A yield mean parameters:

Thus, stationary points satisfy the equation:

• strict concavity of objective guarantees that θ(µ) attains global

Goal: Given a vector y ∈ Rn and a symmetric matrix Q 0, solve the

Semidefinite set MGauss = {µ | U (µ) 0}.