Wainwright Microsoft Slides2
Wainwright Microsoft Slides2
variational methods
Martin Wainwright
Departments of Statistics, and
Electrical Engineering and Computer Science,
UC Berkeley
Email: [email protected]
1
Introduction
2
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods
3
Undirected graphical models
Based on correspondences between graphs and random variables.
• given an undirected graph G = (V, E), associate to each node s a
random variable Xs
• for each subset A ⊆ V , define XA := {xs , s ∈ A}.
2
ag replacements 3 4
7
PSfrag replacements B
1 5 6 A S
Maximal cliques (123), (345), (456), (47) Vertex cutset S
• a clique C ⊆ V is a subset of vertices all joined by edges
• a vertex cutset is a subset S ⊂ V whose removal breaks the graph
into two or more pieces
4
Factorization and Markov properties
5
Example: Hidden Markov models
X1 X2 X3 XT
q q
Y1 Y2 Y3 YT
6
Example: Graphical codes for communication
ag replacements
Goal: Achieve reliable communication over a noisy channel.
0 00000 10010 00000
Encoder Noisy Decoder
Channel Y b
source X X
7
Graphical codes and decoding
Parity check matrix Factor graph
x1
x2
ψ1357
2 3
1 0 1 0 1 0 1 x3
6 7
H=6
40 1 1 0 0 1 17 5 x4 ψ2367
PSfrag replacements
0 0 0 1 1 1 1
x5
ψ4567
Codeword: [0 1 0 1 0 1 0] x6
Non-codeword: [0 0 0 0 0 1 1]
x7
8
Challenging computational problems
9
Gibbs sampling in the Ising model
(m+1) (m)
Update xs stochastically based on values xN (s)
at neighbors:
1. Choose s ∈ V at random.
2. Sample u ∼ U(0, 1) and update
8 P
>
< 1 if u ≤ {1 + exp[−(θs + (m)
θst xt )]}−1
xs(m+1) = t∈N (s) ,
>
: 0 otherwise
10
Mean field updates in the Ising model
3
1. Choose s ∈ V at random.
placements ν3
1 2. Update νs based on neighbors
2 4
ν2 {νt , t ∈ N (s)}:
ν4
ν5 ˘ X ¯−1
νs ←− 1 + exp[−(θs + θst νt )]
5 t∈N (s)
11
Sum-product (belief-propagation) in the Ising model
acements
Tt 1. For each (direction of each) edge, update message:
Tw
w s 1
X Y
νwt νts (xs ) ← exp(θt xt +θst xs xt ) νut (xt )
νts
t νut xt =0 u∈N (t)\s
u νvt
Tu 2. Upon convergence, compute approx. to marginal:
v ` ´ Y
Tv p(xs ) ∝ exp θs xs νts (xs ).
t∈N (s)
• for any tree (i.e., no cycles), updates will converge (after a finite
number of steps), and yield exact marginals (cf. Pearl, 1988)
12
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods
13
Variational methods
b as
Variational principle: Representation of a quantity of interest u
the solution of an optimization problem.
b to be studied through the lens of the
1. allows the quantity u
optimization problem
2. approximations to u b can be obtained by approximating or relaxing
the variational principle
14
Illustration: A simple variational principle
15
Useful variational principles for graphical models?
Consider an undirected graphical model:
1 Y
p(x) = ψC (xC )
Z
C∈C
16
Exponential families
φα : X n → R ≡ sufficient statistic
φ = {φα , α ∈ I} ≡ vector of sufficient statistics
θ = {θα , α ∈ I} ≡ parameter vector
ν ≡ base measure (e.g., Lebesgue, counting)
17
Examples: Scalar exponential families
1 2πe
Gaussian R Lebesgue θ1 x + θ2 x2 − A(θ) [θ
2 1
+ log −θ2
]
18
Graphical models as exponential families
3 4
7
1 5 6
19
Example: Ising model
φ = { xs | s ∈ V } ∪ { xs xt | (s, t) ∈ E }
I = V ∪ E
Xn = {0, 1}n
20
Example: Multivariate Gaussian
U (θ): Matrix of natural parameters φ(x): Matrix of sufficient statistics
2 3 2 3
0 θ1 θ2 ... θn 1 x1 x2 ... xn
6 7 6 7
6θ θ11 θ12 ... θ1n 7 6x (x1 )2 x1 x2 ... x 1 xn 7
6 1 7 6 1 7
6 7 6 7
6 θ2 θ21 θ22 ... θ2n 7 6 x2 x2 x1 (x2 )2 ... x 2 xn 7
6 7 6 7
6 . .. .. .. .. 7 6 . .. .. .. .. 7
6 . 7 6 . 7
6 . . . . . 7 6 . . . . . 7
4 5 4 5
θn θn1 θn2 ... θnn xn xn x1 xn x2 ... (xn )2
5
4 5
21
PSfrag replacements
Example: Latent Dirichlet Allocation model
γ
α u z w
Model components:
Dirichlet u ∼ Dir(α)
Multinomial “topic” z ∼ Mult(u)
“Word” w ∼ multinomial conditioned on z
(with parameter γ)
22
The power of conjugate duality
23
Geometric view: Supporting hyperplanes
Question: Given all hyperplanes in Rn × R with normal (y, −1), what
is the intercept of the one that supports epi(f )?
u
f (x)
hy, xi − ca
Epigraph of f :
PSfrag replacements
epi(f ) := {(x, u) ∈ Rn+1 | f (x) ≤ u}.
hy, xi − cb
−ca
−cb x
24
Example: Single Bernoulli
Random variable X ∈ {0, 1} yields exponential family of the form:
˘ ¯ ˆ ˜
p(x; θ) ∝ exp θ x with A(θ) = log 1 + exp(θ) .
˘ ¯
Let’s compute the dual A∗ (µ) := sup µθ − log[1 + exp(θ)] .
θ∈R
hµ, θi − A∗ (µ)
replacements PSfrag replacements
θ θ
hµ, θi − c
25
More general computation of the dual A∗
• consider the definition of the dual function:
∗
A (µ) = sup hµ, θi − A(θ) .
θ∈Rd
µ − ∇A(θ) = 0.
µ = Eθ [φ(x)] (1)
26
Computation of dual (continued)
• assume solution θ(µ) to equation (1) exists
• thus, we recognize that A∗ (µ) = −H(p(x; θ(µ))) when equation (1) has a
solution
27
Sets of realizable mean parameters
28
Examples of M:
1. Gaussian MRF: Matrices of suff. statistics and mean parameters:
µ11
2 3
1 h i
Mgauss
φ(x) = 4 5 1 x .
x
(2 3 )
1 h i
U (µ) := E 4 5 PSfrag
1 xreplacements µ1
x
29
Geometry and moment mapping
M
Θ
θ µ
eplacements
30
Variational principles in terms of mean parameters
Theorem:
(a) The conjugate dual of A takes the form:
8
<−H(p(x; θ(µ))) if µ ∈ int M(G; φ)
∗
A (µ) =
:+∞ if µ ∈
/ cl M(G; φ).
Note: Boundary behavior by lower semi-continuity.
31
Alternative view: Kullback-Leibler divergence
• Kullback-Leibler divergence defines “distance” between probability
distributions:
Z h
p(x) i
D(p || q) := log p(x)ν(dx)
q(x)
0 = inf D(µ1 || θ 2 )
µ1 ∈M(G;φ)
32
Challenges
Remarks:
1. Variational representation clarifies why certain models are
tractable.
2. For intractable cases, one strategy is to solve an approximate form
of the optimization problem.
33
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods
34
A(i): Multivariate Gaussian (fixed covariance)
Consider the set of all Gaussians with fixed inverse covariance Q 0.
• potentials φ(x) = {x1 , . . . , xn } and natural parameter θ ∈ Θ = Rn .
• cumulant generating function:
density
z }| {
Z n
˘X ¯ ˘1 T ¯
A(θ) = log exp θs x s exp − x Qx dx
s=1 | 2{z }
Rn
base measure
b = Q−1 θ.
• optimum is uniquely obtained at the familiar Gaussian mean µ
35
A(ii): Multivariate Gaussian (arbitrary covariance)
• matrices of sufficient statistics, natural parameters, and mean
parameters:
2 3 2 3 (2 3 )
1 h i 0 [θs ] 1 h i
φ(x) = 4 5 1 x , U (θ) := 4 5 U (µ) := E 4 5 1 x
x [θs ] [θst ] x
• solution yields the normal equations for Gaussian mean and covariance.
36
B: Belief propagation/sum-product on trees
• multinomial variables Xs ∈ {0, 1, . . . , ms − 1} on a tree T = (V, E)
• sufficient statistics: indicator functions for each node and edge
I j (xs ) for s = 1, . . . n, j ∈ Xs
I jk (xs , xt ) for (s, t) ∈ E, (j, k) ∈ Xs × Xt .
37
Decomposition of entropy for trees
• by the junction tree theorem, any tree can be factorized in terms of
its marginals µ ≡ µ(θ) as follows:
Y Y µst (xs , xt )
p(x; θ) = µs (xs )
µs (xs )µt (xt )
s∈V (s,t)∈E
where
X
Hs (µs ) := − µs (xs ) log µs (xs )
xs
X µst (xs , xt )
Ist (µst ) := µst (xs , xt ) log .
xs ,xt
µs (xs )µt (xt )
38
Exact variational principle on trees
• putting the pieces back together yields:
˘ X X ¯
A(θ) = max hθ, µi + Hs (µs ) − Ist (µst ) .
µ∈MARG(T )
s∈V (s,t)∈E(T )
39
Lagrangian derivation (continued)
• taking derivatives of the Lagrangian w.r.t µs and µst yields
∂L X
= θs (xs ) − log µs (xs ) + λts (xs ) + C
∂µs (xs )
t∈Γ(s)
∂L µst (xs , xt )
= θst (xs , xt ) − log − λts (xs ) − λst (xt ) + C 0
∂µst (xs , xt ) µs (xs )µt (xt )
40
C: Max-product (belief revision) on trees
41
Limiting form of variational principle (on trees)
• consider the tree-structured variational principle for p(x; βθ):
1 1 ˘ X X ¯
A(βθ) = max hβθ, µi + Hs (µs ) − Ist (µst ) .
β β µ∈MARG(T )
s∈V (s,t)∈E(T )
42
Outline
1. Introduction and motivation
(a) Background on graphical models
(b) Some applications and challenging problems
(c) Illustrations of some variational methods
43
A: Mean field theory
Examples:
Q
(a) For product distributions p(x) = s∈V µs (xs ), entropy decomposes
P
as −A∗ (µ) = s∈V Hs (xs ).
(b) Similarly, for trees (more generally, decomposable graphs), the
junction tree theorem yields an explicit form for −A∗ (µ).
44
Geometry of mean field
1 2 3
4 5 6
7 8 9
µe
Mtr • under mild conditions, Mtr is a non-
convex inner approximation to M
• optimizing over Mtr (as opposed to M)
yields lower bound :
lacements ˘ ¯
A(θ) ≥ sup hθ, µ ei − A∗ (e
µ) .
e∈Mtr
µ
M
45
Alternative view: Minimizing KL divergence
µ || θ) = A(θ) + A∗ (e
D(e µ) − he
µ, θi
46
Example: Naive mean field algorithm for Ising model
• consider completely disconnected subgraph H = (V, ∅)
• permissible exponential parameters belong to subspace
E(H) = {θ ∈ Rd | θst = 0 ∀ (s, t) ∈ E}
Q
• allowed distributions take product form p(x; θ) = p(xs ; θs ), and
s∈V
generate
Mtr (G; H) = {µ | µst = µs µt , µs ∈ [0, 1] }.
• approximate variational principle:
X X ff
ˆX ˜
max θ s µs + θst µs µt − µs log µs +(1−µs ) log(1−µs ) .
µs ∈[0,1]
s∈V (s,t)∈E s∈V
47
Example: Structured mean field for coupled HMM
(a) (b)
48
B: Belief propagation on arbitrary graphs
49
Geometry of belief propagation
µint
MARG(G)
• belief propagation uses a polyhedral outer µf rac
approximation to M
PSfrag replacements
• for any graph, LOCAL(G) ⊇ MARG(G).
• equality holds ⇐⇒ G is a tree.
LOCAL(G)
50
Illustration: Globally inconsistent BP fixed points
Consider the following assignment of pseudomarginals τs , τst :
2
1 3
Locally consistent
(pseudo)marginals
• can verify that τ ∈ LOCAL(G), and that τ is a fixed point of belief
propagation (with all constant messages)
• however, τ is globally inconsistent
Note: More generally: for any τ in the interior of LOCAL(G), can
construct a distribution with τ as a BP fixed point.
51
High-level perspective
52
Generalized belief propagation on hypergraphs
12
1245 2356
1 2 123 234 25
14 23
45 5 56
23
4 3
58
34 4578 5689
D+ (h) := {g ∈ E | g ⊆ h }, A+ (h) := {g ∈ E | g ⊇ h }.
53
Hypertree factorization and entropy
• hypertrees are an alternative way to describe junction trees
54
Examples: Hypertree factorization
1. Ordinary tree:
2. Hypertree:
1245 2356
25
µ1245
ϕ1245 = µ25 µ45
µ5 µ5 µ 5
45 5 56
µ45 58
ϕ45 = 4578
µ5
ϕ5 = µ5
55
Building augmented hypergraphs
Better entropy approximations via augmented hypergraphs.
1 12 23 3
1 2 3 2
1 2 3 1245 2356
25
14 36
4 5 6 4 45 5 56 6
4 5 6 47 69
58
4578 5689
8
7 8 9 7 78 89 9
7 8 9
45 5 56 45 56
58 58
4578 5689 4578 5689
56
C. Convex relaxations
Possible concerns with the Bethe/Kikuchi problems and variations?
57
Bounds from “convexified” Bethe/Kikuchi problems
Idea: Upper bound −A∗ (µ) by convex combination of tree-structured
entropies.
−A∗ (µ) ≤ −ρ(T 1 )A∗ (µ(T 1 )) − ρ(T 2 )A∗ (µ(T 2 )) − ρ(T 3 )A∗ (µ(T 3 ))
58
Edge appearance probabilities
b b b b
lacements
PSfrag replacements
PSfrag replacements
PSfrag replacements
e e e e
1 1 1
(a) Original (b) ρ(T 1 ) = 3
(c) ρ(T 2 ) = 3
(d) ρ(T 3 ) = 3
In this example: ρb = 1; ρe = 23 ; ρf = 13 .
59
Optimal bounds by tree-reweighted message-passing
Recall the constraint set of locally consistent marginal distributions:
X X
LOCAL(G) = { τ ≥ 0 | τs (xs ) = 1, τst (xs , xt ) = τt (xt ) }.
xs xs
| {z } | {z }
normalization marginalization
(a) For any given edge weights ρe = {ρe } in the spanning tree polytope,
the optimal upper bound over all tree parameters is given by:
˘ X X ¯
A(θ) ≤ max hθ, τ i + Hs (τs ) − ρst Ist (τst ) .
τ ∈LOCAL(G)
s∈V (s,t)∈E
(b) This optimization problem is strictly convex, and its unique optimum
is specified by the fixed point of ρe -reweighted message passing:
Q ˆ ∗ ˜ρvt
( Mvt (xt ) )
X h 0
θst (xs , xt ) i v∈Γ(t)\s
∗
Mts (xs ) = κ exp + θt (x0t ) ˆ ∗ ˜(1−ρts ) .
ρst Mst (xt )
x0 ∈Xt
t
60
Upper bounds on lattice model
0.04
0.03
0.02
Opt. Upper
Relative Error
0.01
−0.03
−0.04
0 0.25 0.5 0.75 1
Coupling Strength
61
Upper bounds on fully connected models
1.5
Opt. Upper
1
Doub. Opt. Upper
0.5
Relative Error
0
MF Lower
−0.5
−1
−1.5
0 0.25 0.5 0.75 1
Coupling Strength
62
Semidefinite constraints in convex relaxations
It must be positive semidefinite, which imposes (an infinite number of) linear
constraints on µs , µst .
63
Illustrative example
2
1 3
Locally consistent
(pseudo)marginals
µ1 µ12 µ13 0.5 0.4 0.1
Second-order
µ21 µ2 µ23 = 0.4 0.5 0.4
moment matrix
µ31 µ32 µ3 0.1 0.4 0.5
Not positive-semidefinite!
64
Log-determinant relaxation
• based on optimizing over covariance matrices M1 (µ) ∈ SDEF1 (Kn )
Remarks:
1. Log-det. problem can be solved efficiently by interior point methods.
2. Relevance for applications:
(a) Upper bound on A(θ).
(b) Method for computing approximate marginals.
65
Results for approximating marginals
LD LD
Weak − BP Weak − BP
Strong − Strong −
Weak + Weak +
Strong + Strong +
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Average error in marginal Average error in marginal
66
Summary and future directions
67