Probabilistic Machine Learning: Exponential Families
Probabilistic Machine Learning: Exponential Families
Lecture 05
Exponential Families
Philipp Hennig
04 May 2023
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
Bonus Points
it’s ok not to get everything right
QN
p(x | w)p(w) “often” p(xn | w)p(w)
p(w | x) = R = R QNn=1
p(x | w)p(w) dw n=1 p(xn | w)p(w) dw
E. Pitman. Sufficient statistics and intrinsic accuracy (1936). Math. Proc. Cambr. Phil. Soc. 32(4), 1936.
P. Diaconis and D. Ylvisaker, Conjugate priors for exponential families. Annals of Statistics 7(2), 1979.
is called an exponential family of probability measures. The function ϕ : X _ Rd is called the sufficient
statistics. The parameters w ∈ Rd are the natural parameters of pw . The normalization constant
Z(w) : Rd _ R is the partition function. The function h(x) : X _ R+ is the base measure. For
notational convenience, it can be useful to re-parametrize the natural parameters w as w := η(θ) in
terms of canonical parameters θ.
Note: Each row of this table is one exponential family. Some authors (e.g. Murphy) call the entire table
“the exponential family”. We will not do this.
Computing F(α, ν) can be tricky. But if we have it, it completely automates Bayesian inference!
Finding F is thus the challenge when constructing an EF.
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8
Exponential Families have Conjugate Priors
but their normalization constant may not be known
F(ϕ(x) + α, ν + 1)
= h(x)
F(α, ν)
Computing F(α, ν) can be tricky. But if we have it, it completely automates Bayesian inference!
Finding F is thus the challenge when constructing an EF.
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8
Atomic inference machines
Exponential families with tractable conjugate priors make Bayesian ML tractable
▶ Once we decide to use a particular exponential family (h(x), ϕ(x)) as the model for some data x,
we automatically get a conjugate prior in the form of another exponential family
▶ in principle, this provides the means to do analytic Bayesian inference, if we can compute the
partition function Z(w) of the likelihood and F(α, ν), that of the conjugate prior.
▶ This solves the learning problem, i.e. how to extract information from the data. It does not
guarantee that we will be able to compute moments, or sample from the posterior, but these can
be encapsulated to subroutines, now that the data is dealt with.
(x − µ)2
1
pw (x | w) = √ exp − = N (x; µ, σ 2 )
2πσ 2 2σ 2
2 √
x − 2µx + µ2
= exp − − log 2πσ 2
2σ 2
2 √
µ/σ2 µ
= exp x −1/2 x2 1 2 − + log 2πσ 2
/σ 2σ 2
w1 1 w21
= exp ϕ1 (x) ϕ2 (x) − − log w2 + log(2π)
w2 2 w2
| {z }
log Z(w)
Y
n
p(x | µ, σ) = N (xi ; µ, σ 2 )
i=1
σ2
p(µ, σ | µ0 , ν, α, β) = N G(σ −2 ; α, β) µ; µ0 ,
ν
νµ0 + nx̄ σ 2
p(µ, σ | x, µ0 , ν, α, β) = N µ; , ·
ν+n ν+n
1X
n
−2 n nν
G σ ;α + ,β + (xi − x̄) +
2
(x̄ − µ0 ) 2
2 2 2(n + ν)
i=1
1X
n
where x̄ := xi
n William S. Gosset (1876–1937)
i=1
▶ hence, collect statistics of ϕ, compute ∇w log Z(w) and solve the above for w
▶ this may or may not be possible analytically, but in any case it encapsulates the data!
Laplace approximations show that Bayesian inference is not “about the prior” at all, but rather about cap-
turing the (local) geometry of the likelihood (around the mode). Uncertainty is about tracking all possible
solutions at once (or at least as many as possible), not one single estimate.