0% found this document useful (0 votes)
54 views19 pages

Probabilistic Machine Learning: Exponential Families

This document discusses exponential families and their use in probabilistic machine learning. Exponential families provide a framework for probabilistic models and allow for conjugate priors and tractable Bayesian inference. The key challenges are determining the normalization constant of the exponential family and computing its predictive distribution.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views19 pages

Probabilistic Machine Learning: Exponential Families

This document discusses exponential families and their use in probabilistic machine learning. Exponential families provide a framework for probabilistic models and allow for conjugate priors and tractable Bayesian inference. The key challenges are determining the normalization constant of the exponential family and computing its predictive distribution.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Probabilistic Machine Learning

Lecture 05
Exponential Families

Philipp Hennig
04 May 2023

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
Bonus Points
it’s ok not to get everything right

▶ “bonus points” means not everyone should get them


▶ otherwise, the exam just gets a steeper grading curve, and nobody gains anything
▶ It’s ok to not be able to finish the exercise. You do not need all the bonus points to finish the exam!
▶ please do not post solutions on the forum. Sometimes they even add confusion!
▶ keep in mind that some things can actually be computed by hand. Auto-diff and optimization are
powerful tools, but they do have computational and coding cost.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 1


The skeleton of ML
probabilistic inference

QN
p(x | w)p(w) “often” p(xn | w)p(w)
p(w | x) = R = R QNn=1
p(x | w)p(w) dw n=1 p(xn | w)p(w) dw

This is very abstact. It would be nice to reduce this to


▶ get some data x ∈ RN×D (N: “batch dim”, D: “input dim”)
▶ compute some statistics ϕ(x) ∈ RF (note this can consume either or both axes of x) that capture
the algebraic structure of p(x | w)
▶ somehow compute p(w | x), using only ϕ(x) and nothing else about the data.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 2


Conjugate Priors
Mapping Bayesian inference to tractable code

Definition (Conjugate Prior)


Let x and w be a data-set and a variable to be inferred, respectively, connected by the likelihood
p(x | w) = ℓ(x; w). A conjugate prior to ℓ for w is a probability measure with pdf p(w) = g(w; θ), such
that
ℓ(x; w)g(w; θ)
p(w | x) = R = g(w; θ + ϕ(x)).
ℓ(x; w)g(w; θ) dw
That is, such that the posterior arising from ℓ is of the same functional form as the prior, with updated
parameters arising by adding some sufficient statistics of the observation x to the prior’s parameters.

E. Pitman. Sufficient statistics and intrinsic accuracy (1936). Math. Proc. Cambr. Phil. Soc. 32(4), 1936.
P. Diaconis and D. Ylvisaker, Conjugate priors for exponential families. Annals of Statistics 7(2), 1979.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 3


▶ Conjugate priors allow analytic Bayesian inference, by mapping it ot
▶ “data processing” ϕ(x) : RN×D _ RF
▶ “inference” g(w; θ + ϕ(x)), simple (floating-point) addition yields the full, normalized posterior!
▶ (the implicit assumption in this formulation is that g can be computed in decent time)
▶ How can we construct them in general?

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 4


Exponential Families
Exponentials of a Linear Form

Definition (Exponential Family, simplified form)


Consider a random variable X taking values x ∈ X ⊂ Rn . A probability distribution for X with pdf of the
functional form
h(x) ϕ(x)⊺ w
pw (x) = h(x) exp [ϕ(x)⊺ w − log Z(w)] = e = p(x | w)
Z(w)

is called an exponential family of probability measures. The function ϕ : X _ Rd is called the sufficient
statistics. The parameters w ∈ Rd are the natural parameters of pw . The normalization constant
Z(w) : Rd _ R is the partition function. The function h(x) : X _ R+ is the base measure. For
notational convenience, it can be useful to re-parametrize the natural parameters w as w := η(θ) in
terms of canonical parameters θ.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 5


A Family Meeting
Exponential families provide the probabilistic analogue to data types

Name sufficient stats domain use case


Bernoulli ϕ(x) = [x] X = {0; 1} coin toss
Poisson ϕ(x) = [x] X = R+ emails per day
Laplace ϕ(x) = [1, x]⊺ X=R floods
Helmert (χ2 ) ϕ(x) = [x, − log x] X=R variances
Dirichlet ϕ(x) = [log x] X = R+ class probabilities
Euler (Γ) ϕ(x) = [x, log x] X = R+ variances
Wishart ϕ(X) = [X, log |X|] X = {X ∈ RN×N | v⊺ Xv ≥ 0∀v ∈ RN } covariances
Gauss ϕ(X) = [X, XX⊺ ] X = RN functions
Boltzmann ϕ(X) = [X, triag(XX⊺ )] X = {0; 1}N thermodynamics

Note: Each row of this table is one exponential family. Some authors (e.g. Murphy) call the entire table
“the exponential family”. We will not do this.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 6


CODE
– Thanks to Marvin Pförtner for pair coding –

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 7


Exponential Families have Conjugate Priors
but their normalization constant may not be known

▶ Consider the exponential family pw (x | w) = h(x) exp [ϕ(x)⊺ w − log Z(w)]


R
▶ its conjugate prior is the exponential family F(α, ν) := exp(α⊺ w − ν log Z(w)) dw
 ⊺   
w α
pα (w | α, ν) = exp − log F(α, ν)
− log Z(w) ν
!
Yn X
because pα (w | α, ν) pw (xi | w) ∝ pα w α + ϕ(xi ), ν + n
i=1 i

Computing F(α, ν) can be tricky. But if we have it, it completely automates Bayesian inference!
Finding F is thus the challenge when constructing an EF.
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8
Exponential Families have Conjugate Priors
but their normalization constant may not be known

▶ Consider the exponential family pw (x | w) = h(x) exp [ϕ(x)⊺ w − log Z(w)]


R
▶ its conjugate prior is the exponential family F(α, ν) := exp(α⊺ w − ν log Z(w)) dw
 ⊺   
w α
pα (w | α, ν) = exp − log F(α, ν)
− log Z(w) ν
!
Yn X
because pα (w | α, ν) pw (xi | w) ∝ pα w α + ϕ(xi ), ν + n
i=1 i

▶ and the predictive is


Z Z

p(x) = pw (x | w)pα (w | α, ν) dw = h(x) e(ϕ(x)+α) w+(ν+1) log Z(w)−log F(α,ν) dw

F(ϕ(x) + α, ν + 1)
= h(x)
F(α, ν)

Computing F(α, ν) can be tricky. But if we have it, it completely automates Bayesian inference!
Finding F is thus the challenge when constructing an EF.
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8
Atomic inference machines
Exponential families with tractable conjugate priors make Bayesian ML tractable

▶ Once we decide to use a particular exponential family (h(x), ϕ(x)) as the model for some data x,
we automatically get a conjugate prior in the form of another exponential family
▶ in principle, this provides the means to do analytic Bayesian inference, if we can compute the
partition function Z(w) of the likelihood and F(α, ν), that of the conjugate prior.
▶ This solves the learning problem, i.e. how to extract information from the data. It does not
guarantee that we will be able to compute moments, or sample from the posterior, but these can
be encapsulated to subroutines, now that the data is dealt with.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 9


CODE

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 10


Example: The Gaussian distribution
There it is, again!

 
(x − µ)2
1
pw (x | w) = √ exp − = N (x; µ, σ 2 )
2πσ 2 2σ 2
 2 √ 
x − 2µx + µ2
= exp − − log 2πσ 2
2σ 2
    2 √ 
  µ/σ2 µ
= exp x −1/2 x2 1 2 − + log 2πσ 2
/σ 2σ 2
 
    
  w1 1 w21 
= exp  ϕ1 (x) ϕ2 (x) − − log w2 + log(2π) 
 w2 2 w2 
| {z }
log Z(w)

▶ The natural parameters are precision σ −2 and precision-adjusted mean µσ −2


▶ the sufficient statistics are the first two (non-central) sample moments of the data
▶ The conjugate prior is the Normal-Gamma, the predictive marginal is the Student-t distribution
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 11
Analytic Hierarchical Bayesian Inference
Inferring Mean and Co-Variance of a Gaussian

Y
n
p(x | µ, σ) = N (xi ; µ, σ 2 )
i=1
 
σ2
p(µ, σ | µ0 , ν, α, β) = N G(σ −2 ; α, β) µ; µ0 ,
ν
 
νµ0 + nx̄ σ 2
p(µ, σ | x, µ0 , ν, α, β) = N µ; , ·
ν+n ν+n
 
1X
n
−2 n nν
G σ ;α + ,β + (xi − x̄) +
2
(x̄ − µ0 ) 2
2 2 2(n + ν)
i=1

1X
n
where x̄ := xi
n William S. Gosset (1876–1937)
i=1

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 12


What if we don’t know the log partition function?

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 13


Exponential Families Allow Efficient Maximum Likelihood Estimation
why they’re called sufficient statistics

▶ Consider the exponential family

pw (x | w) = exp [ϕ(x)⊺ w − log Z(w)]

▶ for iid data:


!
Y
n X
n

pw (x1 , x2 , . . . , xn | w) = pw (xi | w) = exp ϕ (xi )w − n log Z(w)
i i

▶ to find the maximum likelihood estimate for w, set


1X
∇w log p(x | w) = 0 ⇒ ∇w log Z(w) = ϕ(xi )
n
i

▶ hence, collect statistics of ϕ, compute ∇w log Z(w) and solve the above for w
▶ this may or may not be possible analytically, but in any case it encapsulates the data!

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 14


Example for Maximum Likelihood Estimation
the Gaussian case

Example: Assume we observe samples xi drawn iid. from


Example:
Y
n Y
n  
1 (xi − µ)2
p(x | µ, σ 2 ) = N (xi ; µ, σ 2 ) = √ exp −
2πσ 2 2σ 2
i=1 i

Remember that the Gaussian is an EF with


µ    
x 1 w21
w := σ12 ϕ(x) := log Z(w) := − log w2 + log(2π)
σ2
− 21 x2 2 w2
so we find the maximum likelihood estimate by computing
" w1 #   n  
 w  µ ! 1 X xi
∇w log Z(w) =
2
w2 = = ϕ(x) =
− 21 w12 + w12 − 12 (µ2 + σ 2 ) n − 12 x2i
i=1
2
!
1X 1 X 2
n n
hence setting µ̂ = x̄ =: xi and 2
σ̂ = xi − µ̂2
n n
i=1 i=1

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 15


Why stop at maximum likelihood?
▶ thanks to auto-diff, we can do approximate full Bayesian inference

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 16


Summary: Exponential Families
▶ reduce Bayesian inference to
▶ modelling: designing / computing the sufficient statistics ϕ(x) and
partition function Z(w)
▶ computing the posterior: essentially evaluating the log partition Please cite this course, as
function F of the conjugate prior @techreport { Tuebingen_ProbML23 ,
title =
▶ if F is not tractable, we can still do Laplace approximations: { P r o b a b i l i s t i c Machine L e a r n i n g } ,
a u t h o r = { Hennig , P h i l i p p } ,
▶ find the mode ŵ of the posterior, by solving the root-finding problem s e r i e s = { L e c t u r e Notes
i n Machine L e a r n i n g } ,
∑ year = {2023} ,
α + ni=1 ϕ(xi ) i n s t i t u t i o n = { Tübingen AI Center } }
∇w log p(w | x) =
ν+n
▶ evaluate the Hessian Ψ = ∇w ∇⊺w log p(w | x) at ŵ
▶ approximate the posterior as N (w; ŵ, −Ψ−1 )

Laplace approximations show that Bayesian inference is not “about the prior” at all, but rather about cap-
turing the (local) geometry of the likelihood (around the mode). Uncertainty is about tracking all possible
solutions at once (or at least as many as possible), not one single estimate.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 17

You might also like