0% found this document useful (0 votes)

54 views19 pages

Probabilistic Machine Learning: Exponential Families

This document discusses exponential families and their use in probabilistic machine learning. Exponential families provide a framework for probabilistic models and allow for conjugate priors and tractable Bayesian inference. The key challenges are determining the normalization constant of the exponential family and computing its predictive distribution.

Uploaded by

Gobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views19 pages

Probabilistic Machine Learning: Exponential Families

Uploaded by

Gobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Probabilistic Machine Learning

Lecture 05
Exponential Families

Philipp Hennig
04 May 2023

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
Bonus Points
it’s ok not to get everything right

▶ “bonus points” means not everyone should get them

▶ otherwise, the exam just gets a steeper grading curve, and nobody gains anything
▶ It’s ok to not be able to finish the exercise. You do not need all the bonus points to finish the exam!
▶ please do not post solutions on the forum. Sometimes they even add confusion!
▶ keep in mind that some things can actually be computed by hand. Auto-diff and optimization are
powerful tools, but they do have computational and coding cost.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 1

The skeleton of ML
probabilistic inference

This is very abstact. It would be nice to reduce this to

▶ get some data x ∈ RN×D (N: “batch dim”, D: “input dim”)
▶ compute some statistics ϕ(x) ∈ RF (note this can consume either or both axes of x) that capture
the algebraic structure of p(x | w)
▶ somehow compute p(w | x), using only ϕ(x) and nothing else about the data.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 2

Conjugate Priors
Mapping Bayesian inference to tractable code

Definition (Conjugate Prior)

Let x and w be a data-set and a variable to be inferred, respectively, connected by the likelihood
p(x | w) = ℓ(x; w). A conjugate prior to ℓ for w is a probability measure with pdf p(w) = g(w; θ), such
that
ℓ(x; w)g(w; θ)
p(w | x) = R = g(w; θ + ϕ(x)).
ℓ(x; w)g(w; θ) dw
That is, such that the posterior arising from ℓ is of the same functional form as the prior, with updated
parameters arising by adding some sufficient statistics of the observation x to the prior’s parameters.

E. Pitman. Sufficient statistics and intrinsic accuracy (1936). Math. Proc. Cambr. Phil. Soc. 32(4), 1936.
P. Diaconis and D. Ylvisaker, Conjugate priors for exponential families. Annals of Statistics 7(2), 1979.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 3

▶ Conjugate priors allow analytic Bayesian inference, by mapping it ot
▶ “data processing” ϕ(x) : RN×D _ RF
▶ “inference” g(w; θ + ϕ(x)), simple (floating-point) addition yields the full, normalized posterior!
▶ (the implicit assumption in this formulation is that g can be computed in decent time)
▶ How can we construct them in general?

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 4

Exponential Families
Exponentials of a Linear Form

Definition (Exponential Family, simplified form)

Consider a random variable X taking values x ∈ X ⊂ Rn . A probability distribution for X with pdf of the
functional form
h(x) ϕ(x)⊺ w
pw (x) = h(x) exp [ϕ(x)⊺ w − log Z(w)] = e = p(x | w)
Z(w)

is called an exponential family of probability measures. The function ϕ : X _ Rd is called the sufficient
statistics. The parameters w ∈ Rd are the natural parameters of pw . The normalization constant
Z(w) : Rd _ R is the partition function. The function h(x) : X _ R+ is the base measure. For
notational convenience, it can be useful to re-parametrize the natural parameters w as w := η(θ) in
terms of canonical parameters θ.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 5

A Family Meeting
Exponential families provide the probabilistic analogue to data types

Name sufficient stats domain use case

Bernoulli ϕ(x) = [x] X = {0; 1} coin toss
Poisson ϕ(x) = [x] X = R+ emails per day
Laplace ϕ(x) = [1, x]⊺ X=R floods
Helmert (χ2 ) ϕ(x) = [x, − log x] X=R variances
Dirichlet ϕ(x) = [log x] X = R+ class probabilities
Euler (Γ) ϕ(x) = [x, log x] X = R+ variances
Wishart ϕ(X) = [X, log |X|] X = {X ∈ RN×N | v⊺ Xv ≥ 0∀v ∈ RN } covariances
Gauss ϕ(X) = [X, XX⊺ ] X = RN functions
Boltzmann ϕ(X) = [X, triag(XX⊺ )] X = {0; 1}N thermodynamics

Note: Each row of this table is one exponential family. Some authors (e.g. Murphy) call the entire table
“the exponential family”. We will not do this.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 6

CODE
– Thanks to Marvin Pförtner for pair coding –

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 7

Exponential Families have Conjugate Priors
but their normalization constant may not be known

▶ Consider the exponential family pw (x | w) = h(x) exp [ϕ(x)⊺ w − log Z(w)]

R
▶ its conjugate prior is the exponential family F(α, ν) := exp(α⊺ w − ν log Z(w)) dw
⊺
w α
pα (w | α, ν) = exp − log F(α, ν)
− log Z(w) ν
!
Yn X
because pα (w | α, ν) pw (xi | w) ∝ pα w α + ϕ(xi ), ν + n
i=1 i

Computing F(α, ν) can be tricky. But if we have it, it completely automates Bayesian inference!
Finding F is thus the challenge when constructing an EF.
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8
Exponential Families have Conjugate Priors
but their normalization constant may not be known

▶ Consider the exponential family pw (x | w) = h(x) exp [ϕ(x)⊺ w − log Z(w)]

▶ and the predictive is

Z Z
⊺
p(x) = pw (x | w)pα (w | α, ν) dw = h(x) e(ϕ(x)+α) w+(ν+1) log Z(w)−log F(α,ν) dw

F(ϕ(x) + α, ν + 1)
= h(x)
F(α, ν)

Computing F(α, ν) can be tricky. But if we have it, it completely automates Bayesian inference!
Finding F is thus the challenge when constructing an EF.
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8
Atomic inference machines
Exponential families with tractable conjugate priors make Bayesian ML tractable

▶ Once we decide to use a particular exponential family (h(x), ϕ(x)) as the model for some data x,
we automatically get a conjugate prior in the form of another exponential family
▶ in principle, this provides the means to do analytic Bayesian inference, if we can compute the
partition function Z(w) of the likelihood and F(α, ν), that of the conjugate prior.
▶ This solves the learning problem, i.e. how to extract information from the data. It does not
guarantee that we will be able to compute moments, or sample from the posterior, but these can
be encapsulated to subroutines, now that the data is dealt with.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 9

CODE

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 10

Example: The Gaussian distribution
There it is, again!

(x − µ)2
1
pw (x | w) = √ exp − = N (x; µ, σ 2 )
2πσ 2 2σ 2
2 √
x − 2µx + µ2
= exp − − log 2πσ 2
2σ 2
2 √
µ/σ2 µ
= exp x −1/2 x2 1 2 − + log 2πσ 2
/σ 2σ 2
 
 
 w1 1 w21 
= exp  ϕ1 (x) ϕ2 (x) − − log w2 + log(2π) 
 w2 2 w2 
| {z }
log Z(w)

▶ The natural parameters are precision σ −2 and precision-adjusted mean µσ −2

▶ the sufficient statistics are the first two (non-central) sample moments of the data
▶ The conjugate prior is the Normal-Gamma, the predictive marginal is the Student-t distribution
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 11
Analytic Hierarchical Bayesian Inference
Inferring Mean and Co-Variance of a Gaussian

Y
n
p(x | µ, σ) = N (xi ; µ, σ 2 )
i=1

σ2
p(µ, σ | µ0 , ν, α, β) = N G(σ −2 ; α, β) µ; µ0 ,
ν

νµ0 + nx̄ σ 2
p(µ, σ | x, µ0 , ν, α, β) = N µ; , ·
ν+n ν+n

1X
n
−2 n nν
G σ ;α + ,β + (xi − x̄) +
2
(x̄ − µ0 ) 2
2 2 2(n + ν)
i=1

1X
n
where x̄ := xi
n William S. Gosset (1876–1937)
i=1

What if we don’t know the log partition function?

Exponential Families Allow Efficient Maximum Likelihood Estimation
why they’re called sufficient statistics

▶ Consider the exponential family

pw (x | w) = exp [ϕ(x)⊺ w − log Z(w)]

▶ for iid data:

!
Y
n X
n
⊺
pw (x1 , x2 , . . . , xn | w) = pw (xi | w) = exp ϕ (xi )w − n log Z(w)
i i

▶ to find the maximum likelihood estimate for w, set

1X
∇w log p(x | w) = 0 ⇒ ∇w log Z(w) = ϕ(xi )
n
i

▶ hence, collect statistics of ϕ, compute ∇w log Z(w) and solve the above for w
▶ this may or may not be possible analytically, but in any case it encapsulates the data!

Example for Maximum Likelihood Estimation
the Gaussian case

Example: Assume we observe samples xi drawn iid. from

Example:
Y
n Y
n
1 (xi − µ)2
p(x | µ, σ 2 ) = N (xi ; µ, σ 2 ) = √ exp −
2πσ 2 2σ 2
i=1 i

Remember that the Gaussian is an EF with

µ
x 1 w21
w := σ12 ϕ(x) := log Z(w) := − log w2 + log(2π)
σ2
− 21 x2 2 w2
so we find the maximum likelihood estimate by computing
" w1 # n
w µ ! 1 X xi
∇w log Z(w) =
2
w2 = = ϕ(x) =
− 21 w12 + w12 − 12 (µ2 + σ 2 ) n − 12 x2i
i=1
2
!
1X 1 X 2
n n
hence setting µ̂ = x̄ =: xi and 2
σ̂ = xi − µ̂2
n n
i=1 i=1

Why stop at maximum likelihood?
▶ thanks to auto-diff, we can do approximate full Bayesian inference

Summary: Exponential Families
▶ reduce Bayesian inference to
▶ modelling: designing / computing the sufficient statistics ϕ(x) and
partition function Z(w)
▶ computing the posterior: essentially evaluating the log partition Please cite this course, as
function F of the conjugate prior @techreport { Tuebingen_ProbML23 ,
title =
▶ if F is not tractable, we can still do Laplace approximations: { P r o b a b i l i s t i c Machine L e a r n i n g } ,
a u t h o r = { Hennig , P h i l i p p } ,
▶ find the mode ŵ of the posterior, by solving the root-finding problem s e r i e s = { L e c t u r e Notes
i n Machine L e a r n i n g } ,
∑ year = {2023} ,
α + ni=1 ϕ(xi ) i n s t i t u t i o n = { Tübingen AI Center } }
∇w log p(w | x) =
ν+n
▶ evaluate the Hessian Ψ = ∇w ∇⊺w log p(w | x) at ŵ
▶ approximate the posterior as N (w; ŵ, −Ψ−1 )

Laplace approximations show that Bayesian inference is not “about the prior” at all, but rather about cap-
turing the (local) geometry of the likelihood (around the mode). Uncertainty is about tracking all possible
solutions at once (or at least as many as possible), not one single estimate.

Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
100% (9)
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
530 pages
15 Exponential Families
No ratings yet
15 Exponential Families
33 pages
CPSC 440: Advanced Machine Learning: Exponential Families
No ratings yet
CPSC 440: Advanced Machine Learning: Exponential Families
15 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
25 Customizing Models A Algorithms
No ratings yet
25 Customizing Models A Algorithms
38 pages
19-Bayesian 2
No ratings yet
19-Bayesian 2
39 pages
Conjugate Prior
No ratings yet
Conjugate Prior
6 pages
08 Learning Representations
No ratings yet
08 Learning Representations
38 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Conjugacy and The Exponential Family
No ratings yet
Conjugacy and The Exponential Family
6 pages
A Very Gentle Note On The Construction of DP Zhang
No ratings yet
A Very Gentle Note On The Construction of DP Zhang
15 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Probabilistic Learning and Generalized Linear Models (GLMS)
No ratings yet
Probabilistic Learning and Generalized Linear Models (GLMS)
38 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
27 Revision
No ratings yet
27 Revision
80 pages
20 Bayesian2
No ratings yet
20 Bayesian2
50 pages
Lec 38
No ratings yet
Lec 38
8 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
21 Efficient Inference A K-Means
No ratings yet
21 Efficient Inference A K-Means
32 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
CQF ML Lab Estimating Default Probability With Logistic Regression
No ratings yet
CQF ML Lab Estimating Default Probability With Logistic Regression
7 pages
Introduction To Bayesian Methods With An Example
No ratings yet
Introduction To Bayesian Methods With An Example
25 pages
Ch3 - 2009 Conjugate Families of Distributions
No ratings yet
Ch3 - 2009 Conjugate Families of Distributions
67 pages
CSE291D Lecture 4: Exponential Families Generalized Linear Models
No ratings yet
CSE291D Lecture 4: Exponential Families Generalized Linear Models
67 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Generalized Linear Models and Exponential Family
No ratings yet
Generalized Linear Models and Exponential Family
15 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
CQF January 2016 M5S8 Workings Annotated
No ratings yet
CQF January 2016 M5S8 Workings Annotated
7 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
SARHA - 2001 - Conjugate Prior Distribution
No ratings yet
SARHA - 2001 - Conjugate Prior Distribution
6 pages
Tungban Probabilistic ML 2021 - 04 - Sampling
No ratings yet
Tungban Probabilistic ML 2021 - 04 - Sampling
24 pages
04 Sampling
No ratings yet
04 Sampling
24 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Statistics - Lecture 7
No ratings yet
Statistics - Lecture 7
47 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
Introduction To Probabilistic Learning
No ratings yet
Introduction To Probabilistic Learning
9 pages
12 MLEFilled
No ratings yet
12 MLEFilled
8 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
33 pages
Bayesian Inference For The Gaussian
No ratings yet
Bayesian Inference For The Gaussian
11 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Lecture 11
No ratings yet
Lecture 11
6 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Exp Family
No ratings yet
Exp Family
7 pages
Notes
No ratings yet
Notes
10 pages
MLESA v2024 Week10 Assignment Solution
No ratings yet
MLESA v2024 Week10 Assignment Solution
7 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 2 - Machine Learning - WWW - Rgpvnotes.in PDF
10 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Tut6 Questions
No ratings yet
Tut6 Questions
2 pages
41-, Gaussian Mixture Models, Expectation Maximization-20-11-2024
No ratings yet
41-, Gaussian Mixture Models, Expectation Maximization-20-11-2024
40 pages
Assign 1
No ratings yet
Assign 1
5 pages
Inductive Moment Matching
No ratings yet
Inductive Moment Matching
36 pages
IterateAI Careers
No ratings yet
IterateAI Careers
4 pages
Chapter 14 - Analyzing Adversarial Performance - The Deep Learning Architect's Handbook
No ratings yet
Chapter 14 - Analyzing Adversarial Performance - The Deep Learning Architect's Handbook
1 page
AIcrowd - Single-Source Augmentation - Challenges
No ratings yet
AIcrowd - Single-Source Augmentation - Challenges
1 page
2024 11 15 AI Updates
No ratings yet
2024 11 15 AI Updates
20 pages
The Most Used Positional Encoding: Rope: Damien Benveniste
No ratings yet
The Most Used Positional Encoding: Rope: Damien Benveniste
7 pages
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
No ratings yet
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
8 pages
Model Compression Techniquesin Deep Learning
No ratings yet
Model Compression Techniquesin Deep Learning
23 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
SVM
No ratings yet
SVM
19 pages
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
No ratings yet
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
26 pages
Deepseek-Vl: Towards Real-World Vision-Language Understanding
No ratings yet
Deepseek-Vl: Towards Real-World Vision-Language Understanding
33 pages
Ai
No ratings yet
Ai
28 pages
RNN
No ratings yet
RNN
12 pages
(Universitext) Paolo Baldi - Probability - An Introduction Through Theory and Exercises-Springer (2024) (Z-Lib - Io)
No ratings yet
(Universitext) Paolo Baldi - Probability - An Introduction Through Theory and Exercises-Springer (2024) (Z-Lib - Io)
395 pages
Modified Generative AI and LLMs in Practice
No ratings yet
Modified Generative AI and LLMs in Practice
6 pages

Probabilistic Machine Learning: Exponential Families

Uploaded by

Probabilistic Machine Learning: Exponential Families

Uploaded by

Probabilistic Machine Learning

▶ “bonus points” means not everyone should get them

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 1

This is very abstact. It would be nice to reduce this to

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 2

Definition (Conjugate Prior)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 3

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 4

Definition (Exponential Family, simplified form)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 5

Name sufficient stats domain use case

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 6

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 7

▶ Consider the exponential family pw (x | w) = h(x) exp [ϕ(x)⊺ w − log Z(w)]

▶ Consider the exponential family pw (x | w) = h(x) exp [ϕ(x)⊺ w − log Z(w)]

▶ and the predictive is

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 9

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 10

▶ The natural parameters are precision σ −2 and precision-adjusted mean µσ −2

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 12

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 13

▶ Consider the exponential family

pw (x | w) = exp [ϕ(x)⊺ w − log Z(w)]

▶ for iid data:

▶ to find the maximum likelihood estimate for w, set

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 14

Example: Assume we observe samples xi drawn iid. from

Remember that the Gaussian is an EF with

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 15

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 16

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 17

You might also like