25 Customizing Models A Algorithms
25 Customizing Models A Algorithms
Lecture 25
Customizing Probabilistic Models
Philipp Hennig
19 July 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ graphical models ▶ Monte Carlo
▶ Gaussian distributions ▶ Linear algebra / Gaussian inference
▶ (deep) learnt representations ▶ maximum likelihood / MAP
▶ Kernels ▶ Laplace approximations
▶ Markov Chains ▶ EM (iterative maximum likelihood)
▶ Exponential Families / Conjugate Priors ▶ variational inference / mean field
▶ Factor Graphs & Message Passing
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Variational Inference
▶ is a general framework to construct approximating probability distributions q(z) to non-analytic
posterior distributions p(z | x) by minimizing the functional
▶ the beauty is that we get to choose q, so one can nearly always find a tractable approximation.
Q
▶ If we impose the mean field approximation q(z) = i q(zi ), get
▶ for Exponential Family p things are particularly simple: we only need the expectation under q of
the sufficient statistics.
Variational Inference is an extremely flexible and powerful approximation method. Its downside is that
constructing the bound and update equations can be tedious. For a quick test, variational inference is
often not a good idea. But for a deployed product, it can be the most powerful tool in the box.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Latent Dirichlet Allocation
Topic Models [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) JMLR 3, 993–1022]
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
The Variational Approximation
closing the loop
" #
X
Id
q(π d ) = D π d ; α̃dk := αdk + γ̃dik ∀d = 1, . . . , D
i=1 k=1,...,K
" #
X
D X
Id
q(θ k ) = D θ k ; β̃kv := βkv + γ̃dik I(wdi = v) ∀k = 1, . . . , K
d i=1 v=1,...,V
Y cdik
q(cdi ) = γ̃dik , ∀d i = 1, . . . , Id
k
P P
where γ̃dik = γdik / k γdik and (note that
α̃dk = const.) k
γdik = exp Eq(πdk ) (log πdk ) + Eq(θdi ) (log θkwdi )
!!
X
= exp 𝟋(α̃jk ) + 𝟋(β̃kwdi ) − 𝟋 β̃kv
v
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Building the Algorithm
updating and evaluating the bound
1 procedure LDA(W, α, β)
2 γ̃dik ^ DIRICHLET_RAND(α) initialize
3 L ^ −∞
4 while L not converged do
5 for d = 1, . . . , D;Pk = 1, . . . , K do
6 α̃dk ^ αdk + i γ̃dik update document-topics distributions
7 end for
8 for k = 1, . . . , K;Pv = 1, . . . , V do
9 β̃kv ^ βkv + d,i γ̃dik I(wdi = v) update topic-word distributions
10 end for
11 for d = 1, . . . , D; k = 1, . . . , K; i = 1, . . .P
, Id do
12 γ̃dik ^ exp(𝟋(α̃dk ) + 𝟋(β̃kwdi ) − 𝟋( v β̃kv )) update word-topic assignments
13 γ̃dik ^ γ̃dik /γ̃di·
14 end for
15 L ^ BOUND(γ̃, w, α̃, β̃) update bound
16 end while
17 end procedure
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Exponential Family Approximations
The connection to EM is not accidental
Y
N
p(x, z | η) = exp (η ⊺ ϕ(xn , zn ) − log Z(η))
n=1
with conjugate prior p(η | ν, v) = exp (η ⊺ v − ν log Z(η) − log F(ν, v))
▶ and assume q(z, η) = q(z) · q(η). Then q is in the same exponential family, with
X
N
log q∗ (z) = Eq(η) (log p(x, z | η)) + const. = Eq(η) (η)⊺ ϕ(xn , zn )
n=1
Y
∗ ⊺
q (z) = exp (E(η) ϕ(xn , zn ) − log Z(E(η))) (note induced factorization)
n=1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Exponential Family Approximations
The connection to EM is not accidental
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Even, and especially if, you consider variational approximations,
using conjugate exponential family priors can make life much easier.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Reminder: Collapsed Gibbs Sampling
It pays off to look closely at the math! T. L. Griffiths & M. Steyvers, Finding scientific topics, PNAS 101/1 (4/2004), 5228–5235
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
A Collapsed Gibbs Sampler for LDA
It pays off to look closely at the math! T. L. Griffiths & M. Steyvers, Finding scientific topics, PNAS 101/1 (4/2004), 5228–5235
P ! P !
Y Γ( k αdk ) Q Y Γ( v βkv ) Q
p(C, W) = P Γ(αdk +ndk· )
P Γ(βkv +n·kv )
Γ( k αdk + ndk· ) k Γ(αdk ) Γ( v βkv + n·kv ) v Γ(βkv )
d k
A collapsed sampling method can converge much faster by eliminating the latent variables that mediate
between individual data.
1 procedure LDA(W, α, β)
2 γdkv ^ 0 ∀d, k, v initialize counts
3 while true do
4 for d = 1, . . . , D; i = 1, . . . , Id do P can be parallelized
\di \di \di
5 cdi ∝ (αdk + ndk· )(βkwdi + n·kwdi )( v βkv + n·kv )−1 sample assignment
6 n ^ UPDATECOUNTS(cdi ) update counts (check whether first pass or repeat)
7 end for
8 end while
9 end procedure
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Can we do the same for variational inference?
Why don’t we use the mean field in our variational bound? Yee Whye Teh, David Newman & Max Welling, NeurIPS 2017
we will just get q(Θ, Π) = p(Θ, Π | C, W) and the bound will be tight in Π, Θ.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
A Collapsed Variational Bound
Why don’t we use the mean field in our variational bound? Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007
P ! P !
Y Γ( α ) Q Y Γ( β ) Q
p(C, W) = P k dk Γ(αdk +ndk· )
P v kv Γ(βkv +n·kv )
Γ( k αdk + ndk· ) k Γ(αdk ) Γ( v βkv + n·kv ) v Γ(βkv )
d k
▶ because we make strictly less assumptions about q than before, we will get a strictly better
approximation to the true posterior!
▶ The bound is maximized for cdi if
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Constructing the Algorithm
Why didn’t we do this earlier?
P Q P
▶ Note that cdi ∈ {0; 1}K and k cdik = 1. So q(cdi ) = k γdik with k γdik = 1
Qn−1 Pn−1
▶ Also: Γ(α + n) = ℓ=0 (α + ℓ), thus log Γ(α + n) = ℓ=0 log(α + ℓ)
P ! P !
Y Γ( k αdk ) Q Γ(αdk +ndk· ) Y Γ( v βkv ) Q Γ(βkv +n·kv )
p(C, W) = P P
Γ( k αdk + ndk· ) k Γ(αdk ) Γ( v βkv + n·kv ) v Γ(βkv )
d k
log q(cdi ) = Eq(C\di ) (log p(C, W)) + const.
log γdik = log q(cdik = 1)
" !#
X
= Eq(C\di ) log Γ(αdk + ndk· ) + log Γ(βkwdi + n·kwdi ) − log Γ βkv + n·kv + const.
v
" !#
\di \di
X \di
= Eq(C\di ) log(αdk + ndk· ) + log(βkwdi + n·kwdi ) − log βkv + n·kv + const.
v
(note all terms in p(C, W) that don’t involve cdik can be moved into the constant, as can all sums over k.
Pn\di −1
We can also add terms to const., such as ℓ=0 log(α + ℓ), effectively cancelling terms in log Γ)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
A Complication
Ah, that’s why!
" !#!
\di \di
X \di
γdik ∝ exp Eq(C\di ) log(αdk + ndk· ) + log(βkwdi + n·kwdi ) − log βkv + n·kv
v
Q
▶ Under q(C) = di cdi , the counts ndk· are sums of independent Bernoulli variables (i.e. they have a
multinomial distribution). Computing their expected logarithm is tricky (O(n2d·· )):
∑
K ∑
K Id
∑ ( )
Id ndk· I −n
H(q(ndk· )) = E[log ndk· ] = − log(Id !) − Id γdk· log(γdk· ) + γdk· (1 − γdk· ) d dk· log(ndk· !)
ndk·
k k=1 ndk· =1
▶ That’s likely why the original paper (and scikit-learn) don’t do this.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
If arithmetic doesn’t work, try creativity!
Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007
" !#!
\di \di
X \di
γdik ∝ exp Eq(C\di ) log(αdk + ndk· ) + log(βkwdi + n·kwdi ) − log βkv + n·kv
v
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Statistics for the rescue
recall Lecture 3 Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007
f = 1/3, N = 10
0.25
PN
The probability measure of R = i xi with discrete xi of 0.2
probablity f is
0.15
N!
p(r)
P(R = r | f, N) = · f r · (1 − f)N−r
(N − r)! · r!
0.1
N
= · f r · (1 − f)N−r
r
5 · 10−2
≈ N (r; Nr, Nr(1 − r))
0
0 5 10
r
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
If arithmetic doesn’t work, try creativity!
Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
If arithmetic doesn’t work, try creativity!
Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007
1 1 1
log(α + n) ≈ log(α + E(n)) + (n − E(n)) · − (n − E(n))2 ·
α + E(n) 2 (α + E(n))2
\di
\di \di varq [ndk· ]
Eq [log(αdk + ndk· )] ≈ log(αdk + Eq [ndk· ]) − \di
2(αdk + Eq [ndk· ])2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Approximate Approximate Inference
probabilistic machine learning often involves creative tweaks
" !#!
\di \di
X \di
γdik ∝ exp Eq(C\di ) log(αdk + ndk· ) + log(βkwdi + n·kwdi ) − log βkv + n·kv
v
\di
\di \di varq [ndk· ]
Eq [log(αdk + ndk· )] ≈ log(αdk + Eq [ndk· ]) − \di
2(αdk + Eq [ndk· ])2
!−1
\di \di
X \di
γdik ∝ (αdk + E[ndk· ])(βkwdi + E[n·kwdi ]) βkv + E[n·kv ]
v
\di \di \di
!
varq [ndk· ] varq [n·kwdi ] varq [n·k· ]
· exp − \di
− \di
+ P \di
2(αdk + Eq [ndk· ])2 2(βkwdi + Eq [n·kwdi ])2 2( v βkv + Eq [n·kv ])2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Complexity
The algorithm
!−1
\di \di
X \di
γdik ∝ (αdk + E[ndk· ])(βkwdi + E[n·kwdi ]) βkv + E[n·kv ]
v
\di \di \di
!
varq [ndk· ] varq [n·kwdi ] varq [n·k· ]
· exp − \di
− \di
+ P \di
2(αdk + Eq [ndk· ])2 2(βkwdi + Eq [n·kwdi ])2 2( v βkv + Eq [n·kv ])2
Note that γdik doesn’t depend on i ∈ 1, . . . , Id , it’s the same for all wdi in d with wdi = v!
▶ memory requirement: O(DKV), since we have to store γdik for each value of i ∈ 1, . . . , V and
▶ E[ndk· ], var[ndk· ] ∈ RD×K
▶ E[n·kv ], var[n·kv ] ∈ RK×V
▶ E[n·k· ], var[n·k· ] ∈ RK
▶ computational complexity: O(DKV) We can loop over V rather than Id (good for long documents!)
Often, a document will be sparse in V, so iteration cost can be much lower.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Because machine learning involves real-world data, every problem is unique.
Thinking hard about both your model and your algorithm
can make a big difference in predictive and numerical performance.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Some Output
can we be happy with this?
1.0
0.8
0.6
πd
0.4
0.2
0.0
1,780
1,800
1,820
1,840
1,860
1,880
1,900
1,920
1,940
1,960
1,980
2,000
2,020
year
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
Can we be happy with this model?
Have we described all we know about the data?
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
! Id
! !
Y
D Y
D Y
QK Y
K
p(C, Π, Θ, W) = D(π d ; αd ) · k=1 (πdk θkwdi )
cdik
· D(θ k ; β k )
d=1 d=1 i=1 k=1
| {z } | {z } | {z }
p(Π|α) p(W,C|Θ,Π) p(Θ|β)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Meta-Data
It’s right there!
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
The ELBO
useful for monitoring / bug-fixing
P ! P !
Y
D
Γ( k αdk ) Y αdk −1+ndk·
K Y
K
Γ( v βkv ) Y βkv −1+n·kv
V
p(C, Π, Θ, W) = Q πdk · Q θkv
d=1 k Γ(αdk ) k=1 k=1 v Γ(βkv )v=1
We need
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
Adding more Information
a model for document metadata
ϕd
metadata
d = [1, . . . , D]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
The Price of Packaged Solutions
Toolboxes speed up development, but also make it brittle
https://github.com/scikit-learn/scikit-learn/blob/fd237278e/sklearn/decomposition/_lda.py#L134
▶ toolboxes are extremely valuable for quick early development. Use them to your advantage!
▶ but their interface often enforces and restricts model design decisions
▶ to really solve a probabilistic modelling task, build your own craftware
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
A Generalized Linear Model
Latent Topic Dynamics
latent function transf. doc topic dist. doc topic dist. word topic word topic word dist.
βk
f αd πd cdi wdi θk
topic prior
i = [1, . . . , Id ] k = [1, . . . , K]
h ϕd
kernel
metadata
d = [1, . . . , D]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
A Generalized Linear Model
Latent Topic Dynamics
latent function transf. doc topic dist. doc topic dist. word topic word topic word dist.
βk
f αd πd cdi wdi θk
topic prior
i = [1, . . . , Id ] k = [1, . . . , K]
h ϕd
kernel
metadata
d = [1, . . . , D]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
A change in prior
EM-style point estimates from the ELBO
latent function transf. doc topic dist. doc topic dist. word topic word topic word dist.
βk
f αd πd cdi wdi θk
topic prior
i = [1, . . . , Id ] k = [1, . . . , K]
h ϕd
kernel
metadata
d = [1, . . . , D]
0.8
0.6
p(topic)
0.4
0.2
0
1,800 1,820 1,840 1,860 1,880 1,900 1,920 1,940 1,960 1,980 2,000
Year
(
2 −α
(xa − xb ) 1.00 if president(xa ) = president (xb )
k(xa , xb ) = θ2 1 + ·
2αℓ2 γ otherwise
θ=5 ℓ = 10years α = 0.5 γ = 0.9
s
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
Results
Kernel Topic Models [Hennig, Stern, Herbrich, Graepel. Kernel Topic Models. AISTATS 2012]
1
war, spain
war, national, economic labor energy, cut, oil
law, war,
0.6
⟨πk | ϕ⟩
America,
good,
American, people
0.4 work
made, business
0
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
year
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 30
The Topics of American History
1897
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 31
The Topics of American History
1980
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 32
Can we do even better?
Intra-Document Structure! Bags of Bags of Words
We are 15 years into this new century. Fifteen years that dawned with terror touching our shores, that unfolded with a new generation
fighting two long and costly wars, that saw a vicious recession spread across our Nation and the world. It has been and still is a hard
time for many.
But tonight we turn the page. Tonight, after a breakthrough year for America, our economy is growing and creating jobs at the fastest
pace since 1999. Our unemployment rate is now lower than it was before the financial crisis. More of our kids are graduating than ever
before. More of our people are insured than ever before. And we are as free from the grip of foreign oil as we’ve been in almost 30
years.
Tonight, for the first time since 9/11, our combat mission in Afghanistan is over. Six years ago, nearly 180,000 American troops served
in Iraq and Afghanistan. Today, fewer than 15,000 remain. And we salute the courage and sacrifice of every man and woman in this 9/11
generation who has served to keep us safe. We are humbled and grateful for your service.
America, for all that we have endured, for all the grit and hard work required to come back, for all the tasks that lie ahead, know
this: The shadow of crisis has passed, and the State of the Union is strong.
Each document is actually pre-structured into sequential sub-documents, typically of one topic each.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 33
Designing a probabilistic machine learning method:
1. get the data
1.1 try to collect as much meta-data as possible
2. build the model
2.1 identify quantities and datastructures; assign names
2.2 design a generative process (graphical model)
2.3 assign (conditional) distributions to factors/arrows (use exponential families!)
3. design the algorithm
3.1 consider conditional independence
3.2 try standard methods for early experiments
3.3 run unit-tests and sanity-checks
3.4 identify bottlenecks, find customized approximations and refinements
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 34