0% found this document useful (0 votes)

15 views

25 Customizing Models A Algorithms

This document summarizes a lecture on customizing probabilistic models. It discusses variational inference, which allows approximating intractable posterior distributions by choosing an approximating distribution and minimizing its divergence from the true posterior. Mean field variational inference imposes independence between variables, simplifying computations. Latent Dirichlet allocation is presented as a topic model that uses variational inference. The variational approximation for LDA makes assumptions of independence between topic and document distributions. An algorithm to perform variational inference for LDA is outlined.

Uploaded by

ian

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

25 Customizing Models A Algorithms

Uploaded by

ian

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Probabilistic Machine Learning

Lecture 25
Customizing Probabilistic Models

Philipp Hennig
19 July 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
The Toolbox

Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)

Modelling: Computation:
▶ graphical models ▶ Monte Carlo
▶ Gaussian distributions ▶ Linear algebra / Gaussian inference
▶ (deep) learnt representations ▶ maximum likelihood / MAP
▶ Kernels ▶ Laplace approximations
▶ Markov Chains ▶ EM (iterative maximum likelihood)
▶ Exponential Families / Conjugate Priors ▶ variational inference / mean field
▶ Factor Graphs & Message Passing

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Variational Inference
▶ is a general framework to construct approximating probability distributions q(z) to non-analytic
posterior distributions p(z | x) by minimizing the functional

q∗ = arg min DKL (q(z)∥p(z | x)) = arg max L(q)

q∈Q q∈Q

▶ the beauty is that we get to choose q, so one can nearly always find a tractable approximation.
Q
▶ If we impose the mean field approximation q(z) = i q(zi ), get

log q∗j (zj ) = Eq,i̸=j (log p(x, z)) + const..

▶ for Exponential Family p things are particularly simple: we only need the expectation under q of
the sufficient statistics.
Variational Inference is an extremely flexible and powerful approximation method. Its downside is that
constructing the bound and update equations can be tedious. For a quick test, variational inference is
often not a good idea. But for a deployed product, it can be the most powerful tool in the box.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Latent Dirichlet Allocation
Topic Models [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) JMLR 3, 993–1022]

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:

QK
▶ Draw K topic distributions θk over V words from p(Θ | β) = k=1 D(θk ; βk )
QD
▶ Draw D document distributions over K topics from p(Π | α) = d=1 D(πd ; αd )
Q cdik
▶ Draw topic assignments cik of word wdi from p(C | Π) = i,d,k πdk
Q cdik
▶ Draw word wdi from p(wdi = v | cdi , Θ) = k θkv
P
Useful notation: ndkv = #{i : wdi = v, cijk = 1}. Write ndk: := [ndk1 , . . . , ndkV ] and ndk· = v ndkv , etc.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
The Variational Approximation
closing the loop

 " # 
X
Id
q(π d ) = D π d ; α̃dk := αdk + γ̃dik  ∀d = 1, . . . , D
i=1 k=1,...,K
 " # 
X
D X
Id
q(θ k ) = D θ k ; β̃kv := βkv + γ̃dik I(wdi = v)  ∀k = 1, . . . , K
d i=1 v=1,...,V
Y cdik
q(cdi ) = γ̃dik , ∀d i = 1, . . . , Id
k
P P
where γ̃dik = γdik / k γdik and (note that
α̃dk = const.) k

γdik = exp Eq(πdk ) (log πdk ) + Eq(θdi ) (log θkwdi )
!!
X
= exp 𝟋(α̃jk ) + 𝟋(β̃kwdi ) − 𝟋 β̃kv
v

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Building the Algorithm
updating and evaluating the bound

1 procedure LDA(W, α, β)
2 γ̃dik ^ DIRICHLET_RAND(α) initialize
3 L ^ −∞
4 while L not converged do
5 for d = 1, . . . , D;Pk = 1, . . . , K do
6 α̃dk ^ αdk + i γ̃dik update document-topics distributions
7 end for
8 for k = 1, . . . , K;Pv = 1, . . . , V do
9 β̃kv ^ βkv + d,i γ̃dik I(wdi = v) update topic-word distributions
10 end for
11 for d = 1, . . . , D; k = 1, . . . , K; i = 1, . . .P
, Id do
12 γ̃dik ^ exp(𝟋(α̃dk ) + 𝟋(β̃kwdi ) − 𝟋( v β̃kv )) update word-topic assignments
13 γ̃dik ^ γ̃dik /γ̃di·
14 end for
15 L ^ BOUND(γ̃, w, α̃, β̃) update bound
16 end while
17 end procedure
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Exponential Family Approximations
The connection to EM is not accidental

▶ What has happened here? Why the connection to EM?

▶ Consider an exponential family joint distribution

Y
N
p(x, z | η) = exp (η ⊺ ϕ(xn , zn ) − log Z(η))
n=1
with conjugate prior p(η | ν, v) = exp (η ⊺ v − ν log Z(η) − log F(ν, v))

▶ and assume q(z, η) = q(z) · q(η). Then q is in the same exponential family, with

X
N
log q∗ (z) = Eq(η) (log p(x, z | η)) + const. = Eq(η) (η)⊺ ϕ(xn , zn )
n=1
Y
∗ ⊺
q (z) = exp (E(η) ϕ(xn , zn ) − log Z(E(η))) (note induced factorization)
n=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Exponential Family Approximations
The connection to EM is not accidental

▶ What has happened here? Why the connection to EM?

▶ Consider an exponential family joint distribution
Y
N
p(x, z | η) = exp (η ⊺ ϕ(xn , zn ) − log Z(η))
n=1
with conjugate prior p(η | ν, v) = exp (η ⊺ v − ν log Z(η) − log F(ν, v))
▶ and assume q(z, η) = q(z) · q(η). Then q is in the same exponential family, with
log q∗ (η) = log p(η | ν, v) + Ez (log p(x, z | η)) + const.
X
N
= −ν log Z(η) + η ⊺ v + − log Z(η) + η ⊺ Ez (ϕ(xn , zn )) + const.
n=1
! !
X
N
∗ ⊺
q (η) = exp η v+ Ez (ϕ(xn , zn )) − (ν + N) log Z(η) − const.
n=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
Even, and especially if, you consider variational approximations,
using conjugate exponential family priors can make life much easier.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Reminder: Collapsed Gibbs Sampling
It pays off to look closely at the math! T. L. Griffiths & M. Steyvers, Finding scientific topics, PNAS 101/1 (4/2004), 5228–5235

Recall Γ(x + 1) = x · Γ(x) ∀x ∈ R+

P ! P !
YD
Γ( k αdk ) Y αdk −1+ndk·
K YK
Γ( v βkv ) Y βkv −1+n·kv
V
p(C, Π, Θ, W) = Q πdk · Q θkv
d=1 k Γ(αdk ) k=1 k=1 v Γ(βkv ) v=1
! !
YD
B(αd + nd:· ) YK
B(βk + n·k: )
= D(πd ; αd + nd:· ) · D(θk ; βk + n·k: )
B(αd ) B(βk )
d=1 k=1
! !
YD
B(αd + nd:· ) YK
B(βk + n·k: )
p(C, W) = ·
B(αd ) B(βk )
d=1 k=1
P ! P !
Y Γ( k′ αdk′ ) Q Γ(αdk +ndk· ) Y Γ( v βkv ) Q Γ(βkv +n·kv )
= P P
Γ( k′ αdk′ + ndk′ · ) k Γ(αdk ) Γ( v βkv + n·kv ) v Γ(βkv )
d k
\di \di P \di
\di
(αdk + ndk· )(βkwdi + n·kwdi )( v βkv + n·kv )−1
p(cdik = 1 | C , W) = P \di P \di P \di −1
k′ (αdk′ + ndk′ · ) · w′ (βkw′ + n·kw′ ) · v′ (βkv′ + n·kv′ )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
A Collapsed Gibbs Sampler for LDA
It pays off to look closely at the math! T. L. Griffiths & M. Steyvers, Finding scientific topics, PNAS 101/1 (4/2004), 5228–5235

P ! P !
Y Γ( k αdk ) Q Y Γ( v βkv ) Q
p(C, W) = P Γ(αdk +ndk· )
P Γ(βkv +n·kv )
Γ( k αdk + ndk· ) k Γ(αdk ) Γ( v βkv + n·kv ) v Γ(βkv )
d k

A collapsed sampling method can converge much faster by eliminating the latent variables that mediate
between individual data.
1 procedure LDA(W, α, β)
2 γdkv ^ 0 ∀d, k, v initialize counts
3 while true do
4 for d = 1, . . . , D; i = 1, . . . , Id do P can be parallelized
\di \di \di
5 cdi ∝ (αdk + ndk· )(βkwdi + n·kwdi )( v βkv + n·kv )−1 sample assignment
6 n ^ UPDATECOUNTS(cdi ) update counts (check whether first pass or repeat)
7 end for
8 end while
9 end procedure
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Can we do the same for variational inference?
Why don’t we use the mean field in our variational bound? Yee Whye Teh, David Newman & Max Welling, NeurIPS 2017

▶ Deriving our variational bound, we previously imposed the factorization

Y
q(Π, Θ, C) = q(Π, Θ) · q(cdi ), but can we get away with less? Like,
di
Y
q(Π, Θ, C) = q(Θ, Π | C) · q(cdi )
di

▶ Note p(C, Θ, Π | W) = p(Θ, Π | C, W)p(C | W). So when we minimize

we will just get q(Θ, Π) = p(Θ, Π | C, W) and the bound will be tight in Π, Θ.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
A Collapsed Variational Bound
Why don’t we use the mean field in our variational bound? Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007

P ! P !
Y Γ( α ) Q Y Γ( β ) Q
p(C, W) = P k dk Γ(αdk +ndk· )
P v kv Γ(βkv +n·kv )
Γ( k αdk + ndk· ) k Γ(αdk ) Γ( v βkv + n·kv ) v Γ(βkv )
d k

▶ The remaining collapsed variational bound (ELBO) becomes

Z
L(q) = q(C) log p(C, W) dC + H(q(C))

▶ because we make strictly less assumptions about q than before, we will get a strictly better
approximation to the true posterior!
▶ The bound is maximized for cdi if

log q(cdi ) = Eq(C\di ) (log p(C, W)) + const.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Constructing the Algorithm
Why didn’t we do this earlier?
P Q P
▶ Note that cdi ∈ {0; 1}K and k cdik = 1. So q(cdi ) = k γdik with k γdik = 1
Qn−1 Pn−1
▶ Also: Γ(α + n) = ℓ=0 (α + ℓ), thus log Γ(α + n) = ℓ=0 log(α + ℓ)
P ! P !
Y Γ( k αdk ) Q Γ(αdk +ndk· ) Y Γ( v βkv ) Q Γ(βkv +n·kv )
p(C, W) = P P
Γ( k αdk + ndk· ) k Γ(αdk ) Γ( v βkv + n·kv ) v Γ(βkv )
d k
log q(cdi ) = Eq(C\di ) (log p(C, W)) + const.
log γdik = log q(cdik = 1)
" !#
X
= Eq(C\di ) log Γ(αdk + ndk· ) + log Γ(βkwdi + n·kwdi ) − log Γ βkv + n·kv + const.
v
" !#
\di \di
X \di
= Eq(C\di ) log(αdk + ndk· ) + log(βkwdi + n·kwdi ) − log βkv + n·kv + const.
v

(note all terms in p(C, W) that don’t involve cdik can be moved into the constant, as can all sums over k.
Pn\di −1
We can also add terms to const., such as ℓ=0 log(α + ℓ), effectively cancelling terms in log Γ)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
A Complication
Ah, that’s why!

" !#!
\di \di
X \di
γdik ∝ exp Eq(C\di ) log(αdk + ndk· ) + log(βkwdi + n·kwdi ) − log βkv + n·kv
v

Q
▶ Under q(C) = di cdi , the counts ndk· are sums of independent Bernoulli variables (i.e. they have a
multinomial distribution). Computing their expected logarithm is tricky (O(n2d·· )):

∑
K ∑
K Id
∑ ( )
Id ndk· I −n
H(q(ndk· )) = E[log ndk· ] = − log(Id !) − Id γdk· log(γdk· ) + γdk· (1 − γdk· ) d dk· log(ndk· !)
ndk·
k k=1 ndk· =1

▶ That’s likely why the original paper (and scikit-learn) don’t do this.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
If arithmetic doesn’t work, try creativity!
Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007

Yee Whye Teh Max Welling

image: Oxford U image: U v Amsterdam

" !#!
\di \di
X \di
γdik ∝ exp Eq(C\di ) log(αdk + ndk· ) + log(βkwdi + n·kwdi ) − log βkv + n·kv
v

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Statistics for the rescue
recall Lecture 3 Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007

f = 1/3, N = 10

0.25

PN
The probability measure of R = i xi with discrete xi of 0.2
probablity f is
0.15
N!

p(r)
P(R = r | f, N) = · f r · (1 − f)N−r
(N − r)! · r!
0.1
N
= · f r · (1 − f)N−r
r
5 · 10−2
≈ N (r; Nr, Nr(1 − r))
0

0 5 10
r

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
If arithmetic doesn’t work, try creativity!
Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007

Yee Whye Teh Max Welling

image: Oxford U image: U v Amsterdam
but the CLT applies! So a Gaussian approximation should be good:
\di \di \di \di \di
X \di
X
p(ndk· ) ≈ N (ndk· ; Eq [ndk· ], varq [ndk· ]) with Eq [ndk· ] = γdkj , varq [ndk· ] = γdkj (1 − γdkj )
j̸=i j̸=i

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
If arithmetic doesn’t work, try creativity!
Yee Whye Teh, David Newman & Max Welling, NeurIPS 2007

Yee Whye Teh Max Welling

image: Oxford U image: U v Amsterdam

1 1 1
log(α + n) ≈ log(α + E(n)) + (n − E(n)) · − (n − E(n))2 ·
α + E(n) 2 (α + E(n))2
\di
\di \di varq [ndk· ]
Eq [log(αdk + ndk· )] ≈ log(αdk + Eq [ndk· ]) − \di
2(αdk + Eq [ndk· ])2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Approximate Approximate Inference
probabilistic machine learning often involves creative tweaks

" !#!
\di \di
X \di
γdik ∝ exp Eq(C\di ) log(αdk + ndk· ) + log(βkwdi + n·kwdi ) − log βkv + n·kv
v
\di
\di \di varq [ndk· ]
Eq [log(αdk + ndk· )] ≈ log(αdk + Eq [ndk· ]) − \di
2(αdk + Eq [ndk· ])2
!−1
\di \di
X \di
γdik ∝ (αdk + E[ndk· ])(βkwdi + E[n·kwdi ]) βkv + E[n·kv ]
v
\di \di \di
!
varq [ndk· ] varq [n·kwdi ] varq [n·k· ]
· exp − \di
− \di
+ P \di
2(αdk + Eq [ndk· ])2 2(βkwdi + Eq [n·kwdi ])2 2( v βkv + Eq [n·kv ])2

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Complexity
The algorithm

!−1
\di \di
X \di
γdik ∝ (αdk + E[ndk· ])(βkwdi + E[n·kwdi ]) βkv + E[n·kv ]
v
\di \di \di
!
varq [ndk· ] varq [n·kwdi ] varq [n·k· ]
· exp − \di
− \di
+ P \di
2(αdk + Eq [ndk· ])2 2(βkwdi + Eq [n·kwdi ])2 2( v βkv + Eq [n·kv ])2

Note that γdik doesn’t depend on i ∈ 1, . . . , Id , it’s the same for all wdi in d with wdi = v!
▶ memory requirement: O(DKV), since we have to store γdik for each value of i ∈ 1, . . . , V and
▶ E[ndk· ], var[ndk· ] ∈ RD×K
▶ E[n·kv ], var[n·kv ] ∈ RK×V
▶ E[n·k· ], var[n·k· ] ∈ RK
▶ computational complexity: O(DKV) We can loop over V rather than Id (good for long documents!)
Often, a document will be sparse in V, so iteration cost can be much lower.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Because machine learning involves real-world data, every problem is unique.
Thinking hard about both your model and your algorithm
can make a big difference in predictive and numerical performance.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Some Output
can we be happy with this?

1.0

0.8

0.6
πd

0.4

0.2

0.0
1,780

1,800

1,820

1,840

1,860

1,880

1,900

1,920

1,940

1,960

1,980

2,000

2,020
year

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
Can we be happy with this model?
Have we described all we know about the data?

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

! Id
! !
Y
D Y
D Y
QK Y
K
p(C, Π, Θ, W) = D(π d ; αd ) · k=1 (πdk θkwdi )
cdik
· D(θ k ; β k )
d=1 d=1 i=1 k=1
| {z } | {z } | {z }
p(Π|α) p(W,C|Θ,Π) p(Θ|β)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Meta-Data
It’s right there!

Adams_1797.txt Cleveland_1887.txt Grant_1873.txt Johnson_1964.txt Obama_2010.txt Roosevelt_1942.txt

Adams_1798.txt Cleveland_1888.txt Grant_1874.txt Johnson_1965.txt Obama_2011.txt Roosevelt_1943.txt
Adams_1799.txt Cleveland_1893.txt Grant_1875.txt Johnson_1966.txt Obama_2012.txt Roosevelt_1944.txt
Adams_1800.txt Cleveland_1894.txt Grant_1876.txt Johnson_1967.txt Obama_2013.txt Roosevelt_1945.txt
Adams_1825.txt Cleveland_1895.txt Harding_1921.txt Johnson_1968.txt Obama_2014.txt Taft_1909.txt
Adams_1826.txt Cleveland_1896.txt Harding_1922.txt Johnson_1969.txt Obama_2015.txt Taft_1910.txt
Adams_1827.txt Clinton_1993.txt Harrison_1889.txt Kennedy_1962.txt Obama_2016.txt Taft_1911.txt
Adams_1828.txt Clinton_1994.txt Harrison_1890.txt Kennedy_1963.txt Pierce_1853.txt Taft_1912.txt
Arthur_1881.txt Clinton_1995.txt Harrison_1891.txt Lincoln_1861.txt Pierce_1854.txt Taylor_1849.txt
Arthur_1882.txt Clinton_1996.txt Harrison_1892.txt Lincoln_1862.txt Pierce_1855.txt Truman_1946.txt
Arthur_1883.txt Clinton_1997.txt Hayes_1877.txt Lincoln_1863.txt Pierce_1856.txt Truman_1947.txt
Arthur_1884.txt Clinton_1998.txt Hayes_1878.txt Lincoln_1864.txt Polk_1845.txt Truman_1948.txt
Buchanan_1857.txt Clinton_1999.txt Hayes_1879.txt Madison_1809.txt Polk_1846.txt Truman_1949.txt
Buchanan_1858.txt Clinton_2000.txt Hayes_1880.txt Madison_1810.txt Polk_1847.txt Truman_1950.txt
Buchanan_1859.txt Coolidge_1923.txt Hoover_1929.txt Madison_1811.txt Polk_1848.txt Truman_1951.txt
Buchanan_1860.txt Coolidge_1924.txt Hoover_1930.txt Madison_1812.txt Reagan_1982.txt Truman_1952.txt
Buren_1837.txt Coolidge_1925.txt Hoover_1931.txt Madison_1813.txt Reagan_1983.txt Truman_1953.txt
Buren_1838.txt Coolidge_1926.txt Hoover_1932.txt Madison_1814.txt Reagan_1984.txt Trump_2017.txt
Buren_1839.txt Coolidge_1927.txt Jackson_1829.txt Madison_1815.txt Reagan_1985.txt Trump_2018.txt
Buren_1840.txt Coolidge_1928.txt Jackson_1830.txt Madison_1816.txt Reagan_1986.txt Tyler_1841.txt
Bush_1989.txt Eisenhower_1954.txt Jackson_1831.txt McKinley_1897.txt Reagan_1987.txt Tyler_1842.txt
Bush_1990.txt Eisenhower_1955.txt Jackson_1832.txt McKinley_1898.txt Reagan_1988.txt Tyler_1843.txt
Bush_1991.txt Eisenhower_1956.txt Jackson_1833.txt McKinley_1899.txt Roosevelt_1901.txt Tyler_1844.txt
Bush_1992.txt Eisenhower_1957.txt Jackson_1834.txt McKinley_1900.txt Roosevelt_1902.txt Washington_1790.txt
Bush_2001.txt Eisenhower_1958.txt Jackson_1835.txt Monroe_1817.txt Roosevelt_1903.txt Washington_1791.txt
Probabilistic ML —Bush_2002.txt
P. Hennig, SS 2021 — Lecture 25:Eisenhower_1959.txt
Customizing Probabilistic ModelsJackson_1836.txt Monroe_1818.txt
— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 Roosevelt_1904.txt Washington_1792.txt 22
What about the hyperparameters?
EM-style point estimates from the ELBO

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

log p(W | α, β) = L(q, α, β) + DKL (q∥p(C | W, α, β))

Z
p(W, Π, Θ, C | α, β)
L(q, α, β) = q(C, Θ, Π) log
q(C, Θ, Π)
log p(α, β | W) ≥ L(q, α, β) + log p(α, β)
∇α,β log p(α, β | W) = ∇α,β L(q, α, β) + ∇α,β log p(α, β) + ∇α,β DKL (q∥p(C | W, α, β))
| {z }
≈0

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
The ELBO
useful for monitoring / bug-fixing

P ! P !
Y
D
Γ( k αdk ) Y αdk −1+ndk·
K Y
K
Γ( v βkv ) Y βkv −1+n·kv
V
p(C, Π, Θ, W) = Q πdk · Q θkv
d=1 k Γ(αdk ) k=1 k=1 v Γ(βkv )v=1

We need

L(q, W) = Eq (log p(W, C, Θ, Π)) + H(q)

Z Z
= q(C, Θ, Π) log p(W, C, Θ, Π) dC dΘ dΠ − q(C, Θ, Π) log q(C, Θ, Π) dC dΘ dΠ
Z X X X
= q(C, Θ, Π) log p(W, C, Θ, Π) dC dΘ dΠ + H(D(θk β̃k )) + H(D(πd α̃d )) + H(γ̃ di )
k d di

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
Adding more Information
a model for document metadata

doc topic dist. word topic word topic word dist.

βk
αd πd cdi wdi θk
topic prior
i = [1, . . . , Id ] k = [1, . . . , K]

ϕd
metadata
d = [1, . . . , D]

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
The Price of Packaged Solutions
Toolboxes speed up development, but also make it brittle

https://github.com/scikit-learn/scikit-learn/blob/fd237278e/sklearn/decomposition/_lda.py#L134

▶ toolboxes are extremely valuable for quick early development. Use them to your advantage!
▶ but their interface often enforces and restricts model design decisions
▶ to really solve a probabilistic modelling task, build your own craftware

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
A Generalized Linear Model
Latent Topic Dynamics

latent function transf. doc topic dist. doc topic dist. word topic word topic word dist.
βk
f αd πd cdi wdi θk
topic prior
i = [1, . . . , Id ] k = [1, . . . , K]

h ϕd
kernel
metadata
d = [1, . . . , D]

To generate the words W of documents d = 1, . . . , D with features ϕd ∈ F:

▶ draw function f : F _ RK from p(f | h) = GP(f; 0, h)
▶ draw document topic distribution πd from D(αd = exp(f(ϕd )))

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
A Generalized Linear Model
Latent Topic Dynamics

latent function transf. doc topic dist. doc topic dist. word topic word topic word dist.
βk
f αd πd cdi wdi θk
topic prior
i = [1, . . . , Id ] k = [1, . . . , K]

h ϕd
kernel
metadata
d = [1, . . . , D]

To generate the words W of documents d = 1, . . . , D with features ϕd ∈ F:

QK
▶ draw topic-word distributions p(Θ | β) = k=1 D(θk , βk )
QD QId Q cdik
▶ draw each word’s topic p(Cd:: | Π) = d=1 i=1 k πdk
cdik
▶ draw the word wdi with probability θkw di
.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 27
A change in prior
EM-style point estimates from the ELBO

latent function transf. doc topic dist. doc topic dist. word topic word topic word dist.
βk
f αd πd cdi wdi θk
topic prior
i = [1, . . . , Id ] k = [1, . . . , K]

h ϕd
kernel
metadata
d = [1, . . . , D]

log p(α, β | W) ≥ L(q, α, β) + log p(α, β)

∇α,β log p(α, β | W) = ∇α,β L(q, α, β) + ∇α,β log p(α, β) + ∇α,β DKL (q∥p(C | W, α, β))
| {z }
≈0
1 1
log p(f = log α) = − ∥fd ∥2k = − f⊺d k−1 fd
2 2 DD
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 28
A GP prior on topics
Lectures 3 – 13

0.8

0.6
p(topic)

0.4

0.2

0
1,800 1,820 1,840 1,860 1,880 1,900 1,920 1,940 1,960 1,980 2,000
Year

(
2 −α
(xa − xb ) 1.00 if president(xa ) = president (xb )
k(xa , xb ) = θ2 1 + ·
2αℓ2 γ otherwise
θ=5 ℓ = 10years α = 0.5 γ = 0.9
s
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
Results
Kernel Topic Models [Hennig, Stern, Herbrich, Graepel. Kernel Topic Models. AISTATS 2012]

1
war, spain
war, national, economic labor energy, cut, oil

0.8 war, constitution, union men,

law, war,

0.6
⟨πk | ϕ⟩

America,
good,
American, people

0.4 work

world, peace, free

0.2 public, commerce

made, business

0
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
year
Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 30
The Topics of American History
1897

The most important problem with

which this Government is now called
upon to deal pertaining to its for-
eign relations concerns its duty toward
Spain and the Cuban insurrection.
(William McKinley, 1897)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 31
The Topics of American History
1980

Three basic developments have helped

to shape our challenges: the steady
growth and increased projection of
Soviet military power beyond its own
borders; the overwhelming depen-
dence of the Western democracies
on oil supplies from the Middle East;
and the press of social and religious
and economic and political change in
the many nations of the developing
world, exemplified by the revolution
in Iran. (Jimmy Carter, 1980)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 32
Can we do even better?
Intra-Document Structure! Bags of Bags of Words

Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans:

We are 15 years into this new century. Fifteen years that dawned with terror touching our shores, that unfolded with a new generation
fighting two long and costly wars, that saw a vicious recession spread across our Nation and the world. It has been and still is a hard
time for many.

But tonight we turn the page. Tonight, after a breakthrough year for America, our economy is growing and creating jobs at the fastest
pace since 1999. Our unemployment rate is now lower than it was before the financial crisis. More of our kids are graduating than ever
before. More of our people are insured than ever before. And we are as free from the grip of foreign oil as we’ve been in almost 30
years.

Tonight, for the first time since 9/11, our combat mission in Afghanistan is over. Six years ago, nearly 180,000 American troops served
in Iraq and Afghanistan. Today, fewer than 15,000 remain. And we salute the courage and sacrifice of every man and woman in this 9/11
generation who has served to keep us safe. We are humbled and grateful for your service.

America, for all that we have endured, for all the grit and hard work required to come back, for all the tasks that lie ahead, know
this: The shadow of crisis has passed, and the State of the Union is strong.

Barack H. Obama, 2015

Each document is actually pre-structured into sequential sub-documents, typically of one topic each.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 33
Designing a probabilistic machine learning method:
1. get the data
1.1 try to collect as much meta-data as possible
2. build the model
2.1 identify quantities and datastructures; assign names
2.2 design a generative process (graphical model)
2.3 assign (conditional) distributions to factors/arrows (use exponential families!)
3. design the algorithm
3.1 consider conditional independence
3.2 try standard methods for early experiments
3.3 run unit-tests and sanity-checks
3.4 identify bottlenecks, find customized approximations and refinements

Packaged solutions can give great first solutions, fast.

Building a tailormade solution requires creativity and mathematical stamina.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 25: Customizing Probabilistic Models — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 34

Physics Premier PDF
61% (49)
Physics Premier PDF
43 pages
Singapore Assessment 1 PL - DM5A
100% (1)
Singapore Assessment 1 PL - DM5A
22 pages
An Introduction To Signal Detection and Estimation - Second Edition Chapter IV: Selected Solutions
100% (1)
An Introduction To Signal Detection and Estimation - Second Edition Chapter IV: Selected Solutions
7 pages
BITS Pilani Advance Statistic
No ratings yet
BITS Pilani Advance Statistic
23 pages
24 Variational Inference
No ratings yet
24 Variational Inference
24 pages
21 Efficient Inference A K-Means
No ratings yet
21 Efficient Inference A K-Means
32 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
19 pages
15 Exponential Families
No ratings yet
15 Exponential Families
33 pages
27 Revision
No ratings yet
27 Revision
80 pages
08 Learning Representations
No ratings yet
08 Learning Representations
38 pages
A Beginner’s Guide to Variational Inference
No ratings yet
A Beginner’s Guide to Variational Inference
48 pages
CPSC 440: Advanced Machine Learning: Exponential Families
No ratings yet
CPSC 440: Advanced Machine Learning: Exponential Families
15 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
khartov2014
No ratings yet
khartov2014
15 pages
Modeling, Inference and Prediction: 2.1 Probabilistic Models
No ratings yet
Modeling, Inference and Prediction: 2.1 Probabilistic Models
16 pages
04 Sampling
No ratings yet
04 Sampling
24 pages
Tungban Probabilistic ML 2021 - 04 - Sampling
No ratings yet
Tungban Probabilistic ML 2021 - 04 - Sampling
24 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
05 MCMC
No ratings yet
05 MCMC
36 pages
Wainwright Microsoft Slides2
No ratings yet
Wainwright Microsoft Slides2
67 pages
Bayesian reasoning and machine learning Barber D. - Get instant access to the full ebook with detailed content
100% (1)
Bayesian reasoning and machine learning Barber D. - Get instant access to the full ebook with detailed content
47 pages
20 Latent Dirichlet Allocation
No ratings yet
20 Latent Dirichlet Allocation
27 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Week 12 - GS
No ratings yet
Week 12 - GS
5 pages
Lecture17 Mle Map
No ratings yet
Lecture17 Mle Map
29 pages
Math 181A WI25 HW Problems Complete
No ratings yet
Math 181A WI25 HW Problems Complete
26 pages
16 Graphical Models
No ratings yet
16 Graphical Models
27 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
Notes
No ratings yet
Notes
9 pages
L20-GenerativeModels
No ratings yet
L20-GenerativeModels
53 pages
05 Vae
No ratings yet
05 Vae
76 pages
A Brief Primer On Variational Inference - Fabian Dablander
No ratings yet
A Brief Primer On Variational Inference - Fabian Dablander
14 pages
Numerical Integration
No ratings yet
Numerical Integration
28 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Lecture6 Handout
No ratings yet
Lecture6 Handout
41 pages
Copy of deep-learning
No ratings yet
Copy of deep-learning
28 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
An Introduction To Variational Calculus in Machine Learning
No ratings yet
An Introduction To Variational Calculus in Machine Learning
7 pages
Importance Sampling
No ratings yet
Importance Sampling
13 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
Advanced Machine Learning
No ratings yet
Advanced Machine Learning
74 pages
Stat520 Ch.3
No ratings yet
Stat520 Ch.3
5 pages
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
No ratings yet
Eric Jang - A Beginner's Guide To Variational Methods - Mean-Field Approximation
9 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
BDA_lecture_10b
No ratings yet
BDA_lecture_10b
66 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
No ratings yet
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
49 pages
MCMC Brief
No ratings yet
MCMC Brief
69 pages
23 Free Energy
No ratings yet
23 Free Energy
29 pages
VAE talk.compressed - 副本
No ratings yet
VAE talk.compressed - 副本
59 pages
Mlgs 2021 Retake
No ratings yet
Mlgs 2021 Retake
54 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Statistics
No ratings yet
Statistics
60 pages
Formula Sheet Math236
No ratings yet
Formula Sheet Math236
2 pages
Quiz3_2024
No ratings yet
Quiz3_2024
2 pages
Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
Stochastic Dynamics
No ratings yet
Stochastic Dynamics
72 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
6-Uncertainty6
No ratings yet
6-Uncertainty6
36 pages
14 Generalized Linear Models
No ratings yet
14 Generalized Linear Models
29 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Hidden Markov Model
No ratings yet
Hidden Markov Model
36 pages
Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive
No ratings yet
Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive
391 pages
Perpendicularity
No ratings yet
Perpendicularity
9 pages
Chp2-Binary Numbers and Codes (15.1.09)
No ratings yet
Chp2-Binary Numbers and Codes (15.1.09)
16 pages
Peak Education Services Nakabugo p.4 Mtc Mot i 2025.
No ratings yet
Peak Education Services Nakabugo p.4 Mtc Mot i 2025.
8 pages
Lec-4 Infix To Postfix
No ratings yet
Lec-4 Infix To Postfix
15 pages
Syllabus - SHL Assessment -Updated B.tech, MTECH,BCA, MCA (1)
No ratings yet
Syllabus - SHL Assessment -Updated B.tech, MTECH,BCA, MCA (1)
18 pages
XII-Pre-Score-1-SRG-P-2-SC
No ratings yet
XII-Pre-Score-1-SRG-P-2-SC
25 pages
11 2 Radian Angle Measure Student
No ratings yet
11 2 Radian Angle Measure Student
4 pages
CSC425 Topic 2 Components of A Programming Language
No ratings yet
CSC425 Topic 2 Components of A Programming Language
69 pages
Brownian Motion Stochastic Calculus
100% (1)
Brownian Motion Stochastic Calculus
48 pages
DLL - Mathematics 2 - Q2 - W3
No ratings yet
DLL - Mathematics 2 - Q2 - W3
11 pages
Weaving Math - Compressed
No ratings yet
Weaving Math - Compressed
18 pages
The Concept of Comonotonicity in Actuarial Science and Finance: Applications
No ratings yet
The Concept of Comonotonicity in Actuarial Science and Finance: Applications
31 pages
Midterm 01 Physics 2 UT
No ratings yet
Midterm 01 Physics 2 UT
8 pages
Finite Element Analysis
100% (1)
Finite Element Analysis
10 pages
AdMath Reviewer
No ratings yet
AdMath Reviewer
13 pages
circle geometry
No ratings yet
circle geometry
28 pages
SMA 2101 Calculus I
100% (1)
SMA 2101 Calculus I
3 pages
Number Theory: Modular Arithmetic and GCD
No ratings yet
Number Theory: Modular Arithmetic and GCD
21 pages
TGT Math
No ratings yet
TGT Math
4 pages
CV1011 - 3 Geometric Properties and Distributed Loads
No ratings yet
CV1011 - 3 Geometric Properties and Distributed Loads
37 pages
Dr. Sufian M. Salih / Introduction and Basic Concepts in Biostatistical
No ratings yet
Dr. Sufian M. Salih / Introduction and Basic Concepts in Biostatistical
20 pages
Stephensb mth1010
No ratings yet
Stephensb mth1010
4 pages
Areas of Complex Figures
No ratings yet
Areas of Complex Figures
8 pages
W7GS
No ratings yet
W7GS
10 pages
Van Hiele Levels and Achievement in Writing Geometry Proofs - Reaction Paper
No ratings yet
Van Hiele Levels and Achievement in Writing Geometry Proofs - Reaction Paper
2 pages

25 Customizing Models A Algorithms

Uploaded by

25 Customizing Models A Algorithms

Uploaded by

Probabilistic Machine Learning

q∗ = arg min DKL (q(z)∥p(z | x)) = arg max L(q)

log q∗j (zj ) = Eq,i̸=j (log p(x, z)) + const..

To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:

▶ What has happened here? Why the connection to EM?

▶ What has happened here? Why the connection to EM?

Recall Γ(x + 1) = x · Γ(x) ∀x ∈ R+

▶ Deriving our variational bound, we previously imposed the factorization

▶ Note p(C, Θ, Π | W) = p(Θ, Π | C, W)p(C | W). So when we minimize

▶ The remaining collapsed variational bound (ELBO) becomes

log q(cdi ) = Eq(C\di ) (log p(C, W)) + const.

Yee Whye Teh Max Welling

Yee Whye Teh Max Welling

Yee Whye Teh Max Welling

Adams_1797.txt Cleveland_1887.txt Grant_1873.txt Johnson_1964.txt Obama_2010.txt Roosevelt_1942.txt

log p(W | α, β) = L(q, α, β) + DKL (q∥p(C | W, α, β))

L(q, W) = Eq (log p(W, C, Θ, Π)) + H(q)

doc topic dist. word topic word topic word dist.

To generate the words W of documents d = 1, . . . , D with features ϕd ∈ F:

To generate the words W of documents d = 1, . . . , D with features ϕd ∈ F:

log p(α, β | W) ≥ L(q, α, β) + log p(α, β)

0.8 war, constitution, union men,

world, peace, free

0.2 public, commerce

The most important problem with

Three basic developments have helped

Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans:

Barack H. Obama, 2015

Packaged solutions can give great first solutions, fast.

You might also like