0% found this document useful (0 votes)
24 views25 pages

Scribe Notes BML

This document summarizes key concepts from a lecture on Bayesian machine learning and regression analysis. It introduces: - Linear regression and decision theory, including Fisher's maximum likelihood estimator and Wald's loss function approach - Bayesian decisions, where the estimator is chosen to minimize expected loss under a probabilistic model with priors - Examples of Bayesian estimators for linear regression, including the minimum mean squared error and mean estimator under the posterior

Uploaded by

ethan cohen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views25 pages

Scribe Notes BML

This document summarizes key concepts from a lecture on Bayesian machine learning and regression analysis. It introduces: - Linear regression and decision theory, including Fisher's maximum likelihood estimator and Wald's loss function approach - Bayesian decisions, where the estimator is chosen to minimize expected loss under a probabilistic model with priors - Examples of Bayesian estimators for linear regression, including the minimum mean squared error and mean estimator under the posterior

Uploaded by

ethan cohen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Bayesian machine learning Spring 2020

Cours 1 — January 17th, 2020


Lecturer: Rémi Bardenet Scribe: Pierre Delanoue, Van Nguyen Nguyen

The web page of the course: https://github.com/rbardenet/bml-course


Contact : [email protected]
Objective of the class:

• Decision theory

• Formalizing a problem in a Bayesian way

• MCMC and variational Bayes

• Bayesian nonparametrics

1.1 Regression and Decision Theory

Definition 1.1 (Linear Regression)

yi = f (xi ) + εi ∈ R, xi ∈ Rd , i ∈ [[1, n]], εi ∼ N (0, σ 2 )i.i.d.

In matrix form,
y = Xθ? + ε

Remark 1.1.1 Your job is to (i) come up with an estimator θb = θ(X, b y) Often, you also
need (ii) to report some region A = A(X, y) with level of confidence α ⊂ Θ.

1.1.1 Fisher’s answer


(i) θbM LE = (X T X)−1 X T y (assuming X is full rank)

Proposition 1.2 • (i) θbM LE is unbiased, i.e., E(θbM LE ) = θ?

• (ii) θbM LE ∼ N (θ∗ , σ 2 (X T X)−1 )

• (iii) θbM LE has the minimum variance among linear unbiased estimators ( A  B :=
A − B positive semidefinite)

1-1
Cours 1 — January 17th, 2020 Spring 2020

Proof (i) E[θbM LE ] = (X T X)−1 X T E[y] = θ∗

(ii) θbM LE − θ∗ ∼ N (0, σ 2 (X T X)−1 )

Var(θbM LE ) = E(θb − θ∗ )(θb − θ∗ )T


= E((X T X)−1 X T y − θ∗ )((X T X)−1 X T y − θ∗ )T
T T
= (X T X)−1 X T E(yyT )X(X T X)−1 − 2θ∗ θ∗ + θ∗ θ∗
T T
= (X T X)−1 X T E (Xθ + ε)(Xθ + ε)T X(X T X)−1 − 2θ∗ θ∗ + θ∗ θ∗
 
T T
= (X T X)−1 X T Xθ∗ θ∗ T X T + σ 2 I X(X T X)−1 − 2θ∗ θ∗ + θ∗ θ∗
 
T T
= θ∗ θ∗ T + (X T X)−1 σ 2 − 2θ∗ θ∗ + θ∗ θ∗ = (X T X)−1 σ 2

Definition 1.3 (Confidence interval)

Aα := {θ ⊂ Rd , (θ − θbM LE )T σ 2 (X T X)−1 (θ − θbM LE ) ≤ α}

We can choose α to guarantee coverage: P(θ∗ ∈ Aα (X, y)) ≥ 95%.

1.1.2 Wald’s answer


Principle: An estimator, or a confidence interval, are data-driven decisions of the form

a : data 7−→ θ,
b Aα .

Let us pick a loss function L(θ, a(X, y)). We’d like to choose a ∈ arg min Ey L(θ∗ , a(X, y)).

Example 1.1.1 In regression with the squared loss, this boils down to

θb ∈ arg min Ey [||θ∗ − θ||


b 2 ](=: M SE(θ))
b

Definition 1.4 (admissible estimator) θb is said to be admissible if


(
∀θ E[L(θ, θ̃)] ≤ E[L(θ, θ)]
b
@θ̃ = θ̃(X, y)s.t.
∃ θ0 s.t. E[L(θ0 , θ̃)] < E[L(θ0 , θ)]
b

Theorem 1.5 (Corollary of a result by James & Stein) θbM LE is not admissible
for linear regression.

1-2
Cours 1 — January 17th, 2020 Spring 2020

Exercise: Prove that ∃ θb s.t.∀θ ∈ B(0, ρ), Ey [||θb − θ||2 ] < Ey [||θbM LE − θ||2 ]
Let’s define
θbR (λ) = arg min ||y − Xθ||2 + λ||θ||22 , λ > 0
θ
= (X X + λI)−1 X T y
T

Compute MSE(θbR (λ)) for X T X = I

Lemme 1.6
b 2 ] = T r(Var(θ))
E[||θ − θ|| b − θ||2
b + ||E[θ]

1 b
θbR (λ) = θM LE (shrinkage)
1+λ
1 ∗
E[θbR (λ)] = θ
1+λ
1 1
Var[θbR (λ)] = Var[θbM LE]
1+λ 1+λ
2
σ
= I
(1 + λ)2
dσ 2 1 2 ∗
M SE(θbR (λ) = + (1 − ) ||θ ||
(1 + λ)2 1+λ
dσ 2 1 2 2
≤ 2
+ (1 − )ρ
(1 + λ) 1+λ
1
=
(1 + λ)2 [dσ 2 + λ2 ρ2 ]
= f (λ)

f |λ=0 ∝ 2λρ2 (1 + λ)2 − (λ2 ρ2 + dσ 2 )2(1 + λ)|λ=0 = −2dσ 2
∂λ

∃λ0 s.t.∀λ < λ0 , f (λ) ≤ f (0) = M SE(θbM LE )

• For the general case, check Wieringen’s lecture notes on linear regression
Definition 1.7 θminimax := θbm ∈ arg min b sup Ey [L(θ, θ)]
θ θ
b

Definition 1.8 θbB ∈ arg minθ,y Eθ,y [L(θ, θ)]


b

Theorem 1.9 (Berger’85) Under topolical assumptions, assume θ, θb ∈ closed bounded sub-
sets of Θ, continuity assumptions on L.
• Any estimator is dominated by a bayesian estimator
• In linear regression, Bayes => admissible

1-3
Cours 1 — January 17th, 2020 Spring 2020

1.1.3 Bayesian decisions


Definition 1.10 When confronted with picking an action a depending on a state of the
world S ∈ S a Bayesian picks 1) a idstributions p ∈ M1 (S, Σ) and 2) a loss function L and
chooses :
Z
aB ∈ arg min L(a, s)dp(s)
b

Example 1.1.2 (estimation) • S = Y n × Θ where, Y ⊂ Rdy and θ ⊂ RdΘ

• L(θ, b α
b θ) =choice ||θ − θ||
b S) =choice L(θ,

• p(s) = p(y1,...,n , θ) = p(y1,...,n |θ)p(θ) where p(y1,...,n |θ) is the choice and p(θ) is the prior
(choice)
R
Exercice: We define L(θ, θ) b 2 and we will show that θbB =
b = ||θ − θ|| θp(θ|y1,...,n )dθ
Z
θB ∈ arg min ||θ − θ(y
b b 1,...,n )||2 p(y1,...,n , θ)dy1,...,n dθ

b 1,...,n )||2 p(y1,...,n |θ)p(θ)dθdy1,...,n


= arg min ||θ − θ(y
θb

Remark 1.1.2 Z
b 2 p(y1,...,n |θ)p(θ)dθdy1,...,n
min ||θ − θ||
θ
b
Z
b 2 p(y1,...,n |θ)p(θ)dθdy1,...,n
= min ||θ − θ||
θb

But
Z Z
b 2 p(θ|y1,...,n )dθ =
||θ − θ|| b 2 p(θ|y1,...,n )dθ
||θ − θM EP + θM EP − θ||
Z
= ||θ − θM EP ||2 + ||θM EP − θ||
b 2 p(θ|y1,...,n )dθ

R
where the mean estimator posterior θbM EP = θbB dθ

1-4
Cours 1 — January 17th, 2020 Spring 2020

Exercise: Prove that p(y1,...,n |θ, X) = N (y1,...,n |Xθ, σ 2 I), p(θ)

log(p(θ|y1,...,n ) = log(p(y1,...,n |θ)) + log(p(θ)) − log(p(y1,...,n ))


||Xθ − y||2 ||θ||2
=− − + ...
2σ 2 2θ2
< Xθ − y, Xθ − y > < θ, θ >
=− − + ...
2σ 2 2σ 2
θT X T Xθ − 2θT X T y θT θ
=− − 2
2σ 2 2σ
T
θ X y T T
X X I XT X I Σ−1
=− − θ( + )θ + ...(( + ) = )
σ2 2σ 2 2σ 2 2σ 2 2σ 2 2
1
= − (θ − σ −2 ΣX T y)T Σ−1 (θ − σ −2 ΣX T y))
2
We have p(θ|y1,...,n ) = N (θ|σ −2 ΣX T y, Σ).

σ 2 I −1 T
In particular, θbB = σ −2 ΣX T y = (X T X + σ2
) X y

Remark 1.1.3 Let p(θ) ∝ e−λ||θ|| (Laplace). Then

||y − Xθ||2
log(p(θ|y1,...,n )) = − − λ||θ||1
2σ 2

and θbLASSO ∈ arg max logp(θ|y1,...,n )


R
Warning : θbB = θp(θ|y1,...,n )dθ is not sparse (Park and Casella).

Example 1.1.3 (credible regions) S = Y n × Θ


For A ⊂ Θ, L(A, s) = 1θ∈A
/ + γdiam(A) where γ is chosen: p(s) = p(y1,...,n , θ)
Z
bB ∈ arg min 1θ∈A
A / p(θ|y1,...,n dθ + γdiam(A) (a credible region)

For example, in ridge regression

p(θ|y1,...,n ) = N (θ|θbR (λ), Σ)

Aα = {(θ − θbR(λ) )T Σ−1 (θ − θbR(λ) ) ≤ α}


Lasso: Hastie, Tibshirani and Wainwright

Example 1.1.4 (Horseshoe prior)

S = Y n × Θ × RdΘ
+ × R+ ≥ 10

1-5
Cours 1 — January 17th, 2020 Spring 2020

θj ∼ N (0, τ 2 λ2j ), j = 1, ..., d

1
λj ∼ C + (1) ∝ 1λ>0
1 + λ2

τ ∼ C + (1)
Let S = X n × Y n × Θ × X × Y: L(by , s) = 1y6=yb(α1y=1 + β1y=0 )
Z
ybB ∈ arg min 1y6=yb(α1y=1 + β1y=0 )dp(x1,...,n , y1,...,n , θ, x, y)

p(s) = p(y|x, θ, x1,...,n , y1,...,n )p(x, θ, x1,...,n , y1,...,n )

p(x, θ, x1,...,n , y1,...,n ) = p(θ|x, x1,...,n , y1,...,n )p(x, x1,...,n , y1,...,n )

Example 1.1.5
p(y = +1|x, θ) = σ(xT θ)
Z
ybB ∈ arg min L(y, yb)p(y|x, θ)p(y1,..,n|θ,x1,...,n )p(θ)dθdxdy1,...,n
R
f (y) ∝ p(y|x, θ)p(y1,...,n |x1,...,n , θ)p(θ)dθ is called the posterior predictive.

Z
ybB ∈ arg min 1y6=yb[α1y=1 + β1y=0 ]f (y)dx1,...,n dy1,...,n dxdy
Z
arg min f (1)α1yb=1 + f (0)β1yb=0 ]dx1,...,n dy1,...,n dx

ybB = 1 iff βf (0) ≤ αf (1)

1.1.4 The likelihood principle


• Berger Wolpert 88

• Bayesian decisions are robust to optimal stopping. Let S = (∪n≥0 Y n ) × Θ:


X
y , y) =
EL(b E[L(θ, θ)1
b N =n ]
n≥0

XZ n−1
Y
= L(θ, θ)[1
b y
1,...,n ∈{N =n} 1y1,...,i ∈{N
/ =n} ]p(y1,..,n|θ)p(θ)dθdy1,...,n
n≥0 i=1

1-6
Bayesian Machine Learning Spring 2020
Cours 2 — January 24th, 2020
Lecturer: Rémi Bardenet Scribe: W. Jallet, S. Jerad

2.1 A bit of objective Bayes


We suppose we are still in the Bayesian linear regression setting.
Recall the decision function

âB = argmin EY,θ L(a, θ) (2.1)


a

âB (X, y, x) = 1{ β p(y=1|X,y,x) ≥1}


α p(y=0|X,y,x)
(2.2)
âB (y) = argmin EY,θ [1θ6∈I + γµ(I)]
interval I

Theorem 2.11 (Bernstein-von Mises (Van der Vaart’ 2000)) We assume that the prior
p(θ) puts “enough mass" around θ∗ ∈ Θ̊ ⊆ Rd , then for all ε > 0
 
Pp(·|θ) sup Pθ|Y,x (B) − PN (θ∗ ,σ2 (X T X)−1 /N ) (B) ≥ ε → 0 (2.3)
B⊂Θ

This result is also called the “Bayesian central limit theorem".

Picking a prior

• find a prior that encodes physical constraints of your problem

• find a prior that comes from symmetries of your problem e.g. Jeffrey’s prior

• try several priors and make sure that âB does not change too much

2.2 More decision problems from ML


Exercise. Frame PCA as a Bayesian decision problem.
Regular PCA: given data x = (x1 , . . . , xN ) ∈ Rd×N , define
N
1 X
Σ̂ = (xi − x̄)(xi − x̄)| = U ΛU |
N i=1

2-7
Cours 2 — January 24th, 2020 Spring 2020

−1/2 −1/2
Then we obtain the normalized PCA vectors as Λ:q x̂i = Λ:q U:q| x (whitened PCA), where
the subscript : q indicates we only take the first q components.
For the Bayesian formulation, take data x ∼ N (0, I), y ∼ N (µ, σ 2 W W | + I). The joint
distribution is
p(y, x, µ, σ, W ) ∝ p(y|x, µ, σ, W, q)p(x)p(µ)p(σ)p(W )
Now we choose a prior for the weights W . Some suggestions:

1. p(W ) ∝ p(W |q)p(q), for instance p(W |q) = qj=1 e−λkwj k and a conjugate prior q ∼
Q
P(λ) for some hyperparameter λ.
Qd−1 −kwj k2 /(2v2 )
2. an alternative is p(W ) ∝ p(W |v)p(v) = j=1 e j p(v) with prior p(v), which

can be for instance a Laplace distribution to enforce sparsity of the weights, or a


horseshoe distribution. explain
horse-
Now a question is how do you recover the MLE ? shoe
Theorem 2.12 (Bishop, Tipping, 1997) It holds that

ŴMLE = U:q (Λ:q − σ 2 I)1/2 (2.4)

Then, the PCA vectors are given by


−1
x̂ = ŴMLE (y − ȳ) = (Λ:q − σ 2 I)−1/2 U:q| y −−→ Λ−1/2
:q U:q| y (2.5)
σ→0

Exercise. How would you formalize clustering as a Bayesian decision problem?

Example 2.2.1 (Latent Dirichlet Allocation (Blei et al. [3])) Let qd` ∈ {1, T } be the
topic of a word ` ∈ {1, Ld } inside of document d ∈ {1, D}.See Figure 2.1 finish
the
α Πd ∈ ∆T qd` yB
βd` di-
a-
Figure 2.1. Graphical model for LDA. gram

We want to prove Πd ∼ D(α) (Dirichlet distribution) is conjugate to qd` ∼ Cat(Πd ). Missing


We have that ! T ex-
d T
Y Y 1qd` =t
Y pla-
p(Πd |qd` , α) ∝ Πdt Πα−1
dt 1Πd ∈∆T (2.6)
na-
`=1 t=1 t=1
tions,
With misclassification error maybe
L(q̂, q) = 1q6=q̂ for-
then (exercise) mula
Z is
q̂dw = argmax p(qd` = t|Πd , yd` , B, β)p(Πd , y, B, α, β) (2.7) wrong

2-8
Cours 2 — January 24th, 2020 Spring 2020

2.3 Subjective Bayes


We denote S the states of the world, Z the space of consequences, and A = F(S, Z) the
set of functions between S and Z.

Theorem 2.13 (Savage) Let ≺ be a preference relation over A that is complete and tran-
sitive.
Then, the following statements are equivalent:

• ≺ satisfies a few more intuitive postulates (“internal coherence")

• there exists a unique function L on A × S and probability distribution π on S such


that Z Z
a ≺ a ⇔ L(a, s)dπ(s) ≤ L(a0 , s)dπ(s)
0

This is the idea of rationality, e.g. from neoclassical economics. The loss L is bounded, and
π is finitely additive. The prior is coupled to the loss. We may act before having any data,
but as data comes in our actions will become more appropriate.

2.3.1 Computational aspects


Exercise (Metropolis-Hastings) We denote α(x, y) the acceptance ratio of the MH
π(y)q(x|y)
algorithm, α(x, y) = π(x)q(y|x) .
R 
1. show that p(x, y) = α(x, y)q(y|x) + δx (y) 1 − α(x, y)q(y|x) dy
R
2. π(x)p(x, y) dx = π(y)

Exercise (Gibbs sampler) The Gibbs sampler is useful in the case where the conditional
distributions of the variables (conditionally on each other) are known.

1. given x = (x1 , x2 ),
1 1
q(y|x) = π(y1 |x2 )π(y2 |y1 ) + π(y2 |x1 )π(y1 |y2 )
2 2
Show that α(x, y) = 1

2. Derive all of the conditional in Latent Dirichlet Allocation (LDA).

Check out this website for interactive visualisations of MCMC algorithms.

2-9
Bayesian Machine Learning Spring 2020
Cours 3 — January 31st, 2020
Lecturer: Rémi Bardenet Scribe: Antoine Barrier

Remark 3.0.1 computation : exact / MCMC / VB

3.1 Variational Bayes


Remember that we often have to compute
Z
L(a, (θ, z1:N , s))p(θ, z1:N |y1:N )dθdz1:N

The key quantity we want to determine is p(θ, z1:N |y1:N ).


P
Example 3.1.1 (LDA) • number of variables is Ω( i Li ).

• discrete

VB objective Find
q ∈ argmin KL(q, p((θ, z)|y)) (VB)
q∈Q
R
where Q is the set of probabilities over (θ, z1:N ) and KL(p, q) = p log(p/q).

1. We choose Q so that (VB) is easy.

Example 3.1.2 Mean-field approximation : we assume all variables are independent


under all probabilities of Q : in other words if q ∈ Q :
dθ N Y
dz
η
Y Y
q(θ, z1:N ) = qdηd (θd ) qijij (zi )
d=1 i=1 j=1

Remark 3.1.1 • In MF, coordinatewise optimization is tractable and cheep.


• Check out [7] for LDA (exo)

2. Variational autoencoding Bayes: see [5].

3-10
Cours 3 — January 31st, 2020 Spring 2020

x1 ∈ X f1 x2 f2

zx1 ,x2 ,f1 ,f2

Figure 3.2. Optimization process

N (f (x1 ), σ 2 ) N (f (x2 ), σ 2 )

f1 f2

Figure 3.3. Graphical model

3.2 Bayesian optimization


We only consider a two-stage optimization problem here (see Figure 3.2).
Let (x1 , x2 ) ∈ A = X × X , S = R × R × RX , and (see Figure 3.3):
p(s) = p(f1 , f2 , f ) ∝ p(f2 |f )p(f1 |f )p(f )
We need a prior distribution for f .
We consider the loss function:
L(ax1 ,x2 , s) = f2 − f ∗ where f ∗ = min f
X

Remark 3.2.1 Other commons loss functions are:


2
X
∗ ∗
L(ax1 ,x2 , s) = min(f1 − f , f2 − f ) L(ax1 ,x2 , s) = fi − f ∗
i=1

Our Bayesian action is:


Z
âB ∈ argmin [f2 − f ∗ ]p(f, f1 , f2 )df df1 df2
x1 ,x2
Z h Z i
= argmin p(f1 )df1 argmin [f2 − f ∗ ]p(f2 , f |f1 )df df2
x1 x2 =S(x1 ,f1 )

3-11
Cours 3 — January 31st, 2020 Spring 2020

We have:
Z Z
p(f1 )df1 ∝ p(f1 |f ) p(f ) df and p(f2 , f |f1 ) = p(f2 |f ) p(f |f1 )
| {z } |{z} | {z } | {z }
N (f1 ,σ 2 ) ??? N (f2 ,σ 2 ) ???

Remark 3.2.2 1. we need to specify a prior over functions p(f ) such that p(f |f2 ) is
tractable.

2. dynamic programming is usually intractable → approximate DP. See [1].

Greedy solution: sequencial Bayesian optimization Consider the following algo-


rithm:
Algorithm 1:
Input : (x1 , f1 ), . . . , (xN , fN )
1 for t ∈ JN + 1, T K do
Z ∝p(ft |f )p(f |f1:t−1 )
 z }| {
2 xt = argmax min fj − ft + p(ft |f1:t−1 ) dft
1≤j≤t−1
| {z }
expected improvement
3 end graphic
3
Gaussian processes

Definition 3.14 f is said to follow a Gaussian processus GP (µ : X −→ R, k : X ×X −→ R)


if:
∀p ≥ 1, ∀x1:p ∈ X , (f (x1 ), . . . , f (xp )) ∼ N (µ1:p , K1:p,1:p )
where µi = µ(xi ) and Kij = k(xi , xj ).
P∞ R
Exercice 3.2.1 If k is a Mercer kernel, k(x, y)P = √ i=1 λi ei (x)ei (y) pointwise with k(x, x)dx <
iid
+∞, and (zi )i ∼ N (0, 1), show that f (x) = i≥1 λi zi ei (x) satisfies the definition.
Hint: start with 2 variables:
Xp p
Cov(f1 , f2 ) = E[f1 , f2 ] = λi λi E[zi2 ] ei (x1 )ei (x2 ) = k(x1 , x2 )
i≥1
| {z }
=1

kx−yk2
For k(x, y) = e− 2λ2 → samples in C ∞ , see lecture notes on Bayesian Nonparametrics by
P. Orbanz (http://www.gatsby.ucl.ac.uk/~porbanz/papers/porbanz_BNP_draft.pdf) graphic
4
Proposition 3.15 If f ∼ GP (0, k) and fi = f (xi ) + εi where (εi )i ∼iid N (0, σ 2 ) then

f |σ((x1 , f1 ), . . . , (xp , fp ))) ∼ GP (µ̃, k̃)

3-12
Cours 3 — January 31st, 2020 Spring 2020

with
 
f1
> 2 −1  .. 
µ̃(x) = (k(x, x1 ), . . . , k(x, xp )) (K1:p,1:p + σ Ip )  . 
fp
 
k(x, x1 )
k̃(x, y) = k(x, y) − (k(x, x1 ), . . . , k(x, xp ))(K1:p,1:p + σ 2 Ip )−1  ... 
 
k(x, xp )

Exercice 3.2.2
 
f1
..
.
 
  !
K1:p,1:p + σ 2 Ip K1:p,p+1:q

 fp 
 
 ∼ N 0,
f (xp+1 ) Kp+1:q,1:p Kp+1:q,p+1:q

 . 
 .. 
f (xq )

Then:
   
f (xp+1 ) f1 !
 ..  2 −1  .  2 −1
 .  ∼ N Kp+1:q,1:p (K1:p,1:p +σ Ip )  ..  , Kp+1:q,p+1:q −Kp+1:q,1:p (K1:p,1:p +σ Ip ) K1:p,p+1:q
f (xq ) fp

Z
p(f |x, θ) = p(f |f (x1 ), . . . , f (xN )) p(f (x1 ), . . . , f (xN )|x, θ) = N (f |0, σ 2 I + K)
| {z }| {z }
N (0,σ 2 I) N (0,K)

3-13
Bayesian machine learning Spring 2020
Cours 4 — February 7, 2020
Lecturer: Julyan Arbel Scribe: Nicolas Pinon, Aitor Artola

4.1 Introduction
Bayesian nonparametric : Bayesian statistics that is not parametric Not parametric : pareme-
ters not finite, unbounded/griowing/infinite number of parameters
GitHub of the course : https://github.com/jarbel/bml-course

4.2 Dirichlet process


Definition 4.16 Dirichlet Process (Ferguson 1973) : P is a Dirichlet Process in space Θ if
∃ α > 0, P0 probability measure, ∀k ∈ N∗ , ∀ partition (A1 , ..., Ak ) of Θ :
(P (A1 ), ..., P (Ak )) ∼ Dir(αP0 (A1 ), ..., P0 (Ak ))

Definition 4.17 Beta distribution : X ∼ Beta(a, b)


f (x) ∝ xa−1 (1 − x)b−1

Definition 4.18 DirichletPdistribution : X ∼ Dir(a1 , ..., ak )


f (x) ∝ x1a1 −1 ...xal k −1 with xi = 1

We have A, B ∈ Θ and the following Dirichlet distribution on the set Θ = {A, Ac } :


(P (A), P (Ac )) ∼ Dir(αP0 (A), αP0 (Ac ))
P (A) ∼ Beta(αP0 (A), α(1 − P0 (A)))
The expectation and the variance of a beta law are :
a
E[Beta(a, b)] = a+b
ab
Var[Beta(a, b)] = (a+b+1)(a+b)2

We deduce the expectation and variance of our Dirichlet law :

E[P (A)] = P0 (A)


Var[P (A)] = P0 (A)(1−P
1+α
0 (A)

cov[P (A), P (B)] = P0 (A∩B)−P


1=α
0 (A)P (B)

Theorem 4.19 (De Finetti)


Exchangability ⇐⇒ conditional independence

4-14
Cours 4 — February 7, 2020 Spring 2020

Note : Independance is the same as Exchangability.

Theorem 4.20 (conjugacy) We consider X1 , ..., Xn |P with the Dirichlet prior P ∼ DP (αP0 ).
The posterior of P in this model is :

P |X1 , ..., Xn ∼ DP (αP0 + N


P
i=1 δXi )

and the predictive distribution is :


α 1
Pn
P (Xn+1 |Xn , ..., X1 ) = P
α+n 0
+ α+n i=1 δXi

Definition 4.21 (conjugacy update)

α ←α+n Pn
α 1
P0 ← α+n P0 + α+n i=1 δXi

So if we have thePn a DP SP (G0 ), then we could compute its parameters α = G0 (Θ) and
P0 = GG 0
0 (Θ)
. Here i=1 δXi is the empirical loss.

Definition 4.22 (Polya Urn) We consider a Polya Urn problem, we start with an urn
with α black balls. If we pick a black ball we add in the urn a ball with a new color Xi
following P0 and if we pick a non black ball we add a ball with the same color. This problem
is a DP :
1. X1 |P ∼ P
P(X1 ∈ A) = EP [P(X1 ∈ A|P )]
= EP [P (A)] = P0 (A)
⇒ X 1 ∼ P0
α 1
2. X2 |X1 ∼ P
α+n 0
+ δ
α+n X1

Definition 4.23 (Chinese Restaurant Process) A customer enter in a chinese restau-


rant, and choose a table Xi with a DP (αP0 ). We also define the number of table K, the ith
table choose variable Xi∗ and the number of customer in the ith table ni . So the DP could be
rewrite : Pn
α 1
P (Xn+1 |Xn , ..., X1 ) = α+n P0 + α+n i=1 δXi
P (Xn+1 |Xn , ..., X1 ) = α+n P0 + α+n n K
α n 1
P
j=1 nj δXj

We can also define the law of ni :


Γ(α) QK
P (n1 , ..., nK ) = α Γ(α+n) j=1 (nj − 1)!
Γ(α) α 1 1 α
Using Γ(α+n) = α α+1
... α+n−1 = (α)n
we deduce the probability of all customers choosing the
same table:
α 1 n−1 1
P (n1 = n) = α α+1
... α+n−1 = (α)n
α
= (α)n
(n − 1)!

4-15
Cours 4 — February 7, 2020 Spring 2020

and the probability to have one customer by table :

P (n1 = 1, ..., nn = 1) = ni=1 α+i−1


α
Q
α
= (α)n
(n − 1)!
αn
= (α)n

We have the combinatorial properties for Kn number of table. We introduce Di :


(
1 if Xi is seated at a new table
Di =
0 otherwise
α
Di ∼ Ber( α+i−1 )
Pn
Kn = i=1 Di

Proposition 4.24
E[Kn ] = ni=1 E[Di ] = ni=1 α
P P
α+i−1
n→+∞
−−−−→ ∞
a.s
∼ α log n
n→+∞

Proposition 4.25
Kn n→+∞
log n
−−−−→ α
a.s

Proposition 4.26 Proof Lindeborg CLT, independent random variables :


Kn −E[Kn ]
Std(Kn )
→ N (0, 1)

If P0 is non atomic then P(Kn = K) = ...

m1 = #(tables with 1 customer)


m2 = #(tables with 2 customer)
...
mn = #(tables with n customer)
PK Pn PK
j=1 nj = n , l=1 mj = K , l=1 lml = n

Definition 4.27 (Population genetic: Emons Sampling formula)


n!
P (m1 , ..., mn ) = (α)!
αK Qn (l)1 ml ml !
l=1

αm1 (α×1)m2 (α×1×2)m3 ...(α×1...×(n−1))mn alphaK Qn


α(α+1)...(α+n−1)
= (α)n l=1 (l − 1)!ml
Q1 n n!

ml ! 1,...,1,2,...,2,...,n
= Qn ml m !
l=1 (l!) l

4-16
Cours 4 — February 7, 2020 Spring 2020

iid
Definition 4.28 (Stick-breaking for the DP) Let Vi ∼ Beta(1, α) with α > 0 and p1 =
iid
V1 and pi = Vi i−1 pi δθi with ∞
Q Pi=1 iid P
l=1 (1 − Vl ) and θi ∼ P0 then P = i=1 pi = 1

X1 , ..., Xn |P ∼ P = i=1 iid pi δθi


P

Definition 4.29 (Mixture Model)



Yi |Xi , P
 ∼ fp (Yi |Xi ) (often gaussian)
X1 , .., Xn |P ∼ P

P ∼ DP (αP0 )

Clustering for (X1 , ..., Xn ) → indices a clustering for (Y1 , ..., Yn ). usefull for density
estimation.
Definition 4.30 (Pitman Yor process)
k
α + kσ n 1X
P (Xn+1 |X1 , ..., Xn ) = P0 + (nj − σ)δx∗j
α+n α + n n j=1

With σ ∈ [0, 1]. With σ = 0 we recall the DP.


E[Kn ] ∼ Snσ with S some random variable.
Definition 4.31 (Stick breaking interpretation )

Vi ∼ Beta(1 − σ, α + iσ)
pi = Vi Πl<i (1 − Vl )
Kn = Kn−1 + Kn+ ∼ P ois(γ(1 + 21 + ... + 1i )
Definition 4.32 (Feature allocation model/ Indian Buffet Process)
• Customer 1 : N1 ∼ P ois(γ) features
• Customer 2 :
– every dish of Customer 1 with probability 1/2
– new dish P ois(γ/2)
1
The total number of dishes N2 has the law : 2
P ois(γ/2) + 21 P ois(γ/2) = P ois(γ)
• ...
• Customer i :
nj
– Chooses every dish j ∈ 1, ..., K with probability i
– Chooses new dish P ois(γ/i)
PK Pi−1
j=1 = l=1 Nl ∼ P ois((i − 1)γ)

Definition 4.33 Hierarchical DP (Teh et al.):

4-17
Bayesian machine learning Spring 2020
Cours 5 — February 14, 2020
Lecturer: Julyan Arbel Scribe: W. Jallet, A. Floyrac, C. Guillo

5.1 The use for Bayesian Deep Learning


5.1.1 Bayesian model averaging (BMA)
We want to obtain a predictive distribution for our variable x given our dataset D:
Z
p(x|D) = p(x|θ) p(θ|D) dθ
Θ | {z } | {z }
model posterior

This can also be a conditional predictive if we are in a regression or classification problem


Z
p(Y |X, D) = p(Y |X, W )p(W |D) dW (5.8)
W

5.1.2 Uncertainty
Epistemic uncertainty also known as model uncertainty, represents uncertainty over
which base hypothesis (or parameter) is correct given the amount of available data.

Aleatoric uncertainty essentially, noise from the data measurements (e.g. measuring
errors in sensor data).
Thus, a Bayesian approach to deep learning considers epistemic uncertainties in a prin-
cipled way, where these uncertainties are carried over to the posterior distribution on our
parameter space.

5.1.3 Link between Bayesian DL and regularized Maximum Likeli-


hood
When using regularized maximum likelihood to learn parameters, we are computing a quan-
tity
θ̂MLE ∈ argmax log p(D|θ) + log p(θ) (5.9)
θ∈Θ | {z } |{z}
likelihood penalty

If the penalty term is indeed a prior likelihood −R(θ) = log p(θ) the previous regularized
MLE is known as the maximum a posteriori (MAP) estimator, which can be written
θ̂MAP ∈ argmax p(θ|D) (5.10)
θ∈Θ | {z }
=actual posterior

5-18
Cours 5 — February 14, 2020 Spring 2020

This is still an optimization problem, and not really Bayesian inference. Indeed, MAP
is taking the maximizing mode(s) in the posterior (and not computing a full predictive
distribution), dropping all of the uncertainty it contains and thus all of the information on
the predictive uncertainty.

• A Gaussian prior p(θ) ∝ exp(−kθk22 /2) on parameter space leads to `2 regularization,


and the corresponding MAP estimator is known as the Ridge estimator.

• A Laplace prior p(θ) ∝ exp(−kθk1 ) yields `1 penalization and the so-called LASSO
estimator.

5.1.4 Bayesian Model Averaging (BMA) vs. Model Combination


Methods
Reference: see [2, ch. 14].
N.B. for instance, mixture models are model combination methods.

Gaussian mixtures They are generative models on the data likelihood:


K
X
p(X) = πk N (X|µk , Σk ) (5.11)
k=1

We introduce latent variables Z ∈ {0, 1}K s.t. k zk = 1, which represent to which mixture
P
component a data point belongs (i.e. it belongs to the k-th component iff zk = 1). Then,
the joint likelihood of our variable x and (unobserved) latent variable z is

p(X, Z) = p(Z)p(X|Z) (5.12)

Where:
πkzk i.e. p(zk = 1) = πk
Q
• p(Z) = k

• p(X|Z) factorizes with p(X|zk = 1) = N (X|µk , Σk ), i.e.


K
Y
p(X|Z) = N (X|µk , Σk )zk (5.13)
k=1

The likelihood is obtained as usual by marginalizing with respect to the latent variable Z:
X
p(X) = p(Z)p(X|Z) (5.14)
Z

where we sum over all possible (one-hot) Z ∈ {0, 1}K ; there are K of them due to the
constraint from above.

5-19
Cours 5 — February 14, 2020 Spring 2020

The full observed data likelihood is written


n
!
Y Y X
p(D) = p(Xi ) = p(Zi )p(X|Zi ) (5.15)
k i=1 Zi

where D = {X1 , . . . , Xn ).
This is in contrast to BMA where the whole dataset is generated by a single model (see
Minkha 2002) as well as a conditional predictive distribution
Z
p(y|x, D) = p(y, W |x, D) dw = Ew [p(y|x, w)|D] (5.16)
W

BMA H different models indexed by h = 1, . . . , H (in the discrete case) with a prior
probability p(h). The marginal distribution of data X is
H
X Z
p(X) = p(X|h)p(h) or p(X) = p(X|h)p(h) dh (5.17)
h=1 H

Example 5.1.1 We are given observations X = {x1 , . . . , xn }.


Z Z
p(x|X) = p(x, θ|X) dθ = Eθ [p(x|θ)|X] = p(x|θ, X)p(θ|X) dθ (5.18)
Θ Θ

5.2 Bayesian Neural Networks (BNNs)


Reference: See Neal () and MacKay [6]
We put a common (isotropic) prior N (0, σ 2 ) on the (independent) weights of the NN. A
neural network defines a parametric mapping
(
X −→ Y
fw : (5.19)
x 7−→ fw (x)

For regression, we want a conditional predictive distribution y|x. Looking for a Gaussian
likelihood
p(y|x, w) = N (y|fw (x), τ 2 ) (5.20)
For data D = {(Xi , Yi )}i , we get a full data likelihood under weights w
n
Y
p(D|w) = N (Yi |f (Xi , w), τ 2 ) (5.21)
i=1

and the posterior distribution on the weights is given by Bayes’ rule:


n
Y
2
p(w|D) ∝ p(w) p(D|w) ∝ N (w|0, σ I) N (Yi |fw (Xi ), τ 2 ) (5.22)
|{z} i=1
prior

5-20
Cours 5 — February 14, 2020 Spring 2020

The BNN wide limit Some notations:


(0)
• inputs X ∈ RH , H (0) = d
• depth L ∈ N∗
• output Y ∈ R(L+1)
• width H (`) ∈ N∗ for the layer at depth ` ∈ {0, . . . , L + 1}
• non-linearity φ
• pre-nonlinearity g (`) (X) = W (`) h(`−1) (X)
• post-nonlinearity h(`) (X) = φ(g (`) (X)) (applied elementwise) for ` ≥ 1
We also impose h(0) (X) = X.

Output
Inputs X hidden
Y

Example 5.2.1 (Single hidden layer, Neal (1996)) In this setup, L = 1 and H = H (1) . make
The equations of the NN boil down to a
di-
g (1) (X) = g(X) = W (1) X ∈ RH
a-
h(1) (X) = h(X) = φ(W (1) X) (5.23) gram
(2) (1) (2) (1)
Y (X) = W h (X) = W φ(W X)
How do uncertainties propagate? For all 1 ≤ i ≤ H, gi (X) is a random variable and
X (1) iid
gi (X) = Wij Xj ∼ N (0, kXk22 σH2
)
j

Thus, the hidden variables hi (X) = φ(gi (X)) are iid and are functions of Gaussians.
The output is Y (X) = W (2) h(X). Because the weights are iid, we have that W (2) and
hi (X) are independent, thus the statistics of each neuron output Yi are
H
X (2)
E[Yi (X)] = EW (2) [Wij ] E[hj (X)] = 0
j=1

5-21
Cours 5 — February 14, 2020 Spring 2020

and
(2)
Var(Yij ) = E[(Wij )2 ] E[(hj (X))2 ]
| {z }
=c (constant)
(2) P
where we denote Yij = Wij hj (X) so that Yi = j Yij and recall that the hj (X) are iid. In
conclusion, we have a predictive distribution Yj (X) which is not Gaussian but has statistics
2
0, HσH .
We have a version of the Central Limit Theorem (CLT):
√ Yi (X) H→+∞ 2
H −−−−→ N (0, cHσH ) (5.24)
H
2 2 σ2
We have nondegenerate asymptotic variance HσH = constant, σH = H
.

This result does extend to deeper networks where L > 1 (see a 2018 result). cite
We see that asymptotically, the predictive prior distribution Yi (x) of the i-th output is pre-
a white-noise Gaussian process. This is intuitive: we have learned nothing (the input X is cise
fixed, has no prior, and we have not constrained any observations of the Yi ), the weights are pa-
distributed randomly, so the predictor should contain no information. per
please!
5.2.1 Understanding the prior at the level of the units [9]
What can we say about the priors of h(`) (x), g (`) (x) at a given number of units H (`) ? We
(`) iid
suppose as before that these we have the weights’ prior Wij ∼ N (0, σ 2 ).
We need a condition on the nonlinearity φ, called the extended envelope condition:
(
≥ c1 + d1 |x| on R+ or R−
φ(x) (5.25)
≤ c2 + d2 |x| on R

where d1 , d2 > 0. This imposes a kind of ReLU-like nonlinearity.


Now, we can precisely caracterize the distribution of pre- and post- nonlinearities.

Theorem 5.34 (Vladimirova et al [9] (2018)) We assume the conditions above on the
(`) (`)
priors and nonlinearity. Then, conditional on X, the prior hi (X) or hi (X) at layer ` is
Sub-Weibull of parameter 2` .

Definition 5.35 (Sub-Weibull distribution) A random variable X is Sub-Weibull with


tail parameter θ if its c.d.f. F satisfies the following conditions:
1/θ
1 − F (t) ≤ e−λt (5.26)

for some λ > 0 (right tail), and


1/θ
F (t) ≤ e−λ|t| (5.27)
t→−∞

for the left tail.

5-22
Cours 5 — February 14, 2020 Spring 2020

Figure 5.4. Impact of the number of layers on the prior distribution. Taken from [8].

Remark 5.2.1 In the above definition, the quantity 1 − F (t) is also called the survival
function.

We can define the following specific Sub-Gaussian distributions:

• Sub-Gaussian: a Sub-Weibull of parameter θ = 1/2,


2
1 − F (t) ≤ e−λt (5.28)

• Sub-Exponential: a Sub-Weibull with θ = 1, i.e. the survival function satisfies

1 − F (t) ≤ e−λt (5.29)

We can also interprete these priors from a regularization point of view; the mode of the
weights’ posterior distribution given data D = {(Xi , Yi )}i is as usual the MAP estimator

ŵMAP ∈ argmax p(w|D)


w∈W
(5.30)
= argmax log p(D|w) + log p(w)
w

Weight decay regularization for NN is nothing more than applying `2 regularization on the
weights (which is the same as using a Gaussian prior p(w) ∝ exp(−kwk22 )). I
missed
5.2.2 Subspace inference for Bayesian DL a
bunch
Reference: Izmailov et al. (2019) [4] of
draw-
ings
5-23 here.
Cours 5 — February 14, 2020 Spring 2020

Posterior inference is not really scalable in general, especially when the parameter space
is large, which is the case in deep learning where the space W is often high-dimensional.
The idea is to construct lower-dimensional subspaces, e.g. the first few components
of the SGD trajectories, then perform variational inference: we can perform prediction using
the approximate posterior predictive distribution and the uncertainty is well-calibrated. The
idea is analogous to PCA, where we reduce the feature space to a given number of principal
components.

5-24
Bibliography

[1] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2017.
[2] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci-
ence and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006. isbn: 0387310738.
[3] David M. Blei et al. “Latent Dirichlet Allocation”. In: J. Mach. Learn. Res. 3.null (Mar.
2003), pp. 993–1022. issn: 1532-4435.
[4] Pavel Izmailov et al. Subspace Inference for Bayesian Deep Learning. 2019. arXiv: 1907.
07504 [cs.LG].
[5] Diederik P. Kingma and Max. Welling. Auto-Encoding Variational Bayes. 2013. arXiv:
1312.6114 [stat.ML].
[6] David J.C MacKay. “Bayesian neural networks and density networks”. In: Nuclear In-
struments and Methods in Physics Research Section A: Accelerators, Spectrometers,
Detectors and Associated Equipment 354.1 (1995). Proceedings of the Third Work-
shop on Neutron Scattering Data Analysis, pp. 73–80. issn: 0168-9002. doi: https:
//doi.org/10.1016/0168- 9002(94)00931- 7. url: http://www.sciencedirect.
com/science/article/pii/0168900294009317.
[7] Kevin P. Murphy. “A Variational Approximation for Bayesian Networks with Discrete
and Continuous Latent Variables”. In: CoRR abs/1301.6724 (2013). arXiv: 1301.6724.
url: http://arxiv.org/abs/1301.6724.
[8] Mariia Vladimirova and Julyan Arbel. Sub-Weibull distributions: generalizing sub-Gaussian
and sub-Exponential properties to heavier-tailed distributions. 2019. arXiv: 1905.04955
[math.ST].
[9] Mariia Vladimirova et al. Understanding Priors in Bayesian Neural Networks at the
Unit Level. 2018. arXiv: 1810.05193 [stat.ML].

25

You might also like