Scribe Notes BML
Scribe Notes BML
• Decision theory
• Bayesian nonparametrics
In matrix form,
y = Xθ? + ε
Remark 1.1.1 Your job is to (i) come up with an estimator θb = θ(X, b y) Often, you also
need (ii) to report some region A = A(X, y) with level of confidence α ⊂ Θ.
• (iii) θbM LE has the minimum variance among linear unbiased estimators ( A B :=
A − B positive semidefinite)
1-1
Cours 1 — January 17th, 2020 Spring 2020
a : data 7−→ θ,
b Aα .
Let us pick a loss function L(θ, a(X, y)). We’d like to choose a ∈ arg min Ey L(θ∗ , a(X, y)).
Example 1.1.1 In regression with the squared loss, this boils down to
Theorem 1.5 (Corollary of a result by James & Stein) θbM LE is not admissible
for linear regression.
1-2
Cours 1 — January 17th, 2020 Spring 2020
Exercise: Prove that ∃ θb s.t.∀θ ∈ B(0, ρ), Ey [||θb − θ||2 ] < Ey [||θbM LE − θ||2 ]
Let’s define
θbR (λ) = arg min ||y − Xθ||2 + λ||θ||22 , λ > 0
θ
= (X X + λI)−1 X T y
T
Lemme 1.6
b 2 ] = T r(Var(θ))
E[||θ − θ|| b − θ||2
b + ||E[θ]
1 b
θbR (λ) = θM LE (shrinkage)
1+λ
1 ∗
E[θbR (λ)] = θ
1+λ
1 1
Var[θbR (λ)] = Var[θbM LE]
1+λ 1+λ
2
σ
= I
(1 + λ)2
dσ 2 1 2 ∗
M SE(θbR (λ) = + (1 − ) ||θ ||
(1 + λ)2 1+λ
dσ 2 1 2 2
≤ 2
+ (1 − )ρ
(1 + λ) 1+λ
1
=
(1 + λ)2 [dσ 2 + λ2 ρ2 ]
= f (λ)
∂
f |λ=0 ∝ 2λρ2 (1 + λ)2 − (λ2 ρ2 + dσ 2 )2(1 + λ)|λ=0 = −2dσ 2
∂λ
• For the general case, check Wieringen’s lecture notes on linear regression
Definition 1.7 θminimax := θbm ∈ arg min b sup Ey [L(θ, θ)]
θ θ
b
Theorem 1.9 (Berger’85) Under topolical assumptions, assume θ, θb ∈ closed bounded sub-
sets of Θ, continuity assumptions on L.
• Any estimator is dominated by a bayesian estimator
• In linear regression, Bayes => admissible
1-3
Cours 1 — January 17th, 2020 Spring 2020
• L(θ, b α
b θ) =choice ||θ − θ||
b S) =choice L(θ,
• p(s) = p(y1,...,n , θ) = p(y1,...,n |θ)p(θ) where p(y1,...,n |θ) is the choice and p(θ) is the prior
(choice)
R
Exercice: We define L(θ, θ) b 2 and we will show that θbB =
b = ||θ − θ|| θp(θ|y1,...,n )dθ
Z
θB ∈ arg min ||θ − θ(y
b b 1,...,n )||2 p(y1,...,n , θ)dy1,...,n dθ
Remark 1.1.2 Z
b 2 p(y1,...,n |θ)p(θ)dθdy1,...,n
min ||θ − θ||
θ
b
Z
b 2 p(y1,...,n |θ)p(θ)dθdy1,...,n
= min ||θ − θ||
θb
But
Z Z
b 2 p(θ|y1,...,n )dθ =
||θ − θ|| b 2 p(θ|y1,...,n )dθ
||θ − θM EP + θM EP − θ||
Z
= ||θ − θM EP ||2 + ||θM EP − θ||
b 2 p(θ|y1,...,n )dθ
R
where the mean estimator posterior θbM EP = θbB dθ
1-4
Cours 1 — January 17th, 2020 Spring 2020
σ 2 I −1 T
In particular, θbB = σ −2 ΣX T y = (X T X + σ2
) X y
||y − Xθ||2
log(p(θ|y1,...,n )) = − − λ||θ||1
2σ 2
S = Y n × Θ × RdΘ
+ × R+ ≥ 10
1-5
Cours 1 — January 17th, 2020 Spring 2020
1
λj ∼ C + (1) ∝ 1λ>0
1 + λ2
τ ∼ C + (1)
Let S = X n × Y n × Θ × X × Y: L(by , s) = 1y6=yb(α1y=1 + β1y=0 )
Z
ybB ∈ arg min 1y6=yb(α1y=1 + β1y=0 )dp(x1,...,n , y1,...,n , θ, x, y)
Example 1.1.5
p(y = +1|x, θ) = σ(xT θ)
Z
ybB ∈ arg min L(y, yb)p(y|x, θ)p(y1,..,n|θ,x1,...,n )p(θ)dθdxdy1,...,n
R
f (y) ∝ p(y|x, θ)p(y1,...,n |x1,...,n , θ)p(θ)dθ is called the posterior predictive.
Z
ybB ∈ arg min 1y6=yb[α1y=1 + β1y=0 ]f (y)dx1,...,n dy1,...,n dxdy
Z
arg min f (1)α1yb=1 + f (0)β1yb=0 ]dx1,...,n dy1,...,n dx
XZ n−1
Y
= L(θ, θ)[1
b y
1,...,n ∈{N =n} 1y1,...,i ∈{N
/ =n} ]p(y1,..,n|θ)p(θ)dθdy1,...,n
n≥0 i=1
1-6
Bayesian Machine Learning Spring 2020
Cours 2 — January 24th, 2020
Lecturer: Rémi Bardenet Scribe: W. Jallet, S. Jerad
Theorem 2.11 (Bernstein-von Mises (Van der Vaart’ 2000)) We assume that the prior
p(θ) puts “enough mass" around θ∗ ∈ Θ̊ ⊆ Rd , then for all ε > 0
Pp(·|θ) sup Pθ|Y,x (B) − PN (θ∗ ,σ2 (X T X)−1 /N ) (B) ≥ ε → 0 (2.3)
B⊂Θ
Picking a prior
• find a prior that comes from symmetries of your problem e.g. Jeffrey’s prior
• try several priors and make sure that âB does not change too much
2-7
Cours 2 — January 24th, 2020 Spring 2020
−1/2 −1/2
Then we obtain the normalized PCA vectors as Λ:q x̂i = Λ:q U:q| x (whitened PCA), where
the subscript : q indicates we only take the first q components.
For the Bayesian formulation, take data x ∼ N (0, I), y ∼ N (µ, σ 2 W W | + I). The joint
distribution is
p(y, x, µ, σ, W ) ∝ p(y|x, µ, σ, W, q)p(x)p(µ)p(σ)p(W )
Now we choose a prior for the weights W . Some suggestions:
1. p(W ) ∝ p(W |q)p(q), for instance p(W |q) = qj=1 e−λkwj k and a conjugate prior q ∼
Q
P(λ) for some hyperparameter λ.
Qd−1 −kwj k2 /(2v2 )
2. an alternative is p(W ) ∝ p(W |v)p(v) = j=1 e j p(v) with prior p(v), which
Example 2.2.1 (Latent Dirichlet Allocation (Blei et al. [3])) Let qd` ∈ {1, T } be the
topic of a word ` ∈ {1, Ld } inside of document d ∈ {1, D}.See Figure 2.1 finish
the
α Πd ∈ ∆T qd` yB
βd` di-
a-
Figure 2.1. Graphical model for LDA. gram
2-8
Cours 2 — January 24th, 2020 Spring 2020
Theorem 2.13 (Savage) Let ≺ be a preference relation over A that is complete and tran-
sitive.
Then, the following statements are equivalent:
This is the idea of rationality, e.g. from neoclassical economics. The loss L is bounded, and
π is finitely additive. The prior is coupled to the loss. We may act before having any data,
but as data comes in our actions will become more appropriate.
Exercise (Gibbs sampler) The Gibbs sampler is useful in the case where the conditional
distributions of the variables (conditionally on each other) are known.
1. given x = (x1 , x2 ),
1 1
q(y|x) = π(y1 |x2 )π(y2 |y1 ) + π(y2 |x1 )π(y1 |y2 )
2 2
Show that α(x, y) = 1
2-9
Bayesian Machine Learning Spring 2020
Cours 3 — January 31st, 2020
Lecturer: Rémi Bardenet Scribe: Antoine Barrier
• discrete
VB objective Find
q ∈ argmin KL(q, p((θ, z)|y)) (VB)
q∈Q
R
where Q is the set of probabilities over (θ, z1:N ) and KL(p, q) = p log(p/q).
3-10
Cours 3 — January 31st, 2020 Spring 2020
x1 ∈ X f1 x2 f2
N (f (x1 ), σ 2 ) N (f (x2 ), σ 2 )
f1 f2
3-11
Cours 3 — January 31st, 2020 Spring 2020
We have:
Z Z
p(f1 )df1 ∝ p(f1 |f ) p(f ) df and p(f2 , f |f1 ) = p(f2 |f ) p(f |f1 )
| {z } |{z} | {z } | {z }
N (f1 ,σ 2 ) ??? N (f2 ,σ 2 ) ???
Remark 3.2.2 1. we need to specify a prior over functions p(f ) such that p(f |f2 ) is
tractable.
kx−yk2
For k(x, y) = e− 2λ2 → samples in C ∞ , see lecture notes on Bayesian Nonparametrics by
P. Orbanz (http://www.gatsby.ucl.ac.uk/~porbanz/papers/porbanz_BNP_draft.pdf) graphic
4
Proposition 3.15 If f ∼ GP (0, k) and fi = f (xi ) + εi where (εi )i ∼iid N (0, σ 2 ) then
3-12
Cours 3 — January 31st, 2020 Spring 2020
with
f1
> 2 −1 ..
µ̃(x) = (k(x, x1 ), . . . , k(x, xp )) (K1:p,1:p + σ Ip ) .
fp
k(x, x1 )
k̃(x, y) = k(x, y) − (k(x, x1 ), . . . , k(x, xp ))(K1:p,1:p + σ 2 Ip )−1 ...
k(x, xp )
Exercice 3.2.2
f1
..
.
!
K1:p,1:p + σ 2 Ip K1:p,p+1:q
fp
∼ N 0,
f (xp+1 ) Kp+1:q,1:p Kp+1:q,p+1:q
.
..
f (xq )
Then:
f (xp+1 ) f1 !
.. 2 −1 . 2 −1
. ∼ N Kp+1:q,1:p (K1:p,1:p +σ Ip ) .. , Kp+1:q,p+1:q −Kp+1:q,1:p (K1:p,1:p +σ Ip ) K1:p,p+1:q
f (xq ) fp
Z
p(f |x, θ) = p(f |f (x1 ), . . . , f (xN )) p(f (x1 ), . . . , f (xN )|x, θ) = N (f |0, σ 2 I + K)
| {z }| {z }
N (0,σ 2 I) N (0,K)
3-13
Bayesian machine learning Spring 2020
Cours 4 — February 7, 2020
Lecturer: Julyan Arbel Scribe: Nicolas Pinon, Aitor Artola
4.1 Introduction
Bayesian nonparametric : Bayesian statistics that is not parametric Not parametric : pareme-
ters not finite, unbounded/griowing/infinite number of parameters
GitHub of the course : https://github.com/jarbel/bml-course
4-14
Cours 4 — February 7, 2020 Spring 2020
Theorem 4.20 (conjugacy) We consider X1 , ..., Xn |P with the Dirichlet prior P ∼ DP (αP0 ).
The posterior of P in this model is :
α ←α+n Pn
α 1
P0 ← α+n P0 + α+n i=1 δXi
So if we have thePn a DP SP (G0 ), then we could compute its parameters α = G0 (Θ) and
P0 = GG 0
0 (Θ)
. Here i=1 δXi is the empirical loss.
Definition 4.22 (Polya Urn) We consider a Polya Urn problem, we start with an urn
with α black balls. If we pick a black ball we add in the urn a ball with a new color Xi
following P0 and if we pick a non black ball we add a ball with the same color. This problem
is a DP :
1. X1 |P ∼ P
P(X1 ∈ A) = EP [P(X1 ∈ A|P )]
= EP [P (A)] = P0 (A)
⇒ X 1 ∼ P0
α 1
2. X2 |X1 ∼ P
α+n 0
+ δ
α+n X1
4-15
Cours 4 — February 7, 2020 Spring 2020
Proposition 4.24
E[Kn ] = ni=1 E[Di ] = ni=1 α
P P
α+i−1
n→+∞
−−−−→ ∞
a.s
∼ α log n
n→+∞
Proposition 4.25
Kn n→+∞
log n
−−−−→ α
a.s
4-16
Cours 4 — February 7, 2020 Spring 2020
iid
Definition 4.28 (Stick-breaking for the DP) Let Vi ∼ Beta(1, α) with α > 0 and p1 =
iid
V1 and pi = Vi i−1 pi δθi with ∞
Q Pi=1 iid P
l=1 (1 − Vl ) and θi ∼ P0 then P = i=1 pi = 1
Clustering for (X1 , ..., Xn ) → indices a clustering for (Y1 , ..., Yn ). usefull for density
estimation.
Definition 4.30 (Pitman Yor process)
k
α + kσ n 1X
P (Xn+1 |X1 , ..., Xn ) = P0 + (nj − σ)δx∗j
α+n α + n n j=1
4-17
Bayesian machine learning Spring 2020
Cours 5 — February 14, 2020
Lecturer: Julyan Arbel Scribe: W. Jallet, A. Floyrac, C. Guillo
5.1.2 Uncertainty
Epistemic uncertainty also known as model uncertainty, represents uncertainty over
which base hypothesis (or parameter) is correct given the amount of available data.
Aleatoric uncertainty essentially, noise from the data measurements (e.g. measuring
errors in sensor data).
Thus, a Bayesian approach to deep learning considers epistemic uncertainties in a prin-
cipled way, where these uncertainties are carried over to the posterior distribution on our
parameter space.
If the penalty term is indeed a prior likelihood −R(θ) = log p(θ) the previous regularized
MLE is known as the maximum a posteriori (MAP) estimator, which can be written
θ̂MAP ∈ argmax p(θ|D) (5.10)
θ∈Θ | {z }
=actual posterior
5-18
Cours 5 — February 14, 2020 Spring 2020
This is still an optimization problem, and not really Bayesian inference. Indeed, MAP
is taking the maximizing mode(s) in the posterior (and not computing a full predictive
distribution), dropping all of the uncertainty it contains and thus all of the information on
the predictive uncertainty.
• A Laplace prior p(θ) ∝ exp(−kθk1 ) yields `1 penalization and the so-called LASSO
estimator.
We introduce latent variables Z ∈ {0, 1}K s.t. k zk = 1, which represent to which mixture
P
component a data point belongs (i.e. it belongs to the k-th component iff zk = 1). Then,
the joint likelihood of our variable x and (unobserved) latent variable z is
Where:
πkzk i.e. p(zk = 1) = πk
Q
• p(Z) = k
The likelihood is obtained as usual by marginalizing with respect to the latent variable Z:
X
p(X) = p(Z)p(X|Z) (5.14)
Z
where we sum over all possible (one-hot) Z ∈ {0, 1}K ; there are K of them due to the
constraint from above.
5-19
Cours 5 — February 14, 2020 Spring 2020
where D = {X1 , . . . , Xn ).
This is in contrast to BMA where the whole dataset is generated by a single model (see
Minkha 2002) as well as a conditional predictive distribution
Z
p(y|x, D) = p(y, W |x, D) dw = Ew [p(y|x, w)|D] (5.16)
W
BMA H different models indexed by h = 1, . . . , H (in the discrete case) with a prior
probability p(h). The marginal distribution of data X is
H
X Z
p(X) = p(X|h)p(h) or p(X) = p(X|h)p(h) dh (5.17)
h=1 H
For regression, we want a conditional predictive distribution y|x. Looking for a Gaussian
likelihood
p(y|x, w) = N (y|fw (x), τ 2 ) (5.20)
For data D = {(Xi , Yi )}i , we get a full data likelihood under weights w
n
Y
p(D|w) = N (Yi |f (Xi , w), τ 2 ) (5.21)
i=1
5-20
Cours 5 — February 14, 2020 Spring 2020
Output
Inputs X hidden
Y
Example 5.2.1 (Single hidden layer, Neal (1996)) In this setup, L = 1 and H = H (1) . make
The equations of the NN boil down to a
di-
g (1) (X) = g(X) = W (1) X ∈ RH
a-
h(1) (X) = h(X) = φ(W (1) X) (5.23) gram
(2) (1) (2) (1)
Y (X) = W h (X) = W φ(W X)
How do uncertainties propagate? For all 1 ≤ i ≤ H, gi (X) is a random variable and
X (1) iid
gi (X) = Wij Xj ∼ N (0, kXk22 σH2
)
j
Thus, the hidden variables hi (X) = φ(gi (X)) are iid and are functions of Gaussians.
The output is Y (X) = W (2) h(X). Because the weights are iid, we have that W (2) and
hi (X) are independent, thus the statistics of each neuron output Yi are
H
X (2)
E[Yi (X)] = EW (2) [Wij ] E[hj (X)] = 0
j=1
5-21
Cours 5 — February 14, 2020 Spring 2020
and
(2)
Var(Yij ) = E[(Wij )2 ] E[(hj (X))2 ]
| {z }
=c (constant)
(2) P
where we denote Yij = Wij hj (X) so that Yi = j Yij and recall that the hj (X) are iid. In
conclusion, we have a predictive distribution Yj (X) which is not Gaussian but has statistics
2
0, HσH .
We have a version of the Central Limit Theorem (CLT):
√ Yi (X) H→+∞ 2
H −−−−→ N (0, cHσH ) (5.24)
H
2 2 σ2
We have nondegenerate asymptotic variance HσH = constant, σH = H
.
This result does extend to deeper networks where L > 1 (see a 2018 result). cite
We see that asymptotically, the predictive prior distribution Yi (x) of the i-th output is pre-
a white-noise Gaussian process. This is intuitive: we have learned nothing (the input X is cise
fixed, has no prior, and we have not constrained any observations of the Yi ), the weights are pa-
distributed randomly, so the predictor should contain no information. per
please!
5.2.1 Understanding the prior at the level of the units [9]
What can we say about the priors of h(`) (x), g (`) (x) at a given number of units H (`) ? We
(`) iid
suppose as before that these we have the weights’ prior Wij ∼ N (0, σ 2 ).
We need a condition on the nonlinearity φ, called the extended envelope condition:
(
≥ c1 + d1 |x| on R+ or R−
φ(x) (5.25)
≤ c2 + d2 |x| on R
Theorem 5.34 (Vladimirova et al [9] (2018)) We assume the conditions above on the
(`) (`)
priors and nonlinearity. Then, conditional on X, the prior hi (X) or hi (X) at layer ` is
Sub-Weibull of parameter 2` .
5-22
Cours 5 — February 14, 2020 Spring 2020
Figure 5.4. Impact of the number of layers on the prior distribution. Taken from [8].
Remark 5.2.1 In the above definition, the quantity 1 − F (t) is also called the survival
function.
We can also interprete these priors from a regularization point of view; the mode of the
weights’ posterior distribution given data D = {(Xi , Yi )}i is as usual the MAP estimator
Weight decay regularization for NN is nothing more than applying `2 regularization on the
weights (which is the same as using a Gaussian prior p(w) ∝ exp(−kwk22 )). I
missed
5.2.2 Subspace inference for Bayesian DL a
bunch
Reference: Izmailov et al. (2019) [4] of
draw-
ings
5-23 here.
Cours 5 — February 14, 2020 Spring 2020
Posterior inference is not really scalable in general, especially when the parameter space
is large, which is the case in deep learning where the space W is often high-dimensional.
The idea is to construct lower-dimensional subspaces, e.g. the first few components
of the SGD trajectories, then perform variational inference: we can perform prediction using
the approximate posterior predictive distribution and the uncertainty is well-calibrated. The
idea is analogous to PCA, where we reduce the feature space to a given number of principal
components.
5-24
Bibliography
[1] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2017.
[2] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci-
ence and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006. isbn: 0387310738.
[3] David M. Blei et al. “Latent Dirichlet Allocation”. In: J. Mach. Learn. Res. 3.null (Mar.
2003), pp. 993–1022. issn: 1532-4435.
[4] Pavel Izmailov et al. Subspace Inference for Bayesian Deep Learning. 2019. arXiv: 1907.
07504 [cs.LG].
[5] Diederik P. Kingma and Max. Welling. Auto-Encoding Variational Bayes. 2013. arXiv:
1312.6114 [stat.ML].
[6] David J.C MacKay. “Bayesian neural networks and density networks”. In: Nuclear In-
struments and Methods in Physics Research Section A: Accelerators, Spectrometers,
Detectors and Associated Equipment 354.1 (1995). Proceedings of the Third Work-
shop on Neutron Scattering Data Analysis, pp. 73–80. issn: 0168-9002. doi: https:
//doi.org/10.1016/0168- 9002(94)00931- 7. url: http://www.sciencedirect.
com/science/article/pii/0168900294009317.
[7] Kevin P. Murphy. “A Variational Approximation for Bayesian Networks with Discrete
and Continuous Latent Variables”. In: CoRR abs/1301.6724 (2013). arXiv: 1301.6724.
url: http://arxiv.org/abs/1301.6724.
[8] Mariia Vladimirova and Julyan Arbel. Sub-Weibull distributions: generalizing sub-Gaussian
and sub-Exponential properties to heavier-tailed distributions. 2019. arXiv: 1905.04955
[math.ST].
[9] Mariia Vladimirova et al. Understanding Priors in Bayesian Neural Networks at the
Unit Level. 2018. arXiv: 1810.05193 [stat.ML].
25