0% found this document useful (0 votes)

24 views25 pages

Scribe Notes BML

This document summarizes key concepts from a lecture on Bayesian machine learning and regression analysis. It introduces: - Linear regression and decision theory, including Fisher's maximum likelihood estimator and Wald's loss function approach - Bayesian decisions, where the estimator is chosen to minimize expected loss under a probabilistic model with priors - Examples of Bayesian estimators for linear regression, including the minimum mean squared error and mean estimator under the posterior

Uploaded by

ethan cohen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views25 pages

Scribe Notes BML

Uploaded by

ethan cohen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Bayesian machine learning Spring 2020

Cours 1 — January 17th, 2020

Lecturer: Rémi Bardenet Scribe: Pierre Delanoue, Van Nguyen Nguyen

The web page of the course: https://github.com/rbardenet/bml-course

Contact : [email protected]
Objective of the class:

• Decision theory

• Formalizing a problem in a Bayesian way

• MCMC and variational Bayes

• Bayesian nonparametrics

1.1 Regression and Decision Theory

Definition 1.1 (Linear Regression)

yi = f (xi ) + εi ∈ R, xi ∈ Rd , i ∈ [[1, n]], εi ∼ N (0, σ 2 )i.i.d.

In matrix form,
y = Xθ? + ε

Remark 1.1.1 Your job is to (i) come up with an estimator θb = θ(X, b y) Often, you also
need (ii) to report some region A = A(X, y) with level of confidence α ⊂ Θ.

1.1.1 Fisher’s answer

(i) θbM LE = (X T X)−1 X T y (assuming X is full rank)

Proposition 1.2 • (i) θbM LE is unbiased, i.e., E(θbM LE ) = θ?

• (ii) θbM LE ∼ N (θ∗ , σ 2 (X T X)−1 )

• (iii) θbM LE has the minimum variance among linear unbiased estimators ( A B :=
A − B positive semidefinite)

1-1
Cours 1 — January 17th, 2020 Spring 2020

Proof (i) E[θbM LE ] = (X T X)−1 X T E[y] = θ∗

(ii) θbM LE − θ∗ ∼ N (0, σ 2 (X T X)−1 )

Var(θbM LE ) = E(θb − θ∗ )(θb − θ∗ )T

= E((X T X)−1 X T y − θ∗ )((X T X)−1 X T y − θ∗ )T
T T
= (X T X)−1 X T E(yyT )X(X T X)−1 − 2θ∗ θ∗ + θ∗ θ∗
T T
= (X T X)−1 X T E (Xθ + ε)(Xθ + ε)T X(X T X)−1 − 2θ∗ θ∗ + θ∗ θ∗

T T
= (X T X)−1 X T Xθ∗ θ∗ T X T + σ 2 I X(X T X)−1 − 2θ∗ θ∗ + θ∗ θ∗

T T
= θ∗ θ∗ T + (X T X)−1 σ 2 − 2θ∗ θ∗ + θ∗ θ∗ = (X T X)−1 σ 2

Definition 1.3 (Confidence interval)

Aα := {θ ⊂ Rd , (θ − θbM LE )T σ 2 (X T X)−1 (θ − θbM LE ) ≤ α}

We can choose α to guarantee coverage: P(θ∗ ∈ Aα (X, y)) ≥ 95%.

1.1.2 Wald’s answer

Principle: An estimator, or a confidence interval, are data-driven decisions of the form

a : data 7−→ θ,
b Aα .

Let us pick a loss function L(θ, a(X, y)). We’d like to choose a ∈ arg min Ey L(θ∗ , a(X, y)).

Example 1.1.1 In regression with the squared loss, this boils down to

θb ∈ arg min Ey [||θ∗ − θ||

b 2 ](=: M SE(θ))
b

Definition 1.4 (admissible estimator) θb is said to be admissible if

(
∀θ E[L(θ, θ̃)] ≤ E[L(θ, θ)]
b
@θ̃ = θ̃(X, y)s.t.
∃ θ0 s.t. E[L(θ0 , θ̃)] < E[L(θ0 , θ)]
b

Theorem 1.5 (Corollary of a result by James & Stein) θbM LE is not admissible
for linear regression.

1-2
Cours 1 — January 17th, 2020 Spring 2020

Exercise: Prove that ∃ θb s.t.∀θ ∈ B(0, ρ), Ey [||θb − θ||2 ] < Ey [||θbM LE − θ||2 ]
Let’s define
θbR (λ) = arg min ||y − Xθ||2 + λ||θ||22 , λ > 0
θ
= (X X + λI)−1 X T y
T

Compute MSE(θbR (λ)) for X T X = I

Lemme 1.6
b 2 ] = T r(Var(θ))
E[||θ − θ|| b − θ||2
b + ||E[θ]

1 b
θbR (λ) = θM LE (shrinkage)
1+λ
1 ∗
E[θbR (λ)] = θ
1+λ
1 1
Var[θbR (λ)] = Var[θbM LE]
1+λ 1+λ
2
σ
= I
(1 + λ)2
dσ 2 1 2 ∗
M SE(θbR (λ) = + (1 − ) ||θ ||
(1 + λ)2 1+λ
dσ 2 1 2 2
≤ 2
+ (1 − )ρ
(1 + λ) 1+λ
1
=
(1 + λ)2 [dσ 2 + λ2 ρ2 ]
= f (λ)
∂
f |λ=0 ∝ 2λρ2 (1 + λ)2 − (λ2 ρ2 + dσ 2 )2(1 + λ)|λ=0 = −2dσ 2
∂λ

∃λ0 s.t.∀λ < λ0 , f (λ) ≤ f (0) = M SE(θbM LE )

• For the general case, check Wieringen’s lecture notes on linear regression
Definition 1.7 θminimax := θbm ∈ arg min b sup Ey [L(θ, θ)]
θ θ
b

Definition 1.8 θbB ∈ arg minθ,y Eθ,y [L(θ, θ)]

Theorem 1.9 (Berger’85) Under topolical assumptions, assume θ, θb ∈ closed bounded sub-
sets of Θ, continuity assumptions on L.
• Any estimator is dominated by a bayesian estimator
• In linear regression, Bayes => admissible

1-3
Cours 1 — January 17th, 2020 Spring 2020

1.1.3 Bayesian decisions

Definition 1.10 When confronted with picking an action a depending on a state of the
world S ∈ S a Bayesian picks 1) a idstributions p ∈ M1 (S, Σ) and 2) a loss function L and
chooses :
Z
aB ∈ arg min L(a, s)dp(s)
b

Example 1.1.2 (estimation) • S = Y n × Θ where, Y ⊂ Rdy and θ ⊂ RdΘ

• L(θ, b α
b θ) =choice ||θ − θ||
b S) =choice L(θ,

• p(s) = p(y1,...,n , θ) = p(y1,...,n |θ)p(θ) where p(y1,...,n |θ) is the choice and p(θ) is the prior
(choice)
R
Exercice: We define L(θ, θ) b 2 and we will show that θbB =
b = ||θ − θ|| θp(θ|y1,...,n )dθ
Z
θB ∈ arg min ||θ − θ(y
b b 1,...,n )||2 p(y1,...,n , θ)dy1,...,n dθ

b 1,...,n )||2 p(y1,...,n |θ)p(θ)dθdy1,...,n

= arg min ||θ − θ(y
θb

Remark 1.1.2 Z
b 2 p(y1,...,n |θ)p(θ)dθdy1,...,n
min ||θ − θ||
θ
b
Z
b 2 p(y1,...,n |θ)p(θ)dθdy1,...,n
= min ||θ − θ||
θb

But
Z Z
b 2 p(θ|y1,...,n )dθ =
||θ − θ|| b 2 p(θ|y1,...,n )dθ
||θ − θM EP + θM EP − θ||
Z
= ||θ − θM EP ||2 + ||θM EP − θ||
b 2 p(θ|y1,...,n )dθ

R
where the mean estimator posterior θbM EP = θbB dθ

1-4
Cours 1 — January 17th, 2020 Spring 2020

Exercise: Prove that p(y1,...,n |θ, X) = N (y1,...,n |Xθ, σ 2 I), p(θ)

log(p(θ|y1,...,n ) = log(p(y1,...,n |θ)) + log(p(θ)) − log(p(y1,...,n ))

||Xθ − y||2 ||θ||2
=− − + ...
2σ 2 2θ2
< Xθ − y, Xθ − y > < θ, θ >
=− − + ...
2σ 2 2σ 2
θT X T Xθ − 2θT X T y θT θ
=− − 2
2σ 2 2σ
T
θ X y T T
X X I XT X I Σ−1
=− − θ( + )θ + ...(( + ) = )
σ2 2σ 2 2σ 2 2σ 2 2σ 2 2
1
= − (θ − σ −2 ΣX T y)T Σ−1 (θ − σ −2 ΣX T y))
2
We have p(θ|y1,...,n ) = N (θ|σ −2 ΣX T y, Σ).

σ 2 I −1 T
In particular, θbB = σ −2 ΣX T y = (X T X + σ2
) X y

Remark 1.1.3 Let p(θ) ∝ e−λ||θ|| (Laplace). Then

||y − Xθ||2
log(p(θ|y1,...,n )) = − − λ||θ||1
2σ 2

and θbLASSO ∈ arg max logp(θ|y1,...,n )

R
Warning : θbB = θp(θ|y1,...,n )dθ is not sparse (Park and Casella).

Example 1.1.3 (credible regions) S = Y n × Θ

For A ⊂ Θ, L(A, s) = 1θ∈A
/ + γdiam(A) where γ is chosen: p(s) = p(y1,...,n , θ)
Z
bB ∈ arg min 1θ∈A
A / p(θ|y1,...,n dθ + γdiam(A) (a credible region)

For example, in ridge regression

p(θ|y1,...,n ) = N (θ|θbR (λ), Σ)

Aα = {(θ − θbR(λ) )T Σ−1 (θ − θbR(λ) ) ≤ α}

Lasso: Hastie, Tibshirani and Wainwright

Example 1.1.4 (Horseshoe prior)

S = Y n × Θ × RdΘ
+ × R+ ≥ 10

1-5
Cours 1 — January 17th, 2020 Spring 2020

θj ∼ N (0, τ 2 λ2j ), j = 1, ..., d

1
λj ∼ C + (1) ∝ 1λ>0
1 + λ2

τ ∼ C + (1)
Let S = X n × Y n × Θ × X × Y: L(by , s) = 1y6=yb(α1y=1 + β1y=0 )
Z
ybB ∈ arg min 1y6=yb(α1y=1 + β1y=0 )dp(x1,...,n , y1,...,n , θ, x, y)

p(s) = p(y|x, θ, x1,...,n , y1,...,n )p(x, θ, x1,...,n , y1,...,n )

p(x, θ, x1,...,n , y1,...,n ) = p(θ|x, x1,...,n , y1,...,n )p(x, x1,...,n , y1,...,n )

Z
ybB ∈ arg min 1y6=yb[α1y=1 + β1y=0 ]f (y)dx1,...,n dy1,...,n dxdy
Z
arg min f (1)α1yb=1 + f (0)β1yb=0 ]dx1,...,n dy1,...,n dx

ybB = 1 iff βf (0) ≤ αf (1)

1.1.4 The likelihood principle

• Berger Wolpert 88

• Bayesian decisions are robust to optimal stopping. Let S = (∪n≥0 Y n ) × Θ:

X
y , y) =
EL(b E[L(θ, θ)1
b N =n ]
n≥0

XZ n−1
Y
= L(θ, θ)[1
b y
1,...,n ∈{N =n} 1y1,...,i ∈{N
/ =n} ]p(y1,..,n|θ)p(θ)dθdy1,...,n
n≥0 i=1

1-6
Bayesian Machine Learning Spring 2020
Cours 2 — January 24th, 2020
Lecturer: Rémi Bardenet Scribe: W. Jallet, S. Jerad

2.1 A bit of objective Bayes

We suppose we are still in the Bayesian linear regression setting.
Recall the decision function

âB = argmin EY,θ L(a, θ) (2.1)

âB (X, y, x) = 1{ β p(y=1|X,y,x) ≥1}

α p(y=0|X,y,x)
(2.2)
âB (y) = argmin EY,θ [1θ6∈I + γµ(I)]
interval I

Theorem 2.11 (Bernstein-von Mises (Van der Vaart’ 2000)) We assume that the prior
p(θ) puts “enough mass" around θ∗ ∈ Θ̊ ⊆ Rd , then for all ε > 0

Pp(·|θ) sup Pθ|Y,x (B) − PN (θ∗ ,σ2 (X T X)−1 /N ) (B) ≥ ε → 0 (2.3)
B⊂Θ

This result is also called the “Bayesian central limit theorem".

Picking a prior

• find a prior that encodes physical constraints of your problem

• find a prior that comes from symmetries of your problem e.g. Jeffrey’s prior

• try several priors and make sure that âB does not change too much

2.2 More decision problems from ML

Exercise. Frame PCA as a Bayesian decision problem.
Regular PCA: given data x = (x1 , . . . , xN ) ∈ Rd×N , define
N
1 X
Σ̂ = (xi − x̄)(xi − x̄)| = U ΛU |
N i=1

2-7
Cours 2 — January 24th, 2020 Spring 2020

−1/2 −1/2
Then we obtain the normalized PCA vectors as Λ:q x̂i = Λ:q U:q| x (whitened PCA), where
the subscript : q indicates we only take the first q components.
For the Bayesian formulation, take data x ∼ N (0, I), y ∼ N (µ, σ 2 W W | + I). The joint
distribution is
p(y, x, µ, σ, W ) ∝ p(y|x, µ, σ, W, q)p(x)p(µ)p(σ)p(W )
Now we choose a prior for the weights W . Some suggestions:

1. p(W ) ∝ p(W |q)p(q), for instance p(W |q) = qj=1 e−λkwj k and a conjugate prior q ∼
Q
P(λ) for some hyperparameter λ.
Qd−1 −kwj k2 /(2v2 )
2. an alternative is p(W ) ∝ p(W |v)p(v) = j=1 e j p(v) with prior p(v), which

can be for instance a Laplace distribution to enforce sparsity of the weights, or a

horseshoe distribution. explain
horse-
Now a question is how do you recover the MLE ? shoe
Theorem 2.12 (Bishop, Tipping, 1997) It holds that

ŴMLE = U:q (Λ:q − σ 2 I)1/2 (2.4)

Then, the PCA vectors are given by

−1
x̂ = ŴMLE (y − ȳ) = (Λ:q − σ 2 I)−1/2 U:q| y −−→ Λ−1/2
:q U:q| y (2.5)
σ→0

Exercise. How would you formalize clustering as a Bayesian decision problem?

Example 2.2.1 (Latent Dirichlet Allocation (Blei et al. [3])) Let qd` ∈ {1, T } be the
topic of a word ` ∈ {1, Ld } inside of document d ∈ {1, D}.See Figure 2.1 finish
the
α Πd ∈ ∆T qd` yB
βd` di-
a-
Figure 2.1. Graphical model for LDA. gram

We want to prove Πd ∼ D(α) (Dirichlet distribution) is conjugate to qd` ∼ Cat(Πd ). Missing

We have that ! T ex-
d T
Y Y 1qd` =t
Y pla-
p(Πd |qd` , α) ∝ Πdt Πα−1
dt 1Πd ∈∆T (2.6)
na-
`=1 t=1 t=1
tions,
With misclassification error maybe
L(q̂, q) = 1q6=q̂ for-
then (exercise) mula
Z is
q̂dw = argmax p(qd` = t|Πd , yd` , B, β)p(Πd , y, B, α, β) (2.7) wrong

2-8
Cours 2 — January 24th, 2020 Spring 2020

2.3 Subjective Bayes

We denote S the states of the world, Z the space of consequences, and A = F(S, Z) the
set of functions between S and Z.

Theorem 2.13 (Savage) Let ≺ be a preference relation over A that is complete and tran-
sitive.
Then, the following statements are equivalent:

• ≺ satisfies a few more intuitive postulates (“internal coherence")

• there exists a unique function L on A × S and probability distribution π on S such

that Z Z
a ≺ a ⇔ L(a, s)dπ(s) ≤ L(a0 , s)dπ(s)
0

This is the idea of rationality, e.g. from neoclassical economics. The loss L is bounded, and
π is finitely additive. The prior is coupled to the loss. We may act before having any data,
but as data comes in our actions will become more appropriate.

2.3.1 Computational aspects

Exercise (Metropolis-Hastings) We denote α(x, y) the acceptance ratio of the MH
π(y)q(x|y)
algorithm, α(x, y) = π(x)q(y|x) .
R
1. show that p(x, y) = α(x, y)q(y|x) + δx (y) 1 − α(x, y)q(y|x) dy
R
2. π(x)p(x, y) dx = π(y)

Exercise (Gibbs sampler) The Gibbs sampler is useful in the case where the conditional
distributions of the variables (conditionally on each other) are known.

1. given x = (x1 , x2 ),
1 1
q(y|x) = π(y1 |x2 )π(y2 |y1 ) + π(y2 |x1 )π(y1 |y2 )
2 2
Show that α(x, y) = 1

2. Derive all of the conditional in Latent Dirichlet Allocation (LDA).

Check out this website for interactive visualisations of MCMC algorithms.

2-9
Bayesian Machine Learning Spring 2020
Cours 3 — January 31st, 2020
Lecturer: Rémi Bardenet Scribe: Antoine Barrier

Remark 3.0.1 computation : exact / MCMC / VB

3.1 Variational Bayes

Remember that we often have to compute
Z
L(a, (θ, z1:N , s))p(θ, z1:N |y1:N )dθdz1:N

The key quantity we want to determine is p(θ, z1:N |y1:N ).

P
Example 3.1.1 (LDA) • number of variables is Ω( i Li ).

• discrete

VB objective Find
q ∈ argmin KL(q, p((θ, z)|y)) (VB)
q∈Q
R
where Q is the set of probabilities over (θ, z1:N ) and KL(p, q) = p log(p/q).

1. We choose Q so that (VB) is easy.

Example 3.1.2 Mean-field approximation : we assume all variables are independent

under all probabilities of Q : in other words if q ∈ Q :
dθ N Y
dz
η
Y Y
q(θ, z1:N ) = qdηd (θd ) qijij (zi )
d=1 i=1 j=1

Remark 3.1.1 • In MF, coordinatewise optimization is tractable and cheep.

• Check out [7] for LDA (exo)

2. Variational autoencoding Bayes: see [5].

3-10
Cours 3 — January 31st, 2020 Spring 2020

x1 ∈ X f1 x2 f2

zx1 ,x2 ,f1 ,f2

Figure 3.2. Optimization process

N (f (x1 ), σ 2 ) N (f (x2 ), σ 2 )

f1 f2

Figure 3.3. Graphical model

3.2 Bayesian optimization

We only consider a two-stage optimization problem here (see Figure 3.2).
Let (x1 , x2 ) ∈ A = X × X , S = R × R × RX , and (see Figure 3.3):
p(s) = p(f1 , f2 , f ) ∝ p(f2 |f )p(f1 |f )p(f )
We need a prior distribution for f .
We consider the loss function:
L(ax1 ,x2 , s) = f2 − f ∗ where f ∗ = min f
X

Remark 3.2.1 Other commons loss functions are:

2
X
∗ ∗
L(ax1 ,x2 , s) = min(f1 − f , f2 − f ) L(ax1 ,x2 , s) = fi − f ∗
i=1

Our Bayesian action is:

Z
âB ∈ argmin [f2 − f ∗ ]p(f, f1 , f2 )df df1 df2
x1 ,x2
Z h Z i
= argmin p(f1 )df1 argmin [f2 − f ∗ ]p(f2 , f |f1 )df df2
x1 x2 =S(x1 ,f1 )

3-11
Cours 3 — January 31st, 2020 Spring 2020

We have:
Z Z
p(f1 )df1 ∝ p(f1 |f ) p(f ) df and p(f2 , f |f1 ) = p(f2 |f ) p(f |f1 )
| {z } |{z} | {z } | {z }
N (f1 ,σ 2 ) ??? N (f2 ,σ 2 ) ???

Remark 3.2.2 1. we need to specify a prior over functions p(f ) such that p(f |f2 ) is
tractable.

2. dynamic programming is usually intractable → approximate DP. See [1].

Greedy solution: sequencial Bayesian optimization Consider the following algo-

rithm:
Algorithm 1:
Input : (x1 , f1 ), . . . , (xN , fN )
1 for t ∈ JN + 1, T K do
Z ∝p(ft |f )p(f |f1:t−1 )
z }| {
2 xt = argmax min fj − ft + p(ft |f1:t−1 ) dft
1≤j≤t−1
| {z }
expected improvement
3 end graphic
3
Gaussian processes

Definition 3.14 f is said to follow a Gaussian processus GP (µ : X −→ R, k : X ×X −→ R)

if:
∀p ≥ 1, ∀x1:p ∈ X , (f (x1 ), . . . , f (xp )) ∼ N (µ1:p , K1:p,1:p )
where µi = µ(xi ) and Kij = k(xi , xj ).
P∞ R
Exercice 3.2.1 If k is a Mercer kernel, k(x, y)P = √ i=1 λi ei (x)ei (y) pointwise with k(x, x)dx <
iid
+∞, and (zi )i ∼ N (0, 1), show that f (x) = i≥1 λi zi ei (x) satisfies the definition.
Hint: start with 2 variables:
Xp p
Cov(f1 , f2 ) = E[f1 , f2 ] = λi λi E[zi2 ] ei (x1 )ei (x2 ) = k(x1 , x2 )
i≥1
| {z }
=1

kx−yk2
For k(x, y) = e− 2λ2 → samples in C ∞ , see lecture notes on Bayesian Nonparametrics by
P. Orbanz (http://www.gatsby.ucl.ac.uk/~porbanz/papers/porbanz_BNP_draft.pdf) graphic
4
Proposition 3.15 If f ∼ GP (0, k) and fi = f (xi ) + εi where (εi )i ∼iid N (0, σ 2 ) then

f |σ((x1 , f1 ), . . . , (xp , fp ))) ∼ GP (µ̃, k̃)

3-12
Cours 3 — January 31st, 2020 Spring 2020

with
 
f1
> 2 −1  .. 
µ̃(x) = (k(x, x1 ), . . . , k(x, xp )) (K1:p,1:p + σ Ip )  . 
fp
 
k(x, x1 )
k̃(x, y) = k(x, y) − (k(x, x1 ), . . . , k(x, xp ))(K1:p,1:p + σ 2 Ip )−1  ... 
 
k(x, xp )

Exercice 3.2.2
 
f1
..
.
 
  !
K1:p,1:p + σ 2 Ip K1:p,p+1:q

 fp 
 
 ∼ N 0,
f (xp+1 ) Kp+1:q,1:p Kp+1:q,p+1:q

 . 
 .. 
f (xq )

Then:
   
f (xp+1 ) f1 !
 ..  2 −1  .  2 −1
 .  ∼ N Kp+1:q,1:p (K1:p,1:p +σ Ip )  ..  , Kp+1:q,p+1:q −Kp+1:q,1:p (K1:p,1:p +σ Ip ) K1:p,p+1:q
f (xq ) fp

Z
p(f |x, θ) = p(f |f (x1 ), . . . , f (xN )) p(f (x1 ), . . . , f (xN )|x, θ) = N (f |0, σ 2 I + K)
| {z }| {z }
N (0,σ 2 I) N (0,K)

3-13
Bayesian machine learning Spring 2020
Cours 4 — February 7, 2020
Lecturer: Julyan Arbel Scribe: Nicolas Pinon, Aitor Artola

4.1 Introduction
Bayesian nonparametric : Bayesian statistics that is not parametric Not parametric : pareme-
ters not finite, unbounded/griowing/infinite number of parameters
GitHub of the course : https://github.com/jarbel/bml-course

4.2 Dirichlet process

Definition 4.16 Dirichlet Process (Ferguson 1973) : P is a Dirichlet Process in space Θ if
∃ α > 0, P0 probability measure, ∀k ∈ N∗ , ∀ partition (A1 , ..., Ak ) of Θ :
(P (A1 ), ..., P (Ak )) ∼ Dir(αP0 (A1 ), ..., P0 (Ak ))

Definition 4.17 Beta distribution : X ∼ Beta(a, b)

f (x) ∝ xa−1 (1 − x)b−1

Definition 4.18 DirichletPdistribution : X ∼ Dir(a1 , ..., ak )

f (x) ∝ x1a1 −1 ...xal k −1 with xi = 1

We have A, B ∈ Θ and the following Dirichlet distribution on the set Θ = {A, Ac } :

(P (A), P (Ac )) ∼ Dir(αP0 (A), αP0 (Ac ))
P (A) ∼ Beta(αP0 (A), α(1 − P0 (A)))
The expectation and the variance of a beta law are :
a
E[Beta(a, b)] = a+b
ab
Var[Beta(a, b)] = (a+b+1)(a+b)2

We deduce the expectation and variance of our Dirichlet law :

E[P (A)] = P0 (A)

Var[P (A)] = P0 (A)(1−P
1+α
0 (A)

cov[P (A), P (B)] = P0 (A∩B)−P

1=α
0 (A)P (B)

Theorem 4.19 (De Finetti)

Exchangability ⇐⇒ conditional independence

4-14
Cours 4 — February 7, 2020 Spring 2020

Note : Independance is the same as Exchangability.

Theorem 4.20 (conjugacy) We consider X1 , ..., Xn |P with the Dirichlet prior P ∼ DP (αP0 ).
The posterior of P in this model is :

P |X1 , ..., Xn ∼ DP (αP0 + N

P
i=1 δXi )

and the predictive distribution is :

α 1
Pn
P (Xn+1 |Xn , ..., X1 ) = P
α+n 0
+ α+n i=1 δXi

Definition 4.21 (conjugacy update)

α ←α+n Pn
α 1
P0 ← α+n P0 + α+n i=1 δXi

So if we have thePn a DP SP (G0 ), then we could compute its parameters α = G0 (Θ) and
P0 = GG 0
0 (Θ)
. Here i=1 δXi is the empirical loss.

Definition 4.22 (Polya Urn) We consider a Polya Urn problem, we start with an urn
with α black balls. If we pick a black ball we add in the urn a ball with a new color Xi
following P0 and if we pick a non black ball we add a ball with the same color. This problem
is a DP :
1. X1 |P ∼ P
P(X1 ∈ A) = EP [P(X1 ∈ A|P )]
= EP [P (A)] = P0 (A)
⇒ X 1 ∼ P0
α 1
2. X2 |X1 ∼ P
α+n 0
+ δ
α+n X1

Definition 4.23 (Chinese Restaurant Process) A customer enter in a chinese restau-

rant, and choose a table Xi with a DP (αP0 ). We also define the number of table K, the ith
table choose variable Xi∗ and the number of customer in the ith table ni . So the DP could be
rewrite : Pn
α 1
P (Xn+1 |Xn , ..., X1 ) = α+n P0 + α+n i=1 δXi
P (Xn+1 |Xn , ..., X1 ) = α+n P0 + α+n n K
α n 1
P
j=1 nj δXj
∗

We can also define the law of ni :

Γ(α) QK
P (n1 , ..., nK ) = α Γ(α+n) j=1 (nj − 1)!
Γ(α) α 1 1 α
Using Γ(α+n) = α α+1
... α+n−1 = (α)n
we deduce the probability of all customers choosing the
same table:
α 1 n−1 1
P (n1 = n) = α α+1
... α+n−1 = (α)n
α
= (α)n
(n − 1)!

4-15
Cours 4 — February 7, 2020 Spring 2020

and the probability to have one customer by table :

P (n1 = 1, ..., nn = 1) = ni=1 α+i−1

α
Q
α
= (α)n
(n − 1)!
αn
= (α)n

We have the combinatorial properties for Kn number of table. We introduce Di :

(
1 if Xi is seated at a new table
Di =
0 otherwise
α
Di ∼ Ber( α+i−1 )
Pn
Kn = i=1 Di

Proposition 4.24
E[Kn ] = ni=1 E[Di ] = ni=1 α
P P
α+i−1
n→+∞
−−−−→ ∞
a.s
∼ α log n
n→+∞

Proposition 4.25
Kn n→+∞
log n
−−−−→ α
a.s

Proposition 4.26 Proof Lindeborg CLT, independent random variables :

Kn −E[Kn ]
Std(Kn )
→ N (0, 1)

If P0 is non atomic then P(Kn = K) = ...

m1 = #(tables with 1 customer)

m2 = #(tables with 2 customer)
...
mn = #(tables with n customer)
PK Pn PK
j=1 nj = n , l=1 mj = K , l=1 lml = n

Definition 4.27 (Population genetic: Emons Sampling formula)

n!
P (m1 , ..., mn ) = (α)!
αK Qn (l)1 ml ml !
l=1

αm1 (α×1)m2 (α×1×2)m3 ...(α×1...×(n−1))mn alphaK Qn

α(α+1)...(α+n−1)
= (α)n l=1 (l − 1)!ml
Q1 n n!

ml ! 1,...,1,2,...,2,...,n
= Qn ml m !
l=1 (l!) l

4-16
Cours 4 — February 7, 2020 Spring 2020

iid
Definition 4.28 (Stick-breaking for the DP) Let Vi ∼ Beta(1, α) with α > 0 and p1 =
iid
V1 and pi = Vi i−1 pi δθi with ∞
Q Pi=1 iid P
l=1 (1 − Vl ) and θi ∼ P0 then P = i=1 pi = 1

X1 , ..., Xn |P ∼ P = i=1 iid pi δθi

Definition 4.29 (Mixture Model)


Yi |Xi , P
 ∼ fp (Yi |Xi ) (often gaussian)
X1 , .., Xn |P ∼ P

P ∼ DP (αP0 )


Clustering for (X1 , ..., Xn ) → indices a clustering for (Y1 , ..., Yn ). usefull for density
estimation.
Definition 4.30 (Pitman Yor process)
k
α + kσ n 1X
P (Xn+1 |X1 , ..., Xn ) = P0 + (nj − σ)δx∗j
α+n α + n n j=1

With σ ∈ [0, 1]. With σ = 0 we recall the DP.

E[Kn ] ∼ Snσ with S some random variable.
Definition 4.31 (Stick breaking interpretation )
⊥
Vi ∼ Beta(1 − σ, α + iσ)
pi = Vi Πl<i (1 − Vl )
Kn = Kn−1 + Kn+ ∼ P ois(γ(1 + 21 + ... + 1i )
Definition 4.32 (Feature allocation model/ Indian Buffet Process)
• Customer 1 : N1 ∼ P ois(γ) features
• Customer 2 :
– every dish of Customer 1 with probability 1/2
– new dish P ois(γ/2)
1
The total number of dishes N2 has the law : 2
P ois(γ/2) + 21 P ois(γ/2) = P ois(γ)
• ...
• Customer i :
nj
– Chooses every dish j ∈ 1, ..., K with probability i
– Chooses new dish P ois(γ/i)
PK Pi−1
j=1 = l=1 Nl ∼ P ois((i − 1)γ)

Definition 4.33 Hierarchical DP (Teh et al.):

4-17
Bayesian machine learning Spring 2020
Cours 5 — February 14, 2020
Lecturer: Julyan Arbel Scribe: W. Jallet, A. Floyrac, C. Guillo

5.1 The use for Bayesian Deep Learning

5.1.1 Bayesian model averaging (BMA)
We want to obtain a predictive distribution for our variable x given our dataset D:
Z
p(x|D) = p(x|θ) p(θ|D) dθ
Θ | {z } | {z }
model posterior

This can also be a conditional predictive if we are in a regression or classification problem

Z
p(Y |X, D) = p(Y |X, W )p(W |D) dW (5.8)
W

5.1.2 Uncertainty
Epistemic uncertainty also known as model uncertainty, represents uncertainty over
which base hypothesis (or parameter) is correct given the amount of available data.

Aleatoric uncertainty essentially, noise from the data measurements (e.g. measuring
errors in sensor data).
Thus, a Bayesian approach to deep learning considers epistemic uncertainties in a prin-
cipled way, where these uncertainties are carried over to the posterior distribution on our
parameter space.

5.1.3 Link between Bayesian DL and regularized Maximum Likeli-

hood
When using regularized maximum likelihood to learn parameters, we are computing a quan-
tity
θ̂MLE ∈ argmax log p(D|θ) + log p(θ) (5.9)
θ∈Θ | {z } |{z}
likelihood penalty

If the penalty term is indeed a prior likelihood −R(θ) = log p(θ) the previous regularized
MLE is known as the maximum a posteriori (MAP) estimator, which can be written
θ̂MAP ∈ argmax p(θ|D) (5.10)
θ∈Θ | {z }
=actual posterior

5-18
Cours 5 — February 14, 2020 Spring 2020

This is still an optimization problem, and not really Bayesian inference. Indeed, MAP
is taking the maximizing mode(s) in the posterior (and not computing a full predictive
distribution), dropping all of the uncertainty it contains and thus all of the information on
the predictive uncertainty.

• A Gaussian prior p(θ) ∝ exp(−kθk22 /2) on parameter space leads to `2 regularization,

and the corresponding MAP estimator is known as the Ridge estimator.

• A Laplace prior p(θ) ∝ exp(−kθk1 ) yields `1 penalization and the so-called LASSO
estimator.

5.1.4 Bayesian Model Averaging (BMA) vs. Model Combination

Methods
Reference: see [2, ch. 14].
N.B. for instance, mixture models are model combination methods.

Gaussian mixtures They are generative models on the data likelihood:

K
X
p(X) = πk N (X|µk , Σk ) (5.11)
k=1

We introduce latent variables Z ∈ {0, 1}K s.t. k zk = 1, which represent to which mixture
P
component a data point belongs (i.e. it belongs to the k-th component iff zk = 1). Then,
the joint likelihood of our variable x and (unobserved) latent variable z is

p(X, Z) = p(Z)p(X|Z) (5.12)

Where:
πkzk i.e. p(zk = 1) = πk
Q
• p(Z) = k

• p(X|Z) factorizes with p(X|zk = 1) = N (X|µk , Σk ), i.e.

K
Y
p(X|Z) = N (X|µk , Σk )zk (5.13)
k=1

The likelihood is obtained as usual by marginalizing with respect to the latent variable Z:
X
p(X) = p(Z)p(X|Z) (5.14)
Z

where we sum over all possible (one-hot) Z ∈ {0, 1}K ; there are K of them due to the
constraint from above.

5-19
Cours 5 — February 14, 2020 Spring 2020

The full observed data likelihood is written

n
!
Y Y X
p(D) = p(Xi ) = p(Zi )p(X|Zi ) (5.15)
k i=1 Zi

where D = {X1 , . . . , Xn ).
This is in contrast to BMA where the whole dataset is generated by a single model (see
Minkha 2002) as well as a conditional predictive distribution
Z
p(y|x, D) = p(y, W |x, D) dw = Ew [p(y|x, w)|D] (5.16)
W

BMA H different models indexed by h = 1, . . . , H (in the discrete case) with a prior
probability p(h). The marginal distribution of data X is
H
X Z
p(X) = p(X|h)p(h) or p(X) = p(X|h)p(h) dh (5.17)
h=1 H

Example 5.1.1 We are given observations X = {x1 , . . . , xn }.

5.2 Bayesian Neural Networks (BNNs)

Reference: See Neal () and MacKay [6]
We put a common (isotropic) prior N (0, σ 2 ) on the (independent) weights of the NN. A
neural network defines a parametric mapping
(
X −→ Y
fw : (5.19)
x 7−→ fw (x)

For regression, we want a conditional predictive distribution y|x. Looking for a Gaussian
likelihood
p(y|x, w) = N (y|fw (x), τ 2 ) (5.20)
For data D = {(Xi , Yi )}i , we get a full data likelihood under weights w
n
Y
p(D|w) = N (Yi |f (Xi , w), τ 2 ) (5.21)
i=1

and the posterior distribution on the weights is given by Bayes’ rule:

n
Y
2
p(w|D) ∝ p(w) p(D|w) ∝ N (w|0, σ I) N (Yi |fw (Xi ), τ 2 ) (5.22)
|{z} i=1
prior

5-20
Cours 5 — February 14, 2020 Spring 2020

The BNN wide limit Some notations:

(0)
• inputs X ∈ RH , H (0) = d
• depth L ∈ N∗
• output Y ∈ R(L+1)
• width H (`) ∈ N∗ for the layer at depth ` ∈ {0, . . . , L + 1}
• non-linearity φ
• pre-nonlinearity g (`) (X) = W (`) h(`−1) (X)
• post-nonlinearity h(`) (X) = φ(g (`) (X)) (applied elementwise) for ` ≥ 1
We also impose h(0) (X) = X.

Output
Inputs X hidden
Y

Example 5.2.1 (Single hidden layer, Neal (1996)) In this setup, L = 1 and H = H (1) . make
The equations of the NN boil down to a
di-
g (1) (X) = g(X) = W (1) X ∈ RH
a-
h(1) (X) = h(X) = φ(W (1) X) (5.23) gram
(2) (1) (2) (1)
Y (X) = W h (X) = W φ(W X)
How do uncertainties propagate? For all 1 ≤ i ≤ H, gi (X) is a random variable and
X (1) iid
gi (X) = Wij Xj ∼ N (0, kXk22 σH2
)
j

Thus, the hidden variables hi (X) = φ(gi (X)) are iid and are functions of Gaussians.
The output is Y (X) = W (2) h(X). Because the weights are iid, we have that W (2) and
hi (X) are independent, thus the statistics of each neuron output Yi are
H
X (2)
E[Yi (X)] = EW (2) [Wij ] E[hj (X)] = 0
j=1

5-21
Cours 5 — February 14, 2020 Spring 2020

and
(2)
Var(Yij ) = E[(Wij )2 ] E[(hj (X))2 ]
| {z }
=c (constant)
(2) P
where we denote Yij = Wij hj (X) so that Yi = j Yij and recall that the hj (X) are iid. In
conclusion, we have a predictive distribution Yj (X) which is not Gaussian but has statistics
2
0, HσH .
We have a version of the Central Limit Theorem (CLT):
√ Yi (X) H→+∞ 2
H −−−−→ N (0, cHσH ) (5.24)
H
2 2 σ2
We have nondegenerate asymptotic variance HσH = constant, σH = H
.

This result does extend to deeper networks where L > 1 (see a 2018 result). cite
We see that asymptotically, the predictive prior distribution Yi (x) of the i-th output is pre-
a white-noise Gaussian process. This is intuitive: we have learned nothing (the input X is cise
fixed, has no prior, and we have not constrained any observations of the Yi ), the weights are pa-
distributed randomly, so the predictor should contain no information. per
please!
5.2.1 Understanding the prior at the level of the units [9]
What can we say about the priors of h(`) (x), g (`) (x) at a given number of units H (`) ? We
(`) iid
suppose as before that these we have the weights’ prior Wij ∼ N (0, σ 2 ).
We need a condition on the nonlinearity φ, called the extended envelope condition:
(
≥ c1 + d1 |x| on R+ or R−
φ(x) (5.25)
≤ c2 + d2 |x| on R

where d1 , d2 > 0. This imposes a kind of ReLU-like nonlinearity.

Now, we can precisely caracterize the distribution of pre- and post- nonlinearities.

Theorem 5.34 (Vladimirova et al [9] (2018)) We assume the conditions above on the
(`) (`)
priors and nonlinearity. Then, conditional on X, the prior hi (X) or hi (X) at layer ` is
Sub-Weibull of parameter 2` .

Definition 5.35 (Sub-Weibull distribution) A random variable X is Sub-Weibull with

tail parameter θ if its c.d.f. F satisfies the following conditions:
1/θ
1 − F (t) ≤ e−λt (5.26)

for some λ > 0 (right tail), and

1/θ
F (t) ≤ e−λ|t| (5.27)
t→−∞

for the left tail.

5-22
Cours 5 — February 14, 2020 Spring 2020

Figure 5.4. Impact of the number of layers on the prior distribution. Taken from [8].

Remark 5.2.1 In the above definition, the quantity 1 − F (t) is also called the survival
function.

We can define the following specific Sub-Gaussian distributions:

• Sub-Gaussian: a Sub-Weibull of parameter θ = 1/2,

2
1 − F (t) ≤ e−λt (5.28)

• Sub-Exponential: a Sub-Weibull with θ = 1, i.e. the survival function satisfies

1 − F (t) ≤ e−λt (5.29)

We can also interprete these priors from a regularization point of view; the mode of the
weights’ posterior distribution given data D = {(Xi , Yi )}i is as usual the MAP estimator

ŵMAP ∈ argmax p(w|D)

w∈W
(5.30)
= argmax log p(D|w) + log p(w)
w

Weight decay regularization for NN is nothing more than applying `2 regularization on the
weights (which is the same as using a Gaussian prior p(w) ∝ exp(−kwk22 )). I
missed
5.2.2 Subspace inference for Bayesian DL a
bunch
Reference: Izmailov et al. (2019) [4] of
draw-
ings
5-23 here.
Cours 5 — February 14, 2020 Spring 2020

Posterior inference is not really scalable in general, especially when the parameter space
is large, which is the case in deep learning where the space W is often high-dimensional.
The idea is to construct lower-dimensional subspaces, e.g. the first few components
of the SGD trajectories, then perform variational inference: we can perform prediction using
the approximate posterior predictive distribution and the uncertainty is well-calibrated. The
idea is analogous to PCA, where we reduce the feature space to a given number of principal
components.

5-24
Bibliography

[1] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2017.
[2] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Sci-
ence and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006. isbn: 0387310738.
[3] David M. Blei et al. “Latent Dirichlet Allocation”. In: J. Mach. Learn. Res. 3.null (Mar.
2003), pp. 993–1022. issn: 1532-4435.
[4] Pavel Izmailov et al. Subspace Inference for Bayesian Deep Learning. 2019. arXiv: 1907.
07504 [cs.LG].
[5] Diederik P. Kingma and Max. Welling. Auto-Encoding Variational Bayes. 2013. arXiv:
1312.6114 [stat.ML].
[6] David J.C MacKay. “Bayesian neural networks and density networks”. In: Nuclear In-
struments and Methods in Physics Research Section A: Accelerators, Spectrometers,
Detectors and Associated Equipment 354.1 (1995). Proceedings of the Third Work-
shop on Neutron Scattering Data Analysis, pp. 73–80. issn: 0168-9002. doi: https:
//doi.org/10.1016/0168- 9002(94)00931- 7. url: http://www.sciencedirect.
com/science/article/pii/0168900294009317.
[7] Kevin P. Murphy. “A Variational Approximation for Bayesian Networks with Discrete
and Continuous Latent Variables”. In: CoRR abs/1301.6724 (2013). arXiv: 1301.6724.
url: http://arxiv.org/abs/1301.6724.
[8] Mariia Vladimirova and Julyan Arbel. Sub-Weibull distributions: generalizing sub-Gaussian
and sub-Exponential properties to heavier-tailed distributions. 2019. arXiv: 1905.04955
[math.ST].
[9] Mariia Vladimirova et al. Understanding Priors in Bayesian Neural Networks at the
Unit Level. 2018. arXiv: 1810.05193 [stat.ML].

Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
100% (9)
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
530 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Bayesian Inference in The Normal Linear Regression Model
No ratings yet
Bayesian Inference in The Normal Linear Regression Model
53 pages
Applied Bayesian Statistics Scott M Lynch PDF Download
No ratings yet
Applied Bayesian Statistics Scott M Lynch PDF Download
90 pages
Estimating Forward Price Curve in
No ratings yet
Estimating Forward Price Curve in
75 pages
Fitting Statistical Models With PROC MCMC: Conference Paper
No ratings yet
Fitting Statistical Models With PROC MCMC: Conference Paper
27 pages
Sampling Methods: Søren Højsgaard
No ratings yet
Sampling Methods: Søren Højsgaard
22 pages
An Introduction To MCMC For Machine Learning: Abstract
No ratings yet
An Introduction To MCMC For Machine Learning: Abstract
39 pages
Handbook of Financial Econometrics Tools and Techniques 1st Edition by Ait Sahalia, Yacine, Hansen, Lars Peter 0080929842 9780080929842 Download PDF
100% (31)
Handbook of Financial Econometrics Tools and Techniques 1st Edition by Ait Sahalia, Yacine, Hansen, Lars Peter 0080929842 9780080929842 Download PDF
87 pages
Chap 3
No ratings yet
Chap 3
74 pages
Deming Regression: Methcomp Package May 2007
100% (1)
Deming Regression: Methcomp Package May 2007
10 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Intro Bayes Time Series 1
No ratings yet
Intro Bayes Time Series 1
72 pages
Bayesian Modelling With Stan
No ratings yet
Bayesian Modelling With Stan
24 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Pravaha 2019
No ratings yet
Pravaha 2019
200 pages
A Gentle Introduction To The Dirichlet Process
No ratings yet
A Gentle Introduction To The Dirichlet Process
118 pages
Week 1 Linear - Regression-Final
No ratings yet
Week 1 Linear - Regression-Final
43 pages
Estimation
No ratings yet
Estimation
53 pages
Bayesian Linear Regression For Posterior Predictive Distribution MATLAB
No ratings yet
Bayesian Linear Regression For Posterior Predictive Distribution MATLAB
46 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Markov Chain Monte Carlo (MCMC) Methods: Example 11 (Matlab)
No ratings yet
Markov Chain Monte Carlo (MCMC) Methods: Example 11 (Matlab)
21 pages
Chap3 01
No ratings yet
Chap3 01
35 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
172 pages
Fadli (Bahasa Inggris)
No ratings yet
Fadli (Bahasa Inggris)
30 pages
Lecture 09
No ratings yet
Lecture 09
32 pages
Ece830 Fall11 Lecture20
No ratings yet
Ece830 Fall11 Lecture20
7 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
RenSun Sankhya2004 ComparisonBayesFreqtstPrediction
No ratings yet
RenSun Sankhya2004 ComparisonBayesFreqtstPrediction
29 pages
Linear Model Methodology
No ratings yet
Linear Model Methodology
9 pages
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
No ratings yet
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
38 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
Lin Reg
No ratings yet
Lin Reg
34 pages
A Bayesian Proportional-Hazards Model in Survival Analysis
No ratings yet
A Bayesian Proportional-Hazards Model in Survival Analysis
11 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
Bayesian Analysis
No ratings yet
Bayesian Analysis
20 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Lecture 5
No ratings yet
Lecture 5
23 pages
Shaping States Into Nations The Effects of Ethnic Geography On State Borders 2020
No ratings yet
Shaping States Into Nations The Effects of Ethnic Geography On State Borders 2020
82 pages
Assign 1
No ratings yet
Assign 1
5 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Simple Linear Regression.: 29.1 Method of Least Squares
No ratings yet
Simple Linear Regression.: 29.1 Method of Least Squares
4 pages
Representer Function
No ratings yet
Representer Function
12 pages
Revised Lecture Notes 2
No ratings yet
Revised Lecture Notes 2
16 pages
Mid Sem Solution 2019
No ratings yet
Mid Sem Solution 2019
9 pages
Mid Exam Analisis Algoritma
No ratings yet
Mid Exam Analisis Algoritma
70 pages
Bayes Linearreg
No ratings yet
Bayes Linearreg
13 pages
CS 188 Introduction To AI Midterm Study Guide
No ratings yet
CS 188 Introduction To AI Midterm Study Guide
2 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Bishop CH 3 Notes
No ratings yet
Bishop CH 3 Notes
6 pages
Final Exam Solution 20220201
No ratings yet
Final Exam Solution 20220201
14 pages
Filt Ident Lecturenotes
No ratings yet
Filt Ident Lecturenotes
12 pages
Machine Learning and The Yield Curve Tree-Based Macroeconomic Regime Switching
No ratings yet
Machine Learning and The Yield Curve Tree-Based Macroeconomic Regime Switching
41 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
BayesianStatisticsandMarketing ByRossiand Allenby
No ratings yet
BayesianStatisticsandMarketing ByRossiand Allenby
26 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Working Paper Series: Spillover Effects in International Business Cycles
No ratings yet
Working Paper Series: Spillover Effects in International Business Cycles
47 pages
ML Assignment 1
No ratings yet
ML Assignment 1
7 pages
Statistics 512 Notes 26: Decision Theory Continued: FX FX D
No ratings yet
Statistics 512 Notes 26: Decision Theory Continued: FX FX D
11 pages
Unit 3 - 3.2 Inference in Bayesian Networks
No ratings yet
Unit 3 - 3.2 Inference in Bayesian Networks
37 pages
Evaluating Basketball Player Performance Via Statistical Network Modeling
No ratings yet
Evaluating Basketball Player Performance Via Statistical Network Modeling
11 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
Output 25
No ratings yet
Output 25
8 pages
Gibbs Sampling
No ratings yet
Gibbs Sampling
10 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Econometrics - Exercise Set 2 (Solution)
No ratings yet
Econometrics - Exercise Set 2 (Solution)
12 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Stat 111
No ratings yet
Stat 111
7 pages
04 Lecturenote MLE MAP Discriminative
No ratings yet
04 Lecturenote MLE MAP Discriminative
6 pages
Multivariate Bayesian Regression Analysis Applied To Ground-Motion Prediction Equations, Part 1
No ratings yet
Multivariate Bayesian Regression Analysis Applied To Ground-Motion Prediction Equations, Part 1
17 pages
Tutorial
No ratings yet
Tutorial
27 pages
Learning Algorithms For The Classification Restricted Boltzmann Machine
No ratings yet
Learning Algorithms For The Classification Restricted Boltzmann Machine
27 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Bayes Regression
No ratings yet
Bayes Regression
7 pages
Output 23
No ratings yet
Output 23
6 pages
Linear Regression: Bayesian Theory and Practice January 2025
No ratings yet
Linear Regression: Bayesian Theory and Practice January 2025
4 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
ChoiceModelR Manual
No ratings yet
ChoiceModelR Manual
17 pages
Bayesian Estimation of AR 1 Models
No ratings yet
Bayesian Estimation of AR 1 Models
5 pages
Econometrics 2018 Final Solutions
No ratings yet
Econometrics 2018 Final Solutions
5 pages
Simple Linear Regression.: 29.1 Method of Least Squares
No ratings yet
Simple Linear Regression.: 29.1 Method of Least Squares
4 pages
Lecture 18-2
No ratings yet
Lecture 18-2
11 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Computer Solved: Nonlinear Differential Equations
From Everand
Computer Solved: Nonlinear Differential Equations
Joe J. Ettl
No ratings yet