0% found this document useful (0 votes)
10 views38 pages

1-MS2 (Intro Bayes)

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views38 pages

1-MS2 (Intro Bayes)

Uploaded by

hadjiamine93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Mathematical Statistics 2 (MS2)

Lecture 1: Introduction to Bayesian inference

Amine Hadji <[email protected]>


Leiden University February 9, 2022
1 Introduction to Bayesian inference
1.1 Introduction

About this course

• Introduction to theory of Bayesian inference

• Implementation of common algorithms

• 3 Lectures + 3 Computer labs

• One practical assignment (+ Presentation possible)

1 / 25
1 Introduction to Bayesian inference
1.1 Introduction

In this chapter, we will discuss:


• Bayes theorem & Bayesian approach

• Prior & posterior distribution

• Bayesian inference (estimation, testing)

• Choice of the prior

2 / 25
1 Introduction to Bayesian inference
1.1 Introduction

Bayesian and frequentist probabilities


Frequentism: Bayesianism:
The probability of an event A is the The probability of an event A is a
limit of the frequency of the oc- quantification of the belief one has
curence of A in a repeated experi- that A occurs in a specific exeperi-
ment. ment.

Both philosophies have pros and cons. Those will not be discussed in this
course.

3 / 25
1 Introduction to Bayesian inference
1.1 Introduction

Bayes’ rule

Proposition 1.1 (Bayes’ theorem)


Let (Ω, F, P) a probability space with sample space Ω and probability
measure P. Let A, B ∈ F two events such that P(A) 6= 0, then

P(A|B)P(B)
P(B|A) =
P(A)
P(A|B)P(B)
= .
P(A|B)P(B) + P(A|B c )P(B c )

4 / 25
1 Introduction to Bayesian inference
1.1 Introduction

Example - Sensititvity and specificity


A medical test for disease X has outcomes positive and negative. The
disease has a prevalence of 1% (i.e. P(sick)=1%). We are given the
sensitivity and the specificity of the test:

• sensitivity (i.e. P(+ | sick)): 90%

• specificity (i.e. P(– | healthy)): 99%


You get tested positive for disease X. What is the probability you are
actually sick?

5 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Binary case


Let θ and Y two binary random variables (both can take values {0, 1}),
then

P(Y = y |θ = 1)P(θ = 1)
P(θ = 1|Y = y ) =
P(Y = y |θ = 1)P(θ = 1) + P(Y = y |θ = 0)P(θ = 0)

6 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Binary case


Let θ and Y two binary random variables (both can take values {0, 1}),
then

P(Y = y |θ = 1)P(θ = 1)
P(θ = 1|Y = y ) =
P(Y = y |θ = 1)P(θ = 1) + P(Y = y |θ = 0)P(θ = 0)

Terminology:
• P(θ = 1), P(θ = 0) are prior probabilities

• P(Y = y |θ = 1) is the likelihood

• P(θ = 1|Y = y ) is the posterior probability

6 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Categorical case


Let now θ ∈ {θ1 , ..., θK } a categorical random variable, and Y an
arbitrary discrete random variable

P(Y = y |θ = θk )P(θ = θk )
P(θ = θk |Y = y ) = PK
i=1 P(Y = y |θ = θi )P(θ = θi )

7 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Categorical case


Let now θ ∈ {θ1 , ..., θK } a categorical random variable, and Y an
arbitrary discrete random variable

P(Y = y |θ = θk )P(θ = θk )
P(θ = θk |Y = y ) = PK
i=1 P(Y = y |θ = θi )P(θ = θi )
Short-hand notation
P(y |θk )P(θk )
P(θk |y ) = PK
i=1 P(y |θi )P(θi )

7 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Continuous version


Let now θ a continuous random variable, and Y an arbitrary random
variable with L(θ|Y = y ) the likelihood of θ

P(θ)L(θ|Y = y )
P(θ|y ) = R
P(θ)L(θ|Y = y )dθ

8 / 25
1 Introduction to Bayesian inference
1.2 Bayesian approach

Posterior probability - Continuous version


Let now θ a continuous random variable, and Y an arbitrary random
variable with L(θ|Y = y ) the likelihood of θ

P(θ)L(θ|Y = y )
P(θ|y ) = R
P(θ)L(θ|Y = y )dθ

P(θ|y ) ∝ P(θ)L(θ|Y = y )

8 / 25
1 Introduction to Bayesian inference
1.3 Prior & posterior distribution

Examples

• Binomial case

• Poisson case

• Gaussian case (σ 2 known)

9 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

• Posterior mode, mean, median

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

• Posterior mode, mean, median

• Posterior variance

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

• Posterior mode, mean, median

• Posterior variance

• Credible intervals

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior summary measurement

• Posterior mode, mean, median

• Posterior variance

• Credible intervals

• Bayesian hypothesis testing

10 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior mode

θ̂M = arg maxθ P(θ|y )

Proposition 1.2
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). Let θ̂M the posterior mode. The following
properties are true
• if P(θ) is constant for all possible values of θ, then θ̂M = θ̂MLE

• there exist bijective functions h : Θ → ∆ such that h\ 6 h(θ̂M ).


M (θ) =

11 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior mean
R
θ̂ = E[θ|y ] = θP(θ|y )dθ

Proposition 1.3
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). Let θ̂ the posterior mean. The following
properties are true
• Z
θ̂ = arg min

(θ − θ∗ )2 P(θ|y )dθ
θ ∈Θ

• there exist bijective functions h : Θ → ∆ such that h(θ)


d = 6 h(θ̂).

12 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior median

θ̂Med =
1
2 (inf {θ∗ : P(θ ≤ θ∗ |y ) ≥ 1/2} + sup {θ∗ : P(θ ≥ θ∗ |y )dθ ≥ 1/2})

Proposition 1.4
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). Let θ̂Med the posterior median. The following
properties are true
• Z
θ̂Med = arg min

|θ − θ∗ |P(θ|y )dθ
θ ∈Θ

• all bijective functions h : Θ → ∆ verify h\


Med (θ) = h(θ̂Med ).

13 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Posterior variance

Var(θ|y ) = E[(θ − θ̂)2 |y ] = (θ − θ̂)2 P(θ|y )dθ


R

Proposition 1.5
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). Let θ̂ the posterior mean and Var(θ|y ) the
posterior variance. The following properties are true

Var(θ|y ) = E[θ2 |y ] − θ̂2


Var(θ) = E[Var(θ|y )] + Var(θ̂)
(The posterior variance is on average smaller than the prior variance)

14 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Credible sets

Definition 1.6
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample, and θ a random variable
on Θ with prior distribution P(θ) and posterior distribution
P(θ|y ) := P(θ|Y = y ). The set Ĉ ⊂ Θ is a (1 − α)-credible set if it
verifies the following:

P(θ ∈ Ĉ |y ) ≥ (1 − α).

A (1 − α) credible set for a specific posterior distribution is not unique

15 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Credible sets
Two special types of credible set
• Highest posterior density set ĈHPD :
for all θ ∈ ĈHPD and θ0 ∈
/ ĈHPD , we have P(θ|y ) ≥ P(θ0 |y )

• Equal tail interval C̃ := [a(y ), b(y )]:


a(y ) := sup{a : P(θ ≤ a) < α/2}
b(y ) := inf{b : P(θ ≥ b) < α/2}

16 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Bayesian hypothesis testing


Bayesian tools for hypothesis testing H0 : θ = θ0 :

• Using credible sets (reject H0 if θ0 ∈


/ Ĉ )

• Putting a prior belief on H0 and H1 :

P(y |H0 )P(H0 )


P(H0 |y ) = R
P(y |H0 )P(H0 ) + P(H1 ) H1 P(θ)L(θ|Y = y )dθ

• Using a Bayes factor for model selection

17 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Bayes factor

Definition 1.7
Let Y = (Y1 , ..., Yn )|θi a conditional iid sample, and θ1 , θ2 two random
variables on Θ1 , Θ2 respectively with prior distributions P1 (θ1 ) and
P2 (θ2 ). The Bayes factor B12 is the marginal likelihood ratio
R
L(θ1 |Y = y )P1 (θ1 )dθ1
B12 = R .
L(θ2 |Y = y )P2 (θ2 )dθ2

18 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Bayes factor - Interpretation


Jeffrey’s classification for favoring Model 1 (i.e. Θ1 ) according to the
Bayes factor is as follows

• BF12 ∈ [1, 3.2): ’not worth mentioning’

• BF12 ∈ [3.2, 10): ’substantial’ evidence for Model 1

• BF12 ∈ [10, 32): ’strong’ evidence for Model 1

• BF12 ∈ [32, 100): ’very strong’ evidence for Model 1

• BF12 > 100: ’decisive’ evidence for Model 1

19 / 25
1 Introduction to Bayesian inference
1.4 Bayesian inference

Bayes factor - Interpretation


Jeffrey’s classification for favoring Model 1 (i.e. Θ1 ) according to the
Bayes factor is as follows

• BF12 ∈ [1, 3.2): ’not worth mentioning’

• BF12 ∈ [3.2, 10): ’substantial’ evidence for Model 1

• BF12 ∈ [10, 32): ’strong’ evidence for Model 1

• BF12 ∈ [32, 100): ’very strong’ evidence for Model 1

• BF12 > 100: ’decisive’ evidence for Model 1


If B12 < 1, then the evidence favors Model 2, and we use B21 = 1/B12 to
make interpretations.

19 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

What is a good prior?


A prior distribution is supposed to represent our prior belief on a
parameter...

20 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

What is a good prior?


A prior distribution is supposed to represent our prior belief on a
parameter...

... but choosing a prior without thinking of the posterior might lead to
computational intractability of the posterior

20 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Conjugate priors

Definition 1.8
Let P a family of probability distributions. Let Y = (Y1 , ..., Yn )|θ a
conditional iid sample, and θ a random variable on Θ with prior
distribution P(θ) and posterior distribution P(θ|y ) := P(θ|Y = y ). We
say that the prior and posterior distributions are conjugate distributions
of P for the likelihood L(θ|Y = y ) if P(θ), P(θ|y ) ∈ P. Moreover, we
say call the prior P(θ) a conjugate prior.

21 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Exponential family

Definition 1.9
A family of probability distribution defined by its likelihood L(θ|Y = y )
depending on a parameter θ is called a k-dimensional exponential family
if there exist functions c, h, Qj and Vj such that
 
X k 
L(θ|Y = y ) = c(θ)h(y ) exp Qj (θ)Vj (y ) .
 
j=1

22 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Exponential family

Definition 1.9
A family of probability distribution defined by its likelihood L(θ|Y = y )
depending on a parameter θ is called a k-dimensional exponential family
if there exist functions c, h, Qj and Vj such that
 
X k 
L(θ|Y = y ) = c(θ)h(y ) exp Qj (θ)Vj (y ) .
 
j=1

The statistic V (y ) := {V1 (y ), ..., Vk (y )} is sufficient for θ

22 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Exponential family - Result

Proposition 1.10
Let a k-dimensional exponential family defined by its likelihood
L(θ|Y = y ). All distributions of the family
  
 Xk 
Pα,β = P(θ) ∝ c(θ)β exp Qj (θ)αj
  
j=1

are conjugate priors for L(θ|Y = y ).

23 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Jeffreys non-informative prior

24 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Jeffreys non-informative prior

Definition 1.11
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample with likelihood
L(θ|Y = y ). The non-informative Jeffreys prior is the prior verifying
p
PJ (θ) ∝ |I (θ)|,

where I (θ) is the Fisher information in one observation of θ, i.e.

∂ 2 log(L(θ|Y1 = y1 ))
 
∂ log(L(θ|Y1 = y1 ))
I (θ) = Varθ = Eθ .
∂θ ∂θ2

24 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Jeffreys non-informative prior

Definition 1.11
Let Y = (Y1 , ..., Yn )|θ a conditional iid sample with likelihood
L(θ|Y = y ). The non-informative Jeffreys prior is the prior verifying
p
PJ (θ) ∝ |I (θ)|,

where I (θ) is the Fisher information in one observation of θ, i.e.

∂ 2 log(L(θ|Y1 = y1 ))
 
∂ log(L(θ|Y1 = y1 ))
I (θ) = Varθ = Eθ .
∂θ ∂θ2

Careful:
The Fisher information uses the log-likelihood of one observation

24 / 25
1 Introduction to Bayesian inference
1.5 Choice of prior

Jeffreys non-informative prior

• If θ and φ are possible parametrizations of a statistical model, their


Jeffreys priors are invariant under reparametrization


PJ (φ) = PJ (θ)

• Most Jeffreys priors are improper (i.e. they cannot be probability
R
distribution because PJ (θ)dθ is not well-defined)

25 / 25

You might also like