Single Parametric Models
Single Parametric Models
Department of Biostatistics and School of Data Science Bayesian Inference Based on Posterior
City University of Hong Kong
Prediction
Prior Elicitation
1/37 2/37
Bayes' Theorem
P (A ,
B)
-
Let A denote an event, and Ac denote its complement. Thus, A [ Ac = S P(A B),
+ P(AY B),
=
P(B)
P(BIA) P(A)
and A \ Ac = ;, where S is the sample space. We have =
PLBIA) P(A) +
P(BIA)P(AY
c
P(A) + P(A ) = P(S) ⌘ 1
Let A and B be two non-empty events, and P(A|B) denote the probability
of A given that B has occurred. From basic probabilities, we have
P(A \ B)
P(A|B) = ,
P(B)
and thus, P(A \ B) = P(A|B)P(B).
Likewise, P(A \ B) = P(B|A)P(A) and P(Ac \ B) = P(B|Ac )P(Ac ).
Observe that
P(A \ B) P(A \ B)
P(A|B) = =
P(B) P(A \ B) + P(Ac \ B)
P(B|A)P(A)
=
P(B|A)P(A) + P(B|Ac )P(Ac )
patients who have HIV and 4% of the time among patients who do not distribution f (y|✓) that depends upon an unknown vector of
have HIV. If a given person has tested positive, what is the probability parameters ✓, and ⇡(✓) is the prior distribution of ✓ that represents
that he/she actually has HIV virus? the experimenter’s opinion about ✓.
A = event: a person has HIV Bayes’ theorem applied to statistical models
B = event: tested positive statistical version of Bayes' Theorem
p(y, ✓)
P(A|B) =
P(B|A)P(A)
Posterior ! p(✓|y) =
m(y)
=>
P(AIB) :
I
P(B|A)P(A) + P(B|Ac )P(Ac ) f (y|✓)⇡(✓) likelihood ⇥ prior
0.98 ⇥ 0.05 = R
=
0.98 ⇥ 0.05 + 0.04 ⇥ 0.95
= 0.563 P(BIAi) PLAi) #
⇥
f (y|✓)⇡(✓)d✓ marginal distribution of y
⇥ is the parameter space, i.e., the set of all possible values for ✓.
General Bayes’ theorem: Let A1 , . . . , Am be mutually exclusive and
exhaustive events. (Exhaustive means A1 [ · · · [ Am = S.) For any event
The marginal distribution of y is a function of y alone (nothing to
B with P(B) > 0,
do with ✓), and is often called ‘normalizing constant’.
P(B|Aj )P(Aj ) p(✓|y) / f (y|✓)⇡(✓)
P(Aj |B) = Pm , j = 1, . . . , m.
i=1 P(B|Ai )P(Ai )
4/37
Proportional 5/37
2
Write B = 2 +⌧ 2 , and note that 0 < B < 1. Then:
E (✓|y ) = Bµ + (1 B)y , a weighted average of the prior mean and
the observed data value, with weights determined sensibly by the
variances.
Var (✓|y ) = B⌧ 2 ⌘ (1 B) 2
, smaller than ⌧ 2 and 2
.
Precision (which is like “information”) is additive:
Var 1 (✓|y ) = Var 1 (✓) + Var 1 (y |✓).
6/37
Example: µ = 2, ȳ = 6, ⌧ = = 1, varying n µ = 2, ȳ = 6, n = 1, = 1, varying ⌧
0.7
prior
1.2
posterior with n = 1 posterior with τ=1
posterior with n = 10 posterior with τ=2
0.6
posterior with τ=5
1.0
0.5
0.8
0.4
density
density
0.6
0.3
0.4
0.2
0.2
0.1
0.0
0.0
-2 0 2 4 6 8 −2 0 2 4 6 8 10
θ
θ
When n = 1 the prior and likelihood receive equal weight, so the When ⌧ = 1 the prior is as informative as likelihood, so the posterior
posterior mean is 4 = 2+62 . mean is 4 = 2+6
2 .
When n = 10 the data dominate the prior, resulting in a posterior When ⌧ = 5 the prior is almost flat over the likelihood region, and
mean much closer to ȳ . thus is dominated by the likelihood.
The posterior variance also shrinks as n gets larger; the posterior As ⌧ increases, the prior becomes “flat” relative to the likelihood
collapses to a point mass on ȳ as n ! 1. function. Such prior distributions are called “noninformative” priors.
7/37 8/37
9/37 10/37
Bayes and Sufficiency Single Parameter Model: Binomial Data
Recall that T (y) is sufficient for ✓ if the likelihood can be factored as
Example: Estimating the probability of a female birth. The currently
f (y|✓) = h(y)g (T (y)|✓). accepted value of the proportion of female births in large European
populations is 0.485. Recent interest has focused on factors that
Implication in Bayes: may influence the sex ratio.
p(✓|y) / f (y|✓)⇡(✓) / g (T (y)|✓)⇡(✓)
We consider a potential factor, the maternal condition placenta
Then p(✓|y) = p(✓|T (y)) ) we may work with T (y) instead of the previa, an unusual condition of pregnancy in which the placenta is
entire dataset y. implanted low in the uterus obstructing the fetus from a normal
vaginal delivery.
Again, consider n ind. observations y = (y1 , . . . , yn ) from the normal
distribution f (yi |✓) = N(yi |✓, 2 ), and prior ⇡(✓) = N(✓|µ, ⌧ 2 ). Observation: An early study concerning the sex of placenta previa
births in Germany found that of a total of 980 births, 437 were
Since T (y) = ȳ is sufficient for ✓, we have that p(✓|y) = p(✓|ȳ ). female.
Beta(1.485,1.515) Now that we know what the posterior is, we can use it to make
Beta(5.85,6.15)
inference about ✓.
2.0
1.5
1 Point estimation
1.0
3 Hypothesis testing
0.5
θ
15/37 16/37
17/37 18/37
Bayesian Inference: Interval Estimation HPD Credible Interval
Definition: The 100(1 ↵)% highest posterior density (HPD) credible
interval for ✓ is a subset C of ⇥ such that
The Bayesian analogue of a frequentist CI is referred to as a credible
interval: a 100 ⇥ (1 ↵)% credible interval for ✓ is a subset C of ⇥ C = {✓ 2 ⇥ : p(✓|y) k(↵)} ,
such that
Z where k(↵) is the largest constant for which
P(C |y) = p(✓|y)d✓ 1 ↵. P(C |y) 1 ↵.
C
Unlike the classical confidence interval, it has a proper probability 95% HPD interval 95% HPD interval
0.20
0.20
posterior
posterior
Two principles used in constructing credible interval C :
0.10
0.10
The volume of C should be as small as possible.
The posterior density should be greater for every ✓ 2 C than it is for
0.00
0.00
any ✓ 62 C .
0 2 4 6 8 10 12 0 2 4 6 8 10 12
The two criteria turn out to be equivalent.
θ θ
An HPD credible interval has the smallest volume of all intervals of the
same ↵ level.
19/37 20/37
0.4
Specifically, consider qL and qU , the ↵/2- and (1 ↵/2)-quantiles of
p(✓|y):
0.3
Z qL Z 1
posterior
p(✓|y)d✓ = ↵/2 and p(✓|y)d✓ = ↵/2 .
0.2
1 qU
0 2 4 6 8 10
This interval is usually slightly wider than HPD interval, but easier
θ
to compute (just two quantiles), and also transformation invariant.
Equal-tail intervals do not work well for multimodal posteriors.
21/37 22/37
Example: probability of a female birth Bayesian Hypothesis Testing
f (X |✓) = Bin(980, ✓), ⇡(✓) = Beta(1, 1), xobs = 437
To test hypothesis of H0 versus H1 :
25
20 Classical approach bases accept/reject decision on
10
25/37 26/37
Interpretation of Bayes Factor Example: Probability of a female birth
Data: x = 437 out of n = 980 placenta previa births were female.
We test the hypothesis that H0 : ✓ 0.485 vs. H1 : ✓ < 0.485.
Possible interpretations
Choose the uniform prior ⇡(✓) = Beta(1, 1), and the prior probability
of H1 is
BF Strength of evidence P(✓ < 0.485) = 0.485.
1 to 3 barely worth mentioning
The posterior is p(✓|x) = Beta(438, 544), and the posterior
3 to 20 positive probability of H1 is
P(✓ < 0.485|x = 437) = 0.993
20 to 150 strong
> 150 very strong The Bayes factor is
0.993/(1 0.993)
BF = = 150.6,
These are subjective interpretations and not uniformly agreed upon. 0.485/(1 0.485)
strong evidence in favor of H1 , a substantial lower proportion of
female births in population of placenta previa births than in the
general population.
27/37 28/37
Given a Beta(1, 1) prior, the posterior of ✓ is Beta(438, 544). The A Bayesian analysis can be subjective in that two di↵erent people
posterior predictive distribution for the sex of a future birth is thus may observe the same data y and yet arrive at di↵erent conclusions
about ✓ when they have di↵erent prior opinions on ✓.
Z 1
⇤ ⇤
y⇤ (982) – Main criticism from frequentists.
p(y |y) = ✓y (1 ✓)1 · ✓437 (1 ✓)543 d✓
0 (438) (544)
How should one specify a prior (countering to subjectivity)?
This is known as the beta-binomial distribution. Mean and variance
Objective and informative: e.g., Historical data, data from pilot
of the posterior predictive distribution can be obtained by
experiments.
“Today’s posterior is tomorrow’s prior”
E (y ⇤ |y) = E (E (y ⇤ |✓, y)|y) = E (✓|y) = 0.446
Noninformative: priors meant to express ignorance about the
unknown parameters.
var (y ⇤ |y) = E (var (y ⇤ |✓, y)|y) + var (E (y ⇤ |✓, y)|y)
Conjugate: posterior and prior belong to the same distribution family.
= E (✓(1 ✓)|y) + var (✓|y)
31/37 32/37
Meant to express ignorance about the unknown parameter or have Another noninformative prior is the Je↵reys prior, given in the
minimal impact on the posterior distribution of ✓. univariate case by
Also referred as vague prior or flat prior. p(✓) = [I (✓)]1/2 ,
Example 1: ✓ = true probability of success for a new surgical where I (✓) is the expected Fisher information in the model, namely
procedure, 0 ✓ 1. A noninformative prior is ⇡(✓) = Unif (0, 1).
2
@
Example 2: y1 , . . . yn ⇠ N(yi |✓, 2 ), is known, ✓ 2 R. A I (✓) = Ex|✓ log f (x|✓) .
@✓2
noninformative prior is ⇡(✓) = 1, 1 ✓ 1.
R1
This is an improper prior: 1
⇡(✓)d(✓) = 1. Je↵reys prior is improper for many models. It may be proper,
An improper prior may or may not lead to a proper posterior. however, for certain models.
⇣ ⌘
2
The posterior of ✓ in Example 2 is p(✓|y) = N ȳ , n , which is Unlike the uniform, the Je↵reys prior is invariant to 1-1
proper and is equivalent to the likelihood.
transformations.
33/37 34/37
Conjugate Priors Another Example of Conjugate Prior
Suppose that X is distributed as Poisson(✓), so that
✓ x
e ✓
Defined as one that leads to a posterior distribution belonging to the f (x|✓) = , x 2 {0, 1, 2, . . .}, ✓ > 0.
same distributional family as the prior: x!
– normal prior is conjugate for normal likelihood A reasonably flexible prior for ✓ is the Gamma(↵, ) distribution,
– beta prior is conjugate for binomial likelihood
1
✓↵ e ✓/
Conjugate priors are computationally convenient, but are often not p(✓) = ↵
, ✓ > 0, ↵ > 0, > 0,
possible in complex settings (↵)
– In high dimensional parameter space, priors that are conditionally The posterior is then
conjugate are often available (and helpful).
p(✓|x) / f (x|✓)p(✓)
We may guess the conjugate prior by looking at the likelihood as a
/ ✓x+↵ 1
e ✓(1+1/ )
.
function of ✓.
There is one and only one density proportional to the very last
function, Gamma(x + ↵, (1 + 1/ ) 1 ) density. Gamma is the
conjugate family for the Poisson likelihood.
35/37 36/37
Binomial(N, ✓) ✓ ⇠ beta(↵, )
Poisson(✓) ✓ ⇠ gamma( 0 , 0)
2 2
N(✓, ), is known ✓ ⇠ N(µ, ⌧ 2 )
2
N(✓, ), ✓ is known ⌧ 2 = 1/ 2
⇠ gamma( 0 , 0)
Exp( ) ⇠ gamma( 0 , 0)
37/37