0% found this document useful (0 votes)
5 views10 pages

Single Parametric Models

The document discusses Bayes' Theorem and its application in Bayesian inference, particularly focusing on single parameter models. It explains how to derive posterior distributions using prior distributions and likelihoods, with examples including normal and binomial distributions. Additionally, it covers concepts of point estimation, interval estimation, and the use of credible intervals in Bayesian analysis.

Uploaded by

jackyko0319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Single Parametric Models

The document discusses Bayes' Theorem and its application in Bayesian inference, particularly focusing on single parameter models. It explains how to derive posterior distributions using prior distributions and likelihoods, with examples including normal and binomial distributions. Additionally, it covers concepts of point estimation, interval estimation, and the use of credible intervals in Bayesian analysis.

Uploaded by

jackyko0319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Outline

Single Parameter Models


Bayes’ Theorem
Yi Yang Single Parameter Models

Department of Biostatistics and School of Data Science Bayesian Inference Based on Posterior
City University of Hong Kong
Prediction

Prior Elicitation

1/37 2/37

Bayes' Theorem

Bayes’ Theorem PLAIB) =


P(A , B)
P(B)
=
P(A , B)
P(A)

P (A ,
B)
-

Let A denote an event, and Ac denote its complement. Thus, A [ Ac = S P(A B),
+ P(AY B),
=
P(B)
P(BIA) P(A)
and A \ Ac = ;, where S is the sample space. We have =

PLBIA) P(A) +
P(BIA)P(AY

c
P(A) + P(A ) = P(S) ⌘ 1

Let A and B be two non-empty events, and P(A|B) denote the probability
of A given that B has occurred. From basic probabilities, we have
P(A \ B)
P(A|B) = ,
P(B)
and thus, P(A \ B) = P(A|B)P(B).
Likewise, P(A \ B) = P(B|A)P(A) and P(Ac \ B) = P(B|Ac )P(Ac ).
Observe that
P(A \ B) P(A \ B)
P(A|B) = =
P(B) P(A \ B) + P(Ac \ B)
P(B|A)P(A)
=
P(B|A)P(A) + P(B|Ac )P(Ac )

– This is Bayes’ Theorem.


3/37
Bayes’ Theorem Applied to Statistical
Bayes’ Theorem (cont’d)
Models R for parameter
Example: Suppose 5% of a given population is infected with HIV virus,
and that a certain HIV test gives a positive result 98% of the time among -distribution y
Suppose we have observed data y, which has a probability > likelihord
-

patients who have HIV and 4% of the time among patients who do not distribution f (y|✓) that depends upon an unknown vector of
have HIV. If a given person has tested positive, what is the probability parameters ✓, and ⇡(✓) is the prior distribution of ✓ that represents
that he/she actually has HIV virus? the experimenter’s opinion about ✓.
A = event: a person has HIV Bayes’ theorem applied to statistical models
B = event: tested positive statistical version of Bayes' Theorem
p(y, ✓)
P(A|B) =
P(B|A)P(A)
Posterior ! p(✓|y) =
m(y)
=>
P(AIB) :
I
P(B|A)P(A) + P(B|Ac )P(Ac ) f (y|✓)⇡(✓) likelihood ⇥ prior
0.98 ⇥ 0.05 = R
=
0.98 ⇥ 0.05 + 0.04 ⇥ 0.95
= 0.563 P(BIAi) PLAi) #

f (y|✓)⇡(✓)d✓ marginal distribution of y
⇥ is the parameter space, i.e., the set of all possible values for ✓.
General Bayes’ theorem: Let A1 , . . . , Am be mutually exclusive and
exhaustive events. (Exhaustive means A1 [ · · · [ Am = S.) For any event
The marginal distribution of y is a function of y alone (nothing to
B with P(B) > 0,
do with ✓), and is often called ‘normalizing constant’.
P(B|Aj )P(Aj ) p(✓|y) / f (y|✓)⇡(✓)
P(Aj |B) = Pm , j = 1, . . . , m.
i=1 P(B|Ai )P(Ai )
4/37
Proportional 5/37

Single Parameter Model: Normal with


Known Variance
Consider a single observation y from a normal distribution with
known variance.
2
Likelihood: y ⇠ N(y |✓, ), > 0 is known.
Prior on ✓: ✓ ⇠ N(✓ |µ, ⌧ 2 ), µ 2 R and ⌧ > 0 are known
hyperparameters.
Posterior distribution of ✓:
✓ 2

⌧2 2 2

p(✓|y ) = N ✓ | 2 2
µ+ 2 y, .
+⌧ + ⌧2 2 + ⌧2

2
Write B = 2 +⌧ 2 , and note that 0 < B < 1. Then:
E (✓|y ) = Bµ + (1 B)y , a weighted average of the prior mean and
the observed data value, with weights determined sensibly by the
variances.
Var (✓|y ) = B⌧ 2 ⌘ (1 B) 2
, smaller than ⌧ 2 and 2
.
Precision (which is like “information”) is additive:
Var 1 (✓|y ) = Var 1 (✓) + Var 1 (y |✓).
6/37
Example: µ = 2, ȳ = 6, ⌧ = = 1, varying n µ = 2, ȳ = 6, n = 1, = 1, varying ⌧

0.7
prior

1.2
posterior with n = 1 posterior with τ=1
posterior with n = 10 posterior with τ=2

0.6
posterior with τ=5
1.0

0.5
0.8

0.4
density

density
0.6

0.3
0.4

0.2
0.2

0.1
0.0
0.0

-2 0 2 4 6 8 −2 0 2 4 6 8 10
θ
θ

When n = 1 the prior and likelihood receive equal weight, so the When ⌧ = 1 the prior is as informative as likelihood, so the posterior
posterior mean is 4 = 2+62 . mean is 4 = 2+6
2 .
When n = 10 the data dominate the prior, resulting in a posterior When ⌧ = 5 the prior is almost flat over the likelihood region, and
mean much closer to ȳ . thus is dominated by the likelihood.
The posterior variance also shrinks as n gets larger; the posterior As ⌧ increases, the prior becomes “flat” relative to the likelihood
collapses to a point mass on ȳ as n ! 1. function. Such prior distributions are called “noninformative” priors.
7/37 8/37

Deriving the Posterior Deriving the Posterior

We can find the posterior distribution of the normal mean ✓ via


Bayes Theorem 2
Consider the previous example: a single observation y ⇠ N(y |✓, )
f (y |✓)⇡(✓) f (y |✓)⇡(✓) with known and prior ✓ ⇠ N(✓ |µ, ⌧ 2 ). Can you derive the
p(✓|y ) = =R . posterior?
m(y ) ⇥
f (y |✓)⇡(✓)d✓
✓ 2

⌧2 2 2

p(✓|y ) = N ✓ | 2 µ+ 2 y, 2
Note that m(y ) does NOT depend on ✓, and thus is just a constant. + ⌧2 + ⌧2 + ⌧2
That is,
p(✓|y ) / f (y |✓)⇡(✓). Question: Now, consider n independent observations
y = (y1 , . . . , yn ) from the normal distribution f (yi |✓) = N(yi |✓, 2 ),
The final posterior is A · f (y |✓)⇡(✓), such that and the same prior ⇡(✓) = N(✓|µ, ⌧ 2 ). What is the posterior of ✓
Z now?
A · f (y |✓)⇡(✓)d✓ = 1

9/37 10/37
Bayes and Sufficiency Single Parameter Model: Binomial Data
Recall that T (y) is sufficient for ✓ if the likelihood can be factored as
Example: Estimating the probability of a female birth. The currently
f (y|✓) = h(y)g (T (y)|✓). accepted value of the proportion of female births in large European
populations is 0.485. Recent interest has focused on factors that
Implication in Bayes: may influence the sex ratio.
p(✓|y) / f (y|✓)⇡(✓) / g (T (y)|✓)⇡(✓)
We consider a potential factor, the maternal condition placenta
Then p(✓|y) = p(✓|T (y)) ) we may work with T (y) instead of the previa, an unusual condition of pregnancy in which the placenta is
entire dataset y. implanted low in the uterus obstructing the fetus from a normal
vaginal delivery.
Again, consider n ind. observations y = (y1 , . . . , yn ) from the normal
distribution f (yi |✓) = N(yi |✓, 2 ), and prior ⇡(✓) = N(✓|µ, ⌧ 2 ). Observation: An early study concerning the sex of placenta previa
births in Germany found that of a total of 980 births, 437 were
Since T (y) = ȳ is sufficient for ✓, we have that p(✓|y) = p(✓|ȳ ). female.

We know that f (ȳ |✓) = N(✓,


2
), this implies that Question: How much evidence does this provide for the claim that
n
2
! the proportion of female births in the population of placenta previa
n ⌧2 ⌧ 2 2 births is less than the proportion of female births in the general
p(✓|ȳ ) = N ✓ 2 µ+ 2 ȳ , 2
. population?
n + ⌧2 n +⌧
2 + n⌧ 2
11/37 12/37

Example: Probability of a female birth Example: Probability of a female birth


given placenta previa given placenta previa
Likelihood: Let
The posterior distribution can be obtained via
✓ = prob. of a female birth given placenta previa

1 if a female birth p(✓|x) / f (x|✓) ⇡(✓)
Yi = ✓ ◆
0 otherwise 980 (↵ + ) x+↵ 1
= ✓ (1 ✓)980 x+ 1
P980 x (↵) ( )
Let X = i=1 Yi . Assuming independent births and constant ✓, we
have X |✓ ⇠ Binomial(980, ✓), / ✓x+↵ 1
(1 ✓)980 x+ 1
.
✓ ◆ The only distribution function that is proportional to the above is
980 x
f (x|✓) = ✓ (1 ✓)980 x . Beta(x + ↵, 980 x + )!
x

Consider a beta prior distribution for ✓ ✓|X ⇠ Beta(x + ↵, 980 x+ )

(↵ + ) ↵ Beta distributions are conjugate priors for Binomial likelihood


1 1
⇡(✓) = ✓ (1 ✓) .
(↵) ( )
13/37 14/37
Three di↵erent beta priors Bayesian Inference
Beta(1,1)
2.5

Beta(1.485,1.515) Now that we know what the posterior is, we can use it to make
Beta(5.85,6.15)
inference about ✓.
2.0

The three classes of classical, or frequentist, inference are


prior density

1.5

1 Point estimation
1.0

2 Confidence interval (CI)

3 Hypothesis testing
0.5

Each of them has its analog in the Bayesian world.


0.0

0.0 0.2 0.4 0.485 0.6 0.8 1.0

θ
15/37 16/37

Bayesian Inference: Point Estimation Posterior estimates


Prior Posterior
distribution Mode Mean Median
Easy! Simply choose an appropriate distributional summary: Beta(1, 1) 0.44592 0.44603 0.44599
posterior mean, median, or mode. Beta(1.485, 1.515) 0.44596 0.44607 0.44603
Beta(5.85, 6.15) 0.44631 0.44642 0.44639
Mode is often easiest to compute (no integration), but is often least
representative of “middle”, especially for one-tailed distributions. The classical point estimate is ✓ˆMLE = 437
980 = 0.44592.
Mean has the opposite property, tending to ”chase” heavy tails (just Remarks:
like the sample mean X̄ ) 1 A Bayes point estimate is a weighted average of a common
Median is probably the best compromise overall, though can be frequentist estimate and a parameter estimate obtained only from
awkward to compute, since it is the solution ✓median to the prior distribution.
Z
2 The Bayes point estimate “shrinks” the frequentist estimate toward
✓ median
1 the prior estimate.
p(✓|x) d✓ = .
1 2 3 The weight on the frequentist estimate tends to be 1 as n goes to
infinity.

17/37 18/37
Bayesian Inference: Interval Estimation HPD Credible Interval
Definition: The 100(1 ↵)% highest posterior density (HPD) credible
interval for ✓ is a subset C of ⇥ such that
The Bayesian analogue of a frequentist CI is referred to as a credible
interval: a 100 ⇥ (1 ↵)% credible interval for ✓ is a subset C of ⇥ C = {✓ 2 ⇥ : p(✓|y) k(↵)} ,
such that
Z where k(↵) is the largest constant for which
P(C |y) = p(✓|y)d✓ 1 ↵. P(C |y) 1 ↵.
C

Unlike the classical confidence interval, it has a proper probability 95% HPD interval 95% HPD interval

interpretation: “The probability that ✓ lies in C is (1 ↵)”

0.20

0.20
posterior

posterior
Two principles used in constructing credible interval C :

0.10

0.10
The volume of C should be as small as possible.
The posterior density should be greater for every ✓ 2 C than it is for

0.00

0.00
any ✓ 62 C .
0 2 4 6 8 10 12 0 2 4 6 8 10 12
The two criteria turn out to be equivalent.
θ θ

An HPD credible interval has the smallest volume of all intervals of the
same ↵ level.
19/37 20/37

Equal-tail Credible Interval Interval Estimation: Example

Using a Gamma(2, 1) posterior distribution and k(↵) = 0.1:


Simpler alternative: the equal-tail interval, or central posterior
interval, which takes the ↵/2- and (1 ↵/2)-quantiles of p(✓|y).
87% HPD interval, (0.12,3.59)
87% equal tail interval, (0.42,4.39)

0.4
Specifically, consider qL and qU , the ↵/2- and (1 ↵/2)-quantiles of
p(✓|y):

0.3
Z qL Z 1
posterior
p(✓|y)d✓ = ↵/2 and p(✓|y)d✓ = ↵/2 .
0.2
1 qU

Clearly, P(qL < ✓ < qU |y) = 1 ↵; our confidence that ✓ lies in


0.1

(qL , qU ) is 100 ⇥ (1 ↵)%. Thus, this interval is a 100 ⇥ (1 ↵)%


credible interval for ✓.
0.0

0 2 4 6 8 10
This interval is usually slightly wider than HPD interval, but easier
θ
to compute (just two quantiles), and also transformation invariant.
Equal-tail intervals do not work well for multimodal posteriors.

21/37 22/37
Example: probability of a female birth Bayesian Hypothesis Testing
f (X |✓) = Bin(980, ✓), ⇡(✓) = Beta(1, 1), xobs = 437
To test hypothesis of H0 versus H1 :
25
20 Classical approach bases accept/reject decision on

p-value = P{T (Y) more “extreme” than T (yobs )|✓, H0 } ,


15
posterior

10

where “extremeness” is in the direction of HA


5

Several problems with this approach:


0

0.35 0.40 0.45 0.50 0.55 hypotheses must be nested


Plot the posterior Beta(xobs + 1, n xobs + 1) = Beta(438, 544) in R: p-value can only o↵er evidence against the null
p-value is not the “probability that H0 is true” (but is often
theta <- seq(from=0, to=1, by=0.01)
erroneously interpreted this way)
xobs <- 437; n <- 980;
plot(theta,dbeta(theta,xobs+1,n-xobs+1),type="l",xlim=c(0.35,0.55)) As a result of the dependence on “more extreme” T (Y) values, two
experiments with identical likelihoods could result in di↵erent
p-values, violating the Likelihood Principle
Add 95% equal-tail Bayesian CI (dotted vertical lines):
abline(v=qbeta(.5, xobs+1, n-xobs+1))
abline(v=qbeta(c(.025,.975),xobs+1,n-xobs+1),lty=2)
23/37 24/37

Bayes Factor Bayes Factor vs Likelihood Ratio Test

Hypothesis testing in Bayesian framework is often translated into a model


Bayes factors can also be written in a similar form to the likelihood
selection problem: Model M1 under H1 versus Model M0 under H0 .
ratio test:
The quantity commonly used for Bayesian hypothesis testing and model
selection is the Bayes factor (BF): p(y|M1 )
BF =
p(y|M0 )
P(M1 |y)/P(M0 |y) posterior odds ratio
BF = (
P(M1 )/P(M0 ) prior odds ratio We integrate over the parameter space instead of maximizing over it.
P(M1 , y)/m(y)/P(M1 )
= The Bayes factor reduces to a likelihood ratio test in case of a
P(M0 , y)/m(y)/P(M0 )
p(y|M1 ) simple vs. simple hypothesis test, i.e., H0 : ✓ = ✓0 vs H1 : ✓ = ✓1
=
p(y|M0 )
R Other advantages of Bayes factor:
p(y|✓, M1 )⇡(✓|M1 )d✓

R M1 – The BF does NOT require nested models.
=

p(y|✓, M0 )⇡(✓|M0 )d✓ – The BF has a nice interpretation: large values of BF favors M1 (H1 ).
M0

25/37 26/37
Interpretation of Bayes Factor Example: Probability of a female birth
Data: x = 437 out of n = 980 placenta previa births were female.
We test the hypothesis that H0 : ✓ 0.485 vs. H1 : ✓ < 0.485.
Possible interpretations
Choose the uniform prior ⇡(✓) = Beta(1, 1), and the prior probability
of H1 is
BF Strength of evidence P(✓ < 0.485) = 0.485.
1 to 3 barely worth mentioning
The posterior is p(✓|x) = Beta(438, 544), and the posterior
3 to 20 positive probability of H1 is
P(✓ < 0.485|x = 437) = 0.993
20 to 150 strong
> 150 very strong The Bayes factor is
0.993/(1 0.993)
BF = = 150.6,
These are subjective interpretations and not uniformly agreed upon. 0.485/(1 0.485)
strong evidence in favor of H1 , a substantial lower proportion of
female births in population of placenta previa births than in the
general population.
27/37 28/37

Limitations and Alternatives Bayesian Prediction


We are often interested in predicting a future observation, yn+1 ,
given the observed data y = (y1 , . . . , yn ). A necessary assumption is
Limitations: exchangeability.
NOT well-defined when the prior ⇡(✓|H) is improper
Exchangeability: Given a parametric model f (Y |✓), observations
may be sensitive to the choice of prior y1 , . . . , yn , yn+1 are conditionally independent, i.e., the joint
distribution density f (y1 , . . . , yn+1 ) is invariant to permutation of the
Alternatives for model checking:
indexes.
Conditional predictive distribution
Z Under the assumption, we can predict a future observation, yn+1 ,
f (y)
f (yi |y(i) ) = = f (yi |✓, y(i) )p(✓|y(i) )d✓ , conditional on the observed data
f (y(i) )
Z
which will be proper if p(✓|y(i) ) is. p(yn+1 |y) = f (yn+1 |✓)p(✓|y)d✓ ,
Penalized likelihood criteria: the Akaike information criterion (AIC),
Bayesian information criterion (BIC), or Deviance information p(yn+1 |y) is known as posterior predictive distribution.
criterion (DIC).
b here, which is asymptotically
The frequentist would use f (yn+1 |✓)
equivalent to p(yn+1 |y) above (i.e., when p(✓|y) is a point mass at
b
✓).
29/37 30/37
Example: Predicting the sex of a future Prior Elicitation
birth

Given a Beta(1, 1) prior, the posterior of ✓ is Beta(438, 544). The A Bayesian analysis can be subjective in that two di↵erent people
posterior predictive distribution for the sex of a future birth is thus may observe the same data y and yet arrive at di↵erent conclusions
about ✓ when they have di↵erent prior opinions on ✓.
Z 1
⇤ ⇤
y⇤ (982) – Main criticism from frequentists.
p(y |y) = ✓y (1 ✓)1 · ✓437 (1 ✓)543 d✓
0 (438) (544)
How should one specify a prior (countering to subjectivity)?
This is known as the beta-binomial distribution. Mean and variance
Objective and informative: e.g., Historical data, data from pilot
of the posterior predictive distribution can be obtained by
experiments.
“Today’s posterior is tomorrow’s prior”
E (y ⇤ |y) = E (E (y ⇤ |✓, y)|y) = E (✓|y) = 0.446
Noninformative: priors meant to express ignorance about the
unknown parameters.
var (y ⇤ |y) = E (var (y ⇤ |✓, y)|y) + var (E (y ⇤ |✓, y)|y)
Conjugate: posterior and prior belong to the same distribution family.
= E (✓(1 ✓)|y) + var (✓|y)

31/37 32/37

Noninformative Prior Je↵reys Prior

Meant to express ignorance about the unknown parameter or have Another noninformative prior is the Je↵reys prior, given in the
minimal impact on the posterior distribution of ✓. univariate case by
Also referred as vague prior or flat prior. p(✓) = [I (✓)]1/2 ,
Example 1: ✓ = true probability of success for a new surgical where I (✓) is the expected Fisher information in the model, namely
procedure, 0  ✓  1. A noninformative prior is ⇡(✓) = Unif (0, 1).
 2
@
Example 2: y1 , . . . yn ⇠ N(yi |✓, 2 ), is known, ✓ 2 R. A I (✓) = Ex|✓ log f (x|✓) .
@✓2
noninformative prior is ⇡(✓) = 1, 1  ✓  1.
R1
This is an improper prior: 1
⇡(✓)d(✓) = 1. Je↵reys prior is improper for many models. It may be proper,
An improper prior may or may not lead to a proper posterior. however, for certain models.
⇣ ⌘
2
The posterior of ✓ in Example 2 is p(✓|y) = N ȳ , n , which is Unlike the uniform, the Je↵reys prior is invariant to 1-1
proper and is equivalent to the likelihood.
transformations.

33/37 34/37
Conjugate Priors Another Example of Conjugate Prior
Suppose that X is distributed as Poisson(✓), so that
✓ x
e ✓
Defined as one that leads to a posterior distribution belonging to the f (x|✓) = , x 2 {0, 1, 2, . . .}, ✓ > 0.
same distributional family as the prior: x!
– normal prior is conjugate for normal likelihood A reasonably flexible prior for ✓ is the Gamma(↵, ) distribution,
– beta prior is conjugate for binomial likelihood
1
✓↵ e ✓/
Conjugate priors are computationally convenient, but are often not p(✓) = ↵
, ✓ > 0, ↵ > 0, > 0,
possible in complex settings (↵)
– In high dimensional parameter space, priors that are conditionally The posterior is then
conjugate are often available (and helpful).
p(✓|x) / f (x|✓)p(✓)
We may guess the conjugate prior by looking at the likelihood as a
/ ✓x+↵ 1
e ✓(1+1/ )
.
function of ✓.
There is one and only one density proportional to the very last
function, Gamma(x + ↵, (1 + 1/ ) 1 ) density. Gamma is the
conjugate family for the Poisson likelihood.
35/37 36/37

Common Conjugate Families

Likelihood Conjugate Prior

Binomial(N, ✓) ✓ ⇠ beta(↵, )

Poisson(✓) ✓ ⇠ gamma( 0 , 0)

2 2
N(✓, ), is known ✓ ⇠ N(µ, ⌧ 2 )

2
N(✓, ), ✓ is known ⌧ 2 = 1/ 2
⇠ gamma( 0 , 0)

Exp( ) ⇠ gamma( 0 , 0)

MVN(✓, ⌃), ⌃ is known ✓ ⇠ MVN(µ, V )

MVN(✓, ⌃), ✓ is known ⌃ ⇠ Inv Wishart(⌫, V )

37/37

You might also like