0% found this document useful (0 votes)
9 views37 pages

Single Parameter Models

The document discusses single parameter models in Bayesian statistics, focusing on Bayes' Theorem, Bayesian inference, and posterior distributions. It includes examples of normal and binomial distributions, demonstrating how to derive posterior distributions and make inferences about parameters. The document also contrasts Bayesian methods with classical frequentist approaches, particularly in point and interval estimation.

Uploaded by

jackyko0319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views37 pages

Single Parameter Models

The document discusses single parameter models in Bayesian statistics, focusing on Bayes' Theorem, Bayesian inference, and posterior distributions. It includes examples of normal and binomial distributions, demonstrating how to derive posterior distributions and make inferences about parameters. The document also contrasts Bayesian methods with classical frequentist approaches, particularly in point and interval estimation.

Uploaded by

jackyko0319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Single Parameter Models

Yi Yang

Department of Biostatistics and School of Data Science


City University of Hong Kong

1/37
Outline

Bayes’ Theorem

Single Parameter Models

Bayesian Inference Based on Posterior

Prediction

Prior Elicitation

2/37
Bayes’ Theorem
Let A denote an event, and Ac denote its complement. Thus, A ∪ Ac = S
and A ∩ Ac = ∅, where S is the sample space. We have
P(A) + P(Ac ) = P(S) ≡ 1

Let A and B be two non-empty events, and P(A|B) denote the probability
of A given that B has occurred. From basic probabilities, we have
P(A ∩ B)
P(A|B) = ,
P(B)
and thus, P(A ∩ B) = P(A|B)P(B).
Likewise, P(A ∩ B) = P(B|A)P(A) and P(Ac ∩ B) = P(B|Ac )P(Ac ).
Observe that
P(A ∩ B) P(A ∩ B)
P(A|B) = =
P(B) P(A ∩ B) + P(Ac ∩ B)
P(B|A)P(A)
=
P(B|A)P(A) + P(B|Ac )P(Ac )

– This is Bayes’ Theorem.


3/37
Bayes’ Theorem (cont’d)
Example: Suppose 5% of a given population is infected with HIV virus,
and that a certain HIV test gives a positive result 98% of the time among
patients who have HIV and 4% of the time among patients who do not
have HIV. If a given person has tested positive, what is the probability
that he/she actually has HIV virus?

A = event: a person has HIV


B = event: tested positive
P(B|A)P(A)
P(A|B) =
P(B|A)P(A) + P(B|Ac )P(Ac )
0.98 × 0.05
= = 0.563
0.98 × 0.05 + 0.04 × 0.95
General Bayes’ theorem: Let A1 , . . . , Am be mutually exclusive and
exhaustive events. (Exhaustive means A1 ∪ · · · ∪ Am = S.) For any event
B with P(B) > 0,

P(B|Aj )P(Aj )
P(Aj |B) = Pm , j = 1, . . . , m.
i=1 P(B|Ai )P(Ai )
4/37
Bayes’ Theorem Applied to Statistical
Models
Suppose we have observed data y, which has a probability
distribution f (y|θ) that depends upon an unknown vector of
parameters θ, and π(θ) is the prior distribution of θ that represents
the experimenter’s opinion about θ.

Bayes’ theorem applied to statistical models


p(y, θ)
Posterior → p(θ|y) =
m(y)
f (y|θ)π(θ) likelihood × prior
= R ←
Θ
f (y|θ)π(θ)dθ marginal distribution of y
Θ is the parameter space, i.e., the set of all possible values for θ.

The marginal distribution of y is a function of y alone (nothing to


do with θ), and is often called ‘normalizing constant’.
p(θ|y) ∝ f (y|θ)π(θ)
5/37
Single Parameter Model: Normal with
Known Variance
Consider a single observation y from a normal distribution with
known variance.
Likelihood: y ∼ N(y |θ, σ 2 ), σ > 0 is known.
Prior on θ: θ ∼ N(θ |µ, τ 2 ), µ ∈ R and τ > 0 are known
hyperparameters.
Posterior distribution of θ:
σ2 τ2 σ2 τ 2
 
p(θ|y ) = N θ | 2 µ + y , .
σ + τ2 σ2 + τ 2 σ2 + τ 2
σ2
Write B = σ 2 +τ 2 , and note that 0 < B < 1. Then:
E (θ|y ) = Bµ + (1 − B)y , a weighted average of the prior mean and
the observed data value, with weights determined sensibly by the
variances.
Var (θ|y ) = Bτ 2 ≡ (1 − B)σ 2 , smaller than τ 2 and σ 2 .
Precision (which is like “information”) is additive:
Var −1 (θ|y ) = Var −1 (θ) + Var −1 (y |θ).
6/37
Example: µ = 2, ȳ = 6, τ = σ = 1, varying n
prior

1.2
posterior with n = 1
posterior with n = 10

1.0
0.8
density

0.6
0.4
0.2
0.0

-2 0 2 4 6 8
θ

When n = 1 the prior and likelihood receive equal weight, so the


posterior mean is 4 = 2+62 .
When n = 10 the data dominate the prior, resulting in a posterior
mean much closer to ȳ .
The posterior variance also shrinks as n gets larger; the posterior
collapses to a point mass on ȳ as n → ∞.
7/37
µ = 2, ȳ = 6, n = 1, σ = 1, varying τ

0.7
posterior with τ=1
posterior with τ=2

0.6
posterior with τ=5

0.5
0.4
density

0.3
0.2
0.1
0.0

−2 0 2 4 6 8 10

When τ = 1 the prior is as informative as likelihood, so the posterior


mean is 4 = 2+6
2 .
When τ = 5 the prior is almost flat over the likelihood region, and
thus is dominated by the likelihood.
As τ increases, the prior becomes “flat” relative to the likelihood
function. Such prior distributions are called “noninformative” priors.
8/37
Deriving the Posterior

We can find the posterior distribution of the normal mean θ via


Bayes Theorem

f (y |θ)π(θ) f (y |θ)π(θ)
p(θ|y ) = =R .
m(y ) Θ
f (y |θ)π(θ)dθ

Note that m(y ) does NOT depend on θ, and thus is just a constant.
That is,
p(θ|y ) ∝ f (y |θ)π(θ).

The final posterior is A · f (y |θ)π(θ), such that


Z
A · f (y |θ)π(θ)dθ = 1

9/37
Deriving the Posterior

Consider the previous example: a single observation y ∼ N(y |θ, σ 2 )


with known σ and prior θ ∼ N(θ |µ, τ 2 ). Can you derive the
posterior?

σ2 τ2 σ2 τ 2
 
p(θ|y ) = N θ | 2 µ+ 2 y,
σ + τ2 σ + τ 2 σ2 + τ 2

Question: Now, consider n independent observations


y = (y1 , . . . , yn ) from the normal distribution f (yi |θ) = N(yi |θ, σ 2 ),
and the same prior π(θ) = N(θ|µ, τ 2 ). What is the posterior of θ
now?

10/37
Bayes and Sufficiency
Recall that T (y) is sufficient for θ if the likelihood can be factored as
f (y|θ) = h(y)g (T (y)|θ).

Implication in Bayes:
p(θ|y) ∝ f (y|θ)π(θ) ∝ g (T (y)|θ)π(θ)
Then p(θ|y) = p(θ|T (y)) ⇒ we may work with T (y) instead of the
entire dataset y.

Again, consider n ind. observations y = (y1 , . . . , yn ) from the normal


distribution f (yi |θ) = N(yi |θ, σ 2 ), and prior π(θ) = N(θ|µ, τ 2 ).
Since T (y) = ȳ is sufficient for θ, we have that p(θ|y) = p(θ|ȳ ).

2
We know that f (ȳ |θ) = N(θ, σn ), this implies that
σ2
!
n τ2 σ2 τ 2
p(θ|ȳ ) = N θ σ2
µ + σ2 ȳ , 2 .
n +τ 2
n +τ
2 σ + nτ 2
11/37
Single Parameter Model: Binomial Data
Example: Estimating the probability of a female birth. The currently
accepted value of the proportion of female births in large European
populations is 0.485. Recent interest has focused on factors that
may influence the sex ratio.

We consider a potential factor, the maternal condition placenta


previa, an unusual condition of pregnancy in which the placenta is
implanted low in the uterus obstructing the fetus from a normal
vaginal delivery.

Observation: An early study concerning the sex of placenta previa


births in Germany found that of a total of 980 births, 437 were
female.

Question: How much evidence does this provide for the claim that
the proportion of female births in the population of placenta previa
births is less than the proportion of female births in the general
population?
12/37
Example: Probability of a female birth
given placenta previa
Likelihood: Let

θ = prob. of a female birth given placenta previa



1 if a female birth
Yi =
0 otherwise
P980
Let X = i=1 Yi . Assuming independent births and constant θ, we
have X |θ ∼ Binomial(980, θ),
 
980 x
f (x|θ) = θ (1 − θ)980−x .
x

Consider a beta prior distribution for θ

Γ(α + β) α−1
π(θ) = θ (1 − θ)β−1 .
Γ(α)Γ(β)
13/37
Example: Probability of a female birth
given placenta previa

The posterior distribution can be obtained via

p(θ|x) ∝ f (x|θ) π(θ)


 
980 Γ(α + β) x+α−1
= θ (1 − θ)980−x+β−1
x Γ(α)Γ(β)
∝ θx+α−1 (1 − θ)980−x+β−1 .

The only distribution function that is proportional to the above is


Beta(x + α, 980 − x + β)!

θ|X ∼ Beta(x + α, 980 − x + β)

Beta distributions are conjugate priors for Binomial likelihood

14/37
Three different beta priors

Beta(1,1)
2.5

Beta(1.485,1.515)
Beta(5.85,6.15)
2.0
prior density

1.5
1.0
0.5
0.0

0.0 0.2 0.4 0.485 0.6 0.8 1.0

θ
15/37
Bayesian Inference

Now that we know what the posterior is, we can use it to make
inference about θ.

The three classes of classical, or frequentist, inference are

1 Point estimation

2 Confidence interval (CI)

3 Hypothesis testing

Each of them has its analog in the Bayesian world.

16/37
Bayesian Inference: Point Estimation

Easy! Simply choose an appropriate distributional summary:


posterior mean, median, or mode.
Mode is often easiest to compute (no integration), but is often least
representative of “middle”, especially for one-tailed distributions.

Mean has the opposite property, tending to ”chase” heavy tails (just
like the sample mean X̄ )
Median is probably the best compromise overall, though can be
awkward to compute, since it is the solution θmedian to
Z θ median
1
p(θ|x) dθ = .
−∞ 2

17/37
Posterior estimates
Prior Posterior
distribution Mode Mean Median
Beta(1, 1) 0.44592 0.44603 0.44599
Beta(1.485, 1.515) 0.44596 0.44607 0.44603
Beta(5.85, 6.15) 0.44631 0.44642 0.44639
437
The classical point estimate is θ̂MLE = 980 = 0.44592.

Remarks:
1 A Bayes point estimate is a weighted average of a common
frequentist estimate and a parameter estimate obtained only from
the prior distribution.
2 The Bayes point estimate “shrinks” the frequentist estimate toward
the prior estimate.
3 The weight on the frequentist estimate tends to be 1 as n goes to
infinity.

18/37
Bayesian Inference: Interval Estimation

The Bayesian analogue of a frequentist CI is referred to as a credible


interval: a 100 × (1 − α)% credible interval for θ is a subset C of Θ
such that
Z
P(C |y) = p(θ|y)dθ ≥ 1 − α.
C

Unlike the classical confidence interval, it has a proper probability


interpretation: “The probability that θ lies in C is (1 − α)”

Two principles used in constructing credible interval C :


The volume of C should be as small as possible.
The posterior density should be greater for every θ ∈ C than it is for
any θ 6∈ C .
The two criteria turn out to be equivalent.

19/37
HPD Credible Interval
Definition: The 100(1 − α)% highest posterior density (HPD) credible
interval for θ is a subset C of Θ such that

C = {θ ∈ Θ : p(θ|y) ≥ k(α)} ,

where k(α) is the largest constant for which

P(C |y) ≥ 1 − α .

95% HPD interval 95% HPD interval


0.20

0.20
posterior

posterior
0.10

0.10
0.00

0.00

0 2 4 6 8 10 12 0 2 4 6 8 10 12

θ θ

An HPD credible interval has the smallest volume of all intervals of the
same α level.
20/37
Equal-tail Credible Interval

Simpler alternative: the equal-tail interval, or central posterior


interval, which takes the α/2- and (1 − α/2)-quantiles of p(θ|y).

Specifically, consider qL and qU , the α/2- and (1 − α/2)-quantiles of


p(θ|y):
Z qL Z ∞
p(θ|y)dθ = α/2 and p(θ|y)dθ = α/2 .
−∞ qU

Clearly, P(qL < θ < qU |y) = 1 − α; our confidence that θ lies in


(qL , qU ) is 100 × (1 − α)%. Thus, this interval is a 100 × (1 − α)%
credible interval for θ.

This interval is usually slightly wider than HPD interval, but easier
to compute (just two quantiles), and also transformation invariant.

21/37
Interval Estimation: Example

Using a Gamma(2, 1) posterior distribution and k(α) = 0.1:

87% HPD interval, (0.12,3.59)


87% equal tail interval, (0.42,4.39)
0.4
0.3
posterior

0.2
0.1
0.0

0 2 4 6 8 10

Equal-tail intervals do not work well for multimodal posteriors.

22/37
Example: probability of a female birth
f (X |θ) = Bin(980, θ), π(θ) = Beta(1, 1), xobs = 437

25
20
15
posterior

10
5
0

0.35 0.40 0.45 0.50 0.55

Plot the posterior Beta(xobs + 1, n − xobs + 1) = Beta(438, 544) in R:


theta <- seq(from=0, to=1, by=0.01)
xobs <- 437; n <- 980;
plot(theta,dbeta(theta,xobs+1,n-xobs+1),type="l",xlim=c(0.35,0.55))

Add 95% equal-tail Bayesian CI (dotted vertical lines):


abline(v=qbeta(.5, xobs+1, n-xobs+1))
abline(v=qbeta(c(.025,.975),xobs+1,n-xobs+1),lty=2)
23/37
Bayesian Hypothesis Testing
To test hypothesis of H0 versus H1 :
Classical approach bases accept/reject decision on

p-value = P{T (Y) more “extreme” than T (yobs )|θ, H0 } ,

where “extremeness” is in the direction of HA

Several problems with this approach:


hypotheses must be nested
p-value can only offer evidence against the null
p-value is not the “probability that H0 is true” (but is often
erroneously interpreted this way)
As a result of the dependence on “more extreme” T (Y) values, two
experiments with identical likelihoods could result in different
p-values, violating the Likelihood Principle

24/37
Bayes Factor

Hypothesis testing in Bayesian framework is often translated into a model


selection problem: Model M1 under H1 versus Model M0 under H0 .
The quantity commonly used for Bayesian hypothesis testing and model
selection is the Bayes factor (BF):

P(M1 |y)/P(M0 |y) posterior odds ratio


BF = ⇐
P(M1 )/P(M0 ) prior odds ratio
P(M1 , y)/m(y)/P(M1 )
=
P(M0 , y)/m(y)/P(M0 )
p(y|M1 )
=
p(y|M0 )
R
Θ
p(y|θ, M1 )π(θ|M1 )dθ
= R M1
Θ
p(y|θ, M0 )π(θ|M0 )dθ
M0

25/37
Bayes Factor vs Likelihood Ratio Test

Bayes factors can also be written in a similar form to the likelihood


ratio test:
p(y|M1 )
BF =
p(y|M0 )

We integrate over the parameter space instead of maximizing over it.

The Bayes factor reduces to a likelihood ratio test in case of a


simple vs. simple hypothesis test, i.e., H0 : θ = θ0 vs H1 : θ = θ1

Other advantages of Bayes factor:


– The BF does NOT require nested models.
– The BF has a nice interpretation: large values of BF favors M1 (H1 ).

26/37
Interpretation of Bayes Factor

Possible interpretations

BF Strength of evidence
1 to 3 barely worth mentioning
3 to 20 positive
20 to 150 strong
> 150 very strong

These are subjective interpretations and not uniformly agreed upon.

27/37
Example: Probability of a female birth
Data: x = 437 out of n = 980 placenta previa births were female.
We test the hypothesis that H0 : θ ≥ 0.485 vs. H1 : θ < 0.485.

Choose the uniform prior π(θ) = Beta(1, 1), and the prior probability
of H1 is
P(θ < 0.485) = 0.485.

The posterior is p(θ|x) = Beta(438, 544), and the posterior


probability of H1 is
P(θ < 0.485|x = 437) = 0.993

The Bayes factor is


0.993/(1 − 0.993)
BF = = 150.6,
0.485/(1 − 0.485)
strong evidence in favor of H1 , a substantial lower proportion of
female births in population of placenta previa births than in the
general population.
28/37
Limitations and Alternatives

Limitations:
NOT well-defined when the prior π(θ|H) is improper
may be sensitive to the choice of prior

Alternatives for model checking:


Conditional predictive distribution
Z
f (y)
f (yi |y(i) ) = = f (yi |θ, y(i) )p(θ|y(i) )dθ ,
f (y(i) )

which will be proper if p(θ|y(i) ) is.


Penalized likelihood criteria: the Akaike information criterion (AIC),
Bayesian information criterion (BIC), or Deviance information
criterion (DIC).

29/37
Bayesian Prediction
We are often interested in predicting a future observation, yn+1 ,
given the observed data y = (y1 , . . . , yn ). A necessary assumption is
exchangeability.

Exchangeability: Given a parametric model f (Y |θ), observations


y1 , . . . , yn , yn+1 are conditionally independent, i.e., the joint
distribution density f (y1 , . . . , yn+1 ) is invariant to permutation of the
indexes.

Under the assumption, we can predict a future observation, yn+1 ,


conditional on the observed data
Z
p(yn+1 |y) = f (yn+1 |θ)p(θ|y)dθ ,

p(yn+1 |y) is known as posterior predictive distribution.

The frequentist would use f (yn+1 |θ)


b here, which is asymptotically
equivalent to p(yn+1 |y) above (i.e., when p(θ|y) is a point mass at
θ).
b
30/37
Example: Predicting the sex of a future
birth

Given a Beta(1, 1) prior, the posterior of θ is Beta(438, 544). The


posterior predictive distribution for the sex of a future birth is thus
Z 1
∗ ∗ ∗ Γ(982)
p(y |y) = θy (1 − θ)1−y · θ437 (1 − θ)543 dθ
0 Γ(438)Γ(544)

This is known as the beta-binomial distribution. Mean and variance


of the posterior predictive distribution can be obtained by

E (y ∗ |y) = E (E (y ∗ |θ, y)|y) = E (θ|y) = 0.446

var (y ∗ |y) = E (var (y ∗ |θ, y)|y) + var (E (y ∗ |θ, y)|y)


= E (θ(1 − θ)|y) + var (θ|y)

31/37
Prior Elicitation

A Bayesian analysis can be subjective in that two different people


may observe the same data y and yet arrive at different conclusions
about θ when they have different prior opinions on θ.
– Main criticism from frequentists.

How should one specify a prior (countering to subjectivity)?


Objective and informative: e.g., Historical data, data from pilot
experiments.
“Today’s posterior is tomorrow’s prior”
Noninformative: priors meant to express ignorance about the
unknown parameters.
Conjugate: posterior and prior belong to the same distribution family.

32/37
Noninformative Prior

Meant to express ignorance about the unknown parameter or have


minimal impact on the posterior distribution of θ.

Also referred as vague prior or flat prior.

Example 1: θ = true probability of success for a new surgical


procedure, 0 ≤ θ ≤ 1. A noninformative prior is π(θ) = Unif (0, 1).

Example 2: y1 , . . . yn ∼ N(yi |θ, σ 2 ), σ is known, θ ∈ R. A


noninformative prior is π(θ) = 1, −∞ ≤ θ ≤ ∞.
R∞
This is an improper prior: −∞
π(θ)d(θ) = ∞.
An improper prior may or may not lead to a proper posterior.
 2

The posterior of θ in Example 2 is p(θ|y) = N ȳ , σn , which is
proper and is equivalent to the likelihood.

33/37
Jeffreys Prior

Another noninformative prior is the Jeffreys prior, given in the


univariate case by

p(θ) = [I (θ)]1/2 ,

where I (θ) is the expected Fisher information in the model, namely


 2 

I (θ) = −Ex|θ log f (x|θ) .
∂θ2

Jeffreys prior is improper for many models. It may be proper,


however, for certain models.
Unlike the uniform, the Jeffreys prior is invariant to 1-1
transformations.

34/37
Conjugate Priors

Defined as one that leads to a posterior distribution belonging to the


same distributional family as the prior:
– normal prior is conjugate for normal likelihood
– beta prior is conjugate for binomial likelihood

Conjugate priors are computationally convenient, but are often not


possible in complex settings
– In high dimensional parameter space, priors that are conditionally
conjugate are often available (and helpful).

We may guess the conjugate prior by looking at the likelihood as a


function of θ.

35/37
Another Example of Conjugate Prior
Suppose that X is distributed as Poisson(θ), so that

e −θ θx
f (x|θ) = , x ∈ {0, 1, 2, . . .}, θ > 0.
x!
A reasonably flexible prior for θ is the Gamma(α, β) distribution,

θα−1 e −θ/β
p(θ) = , θ > 0, α > 0, β > 0,
Γ(α)β α

The posterior is then

p(θ|x) ∝ f (x|θ)p(θ)
∝ θx+α−1 e −θ(1+1/β) .

There is one and only one density proportional to the very last
function, Gamma(x + α, (1 + 1/β)−1 ) density. Gamma is the
conjugate family for the Poisson likelihood.
36/37
Common Conjugate Families

Likelihood Conjugate Prior

Binomial(N, θ) θ ∼ beta(α, λ)

Poisson(θ) θ ∼ gamma(δ0 , γ0 )

N(θ, σ 2 ), σ 2 is known θ ∼ N(µ, τ 2 )

N(θ, σ 2 ), θ is known τ 2 = 1/σ 2 ∼ gamma(δ0 , γ0 )

Exp(λ) λ ∼ gamma(δ0 , γ0 )

MVN(θ, Σ), Σ is known θ ∼ MVN(µ, V )

MVN(θ, Σ), θ is known Σ ∼ Inv − Wishart(ν, V )

37/37

You might also like