0% found this document useful (0 votes)
41 views

Bayes Lecture Notes

This document discusses the Bayesian paradigm and Bayesian analysis. It introduces Bayes' theorem for probability distributions and how Bayesian statistics uses Bayes' theorem. Specifically, it explains how Bayes' theorem can be used to update from a prior probability to a posterior probability as new information or data is observed. The document also discusses how Bayesian inference applies Bayes' theorem by treating parameters as random variables and using priors, likelihoods, and posteriors.

Uploaded by

Jason Wekesa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Bayes Lecture Notes

This document discusses the Bayesian paradigm and Bayesian analysis. It introduces Bayes' theorem for probability distributions and how Bayesian statistics uses Bayes' theorem. Specifically, it explains how Bayes' theorem can be used to update from a prior probability to a posterior probability as new information or data is observed. The document also discusses how Bayesian inference applies Bayes' theorem by treating parameters as random variables and using priors, likelihoods, and posteriors.

Uploaded by

Jason Wekesa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Decision Theory and

Bayesian Analysis
Dr. Vilda Purutcuoglu

1
Edited by Anil A. Aksu based on lecture notes of STAT 565 course by Dr. Vilda
Purutcuoglu
1
Contents
Decision Theory and Bayesian Analysis 1
Lecture 1. Bayesian Paradigm 5
1.1. Bayes theorem for distributions 5
1.2. How Bayesian Statistics Uses Bayes Theorem 6
1.3. Prior to Posterior 8
1.4. Triplot 8
Lecture 2. Some Common Probability Distributions 13
2.1. Posterior 15
2.2. Weak Prior 17
2.3. Sequential Updating 19
2.4. Normal Sample 19
2.5. NIC distributions 19
2.6. Posterior 20
2.7. Weak prior 21
Lecture 3. Inference 23
3.1. Shape 24
3.2. Visualizing multivariate densities 26
3.3. Informal Inferences 27
3.4. Multivariate inference 28
Lecture 4. Formal Inference 29
4.1. Utility and decisions 29
4.2. Formal Hypothesis Testing 30
4.3. Nuisance Parameter 31
4.4. Transformation 32
4.5. The prior distribution 32
4.6. Subjectivity 32
4.7. Noninformative Priors 33
4.8. Informative Priors 34
4.9. Prior Choices 34
Lecture 5. Structuring Prior Information 39
3
4 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

5.1. Binary Exchangeability 40


5.2. Exchangeable Parameters 40
5.3. Hierarchical Models 41
Lecture 6. Sufficiency and Ancillary 43
6.1. The Likelihood Principle 43
6.2. Identifiability 44
6.3. Asymptotic Theory 44
6.4. Preposterior Properties 45
6.5. Conjugate Prior Forms 45
Lecture 7. Tackling Real Problems 47
7.1. What is a Markov Chain? 47
7.2. The Chapman-Kolmogorov Equations 49
7.3. Marginal Distributions 49
7.4. General Properties of Markov Chains 49
7.5. Noniterative Monte Carlo Methods 51
Lecture 8. 55
8.1. The Gibbs Sampler 55
Lecture 9. Summary of the properties of Gibbs sampler 61
9.1. Metropolis Algorithm 61
9.2. Metropolis Algorithm 62
9.3. Data Augmentation 64
GIBBS SAMPLING 67
DATA AUGMENTATION 71
R Codes 75
Bibliography 79
LECTURE 1. BAYESIAN PARADIGM 5

LECTURE 1
Bayesian Paradigm

1.1. Bayes theorem for distributions


If A and B are two events,
P (A)P (B | A)
(1.1) P (A | B) = .
P (B)
This is just a direct consequence of the multiplication law of proba-
bilities that says we can express P (A | B) as either P (A)P (B | A) or
P (B)P (A | B). For discrete distributions, if Z, Y are discrete random
variables
P (Z = z)P (Y = y | Z = z)
(1.2) P (Z = z | Y = y) = .
P (Y = y)
• How many distributions do we deal with here?
We can express the denominator in terms of the distribution in the
numerator[1].
(1.3) X X
P (Y = y) = P (Y = y, Z = z) = P (Z = z)P (Y = y | Z = z).
z z
• This is sometimes called the law of total probability
In this context, it is just an expression of the fact that as z ranges
over the possible values of Z , the probabilities on the left hand-side of
equation 1.2 make up the distribution of Z given Y = y, and so they
must add up to one. The extension to continuous distribution is easy.
If Z, Y are continuous random variable,
f (Z)f (Y | Z)
(1.4) f (Z | Y ) = .
f (Y )
where the denominator is now expressed as an integral:
Z
(1.5) f (Y ) = f (Z)f (Y | Z)dZ.

continous name?
(1.6) f=
discrete name?
6 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

1.2. How Bayesian Statistics Uses Bayes Theorem


Theorem 1.7 (Bayes’ theorem).
P (A)P (B | A)
P (A | B) =
P (B)
P (B)=if we are interested in the event B, P (B) is the initial or prior
probability of the occurence of event B. Then we observe event A
P (B | A) = How likely B is when A is known to have occurred is the
posterior probability P (B | A).
Bayes’ theorem can be understood as a formula for updating from
prior to posterior probability, the updating consists of multiplying by
the ratio P (B | A)/P (A). It describes how a probability changes as we
learn new information. Observing the occurrence of A will increase the
probability of B if P (B | A) > P (A). From the law of total probability,
(1.8) P (A) = P (A | B)P (B) + P (A | B c )P (B c ).
where P (B c ) = 1 − P (B).
Lemma 1.9.
P (A) − P (A | B c )P (B c )
P (A | B) − P (A) = − P (A)
1 − P (B c )
Proof.
P (A) − P (A | B c )P (B c ) − P (A) + P (A)P (B c )
P (A | B) − P (A) =
P (B)
P (B )(P (A) − P (A | B c ))
c
P (A | B) − P (A) =
P (B)
P (B)P (A | B) + P (B c )P (A | B c ) P (A | B c )
P (A | B)−P (A) = P (B c )( − )
P (B) P (B)
P (A | B c )(1 − P (B c ))
P (A | B) − P (A) = P (B c )(P (A | B) − )
P (B)
P (A | B) − P (A) = P (B c )(P (A | B) − P (A | B c ))


1.2.1. Generalization of the Bayes’ Theorem


Let B1 , ..., Bn be a set of mutually exclusive events. Then
P (Br )P (A | Br ) P (Br )P (A | Br )
(1.10) P (Br | A) = = Pn .
P (A) i=1 P (Br )P (A | Br )
• Assuming that P (Br ) > 0,P (A | B) > P (A) if and only if
P (A | B) > P (A | B c ).
LECTURE 1. BAYESIAN PARADIGM 7

• In Bayesian inference we use Bayes’ theorem in a particular


way.
• Z is the parameter (vector) θ.
• Y is the data (vector) X.
So we have
f (θ)f (X | θ)
(1.11) f (θ | X) =
f (X)
Z
(1.12) f (X) = f (θ)f (X | θ)dθ.

(1.13) f (θ) = prior.

(1.14) f (θ | X) = posterior.

(1.15) f (X | θ) = likelihood.

1.2.2. Interpreting our sense


How do we interpret the things we see, hear, feel, taste or smell?
Example 1.2.1. I hear a song on the radio I identify the singer as
Robbie Williams. Why do I think it’s Robbie Williams?. Because he
sounds like that. Formally, P ( What I hear Robbie Williams ) >>
P (What I hear someone else )
Example 1.2.2. I look out of the window and see what appears to
be a tree. It has a big, dark coloured part sticking up out of the
ground that branches into thinner sticks and on the ends of these are
small green things. Clearly, P (view | tree) is high and P (view |
car) or P (view | Robbie W illiams) are very small. But P (view |
carboard cutout cunningly painted to look like a tree) is
also very high. Maybe even higher than P (view | tree) in the sense
that what I see looks almost like a tree.
Does this mean I should now believe that I am seeing a cardboard cut-
out cunningly painted to look like a tree? No because it is much less
likely to begin with than a red tree.
In statistical terms, consider some data X and some unknown pa-
rameter θ. The first step in any statistical analysis is to build a model
that links the data to unknown parameters and the main function of
this model is to allow us to state the probability of observing any data
given any specified values of the parameters. That is the model defines
f (x | θ).
8 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

When we think of f (x | θ) as a function of θ for fixed observed data


X, we call it likelihood function and it by L(θ, X).
• So how can we combine this with our example?
This perspective underlies the differences between the two main theo-
ries of statistical inference.
• Frequentist inference essentially uses only the likelihood, it
does not recognize f (θ).
• Bayesian inference uses both likelihood and f (θ).
The principal distinguishing feature of Bayesian inference as opposed
to frequentist inference is its use of f (θ).

1.3. Prior to Posterior


We refer to f (θ) as the prior distribution of θ. It represents knowledge
about θ prior to observing the data X. We refer to f (θ | X) as the
posterior distribution of θ and it represents knowledge about θ after
observing X.
• So we have two sources of information Rabout θ.
• Here f (x) does not depend on θ. Thus f (θ | x)dθ = 1. Since
f (x) is a constantR within the integral, we can take it outside
to get 1 = f −1 (x) f (θ)f (x | θ)dθ.
• f (θ | x) =∝ f (θ)f (x | θ) ∝ f (θ)L(θ; x) (the posterior is pro-
portional to the prior times the likelihood).
• The constant that we require to scale the right hand side to
integrate to 1 is usually called the normalizing constant. If we
haven’t dropped any constants form f (θ) or f (x | θ), then the
normalising constant is just f −1 (x), otherwise it also restores
any dropped constants.

1.4. Triplot
If for any value of θ, we have either f (θ) = 0 or f (x | θ) = 0, then
we will also have f (θ | x) = 0. This is called the property of zero
preservation. So if either:
• the prior information says that this θ value is impossible
• the data say that this value of θ is impossible because if it
were the true value, then the observed data would have been
impossible, then the posterior distribution confirms that this
value of θ is impossible.
Definition 1.16. Crowwell’s Rule: If either information source com-
pletely rules out a specific θ, then the posterior must rule it out too.
LECTURE 1. BAYESIAN PARADIGM 9

Prior, Likelihood and Posterior Distribution

0.8
prior
likelihood
0.6
0.4 posterior
f(x)

0.2
0.0

−4 −2 0 2 4

Figure 1. Triplot of prior, likelihood and posterior.

This means that we should be very careful about giving zero proba-
bility to something unless it is genuinely impossible. Once something
has zero probability then no amount of further evidence can cause it
to have a non-zero posterior probability.
• More generally, f (θ | x) will be low if either f (θ) is very small.
We will tend to find that f (x | θ) is large when both f (θ)
and f (x | θ) are relatively large, so that this θ value is given
support by both information sources.
When θ is a scalar parameter, a useful diagram is the triplot, which
shows the prior, likelihood and posterior on the same graph. An ex-
ample is in Figure 1.1

A strong information source in the triplot is indicated by a curve that


is narrow (and therefore, because it integrates to one, also has a high
peak). A narrow curves concentrates on a small range of θ values, and
thereby ”rules out” all values of θ outside that range.
1All plots are generated in R, relevant codes are provided in Appendix R Codes
10 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

• Over the range θ < −1, the likelihood: low


• Over the range θ > 3,the likelihood: low
• Values of θ between −1 and 3, the likelihood: high
• The maximum value of the posterior at: 1
• The Maximum likelihood estimation (MLE) of θ is ≈ 2

1.4.1. Normal Mean


For example, suppose that X1 , X2 , ..., Xn are iid N (µ, σ 2 ) and σ 2 is
known. Then the likelihood is :
n n  
Y Y 1 1 2
f (x | µ) = f (xi | µ) = √ exp − 2 (xi − µ)
i=1 i=1
2πσ 2σ
(1.17)  
1 2
∝ exp − 2 (xi − µ) .

As,
(1.18)
X X X
(xi − x̄ + x̄ − µ)2 = (xi − x̄)2 + n(x̄ − µ)2 + 2(x̄ − µ) (xi − x̄)
X
= (xi − x̄)2 + n(x̄ − µ)2
 
1 2
∝ exp − 2 n(x̄ − µ) .

P P
Note that 2(x̄ − µ) (xi − x̄) = 0 as (xi − x̄) = 0. Suppose the prior
distribution for µ is normal:
(1.19) µ ∼ N (m, v).
Then applying Bayes’ theorem we have:
   
1 2 1 2
f (µ | x) ∝ exp − 2 n(x̄ − µ) exp − n(µ − m)
2σ 2v
| {z }| {z }
(1.20) f (x|µ) f (µ)
 
θ
= exp − .
2
Note that
(1.21) θ = nσ −2 (x̄ − µ) + v −1 (µ − m)2 = (v ∗ )−1 (µ − m∗ )2 + R
and
(1.22) v ∗ = (nσ −2 + v −1 )−1

(1.23) m∗ = (nσ −2 + v −1 )−1 (nσ −2 x̄ + v −1 m) = ax̄ + (1 − a)m


LECTURE 1. BAYESIAN PARADIGM 11

where a = nσ − 2/(nσ −2 + v −1 )
(1.24) R = (n−1 σ 2 + v)(x̄ − m)2
Therefore,
 
1 2
(1.25) f (µ | x) ∝ exp − n(µ − m)
2v
and we have shown that the posterior distribution is normal too: µ |
x ∼ N (m∗ , v ∗ )
• m∗ = weighted average of the mean m and the usual frequentist
data-only estimate x̄.
The weights ∝:
• Bayes’ theorem typically works in this way. We usually find
that posterior estimates are compromises between prior esti-
mates and data based estimates and tend to be closer whichever
information source is stronger. And we usually find that the
posterior variance is smaller than the prior variance.

1.4.2. Weak Prior Information


It is the case where the prior information is much weaker that the data.
This will occur, for instance, if we do not have strong information about
Q before seeing the data, and if there are lots of data. Then in triplot,
the prior distribution will be much broader and flatter that the likeli-
hood. So the posterior is approximately proportional to the likelihood.
Example 1.4.1. In the normal mean analysis, we get weak prior in-
formation by letting the prior precision of v −1 become small. Then
m∗ → x̄ and v ∗ → σ 2 /n so that the posterior distribution of µ corre-
sponds very closely with standard frequentist theory.
LECTURE 2
Some Common Probability Distributions
Binomial on Y ∈ {0, 1, ..., n} with parameters n ∈ {1, 2, 3, ...} and
p ∈ (0, 1) is denoted by Bi(n, p) and
 
n y
(2.1) f (y | n, p) = p (1 − p)n−y
y
for y = 0, 1, ..., n. The mean is given as:
(2.2) E(y) = np.
Also the variance is given as:
(2.3) v(y) = np(1 − p).
Beta on Y ∈ {0, 1} with parameters a, b > 0 is denoted by Beta(p, q)
and the density function is:
Γ(p + q) p−1
(2.4) f (y | p, q) = y (1 − y)q−1
Γ(p)Γ(q)
for y ∈ (0, 1). The mean is given as:
p
(2.5) E(y) = ,
p+q
Also the variance is given as:
pq
(2.6) v(y) = .
(p + q)2 (p + q + 1)
R1
B(p, q) = 0 y p−1 (1 − y)q−1 dy is the beta function and defined to be
the normalizing constant for this density.
• In beta distribution, p and q change the shape of the distribu-
tion. Discuss!
13
14 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

Uniform distribution on Y ∈ {l, r} where −∞ < l < r < ∞ is


denoted by uniform (l, r) and its pdf is:
1
(2.7) f (y | l, r) =
r−l
for y ∈ {l, r}. The mean is given as:
l+r
(2.8) E(y) = ,
2
Also the variance is given as:
(r − l)2
(2.9) v(y) = .
12
Poisson distribution on Y ∈ {0, 1, 2, ..., } with parameter θ > 0 is
denoted by P oisson(θ) and its pdf is:
exp(−θ)θy
(2.10) f (y | θ) =
y!
for y = 0, 1, 2, ... The mean and the variance are given as[4]:
(2.11) E(y) = v(y) = θ.
Gamma distribution on Y > 0 with shape parameter α > 0 and rate
parameter λ > 0 is denoted by Gamma(α, λ) and the corresponding
density is:
λα α−1
(2.12) f (y | α, λ) = y exp(−λy)
Γ(α)
for y > 0. The mean is given as:
α
(2.13) E(y) = ,
λ
Also the variance is given as:
α
(2.14) v(y) = .
λ2
Note that
(2.15) exp(λ) = Gamma(1, λ).
Univariate normal distribution on Y ∈ R with µ ∈ R and variance
σ 2 > 0 is denoted by N (µ, σ 2 ) and its pdf is:
 1/2  
2 1 1 1 2
(2.16) f (y | µ, σ ) = exp − 2 (y − µ) .
σ 2π 2σ
The mean is given as:
(2.17) E(y) = µ,
LECTURE 2. SOME COMMON PROBABILITY DISTRIBUTIONS 15

Also the variance is given as:


(2.18) v(y) = σ 2 .
K-variate normal distribution on Y ∈ Rk with vector b ∈ Rk and
positive definite symmetric (PDS) covariance matrix C is denoted by
Nk (b, C) and the corresponding density function is:
(2.19)  
1 1 1 T −1
f (y | b, C) = exp − 2 (y − b) C (y − b) .
|C|1/2 (2π)k/2 2σ
| {z }
determinant

The mean is given as:


(2.20) E(y) = b,
And the covariance matrix is given as:
(2.21) Cov(y) = C.

2.1. Posterior
Not only the beta distributions are the simplest and the most conve-
nient distributions for a random variable confined to [0, 1], they also
work very nicely as prior distribution for a binomial observation. If
θ ∼ Be(p, q) then
 
n x
(2.22) f (x | θ) = θ (1 − θ)n−x .
x
for x = 1, 2, ..., n.
1
(2.23) f (θ) = θp−1 (1 − θ)q−1
Be(p, q)
where 0 ≤ θ ≤ 1 and p, q > 0.
(2.24)
Z   Z 1
n 1
f (x) = f (θ)f (x | θ)dθ = θp+x−1 (1 − θ)q+n−x−1 dθ
r Be(p, q) 0
 
n Be(p + x, q + n − x)
= .
r Be(p, q)
From
f (θ)f (x | θ)
(2.25) f (θ | x) = .
f (x)
16 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS
(2.26)
θp+x−1 (1 − θ)q+n−x−1
f (θ | x) = ∝ θp−1 (1 − θ)q−1 θx (1 − θ)n−x .
Be(p + x, q + n − x) | {z }| {z }
Beta part Binomial part

So (θ | x) ∝ Beta(p + x, q + n − x). The posterior mean is:


p+x p+q n
(2.27) E(θ | x) = = E(θ) + θ̂
p+q+n p+q+n p+q+n

where θ̂ = x/n. The posterior variance is:


(p + x)(q + n − x)
v(θ | x) =
(p + q + n)2 (p + q + n + 1)
(2.28)
E(θ)(1 − E(θ | x))
=
p+q+n+1
But,
E(θ)(1 − θ)
(2.29) v(θ) =
p+q+1
So the posterior has higher relative precision that the prior.
SPECIAL NOTE:
The classical theory of estimation regards an estimator as good if it is
unbiased and has small variance, or more generally if its mean-square-
error is small. The MSE is an average squared error where the error is
the difference between θ, i.e. y in previous notation , and the estimate
t. In accordance with classical theory, the average is taken with respect
to the sampling distribution of the estimator.
In Bayesian inferenc, θ is a random variable and it is therefore ap-
propriate to average the squared error with respect to the posterior
distribution of θ. Consider
E (t − θ)2 | x = E(t2 | x) − E(2tθ | x) + E(θ2 | x)


(2.30) = t2 − E(2tθ | x) + E(θ2 | x)


= {t − E(θ | x)}2 + v(θ | x).

Therefore the estimate t which minimizes posterior expected square


error is t = E(θ | x), the posterior mean. The posterior mean can
therefore be seen as an estimate of θ which is the best in the sense of
minimizing expected squared error. This is distinct from, but clearly
related to, its more natural role as a useful summary of location of the
posterior distribution.
LECTURE 2. SOME COMMON PROBABILITY DISTRIBUTIONS 17

2.2. Weak Prior


If we reduce the prior relative precision to zero by setting p = q = 0,
we get θ | x ∼ Be(x, n − x). Then E(θ | x) = θ̂ and v(θ | x) =
θ̂(1− θ̂)/(n+1) results which nicely parallel standard frequentist theory.
• Notice that we are not really allowed to let either parameter
of the beta distribution be zero. However, by making p and
q extremely small, we get as close to these results as we like.
We can think of p = q = 0 as a defining limiting (if strictly
improper) case of weak prior information.
Example 2.2.1. A doctor proposes a new treatment protocol for a
certain kind of cancer. With current methods about 40% of patients
with this cancer survive six months after diagnosis. After one year of
using the new protocol, 15 patients with diagnosis, of whom 6 survived.
After two years a further 55 patients have been followed to teh six
months mark, of whom 28 survived. So in total we have 34 patients
surviving out of 70.
Let θ be the true success rate of the new treatment protocol, i.e. the
true proportion of patients who survive 6 months and we wish to make
comparison of θ with the current survival rate 40%.
Suppose that the doctor in charge has prior information leading her to
assign a prior distribution with expectation E(θ) = 0.45, i.e. expects
a slight improvement over the existing protocol, form 40% to 45%,
however her prior standard deviation is 0.07, v(θ) = 0.072 = 0.0049
18 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

Prior, Likelihood and Posterior Distribution

0.8
prior
likelihood
posterior

0.6
0.4
f(x)

0.2
0.0

−4 −2 0 2 4

Figure 1. The triplot from first year’s data.

Prior, Likelihood and Posterior Distribution


0.8

prior
likelihood
posterior
0.6
0.4
f(x)

0.2
0.0

−4 −2 0 2 4

Figure 2. The triplot from two years’ data.


LECTURE 2. SOME COMMON PROBABILITY DISTRIBUTIONS 19

2.3. Sequential Updating


In the last example we pooled the data from the two years and went
back to the original prior distribution to use Bayes’ theorem. We did
not need to do this. A nice feature of Bayes’ theorem is the possibility
of updating sequentially, incorporating data as they arrive. In this
case, consider the data to be just the new patients observed to a six
months follow-up during the second year. These comprise 55 patients,
of whom 28 had survived. The doctor could consider these as the data
x with n = 55 and r = 28. What would the prior information be?
Clearly, the prior distribution should express her information prior to
obtaining these new data, i.e. after the first years’ data, so her prior
for this second analysis is her posterior distribution form the first. This
was Be(28.28, 36.23). Combining this prior with the new data gives the
same posterior Be(28.28 + 28, 36.23 + 27) = Be(56.28, 63.23) as before.
This simply confirms that we can get to the posterior distribution.
• In a single step, combining all the data with a prior distribution
representing information available before any of the data were
obtained.
• Sequentially, combining each item or block of new data with a
prior distribution representing information available just before
the new data were obtained (but after getting data previously
received).

2.4. Normal Sample


Let X1 , X2 , ..., Xn be from N (µ, σ 2 ). θ = (µ, σ 2 ) → unknown parame-
ters. The likelihood is:
n  
2
Y 1 1 2
f (x | µ, σ ) = √ exp − 2 (xi − µ)
i=1
2πσ 2σ
(2.31)  
−n 1  2 2

∝ σ exp − 2 n(x̄ − µ) + S

where S 2 = ni=1 (xi − x̄)2


P

2.5. NIC distributions


For the prior distribution, we now need a joint distribution for µ and
σ2
20 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

Definition 2.32. The normal-inverse-chi-squared distribution(NIC)


has density:
 
2 −(d+3)/ 2 1  −1 2

(2.33) f (x | µ, σ ) ∝ σ exp − 2 v (µ − m) + a

where a > 0, d > 0 and v > 0.

The following facts are easy to derive about N IC(m, v, a, d) distri-


bution.
(a) The conditional distribution of µ given σ 2 is N (µ, vσ 2 ) so E(µ |
σ 2 ) = m, v(µ | σ 2 ) = vσ 2 .
(b) The marginal distribution of σ 2 is such that aσ −2 ∼ χ2d . We say
that σ 2 has the inverse-chi-square distribution IC(a, d). We
have E(σ 2 ) = a/(d−2) if d > 2 and v(σ 2 ) = 2a2 / {(d − 2)2 (d − 4)}
if d > 4.
(c) The conditional distribution of σ 2 given µ is IC(v −1 (µ − m)2 +
a, d+1) and in particular E(σ 2 | µ) = (v −1 (µ−m)2 +a)/(d−1)
provided d > 1. √ √
(d) The marginal distribution of µ is such that (µ−m) d/ avµ+
d. We say that µ has t-distribution td (m, av/d). We have
E(µ) = m if d > 1, and v(µ) = av/(d − 2) if d > 2.

2.6. Posterior
Supposing then that the prior distribution is N IC(m, v, a, d), we find
 
2 d+n+3 1
(2.34) f (µ, σ | x) ∝ σ exp − 2 θ

where θ = v −1 (µ−m)2 +a+n(x̄−µ)+s2 is a quadratic expression in µ.


After completing the square, we see that µ, σ 2 | x ∝ N IC(m∗ , v ∗ , a∗ , d∗ )
where m∗ = (v −1 m + nx̄)/(v −1 + n), v ∗ = (v −1 + n)−1 , a∗ = a + S 2 +
(x̄ − m)2 /(n−1 + v), d∗ = d + n. To interpret these results, note first
that the posterior mean of µ is m∗ which is a weighted average of the
prior mean m and the usual data only-estimate x̄ with weights v −1 and
n.

The posterior mean of σ 2 is a∗ /(d∗ − 2) which is a weighted average of


three terms: the prior mean a/(d − 2) with weight (d − 2), the usual
data-only estimate S 2 /(n−1) with weight (n−1) and (x̄−m)/(n−1 +v)
with weight 1.
LECTURE 2. SOME COMMON PROBABILITY DISTRIBUTIONS 21

2.7. Weak prior


We clearly obtain weak prior information about µ by letting v go to
infinity or v −1 → 0. Then m∗ = x̄, v ∗ = 1/n, a∗ = a + S 2 , because the
third term disappears.

To obtain weak prior information also about σ 2 , if it is usual to set


a = 0 and d = 1. Then a∗ = S 2 and d∗ = n − 1. The resulting in-
ference match the standard frequentist results very closely with these
parameters, since we have:

(µ − x̄) n
(2.35) √ ∝ tn−1 ,
S/ n − 1
S2
(2.36) 2
∝ χ2n−1
σ
Exactly the same distribution statements underlie standard frequentist
inference in this problem.
LECTURE 3
Inference
Summarisation: Here all inferences are derived form the posterior
distribution. In frequentist statistics, there are three kinds of inference:
• Point estimation: unbiasedness, minimum variance estimation
• Interval estimation: Credible intervals in Bayesian inference,
confidence interval in frequentist approach
• Hypothesis testing: Signifcance test.
In Bayesian inference, the posterior distribution expresses all that is
known about θ. So it uses an appropriate summaries of the posterior
distribution to describe the main features of what we now know about
θ.

Plots: To draw the posterior density

Posterior Distribution
0.25
0.20
0.15
f(θ|x )

0.10
0.05
0.00

0 5 10 15 20

Figure 1. A posterior density plot.

23
24 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

For a bivariate parameter, we can still usefully draw the density as


a perspective plot or a contour plot.

20 800

15 600

8
6
Sinc(

4
10 400
2 10
r)

0 5
−2
−10
0

Y
−5 200
0 5
−5
X
5

10 −10 0
5 10 15 20

(a) Perspective plot (b) Contour plot

Bivariate Posterior Distribution


0.20
0.15
f(θ|x )

0.10
0.05
0.00

−4 −2 0 2 4 6

Figure 3. Marginal Densities.

3.1. Shape
In general, plots illustrate the shape of the posterior distribution. Im-
portant features of shape are modes (and antimodes), skewness and
kurtosis (peakedness or heavy tails). The quantitative summaries of
shape are needed to supplement like the view of mode (antimode).
The first task is to identify turning points of the density, i.e. solutions
of f 0 (θ) = 0. Such points include local maxima and minima of f (θ)
which we call mode and antimode, respectively.
LECTURE 3. INFERENCE 25

A point θ0 is characterized as a mode if f 0 (θ0 ) = 0 and f 00 (θ0 ) < 0


whereas it is an antimode if f 0 (θ0 ) = 0 and f 00 (θ0 ) > 0. Any point θ0
for which f 00 (θ0 ) = 0 is a point of inflection of the density (whether or
not f 0 (θ0 ) = 0).
Example 3.1.1. Consider the gamma density
ab b−1 −aθ
(3.1) f (θ) = θ e ; θ>0
Γ(b)
where a, b are positive constants.
ab
(3.2) f 0 (θ) = {(b − 1) − aθ} θb−2 e−aθ
Γ(b)

ab  2 2
f 00 (θ) = a θ − 2a(b − 1)θ − (b − 1)(b − 2) θb−3 e−aθ

(3.3)
Γ(b)
So from f 0 (θ), the turning point at θ = (b − 1)/a. For b ≤ 1, f 0 (θ) < 0
for all θ ≥ 0, so f (θ) is monotonic decreasing and the mode is at θ = 0.
For θ > 1, f (θ) → 0 as θ → 0, so the turning point θ = 0 is not a mode.
In this case, f 0 (θ) > 0 for θ < (b − 1)/a and f 0 (θ) < 0 for θ > (b − 1)/a.
Therefore θ = (b − 1)/a is the mode. Looking at f 00 (θ), the quadratic
1/2
expression has roots at θ = b−1 a
∓ (b−1)
a
. Therefore b > 1, these are
the points of inflection.
Example 3.1.2. Consider the mixture of two normal distributions
   
0.8 1 2 0.2 1 2
(3.4) f (θ) = √ exp − θ + √ exp − (θ − 4)
2π 2 2π 2
| {z } | {z }
∼N with weight ∼N with weight

   
0 0.8θ 1 2 0.2(θ − 4) 1 2
(3.5) f (θ) = − √ exp − θ − √ exp − (θ − 4)
2π 2 2π 2
For θ ≤ 0, f 0 (θ) > 0 and for θ ≥ 0, f 0 (θ) < 0, the turning points at
θ = 0.00034, 2.46498 and 3.9945.
(3.6)
0.8(θ2 − 1) 0.2((θ2 − 8θ − 15)
   
00 1 2 1 2
f (θ) = − √ exp − θ − √ exp − (θ − 4)
2π 2 2π 2
This is positive for θ ≤ −1, for 1 ≤ θ ≤ 3 and θ ≥ 5, confirming
that the middle turning point is an antimode. Calculating f 00 (θ) at the
other points confirms them to be modes. Finally points of inflection
are at θ = −0.99998, θ = 0.98254, θ = 3.17903, θ = 4.99971.
26 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

0.30
0.25
0.20
f(θ|x )

0.15
0.10
0.05
0.00

−4 −2 0 2 4 6

Figure 4. Plot of mixture of two normal distributions

3.2. Visualizing multivariate densities


Turning points:
At a mode, or at a turning pons generally, the gradients of function in
all directions are zero. Therefore, the turning points are solutions of
the simultaneous equations:
∂f (θ, φ) ∂f (θ, φ)
(3.7) 0= = .
∂φ ∂θ
The turning points may be classified by examining the symmetric ma-
trix F 00 (θ, φ) of the second order partial derivatives. For instance, for
a bivariate density f (θ, φ),
" 2 2
#
∂ f (θ,φ) ∂ f (θ,φ)
∂θ2
(3.8) F 00 (θ, φ) = ∂ 2 f (θ,φ)
∂θ∂φ
∂ 2 f (θ,φ)
∂φ∂θ ∂φ2

F 00 (θ, φ) is known as the Hessian matrix. The second derivative of f (θ)


,m a direction t is after differentiating. At a mode, this must be neg-
ative in all directions. So that the Hessian matrix is negative definite.
Similarly, at an antimode, it is positive definite. In the intermediate
case, where F 00 (θ, φ) is indefinite, i.e. has both positive and negative
eigenvalues, we have a saddle point.
If F 00 (θ, φ) is positive definite, regions are:
F 00 (θ, φ) are negative definite, regions are:
F 00 (θ, φ) is indefinite, indefinite curvature. On the boundaries between
these regions, one eigenvalues of F 00 (θ, φ) is zero, so all points on such
boundaries are inflection points.
A point of inflection corresponds to the second derivative being zero
LECTURE 3. INFERENCE 27

in some direction t, therefore inflection points are characterized by


F 00 (θ, φ) being singular. In fact, in more than two dimensions, we can
further subdivide the regions of indefinite curvature according to how
many positive eigenvalues of F 00 (θ, φ) has, and all these subregions are
also separated by inflection boundaries.
Location:
A plot gives a good idea of location, but the conventional location mea-
sures for distributions are also useful. These include the mean, mode and median.
Dispersion:
The usual dispersion measure is the variance, or for a multivariate dis-
tribution the variance-covariance matrix.
Dependence:
It is important with multivariate distributions to summarize the de-
pendence between individual parameters. This can be done with cor-
relation coefficients, but plots of regression functions (conditional mean
functions) can be more informative.

3.3. Informal Inferences


(a) Point estimation: The obvious posterior estimate of θ is its
posterior mean θ̂ = E(θ | x). Modes and medians are also
natural point estimates, and they all have intuitively different
interpretations. The mean is the expected value, the median
is the central value and the mode is the most probable value.
(b) Interval estimation: If asked to provide an interval in which
θ probably lies, we can readily derive such a thing from its
posterior distribution. For instance, in the density shown on
page 1, there is a probability 0.05 to the left of θ = 3.28 and
also 0.05 to the right of θ = 11.84. So the interval (3.28, 11.84)
is a 90% posterior probability for θ. We call such an interval a
credible interval.
• If a frequentist had found this interval, it means that it
would say that if we repeatedly draw samples of data from
the same population, and applied the rule that was used
to derive this particular interval to each off those datasets,
then 90% of those intervals would contain θ.
If a Bayesian approach, there is a posterior probability 0.9 that
θ lies between 3.28 between 11.84.
28 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

Definition 3.9. A 100(1 − α)% credible set for θ is a subset


C such that: Z
(3.10) 1 − α ≤ P (C | y) = p(θ | y)dθ.
C
where integration is replaced by summation for discrete com-
ponents.
Definition 3.11. The exact possible coverage of (1 − α) can
be found by the highest posterior density of HPD credible set
as the set:
(3.12) C = {θ ∈ Θ : p(θ | y) ≥ k(α)} .
where k(α) is the largest constant satisfying P (C | y) ≥ 1 − α.
For 2-sided credible set, we can generally take the α/2 and
(1 − α/2) quantiles of p(θ | y) as our 100(1 − α)% credible set
for θ. This equal tail credible set will be equal to the HPD
credible set if the posterior is symmetric and unimodal, but
will be a bit wider otherwise.
(c) Evaluating hypothesis: Suppose we wish to test a hypothesis
H which asserts that θ lies in some region A. The Bayesian way
to test to the hypothesis is simply to calculate the (posterior)
probability that it is true: P (θ ∈ A | x).

3.4. Multivariate inference


All the above treated θ as a scalar parameter. If we have a vector θ,
then in general we can consider inference of the above forms about any
scalar function φ = g(θ). The inferences are then derived simply from
the marginal posterior distribution of φ.
LECTURE 4
Formal Inference
Suppose that we want to answer a question that falls neatly into the
frequentist point estimation framework, ”What is the best estimate of
θ?”. In the frequentist theory, we need to be explicit about what we
would regard as good properties for an estimator in order to identify a
best one.
The Bayesian approach also needs to know what makes a good estimate
before an answer can be given. This falls into the framework of formal
inference.
Formally, we would seek the estimate that minimises expected square
error. So the expectation is derived from the posterior distribution.
We want to minimise:
E((θ̂ − θ)2 | x) = θ̂2 − 2θ̂E(θ | x) + E(θ2 | x)
(4.1)
= (θ̂ − E(θ | x))2 + v(θ | x).
So the estimate θ̂ that minimises this expected squared error is θ̂ =
E(θ | x).

4.1. Utility and decisions


Formal inference aims to obtain optimal answer to inference questions.
This is done with reference to a measure of how good or bad the various
possible inferences would be deemed to be if we knew the true value
of θ. This measure is a utility function. Formally, u(d, θ) defines the
value of inference d if the true value of the parameter is θ. Formal
inference casts an inference problem as a decision problem. A decision
problem is characterized by:
• a set Ω of possible parameter values
• a probability distribution for θ ∈ Ω
• a set D of possible decisions
29
30 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

• utility function u(d, θ) for d ∈ D and θ ∈ Ω.


The solution is

(4.2) dopt = arg maxd Eθ (u(d, θ))

Here the distribution of θ is its prior distribution.


• In inference problems, we generally define a measure of badness
of an inference, which we call a loss function L(d, θ). We can
simply define utility to be negative loss, and then the optimal
inference is the one which minimises posterior expected loss.

4.1.1. Formal Point Estimation


The set D is the set of all possible values of θ. We have seen that if we
use squared error loss (which is implicitly the measure used by frequen-
tist in considering mean-squared-error of the variance of an unbiased
estimator), formally defining L(θ̂, θ) = (θ̂−θ)2 , then the posterior
mean
is the optimal estimator. If we use absolute error loss L(θ̂, θ) = θ̂ − θ ,

then the optimal estimator is the posterior median.

4.1.2. Formal Interval Estimation


The possible inferences now are interval or, more generally, subsets of
possible values of θ. A loss function will penalize an interval if it fails
to contain the true value of θ. So the optimal interval is found from
this form:
(4.3) dopt = {θ : f (θ |) ≥ t}

where t is chosen to obtain the desired P (θ ∈ dopt | x). Such a credible


interval is called a highest (posterior) density interval (HPD interval
or HDI).
Example 4.1.1. From the following figure 1 which shows 90% credible
interval, it is not HPI. Because here the upper and lower limits should
have the same posterior density, and shows that the density at θ = 3.28
is higher that at θ = 11.84. So the 90% highest density interval is
actually (2.78, 11.06).

4.2. Formal Hypothesis Testing


If we really need to decide whether (to act as if) hypothesis that θ ∈ A,
is true, there are just two inferences. d0 is to say it is true, while d1
LECTURE 4. FORMAL INFERENCE 31

0.20
0.15
f(θ|x )

0.10
0.05
0.00

0 5 10 15 20

Figure 1. A posterior density plot

says it is false. The loss function will take the form:


L(d0 , θ) = 0 if θ∈A
(4.4)
= 1 if θ∈
/A

L(d1 , θ) = k if θ∈A
(4.5)
= 0 if θ∈
/A
where k defines the relative seriousness of the first kind of error relative
to the second. Then Eθ (L(d0 , θ)) = P (θ ∈ / A | x), while Eθ (L(d1 , θ)) =
kP (θ ∈ A | x). The optimal decision is to select d0 (say that H is true)
1
if its probability P (θ ∈ A | x) exceeds k+1 . The greater the relative
seriousness k of the first kind of error, the more willing we are:

4.3. Nuisance Parameter


In any inference problem, the parameter(s) that we wish to make infer-
ence about is (are) called the parameter of interest, and the remainder
of components of θ is (are) called nuisance parameters.

Example 4.3.1. If we have a sample from N (µ, σ 2 ) and we wish to


make inference about µ, then nuisance parameter is σ 2 .
If θ = (φ, ψ) and ψ is the vector of nuisance parameters, then the
inference about φ is made from marginal value posterior distribution.
32 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

4.4. Transformation
If θ̂ is an estimate of θ, is g(θ̂) the appropriate estimate of φ?
This depends on the kind of inference being mode. In the particular
case of point estimation, then the posterior mean is not invariant in
this way.
Example 4.4.1. If φ = θ2 ), then
(4.6) E(φ | x) = E(θ2 | x) = v(θ | x) + E(θ | x)2 ≥ E(θ | x)2
The mode is not invariant to transformations but the median is invari-
ant, at least to 1 − 1 transformations.
Interval estimates are also invariant to 1 − 1 transformations in the
sense that if [a, b] is a 90% interval, say for θ, then [g(a), g(b)] is a 90%
interval for φ if g is a monotone increasing function. If [a, b] is a 90%
HPD interval for θ, then [g(a), g(b)] is an HPD interval for φ?

4.5. The prior distribution


The nature of probability:
(a) Frequency probability: Frequentist statistics uses the familiar
idea that the probability of an event is the limiting relative frequency
with which that event would occur in an infinite sequence of
repetitions. For this definition of probability to apply, it is
necessary for the event to be, at least in principle, repeatable.
(b) Personal probability: Bayesian statistics is based on defining
the probability of a proposition to be a measure of a person’s degree of belief
is the truth of that proposition.
In the Bayesian framework, wherever there is uncertainty there is
probability.
In particular, parameters have probability distributions.

4.6. Subjectivity
The main critic to Bayesian methods is the subjectivity due to the
prior density.
If the data are sufficiently strong, the remaining element of personal
judgement will not matter, because all priors based on reasonable in-
terpretation of the prior information will lead to effectively the same
posterior inferences. Then we can claim robust conclusion on the basis
of the synthesis of prior information and data.
If the data are not that strong, then we do not yet have enough
scientific evidence to reach an objective conclusion. Any method which
LECTURE 4. FORMAL INFERENCE 33

claims to produce a definitive answer in such a situation is misleading,


so this is actually a strength of the Bayesian approach.

4.7. Noninformative Priors


The basis of this is that if we have a completely flat prior distribution
such that f (θ) is a constant, then the posterior density if proportional
to the likelihood and inferences will be based only on the data. If we
can do this, we can get the other benefits of Bayesian analysis, such as
having more meaningful inferences that actually answer the question,
but without the supposed disadvantage of subjectivity.
The main problem with this neat solution is that it can not be
applied consistently.
Example 4.7.1. f (θ) = 1 for all θ ∈ [0, 1] uniform distribution repre-
sents complete ignorance about θ.
If θ is ignored, then φ = θ2 , which also takes values in [0, 1], is also
completely ignored.
But the implied distribution f (φ) = 1 is not consistent with the
previous specification of f (θ) = 1. The uniform prior distribution for
θ implies that φ = θ2 should have the density f (φ) = 2√1 φ . Conversely,
if φ has a uniform prior distribution, then the implied prior for θ has
density f (θ) = 2θ.
In general, a uniform prior for θ translates into a non-uniform prior
for any function of θ. Another complication is that if the range of
possible values of θ
Another complication is that if the range of possible values of θ
is bounded then we can not properly give it a uniform distribution.
For instance, if θ ∈ [0, ∞) and we try to define a prior distribution
f (θ) = c for some constant c, then there is no value of c that will make
this density integrate to 1. For c = 0, it integrates to 0, and for any
positive c it integrates to infinity. In these situations, we appeal to
proportionality and simply write f (θ) ∝ 1
A distribution expressed as f (θ) ∝ h(θ) when there can not be any
proportionality constant that would make this into a proper density
function, is called an improper distribution. This arises whenever the
integral of h(θ) over the range of possible values of θ does not change.
Example 4.7.2. For a parameter θ ∈ [0, 1], three favourite recom-
mendations are f (θ) = 1, f (θ) = π −1 θ−1/2 (1 − θ)−1/2 and f (θ) ∝
θ−1 (1 − θ)−1 , the last of these being improper. We can identify these
as the Be(1, 1), Be(1/2, 1/2) and Be(0, 0) distributions.
34 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

Improper distributions are not in fact usually much of a problem,


since we can appeal to proportionality. That is, the absence of a well-
defined proportionality constant is ignored and assumed to cancel in
the proportionality constant of Bayes’ theorem [2]. In effect, we are
obtaining the limit of the posterior distribution as we go through a
range of increasingly flat priors towards the uniform limit.
(a) The posterior distribution may also be improper. In this case,
technically, the limit of the above process is not well-defined.
Improper prior distribution should never be used when the
resulting posterior distribution is improper, so it is important
to verify propriety of the posterior.
(b) When comparing different models for the data, improper dis-
tributions always lead to undefined model comparisons. This
is an area outside the scope of this course, but very important
in practice.

4.8. Informative Priors


So in specifying an informative prior distribution, (a) we specify values
for whatever summaries best express the features of the prior informa-
tion, then (b) we simply choose any conventional f (θ) that has those
summaries.
Example 4.8.1. I wish to formulate my prior beliefs about number
N of students who will turn up to one of my lectures. I first ask
myself what my best estimate would be, and I decide on 38, so I set
E(N ) = 38. I next ask myself how far wrong this estimate might be.
I decide that the actual number could be as high as 48 or as low as
30, but I think the probability of the actual number being outside that
range is small, maybe only 10%. Now a convenient prior distribution
that matches these summaries is the Poisson distribution.
So it has mean 38 and P (30 ≤ N ≤ 48) = 0.87, which seems a good
enough fit to my specified summaries.

4.9. Prior Choices


There are several alternatives to overcome the problem of improper
prior.
(a) Jeffrey’s Prior:
Let I(θ) be the Fisher information:
 2 
∂ log f (x | θ)
(4.7) I(θ) = −E .
∂θ2
LECTURE 4. FORMAL INFERENCE 35

In the case of a vector parameter, I(θ) is the matrix formed


as minus the expectation of the matrix of second order partial
derivatives of log f (x | θ). The Jeffrey’s prior distribution it
then:
(4.8) fo (θ) ∝ |I(θ)|1/2

Example 4.9.1. If x1 , x2 , ..., xn are normally distributed with


mean θ and known variance v, then
n(x̄ − θ)2
 
(4.9) f (x | θ) ∝ exp −
2v
What is the Jeffrey’s prior for this distribution?
Solution:

n(x̄ − θ)2
(4.10) log f (x | θ) = −
2v

d2 n
(4.11) 2
log f (x | θ) = −
dθ 2v
Therefore,
d2 n
(4.12) I(θ) = −E( log f (x | θ)) =
dθ2 2v
As a result,
r
p n
(4.13) f0 (θ) = I(θ) = .
2v
Example 4.9.2. If x1 , x2 , ..., xn are distributed as N (µ, σ 2 )
with θ = (µ, σ 2 ), then
−n(s + (x̄ − θ)2 )
 
−n
(4.14) f (x | θ) ∝ σ exp −
2σ 2
(xi −x̄)2
P
where s = n
, Then what is the Jeffrey’s prior of f (µ, σ 2 )?
Solution:

Example 4.9.3. If x1 , x2 , ..., xn are normally distributed with


known mean m and variance θ, then
 
−n/2 −s
(4.15) f (x | θ) ∝ θ exp −
(2θ)2
36 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

(xi − m)2 , then what is the Jeffrey’s prior for θ?


P
where s =
Solution:

(4.16) log f (x | θ) =
A number of objections can made to the Jeffrey’s prior, the
most important of which is that it depends on the form of the
data. The prior distribution should only represent the prior
information, and not be influenced by what data are to be col-
lected.

(b) Maximum Entropy:


R∞
The entropy H(f ) = − −∞ f (θ) log f (θ)dθ of the density f (θ)
can be thought of a measure of how uninformative f (θ) is about
θ. For if we try to convert our information about θ as a general
form of inference in the scoring rule framework, H(f ) is the
lowest obtainable expected loss. If H(f ) is high, then the best
decision is still poor. Now to represent prior ignorance we could
use the prior density f (θ) which maximizes the entropy.
Example 4.9.4. Suppose that θ is discrete with possible val-
ues θ1 , θ2 , ..., θk . The P
prior distribution P
with maximum entropy
k
will then maximize i=1 pi log pi + λ pi , where λ is a La-
grange multiplier, ∂F/∂pi = log pi + 1 + λ. Equating this to
zero yields the solution pi = k −1 , i = 1, 2, 3, ..., k. That is the
maximum entropy prior is the uniform distribution.
The primary criticism of this approach is that it is not in-
variant under change of parametrization, the problem which
the Jeffrey’s prior was designed to avoid. In general, unre-
stricted maximization of entropy leads to a uniform prior dis-
tribution, which was shown to be sensitive to parametrization.

(c) Reference Prior: The expected amount of information pro-


vided by observing x is given by
(4.17) H {f (θ)} − E [H {f (θ | x)}] ,
where the expectation is over the preposterior distribution of
f (x) of x. If the experiment yielding x were to be repeated,
giving a new observation independent of x given θ and with
the same distribution, the posterior distribution would be ex-
pected to show a further reduction in entropy, representing the
expected information in the second observation. If this were
LECTURE 4. FORMAL INFERENCE 37

repeated indefinitely we would eventually learn θ exactly and


so remove all the entropy in the original prior distribution.
In the case of discrete θ taking a finite of possible values,
this process reduces ??? maximizing prior entropy, and so gives
the uniform distribution. This is not the case for continuous θ.
It is shown that under appropriate regularity conditions, the
reference prior distribution is the Jeffrey prior.
LECTURE 5
Structuring Prior Information
Independence: Suppose that x1 , x2 , ..., xn are a sample from the N (µ, 1)
distribution, which we write formally as:

(5.1) xi | µ ∼ N (µ, 1)

, independent. From a Bayesian perspective, what is meant here is


conditional dependence. That is the xi ’s are independent given µ.
Exchangaability: It is same as the independence in frequentist ap-
proach.

Definition 5.2. Random variables x1 , x2 , ..., xm are said to be ex-


changeable if their joint distribution is unaffected by permitting the
order of the xi ’s.

So first consider the (marginal) distribution of x1 . The definition


says that every one of the xi ’s must have the same marginal distribu-
tion, because we can permute them so that any desired xi comes into
the first position in the sequence. So one implication of exchangeability
is that the random variables in question are identically distributed.
next consider the joint distribution of x1 and x2 Exchangeability
means that every pair (xi , xj ) (for i 6= j) has the same bivariate dis-
tribution as (x1 , x2 ). In particular, the correlation between any pair of
random variables is the same.
And so it goes to higher order joint distributions. The joint distri-
butions (x1 , x2 , ...., xk ) is the same as that of any other collection of k
distinct xi ’s. This is the meaning of exchageability.
39
40 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

• In general, suppose that xi ’s have a common distribution g(x |


θ), and are independent, given θ. Then the joint density is:
Z
f (x1 , x2 , ..., xm ) = f (x1 , x2 , ..., xm , θ)dθ
Z
(5.3) = f (x1 , x2 , ..., xm )f (θ)dθ
m
Z Y
= g(xi | θ)f (θ)dθ.
i=1

which is unaffected by permuting the xi ’s. Their common mar-


ginal density is:
Z
(5.4) f (x) = g(x | θ)f (θ)dθ,

and the common distribution of any pair of the xi ’s is


Z
(5.5) f (x, y) = g(x | θ)g(y | θ)f (θ)dθ,

This is generally what a frequentist means by the xi ’s being iid,


or being a random sample from the distribution g(x | θ). From the
Bayesian perspective all frequentist statements are conditional on the
parameters. So exchangeability is the Bayesian concept that corre-
sponds very precisely with the frequentist idea of a random sample.

5.1. Binary Exchangeability


Theorem 5.6 (De Finetti, 1937). Let x1 , x2 , x3 , ... be an infinite se-
quence of exchangeable binary random variables. Then their joint dis-
tribution is characterised by a distribution f (θ) for a parameter θ ∈
[0, 1] such that the xi ’s are independent of θ with P (xi = 1 | θ) = 0.
De Finetti’s theorem says that if we have a sequence of binary
variables, so that each xi takes the value 1 (success) or 0 (failure), then
we can represent them as independent Bernouilli trials with probability
θ of success in each trial.

5.2. Exchangeable Parameters


Consider the simple one-way analysis of variance model:

N (µi , σ 2 ), i = 1, 2, ..., k
(5.7) yij ∼
j = 1, 2, ..., ni , independent
LECTURE 5. STRUCTURING PRIOR INFORMATION 41

where θ = (µ1 , µ2 , ..., µk , σ 2 ). This says that we have k independent


normal samples, where the i-th sample has mean µi and size ni , and
where all the observations have a common variance σ 2 .
For this model, we need to specify a joint prior distribution for
µ1 , µ2 , ..., µk and σ 2 . Now when it comes to formulating a joint prior
distribution for many parameters, it is much easier if we can regard
them as independent. Then we could write:
k
Y
2 2
(5.8) f (µ1 , µ2 , ..., µk , σ ) = f (σ ) f (µi ),
i=1
and we would only need to specify the prior distribution of each pa-
rameter separately. Unfortunately, this is unlikely to be the case in
practice with this model.
The model is generally used when the samples are from a related
or similar populations. For instance, they might be weight gains of
pigs given k different diets, and µi is the mean weight gain with diet i.
In this sort of situation, the µi ’s would not be independent. Then we
could add to the ”prior model”
(5.9) µi | ξ, τ 2 ∼ N (ξ, τ 2 )
, independent, which says that the µi ’s are drawn form a normal (as-
sumed) population with unknown mean ξ and unknown variance τ 2 .
The prior distribution could then be completed by specifying a joint
prior distribution f (ξ, σ 2 , τ 2 ) for the remaining parameters.

5.3. Hierarchical Models


The kind of modelling seen in the previous section is called ”hierarchi-
cal”. In general, we can consider a model of the form:
• Data model: f (x | θ)
• First level of prior: f (θ | φ)
• Second level of prior: f (φ)
We often refer to φ as the hyperparameter(s)
• If we are only interested in the parameter θ of the original data
model,
Z
(5.10) f (θ) = f (θ | φ)f (φ)dφ.

• Let us be actually interested in the hyperparameters φ,


(5.11) f (x | φ) =

(5.12) f (θ, φ) =
42 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

(5.13) f (θ, φ | x) ∝
(5.14) f (θ | x) =
Shrinkage: It means that the posterior distribution and posterior esti-
mates of these parameters will generally be closer together than their
corresponding data estimates. This is a phenomenon known as ”shrink-
age”. Let w = (ξ, σ 2 , τ 2 ) and µ = (µ1 , µ2 , ..., µk ) and
f (µ | x, w) ∝ f (x | µ, σ 2 )f (µ | w)
k
(n )
i 2
(µi − ξ)2
  
Y Y 1 (yij − µi )
= √ exp − exp −
(5.15) i=1 j=1
2πσ 2σ 2 2τ 2
k
Y
= f (µi | x, w)
i=1

where each of the f (µi | x, w) comes from the analysis of a single


normal sample given in Lecture 1. That is conditional on w, the µi ’s
are independent N (m∗i , vi∗ ), where:
(5.16) vi∗ = (ni σ −2 + τ −2 )−1
(5.17) m∗i = (ni σ −2 + τ −2 )−1 (ni σ −2 ȳi + τ −2 ξ)
We can already see the shrinkage in the model, because the posterior
mean of each µi is a weighted average of its own data estimate ȳi and
the common value ξ. So they are shrunk towards this common value.
Example 5.3.1. Let’s consider a simple regression situation in which
we have observations y1 , y2 , ..., yn at values x1 < x2 < .. < xn of the
explanatory variables. The usual linear regression model specifies
(5.18) yi | α, β, σ 2 ∼ N (α + βxi , σ 2 )
Instead we can create the hierarchical model:
• Data model:
• First level of prior:
• Second level of prior:
LECTURE 6
Sufficiency and Ancillary
What happens if f (x | θ) does not depend on θ? If f (x | θ) does not
depend on θ, the data are completely uninformative.
Definition 6.1. t(x) is sufficient if, for any given θ, f (x | θ) is a
function only of t(x), apart froma multiplicative factor that can be any
function of x
It means that we only need to know t(x) in order to obtain the
posterior distribution. It is sufficient to know t(x). Therefore the
posterior distribution is the same as if we only had observed t(x) rather
than the whole of x.
Suppose that s(x) represents all other information in x, x = (t(x), s(x)).
Then
(6.2) f (x | θ) = f (t(x) | θ)f (s(x) | t(x), θ).
t(x) is sufficient f (s(x) | t(x), θ) must not depend on θ. So once we
know t(x) there is no information in s | x.
Definition 6.3. s(x) is ancillary if f (s(x) | θ) does not depend on θ.
Example 6.0.1. Let xi |∼ N (µ, σ 2 ) with σ 2 known. Then xi − xj ∼
N (0, 2σ 2 ) is ancillary for any i 6= j.
• Let t(x) be the rest of the information in x, so that again we
have x = (t(x), s(x)). Then
(6.4) f (x | θ) = f (s(x) | θ)f (t(x) | s(x), θ),
but now this implies that f (x | θ) ∝ f (t(x) | s(x), θ)

6.1. The Likelihood Principle


The likelihood principle asserts that inference should be based only on
the likelihood.
43
44 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

Example 6.1.1. Let x ∼ Be(1, θ) and the results of a fixed number n


of independent Bernoullie trials.
(6.5) fx (x | θ) = θr (1 − θ)n−r
where r is the number of success. If x ∼ Bi(n, θ), then fR (r | θ) =
n r

r
θ (1 − θ)n−r ∝ θr (1 − θ)n−r .
If we keep observing independent Bernoulli trials until we get a
fixed number r of success.
(6.6) fy (y | θ) = θr (1 − θ)n−r
where r-th success on the n-th trial. If the distribution is negative
binomial
 
n−1 r
(6.7) fN (n) = θ (1 − θ)n−r ∝ θr (1 − θ)n−r
r−1

6.2. Identifiability
Let θ = (y(θ), h(θ)) and f (x | θ) depend only on g(θ).
f (θ | x) ∝ f (x | θ)f (θ) = f (x | g(θ))f (θ)
(6.8) = f (x | g(θ))f (g(θ))f (h(θ) | g(θ))
∝ f (g(θ) | x)f (h(θ) | g(θ)).
This says that the posterior distribution of θ is made up of the distri-
bution of g(θ) and the prior distribution of h(θ) given g(θ). So it is the
conditional posterior distribution of h(θ) given g(θ) that is the same
as the prior. That is
(6.9) f (h(θ) | x, g(θ)) = f (h(θ) | g(θ)).
We say that h(θ) is not identifiable from these data. No matter how
much data we get, we can not learn exactly what h(θ) is. With sufficient
data we can learn g(θ), but not h(θ).

6.3. Asymptotic Theory


Suppose that we have a sequence of iid observations x1 , x2 , x3 , ... and
suppose that xn = (x1 , x2 , ..., xn ) comprises the first n observations.
We can now consider a sequence of posterior distributions f (θ | x1 ),
f (θ | x2 ), f (θ | x3 ),...,f (θ | xn ),f (θ | xn+1 ). We wish to know how the
posterior distribution f (θ | xn ) behaves as n → ∞.
So as we get more data, we expect that the posterior will in some
sense converge to the true value. Also, the weight the posterior gives
to the data increases, and therefore we can expect that in the limit the
posterior will be insensitive to the prior.
LECTURE 6. SUFFICIENCY AND ANCILLARY 45

Subject to some regularity conditions. Regularity conditions:


(a) The whole θ needs to be identifiable.
(b) Prior possibility condition: the prior probability does not give
zero probability to the true value of θ.
(c) Continuity condition: we need a continuity condition for θ.

6.4. Preposterior Properties


Let X and Y be any two random variables. Then
(6.10) E(Y ) = E {E(Y | X)} ,

(6.11) v(Y ) = E {v(Y | X)} + v {E(Y | X)}


Let’s replace Y by the parameter vector θ and X by the data vector
X.
Remark. If we use the posterior mean E(θ | x) to estimate θ, the its
expected bias 0.

6.5. Conjugate Prior Forms


The conjugacy is a joint property of the prior and the likelihood func-
tion that provides a posterior from the same distributional family as
the prior.
Example 6.5.1. Conjugacy in exponential specifications.
(6.12) E(x | θ) = θ exp {−θx}
where 0 ≤ x, 0 < θ. If θ ∼ Gamma(α, β), then
1 α α−1
(6.13) f (θ | α, β) = β θ exp {−βθ}
Γ(α)
where θ, α, β > 0
Suppose we now observe x1 , x2 , ..., xn ∼ iid. The likelihood is
n
Y n X o
−θxi n
(6.14) L(θ | x) = θe = θ exp −θ xi
i=1
Thus,
π(θ | x) ∝ E(x | θ)L(θ | x)
n X o 1
(6.15) = θn exp −θ xi β α θα−1 exp {−βθ}
Γ(α)
n X o
∝ θα+n−1 exp −θ( xi + β)
46 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS
P
This is the kernel of a Gamma(α + n, xi + β) and therefore the
gamma distribution is shown to be conjugate to the exponential likeli-
hood function.
Table 1. Conjugate Prior Distribution Table

Likelihood Form Conjugate Prior Distribution Hyperparameters


Bernouilli Beta α > 0, β > 0
Binomial Beta α > 0,Pβ > 0
Multinomial Dirichlet θj > 0, θj = θ0
Negative Binomial Beta α > 0, β > 0
Poisson Gamma α > 0, β > 0
Exponential Gamma α > 0, β > 0
Gamma (ind χ2 ) Gamma α > 0, β > 0
Normal for µ Normal µ ∈ R, σ 2 > 0
Normal for σ 2 Inverse Gamma α > 0, β > 0
Pareto for α Gamma α > 0, β > 0
Pareto for β Pareto α > 0, β > 0
Uniform Pareto α > 0, β > 0
LECTURE 7
Tackling Real Problems
There are various computational tools that are widely used in practical
Bayesian statistics, the most well-know one is Markov Chain Monte
Carlo or MCMC. The basic idea is that we randomly draw a very
large sample Q(1) , Q(2) , ... from the posterior distribution. Given such
a sample, we can compute any inference we wish. If we want to make
inference about some derived parameter φ = g(θ) then φ(1) = g(Q(1) ),
φ(2) = g(Q(2) ),..., is a sample from its posterior distribution f (θ | x).
The sample mean φ̄ is an estimate of E(φ | x). In principle, we could
draw such a sample using simple Monte Carlo sampling. That is each
Q(j) is independently drawn from f (θ | x) There are algorithms for
efficiently drawing random samples from a wide variety of standard
distributions.

7.1. What is a Markov Chain?


”A stochastic process” is a consecutive set of random quantities defined
on some known
 state space Q, indexed so that the order is known.
[t]
Q , t ∈ T . Here the state space (which is parameter space for us)
is just the allowable range of values for the random vector of interest.
The state space Q is either discrete or continuous depending on how
the variable of interest is measured.
A Markov chain is a stochastic process with the property that at
time t in the series, the probability of making a transition to any new
state is dependent only on the current state of the process.

(7.1) p(Q[t] ∈ A | Q[0] , Q[1] , ..., Q[t−2] , Q[t−1] )

where A is the identified set on the complete state space.


47
48 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

A fundamental concern is the transition process that defines the


probabilities of moving to other points in the state space, given the cur-
rent location of the chain. This structure is defined via the transition kernel K
as a general mechanism for describing the probabilities of moving to
some other specified state based on the current chain status. When the
state space is discrete, then K is a matrix, k × k for k discrete elements
in A, where each cell defines the probability of a state transition from
the first term in the parentheses to all possible states:
 
p(θ1 , θ1 ) ... p(θ1 , θk )
(7.2) PA =  : 
p(θk , θ1 ) ... p(θk , θk )
The row and columns indicate:


An important feature of the transition kernel is that the transition
probabilities between two selected states for arbitrary numbers of steps
m can be calculate multiplicative:
(7.3) XX X
[m] [0]
pm (θj = y | θi = x) = .. p(θi , θ1 )p(θ1 , θ2 )...p(θm−2 , θm−1 )
| {z }
θ1 θ2 θm−1 transition products
| {z }
all possible paths
[m] [0]
So pm (θj = y | θi = x) is also a stochastic transition matrix.
Example 7.1.1.
  
θ1 0.8 0.2)
(7.4) P =
|{z} θ2 0.6 0.4
current period

Let the starting point S0 = [0, 5 0, 5], to get the first state
(7.5) S1 =
To get the second state,
(7.6) S2 =

(7.7) S3=

(7.8) S4 =
So the choice proportions are converging to [0, 75 0, 25] since the tran-
sition matrix is pushing toward a steady state or stationary distribution
of the proportions. So when we reach this distribution, all future states
are constant, that is stationary.
LECTURE 7. TACKLING REAL PROBLEMS 49

7.2. The Chapman-Kolmogorov Equations


These equations specify how successive events are bound together prob-
abilistically. If we abbreviate the hand side of expression 7.3
X
(7.9) pm1 +m2 (x, y) = pm1 (x, z)pm2 (z, y) discrete case.
all z

(7.10) Z
m1 +m2
p (x, y) = pm1 (x, z)pm2 (z, y)dz continuouscase.
range z

This is also equal to:


(7.11) pm1 +m2 = pm1 pm2 = pm1 pm2 −1 p = pm1 pm2 −2 p2 = ...
For discrete case. Thus iterative probabilities can be decomposed into
segmented products in any way we like, depending on the interim step.

7.3. Marginal Distributions


The marginal distributions at some step m-th from the transition kernel
is found via
(7.12) π m (θ) = [pm (θ1 ), pm (θ2 ), ..., pm (θk )]
| {z } | {z }
current value of the chain the row of the transition kernel for the m-th step

So the marginal distribution at the first step of the Markov chain is


given by
(7.13) π 1 (θ) = π 0 (θ)p1 .
where π 0 = initial starting value assigned to the chain, p1 = P = simple transition matrix,
Thus
(7.14) πn =

(7.15) π m (θj ) =

7.4. General Properties of Markov Chains


(a) Homogeneity: A homogeneous Markov chain at step n-th tran-
sition probability that does not depend on the value of so the
decision to move at this step is independent of this being the
current point in time.
(b) Irreducibility: A Markov chain is irreducible if every point or
collection of points can be reached from every other point or
collection of points. So irreducibility implies the existence of a
path between any two points in the subspace
50 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

(c) Recurrence:If a subspace is closed, finite, and irreducible, then


all states within the supspace are recurrent An irreducible
Markov chain is called recurrent with regard to a given state,
A, which is a single point or a defined collection of point, if
the probability that the chain occupies A infinitely often over
unbounded time is non-zero.
(d) Stationarity: Let π(θ) =stationary distribution of the Markov
chain for θ of the state space.
p(θi , θj ) = the probability that the chain will move from θj
to θj at some arbitrary step t from the transition kernel.
π t (θ) = the marginal distribution, thus the stationary dis-
tribution is defined as:
X
(7.16) π t (θi )p(θi , θj ) = π t+1 (θj ) discrete case
θi

Z
(7.17) π t (θi )p(θi , θj )dθi = π t+1 (θj ) continuous case

That is π = πp. The marginal distribution remains fixed when


the chain reaches the stationary distribution.
Once the chain reaches its stationary distribution (also its
invariant distribution, equilibrium distribution or limiting dis-
tribution), it stays in this distribution and moves about or
”mixes” throughout the subspace according to marginal distri-
bution, π(θ), forever.
(e) Ergodicity: If the chain is irreducible, positive Harris recurrent
(i.e. recurrence under unbounded continuous state space), and
aperiodic, the we call it ergodic. Ergodic Markov chains have
the property:
(7.18) lim pn (θi , θj ) = π(θj )
n→∞

for all θi , and θj in the subspace Therefore in the limit, the


marginal distribution at one step is identical to the marginal
distributions at all other steps. The ergodic theorem is analo-
gous to the strong law of large numbers bot for Markov chains.
Thus suppose that θi+1 , ..., θi+n are n values from a Markov
chain that has reached its ergodic distribution. A statistic of
interest, h(θ), can be calculated empirically:
i+n
1 X
(7.19) h(θi ) = h(θj ) ≈ h(θ)
n j=i+1
LECTURE 7. TACKLING REAL PROBLEMS 51

For a given empirical estimator ĥ(θi ) with bounded limiting


variance, we get the central limit theorem results
√ ĥ(θi ) − h(θ)
(7.20) n q →n→∞ N (0, 1).
v(ĥ(θi ))

7.5. Noniterative Monte Carlo Methods


(a) Direct Methods:
The most basic definition of Monte Carlo
R integration is that
θ ∼ h(θ) and we seek γ = E [f (θ)] = f (θ)h(θ)dθ. Then if
θ1 , ..., θN ∼iid h(θ), we have
N
1 X
(7.21) γ̂ = f (θj )
N j=1

with converges to E [f (θ)] with probability 1 as N → ∞ by the


strong law of large numbers. In our case, h(θ) is a posterior
distribution and γ is the posterior mean of f (θ). Hence the
computation of posterior expectations require only a sample of
size N from the posterior distribution.
(b) Indirect methods:
If we can’t directly sample from distribution? In this case, we
can use one of the following methods
• Importance sampling: Suppose we wish to approximate a
posterior expectation, say
R
f (θ)L(θ)π(θ)dθ
(7.22) E(f (θ) | y) = R
L(θ)π(θ)dθ
where f is the function of interest, L is the likelihood
or the data y. By defining the weight function w(θ) =
L(θ)π(θ)
g(θ)
, we have
R
f (θ)w(θ)g(θ)dθ
E(f (θ) | y) = R
w(θ)g(θ)dθ
(7.23) 1
P N
N j=1 f (θj )w(θj )
≈ 1
PN
N j=1 w(θj )

where θj ∼iid g(θ). Here g(θ) is called the importance


function, how closely it resembles L(θ)π(θ) controls how
good the approximation in the equation is. If g(θ) is a
good approximation, the weights will all be roughly equal,
52 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

which in turn will minimize the variance of the numerator


and denominator.
If g(θ) is a poor approximation, many of the weights will
be close to zero, and thus a few θj ’s will dominate the
sums, producing an inaccurate approximation.
Thus in importance sampling, one chooses a known den-
sity function g(θ) that is easy to sample. The procedure
works best if g(θ) is similar in shape to the known ker-
nel of the posterior L(θ)π(θ) with tails that do not decay
more rapidly than the tails of the posterior.
• Refection sampling: Here instead of trying to approximate
the normalized posterior:
L(θ)π(θ)
(7.24) h(θ) = R ,
L(θ)π(θ)dθ
we try to ”blanket” it. That is suppose there exists an
identifiable constant µ > 0 and a smooth density g(θ),
called the envelope function, such that L(θ)π(θ) < µg(θ)
for all θ.

Rejection Sampling Plot

L(θ)π(θ)
0.15

µg(θ)
0.10
0.05
0.00

−10 −5 0 5 10

Figure 1. Plot of L(θ)π(θ) and µg(θ)

Example 7.5.1.
LECTURE 7. TACKLING REAL PROBLEMS 53

The rejection method proceeds as follows:


(a) Generate θj ∼ g(θ)
(b) Generate U ∼ U nif orm(0, 1)
(c) If µug(θi ) < L(θj )π(θj ), accept θj , otherwise reject
θj
(d) Return to step (i) and repeat until the desired sam-
ple {θj , j = 1, ..., N } is obtained. The members of
this sample will then be random samples from h(θ).
Like an importance sampling density, the envelope
density g should be similar to the posterior in gen-
eral appearance, but with heavier tails and sharper
infinite peaks, in order to assure that there are suf-
ficiently many rejection candidates available across
its entire domain. Also µg is actually an ”envelope”
for the unnormalized posterior Lπ.
• Weighted bootstrap: Suppose an µ appropriate for the
rejection method is not readily available, but that we do
have a sample θ1 , ..., θN from same approximating density
g(θ), Define:
L(θi )π(θi )
(7.25) wi =
g(θi )
and
wi
(7.26) qi = PN
j=1 wj
Now draw θ∗ from the discrete distribution over {θ1 , , ..., θN }
which places mass at θi . Then
L(θ)π(θ)
(7.27) θ∗ ∼ h(θ) = R
L(θ)π(θ)dθ
with the approximation improving as N → ∞. This is a
weighted boottrap, since instead of resampling from the
set {θ1 , , ..., θN } with equally likely probabilities of selec-
tion , we are resampling some points more often that oth-
ers due to the unequal weighting.
NOTE: Importance sampling, rejection sampling and the weighted
bootstrap are all ”one-off”, or non-iterative methods: they draw a
sample of size N , and stop. Hence there is no notion of the algorithm
”converging ” we simply require N sufficiently large. But for many
problems, especially high-dimensional ones, it may be difficult or im-
possible to find an importance sampling density (or envelope function)
54 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

which is an acceptably accurate approximation to the lay posterior,


but still easy to sample from.
Solution: MCMC based methods
• Gibb’s sampling
• Metropolis Algorithm
• Metropolis-Hastling Algorithms
LECTURE 8
• There is a superficial resemblance between MCMC and the
frequentist technique of bootstrapping.
An MCMC sampler is a Markov chain in θ space. We start
the chain with an arbitrary θ(1) , which is therefore not random.
Each subsequent θ(j) drawn from a distribution q(θ(j) | θ(j−1) ),
so that it depends only on the previous point θ(j−1) and not on
the history of the chain up to that point. This is the Markov
property. The distribution q(. | .) defines the transition prob-
abilities of the chain.
• Let θ(2) be the distribution given the initial θ(1) so, just q(θ(2) |
θ(1) ). The distribution of θ(3) is:
Z
(8.1) f (θ | θ ) = q(θ(3) | θ(2) )q(θ(2) | θ(1) )dθ(2) .
(3) (1)

The distribution of θ(j) can then be obtained in principle by


iterating this convolution using
Z
(8.2) f (θ | θ ) = q(θ(j) | θ(j−1) )q(θ(j−1) | θ(1) )dθ(j−1) .
(j) (1)

The Markov chain theory then says that, subject to some


conditions, there is a unique limiting distribution p(θ) such
that for all sufficiently large j we have f (θ(j) | θ(1) ) ≈ p(θ(j) ),
and this limiting distribution is independent of the arbitrary
starting value θ(1) .

8.1. The Gibbs Sampler


Procedure: h i
[0] [0] [0]
(a) Choose starting values: θ[0] = θ1 , θ2 , ..., θk
55
56 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

(b) At the j-th starting at j = 1 complete the single cycle by


drawing values from the k distributions given by:
[j] [j−1] [j−1] [j−1] [j−1] [j−1]
θ1 ≈ π(θ1 | θ2 , θ3 , θ4 , ..., θk−1 , θk )
[j] [j] [j−1] [j−1] [j−1] [j−1]
θ2 ≈ π(θ2 | θ1 , θ3 , θ4 , ..., θk−1 , θk )
[j] [j] [j] [j−1] [j−1] [j−1]
θ3 ≈ π(θ3 | θ1 , θ2 , θ4 , ..., θk−1 , θk )
:
:
[j] [j] [j] [j] [j] [j−1]
θk−1 ≈ π(θk−1 | θ1 , θ2 , θ3 , ..., θk−2 , θk )
[j] [j] [j] [j] [j] [j]
θk ≈ π(θk | θ1 , θ2 , θ3 , ..., θk−2 , θk−1 )
(c) Increment j and repeat until convergence.
   
0 1 ρ
Example 8.1.1. θ1 | θ2 ∼ N ( , )
0 ρ 1
[j] [j−1] [j−1]
(8.3) θ1 | θ2 ∼ N (ρθ2 , 1 − ρ2 )

[j] [j] [j]


(8.4) θ2 | θ1 ∼ N (ρθ1 , 1 − ρ2 )
Example 8.1.2. Let x1 , x2 , ..., xn be a series of count data where there
exists the possibility of a changepoint at some period k, along the series.
Therefore there are two Poisson data-generating processes:
(8.5) xi | λ ∼ P (λ) i = 1, 2, ..., k

(8.6) xi | φ ∼ P (φ) i = k + 1, ..., n


The parameters to be estimated are λ, φ and k. Also the three inde-
pendent priors applied to this model are:
λ ∼ Gamma(α, β),
φ ∼ Gamma(γ, δ),
k ∼ Discrete uniform on [1, 2, ..., n] .
So the joint posterior is
π(λ, φ, k | y) ∝ L(λ, φ, k | y)π(λ | α, β)π(φ | γ, δ)π(k)
k n
Y e−λ λyi Y e−φ φyi β α α−1 −βλ δ γ γ−1 −δφ 1
(8.7) = ( )( )( λ e )( φ e )
i=1
y i ! i=k+1
y i ! Γ(α) Γ(γ) n
Pk
yi γ−1+ n
P
∝ λα−1+ i=1 φ i=k+1 yi −(k+β)λ−(n−k+δ)φ
e .
LECTURE 8. 57

So
k
X
λ | φ, k ∼ Gamma(α + yi , β + k),
i=1
n
X
φ | λ, k ∼ Gamma(γ + yi , δ + n − k).
i=k+1

Let λ and φ be fixed,


k n
Y e−λ λyi Y e−φ φyi
p(y | k, λ, φ) = ( )( )
i=1
yi ! i=k+1
y i!

k n
Y 1 k(φ−λ) −nφ Pki=1 yi Y yi
=( )e e λ ( φ )
(8.8) y!
i=1 i i=k+1
n
Y e−φ φyi k(φ−λ) λ Pki=1 yi
=( )(e ( ) )
i=k+1
y i! φ
= f (y, φ)L(y | k, λ, φ).

Listing 8.1. MCMC Function in R


1 bcp<− function(theta.matrix,y,a,b,g,d){
2
3 n<− length(y);
4 k.prob <− rep(0,length=n);
5
6 for (i in 1:nrow(theta.matrix)){
7
8 lambda <− rgamma(1,a+sum(y[theta.matrix[(i−1),3]:n])
9 ,b+theta.matrix[(i−1),3]);
10
11 phi <− rgamma(1,g+sum(y[theta.matrix[(i−1),3]:n])
12 ,d+length(y)−theta.matrix[(i−1),3]);
13
14 for(j in 1:n) k.prob[j] <− ...
exp(j*(phi−lambda))*(lambda/phi)ˆsum(y[1:j]);
15
16 k.prob <− k.prob/sum(k.prob);
17 k <− sample(1:n,,size=1,prob=k.prob);
18 theta.matrix[i,] <−c(lambda,phi,k);
19 }
20 return(theta.matrix)
21 }
58 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

Example 8.1.3. The time series stored in the tile gives the # of British
cool mining disasters per year over the period 1851 − 1962.

6
5
# of disasters

4
3
2
1

1860 1880 1900 1920 1940 1960

year

Figure 1. Plot of # of disasters over 1851 − 1962

There has been a reduction in the rate of disasters over the period.
Let yi denote the # of disasters in year i = 1, ..., 12 (relabelling the
years by numbers from 1 to n = 112). A model that has been proposed
in the literature has the form:
yi ∼ P oisson(θ), i = 1, ..., k
yi ∼ P oisson(λ), i = k + 1, ..., n
Let
θ ∼ Gamma(a1 , b1 )
λ ∼ Gamma(a2 , b2 )
k ∼ discrete uniform over {1, ..., n}
b1 ∼ Gamma(c1 , d1 )
b2 ∼ Gamma(c2 , d2 )
So
Pk Pn
π(θ, λ, k, b1 , b2 | y) ∝ e−θk θ i=1 yi −θ(n−k)
e λ i=k+1 yi a1 a1 −1 −b1 θ
b1 θ e
×bc11 −1 e−d1 b1 ba22 λa2 −1 e−b2 λ bc22 −1 e−d2 b2 I [k ∈ {1, 2, ..., n}] .
LECTURE 8. 59

k
X
θ | y, λ, b1 , b2 , k ∼ Gamma(a1 + yi , b1 + k)
i=1
n
X
λ | y, θ, b1 , b2 , k ∼ Gamma(a2 + yi , b2 + n − k)
i=k+1
b1 | y, θ, λ, b2 , k ∼ Gamma(c1 + a1 , d1 + θ)
b2 | y, θ, λ, b1 , k ∼ Gamma(c2 + a2 , d2 + λ)
and
Pk
e(λ−θ)k (θ/λ) i=1 yi I [k ∈ {1, 2, ..., n}]
p(k | y, θ1 , λ, b1 , b2 ) = Pn n (λ−φ)j θ Pk y o
j=1 e ( λ ) i=1 i

Burn-in period: The observations obtained after the chain has set-
tled down to the posterior will be more useful in estimating probabili-
ties and expectations for p(θ). If we throw out the early observations,
taken while the process was settling down, the remainder of the process
should be a very close approximation to one in which every observa-
tion is sampled from the posterior. Dropping the early observations is
referred to as using a burn-in period.
Thinning is a process used to make the observations more nearly
independent, hence more nearly random sample from the posterior dis-
tribution. Frankly, after a burn-in, there is not much point in thinnings
unless the correlations are extremely large. If there is a lot of correla-
tion between adjacent observations, a larger overall MC sample size is
needed to achieve reasonable numerical accuracy, in addition to needing
a much larger burn-in.
LECTURE 9
Summary of the properties of Gibbs
sampler
(a) Since the Gibbs sampler conditions on values from the last
iteration of its chain values, it clearly constitutes a Markov
chain.
(b) The Gibbs sampler has the true posterior distribution of pa-
rameter vector as its limiting distribution.
(c) The Gibbs sampler is a homogeneous Markov chain, the con-
secutive probabilities are independent of n, the current length
of the chain.
(d) The Gibbs sampler converges at a geometric rate: the total
variation distance between an arbitrary time and the point of
convergence decreases at a geometric rate in time (t).
(e) The Gibbs sampler is an ergodic Markov chain.

9.1. Metropolis Algorithm


The Metropolis algorithm[3] is another type of accept-reject algorithm.
It requires a candidate generating distribution, sometimes referred to
as the proposal distribution. The algorithm begins with an initial value
θ1 . At the k-th iteration we have (θ1 , θ2 , ..., θk ). The (k + 1)st iteration
first generates θ∗ from a proposal density h(θ∗ | θk ) This density should
mimic the actual posterior distribution in some sense, but in theory, it
can be any distribution with the same support as the posterior. Define:

p(θ∗ )h(θk | θk ) ∼
 
∗ k
(9.1) α(θ , θ ) = min 1, = α.
p(θk )h(θ∗ | θk )
61
62 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

We then simulate u ∼ u [0, 1] and we select θk+1 = θ∗ if u ≤ α and


otherwise take θk+1 = θk . Thus
 ∗
k+1 θ with probability α(θ∗ , θk )
(9.2) θ =
θk with probability 1 − α(θ∗ , θk )
Here α only uses the ratio of two values of p(.), so it is enough to know
the kernel of the posterior density. The acceptance ration can be also
written as:
p(θ∗ | y)/ht (θ∗ | θt−1 )
(9.3) α= .
p(θt−1 | y)/ht (θt−1 | θ∗ )
The ratio α is always defined, because a jump from θt−1 to θ∗ can only
occur if both p(θt−1 | y) and ht (θ∗ | θt−1 ) are nonzero.
So here the proposal density is asymmetric.

9.2. Metropolis Algorithm


The original Metropolisalgorithm assumes that h(θk | θ∗ ) = h(θ∗ | θk )
so that α(θ∗ , θk ) = min p(θ∗ )/p(θk ) . This is called the random walk.

Various suggestions are made about how to choose h(θ∗ | θk ). Often


it is taken as a N (θk , k ) distribution with various suggestions for k .
P P
The Metropolis algorithm is an adaptation of a random walk that
uses an acceptance/rejection rule to converge to the specified target
distribution. The algorithm proceeds as follows:
(a) Draw a starting θ0 , for which p(θ0 | y) > 0 from a starting
distribution p0 (θ). The starting distribution might, for exam-
ple, be based on an approximation or we may simply choose
starting values dispersed around a crude approximate estimate.
(b) For t = 1, 2, ...
• Sample a proposal θ∗ from a jumping distribution (or pro-
posal distribution) at time t, Rt (θ∗ | θt−1 ). For the Me-
tropolis algorithm, the jumping distribution must be symmetric,
satisfying the condition ht (θa | θb ) = ht (θb | θa ) for all θa ,
θb and t.
• Calculate the ratio of the densities,
p(θ∗ | y)
(9.4) α=
p(θt−1 | y)
• Set
θ∗ with probability min(α, 1)

t
(9.5) θ =
θt+1 otherwise
LECTURE 9. SUMMARY OF THE PROPERTIES OF GIBBS SAMPLER63

Given the current value θt−1 , the transition distribution


ht (θt | θt−1 ) of the Markov chain is thus a mixture of a
point mass at θt = θt−1 , and a weighted version of the
jumping distribution, ht (θt | θt−1 ), that adjusts for the
acceptance rate. The algorithm requires the ability to
calculate the ratio α for all (θ, θ∗ ), and to draw θ from
the jumping distribution ht (θ∗ | θ) for all θ and t In addi-
tion, step (c) requires the generation of a uniform random
number.
If θt = θt−1 , that is the jump is not accepted, this counts
as an iteration in the algorithm.
Interpretation of the Gibbs sampler as a special case of
the Metropolis-Hastling algorithm:
Gibbs sampling can be viewed as a special case of the
Metropolis-Hasting’s algorithm in the following way. We
first define iteration t to consist of a series of d steps, with
step j of iteration t corresponding to an update of the
subvector θj conditional on all other elements of θ. Then
the jumping distribution hj,t (. | .) at step j of iteration t
only jumps along the j-th subvector, and does so with the
t−1
conditional posterior density of θj density θ−j :

t−1
p(θj∗ | θj−1 ∗
 t
, y) if θ−j = θ−j
(9.6) hGibbs
j,t (θ∗ |θ t−1
)=
0 otherwise

The only possible jumps are to parameter vectors that


match. θt−1 on all components other than j-th. Under
this jumping distribution, the ratio at the j-th step of
iteration t is:

p(θ∗ | y)/hGibbs
j,t (θ∗ | θt−1 )
α=
p(θt−1 | y)/hGibbs
j,t (θt−1 | θ∗ )
t−1
p(θ∗ | y)/p(θj∗ | θ−j , y)
=
(9.7) p(θjt−1 | y)/p(θ−j
t−1
| θ∗ , y)
t−1
p(θ−j | y)
= t−1
p(θ−j | y)
=1

and thus every jump is accepted.


64 DR. VILDA PURUTCUOGLU, BAYESIAN ANALYSIS

9.3. Data Augmentation


It is a technique that can be helpful in making problems amenable to
Gibbs sampling.
(a) In the real world, some of our data can be missing.
(b) The likelihood function is not fractable for one reason or an-
other but conditional on a collection of unobserved random
variables, the likelihood becomes easy to handle.
Example 9.3.1. Suppose Y0 , Y1 , ...Yn is a time series of random vari-
ables defined by Y0 = 0 and for each i = 1, ..., n, Yi = Yi−1 + si where
si ∼ Beta(θ, θ), θ > 0. Therefore:
(9.8) Yi | Y0 , Y1 , ..., Yi−1 ∼ Yi−1 + si
Likelihood of θ
(9.9)
n
Y
f (y0 , ..., yn | θ) = f (y0 | θ) f (yi | y0 , ..., yi−1 , θ)
i=1
n
Y
= f (yi | yi−1 , θ)
i=1
n
Y Γ(2θ)
= 2 (yi − yi−1 )
θ−1
{1 − (yi − yi−1 )}θ−1 I [0 < yi − yi−1 < 1] .
i=1
{Γ(θ)}
However suppose that an observation yi∗ is missing then the likelihood
is no longer available in closed form.
Solution:
z additional variables to be included in the model. f (y | θ) is not
tractable but f (y, z | θ) is tractable. The posterior distribution of
(θ, z) is proportional to:
(9.10) π(θ, z | y) ∝ f (y, z | θ)π(θ).
y = (y0 , yi∗ , yn ) where z = yi∗ .
(9.11)
f (y, yi∗ | θ) = f (y0 , ..., yn | θ)
n
Y Γ(2θ)
= 2 (yi − yi−1 )
θ−1
{1 − (yi − yi−1 )}θ−1 I [0 < yi − yi−1 < 1] .
i=1
{Γ(θ)}
Therefore the posterior density of θ given (y, yi∗ ) is explicit:
(9.12) Γ(θ | y, yi∗ ) ∝ f (y, yi∗ | θ)π(θ).
APPENDIX 9. SUMMARY OF THE PROPERTIES OF GIBBS SAMPLER65

To complete the Gibbs sampler, we also need to sample from the con-
ditional posterior distribution of yi∗ . This has density:
(9.13)
f (yi∗ , y | θ) ∝ f (y, yi∗ | θ)
θ−1
∝ (yi∗ − yi−1
∗ ∗
) 1 − (yi∗ − yi−1

− yi∗ ) 1 − (yi+1 ∗
− yi∗ )
  
) (yi+1

on the region yi∗ ∈ (yi−1 ∗
, yi−1 ∗
+ 1) ∩ (yi+1 ∗
− 1, yi+1 ). This sampling can
be carried out by rejection sampling.
GIBBS SAMPLING
yi ∼ P oisson(θ), i = 1, ..., k
yi ∼ P oisson(λ), i = k + 1, ..., n
θ ∼ Gamma(a1 , b1 )
λ ∼ Gamma(a2 , b2 )
k ∼ discrete uniform over {1, ..., n}
b1 ∼ Gamma(c1 , d1 )
b2 ∼ Gamma(c2 , d2 )
Likelihoods:
k
Y
f (yI | θ) = f (yi | θ)
i=1
n
Y
f (yJ | λ) = f (yj | λ)
j=k+1

k Pk
Y e−θ θyi e−kθ θ i=1 yi
f (yi | θ) = = Qk
i=1
yi ! i=1 yi !
n n P
Y e−θ θyj e−(n−k)θ λ i=k+1 yi
f (yj | θ) = = Qn
j=k+1
yj ! i=k+1 yi !

Priors:
1 a1 −1 −θ/b1
π(θ) = θ e
ba11 Γ(a1 )
1 a2 −1 −θ/b2
π(λ) = θ e
ba22 Γ(a2 )
1
π(b1 ) = c1 bc11 −1 e−b1 /d1
d1 Γ(c1 )
1
π(b2 ) = bc2 −1 e−b2 /d2
d2 Γ(c2 ) 2
c2

67
68 DR. VILDA PURUTCUOGLU, GIBBS SAMPLING

Since
1
xα−1 e−x/β

β α Γ(α)
for x>0
g(x; α, β) =
0 elsewhere

where α > 0 and β > 0 for Gamma Distribution (α, β).


Posterior Distribution

π(θ, λ, k, b1 , b2 | y) ∝ f (yi | θ)f (yj | λ)π(λ | a2 , b2 )π(θ | a1 , b1 )π(b1 )π(b2 )π(k)

Explicitly,
Pk Pn
π(θ, λ, k, b1 , b2 | y) ∝ (e−kθ θ i=1 yi
)(e−(n−k)θ λ i=k+1 yi
)(d−c 1 c1 −1 −b1 /d1
1 b1 e )
(d−c 2 c2 −1 −b2 /d2
2 b2 e )I [k ∈ {1, 2, ..., n}] (b−a2 a2 −1 −θ/b2
2 θ e )(b−a1 a1 −1 −θ/b1
1 θ e )

So the conditional distributions are:


(a)
Z
π(θ | y, λ, k, b1 , b2 ) ∝ π(θ, λ, k, b1 , b2 | y)dλdθdb1 db2
Z Pk
∝ (e−kθ θ i=1 yi θa1 −1 e−θ/b1 )constantdθ
Z Pk
∝ (e−(k+1/b1 )θ θ i=1 yi +a1 −1 dθ
k
X
= Gamma(a1 + yi , b1 + k)
i=1

(b)
Z
π(λ | y, θ, k, b1 , b2 ) ∝ π(θ, λ, k, b1 , b2 | y)dθdb1 db2 dθ
Z Pn
∝ (e−(n−k)λ θ i=k+1 yi θa2 −1 e−θ/b2 )constantdλ
Z Pk
∝ (e−((n−k)+b2 )λ θ i=1 yi +a2 −1 dλ
n
X
= Gamma(a2 + yi , b2 + (n − k))
i=k+1
DR. VILDA PURUTCUOGLU, GIBBS SAMPLING 69

(c)
Z
π(b1 | y, λ, θ, k, b1 ) ∝ π(θ, λ, k, b1 , b2 | y)dydθdb2 dθ
Z
∝ bc11 −1 e−b1 /d1 θa1 −1 e−θ/b1 constantdb1
Z
∝ bc11 +a1 −1 e−b1 /(d1 +θ) db1
= Gamma(a1 + c1 , d1 + θ)
(d)
Z
π(b2 | y, λ, θ, k, b1 ) ∝ π(θ, λ, k, b1 , b2 | y)dydθdb1 dk
Z
∝ bc22 −1 e−b2 /d2 λa2 −1 e−λ/b2 constantdb2
Z
∝ bc22 +a2 −1 e−b2 /(d2 +λ) db2
= Gamma(a2 + c2 , d2 + λ)
(e)
Z
π(k | y, λ, θ, b1 , b2 ) ∝ π(θ, λ, k, b1 , b2 | y)dydθdb1 db2
Pk
e(λ−θ)k (θ/λ) i=1 yi
=P n Pj o I [k ∈ {1, 2, ..., n}]
n (λ−θ)j (θ/λ) i=1 yi
j=1 e
As the conditional distribution of k is discrete, thereby characterized
by a probability mass function.
DATA AUGMENTATION
Suppose Y0 , Y1 , ..., Yn is a time series of random defined by Y0 = 0,
i = 1, ..., n, Yi = Yi−1 + Si where Si ∼ Beta(θ, θ), θ > 0. Therefore
Yi | Y0 , Y1 , ..., Yi−1 ∼ Yi−1 + Si
Likelihood for the observations (y0 , y1 , ..., yn ) of θ:
n
Y n
Y
f (y0 , ..., yn | θ) = f (y0 | θ) f (yi | y0 , ..., yi−1 , θ) = f (yi | yi−1 , θ)
i=1 i=1
n
Y Γ(2θ)
= 2 (yi − yi−1 )
θ−1
{1 − (yi − yi−1 )}θ−1 I [0 < yi − yi−1 < 1] .
i=1
{Γ(θ)}
However suppose that an observation yi∗ is missing. So since the likeli-
hood is no longer in closed form, we assume that Y1 , ..., Yn are iid data
from the mixture density:
 
1 1 −yi2 /2 1 −(yi −θ)2 /2
f (yi | θ) = e + e .
2 (2π)1/2 (2π)1/2
and
n  2 
2
Y
L(yi | θ) = f (yi | θ) e−yi /2 + e−(yi −θ) /2 .
i=1
Let
• z=the additional variables included in the model(z may be just
a single variable or a vector containing several variables).
• θ = original parameters in the model with prior π(θ).
• y = vector of observations.
So the posterior distribution of (θ, z) is proportional to
π(θ, z | y) ∝ f (y, z | θ)π(θ)
Data augmentation proceeds by carrying out Gibbs sampling to sample
successively form θ and z to produce a sample from this joint distribu-
tion. The marginal distribution of θ therefore the posterior distribution
of interest.
71
72 DR. VILDA PURUTCUOGLU, DATA AUGMENTATION

• y = (y0 , ..., yn ) excluding yi∗ , the missing observation z = yi∗


f (yi∗ , y | θ) ∝ f (y, yi∗ | θ)
θ−1
∝ (yi∗ − yi−1

) 1 − (yi∗ − yi−1


− yi∗ ) 1 − (yi+1∗
− yi∗ )
  
) (yi+1
Therefore, the posterior density of θ given (y, yi∗ ) is
π(θ, yi∗ | y) ∝ f (y, yi∗ | θ)π(θ)
the conditional posterior distribution of yi∗ is:
f (yi∗ , y | θ) ∝ f (y, yi∗ | θ)
θ−1
∝ (yi∗ − yi−1

) 1 − (yi∗ − yi−1


− yi∗ ) 1 − (yi+1∗
− yi∗ )
  
) (yi+1

here yi∗ ∈ (yi−1 ∗
, yi−1 ∗
+ 1) ∩ (yi+1 ∗
− 1, yi+1 ).
• Here z is a sequence of n heads on tails, with one element per
iteration. z = (z1 , ..., zn ) and

1 if i-th observation is head
zi =
2 if i-th observation is fail
Suppose the prior of θ ∼ N (0, 1). Then
f (yi , zi | θ) = f (yi | zi , θ)P (zi ).
where
2
e−yi /2

if zi = 1
f (yi | zi , θ) = −(yi −θ)2 /2
e if zi = 2
and P (zi = 1) = P (zi = 2) = 1/2
Using that
( n )
Y
π(θ, z | y) ∝ f (yi | zi , θ)P (zi ) π(θ)
i=1
P

− 12 i:z =z (yi −θ)2
P   2 yi 1
∝ e i e −θ
∼ N ( i:zi =z , )
1 + n2 1 + n2
where n2 = # of observations for which zi = 2
P (zi = 1)
P (zi = 1 | θ, y) =
P (zi = 1) + P (zi = 2)
2
e−yi /2
= −(y −θ)2 /2 2
e i + e−yi /2
2
P (zi = 1) e−(yi −θ) /2
P (zi = 2 | θ, y) = = −(y −θ)2 /2 2
P (zi = 1) + P (zi = 2) e i + e−yi /2
Then a Gibbs sampler is used to simulate the posterior distri-
bution of (θ, z1 , ..., zn ).
DR. VILDA PURUTCUOGLU, DATA AUGMENTATION 73

Example .0.2. A genetic linkage y = (y1 , y2 , y3 , y4 ) = (125, 18, 20, 34)


with cell probabilities ( 2+θ
4
, 14 (1 − θ), 14 (1 − θ), 4θ ); 0 ≤ θ ≤ 1. Prior of
θ ∼ U nif orm(0, 1), so the posterior density of θ:
π(θ | y) ∝ f (y | θ)π(θ) ∝ (2 + θ)y1 (1 − θ)y2 +y3 θy4 I [θ ∈ (0, 1)]
Then
(a) Sample the posterior distribution of θ directly (e.g. via rejec-
tion sampling)
(b) Use data augmentation
Augment the observed data (y1 , y2 , y3 , y4 ) by dividing the
first cell into two partitions, with respective probabilities pro-
portional to θ and 2. That is,
θ
z | y, θ ∼ Binomial(y1 , )
2+θ
Then the likelihood function is:
f (y, z | θ) = f (y | θ)π(z | y, θ)
 
y1 y2 +y3 y4 y1 θ z 2 y1 −z
∝ (2 + θ) (1 − θ) θ ( )( )
z 2+θ 2+θ
 
y2 +y3 y4 2 y1 −z y1
= (1 − θ) θ θ z
z
The conditional posterior of θ is:
Z
(θ | y, z) = f (y, z | θ)dzdy ∝ θy4 +z (1 − θ)y2 +y3 I [θ ∈ (0, 1)]
= Beta(y4 + z + 1, y2 + y3 + 1)
To complete the Gibbs sampler, we also generate draws from
the conditional posterior distribution of z which is
θ
z | y, θ ∼ Binomial(y1 , )
2+θ
R Codes

Listing 1. Triplot Code in R


1 ####################################################
2 # #
3 # A Sample Triplot by Anil Aksu #
4 # It is developed to show some basics of R #
5 # #
6 ####################################################
7
8 ## the range of sampling
9 x=seq(−4,4,length=101)
10 ## this function gets numbers from console
11 prior=dnorm(x, mean = 0.5, sd = 0.7, log = FALSE)
12 likelihood=dnorm(x, mean = 0.49, sd = 0.65, log = FALSE)
13 posterior=dnorm(x, mean = 0.52, sd = 0.5, log = FALSE)
14
15
16 ## let's plot them
17 plot(range(x), range(c(likelihood,prior,posterior)), ...
type='n', xlab=expression(paste(theta)), ...
ylab=expression(paste("f(", theta, " )")))
18 lines(x, prior, type='l', col='blue')
19 lines(x, likelihood, type='l', col='red')
20 lines(x, posterior, type='l', col='green')
21
22 title("Prior, Likelihood and Posterior Distribution")
23 legend(
24 "topright",
25 lty=c(1,1,1),
26 col=c("blue", "red", "green"),
27 legend = c("prior", "likelihood","posterior")
28 )

Listing 2. Inference Plots Code in R


1 ####################################################
75
76 DR. VILDA PURUTCUOGLU, R CODES

2 # #
3 # Posterior, Perspective and Contour plots #
4 # by Anil Aksu #
5 # #
6 ####################################################
7
8 ## the range of sampling
9 x=seq(0,20,length=101)
10 ## this function gets numbers from console
11 posterior=dnorm(x, mean = 7, sd = 1.5, log = FALSE)
12
13
14 ## let's plot them
15 plot(range(x), range(c(posterior)), type='n', ...
xlab=expression(paste(theta)), ...
ylab=expression(paste("f(", theta, " | x )")))
16
17 lines(x, posterior, type='l', col='blue',lwd=5)
18
19 title("Posterior Distribution")
20 legend = c("posterior")
21
22 ## perspective plot
23 x <− seq(−10, 10, length= 30)
24 y <− x
25 f <− function(x, y) { r <− sqrt(xˆ2+yˆ2); 10 * ...
sin(r)/r }
26 z <− outer(x, y, f)
27 z[is.na(z)] <− 1
28 op <− par(bg = "white")
29 persp(x, y, z, theta = 30, phi = 30, expand = 0.5, ...
col = "lightblue")
30 persp(x, y, z, theta = 30, phi = 30, expand = 0.5, ...
col = "lightblue",
31 ltheta = 120, shade = 0.75, ticktype = "detailed",
32 xlab = "X", ylab = "Y", zlab = "Sinc( r )"
33 ) −> res
34 round(res, 3)
35
36 # contour plot
37 a <− expand.grid(1:20, 1:20)
38 b <− matrix(a[,1]ˆ2 + a[,2]ˆ2, 20)
39 filled.contour(x = 1:20, y = 1:20, z = b,
40 plot.axes = { axis(1); axis(2); ...
points(10, 10) })
41
42
43 ## bivariate posterior sampling
DR. VILDA PURUTCUOGLU, R CODES 77

44
45 ## the range of sampling
46 x=seq(−4,6,length=101)
47 ## this function gets numbers from console
48 posterior=0.8*dnorm(x, mean = 0, sd = 1, log = ...
FALSE)+0.2*dnorm(x, mean = 4, sd = 1, log = FALSE)
49
50 ## let's plot them
51 plot(range(x), range(c(posterior)), type='n', ...
xlab=expression(paste(theta)), ...
ylab=expression(paste("f(", theta, " | x )")))
52
53 lines(x, posterior, type='l', col='blue',lwd=5)
54
55 # title("Bivariate Posterior Distribution")
56 legend = c("posterior")
57
58 ## credible interval posterior plot
59
60 ## the range of sampling
61 x=seq(0,20,length=101)
62 ## this function gets numbers from console
63 posterior=dnorm(x, mean = 7, sd = 2, log = FALSE)
64
65 ## let's plot them
66 plot(range(x), range(c(posterior)), type='n', ...
xlab=expression(paste(theta)), ...
ylab=expression(paste("f(", theta, " | x )")))
67
68 lines(x, posterior, type='l', col='blue',lwd=5)
69
70 # title("Bivariate Posterior Distribution")
71 legend = c("posterior")

Listing 3. Rejection Sampling Plots Code in R


1 ####################################################
2 # #
3 # Rejection sampling plots #
4 # by Anil Aksu #
5 # #
6 ####################################################
7
8
9 require(SMPracticals)
10 ## rejection sampling
11
78 DR. VILDA PURUTCUOGLU, R CODES

12 ## the range of sampling


13 x=seq(−10,10,length=101)
14 ## this function gets numbers from console
15 posterior=0.6*dnorm(x, mean = 0, sd = 4, log = ...
FALSE)+0.4*dnorm(x, mean = 6, sd = 2, log = FALSE)
16 envelope=2*dnorm(x, mean = 2, sd = 5, log = FALSE)
17 ## let's plot them
18 plot(range(x), range(c(posterior,envelope)), ...
type='n', xlab=expression(paste(theta)), ylab="")
19 lines(x, posterior, type='l', col='blue',lwd=5)
20 lines(x, envelope, type='l', col='red',lwd=5)
21
22 title("Rejection Sampling Plot")
23 legend("topright", legend=c(expression(paste("L(", ...
theta, ")",pi,"(", theta, ...
")")),expression(paste(mu,"g(", theta, ")"))),
24 lty=1, col=c('blue', 'red'),inset = .02)
25
26 ## British Coal Mining accidents
27 data(coal)
28 # years of coal mining accidents
29 years <− unique(as.integer(coal$date))
30 # the number of accidents in each year
31 accident <− integer(length(years))
32 for (i in 1:length(years)){
33 accident[i]<−sum(as.integer(coal$date) == years[i])
34 }
35
36 plot(years ,accident, col='blue',lwd=2, xlab="year", ...
ylab="# of disasters")
37 #rug(coal$date)
BIBLIOGRAPHY

1. Allen B. Dawney. Think Bayes: Bayesian Statistics in Python.


O’REILLY, 2013.
2. D. Gamerman and F. L. Herbert. Markov Chain Monte Carlo:
Stochastic Simulation for Bayesian Inference. Chapman and Hal-
l/CRC, 2006.
3. Peter D. Hoff. A First Course in Bayesian Statistical Methods.
Springer Publishing Company, Incorporated, 1st edition, 2009.
4. Sheldon Ross. Introduction to Probability Models. Academic Press,
Boston, 2014.

79

You might also like