0% found this document useful (0 votes)
23 views

Bayesian-inference-slides-2021

Bayesian methods have gained importance in econometrics, particularly in macroeconomics, due to advancements in stochastic simulation algorithms. The document discusses the Bayesian approach to inference, including the combination of prior distributions with likelihoods to generate posterior distributions, and highlights the use of Markov chain Monte Carlo methods to facilitate Bayesian applications. It also covers point and interval estimation, the specification of prior distributions, and the convergence of posterior distributions in large samples.

Uploaded by

jorgebac1718
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Bayesian-inference-slides-2021

Bayesian methods have gained importance in econometrics, particularly in macroeconomics, due to advancements in stochastic simulation algorithms. The document discusses the Bayesian approach to inference, including the combination of prior distributions with likelihoods to generate posterior distributions, and highlights the use of Markov chain Monte Carlo methods to facilitate Bayesian applications. It also covers point and interval estimation, the specification of prior distributions, and the convergence of posterior distributions in large samples.

Uploaded by

jorgebac1718
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Bayesian analysis

Class Notes

Manuel Arellano
Revised: February 7, 2021
Introduction

Bayesian methods have traditionally had limited in‡uence in econometrics, but they
have become more important with the advent of computer-intensive stochastic
simulation algorithms in the 1990s.

This is particularly so in macroeconomics, where applications of Bayesian inference


include vector autoregressions and dynamic stochastic general equilibrium models.

Bayesian approaches are also attractive in models with many parameters, such as
panel models with individual heterogeneity and ‡exible nonlinear regression models.

Examples include discrete choice models of consumer demand in the …elds of


industrial organization and marketing.

2
Introduction (continued)

An empirical study uses data to learn about quantities of interest (parameters).

A likelihood function speci…es the information in the data about those quantities.

Such speci…cation typically involves the use of a priori information in the form of
parametric or functional restrictions.

In the Bayesian approach to inference, one not only assigns a probability measure to
the sample space but also to the parameter space.

Specifying a probability distribution over potential parameter values is the


conventional way of modelling uncertainty in decision-making, and o¤ers a way of
incorporating uncertain prior information into statistical procedures.

3
Outline

The following section introduces the Bayesian way of combining a prior distribution
with the likelihood of the data to generate point and interval estimates.

This is followed by some comments on the speci…cation of prior distributions.

Next we turn to asymptotic approximations; the result is a large-sample equivalence


between Bayesian probability statements and frequentist con…dence statements.

As a result, frequentist and Bayesian inferences are often very similar and can be
reinterpreted in each other’s terms.

Finally, we review Markov chain Monte Carlo methods (MCMC).

The development of these methods has greatly reduced the computational di¢ culties
that held back Bayesian applications in the past.

Bayesian methods are now not only generally feasible, but sometimes also a better
practical alternative to frequentist methods.

The upshot is an emerging Bayesian/frequentist synthesis around increasing


agreement on what works for di¤erent kinds of problems.

4
Bayesian inference

5
Updating a prior distribution with data

Let the density of the data y = (y1 , ..., yn ) conditional on an unknown parameter θ be:
f (y1 , ..., yn j θ ) .
If y is an iid sample f (y1 , ..., yn j θ ) = ∏ni=1 f (yi j θ ) where f (yi j θ ) is the pdf of yi .
In survey sampling f (yi j θ = θ 0 ) is the pdf of the population, (y1 , ..., yn ) are n draws
from such population, and θ 0 is the true value of θ in the pdf that generated the data.
In short write f (y j θ ) = f (y1 , ..., yn j θ ), which is also the likelihood function L (θ ).
Any prior information about the value of θ is speci…ed in a prior distribution p (θ ).
Both the likelihood and the prior are chosen by the researcher.
We combine the prior and the sample, using Bayes’theorem, to obtain the conditional
distribution of the parameter given the data, also known as the posterior distribution:
f (y , θ ) f (y j θ ) p ( θ )
p (θ j y ) = = R .
f (y ) f (y j θ ) p ( θ ) d θ
Note that as a function of θ, the posterior density is proportional to
p ( θ j y ) ∝ f (y j θ ) p ( θ ) = L ( θ ) p ( θ ) .
The posterior density describes how likely it is that a value of θ has generated the
observed data.

6
Point estimation

We can use the posterior density to form optimal point estimates.

The notion of optimality is minimizing mean posterior loss for some loss function ` (r ):
Z
min ` (c θ ) p (θ j y ) d θ
c Θ

The posterior mean Z


θ= θp (θ j y ) d θ
Θ
is the point estimate that minimizes mean squared loss ` (r ) = r 2 .

The posterior median minimizes mean absolute loss ` (r ) = jr j.

The posterior mode e


θ is the maximizer of the posterior density and minimizes mean
Dirac loss.

When the prior density is ‡at, the posterior mode coincides with the maximum
likelihood estimator.

7
Interval estimation

The posterior quantiles characterize the posterior uncertainty about the parameter,
and they can be used to obtain interval estimates.

Any interval (θ ` , θ u ) such that


Z θu
p (θ j y ) d θ = 1 α
θ`
is called a credible interval with coverage probability 1 α.
If the posterior density is unimodal, a common choice is the shortest connected
credible interval or the highest posterior density (HPD) interval.
Often an equal-tail-probability interval is favored because of its simplicity. In such
case, θ ` and θ u are just the α/2 and 1 α/2 posterior quantiles, respectively.
Equal-tail intervals tend to be longer than others, except if the posterior is symmetric.
If the posterior is multi-modal the HPD interval may consist of disjoint segments.

Frequentist con…dence intervals and Bayesian credible intervals are the two main
interval estimation methods in statistics.
In a con…dence interval the coverage probability is calculated from a sampling density,
whereas in a credible interval it is calculated from a posterior density.

8
Bernoulli example

Let us consider a Bernouilli random sample (y1 , ..., yn ) with likelihood given by
L ( θ ) = θ m (1 θ )n m

where m = ∑ni=1 yi . The maximum likelihood estimator is


b m
θ= .
n
Given some prior p (θ ), the posterior mode solves
e
θ = arg max [ln L (θ ) + ln p (θ )] .
θ
Since θ is a probability, a suitable domain for a prior distribution is the (0, 1) interval.
A convenient choice is the Beta distribution:
1
p (θ; α, β) = θ α 1 (1 θ ) β 1
B (α, β)
where B (α, β) is the beta function:
Z 1
1 1
B (α, β) = sα (1 s )β ds.
0
The quantities (α, β) are called prior hyperparameters, to be set according to our a
priori information about θ.

9
Bernoulli example (continued)

The Beta distribution is a convenient prior because the posterior is also Beta:
m +β 1
p (θ j y ) ∝ L (θ ) p (θ ) ∝ θ m +α 1
(1 θ )n
That is, if θ Beta (α, β) then θ j y Beta (α + m, β + n m ).
It is then said that the Beta distribution is the conjugate prior to the Bernoulli.
The posterior mode is given by
h i m+α 1
m +β 1
e
θ = arg max θ m +α 1
(1 θ )n = . (1)
θ n+β 1+α 1
An interesting property of the posterior mode in this example is that it is equivalent
to the MLE of a data set with α 1 additional ones and β 1 additional zeros.
Such data augmentation interpretation provides guidance on how to choose α and β
in describing a priori knowledge about the probability of success in Bernoulli trials.
It also illustrates the vanishing e¤ect of the prior in a large sample: if n is large e
θ b
θ.
However, ML may not be a satisfactory estimator in a small sample that only contains
zeros if the probability of success is known a priori to be greater than zero.

10
Speci…cation of prior distribution

11
Conjugate priors

There is a diversity of considerations involved in the speci…cation of a prior, not


altogether di¤erent from those involved in the speci…cation of a likelihood model.
One consideration in selecting both prior and likelihood is mathematical convenience.
Conjugate priors played a central role for analytical and computational reasons.
Another advantage of conjugate priors is that they can be interpreted in terms of
additional pseudo-data (as in the Bernoulli example).
A prior is conjugate for a family of distributions if the prior and the posterior are of
the same family.
In general, distributions in the exponential family have conjugate priors.
Some likelihood models together with their conjugate priors are the following:
Bernoulli – Beta
Binomial – Beta
Poisson – Gamma
Normal with known variance – Normal
Exponential – Gamma
Uniform – Pareto
Geometric – Beta

12
Informative priors

The argument for using a probability distribution to specify uncertain a priori


information is more compelling when prior knowledge can be associated to past
experience, or to a process of elicitation of consensus expert views.

Other times, a parameter is a random realization drawn from some population, for
example, in a model with individual e¤ects for longitudinal survey data; a situation in
which there exists an actual population prior distribution.

In those cases one would like the prior to accurately express the information available
about the parameters.

However, often little is known a priori and one would like a prior density to just
express lack of information, an issue that we consider next.

13
Flat priors

For a scalar θ taking values on the entire real line a ‡at prior distribution that sets
p (θ ) = 1 is typically employed as an uninformative prior.

A ‡at prior is non-informative in the sense of having little impact on the posterior,
which is simply a renormalization of the likelihood into a density for θ.

A ‡at prior is appealing from the point of view of seeking to summarize the likelihood.

R
Note that a ‡at prior is improper in the sense that Θp (θ ) d θ = ∞.

If an improper prior is combined with a likelihood that cannot be renormalized, the


result is an improper posterior that cannot be used for inference.

Flat priors are often approximated by a proper prior with a large variance.

14
Flat priors (continued)

If p (θ ) is uniform, then the prior of a transformation of θ is not uniform.


If θ > 0, a standard reference prior is a ‡at prior on ln θ, p (ln θ ) = 1, which implies
p (θ ) = 1/θ.
θ
Similarly, if θ 2 (0, 1), a ‡at prior on the logit of θ, ln 1 θ , implies
1
p (θ ) = . (2)
θ (1 θ)
R∞ 1
R1 1
These priors are improper because 0 θ dθ and 0 θ (1 θ ) d θ both diverge.
1
If (y1 , ..., yn ) is i.i.d. N µ, , the standard improper reference prior for (µ, τ ) is to
τ
specify independent ‡at priors on µ and ln τ, so that
p (µ, τ ) = p (µ) p (τ ) = 1/τ.
Je¤reys prior

It is a rule for choosing a non-informative prior that is invariant to transformation:


p (θ ) ∝ [det I (θ )]1/2
where I (θ ) is the information matrix.
If γ = h (θ ) is one-to-one, applying Je¤reys’rule directly to γ we get the same prior
as applying the rule to θ and then transforming to obtain p [h (θ )].
15
Bernoulli example continued

Let us illustrate three standard non-informative priors in the Bernoulli example.


The …rst one is a ‡at prior in the log-odds scale, leading to (2); this is the Beta (0, 0)
distribution since it is the limit of the numerator of the beta distribution as α, β ! 0.
The second is Je¤reys’prior, which in this case is proportional to
1
p (θ ) = p ,
θ (1 θ)
and corresponds to the Beta (0.5, 0.5) distribution.
The third one is the uniform prior p (θ ) = 1, which corresponds to a Beta (1, 1).
All three priors are data augmentation priors:
The Beta (0, 0 ) prior adds no prior observations.
Je¤reys’prior adds one observation with half a success and half a failure.
The uniform prior adds two observations with one success and one failure.

The ML estimator coincides with the posterior mode for the Beta (1, 1) prior, and
with the posterior mean for the Beta (0, 0) prior.

16
Large-sample Bayesian inference

17
Introduction

We typically resort to large-sample approximations to evaluate the performance of a


point estimator and to obtain a (frequentist) con…dence interval.

Here we wish to consider (i) asymptotic approximations to the posterior distribution,


and (ii) the sampling properties of Bayesian estimators in large samples.

The main result is that as n ! ∞ the posterior of θ approaches a multivariate normal


distribution, which is independent of the prior.

The convergence is in probability, where the probability is measured with respect to


the true distribution of y .

Posterior asymptotic results formalize the notion that the importance of the prior
diminishes as n increases.

They hold under suitable conditions on the prior, the likelihood, and Θ, including:
A prior that assigns positive probability to a neighborhood about θ 0 .
A posterior distribution that is not improper.
Identi…cation.
A likelihood that is a continuous function of θ.
A true value θ 0 that is not on the boundary of Θ.
18
Consistency of the posterior distribution
If the population distribution g (yi ) equals f (yi j θ 0 ) for some θ 0 , the posterior is
consistent in the sense that it converges to a point mass at θ 0 as n ! ∞.
When g (yi ) is not in f (yi j θ ), the pseudo true value θ 0 makes f (yi j θ ) closest to
g (yi ) in the KLD sense. Consistency of the pseudo-posterior also holds in this case.
Discrete parameter space (Gelman et al 2014)
If Θ is …nite and Pr (θ = θ 0 ) > 0, then Pr (θ = θ 0 j y ) ! 1 as n ! ∞, where θ 0 is
the value of θ that minimizes KLD (θ ).
To see this, for any θ 6= θ 0 consider the log posterior odds relative to θ 0 :
n
p (θ j y ) p (θ ) f (yi j θ )
ln
p (θ 0 j y )
= ln
p (θ 0 )
+ ∑ ln f (yi j θ 0 )
(3)
i =1
For …xed values of θ and θ 0 , if the yi ’s are iid draws from g (yi ), the second term on
the right is a sum of n iid random variables with a mean given by
f (yi j θ )
E ln = KLD (θ 0 ) KLD (θ ) 0.
f (yi j θ 0 )
Thus, as long as θ 0 is the unique minimizer of KLD (θ ), for θ 6= θ 0 the second term
on the right of (3) is the sum of n iid random variables with negative mean.
By the LLN, the sum approaches ∞ as n ! ∞. As long as the …rst term on the
right is …nite (provided p (θ 0 ) > 0), the whole expression approaches ∞ in the limit.
Then p (θ j y ) /p (θ 0 j y ) ! 0, and so p (θ j y ) ! 0.
Moreover, since all probabilities add up to 1, p (θ 0 j y ) ! 1.
19
Continuous parameter space

If θ has a continuous distribution, p (θ 0 j y ) is zero for any …nite sample, and so the
previous argument does not apply, but it can still be shown that p (θ j y ) becomes
more and more concentrated about θ 0 as n ! ∞, as in the following result.
If θ is de…ned on a compact set and A is a neighborhood of θ 0 with nonzero prior
probability, then Pr (θ 2 A j y ) ! 1 as n ! ∞, where θ 0 minimizes KLD (θ ).

Bernouilli example

The posterior distribution in this case is:


m +β 1
p (θ j y ) ∝ θ m +α 1
(1 θ )n Beta (m + α, n m + β)
with mean and variance given by
m+α m 1
E (θ j y ) = = +O
n+α+β n n
(m + α ) (n m + β ) 1
Var (θ j y ) = =O .
(n + α + β )2 (n + α + β + 1 ) n
p p
As n ! ∞ E (θ j y ) ! θ 0 and Var (θ j y ) ! 0 regardless of the prior distribution.

20
Asymptotic normality of the posterior distribution

We have seen that as n ! ∞ the posterior distribution converges to a degenerate


measure at the true value θ 0 (posterior consistency).
To obtain a non-degenerate limit, we consider the sequence of posterior distributions
p
of γ = n θ b θ , whose densities are given by
1 1
p (γ j y ) = p p b
θ+ p γjy .
n n
The result is that as n ! ∞, p (γ j y ) approaches a normal distribution. This type
of result is known as Bernstein-von Mises Theorem.
A statement for iid data and a scalar θ, under the standard regularity conditions of
MLE asymptotics, the condition that p (θ ) is continuous and positive in an open
neighborhood of θ 0 , and some other technical conditions, is as follows:
Z
!
1 1 2 p
p (γ j y ) q exp 2
γ d γ ! 0.
2πσ 2 2σ θ
θ

where σ2θ = 1/I (θ 0 ).


That is, the L1 distance between the scaled and centered posterior and a N 0, σ2θ
density centered at the random quantity γ goes to zero in probability.

21
Asymptotic normality of the posterior distribution (continued)

Thus, for large n, p (θ j y ) is approximately a random normal density with random


mean parameter b θ and a constant variance parameter I (θ 0 ) 1 /n:

b 1 1
p (θ j y ) N θ, I (θ 0 ) .
n
From a frequentist point of view, this implies that Bayesian methods can be used to
obtain statistically e¢ cient estimators and consistent con…dence intervals.

22
Extension to a multidimensional parameter: some intuition
Consider a Taylor expansion of ln p (θ j y ) about the posterior mode e
θ:
∂ ln p e
θjy 1 0 ∂2 ln p e
θjy
ln p (θ j y ) ln p e
θjy + θ e
θ + θ e
θ θ e
θ
∂θ 0 2 ∂θ∂θ 0
p 0h ip
=c 0.5 n θ e
θ n ∂ ln p e
1 2
θ j y /∂θ∂θ 0 n θ e
θ

Note that ∂ ln p e
θ j y /∂θ 0 = 0. Moreover,
2 e 2 e ∂2 ln f yi j e
1 ∂ ln p θ j y 1 ∂ ln p θ 1 n
θ
n ∂θ∂θ 0
=
n ∂θ∂θ 0
+
n ∑ i =1 ∂θ∂θ 0
1 n
∂2 ln f yi j e
θ 1
=
n ∑ i =1∂θ∂θ 0 I eθ +O
n
Thus, the curvature of the log posterior can be approximated by Fisher information:
p 0 p
ln p (θ j y ) c 0.5 n θ e θ I e θ n θ e θ .
Dropping terms that do not include θ we get the approximation
0
p (θ j y ) ∝ exp[ 0.5 θ e
θ nI e
θ θ e
θ ],
1
which corresponds to the kernel of a multivariate normal density N e
θ, n 1I e
θ .
23
Asymptotic behavior of the posterior in pseudo-likelihood models

If g (yi ) 6= f (yi j θ ) for all θ 2 Θ, the large-n sampling distribution of the PMLE is
p d
n bθ θ 0 ! N (0, ΣS )

where ΣS is the sandwich covariance matrix:


ΣS = ΣM V ΣM ,
1 1
with ΣM = [ E (Hi )] [I (θ 0 )] , V = E (qi qi0 ) and
∂ ln f (yi j θ 0 ) ∂2 ln f (yi j θ 0 )
qi = , Hi =
∂θ ∂θ∂θ 0
In a correctly speci…ed model the information identity holds, but in general V 6= ΣM1 .

The large-n shape of a posterior obtained from Πni=1 f (yi j θ ) becomes close to

b 1
θjy N θ, ΣM .
n
Thus, misspeci…cation produces a discrepancy between the sampling distribution of b
θ
and the shape of the (pseudo)-likelihood.

24
Asymptotic behavior of the posterior in pseudo-likelihood models (continued)

For the purpose of Bayesian inference about the pseudo truth θ 0 it makes sense to
start from the correct large-sample approximation to the likelihood of b
θ instead of the
(incorrect) approximate likelihood of (y1 ...yn ).

That is, to consider a posterior distribution of the form:


1 0
p θjb
θ ∝ exp n θ b
θ ΣS 1 θ b
θ p (θ ) (4)
2
instead of the standard pseudo-posterior
" #
n
p θ j θ ∝ exp ∑ ln f (yi j θ )
b p (θ ) . (5)
i =1

The pseudo-posterior (4) relies on the asymptotic likelihood of b


θ (an "arti…cial"
normal posterior centered at the MLE with sandwich covariance matrix).

This approach is proposed in Müller (2013).

25
Asymptotic frequentist properties of Bayesian inferences

The posterior mode is consistent and asymptotically normal as n ! ∞.


So the large-sample Bayesian statement holds
h i1/2
I e
θ θ e θ jy N (0, I )
alongside the large-sample frequentist statement
h i1/2
I eθ θ e θ jθ N (0, I ) .
These results imply that in regular estimation problems the posterior distribution is
asymptotically the same as the repeated sample distribution.
So, for example, a 95% central posterior interval for θ will cover the true value 95%
of the time under repeated sampling with any …xed true θ.
The frequentist statement speaks of probabilities of e
θ (y ) whereas the Bayesian
statement speaks of probabilities of θ. Speci…cally,
Z Z
Pr (θ r j y ) = 1 (θ r ) p (θ j y ) d θ _ 1 (θ r ) f (y j θ ) p ( θ ) d θ
h i Z
Pr e θ (y ) r j θ 0 = 1 e θ (y ) r f (y j θ 0 ) dy
These results require that the true data distribution is included in the parametric
likelihood family.
26
Bernoulli example continued

The posterior mode corresponding to the beta prior with parameters (α, β) in (1) and
the maximum likelihood estimator b
θ = m/n satisfy
p p
n e
θ θ = n b
θ θ + Rn

where p
n m
Rn = α 1 k .
n+k n
and k = α + β 2.

p p
Since Rn ! 0, it follows that n e
θ θ has the same asymptotic distribution as
p b
n θ θ , namely N [0, θ (1 θ )].

Therefore, the normalized posterior mode has an asymptotic normal distribution,


which is independent of the prior parameters and has the same asymptotic variance as
that of the MLE, so that the posterior mode is asymptotically e¢ cient.

27
Robustness to statistical principle and its failures

The dual frequentist/Bayesian interpretation of many estimation procedures suggests


that it is possible to aim for robustness to statistical philosophies in statistical
methodology, at least in regular estimation problems.

Even for small samples, many statistical methods can be considered as


approximations to Bayesian inferences based on particular prior distributions.

As a way of understanding a statistical procedure, it is often useful to determine the


implicit underlying prior distribution (Gelman et al 2014).

In the case of units roots the symmetry of Bayesian probability statements and
classical con…dence statements breaks down.

With normal errors and a ‡at prior the Bayesian posterior is normal even if the true
data generating process is a random walk (Sims and Uhlig 1991).

28
Markov chain Monte Carlo methods

29
Introduction

A Markov Chain Monte Carlo method simulates a series of parameter draws such that
the marginal distribution of the series is the posterior distribution of the parameters.

The posterior density is proportional to

p ( θ j y ) ∝ f (y j θ ) p ( θ ) .

Usually f (y j θ ) p (θ ) is easy to compute.

However, computation of point estimates and credible intervals typically requires the
evaluation of integrals of the form
R
ΘRh ( θ ) f (y j θ ) p ( θ ) d θ
Θ f (y j θ ) p ( θ ) d θ

for various functions h (.).

For problems for which no analytic solution exists, MCMC methods provide powerful
tools for evaluating these integrals, especially when θ is high dimensional.

30
Markov chains

MCMC is a collection of computational methods that produce an ergodic Markov


chain with the stationary distribution p (θ j y ).
A continuous-state Markov chain is a sequence θ (1 ) , θ (2 ) , ... that satis…es the property:
Pr θ (j +1 ) j θ (j ) , ..., θ (1 ) = Pr θ (j +1 ) j θ (j ) .

The probability Pr θ 0 j θ of transitioning from state θ to state θ 0 is called the


transition kernel and we denote it K θ 0 j θ .
Our interest will be in the steady-state probability distribution of the process.
Given a starting value θ (0 ) , a chain θ (1 ) , θ (2 ) , ..., θ (M ) is generated using a
transition kernel with stationary distribution p (θ j y ), which ensures the convergence
of the marginal distribution of θ (M ) to p (θ j y ).
For su¢ ciently large M , the MCMC methods produce a dependent sample
θ (1 ) , θ (2 ) , ..., θ (M ) whose empirical distribution approaches p (θ j y ).
The ergodicity and construction of the chains usually imply that as M ! ∞,
Z
1 M p
M j∑
b
θ= h θ (j ) ! h (θ ) p (θ j y ) d θ.
=1 Θ

Analogously, a 90% interval estimation is constructed simply by taking the 0.05th and
0.95th quantiles of the sequence h θ (1 ) , ..., h θ (M ) .
31
Markov chains (continued)

In the theory of Markov chains one looks for conditions under which there exists an
invariant distribution, and conditions under which iterations of the transition kernel
K θ 0 j θ converge to the invariant distribution.
In the context of MCMC methods the situation is the reverse: the invariant
distribution is known and in order to generate samples from it the methods look for a
transition kernel whose iterations converge to the invariant distribution.
The problem is to …nd a suitable K θ 0 j θ that satis…es the invariance property :
Z
p θ0 j y = K θ 0 j θ p (θ j y ) d θ. (6)

Under the invariance property, if θ (j ) is a draw from p (θ j y ) then θ (j +1 ) is also a


draw from p (θ j y ).
The steady-state distribution p (θ j y ) satis…es the detailed balance condition:
K θ 0 j θ p (θ j y ) = K θ j θ 0 p θ 0 j y for all θ, θ 0 . (7)
0
The interpretation of equation (7) is that the amount of mass transitioning from θ to
θ is the same as the amount of mass that transitions back from θ to θ 0 .
Two general methods of constructing transition kernels are the Metropolis-Hastings
algorithm and the Gibbs sampler, which we discuss in turn.

32
Metropolis-Hastings method

The MH algorithm proceeds by generating candidates that are either accepted or


rejected according to some probability, which is driven by a ratio of posterior
evaluations. A description of the algorithm is as follows.

Given the posterior density f (y j θ ) p (θ ), known up to a constant, and a prespeci…ed


conditional density q θ 0 j θ called the "proposal distribution", generate
θ (1 ) , θ (2 ) , ..., θ (M ) in the following way:

1 Choose a starting value θ (0 ) .


2 Draw a proposal θ from q θ j θ (j ) .
3 Update θ (j +1 ) from θ (j ) for j = 1, 2..., using
8
< θ with probability ρ θ j θ (j ) ,
θ (j +1 ) =
: θ (j ) with probability 1 ρ θ j θ (j ) ,

where 0 1
f (y j θ ) p ( θ ) q θ (j ) j θ
ρ θ j θ (j ) = min @1, A
f y j θ (j ) p θ (j ) q θ j θ (j )

33
Intuition for how MH deals with a candidate transition (Letham & Rudin 2012)

If p θ 0 j y > p (θ j y ), then for every accepted draw of θ, we should have at least as


many accepted draws of θ 0 and so we always accept the transition θ ! θ 0 .

If p θ 0 j y < p (θ j y ), then for every accepted draw θ, we should have on average


p ( θ 0 jy ) p ( θ 0 jy )
p ( θ jy )
accepted draws of θ 0 . We thus accept the transition with probability p (θ jy ) .

p ( θ 0 jy )
Thus, for any proposed transition, we accept it with probability min 1, p (θ jy ) ,
which corresponds to ρ θ 0 j θ when the proposal distribution is symmetric:
q θ 0 j θ = q θ j θ 0 , as is the case in the original Metropolis algorithm.

The chain of draws so produced spends a relatively high proportion of time in the
higher density regions and a lower proportion in the lower density regions.

Because such proportions of times are balanced in the right way, the generated
sequence of parameter draws has the desired marginal distribution in the limit.

A key practical aspect of this calculation is that the posterior constant of integration
is not needed since ρ θ 0 j θ only depends on a posterior ratio.

34
Choosing a proposal distribution

To guarantee the existence of a stationary distribution, the proposal distribution


q θ 0 j θ should be such that there is a positive density of reaching any state from
any other state.

A popular implementation of the MH algorithm is to use the random walk proposal


distribution:
q θ 0 j θ = N θ, σ2

for some variance σ2 .

In practice, one will try several proposal distributions to …nd out which is most
suitable in terms of rejection rates and coverage of the parameter space.

Other practical considerations include discarding a certain number of the …rst draws
to reduce the dependence on the starting point (burn-in), and only retaining every
d th iteration of the chain to reduce the dependence between draws (thinning).

35
Transition kernel and convergence of the MH algorithm
The MH algorithm describes how to generate a parameter draw θ (j +1 ) conditional on
a parameter draw θ (j ) .
Since the proposal distribution q θ 0 j θ and the acceptance probability ρ θ 0 j θ
depend only on the current state, the sequence of draws forms a Markov chain.
The MH transition kernel can be written as
K θ 0 j θ = q θ 0 j θ ρ θ 0 j θ + r (θ ) δθ θ 0 . (8)
0 0 0
The …rst term q θ j θ ρ θ j θ is the density that θ is proposed given θ, times the
probability that it is accepted.
To this we add the term r (θ ) δθ θ 0 , which gives the probability r (θ ) that
conditional on θ the proposal is rejected times the Dirac delta function δθ θ 0 , equal
to one if θ 0 = θ and zero otherwise. Here
Z
r (θ ) = 1 q θ0 j θ ρ θ0 j θ d θ0.

If the proposal is rejected, then the algorithm sets θ (j +1 ) = θ (j ) , which means that
conditional on the rejection, the transition density contains a point mass at θ = θ 0 ,
which is captured by the Dirac delta function.
For the MH algorithm to generate a sequence of draws from p (θ j y ) a necessary
condition is that the posterior distribution is an invariant distribution under the
transition kernel (8), namely that it satis…es condition (6).
36
Gibbs sampling

The Gibbs sampler is a fast sampling method that can be used in situations when we
have access to conditional distributions.
The idea behind the Gibbs sampler is to partition the parameter vector into two
components θ = (θ 1 , θ 2 ).
(j +1 )
Instead of sampling θ (j +1 ) directly from K θ j θ (j ) , one …rst samples θ 1 from
(j ) (j +1 ) (j +1 )
p θ1 j θ2 and then samples θ 2 from p θ 2 j θ 1 .
(j ) (j ) (j +1 ) (j +1 )
If θ 1 , θ 2 is a draw from the posterior distribution, so is θ 1 , θ2
generated as above, so that the Gibbs sampler kernel satis…es the invariance property;
that is, it has p (θ 1 , θ 2 j y ) as its stationary distribution.
The Gibbs sampler kernel is
K θ 1 , θ 2 j θ 10 , θ 20 = p θ 1 j θ 20 p (θ 2 j θ 1 ) .
It can be regarded as a special case of MH where the proposal distribution is taken to
be the conditional posterior distribution.
The Gibbs sampler is related to data augmentation. A probit model nicely illustrates
this aspect (Lancaster 2004, Example 4.17).

37

You might also like