0% found this document useful (0 votes)

23 views

Bayesian-inference-slides-2021

Bayesian methods have gained importance in econometrics, particularly in macroeconomics, due to advancements in stochastic simulation algorithms. The document discusses the Bayesian approach to inference, including the combination of prior distributions with likelihoods to generate posterior distributions, and highlights the use of Markov chain Monte Carlo methods to facilitate Bayesian applications. It also covers point and interval estimation, the specification of prior distributions, and the convergence of posterior distributions in large samples.

Uploaded by

jorgebac1718

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Bayesian-inference-slides-2021

Uploaded by

jorgebac1718

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Bayesian analysis

Class Notes

Manuel Arellano
Revised: February 7, 2021
Introduction

Bayesian methods have traditionally had limited in‡uence in econometrics, but they
have become more important with the advent of computer-intensive stochastic
simulation algorithms in the 1990s.

This is particularly so in macroeconomics, where applications of Bayesian inference

include vector autoregressions and dynamic stochastic general equilibrium models.

Bayesian approaches are also attractive in models with many parameters, such as
panel models with individual heterogeneity and ‡exible nonlinear regression models.

Examples include discrete choice models of consumer demand in the …elds of

industrial organization and marketing.

2
Introduction (continued)

An empirical study uses data to learn about quantities of interest (parameters).

A likelihood function speci…es the information in the data about those quantities.

Such speci…cation typically involves the use of a priori information in the form of
parametric or functional restrictions.

In the Bayesian approach to inference, one not only assigns a probability measure to
the sample space but also to the parameter space.

Specifying a probability distribution over potential parameter values is the

conventional way of modelling uncertainty in decision-making, and o¤ers a way of
incorporating uncertain prior information into statistical procedures.

3
Outline

The following section introduces the Bayesian way of combining a prior distribution
with the likelihood of the data to generate point and interval estimates.

This is followed by some comments on the speci…cation of prior distributions.

Next we turn to asymptotic approximations; the result is a large-sample equivalence

between Bayesian probability statements and frequentist con…dence statements.

As a result, frequentist and Bayesian inferences are often very similar and can be
reinterpreted in each other’s terms.

Finally, we review Markov chain Monte Carlo methods (MCMC).

The development of these methods has greatly reduced the computational di¢ culties
that held back Bayesian applications in the past.

Bayesian methods are now not only generally feasible, but sometimes also a better
practical alternative to frequentist methods.

The upshot is an emerging Bayesian/frequentist synthesis around increasing

agreement on what works for di¤erent kinds of problems.

4
Bayesian inference

5
Updating a prior distribution with data

Let the density of the data y = (y1 , ..., yn ) conditional on an unknown parameter θ be:
f (y1 , ..., yn j θ ) .
If y is an iid sample f (y1 , ..., yn j θ ) = ∏ni=1 f (yi j θ ) where f (yi j θ ) is the pdf of yi .
In survey sampling f (yi j θ = θ 0 ) is the pdf of the population, (y1 , ..., yn ) are n draws
from such population, and θ 0 is the true value of θ in the pdf that generated the data.
In short write f (y j θ ) = f (y1 , ..., yn j θ ), which is also the likelihood function L (θ ).
Any prior information about the value of θ is speci…ed in a prior distribution p (θ ).
Both the likelihood and the prior are chosen by the researcher.
We combine the prior and the sample, using Bayes’theorem, to obtain the conditional
distribution of the parameter given the data, also known as the posterior distribution:
f (y , θ ) f (y j θ ) p ( θ )
p (θ j y ) = = R .
f (y ) f (y j θ ) p ( θ ) d θ
Note that as a function of θ, the posterior density is proportional to
p ( θ j y ) ∝ f (y j θ ) p ( θ ) = L ( θ ) p ( θ ) .
The posterior density describes how likely it is that a value of θ has generated the
observed data.

6
Point estimation

We can use the posterior density to form optimal point estimates.

The notion of optimality is minimizing mean posterior loss for some loss function ` (r ):
Z
min ` (c θ ) p (θ j y ) d θ
c Θ

The posterior mean Z

θ= θp (θ j y ) d θ
Θ
is the point estimate that minimizes mean squared loss ` (r ) = r 2 .

The posterior median minimizes mean absolute loss ` (r ) = jr j.

The posterior mode e

θ is the maximizer of the posterior density and minimizes mean
Dirac loss.

When the prior density is ‡at, the posterior mode coincides with the maximum
likelihood estimator.

7
Interval estimation

The posterior quantiles characterize the posterior uncertainty about the parameter,
and they can be used to obtain interval estimates.

Any interval (θ ` , θ u ) such that

Z θu
p (θ j y ) d θ = 1 α
θ`
is called a credible interval with coverage probability 1 α.
If the posterior density is unimodal, a common choice is the shortest connected
credible interval or the highest posterior density (HPD) interval.
Often an equal-tail-probability interval is favored because of its simplicity. In such
case, θ ` and θ u are just the α/2 and 1 α/2 posterior quantiles, respectively.
Equal-tail intervals tend to be longer than others, except if the posterior is symmetric.
If the posterior is multi-modal the HPD interval may consist of disjoint segments.

Frequentist con…dence intervals and Bayesian credible intervals are the two main
interval estimation methods in statistics.
In a con…dence interval the coverage probability is calculated from a sampling density,
whereas in a credible interval it is calculated from a posterior density.

8
Bernoulli example

Let us consider a Bernouilli random sample (y1 , ..., yn ) with likelihood given by
L ( θ ) = θ m (1 θ )n m

where m = ∑ni=1 yi . The maximum likelihood estimator is

b m
θ= .
n
Given some prior p (θ ), the posterior mode solves
e
θ = arg max [ln L (θ ) + ln p (θ )] .
θ
Since θ is a probability, a suitable domain for a prior distribution is the (0, 1) interval.
A convenient choice is the Beta distribution:
1
p (θ; α, β) = θ α 1 (1 θ ) β 1
B (α, β)
where B (α, β) is the beta function:
Z 1
1 1
B (α, β) = sα (1 s )β ds.
0
The quantities (α, β) are called prior hyperparameters, to be set according to our a
priori information about θ.

9
Bernoulli example (continued)

The Beta distribution is a convenient prior because the posterior is also Beta:
m +β 1
p (θ j y ) ∝ L (θ ) p (θ ) ∝ θ m +α 1
(1 θ )n
That is, if θ Beta (α, β) then θ j y Beta (α + m, β + n m ).
It is then said that the Beta distribution is the conjugate prior to the Bernoulli.
The posterior mode is given by
h i m+α 1
m +β 1
e
θ = arg max θ m +α 1
(1 θ )n = . (1)
θ n+β 1+α 1
An interesting property of the posterior mode in this example is that it is equivalent
to the MLE of a data set with α 1 additional ones and β 1 additional zeros.
Such data augmentation interpretation provides guidance on how to choose α and β
in describing a priori knowledge about the probability of success in Bernoulli trials.
It also illustrates the vanishing e¤ect of the prior in a large sample: if n is large e
θ b
θ.
However, ML may not be a satisfactory estimator in a small sample that only contains
zeros if the probability of success is known a priori to be greater than zero.

10
Speci…cation of prior distribution

11
Conjugate priors

There is a diversity of considerations involved in the speci…cation of a prior, not

altogether di¤erent from those involved in the speci…cation of a likelihood model.
One consideration in selecting both prior and likelihood is mathematical convenience.
Conjugate priors played a central role for analytical and computational reasons.
Another advantage of conjugate priors is that they can be interpreted in terms of
additional pseudo-data (as in the Bernoulli example).
A prior is conjugate for a family of distributions if the prior and the posterior are of
the same family.
In general, distributions in the exponential family have conjugate priors.
Some likelihood models together with their conjugate priors are the following:
Bernoulli – Beta
Binomial – Beta
Poisson – Gamma
Normal with known variance – Normal
Exponential – Gamma
Uniform – Pareto
Geometric – Beta

12
Informative priors

The argument for using a probability distribution to specify uncertain a priori

information is more compelling when prior knowledge can be associated to past
experience, or to a process of elicitation of consensus expert views.

Other times, a parameter is a random realization drawn from some population, for
example, in a model with individual e¤ects for longitudinal survey data; a situation in
which there exists an actual population prior distribution.

In those cases one would like the prior to accurately express the information available
about the parameters.

However, often little is known a priori and one would like a prior density to just
express lack of information, an issue that we consider next.

13
Flat priors

For a scalar θ taking values on the entire real line a ‡at prior distribution that sets
p (θ ) = 1 is typically employed as an uninformative prior.

A ‡at prior is non-informative in the sense of having little impact on the posterior,
which is simply a renormalization of the likelihood into a density for θ.

A ‡at prior is appealing from the point of view of seeking to summarize the likelihood.

R
Note that a ‡at prior is improper in the sense that Θp (θ ) d θ = ∞.

If an improper prior is combined with a likelihood that cannot be renormalized, the

result is an improper posterior that cannot be used for inference.

Flat priors are often approximated by a proper prior with a large variance.

14
Flat priors (continued)

If p (θ ) is uniform, then the prior of a transformation of θ is not uniform.

If θ > 0, a standard reference prior is a ‡at prior on ln θ, p (ln θ ) = 1, which implies
p (θ ) = 1/θ.
θ
Similarly, if θ 2 (0, 1), a ‡at prior on the logit of θ, ln 1 θ , implies
1
p (θ ) = . (2)
θ (1 θ)
R∞ 1
R1 1
These priors are improper because 0 θ dθ and 0 θ (1 θ ) d θ both diverge.
1
If (y1 , ..., yn ) is i.i.d. N µ, , the standard improper reference prior for (µ, τ ) is to
τ
specify independent ‡at priors on µ and ln τ, so that
p (µ, τ ) = p (µ) p (τ ) = 1/τ.
Je¤reys prior

It is a rule for choosing a non-informative prior that is invariant to transformation:

p (θ ) ∝ [det I (θ )]1/2
where I (θ ) is the information matrix.
If γ = h (θ ) is one-to-one, applying Je¤reys’rule directly to γ we get the same prior
as applying the rule to θ and then transforming to obtain p [h (θ )].
15
Bernoulli example continued

Let us illustrate three standard non-informative priors in the Bernoulli example.

The …rst one is a ‡at prior in the log-odds scale, leading to (2); this is the Beta (0, 0)
distribution since it is the limit of the numerator of the beta distribution as α, β ! 0.
The second is Je¤reys’prior, which in this case is proportional to
1
p (θ ) = p ,
θ (1 θ)
and corresponds to the Beta (0.5, 0.5) distribution.
The third one is the uniform prior p (θ ) = 1, which corresponds to a Beta (1, 1).
All three priors are data augmentation priors:
The Beta (0, 0 ) prior adds no prior observations.
Je¤reys’prior adds one observation with half a success and half a failure.
The uniform prior adds two observations with one success and one failure.

The ML estimator coincides with the posterior mode for the Beta (1, 1) prior, and
with the posterior mean for the Beta (0, 0) prior.

16
Large-sample Bayesian inference

17
Introduction

We typically resort to large-sample approximations to evaluate the performance of a

point estimator and to obtain a (frequentist) con…dence interval.

Here we wish to consider (i) asymptotic approximations to the posterior distribution,

and (ii) the sampling properties of Bayesian estimators in large samples.

The main result is that as n ! ∞ the posterior of θ approaches a multivariate normal

distribution, which is independent of the prior.

The convergence is in probability, where the probability is measured with respect to

the true distribution of y .

Posterior asymptotic results formalize the notion that the importance of the prior
diminishes as n increases.

They hold under suitable conditions on the prior, the likelihood, and Θ, including:
A prior that assigns positive probability to a neighborhood about θ 0 .
A posterior distribution that is not improper.
Identi…cation.
A likelihood that is a continuous function of θ.
A true value θ 0 that is not on the boundary of Θ.
18
Consistency of the posterior distribution
If the population distribution g (yi ) equals f (yi j θ 0 ) for some θ 0 , the posterior is
consistent in the sense that it converges to a point mass at θ 0 as n ! ∞.
When g (yi ) is not in f (yi j θ ), the pseudo true value θ 0 makes f (yi j θ ) closest to
g (yi ) in the KLD sense. Consistency of the pseudo-posterior also holds in this case.
Discrete parameter space (Gelman et al 2014)
If Θ is …nite and Pr (θ = θ 0 ) > 0, then Pr (θ = θ 0 j y ) ! 1 as n ! ∞, where θ 0 is
the value of θ that minimizes KLD (θ ).
To see this, for any θ 6= θ 0 consider the log posterior odds relative to θ 0 :
n
p (θ j y ) p (θ ) f (yi j θ )
ln
p (θ 0 j y )
= ln
p (θ 0 )
+ ∑ ln f (yi j θ 0 )
(3)
i =1
For …xed values of θ and θ 0 , if the yi ’s are iid draws from g (yi ), the second term on
the right is a sum of n iid random variables with a mean given by
f (yi j θ )
E ln = KLD (θ 0 ) KLD (θ ) 0.
f (yi j θ 0 )
Thus, as long as θ 0 is the unique minimizer of KLD (θ ), for θ 6= θ 0 the second term
on the right of (3) is the sum of n iid random variables with negative mean.
By the LLN, the sum approaches ∞ as n ! ∞. As long as the …rst term on the
right is …nite (provided p (θ 0 ) > 0), the whole expression approaches ∞ in the limit.
Then p (θ j y ) /p (θ 0 j y ) ! 0, and so p (θ j y ) ! 0.
Moreover, since all probabilities add up to 1, p (θ 0 j y ) ! 1.
19
Continuous parameter space

If θ has a continuous distribution, p (θ 0 j y ) is zero for any …nite sample, and so the
previous argument does not apply, but it can still be shown that p (θ j y ) becomes
more and more concentrated about θ 0 as n ! ∞, as in the following result.
If θ is de…ned on a compact set and A is a neighborhood of θ 0 with nonzero prior
probability, then Pr (θ 2 A j y ) ! 1 as n ! ∞, where θ 0 minimizes KLD (θ ).

Bernouilli example

The posterior distribution in this case is:

m +β 1
p (θ j y ) ∝ θ m +α 1
(1 θ )n Beta (m + α, n m + β)
with mean and variance given by
m+α m 1
E (θ j y ) = = +O
n+α+β n n
(m + α ) (n m + β ) 1
Var (θ j y ) = =O .
(n + α + β )2 (n + α + β + 1 ) n
p p
As n ! ∞ E (θ j y ) ! θ 0 and Var (θ j y ) ! 0 regardless of the prior distribution.

20
Asymptotic normality of the posterior distribution

We have seen that as n ! ∞ the posterior distribution converges to a degenerate

measure at the true value θ 0 (posterior consistency).
To obtain a non-degenerate limit, we consider the sequence of posterior distributions
p
of γ = n θ b θ , whose densities are given by
1 1
p (γ j y ) = p p b
θ+ p γjy .
n n
The result is that as n ! ∞, p (γ j y ) approaches a normal distribution. This type
of result is known as Bernstein-von Mises Theorem.
A statement for iid data and a scalar θ, under the standard regularity conditions of
MLE asymptotics, the condition that p (θ ) is continuous and positive in an open
neighborhood of θ 0 , and some other technical conditions, is as follows:
Z
!
1 1 2 p
p (γ j y ) q exp 2
γ d γ ! 0.
2πσ 2 2σ θ
θ

where σ2θ = 1/I (θ 0 ).

That is, the L1 distance between the scaled and centered posterior and a N 0, σ2θ
density centered at the random quantity γ goes to zero in probability.

21
Asymptotic normality of the posterior distribution (continued)

Thus, for large n, p (θ j y ) is approximately a random normal density with random

mean parameter b θ and a constant variance parameter I (θ 0 ) 1 /n:

b 1 1
p (θ j y ) N θ, I (θ 0 ) .
n
From a frequentist point of view, this implies that Bayesian methods can be used to
obtain statistically e¢ cient estimators and consistent con…dence intervals.

22
Extension to a multidimensional parameter: some intuition
Consider a Taylor expansion of ln p (θ j y ) about the posterior mode e
θ:
∂ ln p e
θjy 1 0 ∂2 ln p e
θjy
ln p (θ j y ) ln p e
θjy + θ e
θ + θ e
θ θ e
θ
∂θ 0 2 ∂θ∂θ 0
p 0h ip
=c 0.5 n θ e
θ n ∂ ln p e
1 2
θ j y /∂θ∂θ 0 n θ e
θ

Note that ∂ ln p e
θ j y /∂θ 0 = 0. Moreover,
2 e 2 e ∂2 ln f yi j e
1 ∂ ln p θ j y 1 ∂ ln p θ 1 n
θ
n ∂θ∂θ 0
=
n ∂θ∂θ 0
+
n ∑ i =1 ∂θ∂θ 0
1 n
∂2 ln f yi j e
θ 1
=
n ∑ i =1∂θ∂θ 0 I eθ +O
n
Thus, the curvature of the log posterior can be approximated by Fisher information:
p 0 p
ln p (θ j y ) c 0.5 n θ e θ I e θ n θ e θ .
Dropping terms that do not include θ we get the approximation
0
p (θ j y ) ∝ exp[ 0.5 θ e
θ nI e
θ θ e
θ ],
1
which corresponds to the kernel of a multivariate normal density N e
θ, n 1I e
θ .
23
Asymptotic behavior of the posterior in pseudo-likelihood models

If g (yi ) 6= f (yi j θ ) for all θ 2 Θ, the large-n sampling distribution of the PMLE is
p d
n bθ θ 0 ! N (0, ΣS )

where ΣS is the sandwich covariance matrix:

ΣS = ΣM V ΣM ,
1 1
with ΣM = [ E (Hi )] [I (θ 0 )] , V = E (qi qi0 ) and
∂ ln f (yi j θ 0 ) ∂2 ln f (yi j θ 0 )
qi = , Hi =
∂θ ∂θ∂θ 0
In a correctly speci…ed model the information identity holds, but in general V 6= ΣM1 .

The large-n shape of a posterior obtained from Πni=1 f (yi j θ ) becomes close to

b 1
θjy N θ, ΣM .
n
Thus, misspeci…cation produces a discrepancy between the sampling distribution of b
θ
and the shape of the (pseudo)-likelihood.

24
Asymptotic behavior of the posterior in pseudo-likelihood models (continued)

For the purpose of Bayesian inference about the pseudo truth θ 0 it makes sense to
start from the correct large-sample approximation to the likelihood of b
θ instead of the
(incorrect) approximate likelihood of (y1 ...yn ).

That is, to consider a posterior distribution of the form:

1 0
p θjb
θ ∝ exp n θ b
θ ΣS 1 θ b
θ p (θ ) (4)
2
instead of the standard pseudo-posterior
" #
n
p θ j θ ∝ exp ∑ ln f (yi j θ )
b p (θ ) . (5)
i =1

The pseudo-posterior (4) relies on the asymptotic likelihood of b

θ (an "arti…cial"
normal posterior centered at the MLE with sandwich covariance matrix).

This approach is proposed in Müller (2013).

25
Asymptotic frequentist properties of Bayesian inferences

The posterior mode is consistent and asymptotically normal as n ! ∞.

So the large-sample Bayesian statement holds
h i1/2
I e
θ θ e θ jy N (0, I )
alongside the large-sample frequentist statement
h i1/2
I eθ θ e θ jθ N (0, I ) .
These results imply that in regular estimation problems the posterior distribution is
asymptotically the same as the repeated sample distribution.
So, for example, a 95% central posterior interval for θ will cover the true value 95%
of the time under repeated sampling with any …xed true θ.
The frequentist statement speaks of probabilities of e
θ (y ) whereas the Bayesian
statement speaks of probabilities of θ. Speci…cally,
Z Z
Pr (θ r j y ) = 1 (θ r ) p (θ j y ) d θ _ 1 (θ r ) f (y j θ ) p ( θ ) d θ
h i Z
Pr e θ (y ) r j θ 0 = 1 e θ (y ) r f (y j θ 0 ) dy
These results require that the true data distribution is included in the parametric
likelihood family.
26
Bernoulli example continued

The posterior mode corresponding to the beta prior with parameters (α, β) in (1) and
the maximum likelihood estimator b
θ = m/n satisfy
p p
n e
θ θ = n b
θ θ + Rn

where p
n m
Rn = α 1 k .
n+k n
and k = α + β 2.

p p
Since Rn ! 0, it follows that n e
θ θ has the same asymptotic distribution as
p b
n θ θ , namely N [0, θ (1 θ )].

Therefore, the normalized posterior mode has an asymptotic normal distribution,

which is independent of the prior parameters and has the same asymptotic variance as
that of the MLE, so that the posterior mode is asymptotically e¢ cient.

27
Robustness to statistical principle and its failures

The dual frequentist/Bayesian interpretation of many estimation procedures suggests

that it is possible to aim for robustness to statistical philosophies in statistical
methodology, at least in regular estimation problems.

Even for small samples, many statistical methods can be considered as

approximations to Bayesian inferences based on particular prior distributions.

As a way of understanding a statistical procedure, it is often useful to determine the

implicit underlying prior distribution (Gelman et al 2014).

In the case of units roots the symmetry of Bayesian probability statements and
classical con…dence statements breaks down.

With normal errors and a ‡at prior the Bayesian posterior is normal even if the true
data generating process is a random walk (Sims and Uhlig 1991).

28
Markov chain Monte Carlo methods

29
Introduction

A Markov Chain Monte Carlo method simulates a series of parameter draws such that
the marginal distribution of the series is the posterior distribution of the parameters.

The posterior density is proportional to

p ( θ j y ) ∝ f (y j θ ) p ( θ ) .

Usually f (y j θ ) p (θ ) is easy to compute.

However, computation of point estimates and credible intervals typically requires the
evaluation of integrals of the form
R
ΘRh ( θ ) f (y j θ ) p ( θ ) d θ
Θ f (y j θ ) p ( θ ) d θ

for various functions h (.).

For problems for which no analytic solution exists, MCMC methods provide powerful
tools for evaluating these integrals, especially when θ is high dimensional.

30
Markov chains

MCMC is a collection of computational methods that produce an ergodic Markov

chain with the stationary distribution p (θ j y ).
A continuous-state Markov chain is a sequence θ (1 ) , θ (2 ) , ... that satis…es the property:
Pr θ (j +1 ) j θ (j ) , ..., θ (1 ) = Pr θ (j +1 ) j θ (j ) .

The probability Pr θ 0 j θ of transitioning from state θ to state θ 0 is called the

transition kernel and we denote it K θ 0 j θ .
Our interest will be in the steady-state probability distribution of the process.
Given a starting value θ (0 ) , a chain θ (1 ) , θ (2 ) , ..., θ (M ) is generated using a
transition kernel with stationary distribution p (θ j y ), which ensures the convergence
of the marginal distribution of θ (M ) to p (θ j y ).
For su¢ ciently large M , the MCMC methods produce a dependent sample
θ (1 ) , θ (2 ) , ..., θ (M ) whose empirical distribution approaches p (θ j y ).
The ergodicity and construction of the chains usually imply that as M ! ∞,
Z
1 M p
M j∑
b
θ= h θ (j ) ! h (θ ) p (θ j y ) d θ.
=1 Θ

Analogously, a 90% interval estimation is constructed simply by taking the 0.05th and
0.95th quantiles of the sequence h θ (1 ) , ..., h θ (M ) .
31
Markov chains (continued)

In the theory of Markov chains one looks for conditions under which there exists an
invariant distribution, and conditions under which iterations of the transition kernel
K θ 0 j θ converge to the invariant distribution.
In the context of MCMC methods the situation is the reverse: the invariant
distribution is known and in order to generate samples from it the methods look for a
transition kernel whose iterations converge to the invariant distribution.
The problem is to …nd a suitable K θ 0 j θ that satis…es the invariance property :
Z
p θ0 j y = K θ 0 j θ p (θ j y ) d θ. (6)

Under the invariance property, if θ (j ) is a draw from p (θ j y ) then θ (j +1 ) is also a

draw from p (θ j y ).
The steady-state distribution p (θ j y ) satis…es the detailed balance condition:
K θ 0 j θ p (θ j y ) = K θ j θ 0 p θ 0 j y for all θ, θ 0 . (7)
0
The interpretation of equation (7) is that the amount of mass transitioning from θ to
θ is the same as the amount of mass that transitions back from θ to θ 0 .
Two general methods of constructing transition kernels are the Metropolis-Hastings
algorithm and the Gibbs sampler, which we discuss in turn.

32
Metropolis-Hastings method

The MH algorithm proceeds by generating candidates that are either accepted or

rejected according to some probability, which is driven by a ratio of posterior
evaluations. A description of the algorithm is as follows.

Given the posterior density f (y j θ ) p (θ ), known up to a constant, and a prespeci…ed

conditional density q θ 0 j θ called the "proposal distribution", generate
θ (1 ) , θ (2 ) , ..., θ (M ) in the following way:

1 Choose a starting value θ (0 ) .

2 Draw a proposal θ from q θ j θ (j ) .
3 Update θ (j +1 ) from θ (j ) for j = 1, 2..., using
8
< θ with probability ρ θ j θ (j ) ,
θ (j +1 ) =
: θ (j ) with probability 1 ρ θ j θ (j ) ,

where 0 1
f (y j θ ) p ( θ ) q θ (j ) j θ
ρ θ j θ (j ) = min @1, A
f y j θ (j ) p θ (j ) q θ j θ (j )

33
Intuition for how MH deals with a candidate transition (Letham & Rudin 2012)

If p θ 0 j y > p (θ j y ), then for every accepted draw of θ, we should have at least as

many accepted draws of θ 0 and so we always accept the transition θ ! θ 0 .

If p θ 0 j y < p (θ j y ), then for every accepted draw θ, we should have on average

p ( θ 0 jy ) p ( θ 0 jy )
p ( θ jy )
accepted draws of θ 0 . We thus accept the transition with probability p (θ jy ) .

p ( θ 0 jy )
Thus, for any proposed transition, we accept it with probability min 1, p (θ jy ) ,
which corresponds to ρ θ 0 j θ when the proposal distribution is symmetric:
q θ 0 j θ = q θ j θ 0 , as is the case in the original Metropolis algorithm.

The chain of draws so produced spends a relatively high proportion of time in the
higher density regions and a lower proportion in the lower density regions.

Because such proportions of times are balanced in the right way, the generated
sequence of parameter draws has the desired marginal distribution in the limit.

A key practical aspect of this calculation is that the posterior constant of integration
is not needed since ρ θ 0 j θ only depends on a posterior ratio.

34
Choosing a proposal distribution

To guarantee the existence of a stationary distribution, the proposal distribution

q θ 0 j θ should be such that there is a positive density of reaching any state from
any other state.

A popular implementation of the MH algorithm is to use the random walk proposal

distribution:
q θ 0 j θ = N θ, σ2

for some variance σ2 .

In practice, one will try several proposal distributions to …nd out which is most
suitable in terms of rejection rates and coverage of the parameter space.

Other practical considerations include discarding a certain number of the …rst draws
to reduce the dependence on the starting point (burn-in), and only retaining every
d th iteration of the chain to reduce the dependence between draws (thinning).

35
Transition kernel and convergence of the MH algorithm
The MH algorithm describes how to generate a parameter draw θ (j +1 ) conditional on
a parameter draw θ (j ) .
Since the proposal distribution q θ 0 j θ and the acceptance probability ρ θ 0 j θ
depend only on the current state, the sequence of draws forms a Markov chain.
The MH transition kernel can be written as
K θ 0 j θ = q θ 0 j θ ρ θ 0 j θ + r (θ ) δθ θ 0 . (8)
0 0 0
The …rst term q θ j θ ρ θ j θ is the density that θ is proposed given θ, times the
probability that it is accepted.
To this we add the term r (θ ) δθ θ 0 , which gives the probability r (θ ) that
conditional on θ the proposal is rejected times the Dirac delta function δθ θ 0 , equal
to one if θ 0 = θ and zero otherwise. Here
Z
r (θ ) = 1 q θ0 j θ ρ θ0 j θ d θ0.

If the proposal is rejected, then the algorithm sets θ (j +1 ) = θ (j ) , which means that
conditional on the rejection, the transition density contains a point mass at θ = θ 0 ,
which is captured by the Dirac delta function.
For the MH algorithm to generate a sequence of draws from p (θ j y ) a necessary
condition is that the posterior distribution is an invariant distribution under the
transition kernel (8), namely that it satis…es condition (6).
36
Gibbs sampling

The Gibbs sampler is a fast sampling method that can be used in situations when we
have access to conditional distributions.
The idea behind the Gibbs sampler is to partition the parameter vector into two
components θ = (θ 1 , θ 2 ).
(j +1 )
Instead of sampling θ (j +1 ) directly from K θ j θ (j ) , one …rst samples θ 1 from
(j ) (j +1 ) (j +1 )
p θ1 j θ2 and then samples θ 2 from p θ 2 j θ 1 .
(j ) (j ) (j +1 ) (j +1 )
If θ 1 , θ 2 is a draw from the posterior distribution, so is θ 1 , θ2
generated as above, so that the Gibbs sampler kernel satis…es the invariance property;
that is, it has p (θ 1 , θ 2 j y ) as its stationary distribution.
The Gibbs sampler kernel is
K θ 1 , θ 2 j θ 10 , θ 20 = p θ 1 j θ 20 p (θ 2 j θ 1 ) .
It can be regarded as a special case of MH where the proposal distribution is taken to
be the conditional posterior distribution.
The Gibbs sampler is related to data augmentation. A probit model nicely illustrates
this aspect (Lancaster 2004, Example 4.17).

The Plastic Age
No ratings yet
The Plastic Age
64 pages
Lesson 3 Economic Globalization, Poverty, and Inequality
100% (2)
Lesson 3 Economic Globalization, Poverty, and Inequality
13 pages
Alrajhi-0 991078940440279
0% (1)
Alrajhi-0 991078940440279
1 page
bayesian-inference
No ratings yet
bayesian-inference
18 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
Mstat Note14 Bayesian Inference FSP
No ratings yet
Mstat Note14 Bayesian Inference FSP
30 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Chapter 1 B
No ratings yet
Chapter 1 B
35 pages
Bayesian Inference: The Basics
No ratings yet
Bayesian Inference: The Basics
37 pages
MIT18 650F16 Bayesian Statistics
No ratings yet
MIT18 650F16 Bayesian Statistics
18 pages
Bayesian estimation
No ratings yet
Bayesian estimation
13 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Notes4_BayesianLearning
No ratings yet
Notes4_BayesianLearning
8 pages
Chap 2
No ratings yet
Chap 2
28 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
No ratings yet
Bayesian Data Analysis - Reading Instructions 2: Chapter 2 - Outline
36 pages
Bayesian Week2 LectureNotes
No ratings yet
Bayesian Week2 LectureNotes
14 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
Var PPTS
No ratings yet
Var PPTS
249 pages
ln13
No ratings yet
ln13
5 pages
Bayesian Ibrahim
No ratings yet
Bayesian Ibrahim
370 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
zzzz-essential_bayes
No ratings yet
zzzz-essential_bayes
158 pages
Bayesian Statistics: MA501, Statistics For Insurance
No ratings yet
Bayesian Statistics: MA501, Statistics For Insurance
28 pages
CH 5
No ratings yet
CH 5
45 pages
Stat 111
No ratings yet
Stat 111
7 pages
Lecture 6. Bayesian Estimation
No ratings yet
Lecture 6. Bayesian Estimation
14 pages
Conjugate Prior
No ratings yet
Conjugate Prior
5 pages
Bayesian Statistics 01
100% (1)
Bayesian Statistics 01
22 pages
implementation
No ratings yet
implementation
22 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
25 Intro to Bayesian Inference (1)
No ratings yet
25 Intro to Bayesian Inference (1)
31 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Baysian Inferences
No ratings yet
Baysian Inferences
20 pages
Bayesian Parameter Estimation
No ratings yet
Bayesian Parameter Estimation
40 pages
Baysian-Slides 16 Bayes Intro
No ratings yet
Baysian-Slides 16 Bayes Intro
49 pages
ANNParameter Estimation-II,III
No ratings yet
ANNParameter Estimation-II,III
2 pages
Lectures 5
No ratings yet
Lectures 5
31 pages
Revision - Bayesian Inference
No ratings yet
Revision - Bayesian Inference
4 pages
Bayes Manuscripts
No ratings yet
Bayes Manuscripts
180 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
Lecture 20 - Bayesian Analysis
No ratings yet
Lecture 20 - Bayesian Analysis
4 pages
ML-Map-and-Bayseian
No ratings yet
ML-Map-and-Bayseian
35 pages
FSMLecture4 - Copy (4)
No ratings yet
FSMLecture4 - Copy (4)
49 pages
Slides 1
No ratings yet
Slides 1
73 pages
Estimation
No ratings yet
Estimation
53 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
79 pages
Nonparametric Inference Techniques For High-Dimensional Data: Challenges and Solutions
No ratings yet
Nonparametric Inference Techniques For High-Dimensional Data: Challenges and Solutions
16 pages
Bayesian Inference: A Practical Primer: Outline
No ratings yet
Bayesian Inference: A Practical Primer: Outline
28 pages
Advance Statistics
No ratings yet
Advance Statistics
23 pages
Johnson11MLSS Talk Extras
No ratings yet
Johnson11MLSS Talk Extras
73 pages
Bayesian-Statistics Final 20140416 3
No ratings yet
Bayesian-Statistics Final 20140416 3
38 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Lecture Material 2.5 - Bayesian Estimation & Concepts
No ratings yet
Lecture Material 2.5 - Bayesian Estimation & Concepts
12 pages
Single Parameter Models
No ratings yet
Single Parameter Models
37 pages
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
No ratings yet
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
21 pages
BDA class 8
No ratings yet
BDA class 8
11 pages
ParameterEstimation
No ratings yet
ParameterEstimation
50 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Square Summable Power Series
From Everand
Square Summable Power Series
Louis de Branges
5/5 (1)
Quantile-methods-slides-2024
No ratings yet
Quantile-methods-slides-2024
35 pages
bootstrap-methods-2020
No ratings yet
bootstrap-methods-2020
16 pages
MI-Firms
No ratings yet
MI-Firms
105 pages
MI-OG
No ratings yet
MI-OG
70 pages
Research Proposal Hilma
No ratings yet
Research Proposal Hilma
14 pages
Lothal Synopsis
100% (1)
Lothal Synopsis
2 pages
Effect of Teaching and Learning Resources On Students Academic Performance at Selected Schools Kagogo Sector, Burera DistrictRlyanda
No ratings yet
Effect of Teaching and Learning Resources On Students Academic Performance at Selected Schools Kagogo Sector, Burera DistrictRlyanda
7 pages
Branch and Bound Search: Node Is A Subproblem (LP Relaxation), Where "Bound" Comes From
No ratings yet
Branch and Bound Search: Node Is A Subproblem (LP Relaxation), Where "Bound" Comes From
12 pages
Celtic Music
100% (3)
Celtic Music
18 pages
Leather Chemistry
75% (4)
Leather Chemistry
16 pages
Dependent Personality Inventory-Revised (DPI-R) - Incorporating A
No ratings yet
Dependent Personality Inventory-Revised (DPI-R) - Incorporating A
85 pages
Student Remediation Plan
No ratings yet
Student Remediation Plan
2 pages
Effectiveness of Utilizing Visuals and Displays in Teaching Kindergarten at Subay Elementary School
No ratings yet
Effectiveness of Utilizing Visuals and Displays in Teaching Kindergarten at Subay Elementary School
2 pages
Object-Oriented Software Engineering: Practical Software Development Using UML and Java
No ratings yet
Object-Oriented Software Engineering: Practical Software Development Using UML and Java
38 pages
Sachal Jo Risalo
No ratings yet
Sachal Jo Risalo
524 pages
Get (eBook PDF) Understanding Employment Relations (UK Higher Education Business Management) free all chapters
100% (10)
Get (eBook PDF) Understanding Employment Relations (UK Higher Education Business Management) free all chapters
42 pages
Transportation Engineering
No ratings yet
Transportation Engineering
71 pages
What Is A Partnership
No ratings yet
What Is A Partnership
79 pages
Social Media Marketing at Reebok India - The Dilemma of ROMI and Beyond
No ratings yet
Social Media Marketing at Reebok India - The Dilemma of ROMI and Beyond
20 pages
Pre Placement Paid Internship ServiceNow
No ratings yet
Pre Placement Paid Internship ServiceNow
4 pages
Kathopnishad
No ratings yet
Kathopnishad
64 pages
Word Problems On Linear Equations
No ratings yet
Word Problems On Linear Equations
7 pages
Omega Air Product Data Sheet Filter Element XR AF and AAF v4.00
No ratings yet
Omega Air Product Data Sheet Filter Element XR AF and AAF v4.00
2 pages
HIkvision Fever Camera
No ratings yet
HIkvision Fever Camera
16 pages
Chapter 9 Term Paper
No ratings yet
Chapter 9 Term Paper
3 pages
Different Formats of Classroom Assessment Tools-Lecture Notes
No ratings yet
Different Formats of Classroom Assessment Tools-Lecture Notes
6 pages
Everything (Almost) You Wanted To Know About A Willow Flute, But Were AFRAID To Ask by Sarah Kirton
No ratings yet
Everything (Almost) You Wanted To Know About A Willow Flute, But Were AFRAID To Ask by Sarah Kirton
8 pages
The Tgrow Model
No ratings yet
The Tgrow Model
18 pages
T.P. Pantzaris and Y. Basiron: 6 The Lauric (Coconut and Palmkernel) Oils 157
No ratings yet
T.P. Pantzaris and Y. Basiron: 6 The Lauric (Coconut and Palmkernel) Oils 157
3 pages
Technological University of The Philippines - Taguig
No ratings yet
Technological University of The Philippines - Taguig
7 pages
NR7505915345058349 Invoice
No ratings yet
NR7505915345058349 Invoice
2 pages

Bayesian-inference-slides-2021

Uploaded by

Bayesian-inference-slides-2021

Uploaded by

Bayesian analysis

This is particularly so in macroeconomics, where applications of Bayesian inference

Examples include discrete choice models of consumer demand in the …elds of

An empirical study uses data to learn about quantities of interest (parameters).

Specifying a probability distribution over potential parameter values is the

This is followed by some comments on the speci…cation of prior distributions.

Next we turn to asymptotic approximations; the result is a large-sample equivalence

Finally, we review Markov chain Monte Carlo methods (MCMC).

The upshot is an emerging Bayesian/frequentist synthesis around increasing

We can use the posterior density to form optimal point estimates.

The posterior mean Z

The posterior median minimizes mean absolute loss ` (r ) = jr j.

The posterior mode e

Any interval (θ ` , θ u ) such that

where m = ∑ni=1 yi . The maximum likelihood estimator is

There is a diversity of considerations involved in the speci…cation of a prior, not

The argument for using a probability distribution to specify uncertain a priori

If an improper prior is combined with a likelihood that cannot be renormalized, the

If p (θ ) is uniform, then the prior of a transformation of θ is not uniform.

It is a rule for choosing a non-informative prior that is invariant to transformation:

Let us illustrate three standard non-informative priors in the Bernoulli example.

We typically resort to large-sample approximations to evaluate the performance of a

Here we wish to consider (i) asymptotic approximations to the posterior distribution,

The main result is that as n ! ∞ the posterior of θ approaches a multivariate normal

The convergence is in probability, where the probability is measured with respect to

The posterior distribution in this case is:

We have seen that as n ! ∞ the posterior distribution converges to a degenerate

where σ2θ = 1/I (θ 0 ).

Thus, for large n, p (θ j y ) is approximately a random normal density with random

where ΣS is the sandwich covariance matrix:

That is, to consider a posterior distribution of the form:

The pseudo-posterior (4) relies on the asymptotic likelihood of b

This approach is proposed in Müller (2013).

The posterior mode is consistent and asymptotically normal as n ! ∞.

Therefore, the normalized posterior mode has an asymptotic normal distribution,

The dual frequentist/Bayesian interpretation of many estimation procedures suggests

Even for small samples, many statistical methods can be considered as

As a way of understanding a statistical procedure, it is often useful to determine the

The posterior density is proportional to

Usually f (y j θ ) p (θ ) is easy to compute.

for various functions h (.).

MCMC is a collection of computational methods that produce an ergodic Markov

The probability Pr θ 0 j θ of transitioning from state θ to state θ 0 is called the

Under the invariance property, if θ (j ) is a draw from p (θ j y ) then θ (j +1 ) is also a

The MH algorithm proceeds by generating candidates that are either accepted or

Given the posterior density f (y j θ ) p (θ ), known up to a constant, and a prespeci…ed

1 Choose a starting value θ (0 ) .

If p θ 0 j y > p (θ j y ), then for every accepted draw of θ, we should have at least as

If p θ 0 j y < p (θ j y ), then for every accepted draw θ, we should have on average

To guarantee the existence of a stationary distribution, the proposal distribution

A popular implementation of the MH algorithm is to use the random walk proposal

for some variance σ2 .

You might also like