Bayesian-inference-slides-2021
Bayesian-inference-slides-2021
Class Notes
Manuel Arellano
Revised: February 7, 2021
Introduction
Bayesian methods have traditionally had limited in‡uence in econometrics, but they
have become more important with the advent of computer-intensive stochastic
simulation algorithms in the 1990s.
Bayesian approaches are also attractive in models with many parameters, such as
panel models with individual heterogeneity and ‡exible nonlinear regression models.
2
Introduction (continued)
A likelihood function speci…es the information in the data about those quantities.
Such speci…cation typically involves the use of a priori information in the form of
parametric or functional restrictions.
In the Bayesian approach to inference, one not only assigns a probability measure to
the sample space but also to the parameter space.
3
Outline
The following section introduces the Bayesian way of combining a prior distribution
with the likelihood of the data to generate point and interval estimates.
As a result, frequentist and Bayesian inferences are often very similar and can be
reinterpreted in each other’s terms.
The development of these methods has greatly reduced the computational di¢ culties
that held back Bayesian applications in the past.
Bayesian methods are now not only generally feasible, but sometimes also a better
practical alternative to frequentist methods.
4
Bayesian inference
5
Updating a prior distribution with data
Let the density of the data y = (y1 , ..., yn ) conditional on an unknown parameter θ be:
f (y1 , ..., yn j θ ) .
If y is an iid sample f (y1 , ..., yn j θ ) = ∏ni=1 f (yi j θ ) where f (yi j θ ) is the pdf of yi .
In survey sampling f (yi j θ = θ 0 ) is the pdf of the population, (y1 , ..., yn ) are n draws
from such population, and θ 0 is the true value of θ in the pdf that generated the data.
In short write f (y j θ ) = f (y1 , ..., yn j θ ), which is also the likelihood function L (θ ).
Any prior information about the value of θ is speci…ed in a prior distribution p (θ ).
Both the likelihood and the prior are chosen by the researcher.
We combine the prior and the sample, using Bayes’theorem, to obtain the conditional
distribution of the parameter given the data, also known as the posterior distribution:
f (y , θ ) f (y j θ ) p ( θ )
p (θ j y ) = = R .
f (y ) f (y j θ ) p ( θ ) d θ
Note that as a function of θ, the posterior density is proportional to
p ( θ j y ) ∝ f (y j θ ) p ( θ ) = L ( θ ) p ( θ ) .
The posterior density describes how likely it is that a value of θ has generated the
observed data.
6
Point estimation
The notion of optimality is minimizing mean posterior loss for some loss function ` (r ):
Z
min ` (c θ ) p (θ j y ) d θ
c Θ
When the prior density is ‡at, the posterior mode coincides with the maximum
likelihood estimator.
7
Interval estimation
The posterior quantiles characterize the posterior uncertainty about the parameter,
and they can be used to obtain interval estimates.
Frequentist con…dence intervals and Bayesian credible intervals are the two main
interval estimation methods in statistics.
In a con…dence interval the coverage probability is calculated from a sampling density,
whereas in a credible interval it is calculated from a posterior density.
8
Bernoulli example
Let us consider a Bernouilli random sample (y1 , ..., yn ) with likelihood given by
L ( θ ) = θ m (1 θ )n m
9
Bernoulli example (continued)
The Beta distribution is a convenient prior because the posterior is also Beta:
m +β 1
p (θ j y ) ∝ L (θ ) p (θ ) ∝ θ m +α 1
(1 θ )n
That is, if θ Beta (α, β) then θ j y Beta (α + m, β + n m ).
It is then said that the Beta distribution is the conjugate prior to the Bernoulli.
The posterior mode is given by
h i m+α 1
m +β 1
e
θ = arg max θ m +α 1
(1 θ )n = . (1)
θ n+β 1+α 1
An interesting property of the posterior mode in this example is that it is equivalent
to the MLE of a data set with α 1 additional ones and β 1 additional zeros.
Such data augmentation interpretation provides guidance on how to choose α and β
in describing a priori knowledge about the probability of success in Bernoulli trials.
It also illustrates the vanishing e¤ect of the prior in a large sample: if n is large e
θ b
θ.
However, ML may not be a satisfactory estimator in a small sample that only contains
zeros if the probability of success is known a priori to be greater than zero.
10
Speci…cation of prior distribution
11
Conjugate priors
12
Informative priors
Other times, a parameter is a random realization drawn from some population, for
example, in a model with individual e¤ects for longitudinal survey data; a situation in
which there exists an actual population prior distribution.
In those cases one would like the prior to accurately express the information available
about the parameters.
However, often little is known a priori and one would like a prior density to just
express lack of information, an issue that we consider next.
13
Flat priors
For a scalar θ taking values on the entire real line a ‡at prior distribution that sets
p (θ ) = 1 is typically employed as an uninformative prior.
A ‡at prior is non-informative in the sense of having little impact on the posterior,
which is simply a renormalization of the likelihood into a density for θ.
A ‡at prior is appealing from the point of view of seeking to summarize the likelihood.
R
Note that a ‡at prior is improper in the sense that Θp (θ ) d θ = ∞.
Flat priors are often approximated by a proper prior with a large variance.
14
Flat priors (continued)
The ML estimator coincides with the posterior mode for the Beta (1, 1) prior, and
with the posterior mean for the Beta (0, 0) prior.
16
Large-sample Bayesian inference
17
Introduction
Posterior asymptotic results formalize the notion that the importance of the prior
diminishes as n increases.
They hold under suitable conditions on the prior, the likelihood, and Θ, including:
A prior that assigns positive probability to a neighborhood about θ 0 .
A posterior distribution that is not improper.
Identi…cation.
A likelihood that is a continuous function of θ.
A true value θ 0 that is not on the boundary of Θ.
18
Consistency of the posterior distribution
If the population distribution g (yi ) equals f (yi j θ 0 ) for some θ 0 , the posterior is
consistent in the sense that it converges to a point mass at θ 0 as n ! ∞.
When g (yi ) is not in f (yi j θ ), the pseudo true value θ 0 makes f (yi j θ ) closest to
g (yi ) in the KLD sense. Consistency of the pseudo-posterior also holds in this case.
Discrete parameter space (Gelman et al 2014)
If Θ is …nite and Pr (θ = θ 0 ) > 0, then Pr (θ = θ 0 j y ) ! 1 as n ! ∞, where θ 0 is
the value of θ that minimizes KLD (θ ).
To see this, for any θ 6= θ 0 consider the log posterior odds relative to θ 0 :
n
p (θ j y ) p (θ ) f (yi j θ )
ln
p (θ 0 j y )
= ln
p (θ 0 )
+ ∑ ln f (yi j θ 0 )
(3)
i =1
For …xed values of θ and θ 0 , if the yi ’s are iid draws from g (yi ), the second term on
the right is a sum of n iid random variables with a mean given by
f (yi j θ )
E ln = KLD (θ 0 ) KLD (θ ) 0.
f (yi j θ 0 )
Thus, as long as θ 0 is the unique minimizer of KLD (θ ), for θ 6= θ 0 the second term
on the right of (3) is the sum of n iid random variables with negative mean.
By the LLN, the sum approaches ∞ as n ! ∞. As long as the …rst term on the
right is …nite (provided p (θ 0 ) > 0), the whole expression approaches ∞ in the limit.
Then p (θ j y ) /p (θ 0 j y ) ! 0, and so p (θ j y ) ! 0.
Moreover, since all probabilities add up to 1, p (θ 0 j y ) ! 1.
19
Continuous parameter space
If θ has a continuous distribution, p (θ 0 j y ) is zero for any …nite sample, and so the
previous argument does not apply, but it can still be shown that p (θ j y ) becomes
more and more concentrated about θ 0 as n ! ∞, as in the following result.
If θ is de…ned on a compact set and A is a neighborhood of θ 0 with nonzero prior
probability, then Pr (θ 2 A j y ) ! 1 as n ! ∞, where θ 0 minimizes KLD (θ ).
Bernouilli example
20
Asymptotic normality of the posterior distribution
21
Asymptotic normality of the posterior distribution (continued)
b 1 1
p (θ j y ) N θ, I (θ 0 ) .
n
From a frequentist point of view, this implies that Bayesian methods can be used to
obtain statistically e¢ cient estimators and consistent con…dence intervals.
22
Extension to a multidimensional parameter: some intuition
Consider a Taylor expansion of ln p (θ j y ) about the posterior mode e
θ:
∂ ln p e
θjy 1 0 ∂2 ln p e
θjy
ln p (θ j y ) ln p e
θjy + θ e
θ + θ e
θ θ e
θ
∂θ 0 2 ∂θ∂θ 0
p 0h ip
=c 0.5 n θ e
θ n ∂ ln p e
1 2
θ j y /∂θ∂θ 0 n θ e
θ
Note that ∂ ln p e
θ j y /∂θ 0 = 0. Moreover,
2 e 2 e ∂2 ln f yi j e
1 ∂ ln p θ j y 1 ∂ ln p θ 1 n
θ
n ∂θ∂θ 0
=
n ∂θ∂θ 0
+
n ∑ i =1 ∂θ∂θ 0
1 n
∂2 ln f yi j e
θ 1
=
n ∑ i =1∂θ∂θ 0 I eθ +O
n
Thus, the curvature of the log posterior can be approximated by Fisher information:
p 0 p
ln p (θ j y ) c 0.5 n θ e θ I e θ n θ e θ .
Dropping terms that do not include θ we get the approximation
0
p (θ j y ) ∝ exp[ 0.5 θ e
θ nI e
θ θ e
θ ],
1
which corresponds to the kernel of a multivariate normal density N e
θ, n 1I e
θ .
23
Asymptotic behavior of the posterior in pseudo-likelihood models
If g (yi ) 6= f (yi j θ ) for all θ 2 Θ, the large-n sampling distribution of the PMLE is
p d
n bθ θ 0 ! N (0, ΣS )
The large-n shape of a posterior obtained from Πni=1 f (yi j θ ) becomes close to
b 1
θjy N θ, ΣM .
n
Thus, misspeci…cation produces a discrepancy between the sampling distribution of b
θ
and the shape of the (pseudo)-likelihood.
24
Asymptotic behavior of the posterior in pseudo-likelihood models (continued)
For the purpose of Bayesian inference about the pseudo truth θ 0 it makes sense to
start from the correct large-sample approximation to the likelihood of b
θ instead of the
(incorrect) approximate likelihood of (y1 ...yn ).
25
Asymptotic frequentist properties of Bayesian inferences
The posterior mode corresponding to the beta prior with parameters (α, β) in (1) and
the maximum likelihood estimator b
θ = m/n satisfy
p p
n e
θ θ = n b
θ θ + Rn
where p
n m
Rn = α 1 k .
n+k n
and k = α + β 2.
p p
Since Rn ! 0, it follows that n e
θ θ has the same asymptotic distribution as
p b
n θ θ , namely N [0, θ (1 θ )].
27
Robustness to statistical principle and its failures
In the case of units roots the symmetry of Bayesian probability statements and
classical con…dence statements breaks down.
With normal errors and a ‡at prior the Bayesian posterior is normal even if the true
data generating process is a random walk (Sims and Uhlig 1991).
28
Markov chain Monte Carlo methods
29
Introduction
A Markov Chain Monte Carlo method simulates a series of parameter draws such that
the marginal distribution of the series is the posterior distribution of the parameters.
p ( θ j y ) ∝ f (y j θ ) p ( θ ) .
However, computation of point estimates and credible intervals typically requires the
evaluation of integrals of the form
R
ΘRh ( θ ) f (y j θ ) p ( θ ) d θ
Θ f (y j θ ) p ( θ ) d θ
For problems for which no analytic solution exists, MCMC methods provide powerful
tools for evaluating these integrals, especially when θ is high dimensional.
30
Markov chains
Analogously, a 90% interval estimation is constructed simply by taking the 0.05th and
0.95th quantiles of the sequence h θ (1 ) , ..., h θ (M ) .
31
Markov chains (continued)
In the theory of Markov chains one looks for conditions under which there exists an
invariant distribution, and conditions under which iterations of the transition kernel
K θ 0 j θ converge to the invariant distribution.
In the context of MCMC methods the situation is the reverse: the invariant
distribution is known and in order to generate samples from it the methods look for a
transition kernel whose iterations converge to the invariant distribution.
The problem is to …nd a suitable K θ 0 j θ that satis…es the invariance property :
Z
p θ0 j y = K θ 0 j θ p (θ j y ) d θ. (6)
32
Metropolis-Hastings method
where 0 1
f (y j θ ) p ( θ ) q θ (j ) j θ
ρ θ j θ (j ) = min @1, A
f y j θ (j ) p θ (j ) q θ j θ (j )
33
Intuition for how MH deals with a candidate transition (Letham & Rudin 2012)
p ( θ 0 jy )
Thus, for any proposed transition, we accept it with probability min 1, p (θ jy ) ,
which corresponds to ρ θ 0 j θ when the proposal distribution is symmetric:
q θ 0 j θ = q θ j θ 0 , as is the case in the original Metropolis algorithm.
The chain of draws so produced spends a relatively high proportion of time in the
higher density regions and a lower proportion in the lower density regions.
Because such proportions of times are balanced in the right way, the generated
sequence of parameter draws has the desired marginal distribution in the limit.
A key practical aspect of this calculation is that the posterior constant of integration
is not needed since ρ θ 0 j θ only depends on a posterior ratio.
34
Choosing a proposal distribution
In practice, one will try several proposal distributions to …nd out which is most
suitable in terms of rejection rates and coverage of the parameter space.
Other practical considerations include discarding a certain number of the …rst draws
to reduce the dependence on the starting point (burn-in), and only retaining every
d th iteration of the chain to reduce the dependence between draws (thinning).
35
Transition kernel and convergence of the MH algorithm
The MH algorithm describes how to generate a parameter draw θ (j +1 ) conditional on
a parameter draw θ (j ) .
Since the proposal distribution q θ 0 j θ and the acceptance probability ρ θ 0 j θ
depend only on the current state, the sequence of draws forms a Markov chain.
The MH transition kernel can be written as
K θ 0 j θ = q θ 0 j θ ρ θ 0 j θ + r (θ ) δθ θ 0 . (8)
0 0 0
The …rst term q θ j θ ρ θ j θ is the density that θ is proposed given θ, times the
probability that it is accepted.
To this we add the term r (θ ) δθ θ 0 , which gives the probability r (θ ) that
conditional on θ the proposal is rejected times the Dirac delta function δθ θ 0 , equal
to one if θ 0 = θ and zero otherwise. Here
Z
r (θ ) = 1 q θ0 j θ ρ θ0 j θ d θ0.
If the proposal is rejected, then the algorithm sets θ (j +1 ) = θ (j ) , which means that
conditional on the rejection, the transition density contains a point mass at θ = θ 0 ,
which is captured by the Dirac delta function.
For the MH algorithm to generate a sequence of draws from p (θ j y ) a necessary
condition is that the posterior distribution is an invariant distribution under the
transition kernel (8), namely that it satis…es condition (6).
36
Gibbs sampling
The Gibbs sampler is a fast sampling method that can be used in situations when we
have access to conditional distributions.
The idea behind the Gibbs sampler is to partition the parameter vector into two
components θ = (θ 1 , θ 2 ).
(j +1 )
Instead of sampling θ (j +1 ) directly from K θ j θ (j ) , one …rst samples θ 1 from
(j ) (j +1 ) (j +1 )
p θ1 j θ2 and then samples θ 2 from p θ 2 j θ 1 .
(j ) (j ) (j +1 ) (j +1 )
If θ 1 , θ 2 is a draw from the posterior distribution, so is θ 1 , θ2
generated as above, so that the Gibbs sampler kernel satis…es the invariance property;
that is, it has p (θ 1 , θ 2 j y ) as its stationary distribution.
The Gibbs sampler kernel is
K θ 1 , θ 2 j θ 10 , θ 20 = p θ 1 j θ 20 p (θ 2 j θ 1 ) .
It can be regarded as a special case of MH where the proposal distribution is taken to
be the conditional posterior distribution.
The Gibbs sampler is related to data augmentation. A probit model nicely illustrates
this aspect (Lancaster 2004, Example 4.17).
37