Bayesian Methods in Applied Econometrics, Or, Why Econometrics Should Always and Everywhere Be Bayesian
Bayesian Methods in Applied Econometrics, Or, Why Econometrics Should Always and Everywhere Be Bayesian
Christopher A. Sims
Princeton University
[email protected]
August 6, 2007
Introduction
• The first part of the talk makes some unoriginal claims about the role of Bayesian
thinking.
• A summer seminar talk seems a good place to restate claims that are both ob-
vious and outrageous and that therefore (for one of these reasons or the other)
are usually excluded from journal articles.
• The latter part of the talk discusses some areas of econometric application
where frequentist asymptotics seems particularly persistent and suggests how
Bayesian approaches might become more practical and prevalent.
2007
c by Christopher A. Sims. This document may be reproduced for educational and research
purposes, so long as the copies contain this notice and are retained for personal use or distributed
free.
– A 95% confidence interval contains the true parameter value with proba-
bility .95 only before one has seen the data. After the data has been seen,
the probability is zero or one.
– Yet confidence intervals are universally interpreted in practice as guides
to post-sample uncertainty.
– They often are reasonable guides, but only because they often are close to
posterior probability intervals that would emerge from a Bayesian analy-
sis.
• They often are quite familiar with the language of probability and may ask,
for example, for what the data say about the odds of a parameter being in one
region vs. being in another.
• The Bayesian approach to inference should be the starting point also of our
education of econometricians. For the time being, they need also to learn what
a confidence region is (what it really is, as opposed to what most of them think
it is after a one-year statistics or econometrics course). But I think that full
understanding of what confidence regions and hypothesis tests actually are
will lead to declining interest in constructing them.
1.2 Objections
It’s subjective, whereas frequentist approaches are objective
• The objective aspect of Bayesian inference is the set of rules for transforming an
initial distribution into an updated distribution conditional on observations.
2
• Bayesian thinking makes it clear that for decision-making, pre-sample beliefs
are therefore in general important.
• In such a situation, as was pointed out long ago by Hildreth (1963) and Savage
(1977, p.14-15), the task is to present useful information about the shape of the
likelihood.
• If the dimension is high, present slices of it, marginalizations of it, and implied
expected values of functions of the parameter. The functions you choose might
be chosen in the light of possible decision applications.
• The conditions that allow these claims about asymptotic likelihood shapes are
almost the same as the “weak” conditions allowing asymptotic distribution
claims for estimators.
3
Assumptions, continued: Asymptotic results under weak assumptions do not jus-
tify small-sample assertions under weak assumptions
• It will be consistent and allow accurate confidence statements for a large class
of functions f — in large samples under weak assumptions.
• I think it’s Don Berry who said, “Bayesian inference is hard in the sense that
thinking is hard.”
• But it is true that Bayesian inference is usually described and often imple-
mented in a bottom-up way: define your model, define your prior, apply Bayes
rule, emerge with unique, optimal, correct inference.
• This is indeed often hard. Not just computationally, in the third step, but intel-
lectually, in the first two.
• Frequentist inference could be approached in the same way: Define your model,
derive fully efficient estimators, pay no attention to anything else.
• This is actually even harder, and furthermore is not guaranteed to give you any
answer, or a unique answer as to how to proceed.
Hard, continued: Frequentists take shortcuts because doing it right is harder for
them.
• It is standard practice for frequentists to write papers about methods that are
convenient, or intuitively appealing, and describe their properties, often only
approximately, based on asymptotics.
• People use such procedures, often in preference to more elaborate fully efficient
procedures.
• Bayesians can also analyze the properties of estimators or statistics that are not
optimal given the model and prior.
4
• They also can use asymptotic approximation as a shortcut when exact deriva-
tions are hard or as a way of avoiding the need to commit fully to all the details
of a probability model.
• They should probably do more of this, and also stop claiming that it is an ad-
vantage of Bayesian procedures that they give “exact small sample results”.
• There is always at least an implicit alternative, and there are always alterna-
tives under which a given test would be senseless.
• The view that formal econometrics leads to “testing” and “rejecting” models
without presenting an alternative is part of what has given econometrics a bad
name in some quarters (e.g. among macro calibraters).
5
But much of frequentist econometric theory looks like wasted effort
• Or most of the “testing for breaks” literature. (In the simple case of a single
break for a regression model, one can easily compute integrated posteriors for
each possible break date, and the plot of these gives the posterior for the break
date under a uniform prior. No analysis of the asymptotic behavior of the
likelihood as a stochastic process on the break dates conditional on the break
date is needed.)
• These are instances where tracing out the implications of the likelihood is quite
straightforward in a given sample, but frequentist results are much harder —
and less useful.
• Working out the priors and modeling assumptions that would justify a fre-
quentist procedure as Bayesian, or approximately Bayesian, can be helpful.
• It can reassure us (at least us Bayesians) that the procedure makes sense.
• It can help us identify rare conditions or samples in which the frequentist pro-
cedure deviates sharply from its Bayesian approximant, and thereby again lead
us to better estimates or procedures.
2 Recent Successes
Macro policy modeling
• The ECB and the New York Fed have research groups working on models using
a Bayesian approach and meant to integrate into the regular policy cycle. Other
central banks and academic researchers are also working on models using these
methods.
6
– the modeling is directly tied to repeated decision-making use;
– it involves large models with many parameters, so that attempting to pro-
ceed without any use of prior distributions is a dead end;
– computational power and MCMC methods now make Bayesian analysis
of large models feasible.
What is MCMC?
• Its appeal is that whenever the posterior density function can be computed for
arbitrary points in the parameter space, it is possible with MCMC to generate
simulated samples from that posterior density, even though the density corre-
sponds to no known distribution.
Mixed models
In statistics, mixed models, estimated by MCMC, have received a lot of attention.
They should receive more attention from economists.
This type of model is easy to handle by MCMC, harder to make sense of from a
frequentist perspective because it has “parameters” that are “random”. It should
probably be preferred in most applications to trying to control for conditional het-
eroscedasticity via “cluster” corrections on standard errors.
7
3.1 IV,GMM
IV
• The sample moments from which IV estimators are constructed are sufficient
statistics under the model that leads to the LIML likelihood.
• When there are large numbers of instruments, a prior will help in making sense
of results, and in avoiding the trivial result from 2SLS when the first stage has
negative or small degrees of freedom.
• The trivial answer just points to the fact that IV, LIML, and Bayesian posterior
means all will agree asymptotically.
• The Bayesian analysis makes clear exactly what assumptions are needed to
make “asymptotically justified” probability statements actually justified.
Conservative models
• But we can go farther. We can ask, if given the moment assumptions on which
IV is based, what is the most “conservative” small-sample probability model
satisfying those assumptions?
8
• A natural formulation: minimize the mutual information, in Shannon’s sense,
between the model parameters and the data.
GMM
• Here a model and prior that would lead to the usual procedures is less appar-
ent.
• This implies that the posterior depends on the data only through ∑ g(yi | β).
9
• The result generally depends, through A, B, and C, on the prior.
• It seems worthwhile to explore in more detail what emerges from this setup.
3.2 Nonparametrics
What applied modelers do in practice
Bayesian sieve
• This leads to doing essentially what was described on the previous slide —
estimating finite-dimensional models, expanding or shrinking the number of
parameters based on evidence of the extent to which larger models fit better.
• If the finite-dimensional spaces are dense in the full space, then under some
regularity conditions we can get consistency in the full space.
10
• Countable unions of them are “meagre”. One topological definition of “small”.
• So this looks restrictive. We’ve put probability one on a small subset of the
parameter space.
• But this means they put probability one on subsets that are topologically small
in the same sense that a countable union of finite-d subspaces is small.
yt = f ( xt ) + ε t
ε t i.i.d., zero mean conditional on { xs }. The parameter is f . Assume f is a stochastic
process on x as index set, e.g. Gaussian with Cov( f ( x ), f (z)) = R f ( x − z). Then we
have a covariance matrix for a sample of y’s, a cross-covariance of the y’s with the
f ( x )’s, and thus a projection formula for E[ f ( x ∗ ) | y1 , . . . , y T ].
For x’s in the midst of observed xi ’s, the weights in the projection look very much
like kernel weights. For x’s at the boundary of the sample, or in areas where xi ’s are
11
very sparse, the kernel spreads out, becomes asymmetric. This corresponds to what
applied workers actually do.
The plots on the next slide show the weights from an example in which the
R f ( x ) = 1 − | x | k on [−k, k], with k = .12. More detail is in Sims (2000), from which
these graphs were extracted. One of the plots shows a case where the x ∗ value has
many observed xi near-neighbors. The other shows a case of an x ∗ at the boundary
of the x range and without many near neighbors.
12
13
Conclusion
Lose your inhibitions: Put probabilities on parameters without embarrassment.
References
H ILDRETH , C. (1963): “Bayesian Statisticians and Remote Clients,” Econometrica,
31(3), 422–438.
S AVAGE , L. J. (1977): “The Shifting Foundations of Statistics,” in Logic, Laws and Life,
ed. by R. Colodny, pp. 3–18. University of Pittsburgh Press.
14