(Ebook) Introduction To Bayesian Econometrics and Decision Theory
(Ebook) Introduction To Bayesian Econometrics and Decision Theory
(Ebook) Introduction To Bayesian Econometrics and Decision Theory
Theory
Karsten T. Hansen
January 14, 2002
This is only a partial list. A few more (technical) reasons for considering a Bayesian
approach is
can easily accomodate inference in non-regular models,
allows for parameter uncertainty when forming predicitions,
can test multiple non-nested models,
allows for automatic James-Stein Shrinkage estimation using hierarchial models.
Probability theory as logic
Probability spaces are usually introduced in the form of the Kolmogorov axioms. A
probability space (, F, P ) consists of a sample space , a set of events F consisting
of subsets of and a probability measure P with the properties
1. F is a -field
2. P (A) 0, for all A F
3. P () = 1
4. For a disjoint collection {Aj F},
P (Aj ) =
P (Aj )
These are axioms and hence taken as given. The classical interpretation of the
number P (A) is the relative frequency with which A occurs in a repeated random
experiment when the number of trials go to infinity.
But why should we be base probability theory on exactly these axioms? Indeed
many have criticized these axioms as being arbitrary. Can we derive them from deeper
principles that seem less arbitrary? Yes and this also leads to an alternative interpretation of the number P (A).
3
Let us start by noting that in a large part of our lives our human brains are engaged
in plausible reasoning. As an example of plausible reasoning consider the following little
story from Jaynes book:
Suppose some dark night a policeman walks down a street, apparently
deserted; but suddenly he hears a burglar alarm, looks across the street,
and sees a jewelry store with a broken window. Then a gentleman wearing a
mask comes crawling out through the broken window, carrying a bag which
turns out to be full of expensive jewelry. The policeman doesnt hesitate
at all in deciding that this gentleman is dishonest. But by what reasoning
process does he arrive at this conclusion?
The policemans reasoning is clearly not deductive reasoning which is based on
relationships like
If A is true, then B is true.
Deductive reasoning is then A true = B true and B false = A false.
The policemans reasoning is better described by the following relationship:
If A is true, then B becomes more plausible
Plausible reasoning is then
B is true = A becomes more plausible.
How can one formalize this kind of reasoning? In chapter 1 and 2 of Jaynes book it
is shown that given some basic desiderata that a theory of plausible reasoning should
satisfy one can derive the laws of probability from scratch. These desiderata are that
(i) degrees of plausibility are represented by real numbers, (ii) if a conclusion can be
reasoned out in more than one way, then every possible way must lead to the same
result (plus some further weak conditions requiring correspondence of the theory to
common sense).
4
So according to this approach probability theory is not a theory about limiting relative frequencies in random experiments, but a formalization of the process of plausible
reasoning and the interpretation of a probability is
P (A) = the degree of belief in the proposition A
This subjective definition of probability can now be used to formalize the idea of learning
in an uncertain environment. Suppose my degree of belief in A is P (A). Then I learn
that the proposition B is true. If I believe there is some connection between A and B
I have then also learned something about A. In particular, the laws of probability (or,
according to the theory above, the laws of plausible reasoning) tells me that
Pr(A|B) =
Pr(B|A)Pr(A)
Pr(A B)
=
,
Pr(B)
Pr(B)
(1)
p(y|)p()
p(y)
(2)
Here p(y|) is the sample distribution of the data given and p() is the prior distribution of .
5
So Bayesian statistics is nothing more than a formal model of learning in an uncertain environment applied to statistical inference. The prior expresses my beliefs about
before observing the data;The distribution p(|y) expresses my updated beliefs about
after observing the data.
Definition 1. p() is the prior distribution of . p(|y) given in (2) is the posterior
distribution of , and
p(y) =
p(y|)p()d,
Carrying out a Bayesian analysis is deceptively simple and always proceed as follows:
Formulate the sample distribution p(y|) and prior p().
Compute the posterior p(|y) according to (2)
Thats it! All information about is now contained in the posterior. For example the
probability that A is
Pr( A|y) =
A couple of things to note:
p(|y)d.
A
(3)
Definition 4. Given a sample distribution, prior and loss function, a Bayes estimator
B (y) is any function of y so that
B (y) = arg minaA (a|y)
Some typical loss functions when is one dimensional are
L(, a) = ( a)2 ,
L(, a) = | a|,
k2 ( a), if > a,
L(, a) =
,
k1 (a ), otherwise,
quadratic,
(4)
absolut error,
(5)
(6)
(Posterior mean),
(7)
(Posterior median),
(8)
(9)
Proof. Consider first the quadratic case. The posterior expected loss is
Z
(a|y) = ( a)2 p(|y)d,
which is a continuous and convex function of a so
Z
Z
(a|y)
= 0
( a )p(|y)d = 0 a = p(|y)d E[|y].
a
For the generalized absolut error loss case we get
Z
(a|y) = L(, a)p(|y]d
Z
Z a
= k1
(a )p(|y)d + k2
( a)p(|y)d
Z a
=
Pr( < x|y)dx,
(10)
(11)
which shows that a is the k2 /(k1 + k2 ) fractile of the posterior. For k1 = k2 we get the
posterior median.
One can also construct loss functions which gives the posterior mode as an optimal
Bayes estimator.
Here is a simple example of a Bayesian analysis.
Example 1. Suppose we have a sample y of size n where by assumption yi is sampled
from a normal distribution with mean and variance 1,
i.i.d.
yi | N(, 1).
(12)
p(y|) = (2)
exp
n
o
1 X
2
(yi )2 .
2 i=1
(13)
1/2
exp
o
1
2
2 ( 0 ) ,
20
(14)
where 0 and 02 is the known prior mean and variance. So before observing any data
the best guess of is 0 (at least under squared error loss). Usually one would have
0 large to express that little is known about before observing the data.
9
p(y|)p()
p(y|)p()
=R
p(y)
p(y|)p()d
(n+1)/2 n
p(y|)p() = (2)
Note that
Then since
exp
n
o
1 X
1
2
(yi )2 2 ( 0 )2
2 i=1
20
n
n
X
X
2
(yi ) =
(yi y)2 + n( y)2
i=1
i=1
(15)
1
n
1
1
(
( y)2 + 2 ( 0 )2 = 2 (
y 0 )2 ,
)2 + 2
2
0 + n1 2
(16)
where
y + (1/02 )0
(n/ 2 )
,
(n/ 2 ) + (1/02 )
1
2 =
2
n/ + (1/02 )
(17)
(18)
where
Then
where
1
n
1 X
1
(
y 0 )2 .
h(y) = 2
(yi y)2 +
2
2 i=1
2(0 + n1 2 )
1
p(y|)p() = p(y)(2)1/2
1 exp 2 (
)2 },
2
0 1 exp h(y)}.
p(y) = (2)n/2 n
ac
ab + cd 2
(b d)2
+
a(x b)2 + c(x d)2 = (a + c) x
a+c
a+c
10
(19)
(20)
Then
p(|y) =
p(y|)p()
1
= (2)1/2
1 exp 2 (
)2 },
p(y)
2
(21)
(22)
To derive this we did more calculations than we actually had to. Remember that when
deriving the posterior for we only need to include terms where enters. Hence,
p(|y) =
p(y|)p()
p(y)
p(y|)p()
n
n
o
1
1 X
(yi )2 2 ( 0 )2
exp 2
2 i=1
20
1
exp 2 (
)2 .
2
1
p(|y) exp 2 (
)2 .
2
(23)
The optimal Bayes estimator is a convex combination of the usual estimator y and the
prior expectation 0 . When n is large and/or 0 is large most weight is given to y. In
particular,
E[|y] y
as n
E[|y] y
as 0 .
11
In this example there was a close correspondence between the optimal Bayes estimator and the classical estimator y. But suppose now we had the knowledge that
have to be positive. Suppose also we initially use the prior
p() = I(K > > 0)
1
,
K
where K is a large positive number. Then we can compute the posterior for K <
and then let K approach infinity. Then posterior is then
1
1
p(|y) exp 2 (
)2 I(K > > 0) ,
2
K
,
|
2 I(K > > 0)
p(|y) =
(
)
K
/
2
|
I( > > 0)
,
,
/
(
)
as K .
(
y /
)
.
y /
)
(
(24)
Note that the unrestricted estimate is y which may be negative. Developing the
repeated sample distribution of y under the restriction > 0 is a tricky matter. On
the other hand, the posterior analysis is straightforward and E[|y] is a reasonable and
intuitive estimator of .
Models via exchangeability
In criticisms of Bayesian statistics one often meet statements like This is too restrictive
since you have to use a prior to do a Bayesian analysis whereas in classical statistics
you dont. This is correct but now we will show that under mild conditions there
always exists a prior.
12
Consider the following example. Suppose you wish to estimate the probability of
unemployment for a group of (similar) individuals. The only information you have
is a sample y = (y1 , . . . , yn ) where yi is one if individual i is employed and zero if
unemployed. Clearly the indices of the observations should not matter in this case.
The joint distribution of the sample p(y1 , . . . , yn ) should be invariant to permutations
of the indices, i.e.,
p(y1 , . . . , yn ) = p(yi(1) , . . . , yi(n) ),
where {i(1), . . . , i(n)} is a permutation of {1, . . . , n}. Such a condition is called exchangeability.
Definition 5. A finite set of random quantities z1 , . . . , zn are said to be exchangeable if
every permutation of z1 , . . . , zn has the same joint distribution as every other permutation. An infinite collection is exchangeable if every finite subcollection is exchangeable.
The relatively weak assumption of exchangeability turns out to have a profound
consequence as shown by a famous theorem by deFinetti.
Theorem 1. deFinettis representation theorem Let z1 , . . . be a sequence of 0-1
random quantities. The sequence (z1 , . . . , zn ) is exchangeable for every n if and only if
p(z1 , . . . , zn ) =
where
1
0
n
Y
i=1
zi (1 )1zi dF (),
1 X
F () = lim Pr
zi
n
n i=1
What does deFinettis theorem say? It says that if the sequence z1 , z2 , . . . is considered exchangeable then it is as if the zi s are iid Bernoulli given ,
i.i.d.
zi | Bernoulli(),
i = 1, . . . , n,
where is a random variable with a distribution which is the limit distribution of the
P
sample average n1 ni=1 zi .
13
i = 1, . . . , n,
zi | Bernoulli(),
(),
(25)
(26)
is to appeal to exchangeability and think about your beliefs about the limit of n1
when you pick the prior ().
Pn
i=1 zi
p(y|)p()
.
p(y)
Suppose now we also make the assumption that there is a population distribution f (y).
As a measure of the difference between the sample distribution used to compute the
posterior p(y|) and the actual population distribution we can use the Kullback-Leibler
discrepency,
H() =
log
p(y |)
i
f (yi )dyi .
f (yi )
(27)
Let be the value of that minimizes this distance. One can show that if f (yi ) =
p(yi |0 ), i.e. the sample distribution is correctly specified and the population is indexed
as n .
For proofs see the textbook by Schervish or Gelman et.al. on the reading list.
14
So in the case of correct specification, f (yi ) = p(yi |0 ), the posterior will concentrate
around the true value 0 asymptotically as long as 0 is contained in the support of
the prior. Under misspecification the posterior will concentrate around the value of 0
that minimizes the distance to the true model.
Now we shall consider the frequentist risk properties of Bayes estimators. To this
r(, )
(28)
Note the difference between this risk measure and the Bayesian risk measure (3): The
frequentist risk averages over the data for a given parameter whereas the Bayesian
risk measure averages over the parameter space given the data. Furthermore, the
The Bayes risk
frequentist risk is a function of both and the proposed estimator .
is only a function of a = .
There are two popular ways to choose estimators optimally based on their frequentist risk. These are minimaxity and admissibility It turns out that there is close
relationship between admissibility and Bayes estimators.
Definition 6. An estimator is inadmissible if there exists an estimator 1 which
i.e., such that for every ,
dominates ,
r(, 1 ),
r(, )
and, for at least one value 0 ,
> r(0 , 1 ).
r(0 , )
If is not inadmissible it is admissible.
The idea behing admissibility is to reduce the number of potential estimators to
consider. Indeed is seems hard to defend using an inadmissible estimator.
Under mild conditions Bayes estimators can be shown to be admissible. Under
somewhat stronger conditions one can in fact show the reverse: All admissible estima15
tors are Bayes estimators (or limits of Bayes estimators). A theorem proving this is
called a complete class theorem and different versions of complete class theorems exist.3
Comparisons between classical and Bayesian inference
The fundamental difference between classical frequentist inference and Bayesian inference is in the use of pre-data versus post-data probability statements.
The frequentist approach is limited to pre-data considerations. This approach answers questions of the following form:
(Q1) Before we have seen the data, what data do we expect to get?
(Q2) If we use the as yet unknown data to estimate parameters by some known algorithm, how accurate do we expect the estimates to be?
(Q3) If the hypothesis being tested is in fact true, what is the probability that we shall
get data indicating that it is true?
These questions can also be answered in the Bayesian approach. However, followers
of the Bayesian approach argue that these questions are not relevant for scientific
inference. What is relevant are post-data questions:
(Q1) After having seen the data, do we have any reason to be surprised by them?
(Q2) After we have seen the data, what parameter estimates can we now make, and
what accuracy are we entitled to claim?
(Q3) What is the probability conditional on the data, that the hypothesis is true?
Questions (Q1)-(Q3) are only meaningful in a Bayesian framework.
In the frequentist approach one cannot talk about the probability of a hypothesis.
The marginal propensity to consume is either .92 or not. A frequentist 95 pct. confidence interval (a, b) does not mean that the probability of a < < b is 95 pct. either
belongs to the interval (a, b) or not.
3
16
Sometimes frequentist and Bayesian procedures give similar results although their
interpretation differ.
Example 2. In example 1 we found the posterior,
,
p(|y) = N(
2 ),
where
(n/ 2 )
y + (1/02 )0
,
(n/ 2 ) + (1/02 )
1
2 =
2
n/ + (1/02 )
=
B = y .
n
(29)
y = .
n
(30)
Conceptually (29) and (30) are very different, but the final statements one would make
about would be nearly identical.
We stress once again the difference between (30) and (29). (30) answers the question
(Q1) How much would the estimate of vary over the class of all data sets that we
might conceivably get?
4
17
11 9
(1 )3 .
L2 () =
9
18
Suppose we use the test statistic X =number of heads and decision rule reject H0 if
X c. The p-value is the probability of observing the data X = 9 or something more
extreme under H0 . This
1
X
12
1 = Pr(X 9| = 1/2) =
2
(1/2)j (1/2)12j = .075,
j
j=9
X
2+j
2 = Pr(X 9| = 1/2) =
(1/2)j (1/2)3 = .0325.
j
j=9
So using a conventional Type I error level = .05 the two model assumptions lead to
two different conclusions. But there is nothing in the situation that tells us which of
the two models we should use.
What happens here is that the Neyman-Pearson test procedure allows unobserved
outcomes to effect the results. X values more extreme than 9 was used as evidence
against the null. The prominent Bayesian Harold Jeffreys described this situation as a
hypothesis that may be true may be rejected because is has not predicted observable
results that have not occurred.
There is also an important difference between frequentist and Bayesian approaches
to the elimination of nuisance parameters. In the frequentist approach nuisance parameters are usually eliminated by the plug-in method. Suppose we have an estimator
1 of a parameter 1 which depends on another parameter 2 :
1 = 1 (y, 2 ).
Typically one would get rid of the dependence of 2 by plugging in an estimate of 2 :
1 = 1 (y, 2 (y)).
In the Bayesian approach one gets rid of nuisance parameters by integration. Suppose
the joint posterior distribution of 1 and 2 is p(1 , 2 |y). Inference about 1 is then
based on the marginal posterior
p(1 |y) =
p(1 , 2 |y)d2 .
19
20
Bayesian Mechanics
The normal linear regression model
The sampling distribution of the n vector of observable data y is
p(y|X, ) = N(y|X, 1 In ),
(31)
(32)
The first prior specifies that and are prior independent with having a multivariate normal prior with mean 0 and covariance 1
0 and having a gamma prior
with shape parameter 1 and inverse scale parameter 2 (We could also have chosen
to work with the variance 2 = 1/ . The implied prior on 2 will be an inverse gamma
distribution).
The second prior is a non-informative prior. This is a prior that you may want to
use if you dont have much prior information about available (you may be wondering
why 1 represents a non-informative prior on . This will become clearer below).
Consider the second prior first. The posterior distribution of is
where = (X 0 X)1 X 0 y.
21
(33)
n/2
1
0 X( )
1+
( )X
,
s(y)
(34)
(35)
,
p(|y) = tnk |,
(36)
nk
n/21
0 X 0 X( )
d
exp s(y)
exp ( )
2
2
(37)
(38)
1
(n k)
= 2,
s(y)
where
2 = s(y)/(n k).
22
(39)
Now we can see one way the prior p(, ) 1 may be considered non-informative:
The marginal posterior distributions have properties closely resembling the corresponding repeated sample distributions.
For the first prior we get
p(|y) n/2 exp (y X)0 (y X)
2
1
0 X 0 X( )
s(y) + ( )
2
1
exp ( 0 )0 0 ( 0 ) exp{2 } (41)
2
This joint posterior of does not lead to convenient expressions for the marginals of
and .
We can, however, derive analytical expressions for the conditional posteriors p(|, y)
and p( |, y). These conditional posteriors turn out to play a fundamental role when
designing simulation algorithms.
Let us first derive the conditional posterior for given . Remember that we then
only need to include terms containing . Then we get
p(|, y) exp
o
1
0 X 0 X( )
+ ( 0 )0 0 ( 0 )
( )
2
(42)
(z a)0 A(z a) + (z b)0 B(z b) = (z c)0 (A + B)(z c) + (a b)0 A(A + B)1 B(a b),
where c = (A + B)1 (Aa + Bb)
5
23
o
1
0
,
1 ( )
( )
2
where
= ( X 0 X + 0 )1 ,
1 ( X 0 y + 0 0 ).
=
So
),
p(|, y) = N(|,
(43)
(y X)0 (y X)
p( |, y) = G | + 1 ,
+ 2
2
2
(44)
We will later on see how it is extremely easy to simulate draws from p(|y) and
p( |y) using these conditional distributions.
The SURE model
Consider now the model
yij = x0ij j + ij ,
i = 1, . . . , n; j = 1, . . . , J,
(45)
i = 1, . . . , n.
24
(46)
x
0 0
i1
0 xi2 0
Xi =
,
..
.. . .
.
.
0
.
0
0 xiJ
1
2
=
.
..
J
and an informative prior. The usual non-informative prior for this model is
p(, ) ||(J+1)/2 .
(47)
(Yi Xi )0 (Yi Xi )
p(, |y) ||(J+1)/2 ||n/2 exp
2 i=1
n
1X
(nJ1)/2
= ||
exp
(Yi Xi )0 (Yi Xi )
2 i=1
(48)
(Yi Xi ) (Yi Xi ) =
n
X
i=1
(Yi Xi ())+
(Yi Xi ())
0
( ())
25
n
X
i=1
Xi0 Xi ( ()),
(49)
where
()
=
n
X
i=1
1
Xi0 Xi
n
X
Xi0 Yi ,
i=1
X
1
0
So
1
X
Xi0 Xi
.
p(|, y) = N |(),
(50)
(51)
i=1
The conditional posterior of is normal with mean equal to the efficient GLS estimator
(when is known).
To get the conditional posterior for note that
n
n
o
X
0
(Yi Xi ) (Yi Xi ) = Tr M () ,
i=1
where
n
X
(Yi Xi )(Yi Xi )0 .
M () =
i=1
1 n
o
p(|, y) ||(nJ1)/2 exp Tr M () .
2
(52)
p(|, y) = W n, M ()1 )
(53)
E[|, y] = nM ()1 .
The inverse of this is n1 M () which is the usual estimate of the covariance matrix if
is known.
26
Next lets derive the conditional posteriors under the proper prior distributions.
The joint posterior is
J1)/2
p(, |y) ||
1
1 1
0
exp ( 0 ) 0 ( 0 )
exp Tr S
2
2
n
1X
n/2
0
|| exp
(Yi Xi ) (Yi Xi ) (54)
2 i=1
1
exp ( 0 )0 0 ( 0 ) exp
2
1X
(Yi Xi )0 (Yi Xi )
2 i=1
n
X
1
0
Xi0 Xi )( ()
( ()) (
2
i=1
p(|, y) = N |(), () ,
(55)
where
n
1
X
()
= 0 +
Xi0 Xi
,
(56)
i=1
n
X
() = ()
Xi0 Yi + 0 0 .
(57)
i=1
1
1X
p(|, y) ||(n+J1)/2 exp Tr S 1
(Yi Xi )0 (Yi Xi )
2
2 i=1
1
1
||(n+J1)/2 exp Tr S 1 Tr M ()
2
2
1
||(n+J1)/2 exp Tr S 1 + M ()
,
2
p(|, y) = W n + , S 1 + M ()
27
(58)
Readings
Bayesian foundations and philosophy
Jaynes, E.T., (1994), Probability: The logic of science, unpublished book.
Chapters may be downloaded from http://bayes.wustl.edu/etj/prob.html
Jeffreys, H., (1961), Theory of Probability, Oxford University Press.
Bayesian statistics and econometrics
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995), Bayesian
Data Analysis, Chapman and Hall.
Schervish, M.J. (1995) Theory of Statistics, Springer.
Zellner, A. (1971) An Introduction to Bayesian Inference in Econometrics,
Wiley.
Bayesian statistics and Decision Theory
Berger, J.O., (1985), Statistical Decision Theory and Bayesian Analysis,
Springer
Robert, C.P., (1994), The Bayesian Choice, Springer.
28