Lecture Notes For Probability and Statistics
Lecture Notes For Probability and Statistics
Bayesian Inference
Relevant material is in Chapter 11.
1 Introduction
So far we have been using frequentist (or classical) methods. In the frequentist approach,
probability is interpreted as long run frequencies. The goal of frequentist inference is to
create procedures with long run guarantees. Indeed, a better name for frequentist inference
might be procedural inference. Moreover, the guarantees should be uniform over θ if possible.
For example, a confidence interval traps the true value of θ with probability 1 − α, no matter
what the true value of θ is. In frequentist inference, procedures are random while
parameters are fixed, unknown quantities.
In the Bayesian approach, probability is regarded as a measure of subjective degree of
belief. In this framework, everything, including parameters, is regarded as random. There
are no long run frequency guarantees. Bayesian inference is quite controversial.
Note that when we used Bayes estimators in minimax theory, we were not doing Bayesian
inference. We were simply using Bayesian estimators as a method to derive minimax esti-
mators.
One very important point, which causes a lot of confusion, is this:
1
Bayesian Frequentist
Probability subjective degree of belief limiting frequency
Goal analyze beliefs create procedures with frequency guarantees
θ random variable fixed
X random variable random variable
Use Bayes’ theorem? Yes. To update beliefs. Yes, if it leads to procedure with
good frequentist behavior.
Otherwise no.
p(X1 , . . . , Xn |θ)p(θ)
p(θ|X1 , . . . , Xn ) =
p(X1 , . . . , Xn )
where Z
p(X1 , . . . , Xn ) = p(X1 , . . . , Xn |θ)p(θ)dθ.
Hence,
p(θ|X1 , . . . , Xn ) ∝ L(θ)p(θ)
where L(θ) = p(X1 , . . . , Xn |θ) is the likelihood function. The interpretation is that p(θ|X1 , . . . , Xn )
represents your subjective beliefs about θ after observing X1 , . . . , Xn .
A commonly used point estimator is the posterior mean
Z R
θL(θ)p(θ)
θ = E(θ|X1 , . . . , Xn ) = θp(θ|X1 , . . . , Xn )dθ = R .
L(θ)p(θ)
For interval estimation we use C = (a, b) where a and b are chosen so that
Z b
p(θ|X1 , . . . , Xn ) = 1 − α.
a
2
This interpretation is that
P (θ ∈ C|X1 , . . . , Xn ) = 1 − α.
This does not mean that C traps θ with probability 1 − α. We will discuss the distinction
in detail later.
Therefore, p|X ∼ Beta(Y + α, n − Y + β). (See page 325 for more details.) The Bayes
estimator is
Y +α Y +α
pe = = = (1 − λ)b
pmle + λ p
(Y + α) + (n − Y + β) α+β+n
where
α α+β
p= , λ= .
α+β α+β+n
This is an example of a conjugate prior.
3
3 Where Does the Prior Come From?
This is the million dollar question. In principle, the Bayesian is supposed to choose a prior
π that represents their prior information. This will be challenging in high dimensional cases
to say the least. Also, critics will say that someone’s prior opinions should not be included
in a data analysis because this is not scientific.
There has been some effort to define “noninformative priors” but this has not worked out so
well. An example is the Jeffreys prior which is defined to be
p
p(θ) ∝ I(θ).
You can use a flat prior but be aware that this prior doesn’t retain its flatness under trans-
formations. In high dimensional cases, the prior ends up being highly influential. The result
is that Bayesian methods tend to have poor frequentist behavior. We’ll return to this point
soon.
It is common to use flat priors even if they don’t integrate to 1. This is possible since the
posterior might still integrate to 1 even if the prior doesn’t.
where θbn is the mle and I is the Fisher information. In these cases, the 1 − α Bayesian
intervals will be approximately the same as the frequentist confidence intervals. That is, an
approximate 1 − α posterior interval is
zα/2
C = θb ± q
In (θ)
b
which is the Wald confidence interval. However, this is only true if n is large and the
dimension of the model is fixed.
Let’s summarize this point: In low dimensional models, with lots of data and as-
suming the usual regularity conditions, Bayesian posterior intervals will also be
frequentist confidence intervals. In this case, there is little difference between
the two.
Here is a rough derivation of (1). Note that
n
X
log p(θ|X1 , . . . , Xn ) = log p(Xi |θ) + log p(θ) − log C
i=1
4
where C is the normalizing constant. Now the sum has n terms which grows with sample
size. The last two terms are O(1). So the sum dominates, that is,
n
X
log p(θ|X1 , . . . , Xn ) ≈ log p(Xi |θ) = `(θ).
i=1
Typically, a 1 − α Bayesian interval has coverage lower than 1 − α. Suppose you wake
up everyday and produce a Bayesian 95 percent interval for some parameter. (A different
parameter everyday.) The fraction of times your interval contains the true parameter will
not be 95 percent. Here are some examples to make this clear.
5
Example 4 Normal means. Let Xi ∼ N (µi , 1), i = 1, . . . , n. Suppose we use the flat
prior p(µ1 , . . . , µn ) ∝ 1. Then, with µ = (µ1 , . . . , µn ), the posterior for µ is multivariate
Normal Pwith mean X = (X1 , . . . , Xn ) and covariance matrix equal to the identity matrix.
Let θ = ni=1 µ2i . Let Cn = [cn , ∞) where cn is chosen so that P(θ ∈ Cn |X1 , . . . , Xn ) = .95.
How often, in the frequentist sense, does Cn trap θ? Stein (1959) showed that
Pµ (θ ∈ Cn ) → 0, as n → ∞.
Coverage = inf Pθ (θ ∈ Cn ) = 0.
θ
This is called sampling to a foregone conclusion and is a real issue in sequential clinical
trials.
The unknown parameter does not appear in the likelihood. In fact, there are no unknown
parameters in the likelihood! The likelihood function contains no information at all. The
posterior is the same as the prior.
But we can estimate θ. Let
N
1 X
θ=
b cj Sj .
N π j=1
6
Then E(θ)
b = θ. Hoeffding’s inequality implies that
2 π2
P(|θb − θ| > ) ≤ 2e−2n .
6 Bayesian Computing
If θ = (θ1 , . . . , θp ) is a vector then the posterior p(θ|X1 , . . . , Xn ) is a multivariate distribution.
If you are interested in one parameter, θ1 for example, then you need to find the marginal
posterior: Z
p(θ1 |X1 , . . . , Xn ) = p(θ1 , . . . , θp |X1 , . . . , Xn )dθ2 · · · dθp .
Usually, this integral is intractable. In practice, we resort to Monte Carlo methods. These
are discussed in 36/10-702.
7 Conclusion
Bayesian and frequentist inference are answering two different questions.
Frequentist inference answers the question: How do I construct a procedure that has frequency
guarantees?
Bayesian inference answers the question: How do I update my subjective beliefs after I observe
some data?
In parametric models, if n is large and the dimension of the model is fixed, Bayes and
frequentist procedures will be similar. Otherwise, they can be quite different.