0% found this document useful (0 votes)
27 views7 pages

Lecture Notes For Probability and Statistics

This document provides an overview of Bayesian inference versus frequentist inference: - Bayesian inference treats parameters as random variables and uses subjective probabilities to quantify and analyze degrees of belief. Frequentist inference treats parameters as fixed and aims to create procedures with guaranteed long-run frequencies. - In Bayesian inference, Bayes' theorem is used to update subjective beliefs based on data. The posterior distribution represents beliefs about parameters after observing data. - With large samples, the Bayesian posterior distribution approximates a normal distribution centered at the maximum likelihood estimate with variance equal to the inverse Fisher information. Bayesian intervals then approximate frequentist Wald intervals. - Choosing a prior distribution that accurately represents prior beliefs can be challenging, especially for high
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views7 pages

Lecture Notes For Probability and Statistics

This document provides an overview of Bayesian inference versus frequentist inference: - Bayesian inference treats parameters as random variables and uses subjective probabilities to quantify and analyze degrees of belief. Frequentist inference treats parameters as fixed and aims to create procedures with guaranteed long-run frequencies. - In Bayesian inference, Bayes' theorem is used to update subjective beliefs based on data. The posterior distribution represents beliefs about parameters after observing data. - With large samples, the Bayesian posterior distribution approximates a normal distribution centered at the maximum likelihood estimate with variance equal to the inverse Fisher information. Bayesian intervals then approximate frequentist Wald intervals. - Choosing a prior distribution that accurately represents prior beliefs can be challenging, especially for high
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lecture Notes 17

Bayesian Inference
Relevant material is in Chapter 11.

1 Introduction
So far we have been using frequentist (or classical) methods. In the frequentist approach,
probability is interpreted as long run frequencies. The goal of frequentist inference is to
create procedures with long run guarantees. Indeed, a better name for frequentist inference
might be procedural inference. Moreover, the guarantees should be uniform over θ if possible.
For example, a confidence interval traps the true value of θ with probability 1 − α, no matter
what the true value of θ is. In frequentist inference, procedures are random while
parameters are fixed, unknown quantities.
In the Bayesian approach, probability is regarded as a measure of subjective degree of
belief. In this framework, everything, including parameters, is regarded as random. There
are no long run frequency guarantees. Bayesian inference is quite controversial.
Note that when we used Bayes estimators in minimax theory, we were not doing Bayesian
inference. We were simply using Bayesian estimators as a method to derive minimax esti-
mators.
One very important point, which causes a lot of confusion, is this:

Using Bayes’ Theorem 6= Bayesian inference


The difference between Bayesian inference and frequentist inference is the goal.
Bayesian Goal: Quantify and analyze subjective degrees of belief.
Frequentist Goal: Create procedures that have frequency guarantees.
Neither method of inference is right or wrong. Which one you use depends on your goal.
If your goal is to quantify and analyze your subjective degrees of belief, you should use
Bayesian inference. If our goal create procedures that have frequency guarantees then you
should use frequentist procedures.
Sometimes you can do both. That is, sometimes a Bayesian method will also have good
frequentist properties. Sometimes it won’t.
A summary of the main ideas is in Table 1.

1
Bayesian Frequentist
Probability subjective degree of belief limiting frequency
Goal analyze beliefs create procedures with frequency guarantees
θ random variable fixed
X random variable random variable
Use Bayes’ theorem? Yes. To update beliefs. Yes, if it leads to procedure with
good frequentist behavior.
Otherwise no.

Table 1: Bayesian versus Frequentist Inference

To add to the confusion:


Bayes’ nets: are directed graphs endowed with distributions. This has nothing to do with
Bayesian inference.
Bayes’ rule: is the optimal classification rule in a binary classification problem. This has
nothing to do with Bayesian inference.

2 The Mechanics of Bayes


Let X1 , . . . , Xn ∼ p(x|θ). In Bayes we also include a prior p(θ). It follows from Bayes’
theorem that the posterior distribution of θ given the data is

p(X1 , . . . , Xn |θ)p(θ)
p(θ|X1 , . . . , Xn ) =
p(X1 , . . . , Xn )

where Z
p(X1 , . . . , Xn ) = p(X1 , . . . , Xn |θ)p(θ)dθ.

Hence,
p(θ|X1 , . . . , Xn ) ∝ L(θ)p(θ)
where L(θ) = p(X1 , . . . , Xn |θ) is the likelihood function. The interpretation is that p(θ|X1 , . . . , Xn )
represents your subjective beliefs about θ after observing X1 , . . . , Xn .
A commonly used point estimator is the posterior mean
Z R
θL(θ)p(θ)
θ = E(θ|X1 , . . . , Xn ) = θp(θ|X1 , . . . , Xn )dθ = R .
L(θ)p(θ)

For interval estimation we use C = (a, b) where a and b are chosen so that
Z b
p(θ|X1 , . . . , Xn ) = 1 − α.
a

2
This interpretation is that
P (θ ∈ C|X1 , . . . , Xn ) = 1 − α.
This does not mean that C traps θ with probability 1 − α. We will discuss the distinction
in detail later.

Example 1 Let X1 , . . . , Xn ∼ Bernoulli(p). Let the prior be p ∼ Beta(α, β). Hence


Γ(α + β)
p(p) =
Γ(α)Γ(β)
and Z ∞
Γ(α) = tα−1 e−t dt.
0
P
Set Y = i Xi . Then
p(p|X) ∝ pY 1 − pn−Y × pα−1 1 − pβ−1 ∝ pY +α−1 1 − pn−Y +β−1 .
| {z } | {z }
likelihood prior

Therefore, p|X ∼ Beta(Y + α, n − Y + β). (See page 325 for more details.) The Bayes
estimator is
Y +α Y +α
pe = = = (1 − λ)b
pmle + λ p
(Y + α) + (n − Y + β) α+β+n
where
α α+β
p= , λ= .
α+β α+β+n
This is an example of a conjugate prior.

Example 2 Let X1 , . . . , Xn ∼ N (µ, σ 2 ) with σ 2 known. Let µ ∼ N (m, τ 2 ). Then


σ 2
τ2 n
E(µ|X) = σ2
X+ σ2
m
2
τ + n 2
τ + n
and
σ 2 τ 2 /n
Var(µ|X) = 2 .
τ 2 + σn

Example 3 Suppose that X1 , . . . , Xn ∼ Bernoulli(p1 ) and that Y1 , . . . , Ym ∼ Bernoulli(p2 ).


We are interested in δ = p2 − p1 . Let us use then prior p(p1 , p2 ) = 1. The posterior for p1 , p2
is
p(p1 , p2 |Data) ∝ pX
1 (1 − p1 )
n−X Y
p2 (1 − p2 )m−Y ∝ g(p1 )h(p2 )
P P
where X = i Xi , Y = i Yi , g is a Beta(X +1, n−X +1) density and h is a Beta(Y +1, m−
Y +1) density. To get the posterior for δ we need to do a change variables: (p1 , p2 , ) → (δ, p2 )
to get p(δ, p2 |Data). Then we integrate:
Z
p(δ|Data) = p(δ, p2 |Data)d p2 .

(An easier approach is to use simulation.)

3
3 Where Does the Prior Come From?
This is the million dollar question. In principle, the Bayesian is supposed to choose a prior
π that represents their prior information. This will be challenging in high dimensional cases
to say the least. Also, critics will say that someone’s prior opinions should not be included
in a data analysis because this is not scientific.
There has been some effort to define “noninformative priors” but this has not worked out so
well. An example is the Jeffreys prior which is defined to be
p
p(θ) ∝ I(θ).
You can use a flat prior but be aware that this prior doesn’t retain its flatness under trans-
formations. In high dimensional cases, the prior ends up being highly influential. The result
is that Bayesian methods tend to have poor frequentist behavior. We’ll return to this point
soon.
It is common to use flat priors even if they don’t integrate to 1. This is possible since the
posterior might still integrate to 1 even if the prior doesn’t.

4 Large Sample Theory


There is a Bayesian central limit theorem. In nice models, with large n,
!
1
p(θ|X1 , . . . , Xn ) ≈ N θ,
b (1)
In (θ)
b

where θbn is the mle and I is the Fisher information. In these cases, the 1 − α Bayesian
intervals will be approximately the same as the frequentist confidence intervals. That is, an
approximate 1 − α posterior interval is
zα/2
C = θb ± q
In (θ)
b

which is the Wald confidence interval. However, this is only true if n is large and the
dimension of the model is fixed.
Let’s summarize this point: In low dimensional models, with lots of data and as-
suming the usual regularity conditions, Bayesian posterior intervals will also be
frequentist confidence intervals. In this case, there is little difference between
the two.
Here is a rough derivation of (1). Note that
n
X
log p(θ|X1 , . . . , Xn ) = log p(Xi |θ) + log p(θ) − log C
i=1

4
where C is the normalizing constant. Now the sum has n terms which grows with sample
size. The last two terms are O(1). So the sum dominates, that is,
n
X
log p(θ|X1 , . . . , Xn ) ≈ log p(Xi |θ) = `(θ).
i=1

Next, we note that


b 2 00 b
`(θ) ≈ `(θ)
b + (θ − θ)` b + (θ − θ) ` (θ) .
b 0 (θ)
2
Now `0 (θ)
b = 0 so
b 2 00 b
b + (θ − θ) ` (θ) .
`(θ) ≈ `(θ)
2
Thus, approximately, !
b2
(θ − θ)
p(θ|X1 , . . . , Xn ) ∝ exp −
2σ 2
where
1
σ2 = − .
`00 (θ)
b

Let `i = log p(Xi |θb0 ) where θ0 is the true value. Since θb ≈ θ0 ,


X 1 X 00
`00 (θ)
b ≈ `00 (θ0 ) = `00i = n ` ≈ −nI1 (θ0 ) ≈ −nI1 (θ)
b = −In (θ)
b
i
n i i

and therefore, σ 2 ≈ 1/In (θ).


b

5 Bayes Versus Frequentist


In general, Bayesian and frequentist inferences can be quite different. If C is a 1−α Bayesian
interval then
P (θ ∈ C|X1 , . . . , Xn ) = 1 − α.
This does not imply that

frequentist coverage = inf Pθ (θ ∈ C) = 1 − α..


θ

Typically, a 1 − α Bayesian interval has coverage lower than 1 − α. Suppose you wake
up everyday and produce a Bayesian 95 percent interval for some parameter. (A different
parameter everyday.) The fraction of times your interval contains the true parameter will
not be 95 percent. Here are some examples to make this clear.

5
Example 4 Normal means. Let Xi ∼ N (µi , 1), i = 1, . . . , n. Suppose we use the flat
prior p(µ1 , . . . , µn ) ∝ 1. Then, with µ = (µ1 , . . . , µn ), the posterior for µ is multivariate
Normal Pwith mean X = (X1 , . . . , Xn ) and covariance matrix equal to the identity matrix.
Let θ = ni=1 µ2i . Let Cn = [cn , ∞) where cn is chosen so that P(θ ∈ Cn |X1 , . . . , Xn ) = .95.
How often, in the frequentist sense, does Cn trap θ? Stein (1959) showed that

Pµ (θ ∈ Cn ) → 0, as n → ∞.

Thus, Pµ (θ ∈ Cn ) ≈ 0 even though P(θ ∈ Cn |X1 , . . . , Xn ) = .95.

Example 5 Sampling to a Foregone Conclusion. √ Let X1 , X2 , . . . ∼ N (θ, 1). Suppose


we continue sampling until T > k where T = n|X n | and k is a fixed number, say, k = 20.
The sample size N is now a random variable. It can be shown that P(N < ∞) = 1. It
can also be shown that the posterior p(θ|X1 , . . . , XN ) is the same as if N had been fixed
in advance. That is, the randomness in N does not affect the posterior. Now if the prior
√ the posterior is approximately θ|X1 , . . . , XN ∼ N (X n , 1/n). Hence, if
p(θ) is smooth then
Cn = X n ± 1.96/ n then P(θ ∈ Cn |X1 , . . . , XN ) ≈ .95. Notice that 0 is never in Cn since,
when we stop sampling, T > 20, and therefore
1.96 20 1.96
X n − √ > √ − √ > 0. (2)
n n n

Hence, when θ = 0, Pθ (θ ∈ Cn ) = 0. Thus, the coverage is

Coverage = inf Pθ (θ ∈ Cn ) = 0.
θ

This is called sampling to a foregone conclusion and is a real issue in sequential clinical
trials.

Example 6 Let C = {c1 , . . . , cN } be a finite set of constants.


PNFor simplicity, assume that
−1
cj ∈ {0, 1} (although this is not important). Let θ = N j=1 cj . Suppose we want to
estimate θ. We proceed as follows. Let S1 , . . . , Sn ∼ Bernoulli(π) where π is known. If
Si = 1 you get to see ci . Otherwise, you do not. (This is an example of survey sampling.)
The likelihood function is Y
π Si (1 − π)1−Si .
i

The unknown parameter does not appear in the likelihood. In fact, there are no unknown
parameters in the likelihood! The likelihood function contains no information at all. The
posterior is the same as the prior.
But we can estimate θ. Let
N
1 X
θ=
b cj Sj .
N π j=1

6
Then E(θ)
b = θ. Hoeffding’s inequality implies that
2 π2
P(|θb − θ| > ) ≤ 2e−2n .

p θ is close to θ with high probability. In particular, a 1 − α confidence interval is


Hence, b
θ ± log(2/α)/(2nπ 2 ).
b

6 Bayesian Computing
If θ = (θ1 , . . . , θp ) is a vector then the posterior p(θ|X1 , . . . , Xn ) is a multivariate distribution.
If you are interested in one parameter, θ1 for example, then you need to find the marginal
posterior: Z
p(θ1 |X1 , . . . , Xn ) = p(θ1 , . . . , θp |X1 , . . . , Xn )dθ2 · · · dθp .

Usually, this integral is intractable. In practice, we resort to Monte Carlo methods. These
are discussed in 36/10-702.

7 Conclusion
Bayesian and frequentist inference are answering two different questions.
Frequentist inference answers the question: How do I construct a procedure that has frequency
guarantees?
Bayesian inference answers the question: How do I update my subjective beliefs after I observe
some data?
In parametric models, if n is large and the dimension of the model is fixed, Bayes and
frequentist procedures will be similar. Otherwise, they can be quite different.

You might also like