0% found this document useful (0 votes)
15 views

Lectures 5

Uploaded by

rweinert00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lectures 5

Uploaded by

rweinert00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Bayesian Statistics

∗ Bayesian statistical inference is an alternative to the usual


frequentist inference we have considered up to now.

∗ In frequentist inference the parameters of a distribution are


considered fixed unknown quantities and inference is based
on the concept of repeated sampling from the underlying
distribution.

∗ In Bayesian statistics, all unknown quantities including the


parameters of the distribution are considered to be random
variables.

∗ The distribution of X is then a conditional distribution given


a value of the random variable θ.

5-1
Bayesian Inference

∗ As well as the model we need to specify the marginal distri-


bution of θ.

∗ This marginal distribution of θ is called the Prior Distribution


since it specifies our knowledge about θ in the absence of any
data.

∗ During the inference, we update this distribution based on


the observed data so we get the distribution of θ given the
observed data.

∗ This distribution is called the Posterior Distribution and all


inferences should be based on this distribution.

5-2
Bayes Theorem

∗ The name Bayesian statistics is in honour of Thomas Bayes


(c. 1701–1761).

∗ Although the discipline did not emerge until many years after
his death, it is based on a theorem that bears his name.

Theorem 5.1 (Bayes Theorem)


Suppose that X and Y are two random variables then
fX|Y (x | y)fY (y) fX|Y (x | y)fY (y)
fY |X (y | x) = Z =
fX|Y (x | y)fY (y) dy fX (x)

5-3
Bayesian Statistics Process

1. Specify a conditional distribution of the data given the pa-


rameters. This is identical to the usual model specification
in frequentist statistics.

2. Specify the prior distribution of the model parameters π(θ ).

3. Collect the data, X = x.

4. Update the prior distribution based on the data observed


to give a Posterior Distribution of the parameters given the
observed data x, π(θ | x).

5. All inference is then based on this posterior distribution.

5-4
Finding the Posterior Distribution

∗ Prior distribution π(θ)

∗ Data x1, . . . , xn and likelihood L(θ | x1, . . . , xn).

∗ Posterior distribution
π(θ)L(θ | x1, . . . , xn)
π(θ | x1, . . . , xn) = Z
π(θ)L(θ | x1, . . . , xn) dθ

∝ π(θ)L(θ | x1, . . . , xn)

5-5
Bayesian Statistics and Sufficiency

∗ Suppose that T (X ) is a minimal sufficient statistic for θ.

∗ Then the Factorization Theorem tells us that the likelihood


can be written as

L(θ | x) = g(T (x), θ )h(x)

∗ Hence the posterior distribution


π(θ | x) ∝ π(θ )g(T (x), θ )

∗ Since all inference is based on this posterior distribution, any


inference depends on the data only through the value of a
minimal sufficient statistic.

5-6
Prior Distributions
∗ The inferences made using Bayesian methods will depend on
both the likelihood (as in frequentist statistics) and the prior
distribution.

∗ The prior distribution should be an accurate reflection of our


beliefs about the parameter before we conduct the experi-
ment from which the data is derived.

∗ This has led to the criticism that Bayesian statistics is very


subjective.

∗ Quantifying ones prior knowledge into a distribution can also


be very hard.

∗ Even when possible, the resulting posterior distribution may


be difficult to deal with.

5-7
Conjugate Priors
Definition 5.1
Given a family F of pdf’s (or pmf’s) f (x | θ) indexed by a pa-
Q
rameter θ, then a family, of prior distributions is said to be
conjugate for the family F if the posterior distribution of θ is
for all f ∈ F, all priors π(θ) ∈
Q Q
in the family and all possible
data sets x.

∗ Conjugate families often make the mathematics of Bayesian


statistics easier.

∗ They may not adequately describe the prior knowledge about


the parameter.

5-8
Non-informative Priors

∗ How do we express prior ignorance about a parameter?

∗ If the set of possible values of θ is a finite interval then we


may use a uniform distribution.

∗ Most parameter spaces, however are infinite.


Definition 5.2
Suppose that θ is a parameter with prior distribution π(θ). The
prior distribution is called improper if
Z
π(θ) dθ = ∞.

5-9
Jeffrey’s Prior
Definition 5.3
Suppose that X ∼ fX (x | θ ). The usual Fisher Information
Matrix is
∂2
" #
I(θ ) = − E log fX (x | θ )
∂ θ∂ θt
The Jeffrey’s Prior for θ is
q
π(θ ) ∝ |I(θ )|.

∗ Attempt to provide a general noninformative prior.

∗ The Jeffrey’s Prior is often improper although usually


Z
fX (x | θ )π(θ ) dθ < ∞

5-10
Hyperparameters

∗ In many cases where we wish to give a proper prior distribu-


tion, we will specify that π(θ) = π(θ | τ ).

∗ Specification of the prior then involves specification of the


hyperparameter(s) τ .

∗ In many cases we use our prior beliefs to decide on the values


of the hyperparameters.

∗ For example we may have be willing to specify a prior mean


or prior probabilities for certain sets.

∗ If we are unsure about the hyperparameters we may wish to


conduct a sensitivity analysis to see how much our inferences
change when the hyperparameters change.

5-11
Hierarchical Bayes Models
∗ In many cases we consider the hyperparameters to be random
variables with their own prior distribution. We will usually
use an non-informative parameter-free prior which may be
improper for this second stage prior.

∗ This leads to the Hierarchical Bayes Model


X ∼ f (x | θ )
θ ∼ π θ (θ | τ )
τ ∼ π τ (τ )

∗ The joint posterior of θ and τ is then


π(θ , τ | x) ∝ f (x | θ )πθ (θ | τ )πτ (τ ).

∗ Integration over τ gives the marginal posterior π(θ | x).


5-12
Empirical Bayes Methods
∗ Another way of specifying the hyperparameters is to estimate
them based on the observed data.

∗ Suppose that the prior for θ is π(θ | τ ). Then the joint


distribution of X and θ for fixed τ is
π(X , θ | τ ) = f (x | θ )π(θ | τ )

∗ Hence the marginal distribution for X is


Z
m(X | τ ) = f (x | θ )π(θ | τ ) dθ .

∗ We can then use an estimation technique such as the method


of moments or maximize m(x | τ ) (basically maximum likeli-
hood estimation) to find suitable values for τ which are then
assumed fixed and inference is based on the posterior
π(θ | x) ∝ f (x | θ )π(θ | τ̂ )
5-13
Bayesian Point Estimation

∗ Inference should be based on the posterior distribution.

∗ How do we summarize a distribution by a single value?

∗ Some possible choices Z


1. Posterior mean: θ̂ = θπ(θ | x1, . . . , xn) dθ.
2. Posterior mode: θ̂ = arg max π(θ | x1, . . . , xn).
3. Posterior median
( Z θ̂ )
θ̂ = min θ̂ ∈ Θ : π(θ | x) dθ ⩾ 0.5
−∞

4. Decision theoretic estimates based on a loss function L(θ, θ̂).

5-14
Bayesian Decision Theory

∗ Let the function L(θ, θ̂(X )) be the Loss Function associated


with estimating the unknown value θ by the estimator θ̂(X ).

∗ This is a function of θ and the data.

∗ Typical loss functions for estimation are


• Absolute error loss

L(θ, θ̂) = |θ̂ − θ|


• Squared error loss

L(θ, θ̂) = (θ̂ − θ)2


• In some cases the loss can be quantified in terms of re-
sources and so this can be used as a loss function.

5-15
The Risk Function
∗ Recall that the Risk Function is the average loss when using
the Decision Rule (estimator) θ̂, for a given value of θ.
h i
R(θ, θ̂) = E L(θ, θ̂(X )) | θ

∗ We would like to find an estimator which makes the risk


function small.

∗ In Bayesian statistics, however, the risk is a function of θ and


so is a random variable.

Definition 5.4
Suppose that θ̂1(X ) and θ̂2(X ) are two possible estimators of
a parameter θ ∈ Θ. Then θ̂2 is said to be an inadmissible
estimator of θ if
R(θ, θ̂1) ⩽ R(θ, θ̂2) for every θ ∈ Θ
R(θ0, θ̂1) < R(θ0, θ̂2) for some θ0 ∈ Θ
5-16
Bayes Risk and Bayes Rules
Definition 5.5
Suppose that we wish to estimate a parameter θ ∈ Θ and we
have the prior distribution π(θ). Let θ̂ be an estimator of θ with
risk function (for a specified loss function) R(θ, θ̂). The Bayes
Risk is the average risk over all possible values of θ.
  Z
RB (θ̂) = E R(θ, θ̂) = R(θ, θ̂)π(θ) dθ
Θ
The estimator θ̂ which minimizes the Bayes risk is known as the
Bayes Rule

5-17
Finding the Bayes Rule
Theorem 5.2
The Bayes Rule is the estimator which minimizes the posterior
expected loss.

Theorem 5.3
1. When using squared error loss, the Bayes Rule is the posterior
expected value, E(θ | x).

2. When using absolute error loss, the Bayes Rule is the poste-
rior median.

5-18
Bayesian Hypothesis Tests

∗ Suppose we wish to test the hypothesis


H 0 : θ ∈ Θ0 V H 1 : θ ∈ Θ1

∗ As with estimation our test is based on the posterior distri-


bution π(θ | x).

∗ Since this is a probability distribution we can evaluate


Z
P(H0 is true | x) = P(θ ∈ Θ0 | x) = π(θ | x) dθ
Θ0

∗ Generally we will also have


P(H1 is true | x) = P(θ ∈
/ Θ0 | x) = 1 − P(H0 is true | x)

5-19
Rejection Regions For Bayesian Tests

∗ One possible way to do construct a Bayesian Hypothesis test-


ing rule would be to reject H0 whenever

P(H1 is true | x) > P(H0 is true | x)

∗ In keeping with the notion that rejecting H0 when it is true


is a more serious error it is more common to decide to reject
H0 whenever
P(H0 is true | x) < α

∗ Note that α here has a quite different interpretation to the


size of the test. Nonetheless similar values are usually con-
sidered.

5-20
Priors for Bayesian Hypothesis Tests
∗ Usually we will use different priors for hypothesis testing than
we do for estimation.

∗ Suppose that Θ0 = {θ0}. If we proceed as in estimation


whereby we put a continuous prior on θ then we will find
P(θ = θ0 | x) = 0 for any sample x

∗ When Θ0 = {θ0}, this will often mean we use a mixture of a


degenerate distribution at θ = θ0 and a continuous distribu-
tion over θ ̸= θ0.
π(θ) = pI(θ = θ0) + (1 − p)π1(θ)

∗ It is common to take p = 0.5 so that


P(H0) = P(θ = θ0) = P(θ ̸= θ0) = P(H1).

5-21
Posterior Odds
∗ From Bayes Theorem we have that
f (x | θ ∈ Θ0)P(θ ∈ Θ0)
P(θ ∈ Θ0 | x) =
m(x)
f (x | θ ∈
/ Θ0)P(θ ∈
/ Θ0)
P(θ ∈
/ Θ 0 | x) =
m(x)

∗ Hence we have that


P(θ ∈
/ Θ 0 | x) f (x | θ ∈
/ Θ0)P(θ ∈
/ Θ0 )
=
P(θ ∈ Θ0 | x) f (x | θ ∈ Θ0)P(θ ∈ Θ0)

∗ If we choose our prior such that


P(θ ∈ Θ0) = P(θ ∈
/ Θ0) = 0.5
Then the posterior odds in favour of H1 over H0 are
P(θ ∈
/ Θ 0 | x) f (x | θ ∈
/ Θ0 )
B10 = =
P(θ ∈ Θ0 | x) f (x | θ ∈ Θ 0 )
5-22
Bayes Factors

Definition 5.6
Suppose that X is a random sample from the joint distribution
f (x; θ) and we are testing

H 0 : θ ∈ Θ0 V H1 : θ ∈
/ Θ0 .
Then the Bayes Factor for this test is defined as
P(θ ∈
/ Θ 0 | x) f (x | θ ∈
/ Θ0 )
B10 = =
P(θ ∈ Θ0 | x) f (x | θ ∈ Θ 0 )

∗ The Bayes Factor may be used to decide on which hypothesis


to accept given the data.

∗ A common rule of thumb is to reject H0 when B10 > 20.

5-23
Bayes Factors

∗ When both H0 and H1 are simple hypotheses, the Bayes


factor is exactly equivalent to the likelihood ratio test.

∗ For composite hypotheses, however, we will usually specify a


prior on θ conditional on which hypothesis is true.

∗ In that case we will get


Z
P(x | Hi) = f (x | Hi, θ)π(θ | Hi) dθ i = 0, 1

∗ See Kass & Raftery (1995) “Bayes Factors”Journal of the


American Statistical Association, 90: 773-795 for a nice in-
troduction to Bayes factors.

5-24
Bayesian Interval Estimation

∗ Bayesian interval estimation is also based on the posterior


distribution of the parameter.

∗ In frequentist inference we found a random set such that the


probability that the set contains the unknown parameter is a
certain value 1 − α.

∗ In Bayesian inference we find a fixed set such that the pos-


terior probability that θ is in the set given the observed data
is equal to 1 − α.

5-25
Bayesian Credible Sets
Definition 5.7
Suppose that X ∼ f (x | θ and π(θ | x) is the posterior desnity for
θ given the observed data x. A set I(x) which satisfies
Z
π(θ | x)dθ = 1 − α.
I(x)
is called a 100(1 − α)% Bayesian credible set for θ.

5-26
Optimal Bayesian Credible Sets

∗ Of course there are many sets I(x) which can satisfy the
definition to be a Bayesian credible set.

∗ When the posterior density is unimodal we can apply The-


orem 12 (Page 73 in the third set of lecture notes) to find
the shortest such interval.

Theorem 5.4
If the posterior density π(θ | x) is unimodal then, for a given
value of α, the shortest 1 − α credible interval for θ is given by

I(x) = {θ : π(θ | x) ⩾ k}
where
Z
π(θ | x) dθ = 1 − α.
{θ:π(θ|x)⩾k}

5-27
Highest Posterior Density Sets
Definition 5.8
Given a data set x1, . . . , xn and a posterior density π(θ | x) for
a scalar parameter θ a Bayesian 100(1 − α)% Highest Posterior
Density set for θ is the set of points

Ck (x) = {θ : π(θ | x) ⩾ k} .
The point k is chosen such that
Z
P(θ ∈ Ck (x) | x) = π(θ | x)dθ = 1 − α.
Ck (x)

∗ When the posterior is unimodal this will be the shortest


length interval.

∗ For multimodal posterior distributions, however, it may not


even be an interval but a union of disjoint intervals.

5-28
Equi-tailed Bayesian Intervals
Definition 5.9
Given a data set x1, . . . , xn and a posterior density π(θ | x) for a
scalar parameter θ a Bayesian 100(1 − α)% equi-tailed credible
interval for θ is the interval with end-points θl (x) < θu(x) such
that
Z θ ( x)
l α
P(θ < θl (x) | x) = π(θ | x)dθ =
−∞ 2
Z ∞
α
P(θ > θu(x) | x) = π(θ | x)dθ =
θu (x) 2

5-29
Integration in Bayesian Analysis

∗ All Bayesian Inference is based on the posterior distribution.

∗ In many situations (particularly with non-conjugate priors)


this distribution may be rather complex.

∗ Most estimates we may use (posterior mean, posterior me-


dian, ...) involve integration.

∗ Integration is also necessary to find the posterior probabilities


required in testing or interval estimation.

5-30
Monte Carlo Bayesian Analysis
∗ Due to the importance of integration, Monte Carlo methods
are widely used in Bayesian statistics.

∗ Usually we only know the posterior distribution up to a con-


stant and that constant
Z
f (x | θ )π(θ ) dθ
may not be easy to find in closed form.

∗ For this reason, methods that do not need this constant are
often used.

∗ Accept-Reject and Metropolis–Hastings algorithms are such


methods.

∗ In the case of multidimensional θ, the Gibbs Sampler is widely


used.
5-31

You might also like