0% found this document useful (0 votes)
3 views106 pages

Lecture Two 2025

The document discusses the general inference problem in statistics, focusing on measurement precision, statistical models, and the inference process. It highlights the challenges of drawing conclusions from limited data and the importance of selecting appropriate models to describe the data-generating mechanism. Additionally, it covers various approaches to statistical inference, including parametric, non-parametric, and Bayesian methods, as well as the goals of inference such as estimation, confidence set construction, and hypothesis testing.

Uploaded by

Agnes Jeni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views106 pages

Lecture Two 2025

The document discusses the general inference problem in statistics, focusing on measurement precision, statistical models, and the inference process. It highlights the challenges of drawing conclusions from limited data and the importance of selecting appropriate models to describe the data-generating mechanism. Additionally, it covers various approaches to statistical inference, including parametric, non-parametric, and Bayesian methods, as well as the goals of inference such as estimation, confidence set construction, and hypothesis testing.

Uploaded by

Agnes Jeni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Slide 2.1 / 2.

106

2 General inference problem


2.1 Measurement precision
2.2 Statistical Models
2.3 Inference problem
2.4 Goals in Statistical Inference
2.5 Statistical decision theoretic approach to inference

Chapter 2 : General inference problem Page 81 / 748


2.1 Measurement precision Slide 2.2 / 2.106

2.1 Measurement precision

We studied different probability models because they can be used to


describe the population of interest. Finding such a good model is our
final goal. On the way towards this goal, we use the data to identify
the parameters that are used to describe the models.

We stress that the statistical inference problem arises precisely because


we do not know the exact value of the parameter in the model
description and we use the data to work out a proxy for the parameter.

The statistician is confronted with the problem of drawing a conclusion


about the population by using the limited information from the dataset.

Chapter 2 : General inference problem Page 82 / 748


2.1 Measurement precision Slide 2.3 / 2.106

The purpose in Statistical Inference is to draw conclusions from data.

The conclusions might be about predicting further outcomes, evalu-


ating risks of events, testing hypotheses, among others.

In all cases, inference about the population is to be drawn from


limited information contained in the sample.

Chapter 2 : General inference problem Page 83 / 748


2.1 Measurement precision Slide 2.4 / 2.106

The most common situation in Statistics:

an experiment has been performed (i.i.d.);


the possible results are real numbers that form a vector of obser-
vations x=( x1 , x2 , . . . , xn );
the appropriate sample space is Rn ;
there is typically a ”hidden” mechanism that generates the data -
we are looking for ways to identify it.

Chapter 2 : General inference problem Page 84 / 748


2.1 Measurement precision Slide 2.5 / 2.106

Models will describe this mechanism in some simplistic but hopefully


useful way.

For the model to be more trustworthy, continuous variables, such as


time, interval measurements, etc. should be treated as such, where
feasible. However, in practice, only discrete events can actually be
observed.

Thus we record with some unit of measurement, ∆, determined by


the precision of the measuring instrument. This unit of measurement
is always finite in any real situation.

Chapter 2 : General inference problem Page 85 / 748


2.1 Measurement precision Slide 2.6 / 2.106

If empirical observations were truly continuous, then, with probability


one, no two observed responses would ever be identical. This fact will
sometimes be used in our theoretical derivations.

On the other hand, the real life empirical observations are discrete.
This fact will be utilized by us to keep some of the proofs simpler.
In many cases we will be dealing with the discrete case only, thus
avoiding more involved measure-theoretic arguments.

Chapter 2 : General inference problem Page 86 / 748


2.2 Statistical Models Slide 2.7 / 2.106

2.2 Statistical Models


Having got the observations we would like to calculate the joint density
(in the continuous case):

L X ( x) = f X ( x1 , x2 , . . . , xn ) = fX1 ( x1 ). fX2 ( x2 ) . . . fXn ( xn ) (1)

In the discrete case this will be just the product of the probabilities
for each of the measurements to be in a suitable interval of length ∆.

If the observations were independent identically distributed (i.i.d.)


then all densities in (1) would be the same:

fX1 ( x) = fX2 ( x) = · · · = fXn ( x) = f ( x).

This is the most typical situation we will be discussing in our course.


Chapter 2 : General inference problem Page 87 / 748
2.2 Statistical Models Slide 2.8 / 2.106

The need of Statistical Inference arises since typically, our knowledge


about f X ( x1 , x2 , . . . , xn ) is incomplete.

Given an inference problem and having collected some data, we


construct one or more set of possible models which may help us to
understand the data generating mechanism.

Basically, statistical models are working assumptions about how the


dataset was obtained.

Chapter 2 : General inference problem Page 88 / 748


2.2 Statistical Models Slide 2.9 / 2.106

Example 2.9
If our data were counts of accidents within n = 10 consecutive weeks
on a busy crossroad, it may be reasonable to assume that a Poisson
distribution with an unknown parameter λ has given rise to the data.
That is, we may assume that we have 10 independent realisations of
a Poisson(λ) random variable.

Example 2.10
If, on the other hand, we measured the lengths Xi of 20 baby boys at
age 4 weeks, it would be reasonable to assume normal distribution for
these data. Symbolically we denote this as follows:

Xi ∼ N (µ, σ2 ), i = 1, 2, . . . , 20.

Chapter 2 : General inference problem Page 89 / 748


2.2 Statistical Models Slide 2.10 / 2.106

The models we use, as seen in the examples above, are usually about
the shape of the density or of the cumulative distribution function of
the population from which we have sampled.

These models should represent, as much as possible, the available


prior theoretical knowledge about the data generating mechanism.

It should be noted that in most cases, we do not exactly know which


population distribution to assume for our model.

Chapter 2 : General inference problem Page 90 / 748


2.2 Statistical Models Slide 2.11 / 2.106

Suggesting the set of models to be validated, is a difficult matter and


there is always a risk involved in this choice.

The reason is that if the contemplated set of models is “too large”,


many of them will be similar and it will be difficult to single out the
model that is best supported by the data.

On the other hand, if the contemplated set of models is “too small”,


there exists the risk that none of them gives an adequate description
of the data.

Choosing the most appropriate model usually involves a close collab-


oration between the statistician and the people who formulated the
inference problem.

Chapter 2 : General inference problem Page 91 / 748


2.2 Statistical Models Slide 2.12 / 2.106

In general, we can view the statistical model as the triplet (X, P, Θ)


where:

X is the sample space (i.e.. the set of all possible realizations


X = ( X1 , X2 , . . . , Xn )
P is a family of model functions Pθ ( X ) that depend on the
unknown parameter θ;
Θ is the set of possible θ-values, i.e. the parameter space indexing
the models.

Chapter 2 : General inference problem Page 92 / 748


2.3 Inference problem Slide 2.13 / 2.106

2.3 Inference problem


The statistical inference problem can be formulated:
Once the random vector X has been observed, what can be said about
which members of P best describe how it was generated?

The reason we are speaking about a problem here is that we do not


know the exact shape of the distribution that generated the data.

The reason that there exists a possibility of making inference rests


in the fact that typically a given observation is much more probable
under some distributions than under others (i.e. the observations give
information about the distribution).

This information should be combined with the a priori information


about the distribution to do the inference. Always there is some a
priori information. It could be more or less specific.
Chapter 2 : General inference problem Page 93 / 748
2.3 Inference problem Slide 2.14 / 2.106

Parametric Inference

When it is specific to such an extent that the shape of the distribution


is known up to some finite number of parameters i.e. the parameter θ
is finite-dimensional, we have to conduct parametric inference.

Most of the classical statistical inference techniques are based on fairly


specific assumptions regarding the population distribution and most
typically the description of the population is in a parametric form.

In introductory textbooks, the student just practices applying stan-


dard parametric techniques. However, to be successful in practical
statistical analysis, one has to be able to deal with situations where
standard parametric assumptions are not justified.

Chapter 2 : General inference problem Page 94 / 748


2.3 Inference problem Slide 2.15 / 2.106

Non-parametric Inference

A whole set of methods and techniques is available that may be


classified as nonparametric procedures. We will be dealing with them
in the later parts of the course.

These procedures allow us to make inferences without or with a very


limited amount of assumptions regarding the functional form of the
underlying population distribution.

If Θ could only be specified as a infinite dimensional function space,


we speak about non-parametric inference.

Chapter 2 : General inference problem Page 95 / 748


2.3 Inference problem Slide 2.16 / 2.106

Nonparametric inference procedures are applicable in more general


situations (which is good). However, if they are applied to a situation
where a particular parametric distributional shape indeed holds, the
nonparametric procedures may not be as efficient as compared to a
procedure specifically tailored for the corresponding parametric case
which would be bad if the specific parametric model indeed holds.

Chapter 2 : General inference problem Page 96 / 748


2.3 Inference problem Slide 2.17 / 2.106

Robustness approach
The situation, in practice, might be even more blurred. We may know
that the population is “close” to parametrically describable and yet
“deviates a bit” from the parametric family.

Going over in such cases directly to a purely nonparametric approach


would not properly address the situation since the idea about a
relatively small deviation from the baseline parametric family will
be lost. Hence we can use the robustness approach where we still
keep the idea about the “ideal” parametric model but allow for small
deviations from it.

The aim is in such “intermediate” situations to be “close to efficient” if


the parametric model holds but at the same time to be “less sensitive”
to small deviations from the ideal model. These important issues will
be 2discussed
Chapter later
: General inference problemin the course. Page 97 / 748
2.3 Inference problem Slide 2.18 / 2.106

Illustration of robustness

Chapter 2 : General inference problem Page 98 / 748


2.3 Inference problem Slide 2.19 / 2.106

Consequences of applying robustness approach

Chapter 2 : General inference problem Page 99 / 748


2.3 Inference problem Slide 2.20 / 2.106

Bayesian Inference
Another way to classify the Statistical Inference procedures is by the
way we treat the unknown parameter θ.

If we treat it as unknown but deterministic then we are in a


Non-Bayesian setting. If we consider the set of θ-values as quantities
that before collecting the data, have different probabilities of occurring
according to some (a priori) distribution, then we are speaking about
Bayesian inference.

Bayesian approach allows us to introduce and utilise any additional


(prior) information (when such information is available). This infor-
mation is entered through the prior distribution over the set Θ of
parameter values and reflects our prior belief about how likely any
of the parameter values is before obtaining the information from the
data.
Chapter 2 : General inference problem Page 100 / 748
2.4 Goals in Statistical Inference Slide 2.21 / 2.106

2.4 Goals in Statistical Inference

Following are the most common goals in inference:

Estimation
Confidence set construction
Hypothesis testing

Chapter 2 : General inference problem Page 101 / 748


2.4 Goals in Statistical Inference Slide 2.22 / 2.106

2.4.1 Estimation

We want to calculate a number (or a k-dimensional vector, or a


single function) as an approximation to the numerical characteristic
in question.

But let us point out immediately that there is little value in calculating
an approximation to an unknown quantity without having an idea
of how “good” the approximation is and how it compares with other
approximations. Hence, immediately questions about confidence
interval (or, more generally, confidence set) construction arise.

To quote the famous statistician A.N. Whitehead, in Statistics we


always have to ”seek simplicity and distrust it”.

Chapter 2 : General inference problem Page 102 / 748


2.4 Goals in Statistical Inference Slide 2.23 / 2.106

2.4.2 Confidence set construction

After the observations are collected, further information about the set
Θ is added and it becomes plausible that the true distribution belongs
to a smaller family than it was originally postulated, i.e., it becomes
clear that the unknown θ-value belongs to a subset of Θ.

The problem of confidence set construction arises: i.e., determining a


(possibly small) plausible set of θ-values and clarifying the sense in
which the set is plausible.

Chapter 2 : General inference problem Page 103 / 748


2.4 Goals in Statistical Inference Slide 2.24 / 2.106

2.4.3 Hypothesis testing

An experimenter or a statistician sometimes has a theory which when


suitably translated into mathematical language becomes a statement
that the true unknown distribution belongs to a smaller family than
the originally postulated one.

One would like to formulate this theory in the form of a hypothesis.


The data can be used then to infer whether or not his theory complies
with the observations or is in such a serious disarray that would
indicate that the hypothesis is false.

Chapter 2 : General inference problem Page 104 / 748


2.4 Goals in Statistical Inference Slide 2.25 / 2.106

Deeper insight in all of the above goals of inference and deeper


understanding of the nature of problems involved in them is given by
Statistical Decision Theory.

Here we define in general terms what a statistical decision rule is and


it turns out that any of the procedures discussed above can be viewed
as a suitably defined decision rule.

Moreover, defining optimal decision rules as solutions to suitably


formulated constrained mathematical optimization problems will
help us to find “best” decision rules in many practically relevant
situations.

Chapter 2 : General inference problem Page 105 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.26 / 2.106

2.5 Statistical decision theoretic approach to


inference

Statistical Decision Theory studies all inference problems (estimation,


confidence set construction, hypothesis testing) from a unified point
of view.

All parts of the decision making process are formally defined, a desired
optimality criterion is formulated and a decision is considered optimal
if it optimizes the criterion.

Chapter 2 : General inference problem Page 106 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.27 / 2.106

2.5.1 Introduction

Statistical Decision Theory may be considered as the theory of a


two-person game with one player being the statistician and the other
one being the nature. To specify the game, we define:

Θ-set of states (of nature);


A- set of actions (available to the statistician);
L(θ, a) - real-valued function (loss) on Θ × A

Chapter 2 : General inference problem Page 107 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.28 / 2.106

There are some important differences between mathematical theory of


games (that only involves the above triplet) and Statistical Decision
Theory. The most important differences are:

In a two-person game both players are trying to maximize their


winnings (or to minimize their losses), whereas in decision theory
nature chooses a state without this view in mind. Nature can
not be considered an ”intelligent opponent” who would behave
”rationally”.
There is no complete information available (to the statistician)
about nature’s choice.

Chapter 2 : General inference problem Page 108 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.29 / 2.106

In Statistical Decision Theory nature always has the first move


in choosing the ”true state” θ.
The statistician has the chance (and this is most important)
to gather partial information on nature’s choice by sampling
or performing an experiment. This gives the statistician data
X = ( X1 , X2 , . . . , Xn ) that has a distribution L(X|θ) depending
on θ. This is used by the statistician to work out their decision.

Chapter 2 : General inference problem Page 109 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.30 / 2.106

Definition 2.1
A (deterministic) decision function is a function d : X → A from the
sample space to the set of actions.

There is a non-negative loss (a random variable) L(θ, d (X)) incurred


by this action.

We define the risk


Eθ L(θ, d (X)) = R(θ, d ).
For a fixed decision, this is a function (risk function) depending on θ.
R(θ, d ) is the average loss of the statistician when the nature has a
true state θ and the statistician uses decision d.

Chapter 2 : General inference problem Page 110 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.31 / 2.106

2.5.2 Examples
Example 2.11 (Hypothesis testing)
Assume that a data vector X ∼ f (X, θ). Consider testing H0 : θ ≤ θ0
versus H1 : θ > θ0 where θ ∈ R1 is a parameter of interest.

Let A = {a1 , a2 }, Θ = R1 . Here a1 denotes the action ”accept H0 ”


whereas a2 is the action ”Reject H0 .”. Let

D = {Set of all functions from X into A}.


Define

 1
 if θ > θ0 ,
L(θ, a1 ) = 

 0
 if θ ≤ θ0

 0
 if θ > θ0 ,
L(θ, a2 ) = 

.
 1
 if θ ≤ θ0
Chapter 2 : General inference problem Page 111 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.32 / 2.106

Then we have

R(θ, d ) = EL(θ, d (X))


= L(θ, a1 ) Pθ (d (X) = a1 ) + L(θ, a2 ) Pθ (d (X) = a2 )

 Pθ (d (X) = a1 ) if θ > θ0 ,


=
 P (d (X) = a ) if θ ≤ θ .

θ 2 0

Hence

if θ ≤ θ0 : R(θ, d ) = Pθ (reject H0 ) = Error of I type,


if θ > θ0 : R(θ, d ) = Pθ (accept H0 ) = Error of II type.

Chapter 2 : General inference problem Page 112 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.33 / 2.106

Example 2.12 (Estimation)


Let now A = Θ with the interpretation that each action corresponds
to selecting a point θ ∈ Θ. Every d (X) maps X into Θ and if we chose

L(θ, d (X)) = (θ − d (X))2 (quadratic loss)

then the decision rule d (which we can call estimator) has a risk
function

R(θ, d ) = Eθ (d (X) − θ)2 = MS Eθ (d (X)).

Chapter 2 : General inference problem Page 113 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.34 / 2.106

2.5.3 Randomized decision rule

We will see later when studying optimality in hypothesis testing context


that the set of deterministic decision rules D is not convex and it is
difficult to develop a decent mathematical optimization theory over it.

This set is also very small and examples show that very often a
simple randomization of given deterministic rules gives better rules
in the sense of risk minimization. This explains the reason for the
introduction of the randomized decision rules.

Chapter 2 : General inference problem Page 114 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.35 / 2.106

Definition 2.2
A rule δ which chooses di with probability wi ,
P
wi = 1, is a random-
ized decision rule.

For the randomized decision rule δ we have:

X X
L(θ, δ( X )) = wi L(θ, di ( X )) and R(θ, δ) = wi R(θ, di )

The set of all randomized decision rules generated by the set D in the
above way will be denoted by D.

Chapter 2 : General inference problem Page 115 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.36 / 2.106

2.5.4 Optimal decision rules

Given a game (Θ, A, L) and a random vector X whose distribution


depends on θ ∈ Θ what (randomized) decision rule δ should the
statistician choose to perform “optimally”?

This is a question that is easy to pose but usually difficult to answer.

The reason is that usually uniformly best decision rules (that minimize
the risk uniformly for all θ-values) do not exist! It leads us to the
following two ways out:

Chapter 2 : General inference problem Page 116 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.37 / 2.106

First way out:


Constraining the set of decision rules and try to find uniformly best
in this smaller set. This corresponds to looking for optimality under
restrictions - we eliminate some of the decision rules since they do
not satisfy the restrictions by hoping, in the smaller set of remaining
rules to be able to find a uniformly best.

Sensible constraints that we introduce in the estimation context are


usually unbiasedness or invariance.

Definition 2.3
A decision rule d is unbiased if

Eθ [ L(θ′ , d (X))] ≥ Eθ [ L(θ, d (X))] for all θ, θ′ ∈ Θ


holds.
Chapter 2 : General inference problem Page 117 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.38 / 2.106

Exercise 2.9 (at lecture)


Show that in the context of estimation of a parameter θ with quadratic
loss function, the above definition is tantamount to the requirement

Eθ d (X) = θ for all θ ∈ Θ,

that is, the new definition is equivalent to the unbiasedness from


classical statistical estimation theory.

Chapter 2 : General inference problem Page 118 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.39 / 2.106

It is obvious that the new definition of unbiasedness is more general


and can be applied to broader class of loss functions.

The same definition also makes sense in hypothesis testing where


we can also introduce unbiased tests in the same way (see later the
separate lecture about optimality in hypothesis testing) and then look
for optimality amongst all unbiased α level tests.

Chapter 2 : General inference problem Page 119 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.40 / 2.106

Second way out:

Reformulating the optimality criterion in a new way. Since the “uni-


formly best” no matter what θ-value is too strong a requirement, we
can introduce:

Bayes risk or;


Minimax risk

of a decision rule and try to find the rules that minimize these risks.

This leads to Bayesian and to minimax decision rules.

Chapter 2 : General inference problem Page 120 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.41 / 2.106

2.5.5 Bayesian and minimax decision rules

Bayesian rule:
Think of the θ-parameter as random variable with a given (known)
prior density τ on Θ.

Define the Bayesian risk of the decision rule δ with respect to the
prior τ:
Z
r (τ, δ) = E[R(T , δ)] = R(θ, δ)τ(θ)dθ
Θ

where T is a random variable over Θ with a density τ.

Chapter 2 : General inference problem Page 121 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.42 / 2.106

Then the Bayesian rule δτ with respect to the prior τ is defined as:

r (τ, δτ ) = inf r (τ, δ)


δ∈D

Sometimes a Bayesian rule may not exist and so we ask for an ϵ-Bayes
rule.

For ϵ > 0, this is any rule δϵτ that satisfies

r (τ, δϵτ ) ≤ inf r (τ, δ) + ϵ.


δ∈D

Chapter 2 : General inference problem Page 122 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.43 / 2.106

Minimax rule.
Instead of considering uniformly best rules we consider rules that
minimize the supremum of the values of the risk over the set Θ. This
means safeguarding against the worst possible performance.

The value
sup R(θ, δ)
θ∈Θ

is called minimax risk of the decision rule δ. Then the rule δ∗ is called
minimax in the set D if

sup R(θ, δ∗ ) = inf sup R(θ, δ) = minimax value of the game.


θ∈Θ δ∈D θ∈Θ

Chapter 2 : General inference problem Page 123 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.44 / 2.106

Again, like in the Bayesian case, even if the minimax value is finite
there may not be a minimax decision rule. Hence we introduce the
notion of ϵ-minimax rule δϵ such that

sup R(θ, δϵ ) ≤ inf sup R(θ, δ) + ϵ.


θ∈Θ δ∈D θ∈Θ

Note that sometimes choosing a minimax rule may turn out to be


a too pessimistic strategy but experience shows that in most cases
minimax rules are good.

Chapter 2 : General inference problem Page 124 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.45 / 2.106

2.5.6 Least favorable prior distribution

Now we define the least favorable distribution (i.e. least favorable


prior τ∗ over the set Θ) as:

inf r (τ∗ , δ) = sup inf r (τ, δ)


δ∈D τ δ∈D

It indeed deserves its name. If the statistician were told which prior
distribution nature was using, they would like least to be told that τ∗
was the nature’s prior (since given that they always performs in an
optimal way by choosing the corresponding Bayesian rule, they still
have the highest possible value of the Bayesian risk as compared to
the other priors).

Chapter 2 : General inference problem Page 125 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.46 / 2.106

2.5.7 Geometric interpretation for finite Θ

Definition 2.4
A set A ⊂ Rk is convex if for all vectors ⃗x = ( x1 , x2 , . . . , xk )′ and
⃗y = (y1 , y2 , . . . , yk )′ in A and all α ∈ [0, 1] then

α⃗x + (1 − α)⃗y ∈ A.

Now lets assume that Θ has k elements only. Define the risk set of a set
D of decision rules as the set of all risk points {R(θ, d ), θ ∈ Θ}, d ∈ D.
For a fixed d, each such risk point belongs to Rk and by “moving” d
within D, we get a set of such k-dimensional vectors.

Chapter 2 : General inference problem Page 126 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.47 / 2.106

Theorem 2.6
The risk set of a set D of randomized decision rules generated by a
given set D of non-randomized decision rules is convex.

Proof.
It is easy to see that if ⃗y and ⃗y′ are the risk points of δ and δ′ ∈ D,
correspondingly, then any point in the form

⃗z = α⃗y + (1 − α)⃗y′

corresponds to (is the risk point of) the randomized decision rule δα ∈
D that chooses δ with probability α and the rule δ′ with probability
(1 − α). Hence any such ⃗z belongs to the risk set of D. □

Chapter 2 : General inference problem Page 127 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.48 / 2.106

Remark 2.1
The risk set of the set of all randomized rules D generated by the
set D is the smallest convex set containing the risk points of all of
the non-randomized rules in D (i.e. the convex hull of the set of risk
points of D).

Chapter 2 : General inference problem Page 128 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.49 / 2.106

How to illustrate Bayes rules:

Since Θ = (θ1 , θ2 , . . . , θk ) then the prior τ = ( p1 , p2 , .., pk ) in the


case we are dealing with ( pi ≥ 0, ki=1 pi = 1). The Bayes risk of any
P

rule δ w.r.to the prior τ is r (τ, δ) = ki=1 pi R(θi , δ).


P

All points ⃗y in the risk set, corresponding to certain rules δ∗ for which

k
X
pi yi = r (τ, δ∗ ) = the same value = b,
i=1

give rise to the same value b of the Bayesian risk and hence are
equivalent from a Bayesian point of view. The value of their risk
can be easily illustrated and (at least in case of k = 2), one can
easily illustrate the point in the convex risk set that corresponds to (is
the risk point of the) Bayesian rule with respect to the prior τ. (See
illustration
Chapter at lecture)
2 : General inference problem Page 129 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.50 / 2.106

How to illustrate Minimax rules:

In a similar way, the minimax rule can be illustrated in the case of


finite Θ. Three cases will be illustrated at the lecture.

i) The minimax decision rule corresponding to the point at the


lower intersection of S with the line R1 = R2 ;
ii) When S lies entirely to the left of the line R1 = R2 so that
R1 < R2 for every point in S , and therefore the minimax rule is
simply that which minimises R2 ;
iii) When S lies entirely to the right of the line R1 = R2 so that
R1 > R2 for every point in S , and therefore the minimax rule is
simply that which minimises R1 ,

Chapter 2 : General inference problem Page 130 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.51 / 2.106

2.5.8 Example
Let the set Θ = {θ1 , θ2 }. Let X have possible values 0, 1 and 2; the
set A = {a1 , a2 } and let

L(θ1 , a1 ) = L(θ2 , a2 ) = 0, L(θ1 , a2 ) = 1, L(θ2 , a1 ) = 3.

The distributions of X are tabulated as follows:

x 0 1 2 x 0 1 2
P( x|θ1 ) .81 .18 .01 P( x|θ2 ) .25 .5 .25

Interpretation: an attempt by the statistician to guess the state of


nature. If his guess is correct, he does not lose anything; if he is
wrong, he loses $1 or $3 depending on the type of error he has made.
In his guess he is supported by one observation X that has a different
distribution under θ1 and under θ2 .
Chapter 2 : General inference problem Page 131 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.52 / 2.106

Exercise 2.10
Now, consider all possible non-randomized decision rules based on
one observation:
 
 x d1 ( x) d2 ( x) d3 ( x ) d4 ( x) d5 ( x) d6 ( x ) d7 ( x) d8 ( x) 
 
 0 a1 a1 a1 a1 a2 a2 a2 a2 
 
 1 a1 a1 a2 a2 a1 a1 a2 a2 
 
2 a1 a2 a1 a2 a1 a2 a1 a2
a) Sketch the risk set of all randomized rules generated by
d1 , d2 , .., d8 .
b) Find the minimax rule δ∗ (in D) and compute its risk.
c) For what prior is δ∗ a Bayes rule w.r. to that prior (i.e., what is
the least favorable distribution)?
d) Find the Bayes rule for the prior {1/3, 2/3} over {θ1 , θ2 }. Compute
the value of its Bayes risk.
Chapter 2 : General inference problem Page 132 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.53 / 2.106

2.5.9 Fundamental Lemma

Lemma
If τ∗ is a prior on Θ and the Bayes rule δτ∗ has a constant risk w.r. to
θ (i.e. if R(θ, δτ∗ ) = c0 for all θ ∈ Θ) then:
a) δτ∗ is minimax;
b) τ∗ is the least favorable distribution.

Chapter 2 : General inference problem Page 133 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.54 / 2.106

Proof.

a) We compute the minimax risk of δτ∗ and compare it to the


minimax risk of any other rule δ:

c0 = sup R(θ, δτ∗ )


θ∈Θ
Z
= R(θ, δτ∗ )τ∗ (θ)dθ since constant for all θ
Θ
Z
≤ R(θ, δ)τ∗ (θ)dθ since δτ∗ Bayes w.r. to τ∗
Θ
≤ sup R(θ, δ)
θ∈Θ

which means that δτ∗ is minimax.

Chapter 2 : General inference problem Page 134 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.55 / 2.106

b) Now take any other prior τ :


Z
inf r (τ, δ) = inf R(θ, δ)τ(θ)dθ
δ δ Θ
Z
≤ R(θ, δτ∗ )τ(θ)dθ

= R(θ, δτ∗ )τ∗ (θ)dθ
Θ
= r (τ∗ , δτ∗ ),

hence τ∗ is least favorable.

Note here we have used the fact that R(θ, δτ∗ ) is constant and
Z Z
τ (θ)dθ =

τ(θ)dθ = 1.
Θ Θ

Chapter 2 : General inference problem Page 135 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.56 / 2.106

Remark 2.2
The lemma provides a hint how to find minimax estimators. The
minimax estimators are (special) Bayes estimators w.r. to the least
favorable prior.

First we can obtain the general form of the Bayes estimator with
respect to ANY given prior. Then we choose a prior for which the
corresponding Bayes rule has its (usual) risk independent of θ, i.e.
constant with respect to θ.

Chapter 2 : General inference problem Page 136 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.57 / 2.106

2.5.10 Finding Bayes rules analytically

This is important in its own right but also as a device to be utilized


in the search for minimax rules.

Given the prior and the observations X = ( X1 , X2 , . . . , Xn ) we can


find the Bayes rule point-wise (i.e. for any given X=x) by solving a
certain minimization problem.

Chapter 2 : General inference problem Page 137 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.58 / 2.106

Notation:

f (X|θ) is the conditional density of X = ( X1 , X2 , . . . , Xn ) given


θ;
τ(θ) is the prior density on θ;
f (X|θ)τ(θ)dθ;
R
g(X) is the marginal density of X, i.e. g(X) = Θ
f (X, θ) is the joint density of X and θ

f (X, θ) = f ( X|θ)τ(θ) = h(θ|X)g(X).


h(θ|X) is the posterior density of θ given X = ( X1 , X2 , . . . , Xn );

f (X, θ) f (X|θ)τ(θ)
h(θ|X) = =R
g(X)
Θ
f (X|θ)τ(θ)dθ

Chapter 2 : General inference problem Page 138 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.59 / 2.106

Now we formulate a General Theorem regarding calculation of


Bayesian decision rules.

Theorem 2.7
For X ∈ X, a ∈ A and for a given prior τ we define:
Z
Q(X, a) = L(θ, a)h(θ|X)dθ,
Θ

where L(., .) is a particular loss function.

Suppose that for each X ∈ X, there exists a rule aX ∈ A such that

Q( X, aX ) = inf Q(X, a)
a∈A

If δτ (X) = aX belongs to D then δτ (X) = aX is the (point wise


defined) Bayes decision rule with respect to the prior τ.
Chapter 2 : General inference problem Page 139 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.60 / 2.106

Proof.

For any decision ruleZδ we have its Bayesian risk:


r (τ, δ) = R(θ, δ)τ(θ)dθ
Θ
Z "Z #
= L(θ, δ(X)) f (X|θ)dX τ(θ)dθ
Θ X
Z "Z #
= L(θ, δ(X))h(θ|X)dθ g(X)dX
X Θ
Z
= Q(X, δ(X))g(X)dX
X

where we use the short-hand notation dX := dX1 dX2 . . . dXn .

Chapter 2 : General inference problem Page 140 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.61 / 2.106

But for every fixed X-value, Q(X, δ(X)) is smallest when δ(X) = aX .
Making that way our“best choice”for each X-value, we will, of course,
minimize the value of r (τ, δ). Hence, we should be looking for an
action aX that gives an infimum to
Z
inf L(θ, a)h(θ|X)dθ.
a∈A Θ

We now will apply the general theorem to estimation and to hypothesis


testing.

Chapter 2 : General inference problem Page 141 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.62 / 2.106

Theorem 2.8 (case of estimation)


Consider a point estimation problem for a real-valued parameter θ.
The prior over θ is denoted by τ. Then:
a) For a squared error loss L(θ, a) = (θ − a)2 :
Z
δτ (X) = E (θ|X) = θh(θ|X)dθ.
Θ
The Bayesian estimator with respect to quadratic loss is just the
conditional expected value of the parameter given the observed
data.
b) For an absolute error loss L(θ, a) = |θ − a| :

δτ (X) = median of h(θ|X).


The Bayesian estimator with respect to absolute value loss is just
the median of the conditional distribution of the parameter given
the observed data.
Chapter 2 : General inference problem Page 142 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.63 / 2.106

Example 2.13
Show that for a given random variable Y with a finite second moment,
the function q1 (a) = E (Y − a)2 is minimised for a∗ = E (Y ).

Solution:
Setting the derivative with respect to a to zero we get

∂ ∂
" #
E (Y − a)2 = E (Y 2 ) − 2E (Y )a + a2 = −2E (Y ) + 2a = 0
∂a ∂a

from which we deduce that the stationary point is a∗ = E (Y ) and


obviously this stationary point gives rise to a minimum since

∂2
E (Y − a)2 = 2 > 0.
∂a2
Chapter 2 : General inference problem Page 143 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.64 / 2.106

Example 2.14
Show that for a given random variable Y with E|Y| < ∞, the function
q2 (b) = E|Y − b| is minimised for b∗ = median(Y ).

Chapter 2 : General inference problem Page 144 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.65 / 2.106

Solution:
(Continuous case for simplicity.) Denote the density of Y by f (y) and
the cdf by F (y). Having in mind the definition of absolute value we
have:

∂ ∂ b ∞
"Z Z #
E (|Y − b|) = (b − y) f (y)dy + (y − b) f (y)dy
∂b ∂b −∞ b

" Z b Z ∞ #
= bF (b) − y f (y)dy + y f (y)dy − b(1 − F (b))
∂b −∞ b
= F (b) + b f (b) − b f (b) − b f (b) − (1 − F (b)) + b f (b)
= 2F (b) − 1
=0

from which we deduce that the stationary point b∗ satisfies F (b∗ ) =


0.5, i.e., b∗ is the median. And obviously the stationary point b∗ gives
rise2 : to
Chapter a minimum.
General inference problem Page 145 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.66 / 2.106

Remark 2.3
A case in which δτ ∈ D could not be satisfied is in point- estimation
with Θ ≡ A- a finite set. Then E (θ|X) might not belong to A,
hence E (θ|X) would not be a function X → A and δτ would not be
a legitimate estimator. But if Θ ≡ A is convex, it can be shown that
always E (θ|X) ∈ A!

Chapter 2 : General inference problem Page 146 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.67 / 2.106

Bayesian Hypothesis testing with a generalized


0 − 1 loss
A prior τ is given on Θ. Assume that the parameter space Θ is
subdivided into two complementary subsets Θ = Θ0 ∪ Θ1 and we are
testing

H0 : θ ∈ Θ 0 versus H1 : θ ∈ Θ 1 .

Two actions a0 (acceptH0 ) and a1 (rejectH0 ) are possible and the


losses when using these actions are given by:
 
 0
 if θ ∈ Θ0  c1
 if θ ∈ Θ0
L(θ, a0 ) =  L(θ, a1 ) = 
 
and
 c2
 if θ ∈ Θ1  0
 if θ ∈ Θ1 .

Chapter 2 : General inference problem Page 147 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.68 / 2.106

These losses make sense in hypothesis testing because when the


correct guess of H0 or of H1 should not involve any loss (so the loss is
set to zero) whereas an incorrect guess should involve some positive
loss.

The loss when a first type error occurs is denoted as c1 and the
loss when a second type error occurs is denoted as c2 . Since the
consequences of the two types of error may not be equally heavy,
c1 , c2 in general.

Chapter 2 : General inference problem Page 148 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.69 / 2.106

Theorem 2.9 (case of hypothesis testing)


Assume that the parameter space Θ is subdivided into two comple-
mentary subsets Θ = Θ0 ∪ Θ1 and we are testing

H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ 1 .
with a generalised 0-1 loss function.

Then the test



 Reject H0 if P(θ ∈ Θ0 | X) < c2 /(c1 + c2 )

φ =


 Accept H0 if P(θ ∈ Θ0 |X) > c2 /(c1 + c2 )

is a Bayesian rule (Bayesian test) for the above testing problem with
respect to the prior τ.

Chapter 2 : General inference problem Page 149 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.70 / 2.106

Proof.

According to the General Theorem about Bayesian inference, we have


to compare the two quantities Q(X, a0 ) and Q(X, a1 ) below and take
as our action the one that gives the smaller value (in this way we are
minimising the Bayesian risk for the given prior, hence we are deriving
the optimal Bayesian decision in the context of hypothesis testing).

Chapter 2 : General inference problem Page 150 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.71 / 2.106

Now:
Z
Q(X, a0 ) = L(θ, a0 )h(θ|X)dθ

= c2 h(θ|X)dθ
Θ1

= c2 P(θ ∈ Θ1 | X)
= c2 (1 − P(θ ∈ Θ0 | X))
and Z
Q(X, a1 ) = L(θ, a1 )h(θ|X)dθ

= c1 h(θ| X)dθ
Θ0

= c1 P(θ ∈ Θ0 | X)

Chapter 2 : General inference problem Page 151 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.72 / 2.106

Hence we would reject H0 when Q(X, a1 ) < Q(X, a0 ), i.e. for

{X : c1 P(θ ∈ Θ0 |X) < c2 (1 − P(θ ∈ Θ0 |X))}


which is equivalent to
{X : P(θ ∈ Θ0 |X) < c2 /(c1 + c2 )}.
In a nutshell, this means that to perform Bayesian hypothesis testing
of
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
we need to calculate the posterior conditional probability that θ ∈ Θ0
given the data:
Z
P(θ ∈ Θ0 ) = h(θ|X)dθ
Θ0

For this calculation we need, as in the case of estimation, the same


posterior density h(θ|X) of the parameter given the data.
Chapter 2 : General inference problem Page 152 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.73 / 2.106

Then we compare this posterior conditional probability with the thresh-


old c2 /(c1 + c2 ) ∈ (0, 1) and reject H0 when this posterior probability
is not large enough, i.e., is below the threshold.

This makes perfect sense. We also note that the threshold c2 /(c1 + c2 )
to compare with, when the two types of errors are equally weighted
(i.e., when c1 = c2 is chosen, is just equal to 12 ).

The case c1 = c2 (so that the ratio c2 /c1 = 1) is referred to as the


usual zero-one loss in Bayesian hypothesis testing. The general case of
different c1 and c2 is referred to as the generalised zero-one loss. □

Chapter 2 : General inference problem Page 153 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.74 / 2.106

2.5.11 Examples
Example 2.15
Let, given θ, the distribution of each Xi , i = 1, 2, . . . , n be Bernoulli
with parameter θ, i.e.
f (X|θ) = θΣXi (1 − θ)n−ΣXi

and assume a beta prior τ for the (random variable) θ over (0, 1):

1
τ(θ ) = θα−1 (1 − θ)β−1 I(0,1) (θ)
B(α, β)

Show that the Bayesian estimator θ̂B with respect to quadratic loss is:

Σni=1 Xi + α
θ̂B =
α+β+n

Hence, calculate the minimax estimator for the probability of success


in the Bernoulli experiment.
Chapter 2 : General inference problem Page 154 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.75 / 2.106

Solution:
We recall first the definition and some properties of the Beta function
that is used in the definition of the Beta density. In particular, there
is the following relation between B(α, β) and the Gamma function
R∞
Γ(a) = 0 e(−x) xa−1 dx :

1 Γ (α) Γ (β)
Z
B(α, β) = xα−1 (1 − x)β−1 dx =
0 Γ (α + β)

Since the gamma function satisfies Γ(a) = (a − 1)Γ(a − 1), after


substitution we get the following recurrent relation for the Beta
function:

α−1
B(α, β) = B(α − 1, β)
α+β−1

Chapter 2 : General inference problem Page 155 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.76 / 2.106

For the beta prior we can then find easily:

f (X |θ)τ(θ) θΣXi +α−1 (1 − θ)n−ΣXi +β−1


h(θ|X) = R 1 =
f (X|θ)τ(θ)dθ B(ΣXi + α, n − ΣXi + β)
0

which is again a beta density.

Chapter 2 : General inference problem Page 156 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.77 / 2.106

Hence, the Bayesian estimator

Z 1
θ̂τ = θh(θ|X)dθ
0
Γ(ΣXi + α + 1)Γ(n + α + β)
=
Γ(n + 1 + α + β)Γ(ΣXi + a)
B(ΣXi + α + 1, n − ΣXi + β)
=
B(ΣXi + α, n − ΣXi + β)
= by the above property of the Beta function
Σn Xi + α
= i=1
α+β+n
X̄ + α/n
=
1 + α+n
β

The above derivation holds for any beta prior Beta(α, β).
Chapter 2 : General inference problem Page 157 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.78 / 2.106

Compare the estimator obtained with the UMVUE X̄ and appreciate


the effect of the prior on the form of the estimator for small and for
large sample size n. (The UMVUE also coincides with the MLE here).

In particular, note that when the sample size n is small, the effect of
the prior (via the values of the parameters α and β) may be significant
and the Bayesian estimator θ̂τ may be very different from the UMVUE
thus expressing the influence of the prior information on our decision.

On the other hand, when the sample size n is very large, we see that

θ̂τ ≈ X̄
holds no matter what the prior. That is, when the sample size
increases, the prior’s effect on the estimator starts disappearing!

Chapter 2 : General inference problem Page 158 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.79 / 2.106

Let us calculate the (usual) risk with respect to quadratic loss of any
such Bayes estimator:

R(θ, θ̂τ ) = E (θ̂τ − θ)2


= Varθ (θ̂τ ) + (θ − Eθ θ̂τ )2
nθ(1 − θ) nθ + α
= +( − θ )2
(n + α + β) 2 α+β+n
= ...
nθ − nθ2 + (α + β)2 θ2 + α2 − 2α(α + β)θ
=
(n + α + β)2

Chapter 2 : General inference problem Page 159 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.80 / 2.106

For this risk not to depend on θ, it has to hold:



 (α + β)2 = n


 2α(α + β) = n


The solution to this system is α = β = n/2. Hence the minimax
estimator of θ is

ΣXi + n/2
θ̂minimax = √ .
n+ n

Chapter 2 : General inference problem Page 160 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.81 / 2.106

Exercise 2.11
Suppose a single observation x is available from the uniform distribu-
tion with a density

1
f ( x|θ) = I (θ ), θ>0
θ ( x,∞)
The prior on θ has density:

τ(θ) = θ exp(−θ), θ>0


i) Find the Bayes estimator of θ with respect to quadratic loss.
ii) Find the Bayes estimator of θ with respect to absolute value loss
L(θ, a) = |θ − a|.

Chapter 2 : General inference problem Page 161 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.82 / 2.106

Example 2.16
Suppose X1 , X2 , . . . , Xn have conditional joint density:
Pn
f X|Θ ( x1 , x2 , . . . , xn |θ) = θn e−θ i=1 xi , xi > 0 f or i = 1, . . . , n; θ > 0

and a prior density is given by:

τ(θ) = ke−kθ

for θ > 0, where k > 0 is a known constant i.e. the observations are
exponentially distributed given θ, and the prior on θ is also exponential
but with a different parameter.
i) Calculate the posterior density of Θ given X1 = x1 , X2 =
x2 , . . . , Xn = xn
ii) Find the Bayesian estimator of θ with respect to squared error
loss.
Chapter 2 : General inference problem Page 162 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.83 / 2.106

Solution:

i) We do not need to calculate the normalising constant here and


can shortcut the solution. Looking at the joint density
Pn
f (x|θ)τ(θ) = kθn e−θ( i = 1 xi + k )

we see that up to a normalising constant this is a

1
Gamma(n + 1, Pn )
i=1 xi + k

density, hence the posterior h(θ|x) has to be

1
Gamma(n + 1, Pn ).
i = 1 xi + k

Chapter 2 : General inference problem Page 163 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.84 / 2.106

ii) For a Bayes estimator with respect to quadratic loss, we have


θ̂ = E (θ|X), and for a Gamma(α, β) density it is known that the
expected value is equal to αβ, hence we get immediately

n+1
θ̂ = Pn .
i = 1 xi + k

Chapter 2 : General inference problem Page 164 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.85 / 2.106

Of course, we could also calculate directly:

+ k )n+1

Pn ∞
( i=1 xi
Z Z Pn
θ̂ = θh(θ|x)dθ = θn+1 e−θ( i=1 xi +k ) dθ
0 Γ (n + 1) 0

dy
and after changing variables: θ(
Pn
i = 1 xi + k) = y, dθ = (
Pn we
i = 1 xi + k )
can continue the evaluation:
R∞
e−y yn+1 dy
0 Γ (n + 2) n+1
θ̂ = = = Pn
Γ(n + 1)( ni=1 xi + k) Γ(n + 1)( ni=1 xi + k) i=1 xi + k
P P

We arrive at the same answer but of course the shortcut solution is


simpler. Note however that the shortcut solution does not always
work. In general, it is not always possible to guess the posterior from
the joint density f (x|θ)τ(θ).
Chapter 2 : General inference problem Page 165 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.86 / 2.106

If the posterior distribution belongs to the same family as the prior,


the prior and posterior are then called conjugate distributions. The
prior itself is called a conjugate prior and, in such cases, the shortcut
approach works.

This is the reason that Bayesian modellers are often looking for
conjugate priors trying to simplify the calculations. If such priors
are difficult to find or are not reasonable suggestions for a prior in a
particular situation, then one needs to resort to the full-scale Bayesian
estimation instead of the shortcut approach.

Chapter 2 : General inference problem Page 166 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.87 / 2.106

Very often, there is no need to calculate explicitly the marginal g(X) in


the formula for the Bayes estimator. This is an important observation
since the calculation of the integral that defines the marginal g(X)
may be difficult so it would be good if it could be avoided.

Using the symbol ∝ to denote proportionality up to a constant between


two functions, we can write:

f ( X|θ)τ(θ)
h(θ|X ) = ∝ f ( X|θ)τ(θ)
g( X )

Chapter 2 : General inference problem Page 167 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.88 / 2.106

Hence, the shape is that posterior h(θ|X) (which conditionally on


X is a function of θ only), is determined with or without knowing
g(X) since the latter only serves to norm the conditional density to
integrate to one.

We could guess the shape of the density h(θ|X ) by just analysing the
product f ( X|θ)τ(θ) as a function of θ.

Chapter 2 : General inference problem Page 168 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.89 / 2.106

In our example, we have

Xi +α−1 Xi +β−1
Pn Pn
f ( X|θ)τ(θ) = θ i=1 (1 − θ)n− i=1

This already identifies h(θ|X ) as being

n
X n
X
Beta( Xi + α, n − Xi + β)
i=1 i=1

distributed. But for any Beta distributed random variable with param-
eters a, b it is known that the expected values is equal to a+a b . Hence
we get immediately the Bayes estimator

i=1 Xi + α
Pn
θ̂B = E (θ|X ) =
α+β+n

without the need to analyse and calculate the marginal g( X ).


Chapter 2 : General inference problem Page 169 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.90 / 2.106

Example 2.17
Let X1 , X2 , . . . , Xn be a random sample from the normal density with
mean µ and variance 1. Consider estimating µ with a squared-error
loss. Assume that the prior τ(µ) is a normal density with mean µ0
and variance 1.

Show that the Bayes estimator of µ is

µ0 + ni=1 Xi
P
.
n+1

Chapter 2 : General inference problem Page 170 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.91 / 2.106

Solution:
Let X = ( X1 , . . . , Xn ) be the random variables. Setting µ0 = x0 for
convenience of the notation, we can write:

n Pn
i=0 xi
! " #!
1X n+1 2
h(µ|X=x) ∝ exp − ( xi − µ) ∝ exp −
2
µ − 2µ
2 i=0 2 n+1

Of course this also means (by completing the square with the expres-
sion that does not depend on µ)

xi 2
" Pn # !
n+1
h(µ|X=x) ∝ exp − µ − i=0
2 n+1

which implies that h(µ|X=x), (being a density), must be the density


of
Pn
i=0 xi 1
N( , ).
Chapter 2 : General inference problem
n+1 n+1 Page 171 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.92 / 2.106

Hence, the Bayes estimator (being the posterior mean) would be

n n
1 X 1 X 1 n
xi = ( µ0 + xi ) = µ0 + X̄,
n + 1 i=0 n+1 i=1
n+1 n+1

that is, the Bayes estimator is a convex combination of the mean


of the prior and of X̄. In this combination, the weight of the prior
information diminishes quickly when the sample size increases. The
same estimator is obtained with respect to absolute value loss.

Chapter 2 : General inference problem Page 172 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.93 / 2.106

Example 2.18
As part of a quality inspection program, five components are selected
at random from a batch of components to be tested. From past
experience, the parameter θ (the probability of failure), has a beta
distribution with density

τ(θ) = 30θ(1 − θ)4 , 0 ≤ θ ≤ 1.

We wish to test the hypothesis

H0 : θ ≤ 0.2 against H1 : θ > 0.2

using Bayesian hypothesis testing with a zero-one loss. What is your


decision if:
i) In a batch of five there were no failures found.
ii) In a batch of five there was one failure found.
Chapter 2 : General inference problem Page 173 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.94 / 2.106

Solution:
i) X ∼ Bin(5, θ). We have:

P( X = 0|θ) = (1 − θ)5

which means that the posterior of θ given the sample is


h(θ | X = 0) ∝ (1 − θ)5 θ(1 − θ)4 = θ(1 − θ)9 . Hence:

h(θ|X = 0) = 110θ(1 − θ)9


Γ(12) 11!
(Note: Γ(10)Γ(2)
= 9!1! = 110)

Then we get for the posterior probability given the sample:


Z 0.2
P(θ ∈ Θ0 |X = 0) = 110θ(1 − θ)9 dθ = 0.6779
0

and we accept H0 since the above posterior probability is > 12 .


Chapter 2 : General inference problem Page 174 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.95 / 2.106

ii) Now:

P( X = 1|θ) = 5(1 − θ)4 θ

which implies that the posterior of θ given the sample is


h(θ | X = 1) ∝ (1 − θ)4 θ(1 − θ)4 θ = (1 − θ)8 θ2 . Hence:

Γ(12)
h(θ|X = 1) = (1 − θ)8 θ2 = 495θ2 (1 − θ)8
Γ (9) Γ (3)

Then we get for the posterior probability given the sample:


Z 0.2
1
P(θ ∈ Θ0 |X = 1) = 495θ2 (1 − θ)8 dθ = 0.3826 <
0 2

and we reject H0 since the above posterior probability is < 21 .

Chapter 2 : General inference problem Page 175 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.96 / 2.106

Exercise 2.12
In a sequence of consecutive years 1, 2, . . . , T , an annual number of
high-risk events is recorded by a bank. The random number Nt of
high-risk events in a given year is modelled via Poisson(λ) distribution.
This gives a sequence of independent counts n1 , n2 , . . . , nT . The prior
on λ is Gamma(a, b) with known a > 0, b > 0 :

λa−1 e−λ/b
τ(λ) =
, λ > 0.
Γ ( a ) ba
i) Determine the Bayesian estimator of the intensity λ with respect
to quadratic loss.
ii) Assume that the parameters of the prior are a = 2, b = 2. The
bank claims that the yearly intensity λ is no more than 2. Within
the last six years counts were 0, 2, 3, 3, 2, 2. Test the bank’s claim
via Bayesian testing with a zero-one loss.
Chapter 2 : General inference problem Page 176 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.97 / 2.106

2.5.12 What to do if a Bayes rule can not be


found analytically
Integration plays a significant role in analytic determination of the
Bayesian estimators and tests. Integration may be difficult to get
in closed form and numerical methods need to be applied in such
situations.
Simple Monte Carlo methods to calculate the integrals
Z Z
θ f ( X|θ)τ(θ)dθ and f ( X|θ)τ(θ)dθ
Θ Θ
can always be applied.
However, besides the simple Monte Carlo methods, there are more
complicated Monte Carlo procedures which are specific and very useful
in Bayesian inference. To motivate these procedures we first consider
a simplified general example given in the following Lemma.
Chapter 2 : General inference problem Page 177 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.98 / 2.106

Lemma
Suppose we generate random variables by the following algorithm:
i) Generate Y ∼ fY (y);
ii) Generate X ∼ fX|Y ( x|Y ).
Then X ∼ fX ( x).

Chapter 2 : General inference problem Page 178 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.99 / 2.106

Proof.
For the cumulative distribution function F X ( x) we have:

F X ( x) = P( X ≤ x)
= E [ F X|Y ( x|y)]
Z ∞ Z x
= [ fX|Y (t|y)dt ] fY (y)dy
−∞ −∞
Z x Z ∞
= [ fX|Y (t|y) fY (y)dy]dt
Z−∞ x Z −∞

= [ fX,Y (t, y)dy]dt
Z−∞ x
−∞

= fX (t )dt.
−∞
Hence, the random variable X generated by the algorithm has a density
f X ( x ). □
Chapter 2 : General inference problem Page 179 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.100 / 2.106

The above Lemma tells us that if we wanted to calculate an expected


value E [W ( X )] for any function W ( X ) with E [W 2 ( X )] < ∞ then we
can generate independently the sequence (Y1 , X1 ), (Y2 , X2 ), . . . , (Ym , Xm )
for a specified large value m and then by the Law of Large Numbers
we will have

W̄ ≈ E [W ( X )].

The above simple observation can be generalized in the following


algorithm of the Gibbs sampler. Let m be a positive integer and X0
an initial value. Then for i = 1, 2, . . . , m:

i) Generate Yi |Xi−1 ∼ fY|X (y|x)


ii) Generate Xi |Yi ∼ fX|Y ( x|y).

Chapter 2 : General inference problem Page 180 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.101 / 2.106

d d
In more advanced texts, it can be shown that Yi − → fY (y) and Xi −

fX ( x) as i → ∞. Therefore, intuitively, a convergence of the Gibbs
sampler could be argued about in a manner similar to the Lemma.

The rigorous justification is slightly more involved. Reason is that the


pairs
( X1 , Y1 ), ( X2 , Y2 ), . . . , ( Xk , Yk ), ( Xk+1 , Yk+1 )

generated by the Gibbs sampler are not generated independently but :


we need only the pair ( Xk , Yk ) (and none of the previous (k − 1) pairs)
to generate ( Xk+1 , Yk+1 ). Hence having a Markov chain and for it,
under quite general conditions, the distribution stabilizes (reaches an
equilibrium).

Chapter 2 : General inference problem Page 181 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.102 / 2.106

The application of the Gibbs sampler in Bayesian inference can help


in overcoming one of the major obstacles of this inference, namely,
the fact that the prior may not be precisely known. We can in allow
more freedom to ourselves by modeling the prior itself using another
random variable. We get the so-called hierarchical Bayes model if
we assume:

X|θ ∼ f ( x|θ), Θ|γ ∼ q(θ|γ), Γ ∼ ψ(γ)

with q(.|.) and ψ(.) known density functions. Here γ is called the
hyperparameter. We keep in mind that f ( X|θ) does not depend on
γ. Keeping g(.) as a generic notation for a density, we get using the
Bayes formula:

f ( x|θ)q(θ|γ)ψ(γ)
g(θ, γ|x) = .
g( x)
Chapter 2 : General inference problem Page 182 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.103 / 2.106

This conditional joint density is proportional to the product of known


densities f ( x|θ)q(θ|γ)ψ(γ) hence g(θ|x, γ) and g(γ|x, θ) can (in princi-
ple) be determined. (When there is no easy analytic way of doing this
then there is the Metropolis-Hastings algorithm to help us simulate
from the conditionals. The Metropolis-Hastings algorithm is discussed
in a Bayesian statistics course and we will avoid discussing it here).

We can then start a Gibbs sampler with an initial value γ0 as follows:

i) Θi |X, γi−1 ∼ g(θ|X, γi−1 )


ii) Γi |X, θi ∼ g(γ|X, θi ).

(With other words, we simulate from the full conditionals-the condi-


tional distributions of each parameter given the other parameters and
the data.)

Chapter 2 : General inference problem Page 183 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.104 / 2.106

Taking sufficiently large repetitions the algorithm will converge under


suitable conditions as follows:

d d
Θi −
→ h(θ|X ), Γi −
→ g(γ|X ) as i → ∞.
Hence the simple arithmetic average of the Θi values (after possibly
discarding some initial iterates before stabilization has occured) will
converge towards the Bayes estimator with respect to quadratic loss
for the given hierarchical Bayes model.

In practice, we would generate the stream of values (θ1 , γ1 ), (θ2 , γ2 ), . . . .


Then choosing large values of m and B > m, our Bayes estimate of θ
will be the average

B
1 X
θi .
B − m i=m+1
Chapter 2 : General inference problem Page 184 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.105 / 2.106

Remark 2.4
The Gibbs sampler works fine when indeed the conditional distributions
are completely known. The conditional distributions are often only
known up to a (normalizing) proportionality constant.

Interestingly, the Gibbs sampler can still be used also in these cases
but drawing from the conditional distribution is more involved. The
best algorithm for this case: the Metropolis-Hastings algorithm. For
details (separate course MATH5960 in Bayesian inference).

Chapter 2 : General inference problem Page 185 / 748


2.5 Statistical decision theoretic approach to inference Slide 2.106 / 2.106

Here we present the essence of the algorithm. Suppose that a density


f ( x) is only known up to a normalizing constant, i.e. f ( x) = c f˜( x)
where f˜ is known. Choose an arbitrary, completely known, so-called
proposal density u( x′ |x). Let the tth generated data point is x = X t .
For this given x define the set of points
A x = {x′ : f˜( x′ )u( x|x′ ) < f˜( x)u( x′ |x)}.

i) Generate a value x′ randomly from u(X′ |x).


ii) If x′ is not in A x then we put X t+1 = x′ as the new simulated
point. However, if x′ is in A x then perform a further randomiza-
tion and accept x′ with probability f˜( x′ )u( x|x′ )/ f˜( x)u( x′ |x). If
it is accepted, again put X t+1 = x′ . Otherwise, put X t+1 = x.

The theory of the algorithm requires only mild conditions on u( x′ |x)


to work but practice shows that in terms of computing time needed
to run it. For generating a “well mixing” sequence of simulated values,
the choice of u( x′ |x) needs to be done carefully.
Chapter 2 : General inference problem Page 186 / 748

You might also like