Lecture Two 2025
Lecture Two 2025
106
On the other hand, the real life empirical observations are discrete.
This fact will be utilized by us to keep some of the proofs simpler.
In many cases we will be dealing with the discrete case only, thus
avoiding more involved measure-theoretic arguments.
In the discrete case this will be just the product of the probabilities
for each of the measurements to be in a suitable interval of length ∆.
Example 2.9
If our data were counts of accidents within n = 10 consecutive weeks
on a busy crossroad, it may be reasonable to assume that a Poisson
distribution with an unknown parameter λ has given rise to the data.
That is, we may assume that we have 10 independent realisations of
a Poisson(λ) random variable.
Example 2.10
If, on the other hand, we measured the lengths Xi of 20 baby boys at
age 4 weeks, it would be reasonable to assume normal distribution for
these data. Symbolically we denote this as follows:
Xi ∼ N (µ, σ2 ), i = 1, 2, . . . , 20.
The models we use, as seen in the examples above, are usually about
the shape of the density or of the cumulative distribution function of
the population from which we have sampled.
Parametric Inference
Non-parametric Inference
Robustness approach
The situation, in practice, might be even more blurred. We may know
that the population is “close” to parametrically describable and yet
“deviates a bit” from the parametric family.
Illustration of robustness
Bayesian Inference
Another way to classify the Statistical Inference procedures is by the
way we treat the unknown parameter θ.
Estimation
Confidence set construction
Hypothesis testing
2.4.1 Estimation
But let us point out immediately that there is little value in calculating
an approximation to an unknown quantity without having an idea
of how “good” the approximation is and how it compares with other
approximations. Hence, immediately questions about confidence
interval (or, more generally, confidence set) construction arise.
After the observations are collected, further information about the set
Θ is added and it becomes plausible that the true distribution belongs
to a smaller family than it was originally postulated, i.e., it becomes
clear that the unknown θ-value belongs to a subset of Θ.
All parts of the decision making process are formally defined, a desired
optimality criterion is formulated and a decision is considered optimal
if it optimizes the criterion.
2.5.1 Introduction
Definition 2.1
A (deterministic) decision function is a function d : X → A from the
sample space to the set of actions.
2.5.2 Examples
Example 2.11 (Hypothesis testing)
Assume that a data vector X ∼ f (X, θ). Consider testing H0 : θ ≤ θ0
versus H1 : θ > θ0 where θ ∈ R1 is a parameter of interest.
Then we have
Hence
then the decision rule d (which we can call estimator) has a risk
function
This set is also very small and examples show that very often a
simple randomization of given deterministic rules gives better rules
in the sense of risk minimization. This explains the reason for the
introduction of the randomized decision rules.
Definition 2.2
A rule δ which chooses di with probability wi ,
P
wi = 1, is a random-
ized decision rule.
X X
L(θ, δ( X )) = wi L(θ, di ( X )) and R(θ, δ) = wi R(θ, di )
The set of all randomized decision rules generated by the set D in the
above way will be denoted by D.
The reason is that usually uniformly best decision rules (that minimize
the risk uniformly for all θ-values) do not exist! It leads us to the
following two ways out:
Definition 2.3
A decision rule d is unbiased if
of a decision rule and try to find the rules that minimize these risks.
Bayesian rule:
Think of the θ-parameter as random variable with a given (known)
prior density τ on Θ.
Define the Bayesian risk of the decision rule δ with respect to the
prior τ:
Z
r (τ, δ) = E[R(T , δ)] = R(θ, δ)τ(θ)dθ
Θ
Then the Bayesian rule δτ with respect to the prior τ is defined as:
Sometimes a Bayesian rule may not exist and so we ask for an ϵ-Bayes
rule.
Minimax rule.
Instead of considering uniformly best rules we consider rules that
minimize the supremum of the values of the risk over the set Θ. This
means safeguarding against the worst possible performance.
The value
sup R(θ, δ)
θ∈Θ
is called minimax risk of the decision rule δ. Then the rule δ∗ is called
minimax in the set D if
Again, like in the Bayesian case, even if the minimax value is finite
there may not be a minimax decision rule. Hence we introduce the
notion of ϵ-minimax rule δϵ such that
It indeed deserves its name. If the statistician were told which prior
distribution nature was using, they would like least to be told that τ∗
was the nature’s prior (since given that they always performs in an
optimal way by choosing the corresponding Bayesian rule, they still
have the highest possible value of the Bayesian risk as compared to
the other priors).
Definition 2.4
A set A ⊂ Rk is convex if for all vectors ⃗x = ( x1 , x2 , . . . , xk )′ and
⃗y = (y1 , y2 , . . . , yk )′ in A and all α ∈ [0, 1] then
α⃗x + (1 − α)⃗y ∈ A.
Now lets assume that Θ has k elements only. Define the risk set of a set
D of decision rules as the set of all risk points {R(θ, d ), θ ∈ Θ}, d ∈ D.
For a fixed d, each such risk point belongs to Rk and by “moving” d
within D, we get a set of such k-dimensional vectors.
Theorem 2.6
The risk set of a set D of randomized decision rules generated by a
given set D of non-randomized decision rules is convex.
Proof.
It is easy to see that if ⃗y and ⃗y′ are the risk points of δ and δ′ ∈ D,
correspondingly, then any point in the form
⃗z = α⃗y + (1 − α)⃗y′
corresponds to (is the risk point of) the randomized decision rule δα ∈
D that chooses δ with probability α and the rule δ′ with probability
(1 − α). Hence any such ⃗z belongs to the risk set of D. □
Remark 2.1
The risk set of the set of all randomized rules D generated by the
set D is the smallest convex set containing the risk points of all of
the non-randomized rules in D (i.e. the convex hull of the set of risk
points of D).
All points ⃗y in the risk set, corresponding to certain rules δ∗ for which
k
X
pi yi = r (τ, δ∗ ) = the same value = b,
i=1
give rise to the same value b of the Bayesian risk and hence are
equivalent from a Bayesian point of view. The value of their risk
can be easily illustrated and (at least in case of k = 2), one can
easily illustrate the point in the convex risk set that corresponds to (is
the risk point of the) Bayesian rule with respect to the prior τ. (See
illustration
Chapter at lecture)
2 : General inference problem Page 129 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.50 / 2.106
2.5.8 Example
Let the set Θ = {θ1 , θ2 }. Let X have possible values 0, 1 and 2; the
set A = {a1 , a2 } and let
x 0 1 2 x 0 1 2
P( x|θ1 ) .81 .18 .01 P( x|θ2 ) .25 .5 .25
Exercise 2.10
Now, consider all possible non-randomized decision rules based on
one observation:
x d1 ( x) d2 ( x) d3 ( x ) d4 ( x) d5 ( x) d6 ( x ) d7 ( x) d8 ( x)
0 a1 a1 a1 a1 a2 a2 a2 a2
1 a1 a1 a2 a2 a1 a1 a2 a2
2 a1 a2 a1 a2 a1 a2 a1 a2
a) Sketch the risk set of all randomized rules generated by
d1 , d2 , .., d8 .
b) Find the minimax rule δ∗ (in D) and compute its risk.
c) For what prior is δ∗ a Bayes rule w.r. to that prior (i.e., what is
the least favorable distribution)?
d) Find the Bayes rule for the prior {1/3, 2/3} over {θ1 , θ2 }. Compute
the value of its Bayes risk.
Chapter 2 : General inference problem Page 132 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.53 / 2.106
Lemma
If τ∗ is a prior on Θ and the Bayes rule δτ∗ has a constant risk w.r. to
θ (i.e. if R(θ, δτ∗ ) = c0 for all θ ∈ Θ) then:
a) δτ∗ is minimax;
b) τ∗ is the least favorable distribution.
Proof.
Note here we have used the fact that R(θ, δτ∗ ) is constant and
Z Z
τ (θ)dθ =
∗
τ(θ)dθ = 1.
Θ Θ
□
Chapter 2 : General inference problem Page 135 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.56 / 2.106
Remark 2.2
The lemma provides a hint how to find minimax estimators. The
minimax estimators are (special) Bayes estimators w.r. to the least
favorable prior.
First we can obtain the general form of the Bayes estimator with
respect to ANY given prior. Then we choose a prior for which the
corresponding Bayes rule has its (usual) risk independent of θ, i.e.
constant with respect to θ.
Notation:
f (X, θ) f (X|θ)τ(θ)
h(θ|X) = =R
g(X)
Θ
f (X|θ)τ(θ)dθ
Theorem 2.7
For X ∈ X, a ∈ A and for a given prior τ we define:
Z
Q(X, a) = L(θ, a)h(θ|X)dθ,
Θ
Q( X, aX ) = inf Q(X, a)
a∈A
Proof.
But for every fixed X-value, Q(X, δ(X)) is smallest when δ(X) = aX .
Making that way our“best choice”for each X-value, we will, of course,
minimize the value of r (τ, δ). Hence, we should be looking for an
action aX that gives an infimum to
Z
inf L(θ, a)h(θ|X)dθ.
a∈A Θ
□
Example 2.13
Show that for a given random variable Y with a finite second moment,
the function q1 (a) = E (Y − a)2 is minimised for a∗ = E (Y ).
Solution:
Setting the derivative with respect to a to zero we get
∂ ∂
" #
E (Y − a)2 = E (Y 2 ) − 2E (Y )a + a2 = −2E (Y ) + 2a = 0
∂a ∂a
∂2
E (Y − a)2 = 2 > 0.
∂a2
Chapter 2 : General inference problem Page 143 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.64 / 2.106
Example 2.14
Show that for a given random variable Y with E|Y| < ∞, the function
q2 (b) = E|Y − b| is minimised for b∗ = median(Y ).
Solution:
(Continuous case for simplicity.) Denote the density of Y by f (y) and
the cdf by F (y). Having in mind the definition of absolute value we
have:
∂ ∂ b ∞
"Z Z #
E (|Y − b|) = (b − y) f (y)dy + (y − b) f (y)dy
∂b ∂b −∞ b
∂
" Z b Z ∞ #
= bF (b) − y f (y)dy + y f (y)dy − b(1 − F (b))
∂b −∞ b
= F (b) + b f (b) − b f (b) − b f (b) − (1 − F (b)) + b f (b)
= 2F (b) − 1
=0
Remark 2.3
A case in which δτ ∈ D could not be satisfied is in point- estimation
with Θ ≡ A- a finite set. Then E (θ|X) might not belong to A,
hence E (θ|X) would not be a function X → A and δτ would not be
a legitimate estimator. But if Θ ≡ A is convex, it can be shown that
always E (θ|X) ∈ A!
H0 : θ ∈ Θ 0 versus H1 : θ ∈ Θ 1 .
The loss when a first type error occurs is denoted as c1 and the
loss when a second type error occurs is denoted as c2 . Since the
consequences of the two types of error may not be equally heavy,
c1 , c2 in general.
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ 1 .
with a generalised 0-1 loss function.
is a Bayesian rule (Bayesian test) for the above testing problem with
respect to the prior τ.
Proof.
Now:
Z
Q(X, a0 ) = L(θ, a0 )h(θ|X)dθ
ZΘ
= c2 h(θ|X)dθ
Θ1
= c2 P(θ ∈ Θ1 | X)
= c2 (1 − P(θ ∈ Θ0 | X))
and Z
Q(X, a1 ) = L(θ, a1 )h(θ|X)dθ
ZΘ
= c1 h(θ| X)dθ
Θ0
= c1 P(θ ∈ Θ0 | X)
This makes perfect sense. We also note that the threshold c2 /(c1 + c2 )
to compare with, when the two types of errors are equally weighted
(i.e., when c1 = c2 is chosen, is just equal to 12 ).
2.5.11 Examples
Example 2.15
Let, given θ, the distribution of each Xi , i = 1, 2, . . . , n be Bernoulli
with parameter θ, i.e.
f (X|θ) = θΣXi (1 − θ)n−ΣXi
and assume a beta prior τ for the (random variable) θ over (0, 1):
1
τ(θ ) = θα−1 (1 − θ)β−1 I(0,1) (θ)
B(α, β)
Show that the Bayesian estimator θ̂B with respect to quadratic loss is:
Σni=1 Xi + α
θ̂B =
α+β+n
Solution:
We recall first the definition and some properties of the Beta function
that is used in the definition of the Beta density. In particular, there
is the following relation between B(α, β) and the Gamma function
R∞
Γ(a) = 0 e(−x) xa−1 dx :
1 Γ (α) Γ (β)
Z
B(α, β) = xα−1 (1 − x)β−1 dx =
0 Γ (α + β)
α−1
B(α, β) = B(α − 1, β)
α+β−1
Z 1
θ̂τ = θh(θ|X)dθ
0
Γ(ΣXi + α + 1)Γ(n + α + β)
=
Γ(n + 1 + α + β)Γ(ΣXi + a)
B(ΣXi + α + 1, n − ΣXi + β)
=
B(ΣXi + α, n − ΣXi + β)
= by the above property of the Beta function
Σn Xi + α
= i=1
α+β+n
X̄ + α/n
=
1 + α+n
β
The above derivation holds for any beta prior Beta(α, β).
Chapter 2 : General inference problem Page 157 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.78 / 2.106
In particular, note that when the sample size n is small, the effect of
the prior (via the values of the parameters α and β) may be significant
and the Bayesian estimator θ̂τ may be very different from the UMVUE
thus expressing the influence of the prior information on our decision.
On the other hand, when the sample size n is very large, we see that
θ̂τ ≈ X̄
holds no matter what the prior. That is, when the sample size
increases, the prior’s effect on the estimator starts disappearing!
Let us calculate the (usual) risk with respect to quadratic loss of any
such Bayes estimator:
√
The solution to this system is α = β = n/2. Hence the minimax
estimator of θ is
√
ΣXi + n/2
θ̂minimax = √ .
n+ n
Exercise 2.11
Suppose a single observation x is available from the uniform distribu-
tion with a density
1
f ( x|θ) = I (θ ), θ>0
θ ( x,∞)
The prior on θ has density:
Example 2.16
Suppose X1 , X2 , . . . , Xn have conditional joint density:
Pn
f X|Θ ( x1 , x2 , . . . , xn |θ) = θn e−θ i=1 xi , xi > 0 f or i = 1, . . . , n; θ > 0
τ(θ) = ke−kθ
for θ > 0, where k > 0 is a known constant i.e. the observations are
exponentially distributed given θ, and the prior on θ is also exponential
but with a different parameter.
i) Calculate the posterior density of Θ given X1 = x1 , X2 =
x2 , . . . , Xn = xn
ii) Find the Bayesian estimator of θ with respect to squared error
loss.
Chapter 2 : General inference problem Page 162 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.83 / 2.106
Solution:
1
Gamma(n + 1, Pn )
i=1 xi + k
1
Gamma(n + 1, Pn ).
i = 1 xi + k
n+1
θ̂ = Pn .
i = 1 xi + k
+ k )n+1
∞
Pn ∞
( i=1 xi
Z Z Pn
θ̂ = θh(θ|x)dθ = θn+1 e−θ( i=1 xi +k ) dθ
0 Γ (n + 1) 0
dy
and after changing variables: θ(
Pn
i = 1 xi + k) = y, dθ = (
Pn we
i = 1 xi + k )
can continue the evaluation:
R∞
e−y yn+1 dy
0 Γ (n + 2) n+1
θ̂ = = = Pn
Γ(n + 1)( ni=1 xi + k) Γ(n + 1)( ni=1 xi + k) i=1 xi + k
P P
This is the reason that Bayesian modellers are often looking for
conjugate priors trying to simplify the calculations. If such priors
are difficult to find or are not reasonable suggestions for a prior in a
particular situation, then one needs to resort to the full-scale Bayesian
estimation instead of the shortcut approach.
f ( X|θ)τ(θ)
h(θ|X ) = ∝ f ( X|θ)τ(θ)
g( X )
We could guess the shape of the density h(θ|X ) by just analysing the
product f ( X|θ)τ(θ) as a function of θ.
Xi +α−1 Xi +β−1
Pn Pn
f ( X|θ)τ(θ) = θ i=1 (1 − θ)n− i=1
n
X n
X
Beta( Xi + α, n − Xi + β)
i=1 i=1
distributed. But for any Beta distributed random variable with param-
eters a, b it is known that the expected values is equal to a+a b . Hence
we get immediately the Bayes estimator
i=1 Xi + α
Pn
θ̂B = E (θ|X ) =
α+β+n
Example 2.17
Let X1 , X2 , . . . , Xn be a random sample from the normal density with
mean µ and variance 1. Consider estimating µ with a squared-error
loss. Assume that the prior τ(µ) is a normal density with mean µ0
and variance 1.
µ0 + ni=1 Xi
P
.
n+1
Solution:
Let X = ( X1 , . . . , Xn ) be the random variables. Setting µ0 = x0 for
convenience of the notation, we can write:
n Pn
i=0 xi
! " #!
1X n+1 2
h(µ|X=x) ∝ exp − ( xi − µ) ∝ exp −
2
µ − 2µ
2 i=0 2 n+1
Of course this also means (by completing the square with the expres-
sion that does not depend on µ)
xi 2
" Pn # !
n+1
h(µ|X=x) ∝ exp − µ − i=0
2 n+1
n n
1 X 1 X 1 n
xi = ( µ0 + xi ) = µ0 + X̄,
n + 1 i=0 n+1 i=1
n+1 n+1
Example 2.18
As part of a quality inspection program, five components are selected
at random from a batch of components to be tested. From past
experience, the parameter θ (the probability of failure), has a beta
distribution with density
Solution:
i) X ∼ Bin(5, θ). We have:
P( X = 0|θ) = (1 − θ)5
ii) Now:
Γ(12)
h(θ|X = 1) = (1 − θ)8 θ2 = 495θ2 (1 − θ)8
Γ (9) Γ (3)
Exercise 2.12
In a sequence of consecutive years 1, 2, . . . , T , an annual number of
high-risk events is recorded by a bank. The random number Nt of
high-risk events in a given year is modelled via Poisson(λ) distribution.
This gives a sequence of independent counts n1 , n2 , . . . , nT . The prior
on λ is Gamma(a, b) with known a > 0, b > 0 :
λa−1 e−λ/b
τ(λ) =
, λ > 0.
Γ ( a ) ba
i) Determine the Bayesian estimator of the intensity λ with respect
to quadratic loss.
ii) Assume that the parameters of the prior are a = 2, b = 2. The
bank claims that the yearly intensity λ is no more than 2. Within
the last six years counts were 0, 2, 3, 3, 2, 2. Test the bank’s claim
via Bayesian testing with a zero-one loss.
Chapter 2 : General inference problem Page 176 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.97 / 2.106
Lemma
Suppose we generate random variables by the following algorithm:
i) Generate Y ∼ fY (y);
ii) Generate X ∼ fX|Y ( x|Y ).
Then X ∼ fX ( x).
Proof.
For the cumulative distribution function F X ( x) we have:
F X ( x) = P( X ≤ x)
= E [ F X|Y ( x|y)]
Z ∞ Z x
= [ fX|Y (t|y)dt ] fY (y)dy
−∞ −∞
Z x Z ∞
= [ fX|Y (t|y) fY (y)dy]dt
Z−∞ x Z −∞
∞
= [ fX,Y (t, y)dy]dt
Z−∞ x
−∞
= fX (t )dt.
−∞
Hence, the random variable X generated by the algorithm has a density
f X ( x ). □
Chapter 2 : General inference problem Page 179 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.100 / 2.106
W̄ ≈ E [W ( X )].
d d
In more advanced texts, it can be shown that Yi − → fY (y) and Xi −
→
fX ( x) as i → ∞. Therefore, intuitively, a convergence of the Gibbs
sampler could be argued about in a manner similar to the Lemma.
with q(.|.) and ψ(.) known density functions. Here γ is called the
hyperparameter. We keep in mind that f ( X|θ) does not depend on
γ. Keeping g(.) as a generic notation for a density, we get using the
Bayes formula:
f ( x|θ)q(θ|γ)ψ(γ)
g(θ, γ|x) = .
g( x)
Chapter 2 : General inference problem Page 182 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.103 / 2.106
d d
Θi −
→ h(θ|X ), Γi −
→ g(γ|X ) as i → ∞.
Hence the simple arithmetic average of the Θi values (after possibly
discarding some initial iterates before stabilization has occured) will
converge towards the Bayes estimator with respect to quadratic loss
for the given hierarchical Bayes model.
B
1 X
θi .
B − m i=m+1
Chapter 2 : General inference problem Page 184 / 748
2.5 Statistical decision theoretic approach to inference Slide 2.105 / 2.106
Remark 2.4
The Gibbs sampler works fine when indeed the conditional distributions
are completely known. The conditional distributions are often only
known up to a (normalizing) proportionality constant.
Interestingly, the Gibbs sampler can still be used also in these cases
but drawing from the conditional distribution is more involved. The
best algorithm for this case: the Metropolis-Hastings algorithm. For
details (separate course MATH5960 in Bayesian inference).