Statinf Estimation
Statinf Estimation
Characteristics of estimators
Unbiasedness
Consistency
Efficiency
Sufficiency
Cramér-Rao
Cramér-Rao lower bound
Efficiency or bias
Blackwellisation
We often use the notation "ˆ" to represent the estimator of the unknown
parameter, for example θ̂ for θ.
Characteristics of estimators
a. Unbiasedness
b. Consistency
c. Efficiency
d. Sufficiency
a. Unbiased estimators.
To understand this part, it is important to remember the difference
between an estimator (or statistic) and a parameter.
The parameter, generally represented by θ, is a unique but unknown value
(to know it with certainty we would have to carry out a census).
The unknown parameter is, as it were, the intended target.
The statistic or estimator, represented by θ̂, is a function of the
observations, i.e. the sample.
Since the sample is random, the statistic is also random. If the sampling is
repeated several times, possibly with replacement, the results may be
different each time.
It is therefore impossible to guarantee that, for each sample, the estimator
will give exactly the value of the unknown parameter.
We will therefore simply ask that the estimator does not systematically
miss.
To be more precise, we will ask that the estimator be unbiased i.e. that in
expectation it gives the value we are looking for.
Unbiasedness
An estimator Tn = T (x1 , x2 , . . . , xn ) is said to be an unbiased estimator of
γ(θ) if E (Tn ) = γ(θ) for all θ ∈ Θ.
Bias
If E (Tn ) > θ, Tn is said to be positively biased.
If E (Tn ) < θ, Tn is said to be negatively biased.
The amount of bias, b(θ̂), is given by
Example
Let x1 , . . . , xn i.i.d. E [xi ] = µ < ∞.
n
1X
x̄ := xi
n
i=1
Reminder
Expectation properties
E (c) = c
E (x + c) = E (x) + c
E (ax + c) = aE (x) + c
E (ax1 + bx2 + c) = aE (x1 ) + bE (x2 ) + c
Variance properties
Var (c) = 0
Var (x + c) = Var (x)
Var (ax + c) = a2 Var (x)
Var (ax1 ± bx2 + c) = a2 Var (x1 ) + b 2 Var (x2 ) ± 2abCov (x1 , x2 )
σ2
= σ 2 + µ2 − − µ2
n
n−1 2
= σ < σ2
n
O. Dagnelie (SSSIHL) Stat Inf 2024-25 12 / 104
Characteristics of estimators Unbiasedness
Exercise
1 Pn
Show that S 2 := n−1 2 2
i=1 (xi − x̄) is an unbiased estimator of σ .
Let us start by computing a more practical formula for the sampling
variance :
n
2 1 X
S = (xi − x̄)2
n−1
i=1
n
1 X
= (xi2 − 2xi x̄ + x̄ 2 )
n−1
i=1
n
1 X 2n 2 n
= (xi2 ) − x̄ + x̄ 2
n−1 n−1 n−1
i=1
n
1 X n
= (xi2 ) − x̄ 2
n−1 n−1
i=1
Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
n
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
n
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
Xn
1 2 n
E x̄ 2
= E (xi ) −
n−1 n−1
i=1
Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
n
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
X n
1 2 n
E x̄ 2
= E (xi ) −
n−1 n−1
i=1
n n
Var (x) + E 2 (x) − Var (x̄) + E 2 (x̄)
=
n−1 n−1
Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
n
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
X n
1 2 n
E x̄ 2
= E (xi ) −
n−1 n−1
i=1
n n
Var (x) + E 2 (x) − Var (x̄) + E 2 (x̄)
=
n−1 n−1
2
n 2 2
n σ 2
= σ +µ − +µ
n−1 n−1 n
Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
n
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
X n
1 2 n
E x̄ 2
= E (xi ) −
n−1 n−1
i=1
n n
Var (x) + E 2 (x) − Var (x̄) + E 2 (x̄)
=
n−1 n−1
2
n 2 2
n σ 2
= σ +µ − +µ
n−1 n−1 n
n n−1
= σ2 = σ2
n−1 n
n 2
S2 = n−1 s is an unbiased estimator of σ 2 (but not S of σ).
b. Consistent estimator.
An estimator Tn = T (x1 , x2 , . . . , xn ), based on a random sample of size n,
is said to be a consistent estimator of γ(θ), θ ∈ Θ, the parameter space, if
p
Tn converges to γ(θ) in probability, i.e., if Tn −→ γ(θ) as n → ∞.
In other words, Tn is a consistent estimator of γ(θ) if for every
ε > 0, η > 0, there exists a positive integer n ≥ m(ε, η) such that
P{|Tn − γ(θ)| < ε} → 1 as n → ∞
⇒ P{|Tn − γ(θ)| < ε} > 1 − η; ∀n ≥ m where m is some very large value
of n.
Remarks
By Kinchine’s weak law of large numbers, x̄, sample mean, is always a
consistent estimator of µ, population mean.
Consistency is a property concerning the behaviour of an estimator for
indefinitely large values of the sample size n, i.e., as n → ∞.
Nothing is regarded of its behaviour for finite n.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 15 / 104
Characteristics of estimators Consistency
Note :
A sufficient condition for consistency is that the estimator is unbiased and
that its variance tends towards 0 when n becomes large.
This condition is however not necessary.
A biasedASYMPTOTIC
estimator may be consistent,
PROPERTIES if the bias
OF ESTIMATORS: disappears
PLIMS when the size
AND CONSISTENCY
of the sample increases.
0.3
n = 1600
probability density function
0.2
n = 400
0.1
n = 100
n = 25
0.0
80 100 120 140
O. Dagnelie (SSSIHL) Stat Inf 2024-25 16 / 104
Characteristics of estimators Consistency
Example
Prove that in sampling from a N(µ, σ 2 ) population, the sample mean is a
consistent estimator of µ.
Example
Prove that in sampling from a N(µ, σ 2 ) population, the sample mean is a
consistent estimator of µ.
c. Efficient estimator.
Intuition of efficiency
If, of the two consistent estimators T1 , T2 of a certain parameter θ, we
have V (T1 ) < V (T2 ), for all n
then T1 is more efficient than T2 for all sample sizes.
Efficiency
If T1 is the most efficient estimator with variance V1 and T2 is any other
estimator with variance V2 , then the efficiency E of T2 is defined as :
V1
E=
V2
Obviously, E cannot exceed unity.
Relative efficiency
Let T1 and T2 be unbiased estimators with respectively variance V1 and
V2 , then the relative efficiency of T1 with respect to T2 is :
V2
RE (T1 , T2 ) =
V1
T1 is relatively more efficient than T2 if RE (T1 , T2 ) ≥ 1
Example
Let x1 , . . . , xn i.i.d. , E [xi ] = µ < 1.
x̄ and x1 are 2 unbiased estimators for µ (Why ?).
Which is the most efficient estimator ?
Example
Let x1 , . . . , xn i.i.d. , E [xi ] = µ < 1.
x̄ and x1 are 2 unbiased estimators for µ (Why ?).
Which is the most efficient estimator ?
As soon as n > 1, Var (x̄) < Var (x1 ) ⇒ x̄ is more efficient than x1 .
The relative efficiency of x̄ with respect to x1 is given by :
Var (x1 ) σ2
ER(x̄, x1 ) = = 2 =n
Var (x̄) σ /n
Exercise
Let x1 , x2 , x3 a random sample from a normal population whose µ and σ
are unknown.
(1) Show that µ̂1 and µ̂2 are unbiased.
(2) Which estimator of µ is the most efficient, µ̂1 or µ̂2 ?
1 1 1
µ̂1 = x1 + x2 + x3
4 2 4
1 1 1
µ̂2 = x1 + x2 + x3
3 3 3
Exercise (. . .)
1 1 1
E (µ̂1 ) = E x1 + x2 + x3
4 2 4
1 1 1
= E (x1 ) + E (x2 ) + E (x3 )
4 2 4
1 1 1
= µ+ µ+ µ
4 2 4
= µ
1 1 1
E (µ̂2 ) = E x1 + x2 + x3
3 3 3
1 1 1
= E (x1 ) + E (x2 ) + E (x3 )
3 3 3
1 1 1
= µ+ µ+ µ
3 3 3
= µ
O. Dagnelie (SSSIHL) Stat Inf 2024-25 26 / 104
Characteristics of estimators Efficiency
Exercise (. . .)
1 1 1
Var (µ̂1 ) = Var x1 + x2 + x3
4 2 4
1 1 1
= Var (x1 ) + Var (x2 ) + Var (x3 )
16 4 16
3σ 2
=
8
1 1 1
Var (µ̂2 ) = Var x1 + x2 + x3
3 3 3
1 1 1
= Var (x1 ) + Var (x2 ) + Var (x3 )
9 9 9
3σ 2
=
9
µ̂2 is more efficient than µ̂1 , since its variance is smaller.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 27 / 104
Characteristics of estimators Efficiency
Theorems on MVUE
Thm
An M.V.U. is unique in the sense that if T1 and T2 are M.V.U. estimators
for γ(θ), then T1 = T2 , almost surely.
Thm
Let T1 and T2 be unbiased estimators of γ(θ) with efficiencies e1 and e2
respectively and ρ = ρθ be the correlation coefficient between them. Then
√ p √ p
e1 e2 − (1 − e1 )(1 − e2 ) ≤ ρ ≤ e1 e2 + (1 − e1 )(1 − e2 )
Corollary
If we take e1 = 1 and e2 = e in the previous equation, we get
√ √ √
e ≤ ρ ≤ e ⇒ ρ = e.
Thm
If T1 is an MVU estimator of γ(θ), θ ∈ Θ and T2 is any other unbiased
estimator of γ(θ) with efficiency e = eθ , then the correlation coefficient
√ √
between T1 and T2 is given by ρ = e, i.e. , ρθ = eθ , ∀θ ∈ Θ.
Thm
If T1 is an MVUE of γ(θ) and T2 is any other unbiased estimator of γ(θ)
with efficiency e < 1, then no unbiased linear combination of T1 and T2
can be an MVUE of γ(θ).
Other result
The correlation coefficient between a most efficient estimator and any
√
other estimator with efficiency e is e.
d. Sufficient estimator.
Sufficiency
If T = T (x1 , x2 , . . . , xn ) is an estimator of a parameter θ, based on a
sample x1 , x2 , . . . , xn of size n from the population with density f (x, θ)
such that the conditional distribution of x1 , x2 , . . . , xn given T , is
independent of θ, the statistic T is a sufficient estimator for θ.
Example
As an example, the sample mean is sufficient for the mean (µ) of
a normal distribution with known variance. Once the sample mean
is known, no further information about (µ) can be obtained from
the sample itself. On the other hand, for an arbitrary distribution
the median is not sufficient for the mean : even if the median
of the sample is known, knowing the sample itself would provide
further information about the population mean. For example, if
the observations that are less than the median are only slightly
less, but observations exceeding the median exceed it by a large
amount, then this would have a bearing on one’s inference about
the population mean. (source : wikipedia)
(Note : the median is known to be much more robust to extreme values
than the mean.)
Illustration
Let x1 , x2 , . . . , xn be a random sample from a Bernoulli population with
parameter p, 0 < p < 1, i.e.
1 with probability p
xi =
0 with probability q = (1 − p)
Illustration (. . .)
The conditional distribution of (x1 , x2 , . . . , xn ) given T is :
P(x1 ∩ x2 ∩ · · · ∩ xn ∩ T = k)
P(x1 ∩ x2 ∩ · · · ∩ xn |T = k) =
P(T = k)
p (1−p)n−k
k
(
n = 1n
= (k) P
p k (1−p)n−k (k )
0, if ni=1 xi ̸= k
Pn
Since this does not depend on p, T = i=1 xi , is sufficient for p.
One additional word for concreteness (clarification) :
Let us suppose a random sample of size n = 3 in which x1 = 1, x2 = 0, and
x3 = 1. In this case,
P(x1 = 1, x2 = 0, x3 = 1, T = 1) = 0
P P
since xi = 1 + 0 + 1 = 2 which is different from T = xi = 1, we have
an impossible event ⇒ P() = 0.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 33 / 104
Characteristics of estimators Sufficiency
Illustration (. . .)
As soon as T ̸= k, P() = 0.
If now, P(x1 = 1, x2 = 0, x3 = 1, T = 2), by independence, we have
p(1 − p)p = p 2 (1 − p).
So, in general,
n
X
P(x1 ∩ x2 ∩ · · · ∩ xn ∩ T = k) = 0, if xi ̸= k
i=1
n
X
k n−k
and P(x1 ∩ x2 ∩ · · · ∩ xn ∩ T = k) = p (1 − p) , if xi = k
i=1
(Source : PennState)
L = gθ [t(x)].h(x)
where gθ [t(x)] depends on θ and x only through the value of t(x) and h(x)
is independent of θ.
Remarks
1 ’A function independent of θ’ means that it does not involve θ but
also that its domain does not contain θ.
1
For example, f (x) = 2a , a − θ < x < a + θ; −∞ < θ < ∞ depends on
θ.
2 The original sample X = (x1 , x2 , . . . , xn ) is always a sufficient statistic.
3 Koopman’s form of the distributions admitting sufficient statistic :
L = L(x, θ) = g (x).h(θ).exp{a(θ)Ψ(x)}
where h(θ) and a(θ) are functions of θ and g (x) and Ψ(x) are
functions only of the sample observations.
This equation gives the exponential family of distributions containing
the binomial, Poisson and the normal with unknown mean and
variance.
Remarks
4 Invariance Property of Sufficient Estimator :
If T is a sufficient estimator for the parameter θ and if Ψ(T ) is a one
to one function of T , then Ψ(T ) is sufficient for Ψ(θ).
5 Fisher-Neyman Criterion :
A statistic t1 = t(x1 , x2 , . . . , xn ) is a sufficient estimator of parameter
θ if and only if the likelihood function (joint p.d.f. of the sample) can
be expressed as :
n
Y
L= f (xi , θ) = g1 (t1 , θ).k(x1 , x2 , . . . , xn )
i=1
Illustration
Let x1 , x2 , . . . , xn be a random sample from N(µ, σ 2 ) population.
Find sufficient estimators for µ and σ 2 .
Let us write θ = (µ, σ 2 ); −∞ < µ < ∞, 0 < σ 2 < ∞.
n n n
Y 1 1 X
Then L = √
fθ (xi ) = .exp − 2 (xi − µ)2
σ 2π 2σ
i=1 i=1
n
n !
1 1 X
2
X
2
= √ exp − 2 xi − 2µ xi + nµ
σ 2π 2σ
i=1
= gθ [t(x)].h(x)
n
1 1 2
where gθ [t(x)] = √ exp − 2 {t2 (x) − 2µt1 (x) + nµ }
σ 2π 2σ
X X
t(x) = {t1 (x), t2 (x)} = ( xi , xi2 ) and h(x) = 1
We know that the ’better’ of the two is the one with the smaller variance
but what about their performance relative to the other unbiased estimators
of θ ?
Can there be a θ̂3 with smaller variance than θ̂1 and θ̂2 ?
Can the minimum variance unbiased estimator be identified ?
Cramér-Rao Inequality
If t is an unbiased estimator for γ(θ), a function of parameter θ, then
d
{ dθ .γ(θ)}2 {γ ′ (θ)}2
Var (t) ≥ = (1)
∂ 2 I(θ)
E ∂θ log L
{γ ′ (θ)}2
In other words, Cramér-Rao inequality provides a lower bound I(θ) , to
the variance of an unbiased estimator of γ(θ).
Proof
In proving this result, we assume that only a single parameter θ is
unknown. We also take the case of continuous random variables.
The case of discrete random variables can be dealt with similarly on
replacing the multiple integrals by appropriate multiple sums.
Proof (. . . )
3 The range of integration is independent of the parameter θ, so that
f (x, θ) is differentiable under integral sign.
If range is not independent of θ and f is zero at the extremes of the
range, i.e., f (a, θ) = 0 = f (b, θ), then (by Leibniz integral rule)
Z b Z b
∂ ∂f ∂a ∂b
f dx = dx − f (a, θ) + f (b, θ)
∂θ a ∂θ ∂θ ∂θ
Z b Za b
∂ ∂f
⇒ f dx = dx, since f (a, θ) = 0 = f (b, θ)
∂θ a a ∂θ
∂ ∂ ∂
Cov t, log L = E t. log L − E (t).E log L
∂θ ∂θ ∂θ
= γ ′ (θ) (6)
Proof (. . . )
We have : {r (X , Y )}2 ≤ 1 ⇒ {Cov (X , Y )}2 ≤ Var (X ).Var (Y )
NB : r (X , Y ) = SECov (X ,Y ) σXY
(X )SE (Y ) or σX σY
2
∂ ∂
∴ Cov t, log L ≤ Var t.Var log L
∂θ ∂θ
2 2
∂ ∂
⇒ {γ ′ (θ)}2 ≤ Var t E log L − E log L
∂θ ∂θ
2
∂
⇒ {γ ′ (θ)}2 ≤ Var t.E log L
∂θ
{γ ′ (θ)}2
⇒ Var t ≥ 2 (7)
∂
E ∂θ log L
Corollary
If t is an unbiased estimator of parameter θ, i.e.,
1 1
Var (t) ≥ = (8)
I(θ)
∂
2
E ∂θ log L
2
where I(θ) = E ∂
∂θ log L is called by R.A. Fisher as the amount of
information on θ supplied by the sample (x1 , . . . , xn ) and its reciprocal
1/I(θ), as the information limit to the variance of estimator
t = t(x1 , . . . , xn ).
Remarks :
An unbiased estimator t of γ(θ) for which Cramér-Rao lower bound in
(1) is attained is called a minimum variance bound (MVB) estimator.
We have
2 2
∂ ∂
I(θ) = E log L = −E log L
∂θ ∂θ2
P
X
Let us define p̂ = X̄ = ni i which is clearly an unbiased estimator of p.
Please note first that
P
Xi 1 X 1 p(1 − p)
Var (p̂) = Var ( i ) = 2 Var ( Xi ) = 2 np(1 − p) =
n n n n
i
and
n
X
∂ ln Lp (X1 , . . . , Xn ) 1 n
= Xi −
∂p p(1 − p) 1 − p
i=1
∂ ln Lp (X1 ,...,Xn )
The variance of ∂p is therefore
n
∂ ln Lp (X1 , . . . , Xn ) 1 X
Var = 2 2
Var Xi
∂p p (1 − p)
i=1
np(1 − p) n
= =
p 2 (1 − p)2 p(1 − p)
∂ 2 ln Lp (X1 , . . . , Xn )
−1 + 2p n
E = np −
∂ p2 2
p (1 − p)2 (1 − p)2
−n + 2np − np −n
= =
p(1 − p)2 p(1 − p)
Exercise (. . .)
We have :
n 2
n 1X Xi − µ
lnLµ,σ2 (X1 , . . . , Xn ) = − ln(2πσ 2 ) −
2 2 σ
i=1
Exercise (. . .)
The variance of this expression becomes
n
∂lnLµ,σ2 (X1 , . . . , Xn ) 1 X 2
Var = 4
E [ Xi − µ ]
∂µ σ
i=1
1 n
= 4
nσ 2 = 2
σ σ
Let us now compare the inverse of this expression with the variance of X̄
σ2 1
Var (X̄ ) = = ⇒
n ∂lnLµ,σ2 (X1 ,...,Xn )
Var ∂µ
Exercise (. . .)
We can also use the alternative formula for the Cramér-Rao lower bound,
1
i.e.
2
giving
∂ ln L (X ,...,Xn )
µ,σ 2 1
−E
∂ µ2
∂ 2 lnLµ,σ2 (X1 , . . . , Xn ) 1
2
= − 2n
∂µ σ
And therefore
1 σ2
=
∂ 2 ln Lµ,σ2 (X1 ,...,Xn ) n
−E ∂ µ2
Efficiency or bias
Ideally, we want to find an unbiased and efficient estimator.
However, one could imagine that, in some cases, a very precise estimator
with small bias could be preferred to an unbiased and not precise estimator.
In other words, how should we compare 2 estimators, one biased and
another unbiased ?
This global concept of efficiency, with or without bias, is the Mean Squared
Error or Mean Square Error.
Mean Squared Error
The Mean Squared Error of estimator θ̂ of the parameter θ is
The Mean Squared Error is obviously related to the bias and variance of
the estimator.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 58 / 104
Cramér-Rao Efficiency or bias
Thm
MSE (θ̂) = Var (θ̂) + [Bias(θ̂)]2
Proof :
Let θ̂ an estimator of θ ; let me remind that bias b = E (θ̂) − θ.
Let τ be the expectation of θ̂, i.e. τ = E (θ̂).
MSE (θ̂) = E (θ̂ − θ)2
Let us replace θ̂ − θ by θ̂ − τ + τ − θ = θ̂ − τ + b and
(θ̂ − θ)2 = (θ̂ − τ )2 + b 2 + 2b(θ̂ − τ )
Let us take the expectation of both terms.
The first one gives the variance θ̂, the second one is the squared bias.
The third one is zero since
E [2b(θ̂ − τ )] = 2bE [θ̂ − τ ] = 2b[E (θ̂) − τ ]
and by notation τ represents E (θ̂).
O. Dagnelie (SSSIHL) Stat Inf 2024-25 59 / 104
Cramér-Rao Efficiency or bias
EQM(θ̂2 )
ER(θ̂1 , θ̂2 ) =
EQM(θ̂1 )
Consistent estimator
An estimator is consistent or asymptotically consistent when its MSE tends
towards zero as the sample size increases indefinitely.
A consistent estimator obviously has a bias and variance that tend towards
zero.
In a way, this means that the estimator tends to give the right answer with
certainty as the sample approaches a population census.
→ How to obtain MVU estimator from any unbiased estimator through the
use of sufficient statistic. (Blackwellisation)
Rao-Blackwell Theorem
Let U = U(x1 , x2 , . . . , xn ) be an unbiased estimator of parameter γ(θ) and
let T = T (x1 , x2 , . . . , xn ) be sufficient statistic for γ(θ). Consider the
function ϕ(T ) of the sufficient statistic defined as
Rao-Blackwell Theorem
Let θ̂ be an unbiased estimator of parameter θ with E (θ̂2 ) < ∞. Suppose
that T is sufficient for θ, and let θ∗ = E (θ̂|T ). Then, for all θ,
Proof
h i2 h i2
E (θ∗ − θ)2 = E E (θ̂|T ) − θ = E E (θ̂ − θ|T )
h i
≤ E E (θ̂ − θ)2 |T = E (θ̂ − θ)2
The first principle of estimation is the most obvious and intuitive. It boils
down to :
- estimate the moment of order 1, that is to say the expectation µ, of a
population by the sampling mean X̄ ,
- estimate the moment of order 2, that is E (X 2 ) of the population, by
1 Pn 2
n i=1 Xi
- and so on...
Method of moments
Let X1 , . . . , Xn i.i.d. have parameter θ = (θ1 , . . . , θK )′ . Let us note
- µ′k (θ) := E [X k ], k = 1, 2, . . . the theoretical moments
- mk′ (θ) := n1 ni=1 Xik , k = 1, 2, . . . the corresponding empirical
P
moments
Let us assume that the theoretical moments exist and are finite up to order
K at least. These moments are functions of the parameter θ.
The method of moments consists in taking as the estimator of θ the
solution θ̂ of the system ′ ′
µ1 (θ) = m1
..
.
′
′
µK (θ) = mK
(i.e. a system of K equations with K unknowns θ1 , . . . , θK .)
p 7→ µ′1 = E (X ) = p
(µ′1 =) µ = X̄
1 Pn
(µ′2 =) σ 2 + µ2 = n i=1 Xi
2
θ̂ = 2X̄
Yet, we know that P(θ̂ = 2X̄ < Xmax ) > 0.
The method of moments works well in general ; the case of an uniform law
gives an illustration of an example for which the method of moments fails
to deliver a good estimator.
As its name suggests, this Q method uses the likelihood function, defined
previously L(x1 , . . . , xn ) = ni=1 f (xi ) where f (x) is the distribution or
density of the population.
The likelihood function describes the joint distribution or joint density of
the observations.
This method is based on the idea that the values of the sample
observations must be plausible : we therefore try to give the unknown
parameters values that maximise this likelihood.
"Since the result has been observed, it means that it had a high probability
of happening" (or so we hope).
Likelihood function
Let x1 , . . . , xn , a random sample of size n from a population whose density
(or distribution) is f (x) and θ is an unknown parameter, the likelihood
function can be written (in the continuous case) :
n
Y
L(x1 , . . . , xn , θ) = f (x1 , θ)f (x2 , θ) . . . f (xn , θ) = f (xi , θ)
i=1
You will also find the following notation (respectively continuous and
discrete) :
Yn n
Y
Lθ (X ) = fθ (xi ) et Lθ (X ) = pθ (xi )
i=1 i=1
O. Dagnelie (SSSIHL) Stat Inf 2024-25 73 / 104
Methods to find estimators MLE
or
θ̂ = argmaxθ Lθ (X )
or, equivalently,
θ̂ = argmaxθ log Lθ (X )
Thus if there exists a function θ̂ = (θ̂1 , θ̂2 , . . . , θ̂k ) of the sample values
which maximises L for variations in θ, then θ̂ is to be taken as an estimator
of θ. θ̂ is usually called Maximum Likelihood Estimator (MLE).
∂L ∂2L
= 0 and <0 (10)
∂θ ∂θ2
Since L > 0, and log L is a non-decreasing function of L, L and log L attain
their extreme values (maxima or minima) at the same values of θ̂. The first
of the two equations in (10) can be rewritten as
1 ∂L ∂ log L
=0⇒ =0 (11)
L ∂θ ∂θ
which is much more convenient from a practical point of view.
If θ is vector valued parameter, then θ̂ = (θ̂1 , θ̂2 , . . . , θ̂k ) is given by the
solution of simultaneous equations
∂ ∂
log L = log L(θ1 , θ2 , . . . , θk ) = 0, i = 1, 2, . . . , k
∂θi ∂θi
O. Dagnelie (SSSIHL) Stat Inf 2024-25 77 / 104
Methods to find estimators MLE
Properties of MLE
Under regularity conditions (not displayed)
Thm
With probability approaching unity as n → ∞, the likelihood equation
∂ log L
∂θ = 0, has a solution which converges in probability to the true value
θ0 .
MLEs are consistent.
MLEs are consistent but not always unbiased → MLE (σ 2 ) = s 2 for the
Normal distribution.
Properties of MLE (. . .)
Thm
If MLE exists, it is the most efficient in the class of such estimators.
Thm
If a sufficient estimator exists, it is a function of the Maximum Likelihood
Estimator.
Thm
If for a given population with pdf f (x, θ), an MVB estimator T exists for θ
then the likelihood equation will have a solution equal to the estimator T .
Taking logarithms,
n
X n
X
log Lp (X1 , . . . , Xn ) = Xi log(p) + n − Xi log(1 − p)
i=1 i=1
and
n
X n
∂ log Lp (X1 , . . . , Xn ) 1 X 1
= Xi − n− Xi
∂p p 1−p
i=1 i=1
Xn
1 1 n
= Xi + −
p 1−p 1−p
i=1
n
X
Xi − np = 0
i=1
or
n
X
Xi = np
i=1
The solution is
n
1X
p= Xi
n
i=1
i=1
2πσ
n
n
2 −2 1 X 2
= 2πσ exp − 2 (Xi − µ)
2σ
i=1
Taking logarithms,
n
n n 1 X
log Lµ,σ2 (X1 , . . . , Xn ) = − log(2π) − log(σ 2 ) − 2 (Xi − µ)2
2 2 2σ
i=1
1
Lθ (X1 , . . . , Xn ) =
θn
log Lθ (X1 , . . . , Xn ) = −n log(θ)
∂ log Lθ (X1 , . . . , Xn ) −n
= <0
∂θ θ
We can say that Lθ (X1 , . . . , Xn ) is a decreasing function for θ ≥ Xmax ,
which implies that Lθ (X1 , . . . , Xn ) is maximised when θ = Xmax .
We therefore have
θ̂ = max Xi
i
Définition
A confidence interval at confidence level (1 − α) for θ is an interval
[t1 (.), t2 (.)] such that :
- t1 (.) et t2 (.) are statistics
- P[t1 (.) ≤ θ ≤ t2 (.)] ≥ 1 − α ∀θ
t(.)
Confidence interval estimation : the smaller the interval, the greater the
accuracy
→ incorporates a margin of error or sampling error.
t1 (.) t2 (.)
t1 (.) t2 (.)
We therefore have
X̄ − µ
P √ ≤ zα/2 = α/2 ∀µ
σ/ n
X̄ − µ
P √ ≤ z1−α/2 = 1 − α/2 ∀µ
σ/ n
X̄ − µ
P √ ≥ z1−α/2 = α/2 ∀µ
σ/ n
X̄ − µ
⇒ P zα/2 ≤ √ ≤ z1−α/2 = 1 − α ∀µ
σ/ n
with zα/2 = −z1−α/2
O. Dagnelie (SSSIHL) Stat Inf 2024-25 88 / 104
Confidence Interval Estimation
X̄ −µ
√
zα/2 0 z1−α/2 σ/ n
−z1−α/2
X̄ − µ
P zα/2 ≤ √ ≤ z1−α/2 = 1 − α ∀µ
σ/ n
σ σ
P zα/2 √ ≤ X̄ − µ ≤ z1−α/2 √ = 1 − α ∀µ
n n
σ σ
P X̄ − z1−α/2 √ ≤ µ ≤ X̄ − zα/2 √ = 1 − α ∀µ
n n
σ σ
P X̄ − z1−α/2 √ ≤ µ ≤ X̄ + z1−α/2 √ = 1 − α ∀µ
n n
| {z } | {z }
t1 (X1 ,...,Xn ) t2 (X1 ,...,Xn )
µ X̄
µ − z1−α/2 √σn µ + z1−α/2 √σn
The statistician has no way of checking what the true value of µ is and
even if the number of intervals containing µ is the ’expected’ 19, it is not
possible to know which ones are the true ones.
µ
Figure – 20 confidence intervals and population mean
Find the standard error of the mean, margin of error, lower and upper
bounds of an interval at the 95% confidence level for the population mean,
µ.
Exercise (Solution)
σ 6
The standard error of the mean : √ = √ = .75
n 64
σ
The margin of error = z1−α/2 √ = 1.96(.75) = 1.47
n
The confidence interval at 95% is the following :
σ σ
[X̄ − z1−α/2 √ , X̄ + z1−α/2 √ ]
n n
[18.53, 21.47]
The confidence level of the interval implies that, in the long term, 95% of
the intervals found by following this procedure contain the true value of the
population mean.
However, we cannot know whether this interval is part of the 95% good or
the 5% bad without knowing µ.
S S
[t1 (X1 , . . . , Xn ), t2 (X1 , . . . , Xn )] = X̄ − z1−α/2 √ , X̄ + z1−α/2 √
n n
S
= [X̄ ± z1−α/2 √ ]
n
In this particular case, we made this assumption that n is large (this is why
we used the Normal distribution).
If n is small, Z = X̄√−µ
S is not N(0, 1) and we have to use Student’s t
n
distribution (with n − 1 degrees of freedom).
Sample size
In practice, it often happens that we want to find the sample size n
necessary for the confidence interval constructed, at the confidence level
confidence level (1 − α), to be at most equal to 2MoE .
Using the sampling error formula, we obtain :
√ σ 2 σ2
n = z1−α/2 ⇒ n = z1−α/2
MoE MoE 2
And in the particular case of a proportion :
p
√ p̂(1 − p̂) 2 p̂(1 − p̂)
n = z1−α/2 ⇒ n = z1−α/2
MoE MoE 2
In the case of Bernoulli sampling, we may want to find the minimum
sample size n required to know, at the confidence level confidence level
(1 − α), the unknown proportion p to within 0.01 (to within 1%).
Note that :
(1) the minimum sample size must be an integer (i.e. rounded up) ;
(2) if you have an estimate of p, you can obviously use it.
Exercise 1
Scholastic Aptitude Test (SAT) mathematics scores of a random sample of
500 high school seniors in the state of Texas are collected, and the sample
mean and standard deviation are found to be 501 and 112, respectively.
Find a 99% confidence interval on the mean SAT mathematics score for
seniors in the state of Texas.
Source : Probability & Statistics for Engineers & Scientists, Walpole et al.,
9th edition, Prentice Hall
Exercise 2
A sample of 100 voters chosen at random from all the voters in a borough
showed that 54 of them were in favour of a certain candidate.
a) Construct a 98% confidence interval for the percentage of votes received
by this candidate.
b) What sample size would be needed to obtain an estimate to within 2%
with a probability of 95% ?