0% found this document useful (0 votes)
19 views110 pages

Statinf Estimation

Uploaded by

sai13aadityad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views110 pages

Statinf Estimation

Uploaded by

sai13aadityad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Estimation

Characteristics of estimators
Unbiasedness
Consistency
Efficiency
Sufficiency

Cramér-Rao
Cramér-Rao lower bound
Efficiency or bias
Blackwellisation

Methods to find estimators


Method of moments
Maximum Likelihood Method

Confidence Interval Estimation


The aim of statistics is to draw inferences from a number of observations.
Two important problems in statistical inference are estimation and tests of
hypotheses.

Intuitively, the problem is as follows : suppose that a certain characteristic


of the elements in a population can be represented by a random variable X
of density f (x, θ) whose form is known, but which contains an unknown
parameter θ.
Parameter space
Let X be a random variable with p.d.f. f (x, θ).
The unknown parameter θ may take any value on a set Θ.
The p.d.f. will be written f (x, θ), θ ∈ Θ.
The set Θ, i.e. the set of all possible values of θ, is called the parameter
space.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 2 / 104


From a random sample x1 , . . . , xn we can estimate the value of the
unknown parameter θ (or of a function τ (θ) of θ).

Statistic - Parameter - Estimator - Estimate


Any function of the random sample x1 , . . . , xn that is being observed, say
Tn (x1 , . . . , xn ) is called a statistic. Clearly, a statistic is a random variable.
It is used to estimate an unknown parameter θ of the distribution, it is
called an estimator.
A particular value of an estimator, say Tn (x1 , . . . , xn ), is called an estimate
of θ.

We often use the notation "ˆ" to represent the estimator of the unknown
parameter, for example θ̂ for θ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 3 / 104


An estimation of θ can be done in two different ways :
- we give ourselves a statistic t(x1 , . . . , xn ) which makes it possible to
evaluate the parameter θ ; this is the simple estimate or ponctual
estimate (point estimate).
- we give ourselves two statistics, t1 (x1 , . . . , xn ) and t2 (x1 , . . . , xn ) with
t1 (.) < t2 (.) which can be used to define an interval and the
probability that the interval contains θ ; this is confidence interval
estimation.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 4 / 104


Examples of estimators of µ
Let x1 , . . . , xn independent and identically distributed (i.i.d.)
N(µ, σ 2 ), µ ∈ R, σ 2 ∈ R+ .
The following statistics have all values in R, and are therefore estimators of
µ:
1 x̄ (arithmetic mean)
(n)
2 x̃ or x1/2 (empirical median)
1
3 2 (xmin + xmax ) (average of the extremes)
1 (n) (n)
4 2 (x1/4 + x3/4 ) (middle of the interquartile range)
1 P90
5 80 i=11 xi (truncated or ”trimmed” mean)
6 x1 (first observation)
7 ...

O. Dagnelie (SSSIHL) Stat Inf 2024-25 5 / 104


Characteristics of estimators

Characteristics of estimators

How do you tell the difference between a good estimator and a


not-so-good estimator ?
What are the desirable properties of an estimator ?
The properties of an estimator are, in fact, the properties of its
sampling distribution.

a. Unbiasedness
b. Consistency
c. Efficiency
d. Sufficiency

O. Dagnelie (SSSIHL) Stat Inf 2024-25 6 / 104


Characteristics of estimators Unbiasedness

a. Unbiased estimators.
To understand this part, it is important to remember the difference
between an estimator (or statistic) and a parameter.
The parameter, generally represented by θ, is a unique but unknown value
(to know it with certainty we would have to carry out a census).
The unknown parameter is, as it were, the intended target.
The statistic or estimator, represented by θ̂, is a function of the
observations, i.e. the sample.
Since the sample is random, the statistic is also random. If the sampling is
repeated several times, possibly with replacement, the results may be
different each time.
It is therefore impossible to guarantee that, for each sample, the estimator
will give exactly the value of the unknown parameter.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 7 / 104


Characteristics of estimators Unbiasedness

We will therefore simply ask that the estimator does not systematically
miss.
To be more precise, we will ask that the estimator be unbiased i.e. that in
expectation it gives the value we are looking for.

Unbiasedness
An estimator Tn = T (x1 , x2 , . . . , xn ) is said to be an unbiased estimator of
γ(θ) if E (Tn ) = γ(θ) for all θ ∈ Θ.

Bias
If E (Tn ) > θ, Tn is said to be positively biased.
If E (Tn ) < θ, Tn is said to be negatively biased.
The amount of bias, b(θ̂), is given by

b(θ̂) = E (Tn ) − γ(θ), θ ∈ Θ

Bias can be written b(θ̂) ou b. (∗! = textbook)


O. Dagnelie (SSSIHL) Stat Inf 2024-25 8 / 104
Characteristics of estimators Unbiasedness

Example
Let x1 , . . . , xn i.i.d. E [xi ] = µ < ∞.
n
1X
x̄ := xi
n
i=1

is an unbiased estimator of µ (so is x1 for instance).


Can you show it ?

O. Dagnelie (SSSIHL) Stat Inf 2024-25 9 / 104


Characteristics of estimators Unbiasedness

Reminder
Expectation properties

E (c) = c
E (x + c) = E (x) + c
E (ax + c) = aE (x) + c
E (ax1 + bx2 + c) = aE (x1 ) + bE (x2 ) + c

Variance properties

Var (c) = 0
Var (x + c) = Var (x)
Var (ax + c) = a2 Var (x)
Var (ax1 ± bx2 + c) = a2 Var (x1 ) + b 2 Var (x2 ) ± 2abCov (x1 , x2 )

O. Dagnelie (SSSIHL) Stat Inf 2024-25 10 / 104


Characteristics of estimators Unbiasedness

Example : Bernoulli Sample


Let x1 , . . . , xn i.i.d. ∼ Bin(1, p) with p ∈ (0, 1).
n
1X
p̂ := xi
n
i=1

is an unbiased estimator of p since


 Xn  n
1 1X
E xi = E [xi ]
n n
i=1 i=1
1
= np = p p ∈ (0, 1)
n

O. Dagnelie (SSSIHL) Stat Inf 2024-25 11 / 104


Characteristics of estimators Unbiasedness

Example : Empirical variance


Let x1 , . . . , xn i.i.d. Var (xi ) = σ 2 < ∞, E (xi ) = µ.
n n
1X 1X 2
s 2 := (xi − x̄)2 = xi − x̄ 2
n n
i=1 i=1

is a biased estimator of σ 2 . We indeed have


 X n 
2 1
E [s ] = E xi − E [x̄ 2 ]
2
n
i=1
n
E [x 2 ] − Var (x̄) + E 2 (x̄)

=
n
= Var (X ) + E 2 (x) − Var (x̄) + E 2 (x̄)


σ2
= σ 2 + µ2 − − µ2
n
n−1 2
= σ < σ2
n
O. Dagnelie (SSSIHL) Stat Inf 2024-25 12 / 104
Characteristics of estimators Unbiasedness

Exercise
1 Pn
Show that S 2 := n−1 2 2
i=1 (xi − x̄) is an unbiased estimator of σ .
Let us start by computing a more practical formula for the sampling
variance :
n
2 1 X
S = (xi − x̄)2
n−1
i=1
n
1 X
= (xi2 − 2xi x̄ + x̄ 2 )
n−1
i=1
n
1 X 2n 2 n
= (xi2 ) − x̄ + x̄ 2
n−1 n−1 n−1
i=1
n
1 X n
= (xi2 ) − x̄ 2
n−1 n−1
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 13 / 104


Characteristics of estimators Unbiasedness

Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :

O. Dagnelie (SSSIHL) Stat Inf 2024-25 14 / 104


Characteristics of estimators Unbiasedness

Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
 n 
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 14 / 104


Characteristics of estimators Unbiasedness

Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
 n 
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
Xn 
1 2 n
E x̄ 2

= E (xi ) −
n−1 n−1
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 14 / 104


Characteristics of estimators Unbiasedness

Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
 n 
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
X n 
1 2 n
E x̄ 2

= E (xi ) −
n−1 n−1
i=1
n  n 
Var (x) + E 2 (x) − Var (x̄) + E 2 (x̄)
 
=
n−1 n−1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 14 / 104


Characteristics of estimators Unbiasedness

Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
 n 
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
X n 
1 2 n
E x̄ 2

= E (xi ) −
n−1 n−1
i=1
n  n 
Var (x) + E 2 (x) − Var (x̄) + E 2 (x̄)
 
=
n−1 n−1
 2 
n  2 2
 n σ 2
= σ +µ − +µ
n−1 n−1 n

O. Dagnelie (SSSIHL) Stat Inf 2024-25 14 / 104


Characteristics of estimators Unbiasedness

Exercise (. . .)
Let us show that S 2 is an unbiased estimator of σ 2 :
 n 
2 1 X 2 n 2
E (S ) = E (xi ) − x̄
n−1 n−1
i=1
X n 
1 2 n
E x̄ 2

= E (xi ) −
n−1 n−1
i=1
n  n 
Var (x) + E 2 (x) − Var (x̄) + E 2 (x̄)
 
=
n−1 n−1
 2 
n  2 2
 n σ 2
= σ +µ − +µ
n−1 n−1 n
n n−1
= σ2 = σ2
n−1 n
n 2
S2 = n−1 s is an unbiased estimator of σ 2 (but not S of σ).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 14 / 104


Characteristics of estimators Consistency

b. Consistent estimator.
An estimator Tn = T (x1 , x2 , . . . , xn ), based on a random sample of size n,
is said to be a consistent estimator of γ(θ), θ ∈ Θ, the parameter space, if
p
Tn converges to γ(θ) in probability, i.e., if Tn −→ γ(θ) as n → ∞.
In other words, Tn is a consistent estimator of γ(θ) if for every
ε > 0, η > 0, there exists a positive integer n ≥ m(ε, η) such that
P{|Tn − γ(θ)| < ε} → 1 as n → ∞
⇒ P{|Tn − γ(θ)| < ε} > 1 − η; ∀n ≥ m where m is some very large value
of n.

Remarks
By Kinchine’s weak law of large numbers, x̄, sample mean, is always a
consistent estimator of µ, population mean.
Consistency is a property concerning the behaviour of an estimator for
indefinitely large values of the sample size n, i.e., as n → ∞.
Nothing is regarded of its behaviour for finite n.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 15 / 104
Characteristics of estimators Consistency

Note :
A sufficient condition for consistency is that the estimator is unbiased and
that its variance tends towards 0 when n becomes large.
This condition is however not necessary.
A biasedASYMPTOTIC
estimator may be consistent,
PROPERTIES if the bias
OF ESTIMATORS: disappears
PLIMS when the size
AND CONSISTENCY
of the sample increases.

0.3
n = 1600
probability density function

0.2

n = 400

0.1
n = 100
n = 25

0.0
80 100 120 140
O. Dagnelie (SSSIHL) Stat Inf 2024-25 16 / 104
Characteristics of estimators Consistency

Invariance property of consistent estimators


Thm : If Tn is a consistent estimator of γ(θ) and Ψ{γ(θ)} is a continuous
function γ(θ), then Ψ{γ(Tn )} is a consistent estimator of Ψ{γ(θ)}.

Sufficient conditions for consistency


Thm : Let {Tn } be a sequence of estimators such that for all θ ∈ Θ,
(i) Eθ (Tn ) → γ(θ), n → ∞ and
(ii) Varθ (Tn ) → 0, as n → ∞
Then, Tn is a consistent estimator of γ(θ).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 17 / 104


Characteristics of estimators Consistency

Example
Prove that in sampling from a N(µ, σ 2 ) population, the sample mean is a
consistent estimator of µ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 18 / 104


Characteristics of estimators Consistency

Example
Prove that in sampling from a N(µ, σ 2 ) population, the sample mean is a
consistent estimator of µ.

In sampling from a N(µ, σ 2 ) population, the sample mean x̄ is also


2
normally distributed N(µ, σ 2 /n), i.e., E (x̄) = µ and V (x̄) = σn .
Thus as n → ∞, E (x̄) = µ and V (x̄) = 0.
Hence by the previous theorem, x̄ is a consistent estimator of µ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 18 / 104


Characteristics of estimators Efficiency

c. Efficient estimator.

In addition to the need for an estimator to be accurate in terms of


expectation, we also want it to be sufficiently precise.
In other words, we want the observed values to be sufficiently concentrated
around the expected value.
The efficiency of an estimator is therefore linked to the low dispersion of
the results observed, and therefore to its low variance.

Intuition of efficiency
If, of the two consistent estimators T1 , T2 of a certain parameter θ, we
have V (T1 ) < V (T2 ), for all n
then T1 is more efficient than T2 for all sample sizes.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 19 / 104


Characteristics of estimators Efficiency

Most efficient estimator


If in a class of consistent estimators for a parameter, there exists one whose
sampling variance is less than that of any such estimator, it is called the
most efficient estimator. Whenever such an estimator exists, it provides a
criterion for measurement of efficiency of the other estimators.

Efficiency
If T1 is the most efficient estimator with variance V1 and T2 is any other
estimator with variance V2 , then the efficiency E of T2 is defined as :
V1
E=
V2
Obviously, E cannot exceed unity.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 20 / 104


Characteristics of estimators Efficiency

Relative efficiency
Let T1 and T2 be unbiased estimators with respectively variance V1 and
V2 , then the relative efficiency of T1 with respect to T2 is :
V2
RE (T1 , T2 ) =
V1
T1 is relatively more efficient than T2 if RE (T1 , T2 ) ≥ 1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 21 / 104


Characteristics of estimators Efficiency

O. Dagnelie (SSSIHL) Stat Inf 2024-25 22 / 104


Characteristics of estimators Efficiency

Minimum Variance Unbiased (M.V.U.) Estimators


If a statistic Tn = T (x1 , x2 , . . . , xn ), based on a sample of size n, is such
that
(i) T is unbiased for γ(θ), for all θ ∈ Θ and
(ii) It has the smallest variance among the class of all unbiased estimators
of γ(θ),
then T is called the minimum variance unbiased estimator (MVUE) of γ(θ).
→ link with Cramér-Rao lower bound

More precisely, T is MVUE of γ(θ) if

Eθ (T ) = γ(θ) for all θ ∈ Θ


and Varθ (T ) ≤ Varθ (T ′ ) for all θ ∈ Θ

where T ′ is any other unbiased estimator of γ(θ).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 23 / 104


Characteristics of estimators Efficiency

Example
Let x1 , . . . , xn i.i.d. , E [xi ] = µ < 1.
x̄ and x1 are 2 unbiased estimators for µ (Why ?).
Which is the most efficient estimator ?

O. Dagnelie (SSSIHL) Stat Inf 2024-25 24 / 104


Characteristics of estimators Efficiency

Example
Let x1 , . . . , xn i.i.d. , E [xi ] = µ < 1.
x̄ and x1 are 2 unbiased estimators for µ (Why ?).
Which is the most efficient estimator ?

As soon as n > 1, Var (x̄) < Var (x1 ) ⇒ x̄ is more efficient than x1 .
The relative efficiency of x̄ with respect to x1 is given by :

Var (x1 ) σ2
ER(x̄, x1 ) = = 2 =n
Var (x̄) σ /n

The sample mean x̄ is more efficient than x1 if n > 1.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 24 / 104


Characteristics of estimators Efficiency

Exercise
Let x1 , x2 , x3 a random sample from a normal population whose µ and σ
are unknown.
(1) Show that µ̂1 and µ̂2 are unbiased.
(2) Which estimator of µ is the most efficient, µ̂1 or µ̂2 ?
1 1 1
µ̂1 = x1 + x2 + x3
4 2 4
1 1 1
µ̂2 = x1 + x2 + x3
3 3 3

O. Dagnelie (SSSIHL) Stat Inf 2024-25 25 / 104


Characteristics of estimators Efficiency

Exercise (. . .)
 
1 1 1
E (µ̂1 ) = E x1 + x2 + x3
4 2 4
1 1 1
= E (x1 ) + E (x2 ) + E (x3 )
4 2 4
1 1 1
= µ+ µ+ µ
4 2 4
= µ

 
1 1 1
E (µ̂2 ) = E x1 + x2 + x3
3 3 3
1 1 1
= E (x1 ) + E (x2 ) + E (x3 )
3 3 3
1 1 1
= µ+ µ+ µ
3 3 3
= µ
O. Dagnelie (SSSIHL) Stat Inf 2024-25 26 / 104
Characteristics of estimators Efficiency

Exercise (. . .)
 
1 1 1
Var (µ̂1 ) = Var x1 + x2 + x3
4 2 4
1 1 1
= Var (x1 ) + Var (x2 ) + Var (x3 )
16 4 16
3σ 2
=
8

 
1 1 1
Var (µ̂2 ) = Var x1 + x2 + x3
3 3 3
1 1 1
= Var (x1 ) + Var (x2 ) + Var (x3 )
9 9 9
3σ 2
=
9
µ̂2 is more efficient than µ̂1 , since its variance is smaller.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 27 / 104
Characteristics of estimators Efficiency

Theorems on MVUE

Thm
An M.V.U. is unique in the sense that if T1 and T2 are M.V.U. estimators
for γ(θ), then T1 = T2 , almost surely.

Thm
Let T1 and T2 be unbiased estimators of γ(θ) with efficiencies e1 and e2
respectively and ρ = ρθ be the correlation coefficient between them. Then
√ p √ p
e1 e2 − (1 − e1 )(1 − e2 ) ≤ ρ ≤ e1 e2 + (1 − e1 )(1 − e2 )

Corollary
If we take e1 = 1 and e2 = e in the previous equation, we get
√ √ √
e ≤ ρ ≤ e ⇒ ρ = e.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 28 / 104


Characteristics of estimators Efficiency

Thm
If T1 is an MVU estimator of γ(θ), θ ∈ Θ and T2 is any other unbiased
estimator of γ(θ) with efficiency e = eθ , then the correlation coefficient
√ √
between T1 and T2 is given by ρ = e, i.e. , ρθ = eθ , ∀θ ∈ Θ.

Thm
If T1 is an MVUE of γ(θ) and T2 is any other unbiased estimator of γ(θ)
with efficiency e < 1, then no unbiased linear combination of T1 and T2
can be an MVUE of γ(θ).

Other result
The correlation coefficient between a most efficient estimator and any

other estimator with efficiency e is e.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 29 / 104


Characteristics of estimators Sufficiency

d. Sufficient estimator.

An estimator is said to be sufficient for a parameter, if it contains all the


information in the sample regarding the parameter.
Sufficiency can provide insights as to finding estimators with smallest
variance.
Note that maximum likelihood estimators are necessarily functions of
sufficient estimators.

Sufficiency
If T = T (x1 , x2 , . . . , xn ) is an estimator of a parameter θ, based on a
sample x1 , x2 , . . . , xn of size n from the population with density f (x, θ)
such that the conditional distribution of x1 , x2 , . . . , xn given T , is
independent of θ, the statistic T is a sufficient estimator for θ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 30 / 104


Characteristics of estimators Sufficiency

Example
As an example, the sample mean is sufficient for the mean (µ) of
a normal distribution with known variance. Once the sample mean
is known, no further information about (µ) can be obtained from
the sample itself. On the other hand, for an arbitrary distribution
the median is not sufficient for the mean : even if the median
of the sample is known, knowing the sample itself would provide
further information about the population mean. For example, if
the observations that are less than the median are only slightly
less, but observations exceeding the median exceed it by a large
amount, then this would have a bearing on one’s inference about
the population mean. (source : wikipedia)
(Note : the median is known to be much more robust to extreme values
than the mean.)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 31 / 104


Characteristics of estimators Sufficiency

Illustration
Let x1 , x2 , . . . , xn be a random sample from a Bernoulli population with
parameter p, 0 < p < 1, i.e.

1 with probability p
xi =
0 with probability q = (1 − p)

Then T = t(x1 , x2 , . . . , xn ) = x1 + x2 + · · · + xn ∼ B(n, p)


 
n k
P(T = k) = p (1 − p)n−k ; k = 0, 1, 2, . . . , n
k

O. Dagnelie (SSSIHL) Stat Inf 2024-25 32 / 104


Characteristics of estimators Sufficiency

Illustration (. . .)
The conditional distribution of (x1 , x2 , . . . , xn ) given T is :

P(x1 ∩ x2 ∩ · · · ∩ xn ∩ T = k)
P(x1 ∩ x2 ∩ · · · ∩ xn |T = k) =
P(T = k)
p (1−p)n−k
k
(
n = 1n
= (k) P
p k (1−p)n−k (k )
0, if ni=1 xi ̸= k
Pn
Since this does not depend on p, T = i=1 xi , is sufficient for p.
One additional word for concreteness (clarification) :
Let us suppose a random sample of size n = 3 in which x1 = 1, x2 = 0, and
x3 = 1. In this case,

P(x1 = 1, x2 = 0, x3 = 1, T = 1) = 0
P P
since xi = 1 + 0 + 1 = 2 which is different from T = xi = 1, we have
an impossible event ⇒ P() = 0.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 33 / 104
Characteristics of estimators Sufficiency

Illustration (. . .)
As soon as T ̸= k, P() = 0.
If now, P(x1 = 1, x2 = 0, x3 = 1, T = 2), by independence, we have
p(1 − p)p = p 2 (1 − p).
So, in general,
n
X
P(x1 ∩ x2 ∩ · · · ∩ xn ∩ T = k) = 0, if xi ̸= k
i=1
n
X
k n−k
and P(x1 ∩ x2 ∩ · · · ∩ xn ∩ T = k) = p (1 − p) , if xi = k
i=1

As to the denominator, it is the binomial probability of getting exactly k


successes in n trials with a probability of success p, i.e.
P(T = k) = kn p k (1 − p)n−k .


(Source : PennState)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 34 / 104


Characteristics of estimators Sufficiency

Factorization Theorem (Neymann)


The necessary and sufficient condition for a distribution to admit sufficient
statistic is provided by the ’factorization theorem’ due to Neymann.
This theorem is particularly useful since it may not be easy to find the
conditional distribution for (x1 , x2 , . . . , xn ) given T (especially since we
would have to find the conditional distribution given T for all considered
sufficient statistics T ).
Statement : T = t(x) is sufficient for θ if and only if the joint density
function L, of the sample values can be expressed in the form :

L = gθ [t(x)].h(x)

where gθ [t(x)] depends on θ and x only through the value of t(x) and h(x)
is independent of θ.

L = ni=1 f (xi , θ) is the joint density of the observations, the sample.


Q

O. Dagnelie (SSSIHL) Stat Inf 2024-25 35 / 104


Characteristics of estimators Sufficiency

Remarks
1 ’A function independent of θ’ means that it does not involve θ but
also that its domain does not contain θ.
1
For example, f (x) = 2a , a − θ < x < a + θ; −∞ < θ < ∞ depends on
θ.
2 The original sample X = (x1 , x2 , . . . , xn ) is always a sufficient statistic.
3 Koopman’s form of the distributions admitting sufficient statistic :

L = L(x, θ) = g (x).h(θ).exp{a(θ)Ψ(x)}

where h(θ) and a(θ) are functions of θ and g (x) and Ψ(x) are
functions only of the sample observations.
This equation gives the exponential family of distributions containing
the binomial, Poisson and the normal with unknown mean and
variance.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 36 / 104


Characteristics of estimators Sufficiency

Remarks
4 Invariance Property of Sufficient Estimator :
If T is a sufficient estimator for the parameter θ and if Ψ(T ) is a one
to one function of T , then Ψ(T ) is sufficient for Ψ(θ).
5 Fisher-Neyman Criterion :
A statistic t1 = t(x1 , x2 , . . . , xn ) is a sufficient estimator of parameter
θ if and only if the likelihood function (joint p.d.f. of the sample) can
be expressed as :
n
Y
L= f (xi , θ) = g1 (t1 , θ).k(x1 , x2 , . . . , xn )
i=1

where g1 (t1 , θ) is the p.d.f. of the statistic t1 and k(x1 , x2 , . . . , xn ) is a


function of sample observations only, independent of θ.
→ working out the p.d.f. of t1 = t(x1 , x2 , . . . , xn ) is not always easy.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 37 / 104


Characteristics of estimators Sufficiency

Illustration
Let x1 , x2 , . . . , xn be a random sample from N(µ, σ 2 ) population.
Find sufficient estimators for µ and σ 2 .
Let us write θ = (µ, σ 2 ); −∞ < µ < ∞, 0 < σ 2 < ∞.
n n   n 
Y 1 1 X
Then L = √
fθ (xi ) = .exp − 2 (xi − µ)2
σ 2π 2σ
i=1 i=1
n
 n  !
1 1 X
2
X
2
= √ exp − 2 xi − 2µ xi + nµ
σ 2π 2σ
i=1
= gθ [t(x)].h(x)
 n  
1 1 2
where gθ [t(x)] = √ exp − 2 {t2 (x) − 2µt1 (x) + nµ }
σ 2π 2σ
X X
t(x) = {t1 (x), t2 (x)} = ( xi , xi2 ) and h(x) = 1

xi2 is sufficient for σ 2 .


P P
Thus t1 (x) = xi is sufficient for µ and t2 (x) =
O. Dagnelie (SSSIHL) Stat Inf 2024-25 38 / 104
Cramér-Rao Cramér-Rao lower bound

Cramér-Rao lower bound


Let θ̂1 and θ̂2 be two unbiased estimators of the parameter θ.

We know that the ’better’ of the two is the one with the smaller variance
but what about their performance relative to the other unbiased estimators
of θ ?
Can there be a θ̂3 with smaller variance than θ̂1 and θ̂2 ?
Can the minimum variance unbiased estimator be identified ?

Let a random sample of size n be extracted from a population with density


f (x, θ) where θ is an unknown parameter.
There is a theoretical lower bound on the variance of any unbiased
estimator of θ, known as the Cramér-Rao lower bound.
If the variance of a given θ̂ is equal to the Cramér-Rao lower bound, this
estimator is optimal, in the sense that no other unbiased θ̂ can estimate θ
with greater precision (or smaller variance).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 39 / 104


Cramér-Rao Cramér-Rao lower bound

Harald Cramér - Calyampudi Radhakrishna Rao


Harald Cramér (1893-1985) was born in Stockholm, where he studied
and spent his whole career. He is known for a conjecture, several
theorems, Cramér’s V and Cramér’s inequality.

CR Rao (1920-2023) was born in now Vijayanagara, Karnataka, in a


Telugu speaking family and studied in Andhra Pradesh until his MSc
in Mathematics (Andhra University). He went on studying a MA in
Statistics from Calcutta University before completing a PhD in
Cambridge (UK), under the supervision of R.A. Fisher.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 40 / 104


Cramér-Rao Cramér-Rao lower bound

Cramér-Rao Inequality
If t is an unbiased estimator for γ(θ), a function of parameter θ, then
d
{ dθ .γ(θ)}2 {γ ′ (θ)}2
Var (t) ≥ = (1)
∂ 2 I(θ)

E ∂θ log L

where I(θ) is the information on θ, supplied by the sample.

{γ ′ (θ)}2
In other words, Cramér-Rao inequality provides a lower bound I(θ) , to
the variance of an unbiased estimator of γ(θ).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 41 / 104


Cramér-Rao Cramér-Rao lower bound

Proof
In proving this result, we assume that only a single parameter θ is
unknown. We also take the case of continuous random variables.
The case of discrete random variables can be dealt with similarly on
replacing the multiple integrals by appropriate multiple sums.

We further make the following assumptions, which are known as the


Regularity conditions for Cramér-Rao Inequality :
1 The parameter space Θ is a non-degenerate open interval on the real
line R 1 (−∞, ∞).

2 For almost all x = (x1 , . . . , xn ), and for all θ ∈ Θ, ∂θ L(x, θ) exists, the
exceptional set, if any, is independent of θ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 42 / 104


Cramér-Rao Cramér-Rao lower bound

Proof (. . . )
3 The range of integration is independent of the parameter θ, so that
f (x, θ) is differentiable under integral sign.
If range is not independent of θ and f is zero at the extremes of the
range, i.e., f (a, θ) = 0 = f (b, θ), then (by Leibniz integral rule)
Z b Z b
∂ ∂f ∂a ∂b
f dx = dx − f (a, θ) + f (b, θ)
∂θ a ∂θ ∂θ ∂θ
Z b Za b
∂ ∂f
⇒ f dx = dx, since f (a, θ) = 0 = f (b, θ)
∂θ a a ∂θ

4 The conditions of uniform convergence of integrals are satisfied so


that differentiation under the integral sign is valid.
I(θ) = E { ∂θ
 ∂
log L(x, θ)}2 , exists and is positive for all θ ∈ Θ.

5

The Fisher information, I(θ), is defined to be the variance of the score


(the partial derivative with respect to θ of the natural logarithm of the
likelihood function).
O. Dagnelie (SSSIHL) Stat Inf 2024-25 43 / 104
Proof (. . . )
Let X be a random variable following the p.d.f. f (x, θ) and let L be the
likelihood function of the random sample (x1 , . . . , xn ) from this population.
n
Y
L = L(x, θ) = f (xi , θ)
i=1

Since L is the joint p.d.f. of (x1 , . . . , xn ),


Z
L(x, θ)dx = 1 (2)
R RR R
where dx = · · · dx1 dx2 . . . dxn .
Differentiating (2) w.r. to θ and using the mentioned regularity conditions :
Z Z    
∂ ∂ ∂
L dx = 0 ⇒ log L L dx = 0 ⇒ E log L = 0 (3)
∂θ ∂θ ∂θ
∂ 1 ∂
NB : By using the chain rule, one has ∂θ log L(x, θ) = L(x,θ) ∂θ L(x, θ)
Proof (. . . )
(3) is obtained
R thanks to the law of the unconscious statistician :
E [g (X )] = R g (x)f (x)dx where X has pdf f (x).
Let t = t(x1 , . . . , xn ) be an unbiased estimator of γ(θ) such that
Z
E (t) = γ(θ) ⇒ t.L dx = γ(θ) (4)

Differentiating w.r. to θ, we get t ∂L ′


R
∂θ dx = γ (θ)
Z    
∂ ∂
⇒ t log L L dx = γ (θ) ⇒ E t. log L = γ ′ (θ)

(5)
∂θ ∂θ

     
∂ ∂ ∂
Cov t, log L = E t. log L − E (t).E log L
∂θ ∂θ ∂θ
= γ ′ (θ) (6)

NB : Cov (XY ) = E (XY ) − E (X )E (Y )


Cramér-Rao Cramér-Rao lower bound

Proof (. . . )
We have : {r (X , Y )}2 ≤ 1 ⇒ {Cov (X , Y )}2 ≤ Var (X ).Var (Y )
NB : r (X , Y ) = SECov (X ,Y ) σXY
(X )SE (Y ) or σX σY

  2  
∂ ∂
∴ Cov t, log L ≤ Var t.Var log L
∂θ ∂θ
  2   2 
∂ ∂
⇒ {γ ′ (θ)}2 ≤ Var t E log L − E log L
∂θ ∂θ
 2 

⇒ {γ ′ (θ)}2 ≤ Var t.E log L
∂θ
{γ ′ (θ)}2
⇒ Var t ≥  2  (7)

E ∂θ log L

which is Cramér Rao Inequality.


O. Dagnelie (SSSIHL) Stat Inf 2024-25 46 / 104
Cramér-Rao Cramér-Rao lower bound

Corollary
If t is an unbiased estimator of parameter θ, i.e.,

E (t) = θ ⇒ γ(θ) = θ or γ ′ (θ) = 1

Then from (6), we get

1 1
Var (t) ≥ = (8)
I(θ)


2
E ∂θ log L

 
2
where I(θ) = E ∂
∂θ log L is called by R.A. Fisher as the amount of
information on θ supplied by the sample (x1 , . . . , xn ) and its reciprocal
1/I(θ), as the information limit to the variance of estimator
t = t(x1 , . . . , xn ).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 47 / 104


Cramér-Rao Cramér-Rao lower bound

Remarks :
An unbiased estimator t of γ(θ) for which Cramér-Rao lower bound in
(1) is attained is called a minimum variance bound (MVB) estimator.

We have
 2   2 
∂ ∂
I(θ) = E log L = −E log L
∂θ ∂θ2

This form is much more convenient to use in practice.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 48 / 104


Cramér-Rao Cramér-Rao lower bound

Cramér-Rao lower bound and sufficient estimators


Sufficient estimators also play a critical role in the search for ef-
ficient estimators – that is, unbiased estimators whose variance
equals the Cramér-Rao lower bound. There will be an infinite num-
ber of unbiased estimators for any unknown parameter in any pdf.
That said, there may be a subset of those unbiased estimators that
are functions of sufficient estimators. If so, it can be proved that
the variance of every unbiased estimator based on a sufficient es-
timator will necessarily be less than the variance of every unbiased
estimator that is not a function of a sufficient estimator. It follows,
then, that to find an efficient estimator for θ, we can restrict our
attention to functions of sufficient estimators for θ.
(Source : An Introduction to Mathematical Statistics and Its Applications
(5th Edition), Larsen and Marx, Prentice Hall, p.329)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 49 / 104


Cramér-Rao Cramér-Rao lower bound

Example : binomial distribution


Let X1 , . . . , Xn be i.i.d. ∼ Bin(1, p) with p ∈ (0, 1).
Pn Pn
Xi
Lp (X1 , . . . , Xn ) = p i=1 (1 − p)n− i=1 Xi

P
X
Let us define p̂ = X̄ = ni i which is clearly an unbiased estimator of p.
Please note first that
P
Xi 1 X 1 p(1 − p)
Var (p̂) = Var ( i ) = 2 Var ( Xi ) = 2 np(1 − p) =
n n n n
i

How does Var (p̂) compare with Cramér-Rao lower bound ?

O. Dagnelie (SSSIHL) Stat Inf 2024-25 50 / 104


Cramér-Rao Cramér-Rao lower bound

Example : binomial distribution (. . .)


Taking logarithms, Lp (X1 , . . . , Xn ) becomes
n
X  n
X 
ln Lp (X1 , . . . , Xn ) = Xi ln(p) + n − Xi ln(1 − p)
i=1 i=1

and
n
X 
∂ ln Lp (X1 , . . . , Xn ) 1 n
= Xi −
∂p p(1 − p) 1 − p
i=1
∂ ln Lp (X1 ,...,Xn )
The variance of ∂p is therefore

n
∂ ln Lp (X1 , . . . , Xn )  1 X 
Var = 2 2
Var Xi
∂p p (1 − p)
i=1
np(1 − p) n
= =
p 2 (1 − p)2 p(1 − p)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 51 / 104


Cramér-Rao Cramér-Rao lower bound

Example : binomial distribution (. . .)


We can see that
1 1
Var (p̂) = =
I(θ) Var
∂ ln Lp (X1 ,...,Xn ) 
∂p
P
X
Since Var (p̂) is equal to Cramér-Rao lower bound, p̂ = ni i is the
preferred estimator of parameter p of the binomial distribution.
No other unbiased estimator of p with lower variance can be found.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 52 / 104


Cramér-Rao Cramér-Rao lower bound

Example : binomial distribution (. . .)


Taking the expectation of the second partial derivative wrt to the unknown
parameter to reach Cramér-Rao lower bound.
n
∂ 2 ln Lp (X1 , . . . , Xn )
X 
−1 + 2p n
= Xi −
∂ p2 2
p (1 − p)2 (1 − p)2
i=1

∂ 2 ln Lp (X1 , . . . , Xn )
 
−1 + 2p n
E = np −
∂ p2 2
p (1 − p)2 (1 − p)2
−n + 2np − np −n
= =
p(1 − p)2 p(1 − p)

Once again, the variance of the estimator is equal to Cramér-Rao lower


bound, i.e. Var (p̂) =  21  =  1 

−E log L ∂ 2 ln Lp (X1 ,...,Xn )
∂θ 2 −E
∂ p2

O. Dagnelie (SSSIHL) Stat Inf 2024-25 53 / 104


Cramér-Rao Cramér-Rao lower bound

Exercise : mean of a Gaussian sample


Let X1 , . . . , Xn i.i.d. ∼ N(µ,
P σ 2 ) with given σ 2 .
X
Let us define µ̂ = X̄ = ni i .
How does Var (µ̂) compare with Cramér-Rao lower bound ?

O. Dagnelie (SSSIHL) Stat Inf 2024-25 54 / 104


Cramér-Rao Cramér-Rao lower bound

Exercise (. . .)
We have :
n  2
n 1X Xi − µ
lnLµ,σ2 (X1 , . . . , Xn ) = − ln(2πσ 2 ) −
2 2 σ
i=1

Therefore the first derivative of the function with respect to µ is


n
∂lnLµ,σ2 (X1 , . . . , Xn ) 1 X 
= 2 Xi − µ
∂µ σ
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 55 / 104


Cramér-Rao Cramér-Rao lower bound

Exercise (. . .)
The variance of this expression becomes
  n
∂lnLµ,σ2 (X1 , . . . , Xn ) 1 X 2
Var = 4
E [ Xi − µ ]
∂µ σ
i=1
1 n
= 4
nσ 2 = 2
σ σ
Let us now compare the inverse of this expression with the variance of X̄

σ2 1
Var (X̄ ) = =  ⇒
n ∂lnLµ,σ2 (X1 ,...,Xn )
Var ∂µ

The sampling mean X̄ , is an efficient estimator of the population mean, µ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 56 / 104


Cramér-Rao Cramér-Rao lower bound

Exercise (. . .)
We can also use the alternative formula for the Cramér-Rao lower bound,
1
i.e. 
2
 giving
∂ ln L (X ,...,Xn )
µ,σ 2 1
−E
∂ µ2

∂ 2 lnLµ,σ2 (X1 , . . . , Xn ) 1
2
= − 2n
∂µ σ

And therefore
1 σ2
 =
∂ 2 ln Lµ,σ2 (X1 ,...,Xn ) n
−E ∂ µ2

This confirms that X̄ is an efficient estimator of µ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 57 / 104


Cramér-Rao Efficiency or bias

Efficiency or bias
Ideally, we want to find an unbiased and efficient estimator.
However, one could imagine that, in some cases, a very precise estimator
with small bias could be preferred to an unbiased and not precise estimator.
In other words, how should we compare 2 estimators, one biased and
another unbiased ?

This global concept of efficiency, with or without bias, is the Mean Squared
Error or Mean Square Error.
Mean Squared Error
The Mean Squared Error of estimator θ̂ of the parameter θ is

MSE (θ̂) = E (θ̂ − θ)2

The Mean Squared Error is obviously related to the bias and variance of
the estimator.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 58 / 104
Cramér-Rao Efficiency or bias

Thm
MSE (θ̂) = Var (θ̂) + [Bias(θ̂)]2

Proof :
Let θ̂ an estimator of θ ; let me remind that bias b = E (θ̂) − θ.
Let τ be the expectation of θ̂, i.e. τ = E (θ̂).
MSE (θ̂) = E (θ̂ − θ)2
Let us replace θ̂ − θ by θ̂ − τ + τ − θ = θ̂ − τ + b and
(θ̂ − θ)2 = (θ̂ − τ )2 + b 2 + 2b(θ̂ − τ )
Let us take the expectation of both terms.
The first one gives the variance θ̂, the second one is the squared bias.
The third one is zero since
E [2b(θ̂ − τ )] = 2bE [θ̂ − τ ] = 2b[E (θ̂) − τ ]
and by notation τ represents E (θ̂).
O. Dagnelie (SSSIHL) Stat Inf 2024-25 59 / 104
Cramér-Rao Efficiency or bias

This confirms our previous reasoning.


If two estimators are unbiased, the most efficient is the one with the
smallest variance.
Furthermore, for two estimators with the same variance, the one with the
smaller bias is the more efficient.
We can generalise our definition of efficiency.
Relative efficiency
Let θ̂1 and θ̂2 , two estimators of θ, with or without bias.
The relative efficiency of θ̂1 with respect to θ̂2 is the ratio

EQM(θ̂2 )
ER(θ̂1 , θ̂2 ) =
EQM(θ̂1 )

NB : An estimator θ̂ of θ is said efficient if its MSE reaches Cramér-Rao


lower bound.
Note that MSE (θ̂) = Var (θ̂) if θ̂ is an unbiased estimator of θ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 60 / 104


Cramér-Rao Efficiency or bias

There are also asymptotic definitions of the quality of an estimator.


Asymptotically unbiased estimator
An estimator is asymptotically unbiased if the bias tends towards zero when
the sample size increases indefinitely.

Consistent estimator
An estimator is consistent or asymptotically consistent when its MSE tends
towards zero as the sample size increases indefinitely.

A consistent estimator obviously has a bias and variance that tend towards
zero.
In a way, this means that the estimator tends to give the right answer with
certainty as the sample approaches a population census.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 61 / 104


Cramér-Rao Blackwellisation

Minimum Variance Unbiased Estimator and Blackwellisation


Thanks to Cramér-Rao inequality, we are able to know if an unbiased
estimator is an MVU estimator or not.

The regularity conditions are strict ⇒ its applications are limited.


MVB and MVU may be different since the Cramér-Rao lower bound
may not always be attained.
If the regularity conditions are violated, the least attainable variance
may be less than the Cramér-Rao bound.

→ How to obtain MVU estimator from any unbiased estimator through the
use of sufficient statistic. (Blackwellisation)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 62 / 104


Cramér-Rao Blackwellisation

Rao-Blackwell Theorem
Let U = U(x1 , x2 , . . . , xn ) be an unbiased estimator of parameter γ(θ) and
let T = T (x1 , x2 , . . . , xn ) be sufficient statistic for γ(θ). Consider the
function ϕ(T ) of the sufficient statistic defined as

ϕ(t) = E (U|T = t) (9)

which is independent of θ (since T is sufficient for θ). Then

(i) E [ϕ(T )] = γ(θ), and


(ii) Var [ϕ(T )] ≤ Var (U)

This result implies that starting with an unbiased estimator U, we can


improve upon it by defining a function ϕ(T ) of the sufficient statistics as
given in (9).
This technique of obtaining improved estimators is called Blackwellisation.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 63 / 104


Cramér-Rao Blackwellisation

Rao-Blackwell Theorem (. . .) Proof :


(i) holds because of the law of total expectation : E (X ) = E [E (X |Y )]
(ii) holds because of the law of total expectation and linearity of
expectation
Rao-Blackwell theorem enables us to obtain MVU estimators through
sufficient statistic.
If a sufficient estimator exists for a parameter, then in our search for MVU
estimator we may restrict ourselves to functions of the sufficient statistic.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 64 / 104


Cramér-Rao Blackwellisation

Rao-Blackwell Theorem
Let θ̂ be an unbiased estimator of parameter θ with E (θ̂2 ) < ∞. Suppose
that T is sufficient for θ, and let θ∗ = E (θ̂|T ). Then, for all θ,

E (θ∗ − θ)2 ≤ E (θ̂ − θ)2

The inequality is strict unless θ̂ is a function of T .

Proof
h i2 h i2
E (θ∗ − θ)2 = E E (θ̂|T ) − θ = E E (θ̂ − θ|T )
h  i
≤ E E (θ̂ − θ)2 |T = E (θ̂ − θ)2

The inequality follows, since ∀W , var (W ) = EW 2 − (EW )2 ≥ 0, if


W = (θ̂ − θ|T ).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 65 / 104


Methods to find estimators Method of moments

Methods for finding estimators

a. Method of moments (Karl Pearson).

The first principle of estimation is the most obvious and intuitive. It boils
down to :
- estimate the moment of order 1, that is to say the expectation µ, of a
population by the sampling mean X̄ ,
- estimate the moment of order 2, that is E (X 2 ) of the population, by
1 Pn 2
n i=1 Xi
- and so on...

The principle of estimating a moment of the population by an equivalent


moment of the sample is called ”method of moments”.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 66 / 104


Methods to find estimators Method of moments

Method of moments
Let X1 , . . . , Xn i.i.d. have parameter θ = (θ1 , . . . , θK )′ . Let us note
- µ′k (θ) := E [X k ], k = 1, 2, . . . the theoretical moments
- mk′ (θ) := n1 ni=1 Xik , k = 1, 2, . . . the corresponding empirical
P
moments
Let us assume that the theoretical moments exist and are finite up to order
K at least. These moments are functions of the parameter θ.
The method of moments consists in taking as the estimator of θ the
solution θ̂ of the system  ′ ′
 µ1 (θ) = m1

..
.
 ′
 ′
µK (θ) = mK
(i.e. a system of K equations with K unknowns θ1 , . . . , θK .)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 67 / 104


Methods to find estimators Method of moments

Note that instead of taking ordinary or non-central moments


(µ′k (θ) := E [X k ], as we have done), we can also use central moments
E [X − E (X )]k and standardised moments E ( X σ−µ )k .
  

Some remarkable moments :


- the expectation, moment of order 1, E (X ).
- the variance, central moment of order 2, E (X − µ)2
 

- the skewness coefficient, standardised moment of order 3, E ( X σ−µ )3


 

- the non-normalised kurtosis, standardised moment of order 4,


 X −µ 4 
E ( σ )

O. Dagnelie (SSSIHL) Stat Inf 2024-25 68 / 104


Methods to find estimators Method of moments

Example : Bernoulli Sample


Let X1 , . . . , Xn i.i.d. ∼ Bin(1, p). In this case, K = 1,

p 7→ µ′1 = E (X ) = p

The estimator for p given by the method of moments is


n
1X
p̂ = Xi
n
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 69 / 104


Methods to find estimators Method of moments

Example : Gaussian sample


Let X1 , . . . , Xn i.i.d. ∼ N(µ, σ 2 ). In this case, K = 2,
   ′ 
µ µ1 = µ
7→
σ2 µ′2 = σ 2 + µ2

(Note : one of the variance formulas being σ 2 = E (X 2 ) − µ2 )


The system (µ̂, σ̂ 2 )

(µ′1 =) µ = X̄

1 Pn
(µ′2 =) σ 2 + µ2 = n i=1 Xi
2

has for solution



µ̂ = X̄ P
σ̂ 2 = n1 ni=1 Xi2 − X̄ 2 = s 2

O. Dagnelie (SSSIHL) Stat Inf 2024-25 70 / 104


Methods to find estimators Method of moments

Example : Uniform sample


Let X1 , . . . , Xn i.i.d. ∼ U[0, θ].
It is impossible that Xmax > θ (i.e. ∀θ, P(Xmax > θ) = 0).
In this case, K = 1 and the first theoretical moment is
θ
µ′1 (θ) =
2

The estimator θ̂ given by the method of moments is the solution to the


following equation : (µ′1 =) 2θ = X̄ , the resulting estimator is therefore

θ̂ = 2X̄
Yet, we know that P(θ̂ = 2X̄ < Xmax ) > 0.
The method of moments works well in general ; the case of an uniform law
gives an illustration of an example for which the method of moments fails
to deliver a good estimator.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 71 / 104


Methods to find estimators MLE

b. Maximum Likelihood Method.

As its name suggests, this Q method uses the likelihood function, defined
previously L(x1 , . . . , xn ) = ni=1 f (xi ) where f (x) is the distribution or
density of the population.
The likelihood function describes the joint distribution or joint density of
the observations.

This method is based on the idea that the values of the sample
observations must be plausible : we therefore try to give the unknown
parameters values that maximise this likelihood.
"Since the result has been observed, it means that it had a high probability
of happening" (or so we hope).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 72 / 104


Methods to find estimators MLE

Likelihood function
Let x1 , . . . , xn , a random sample of size n from a population whose density
(or distribution) is f (x) and θ is an unknown parameter, the likelihood
function can be written (in the continuous case) :
n
Y
L(x1 , . . . , xn , θ) = f (x1 , θ)f (x2 , θ) . . . f (xn , θ) = f (xi , θ)
i=1

and, for the discrete case :


n
Y
L(x1 , . . . , xn , θ) = p(ki , θ)
i=1

You will also find the following notation (respectively continuous and
discrete) :
Yn n
Y
Lθ (X ) = fθ (xi ) et Lθ (X ) = pθ (xi )
i=1 i=1
O. Dagnelie (SSSIHL) Stat Inf 2024-25 73 / 104
Methods to find estimators MLE

Maximum Likelihood Estimator


Let L(x1 , . . . , xn , θ) = ni=1 f (xi , θ) the likelihood function for a random
Q
sample x1 , . . . , xn of density function f (x) and of unknown parameter θ.
A maximum likelihood estimator (MLE) of θ is any value θ̂ of θ maximising
the likelihood Lθ (X ), i.e. :

Lθ̂ (X ) ≥ Lθ (X ) for any value of θ

or
θ̂ = argmaxθ Lθ (X )
or, equivalently,
θ̂ = argmaxθ log Lθ (X )

O. Dagnelie (SSSIHL) Stat Inf 2024-25 74 / 104


Methods to find estimators MLE

The maximum likelihood method is undoubtedly the most widely used


estimation method.
The estimated value of the parameter is the one that makes the
observation plausible (most likely).

Maximum Likelihood Method


The maximum likelihood method consists of taking as estimator of θ th
solution θ̂ of the system
n
X ∂ log fθ (xi )
=0
∂θ
i=1

(i.e. a system of K equations and K unknowns θ1 , . . . , θK , called likelihood


equations.)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 75 / 104


Methods to find estimators MLE

More formal version


The principle of maximum likelihood consists in finding an estimator for the
unknown parameter θ = (θ1 , θ2 , . . . , θk ) which maximises the likelihood
function L(θ) for variations in parameter, i.e. we wish to find
θ̂ = (θ̂1 , θ̂2 , . . . , θ̂k ) so that

L(θ̂) > L(θ) ∀θ ∈ Θ, i.e.L(θ̂) = sup L(θ) ∀θ ∈ Θ

Thus if there exists a function θ̂ = (θ̂1 , θ̂2 , . . . , θ̂k ) of the sample values
which maximises L for variations in θ, then θ̂ is to be taken as an estimator
of θ. θ̂ is usually called Maximum Likelihood Estimator (MLE).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 76 / 104


Methods to find estimators MLE

More formal version (. . .)


Thus θ̂ is the solution, if any, of

∂L ∂2L
= 0 and <0 (10)
∂θ ∂θ2
Since L > 0, and log L is a non-decreasing function of L, L and log L attain
their extreme values (maxima or minima) at the same values of θ̂. The first
of the two equations in (10) can be rewritten as

1 ∂L ∂ log L
=0⇒ =0 (11)
L ∂θ ∂θ
which is much more convenient from a practical point of view.
If θ is vector valued parameter, then θ̂ = (θ̂1 , θ̂2 , . . . , θ̂k ) is given by the
solution of simultaneous equations
∂ ∂
log L = log L(θ1 , θ2 , . . . , θk ) = 0, i = 1, 2, . . . , k
∂θi ∂θi
O. Dagnelie (SSSIHL) Stat Inf 2024-25 77 / 104
Methods to find estimators MLE

Properties of MLE
Under regularity conditions (not displayed)

Thm
With probability approaching unity as n → ∞, the likelihood equation
∂ log L
∂θ = 0, has a solution which converges in probability to the true value
θ0 .
MLEs are consistent.

MLEs are consistent but not always unbiased → MLE (σ 2 ) = s 2 for the
Normal distribution.

Thm Asymptotic normality of MLEs


A consistent solution of the likelihood equation is asymptoticallynormally
distributed about the true value θ0 . Thus, θ̂ is asymptotically N θ0 , I(θ10 )
as n → ∞.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 78 / 104


Methods to find estimators MLE

Properties of MLE (. . .)
Thm
If MLE exists, it is the most efficient in the class of such estimators.

Thm
If a sufficient estimator exists, it is a function of the Maximum Likelihood
Estimator.

Thm
If for a given population with pdf f (x, θ), an MVB estimator T exists for θ
then the likelihood equation will have a solution equal to the estimator T .

Thm Invariance property of MLE


If T is the MLE of θ and Ψ(θ) is a one-to-one function of θ, then Ψ(T ) is
the MLE of Ψ(θ).
O. Dagnelie (SSSIHL) Stat Inf 2024-25 79 / 104
Methods to find estimators MLE

Example : Bernoulli sample


Let X1 , . . . , Xn i.i.d. ∼ Bin(1, p) with p ∈ (0, 1).
Pn Pn
Xi
Lp (X1 , . . . , Xn ) = p i=1 (1 − p)n− i=1 Xi

Taking logarithms,
n
X  n
X 
log Lp (X1 , . . . , Xn ) = Xi log(p) + n − Xi log(1 − p)
i=1 i=1

and
n
X  n 
∂ log Lp (X1 , . . . , Xn ) 1 X 1
= Xi − n− Xi
∂p p 1−p
i=1 i=1
Xn  
1 1 n
= Xi + −
p 1−p 1−p
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 80 / 104


Methods to find estimators MLE

Example : Bernoulli sample (. . .)


To find the maximum, we need to calculate the first-order condition of the
likelihood function :
X n  
(1 − p) + p np
Xi − =0
p(1 − p) p(1 − p)
i=1

n
X 
Xi − np = 0
i=1
or
n
X
Xi = np
i=1

The solution is
n
1X
p= Xi
n
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 81 / 104


Methods to find estimators MLE

Example : Gaussian sample


Let X1 , . . . , Xn i.i.d. ∼ N(µ, σ 2 ). The likelihood function is
n (Xi −µ)2
1


Y
Lµ,σ2 (X1 , . . . , Xn ) = √ e 2σ 2

i=1
2πσ
 n 
 n
2 −2 1 X 2
= 2πσ exp − 2 (Xi − µ)

i=1

Taking logarithms,
n
n n 1 X
log Lµ,σ2 (X1 , . . . , Xn ) = − log(2π) − log(σ 2 ) − 2 (Xi − µ)2
2 2 2σ
i=1

This likelihood equation system has 2 equations with 2 unknowns µ and σ 2 .


We therefore need to annul the first derivatives with respect to the
unknowns.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 82 / 104
Methods to find estimators MLE

Example : Gaussian sample (. . .)


n
∂ log Lµ,σ2 (X1 , . . . , Xn ) 1 X
= 2 (Xi − µ) = 0
∂µ σ
i=1

This equation, whose unknown is µ, is satisfied in the case where


n
X
Xi − nµ = 0
i=1

The solution is therefore


n
1X
µ̂ = Xi soit := X̄
n
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 83 / 104


Methods to find estimators MLE

Example : Gaussian sample (. . .)


With regards to the second first order condition, µ is substituted by µ̂,
which gives :
n
∂ log Lµ,σ2 (X1 , . . . , Xn ) n 1 1 X
= − + (Xi − µ̂)2 = 0
∂ σ2 2 σ 2 2σ 4
i=1
n
n 1X
= − σ2 + (Xi − µ̂)2 = 0
2 2
i=1

The solution is therefore


n
2 1X
σ̂ = (Xi − µ̂)2 i.e. :≃ s 2 since µ̂ = X̄
n
i=1

O. Dagnelie (SSSIHL) Stat Inf 2024-25 84 / 104


Methods to find estimators MLE

Example : Uniform sample


Let X1 , . . . , Xn i.i.d. ∼ U[0, θ].

1
Lθ (X1 , . . . , Xn ) =
θn
log Lθ (X1 , . . . , Xn ) = −n log(θ)
∂ log Lθ (X1 , . . . , Xn ) −n
= <0
∂θ θ
We can say that Lθ (X1 , . . . , Xn ) is a decreasing function for θ ≥ Xmax ,
which implies that Lθ (X1 , . . . , Xn ) is maximised when θ = Xmax .
We therefore have
θ̂ = max Xi
i

O. Dagnelie (SSSIHL) Stat Inf 2024-25 85 / 104


Confidence Interval Estimation

Confidence Interval Estimation


Simple or point estimation : one point, one statistic t(X1 , . . . , Xn ) (e.g.
sampling mean) is used to estimate θ (unknown population
parameter)
→ does not give any information about the margin of error
(or the precision of the estimate) . . .
Confidence interval estimation : two statistics t1 (X1 , . . . , Xn ) and
t2 (X1 , . . . , Xn ) the bounds, one lower, one upper with
t1 (.) < t2 (.), defining an interval and the probability that the
interval contains the true value of the estimated parameter.

Définition
A confidence interval at confidence level (1 − α) for θ is an interval
[t1 (.), t2 (.)] such that :
- t1 (.) et t2 (.) are statistics
- P[t1 (.) ≤ θ ≤ t2 (.)] ≥ 1 − α ∀θ

O. Dagnelie (SSSIHL) Stat Inf 2024-25 86 / 104


Confidence Interval Estimation

Point estimation : Information provided : 1 single number t(.) = θ̂ (e.g. :


the sampling mean X̄ )
If we repeat the operation, we obtain another value (for which we know
that the probability of it being equal to θ is zero, i.e. P(θ̂ = θ) = 0)

t(.)

Confidence interval estimation : the smaller the interval, the greater the
accuracy
→ incorporates a margin of error or sampling error.

t1 (.) t2 (.)

t1 (.) t2 (.)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 87 / 104


Confidence Interval Estimation

CI for the mean of a Gaussian sample


Let X1 , . . . , Xn i.i.d. ∼ N(µ, σ 2 ) with (n ≥ 2). Let us suppose that σ 2 is
known. We know that :
σ2 X̄ − µ
X̄ ∼ N(µ, ) ⇒ σ ∼ N(0, 1)
n √
n

We therefore have
 
X̄ − µ
P √ ≤ zα/2 = α/2 ∀µ
σ/ n
 
X̄ − µ
P √ ≤ z1−α/2 = 1 − α/2 ∀µ
σ/ n
 
X̄ − µ
P √ ≥ z1−α/2 = α/2 ∀µ
σ/ n
 
X̄ − µ
⇒ P zα/2 ≤ √ ≤ z1−α/2 = 1 − α ∀µ
σ/ n
with zα/2 = −z1−α/2
O. Dagnelie (SSSIHL) Stat Inf 2024-25 88 / 104
Confidence Interval Estimation

α/2 1−α α/2

X̄ −µ

zα/2 0 z1−α/2 σ/ n
−z1−α/2

Figure – Standard Normal, N(0, 1)

O. Dagnelie (SSSIHL) Stat Inf 2024-25 89 / 104


Confidence Interval Estimation

 
X̄ − µ
P zα/2 ≤ √ ≤ z1−α/2 = 1 − α ∀µ
σ/ n
 
σ σ
P zα/2 √ ≤ X̄ − µ ≤ z1−α/2 √ = 1 − α ∀µ
n n
 
σ σ
P X̄ − z1−α/2 √ ≤ µ ≤ X̄ − zα/2 √ = 1 − α ∀µ
n n
 
σ σ
P X̄ − z1−α/2 √ ≤ µ ≤ X̄ + z1−α/2 √ = 1 − α ∀µ
n n
| {z } | {z }
t1 (X1 ,...,Xn ) t2 (X1 ,...,Xn )

[t1 (X1 , . . . , Xn ), t2 (X1 , . . . , Xn )] = [X̄ ± z1−α/2 √σn ] is a confidence interval


for µ with confidence level (1 − α).
Margin of Error : MoE = z1−α/2 √σn

O. Dagnelie (SSSIHL) Stat Inf 2024-25 90 / 104


Confidence Interval Estimation

α/2 1−α α/2

µ X̄
µ − z1−α/2 √σn µ + z1−α/2 √σn

Figure – Sampling distribution of sampling means of n observations from a


N(µ, σ 2 ) and confidence interval at (1 − α)%

A confidence interval for the population mean will be based on the


observed value of the sample mean, i.e. on an observation from this
sampling distribution.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 91 / 104


Confidence Interval Estimation

Let us say we repeat the process of drawing a sample of 100 observations.


This, in no way, affects the true value of µ. On the other hand, x̄ will be
very different from one sample to the next : repeating the same process 20
times, we would obtain 20 different intervals.
 
σ σ
What P X̄ − 1.96 √n ≤ µ ≤ X̄ + 1.96 √n = .95 tells us is that about 95
% of these intervals, or 19 of them, should contain the true value of µ.

The statistician has no way of checking what the true value of µ is and
even if the number of intervals containing µ is the ’expected’ 19, it is not
possible to know which ones are the true ones.

In any case, in practice, it is much better to take a sample of 2000


observations rather than 20 times a sample of 100 observations.
Why ? → Monte-Carlo simulations

O. Dagnelie (SSSIHL) Stat Inf 2024-25 92 / 104


Confidence Interval Estimation

µ
Figure – 20 confidence intervals and population mean

We expect µ to lie within 19 of the 20 confidence intervals from 20


samples drawn from the same population (95% of cases). The centre of
each confidence interval is the sample mean, X̄ .
O. Dagnelie (SSSIHL) Stat Inf 2024-25 93 / 104
Confidence Interval Estimation

Exercise : time at the grocery store


Let us assume that the time spent by customers in a grocery shop is
distributed with a known population standard deviation of 6 minutes.
A random sample of 64 customers spent an average of 20 minutes
shopping in this grocery store.

Find the standard error of the mean, margin of error, lower and upper
bounds of an interval at the 95% confidence level for the population mean,
µ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 94 / 104


Confidence Interval Estimation

Exercise (Solution)
σ 6
The standard error of the mean : √ = √ = .75
n 64
σ
The margin of error = z1−α/2 √ = 1.96(.75) = 1.47
n
The confidence interval at 95% is the following :
σ σ
[X̄ − z1−α/2 √ , X̄ + z1−α/2 √ ]
n n

[18.53, 21.47]
The confidence level of the interval implies that, in the long term, 95% of
the intervals found by following this procedure contain the true value of the
population mean.
However, we cannot know whether this interval is part of the 95% good or
the 5% bad without knowing µ.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 95 / 104


Confidence Interval Estimation

CI for the mean of a sample of any distribution


2
Let X1 , . . . , Xn i.i.d. where Var (Xi ) = σ < ∞.
1 Pn
Assume µ = E (Xi ) and X̄ = n i=1 Xi . We know that :
 
X̄ − µ
P ≤ x → Φ(x)
√S
n

where x 7→ Φ(x) = P[N(0, 1) ≤ x] is the normal distribution function


(0,1). ⇒
 
X̄ − µ
P zα/2 ≤ S ≤ z1−α/2 ≃ 1 − α ∀µ

n
 
S S
P X̄ − z1−α/2 √ ≤ µ ≤ X̄ + z1−α/2 √ ≃ 1 − α ∀µ
n n
| {z } | {z }
t1 (X1 ,...,Xn ) t2 (X1 ,...,Xn )

O. Dagnelie (SSSIHL) Stat Inf 2024-25 96 / 104


Confidence Interval Estimation

CI for the mean of a sample of any distribution (. . .)

 
S S
[t1 (X1 , . . . , Xn ), t2 (X1 , . . . , Xn )] = X̄ − z1−α/2 √ , X̄ + z1−α/2 √
n n
S
= [X̄ ± z1−α/2 √ ]
n

This is a confidence interval for µ at confidence level (1 − α) with the


following margin of error, MoE = z1−α/2 √Sn .

In this particular case, we made this assumption that n is large (this is why
we used the Normal distribution).
If n is small, Z = X̄√−µ
S is not N(0, 1) and we have to use Student’s t
n
distribution (with n − 1 degrees of freedom).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 97 / 104


Confidence Interval Estimation

CI for a proportion (Bernoulli)


Let X1 , . . . , Xn i.i.d. ∼ Bin(1, p) with p ∈ (0, 1). For p̂ = X̄ , we have
E (X̄ ) = p et Var (X̄ ) = p(1−p)
n .
By de Moivre-Laplace theorem, we know that
" #
X̄ − p
P q ≤ x → Φ(x)
p(1−p)
n
If n is large enough and p is approximated by p̂, we have
 
p̂ − p
P zα/2 ≤ q ≤ z1−α/2 ≃ 1 − α ∀p
p(1−p)
n
 r r 
p̂(1 − p̂) p̂(1 − p̂)
P p̂ − z1−α/2 ≤ p ≤ p̂ + z1−α/2 ≃ 1 − α ∀p
| {z n } | {z n }
t1 (X1 ,...,Xn ) t2 (X1 ,...,Xn )
q
p̂(1−p̂)
[t1 (X1 , . . . , Xn ), t2 (X1 , . . . , Xn )] = [p̂ ± z1−α/2 n ] is a CI for p with
confidence level (1 − α).
O. Dagnelie (SSSIHL) Stat Inf 2024-25 98 / 104
Confidence Interval Estimation

Critical values for the standard normal, N(0, 1),


for different confidence levels.
1 − α α α/2 z1−α/2
.9 .1 .05 z.95 = 1.645
.95 .05 .025 z.975 = 1.960
.99 .01 .005 z.995 = 2.576

O. Dagnelie (SSSIHL) Stat Inf 2024-25 99 / 104


Confidence Interval Estimation

Small samples correction


The sampling error formulae for a Gaussian sample and a proportion are
based on the fact that for a random variable with a standard deviation σ,
the sampling mean has a standard error √σn .
When sampling without replacement, the standard error of the sampling
mean is reduced by the reduction factor :
r
N −n
N −1
where N is the population size and n is the sample size.
In the case of sampling without replacement, the formulae for the sampling
error become respectively for a normal random variable and a proportion
r r r
σ N −n p(1 − p) N −n
MoE = z1−α/2 √ and MoE = z1−α/2
n N −1 n N −1
We can see that if the population is very large, the corrective factor is
negligible.
O. Dagnelie (SSSIHL) Stat Inf 2024-25 100 / 104
Confidence Interval Estimation

Sample size
In practice, it often happens that we want to find the sample size n
necessary for the confidence interval constructed, at the confidence level
confidence level (1 − α), to be at most equal to 2MoE .
Using the sampling error formula, we obtain :

√ σ 2 σ2
n = z1−α/2 ⇒ n = z1−α/2
MoE MoE 2
And in the particular case of a proportion :
p
√ p̂(1 − p̂) 2 p̂(1 − p̂)
n = z1−α/2 ⇒ n = z1−α/2
MoE MoE 2
In the case of Bernoulli sampling, we may want to find the minimum
sample size n required to know, at the confidence level confidence level
(1 − α), the unknown proportion p to within 0.01 (to within 1%).

O. Dagnelie (SSSIHL) Stat Inf 2024-25 101 / 104


Confidence Interval Estimation

Since we do not know the proportion p, we can replace it by the largest


value it can take, i.e. p = .5 (the value which cancels out ∂(p − p 2 )/∂p).
We then have
2
z1−α/2
n=
4MoE 2
⇒ for frequent values of α (5% et 1%) and MoE (10%, 5% et 1%),

MoE = .1 MoE = .05 MoE = .01


α = .05 97 385 9604
α = .01 166 664 16577

Note that :
(1) the minimum sample size must be an integer (i.e. rounded up) ;
(2) if you have an estimate of p, you can obviously use it.

O. Dagnelie (SSSIHL) Stat Inf 2024-25 102 / 104


Confidence Interval Estimation

Exercise 1
Scholastic Aptitude Test (SAT) mathematics scores of a random sample of
500 high school seniors in the state of Texas are collected, and the sample
mean and standard deviation are found to be 501 and 112, respectively.
Find a 99% confidence interval on the mean SAT mathematics score for
seniors in the state of Texas.
Source : Probability & Statistics for Engineers & Scientists, Walpole et al.,
9th edition, Prentice Hall

Exercise 2
A sample of 100 voters chosen at random from all the voters in a borough
showed that 54 of them were in favour of a certain candidate.
a) Construct a 98% confidence interval for the percentage of votes received
by this candidate.
b) What sample size would be needed to obtain an estimate to within 2%
with a probability of 95% ?

O. Dagnelie (SSSIHL) Stat Inf 2024-25 103 / 104

You might also like