Inf 2

Summer School in Statistics for Astronomers
VIII
June 4-8, 2012
Inference II:
Maximum Likelihood Estimation,
the Cramér-Rao Inequality, and
the Bayesian Information Criterion
James L Rosenberger
Acknowledgements:
Donald Richards,
Thomas P Hettmansperger
Department of Statistics
Center for Astrostatistics
Penn State University
1
The Method of Maximum Likelihood
R. A. Fisher (1912), “On an absolute crite-

rion for fitting frequency curves,” Messenger
of Math. 41, 155–160
Fisher’s first mathematical paper, written while

a final-year undergraduate in mathematics and
mathematical physics at Cambridge University
It’s not clear what motivated Fisher to study

this subject; perhaps it was the influence of his
tutor, the astronomer F. J. M. Stratton.
Fisher’s paper started with a criticism of two

methods of curve fitting, least-squares and the
method of moments.
2
X: a random variable
θ is a parameter
f (x; θ): A statistical model for X
X1, . . . , Xn: A random sample from X
We want to construct good estimators for θ
3
Protheroe, et al. “Interpretation of cosmic ray
composition - The path length distribution,”
ApJ., 247 1981
X: Length of paths
Parameter: θ > 0
Model: The exponential distribution,
f (x; θ) = θ−1 exp(−x/θ), x>0
Under this model,

Z ∞
E(X) = x f (x; θ) dx = θ
0
Intuition suggests using X̄ to estimate θ
X̄ is unbiased and consistent
4
LF for globular clusters in the Milky Way; van
den Bergh’s normal model,
(x − µ)2
" #
1
f (x) = √ exp −
2πσ 2σ 2
µ: Mean visual absolute magnitude
σ: Std. deviation of visual absolute magnitude
X̄ is a good estimator for µ
S 2 is a good estimator for σ 2
We seek a method which produces good esti-

mators automatically
Fisher’s brilliant idea: The method of maxi-

mum likelihood
5
Choose a globular cluster at random; what is
the chance that the LF will be exactly -7.1
mag? Exactly -7.2 mag?
For any continuous random variable X,
P (X = c) = 0
Suppose X ∼ N (µ = −6.9, σ 2 = 1.21)
X has probability density function
(x − µ)2
" #
1
f (x) = √ exp −
2πσ 2σ 2
P (X = −7.1) = 0, but
(−7.1 + 6.9)2
" #
1
f (−7.1) = √ exp −
1.1 2π 2(1.1)2
= 0.37
6
Interpretation: In one simulation of the ran-
dom variable X, the “likelihood” of observing
the number -7.1 is 0.37
f (−7.2) = 0.28
In one simulation of X, the value x = −7.1 is

32% more likely to be observed than the value
x = −7.2
x = −6.9 is the value which has the greatest

(or maximum) likelihood, for it is where the
probability density function is at its maximum
7
Return to a general model f (x; θ)
Random sample: X1, . . . , Xn
Recall that the Xi are independent random

variables
The joint probability density function of the

sample is
f (x1; θ)f (x2; θ) · · · f (xn; θ)
Here the variables are the X’s, while θ is fixed
Fisher’s brilliant idea: Reverse the roles of the

x’s and θ
Regard the X’s as fixed and θ as the variable
8
The likelihood function is
L(θ; X1, . . . , Xn) = f (X1; θ)f (X2; θ) · · · f (Xn; θ)
Simpler notation: L(θ)
θ̂, the maximum likelihood estimator of θ, is

the value of θ where L is maximized
θ̂ is a function of the X’s
Caution: The MLE is not always unique.
9
Example: “... cosmic ray composition - The
path length distribution ...”
X: Length of paths
Parameter: θ > 0
Model: The exponential distribution,
f (x; θ) = θ−1 exp(−x/θ), x>0
Likelihood function:
L(θ) = f (X1; θ)f (X2; θ) · · · f (Xn; θ)

= θ−n exp(−(X1 + · · · + Xn)/θ)
= θ−n exp(−nX̄/θ)
10
To maximize L, we use calculus
It is also equivalent to maximize ln L:
ln L(θ) = − n ln(θ) − nX̄θ−1

d
ln L(θ) = − nθ−1 + nX̄θ−2
dθ
d2 −2 − 2nX̄θ −3
ln L(θ) = nθ
dθ2
Solve the equation d ln L(θ)/dθ = 0:
θ = X̄
Check that d2 ln L(θ)/dθ2 < 0 at θ = X̄
ln L(θ) is maximized at θ = X̄
Conclusion: The MLE of θ is θ̂ = X̄
11
LF for globular clusters; X ∼ N (µ, σ 2)
(x − µ)2
" #
2 1
f (x; µ, σ ) = √ exp −
2πσ 2σ 2
Assume that σ is known (1.1 mag, say)
Likelihood function:
L(µ) = f (X1; µ)f (X2; µ) · · · f (Xn; µ)

 
n
1
= (2π)−n/2σ −n exp − 2
X
(Xi − µ)
2σ 2 i=1
Maximize ln L using calculus: µ̂ = X̄
12
LF for globular clusters; X ∼ N (µ, σ 2)
(x − µ)2
" #
1
f (x; µ, σ 2) = √ exp −
2πσ 2 2σ 2
Both µ and σ are unknown
A likelihood function of two variables,
L(µ, σ 2) = f (X1; µ, σ 2) · · · f (Xn; µ, σ 2)

 
n
1 1 X
2
= exp − (X i − µ)
(2πσ 2)n/2 2σ 2 i=1
n
n n 2 1 X 2
ln L = − ln(2π) − ln(σ ) − (Xi − µ)
2 2 2σ 2 i=1
n
∂ 1 X
ln L = 2 (Xi − µ)
∂µ σ i=1
n
∂ n 1 X
2
ln L = − + (X i − µ)
∂(σ 2) 2σ 2 2(σ 2)2 i=1
13
Solve for µ and σ 2 the simultaneous equations:
∂ ∂
ln L = 0, 2
ln L = 0
∂µ ∂(σ )
We also verify that L is concave at the solu-

tions of these equations (Hessian matrix)
Conclusion: The MLEs are
n
1
σ̂ 2 = (Xi − X̄)2
X
µ̂ = X̄,
n i=1
µ̂ is unbiased: E(µ̂) = µ
σ̂ 2 is not unbiased: E(σ̂ 2) = n−1

n σ 2 6= σ 2
n σ̂ 2 ≡ S 2
For this reason, we use n−1
14
Calculus cannot always be used to find MLEs
Example: “... cosmic ray composition ...”
Parameter: θ > 0

exp(−(x − θ)), x≥θ
Model: f (x; θ) =
0, x<θ

L(θ) = f (X1; θ) · · · f (Xn; θ)

exp(− Pn (X − θ)), all Xi ≥ θ
i=1 i
=
0, otherwise
X(1): The smallest observation in the sample
“all Xi ≥ θ” is equivalent to “X(1) ≥ θ”


exp(−n(X̄ − θ)), θ ≤ X(1)
L(θ) =
0, otherwise
Conclusion: θ̂ = X(1)
15
General Properties of the MLE θ̂
(a) θ̂ may not be unbiased. We often can re-

move this bias by multiplying θ̂ by a constant.
(b) For many models, θ̂ is consistent.
(c) The Invariance Property: For many nice

functions g, if θ̂ is the MLE of θ then g(θ̂) is
the MLE of g(θ).
(d) The Asymptotic Property: For large n, θ̂

has an approximate normal distribution with
mean θ and variance 1/B where
2
∂

B = nE ln f (X; θ)
∂θ
The asymptotic property can be used to de-

velop confidence intervals for θ
16
The method of maximum likelihood works well
when intuition fails and no obvious estimator
can be found.
When an obvious estimator exists the method

of ML often will find it.
The method can be applied to many statistical

problems: regression analysis, analysis of vari-
ance, discriminant analysis, hypothesis testing,
principal components, etc.
17
The ML Method for Linear Regression Analysis
Scatterplot data: (x1, y1), . . . , (xn, yn)
Basic assumption: The xi’s are non-random

measurements; the yi are observations on Y , a
random variable
Statistical model:
Yi = α + βxi + i, i = 1, . . . , n
Errors 1, . . . , n: a random sample from N (0, σ 2)
Parameters: α, β, σ 2
Yi ∼ N (α + βxi, σ 2): The Yi’s are independent
The Yi are not identically distributed, because

they have differing means
18
The likelihood function is the joint density func-
tion of the observed data, Y1, . . . , Yn
n "
2
#
1 (Y − α − βxi)
L(α, β, σ 2) = exp − i
Y
√
2 2σ 2
i=1 2πσ
 n 
(Yi − α − βxi)2 
P

= (2πσ 2)−n/2 exp − i=1
 
2

 2σ 
Use partial derivatives to maximize L over all

α, β and σ 2 > 0 (Wise advice: Maximize ln L)
The ML estimators are:

Pn
(xi − x̄)(Yi − Ȳ )
β̂ = i=1
Pn 2
, α̂ = Ȳ − β̂x̄
i=1 (xi − x̄)
and
n
2 1 X
σ̂ = (Yi − α̂ − β̂xi)2
n i=1
19
The ML Method for Testing Hypotheses
X ∼ N (µ, σ 2); parameters µ and σ 2

(x−µ)2
Model: f (x; µ, σ 2) = √ 1 exp − 2σ2
2πσ 2
We wish to test H0 : µ = 3 vs. Ha : µ 6= 3
Parameter space: The space of all permissible

values of the parameters
Ω = {(µ, σ) : −∞ < µ < ∞, σ > 0}
H0 and Ha represent restrictions on the param-

eters, so we are led to parameter subspaces
ω0 = {(µ, σ) : µ = 3, σ > 0}
ωa = {(µ, σ) : µ 6= 3, σ > 0}
20
L(µ, σ 2) = f (X1; µ, σ 2) · · · f (Xn; µ, σ 2)
 
n
1 −
1 X
2
= exp (X i − µ)
(2πσ 2) n/2 2σ 2 i=1
Maximize L(µ, σ 2) over ω0 and ωa
The likelihood ratio test statistic is
max L(µ, σ 2) max L(3, σ 2)

ω0 σ>0
λ= =
max L(µ, σ 2) max L(µ, σ 2)
ωa ∪ω0 σ>0,µ
Fact: 0 ≤ λ ≤ 1
L(3, σ 2) is maximized over ω0 at
n
1
σ2 = (Xi − 3)2
X
n i=1
21
n
max L(3, σ 2) =L 3, n
1 (Xi − 3)2
X
ω0
i=1
" #n/2
n
=
2πe n
i=1 (Xi − 3)
2
P
L(µ, σ 2) is maximized over ωa at
n
2 1 X
µ = X̄, σ = (Xi − X̄)2
n i=1
n
max L(µ, σ 2) =L X̄, n
1 (Xi − X̄)2
X
ωa ∪ω0
i=1
" #n/2
n
=
2πe n
i=1 (Xi − X̄)
2
P
22
The likelihood ratio test statistic:
 n/2  n/2
n n
λ = n
 ÷ n

(Xi − 3)2 (Xi − X̄)2
P P
2πe 2πe
i=1 i=1
 n/2
n n
2 (Xi − 3)2
X X
=  (Xi − X̄) ÷
i=1 i=1
λ is close to 1 iff X̄ is close to 3
λ is close to 0 iff X̄ is far from 3
This particular LRT statistic λ is equivalent to

the t-statistic seen earlier
In this case, the ML method discovers the ob-

vious test statistic
23
Given two unbiased estimators, we prefer the
one with smaller variance
In our quest for unbiased estimators with min-

imum possible variance, we need to know how
small their variances can be
Parameter: θ
X: Random variable with model f (x; θ)
The “support” of f is the region where f > 0
We assume that the “support” of f does not

depend on θ
Y : An unbiased estimator of θ
The Cramér-Rao Inequality: The smallest pos-

sible value that Var(Y ) can attain is 1/B where
2
∂2
" #
∂

B = nE ln f (X; θ) = −nE ln f (X; θ)
∂θ ∂θ2
24
Example: “... cosmic ray composition - The
path length distribution ...”
X: Length of paths
Parameter: θ > 0
Model: f (x; θ) = θ−1 exp(−x/θ), x>0
ln f (X; θ) = − ln θ − θ−1X
∂2 −2 − 2θ −3 X
ln f (X; θ) = θ
∂θ2
∂2
" #
E ln f (X; θ) = E(θ −2 − 2θ −3 X)
∂θ 2
= θ−2 − 2θ−3E(X)
= θ−2 − 2θ−3θ
= − θ−2
The smallest possible value of Var(Y ) is θ2/n
This is attained by X̄. For this problem, X̄ is

the best unbiased estimator of θ
25
Y : An unbiased estimator of a parameter θ
We compare Var(Y ) with 1/B, the lower bound

in the Cramér-Rao inequality:
1
÷ Var(Y )
B
This number is called the efficiency of Y
Obviously, 0 ≤ efficiency ≤ 1
If Y has 50% efficiency then about 1/0.5 = 2

times as many sample observations are needed
for Y to perform as well as the MVUE.
The use of Y result in confidence intervals

which generally are longer than those arising
from the MVUE.
If the MLE is unbiased then as n becomes

large, its efficiency increases to 1.
26
The Cramér-Rao inequality states that if Y is
any unbiased estimator of θ then
1
Var(Y ) ≥ h i2
∂
nE ∂θ ln f (X; θ)
The Heisenberg uncertainty principle is known

to be a consequence of the Cramér-Rao in-
equality.
Dembo, Cover, and Thomas (1991) provide a

unified treatment of the Cramér-Rao inequal-
ity, the Heisenberg uncertainty principle, en-
tropy inequalities, Fisher information, and many
other inequalities in statistics, mathematics,
information theory, and physics. This remark-
able paper demonstrates that there is a basic
oneness among these various fields.
Reference
Dembo, Cover, and Thomas (1991), “Information-theoretic in-

equalities,” IEEE Trans. Information Theory 37, 1501–1518.
27
The Bayesian Information Criterion
Suppose that we have two competing statisti-

cal models
We can fit these models using residual sums of

squares, the method of moments, the method
of maximum likelihood, ...
The choice of model cannot be assessed en-

tirely by these methods
By increasing the number of parameters, we

can always reduce the residual sums of squares
Polynomial regression: By increasing the num-

ber of terms, we can reduce the residual sum
of squares
28
More complicated models generally will have
lower residual errors
A standard approach to hypothesis testing for

large data sets is to use the Bayesian informa-
tion criterion (BIC).
The BIC penalizes models with greater num-

bers of free parameters
Two competing models:

f1(x; θ1, . . . , θm1 ) and f2(x; φ1, . . . , φm2 )
Likelihood functions:
L1(θ1, . . . , θm1 ) and L2(φ1, . . . , φm2 )
Bayesian Information Criterion:

L1(θ1, . . . , θm1 )
BIC = 2 ln − (m1 − m2) ln n
L2(φ1, . . . , φm2 )
The BIC balances any improvement in the like-
lihood with the number of model parameters
used to achieve that improvement
29
Calculate all MLEs θ̂i and φ̂i
Compute the estimated BIC:
L1(θ̂1, . . . , θ̂m1 )
BIC = 2 ln
d − (m1 − m2) ln n
L2(φ̂1, . . . , φ̂m2 )
General rules:
BIC
d < 2: Weak evidence that Model 1 is su-
perior to Model 2
2 ≤ BIC
d ≤ 6: Moderate evidence that Model 1
is superior to Model 2
d ≤ 10: Strong evidence that Model 1

6 < BIC
BIC
d > 10: Very strong evidence that Model 1
30
Exercise: Two competing models for globular
cluster LF in the Galaxy
1. A Gaussian model (van den Bergh, 1985)
(x − µ)2
" #
1
f (x; µ, σ) = √ exp −
2πσ 2σ 2
2. A t-distn. model (Secker 1992, AJ 104)

Γ( δ+1 )
(x − µ) 2 − δ+1
2
g(x; µ, σ, δ) = √ 2 1 +
πδ σ Γ( 2δ ) δσ 2
−∞ < µ < ∞, σ > 0, δ > 0
In each model, µ is the mean and σ 2 is the

variance. In Model 2, δ is a shape parameter.
Maximum likelihood calculations suggest that

Model 1 is inferior to Model 2.
Question: Is the increase in likelihood due to

larger number of parameters?
This question can be studied using the BIC.

31
We use the data of Secker (1992), Table 1
32
We assume that the data constitute a random sample
Model 1: Write down the likelihood function,
L1(µ, σ) = f (X1; µ, σ) · · · f (Xn; µ, σ)

 
n
1 −
1 X
2
= exp (X i − µ)
(2πσ 2)n/2 2σ 2 i=1
Estimate µ with X̄, the ML estimator. Also,

estimate σ 2 with S 2, a constant multiple of the
ML estimator of σ 2.
Note that
 
n
1 1 X
2
L1(X̄, S) = exp − (X i − X̄)
(2πS 2)n/2 2S 2 i=1
=(2πS 2)−n/2 exp(−(n − 1)/2)
Calculate x̄ and s2, the sample mean and vari-

ance of the Milky Way data. Use these values
to calculate L1(x̄, s)
Secker (1992, p. 1476): ln L1(x̄, s) = −176.4

33
Model 2: Write down the likelihood function,
L2(µ, σ, δ) =g(X1; µ, σ) · · · g(Xn; µ, σ)

n Γ( δ+1 )
(X − µ) 2 − δ+1
2
Y
2 i
= √ δ
1 +
i=1 πδ σ Γ( 2 ) δσ 2
Are the MLEs of µ, σ 2, δ unique?
No explicit formulas for the MLEs are known;

we must evaluate them numerically
Substitute the Milky Way data for the Xi’s in

the formula for L, and maximize L numerically.
Secker (1992): µ̂ = −7.31, σ̂ = 1.03, δ̂ = 3.55
Calculate L2(−7.31, 1.03, 3.55)
Secker (1992, p. 1476):

ln L2(−7.31, 1.03, 3.55) = −173.0
34
Finally, calculate the estimated BIC:
L1(x̄, s)
BIC
d = 2 ln −(m1 −m2) ln n
L2(−7.31, 1.03, 3.55)
where m1 = 2, m2 = 3, n = 100
d = 2[ln L (x̄, s) − ln L (−7.31, 1.03, 3.55)]

BIC 1 2
+ ln 100
= 2[−176.4 − (−173.0)] + ln 100
= − 2.2
Apply the General Rules on p. 30 to assess the

strength of the evidence that Model 1 may be
superior to Model 2.
Since BIC
d < 2, we have very strong evidence
that the t-distribution model is superior to the
Gaussian distribution Model.
We reject the null hypothesis that the t-distribution

model for GCLF is superior to the Gaussian
model.
35
Concluding general remarks on the BIC
The BIC procedure is consistent: If Model 1 is

the true model then, as n → ∞, the BIC will
determine that it is.
Not all information criteria are consistent.
The BIC is not a panacea; some authors rec-

ommend that it be used in conjunction with
other information criteria.
There are also difficulties with the BIC
Findley (1991, Ann. Inst. Statist. Math.)

studied the performance of the BIC for com-
paring two models with different numbers of
parameters: “Suppose that the log-likelihood-
ratio sequence of two models with different
numbers of estimated parameters is bounded
in probability. Then the BIC will, with asymp-
totic probability 1, select the model having
fewer parameters.”
36

Inf 2

Uploaded by

Copyright:

Available Formats

Inf 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inf 2

Uploaded by

Copyright:

Available Formats

Summer School in Statistics for Astronomers

R. A. Fisher (1912), “On an absolute crite-

Fisher’s first mathematical paper, written while

It’s not clear what motivated Fisher to study

Fisher’s paper started with a criticism of two

f (x; θ): A statistical model for X

X1, . . . , Xn: A random sample from X

We want to construct good estimators for θ

Model: The exponential distribution,

f (x; θ) = θ−1 exp(−x/θ), x>0

Under this model,

Intuition suggests using X̄ to estimate θ

X̄ is unbiased and consistent

µ: Mean visual absolute magnitude

σ: Std. deviation of visual absolute magnitude

X̄ is a good estimator for µ

S 2 is a good estimator for σ 2

We seek a method which produces good esti-

Fisher’s brilliant idea: The method of maxi-

For any continuous random variable X,

Suppose X ∼ N (µ = −6.9, σ 2 = 1.21)

X has probability density function

In one simulation of X, the value x = −7.1 is

x = −6.9 is the value which has the greatest

Random sample: X1, . . . , Xn

Recall that the Xi are independent random

The joint probability density function of the

f (x1; θ)f (x2; θ) · · · f (xn; θ)

Here the variables are the X’s, while θ is fixed

Fisher’s brilliant idea: Reverse the roles of the

Regard the X’s as fixed and θ as the variable

L(θ; X1, . . . , Xn) = f (X1; θ)f (X2; θ) · · · f (Xn; θ)

Simpler notation: L(θ)

θ̂, the maximum likelihood estimator of θ, is

θ̂ is a function of the X’s

Caution: The MLE is not always unique.

Model: The exponential distribution,

f (x; θ) = θ−1 exp(−x/θ), x>0

Random sample: X1, . . . , Xn

L(θ) = f (X1; θ)f (X2; θ) · · · f (Xn; θ)

It is also equivalent to maximize ln L:

ln L(θ) = − n ln(θ) − nX̄θ−1

Solve the equation d ln L(θ)/dθ = 0:

Check that d2 ln L(θ)/dθ2 < 0 at θ = X̄

Conclusion: The MLE of θ is θ̂ = X̄

Assume that σ is known (1.1 mag, say)

Random sample: X1, . . . , Xn

L(µ) = f (X1; µ)f (X2; µ) · · · f (Xn; µ)

Maximize ln L using calculus: µ̂ = X̄

Both µ and σ are unknown

A likelihood function of two variables,

L(µ, σ 2) = f (X1; µ, σ 2) · · · f (Xn; µ, σ 2)

We also verify that L is concave at the solu-

Conclusion: The MLEs are

σ̂ 2 is not unbiased: E(σ̂ 2) = n−1

Example: “... cosmic ray composition ...”

Random sample: X1, . . . , Xn

X(1): The smallest observation in the sample

“all Xi ≥ θ” is equivalent to “X(1) ≥ θ”

Errors 1, . . . , n: a random sample from N (0, σ 2)