Inf 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Summer School in Statistics for Astronomers

VIII
June 4-8, 2012

Inference II:
Maximum Likelihood Estimation,
the Cramér-Rao Inequality, and
the Bayesian Information Criterion

James L Rosenberger

Acknowledgements:
Donald Richards,
Thomas P Hettmansperger
Department of Statistics
Center for Astrostatistics
Penn State University
1
The Method of Maximum Likelihood

R. A. Fisher (1912), “On an absolute crite-


rion for fitting frequency curves,” Messenger
of Math. 41, 155–160

Fisher’s first mathematical paper, written while


a final-year undergraduate in mathematics and
mathematical physics at Cambridge University

It’s not clear what motivated Fisher to study


this subject; perhaps it was the influence of his
tutor, the astronomer F. J. M. Stratton.

Fisher’s paper started with a criticism of two


methods of curve fitting, least-squares and the
method of moments.

2
X: a random variable

θ is a parameter

f (x; θ): A statistical model for X

X1, . . . , Xn: A random sample from X

We want to construct good estimators for θ

3
Protheroe, et al. “Interpretation of cosmic ray
composition - The path length distribution,”
ApJ., 247 1981

X: Length of paths

Parameter: θ > 0

Model: The exponential distribution,

f (x; θ) = θ−1 exp(−x/θ), x>0

Under this model,


Z ∞
E(X) = x f (x; θ) dx = θ
0

Intuition suggests using X̄ to estimate θ

X̄ is unbiased and consistent

4
LF for globular clusters in the Milky Way; van
den Bergh’s normal model,

(x − µ)2
" #
1
f (x) = √ exp −
2πσ 2σ 2

µ: Mean visual absolute magnitude

σ: Std. deviation of visual absolute magnitude

X̄ is a good estimator for µ

S 2 is a good estimator for σ 2

We seek a method which produces good esti-


mators automatically

Fisher’s brilliant idea: The method of maxi-


mum likelihood

5
Choose a globular cluster at random; what is
the chance that the LF will be exactly -7.1
mag? Exactly -7.2 mag?

For any continuous random variable X,

P (X = c) = 0

Suppose X ∼ N (µ = −6.9, σ 2 = 1.21)

X has probability density function

(x − µ)2
" #
1
f (x) = √ exp −
2πσ 2σ 2

P (X = −7.1) = 0, but
(−7.1 + 6.9)2
" #
1
f (−7.1) = √ exp −
1.1 2π 2(1.1)2
= 0.37

6
Interpretation: In one simulation of the ran-
dom variable X, the “likelihood” of observing
the number -7.1 is 0.37

f (−7.2) = 0.28

In one simulation of X, the value x = −7.1 is


32% more likely to be observed than the value
x = −7.2

x = −6.9 is the value which has the greatest


(or maximum) likelihood, for it is where the
probability density function is at its maximum

7
Return to a general model f (x; θ)

Random sample: X1, . . . , Xn

Recall that the Xi are independent random


variables

The joint probability density function of the


sample is

f (x1; θ)f (x2; θ) · · · f (xn; θ)

Here the variables are the X’s, while θ is fixed

Fisher’s brilliant idea: Reverse the roles of the


x’s and θ

Regard the X’s as fixed and θ as the variable

8
The likelihood function is

L(θ; X1, . . . , Xn) = f (X1; θ)f (X2; θ) · · · f (Xn; θ)

Simpler notation: L(θ)

θ̂, the maximum likelihood estimator of θ, is


the value of θ where L is maximized

θ̂ is a function of the X’s

Caution: The MLE is not always unique.

9
Example: “... cosmic ray composition - The
path length distribution ...”

X: Length of paths

Parameter: θ > 0

Model: The exponential distribution,

f (x; θ) = θ−1 exp(−x/θ), x>0

Random sample: X1, . . . , Xn

Likelihood function:

L(θ) = f (X1; θ)f (X2; θ) · · · f (Xn; θ)


= θ−n exp(−(X1 + · · · + Xn)/θ)
= θ−n exp(−nX̄/θ)

10
To maximize L, we use calculus

It is also equivalent to maximize ln L:

ln L(θ) = − n ln(θ) − nX̄θ−1


d
ln L(θ) = − nθ−1 + nX̄θ−2

d2 −2 − 2nX̄θ −3
ln L(θ) = nθ
dθ2

Solve the equation d ln L(θ)/dθ = 0:

θ = X̄

Check that d2 ln L(θ)/dθ2 < 0 at θ = X̄

ln L(θ) is maximized at θ = X̄

Conclusion: The MLE of θ is θ̂ = X̄

11
LF for globular clusters; X ∼ N (µ, σ 2)

(x − µ)2
" #
2 1
f (x; µ, σ ) = √ exp −
2πσ 2σ 2

Assume that σ is known (1.1 mag, say)

Random sample: X1, . . . , Xn

Likelihood function:

L(µ) = f (X1; µ)f (X2; µ) · · · f (Xn; µ)


 
n
1
= (2π)−n/2σ −n exp − 2
X
(Xi − µ)
2σ 2 i=1

Maximize ln L using calculus: µ̂ = X̄

12
LF for globular clusters; X ∼ N (µ, σ 2)

(x − µ)2
" #
1
f (x; µ, σ 2) = √ exp −
2πσ 2 2σ 2

Both µ and σ are unknown

A likelihood function of two variables,

L(µ, σ 2) = f (X1; µ, σ 2) · · · f (Xn; µ, σ 2)


 
n
1 1 X
2
= exp − (X i − µ)
(2πσ 2)n/2 2σ 2 i=1

n
n n 2 1 X 2
ln L = − ln(2π) − ln(σ ) − (Xi − µ)
2 2 2σ 2 i=1

n
∂ 1 X
ln L = 2 (Xi − µ)
∂µ σ i=1
n
∂ n 1 X
2
ln L = − + (X i − µ)
∂(σ 2) 2σ 2 2(σ 2)2 i=1

13
Solve for µ and σ 2 the simultaneous equations:

∂ ∂
ln L = 0, 2
ln L = 0
∂µ ∂(σ )

We also verify that L is concave at the solu-


tions of these equations (Hessian matrix)

Conclusion: The MLEs are

n
1
σ̂ 2 = (Xi − X̄)2
X
µ̂ = X̄,
n i=1

µ̂ is unbiased: E(µ̂) = µ

σ̂ 2 is not unbiased: E(σ̂ 2) = n−1


n σ 2 6= σ 2

n σ̂ 2 ≡ S 2
For this reason, we use n−1

14
Calculus cannot always be used to find MLEs

Example: “... cosmic ray composition ...”

Parameter: θ > 0

exp(−(x − θ)), x≥θ
Model: f (x; θ) =
0, x<θ

Random sample: X1, . . . , Xn


L(θ) = f (X1; θ) · · · f (Xn; θ)

exp(− Pn (X − θ)), all Xi ≥ θ
i=1 i
=
0, otherwise

X(1): The smallest observation in the sample

“all Xi ≥ θ” is equivalent to “X(1) ≥ θ”



exp(−n(X̄ − θ)), θ ≤ X(1)
L(θ) =
0, otherwise

Conclusion: θ̂ = X(1)
15
General Properties of the MLE θ̂

(a) θ̂ may not be unbiased. We often can re-


move this bias by multiplying θ̂ by a constant.

(b) For many models, θ̂ is consistent.

(c) The Invariance Property: For many nice


functions g, if θ̂ is the MLE of θ then g(θ̂) is
the MLE of g(θ).

(d) The Asymptotic Property: For large n, θ̂


has an approximate normal distribution with
mean θ and variance 1/B where

2


B = nE ln f (X; θ)
∂θ

The asymptotic property can be used to de-


velop confidence intervals for θ
16
The method of maximum likelihood works well
when intuition fails and no obvious estimator
can be found.

When an obvious estimator exists the method


of ML often will find it.

The method can be applied to many statistical


problems: regression analysis, analysis of vari-
ance, discriminant analysis, hypothesis testing,
principal components, etc.

17
The ML Method for Linear Regression Analysis

Scatterplot data: (x1, y1), . . . , (xn, yn)

Basic assumption: The xi’s are non-random


measurements; the yi are observations on Y , a
random variable

Statistical model:
Yi = α + βxi + i, i = 1, . . . , n

Errors 1, . . . , n: a random sample from N (0, σ 2)

Parameters: α, β, σ 2

Yi ∼ N (α + βxi, σ 2): The Yi’s are independent

The Yi are not identically distributed, because


they have differing means

18
The likelihood function is the joint density func-
tion of the observed data, Y1, . . . , Yn
n "
2
#
1 (Y − α − βxi)
L(α, β, σ 2) = exp − i
Y

2 2σ 2
i=1 2πσ
 n 
(Yi − α − βxi)2 
P

= (2πσ 2)−n/2 exp − i=1
 
2

 2σ 

Use partial derivatives to maximize L over all


α, β and σ 2 > 0 (Wise advice: Maximize ln L)

The ML estimators are:


Pn
(xi − x̄)(Yi − Ȳ )
β̂ = i=1
Pn 2
, α̂ = Ȳ − β̂x̄
i=1 (xi − x̄)

and

n
2 1 X
σ̂ = (Yi − α̂ − β̂xi)2
n i=1
19
The ML Method for Testing Hypotheses

X ∼ N (µ, σ 2); parameters µ and σ 2


 
(x−µ)2
Model: f (x; µ, σ 2) = √ 1 exp − 2σ2
2πσ 2

Random sample: X1, . . . , Xn

We wish to test H0 : µ = 3 vs. Ha : µ 6= 3

Parameter space: The space of all permissible


values of the parameters
Ω = {(µ, σ) : −∞ < µ < ∞, σ > 0}

H0 and Ha represent restrictions on the param-


eters, so we are led to parameter subspaces
ω0 = {(µ, σ) : µ = 3, σ > 0}
ωa = {(µ, σ) : µ 6= 3, σ > 0}

20
L(µ, σ 2) = f (X1; µ, σ 2) · · · f (Xn; µ, σ 2)
 
n
1 −
1 X
2
= exp (X i − µ)
(2πσ 2) n/2 2σ 2 i=1

Maximize L(µ, σ 2) over ω0 and ωa

The likelihood ratio test statistic is

max L(µ, σ 2) max L(3, σ 2)


ω0 σ>0
λ= =
max L(µ, σ 2) max L(µ, σ 2)
ωa ∪ω0 σ>0,µ

Fact: 0 ≤ λ ≤ 1

L(3, σ 2) is maximized over ω0 at

n
1
σ2 = (Xi − 3)2
X
n i=1

21
 n 
max L(3, σ 2) =L 3, n
1 (Xi − 3)2
X
ω0
i=1
" #n/2
n
=
2πe n
i=1 (Xi − 3)
2
P

L(µ, σ 2) is maximized over ωa at

n
2 1 X
µ = X̄, σ = (Xi − X̄)2
n i=1

 n 
max L(µ, σ 2) =L X̄, n
1 (Xi − X̄)2
X
ωa ∪ω0
i=1
" #n/2
n
=
2πe n
i=1 (Xi − X̄)
2
P

22
The likelihood ratio test statistic:
 n/2  n/2
n n
λ = n
 ÷ n

(Xi − 3)2 (Xi − X̄)2
P P
2πe 2πe
i=1 i=1
 n/2
n n
2 (Xi − 3)2
X X
=  (Xi − X̄) ÷
i=1 i=1

λ is close to 1 iff X̄ is close to 3

λ is close to 0 iff X̄ is far from 3

This particular LRT statistic λ is equivalent to


the t-statistic seen earlier

In this case, the ML method discovers the ob-


vious test statistic

23
Given two unbiased estimators, we prefer the
one with smaller variance

In our quest for unbiased estimators with min-


imum possible variance, we need to know how
small their variances can be

Parameter: θ

X: Random variable with model f (x; θ)

The “support” of f is the region where f > 0

We assume that the “support” of f does not


depend on θ

Random sample: X1, . . . , Xn

Y : An unbiased estimator of θ

The Cramér-Rao Inequality: The smallest pos-


sible value that Var(Y ) can attain is 1/B where
2
∂2
" #


B = nE ln f (X; θ) = −nE ln f (X; θ)
∂θ ∂θ2
24
Example: “... cosmic ray composition - The
path length distribution ...”

X: Length of paths

Parameter: θ > 0

Model: f (x; θ) = θ−1 exp(−x/θ), x>0

ln f (X; θ) = − ln θ − θ−1X
∂2 −2 − 2θ −3 X
ln f (X; θ) = θ
∂θ2

∂2
" #
E ln f (X; θ) = E(θ −2 − 2θ −3 X)
∂θ 2

= θ−2 − 2θ−3E(X)
= θ−2 − 2θ−3θ
= − θ−2
The smallest possible value of Var(Y ) is θ2/n

This is attained by X̄. For this problem, X̄ is


the best unbiased estimator of θ
25
Y : An unbiased estimator of a parameter θ

We compare Var(Y ) with 1/B, the lower bound


in the Cramér-Rao inequality:

1
÷ Var(Y )
B

This number is called the efficiency of Y

Obviously, 0 ≤ efficiency ≤ 1

If Y has 50% efficiency then about 1/0.5 = 2


times as many sample observations are needed
for Y to perform as well as the MVUE.

The use of Y result in confidence intervals


which generally are longer than those arising
from the MVUE.

If the MLE is unbiased then as n becomes


large, its efficiency increases to 1.
26
The Cramér-Rao inequality states that if Y is
any unbiased estimator of θ then

1
Var(Y ) ≥ h i2

nE ∂θ ln f (X; θ)

The Heisenberg uncertainty principle is known


to be a consequence of the Cramér-Rao in-
equality.

Dembo, Cover, and Thomas (1991) provide a


unified treatment of the Cramér-Rao inequal-
ity, the Heisenberg uncertainty principle, en-
tropy inequalities, Fisher information, and many
other inequalities in statistics, mathematics,
information theory, and physics. This remark-
able paper demonstrates that there is a basic
oneness among these various fields.

Reference

Dembo, Cover, and Thomas (1991), “Information-theoretic in-


equalities,” IEEE Trans. Information Theory 37, 1501–1518.
27
The Bayesian Information Criterion

Suppose that we have two competing statisti-


cal models

We can fit these models using residual sums of


squares, the method of moments, the method
of maximum likelihood, ...

The choice of model cannot be assessed en-


tirely by these methods

By increasing the number of parameters, we


can always reduce the residual sums of squares

Polynomial regression: By increasing the num-


ber of terms, we can reduce the residual sum
of squares

28
More complicated models generally will have
lower residual errors

A standard approach to hypothesis testing for


large data sets is to use the Bayesian informa-
tion criterion (BIC).

The BIC penalizes models with greater num-


bers of free parameters

Two competing models:


f1(x; θ1, . . . , θm1 ) and f2(x; φ1, . . . , φm2 )

Random sample: X1, . . . , Xn

Likelihood functions:
L1(θ1, . . . , θm1 ) and L2(φ1, . . . , φm2 )

Bayesian Information Criterion:


L1(θ1, . . . , θm1 )
BIC = 2 ln − (m1 − m2) ln n
L2(φ1, . . . , φm2 )
The BIC balances any improvement in the like-
lihood with the number of model parameters
used to achieve that improvement
29
Calculate all MLEs θ̂i and φ̂i

Compute the estimated BIC:

L1(θ̂1, . . . , θ̂m1 )
BIC = 2 ln
d − (m1 − m2) ln n
L2(φ̂1, . . . , φ̂m2 )

General rules:

BIC
d < 2: Weak evidence that Model 1 is su-
perior to Model 2

2 ≤ BIC
d ≤ 6: Moderate evidence that Model 1
is superior to Model 2

d ≤ 10: Strong evidence that Model 1


6 < BIC
is superior to Model 2

BIC
d > 10: Very strong evidence that Model 1
is superior to Model 2
30
Exercise: Two competing models for globular
cluster LF in the Galaxy

1. A Gaussian model (van den Bergh, 1985)

(x − µ)2
" #
1
f (x; µ, σ) = √ exp −
2πσ 2σ 2

2. A t-distn. model (Secker 1992, AJ 104)


Γ( δ+1 ) 
(x − µ) 2 − δ+1
2
g(x; µ, σ, δ) = √ 2 1 +
πδ σ Γ( 2δ ) δσ 2
−∞ < µ < ∞, σ > 0, δ > 0

In each model, µ is the mean and σ 2 is the


variance. In Model 2, δ is a shape parameter.

Maximum likelihood calculations suggest that


Model 1 is inferior to Model 2.

Question: Is the increase in likelihood due to


larger number of parameters?

This question can be studied using the BIC.


31
We use the data of Secker (1992), Table 1

32
We assume that the data constitute a random sample
Model 1: Write down the likelihood function,

L1(µ, σ) = f (X1; µ, σ) · · · f (Xn; µ, σ)


 
n
1 −
1 X
2
= exp (X i − µ)
(2πσ 2)n/2 2σ 2 i=1

Estimate µ with X̄, the ML estimator. Also,


estimate σ 2 with S 2, a constant multiple of the
ML estimator of σ 2.

Note that
 
n
1 1 X
2
L1(X̄, S) = exp − (X i − X̄)
(2πS 2)n/2 2S 2 i=1
=(2πS 2)−n/2 exp(−(n − 1)/2)

Calculate x̄ and s2, the sample mean and vari-


ance of the Milky Way data. Use these values
to calculate L1(x̄, s)

Secker (1992, p. 1476): ln L1(x̄, s) = −176.4


33
Model 2: Write down the likelihood function,

L2(µ, σ, δ) =g(X1; µ, σ) · · · g(Xn; µ, σ)


n Γ( δ+1 ) 
(X − µ) 2 − δ+1
2
Y
2 i
= √ δ
1 +
i=1 πδ σ Γ( 2 ) δσ 2

Are the MLEs of µ, σ 2, δ unique?

No explicit formulas for the MLEs are known;


we must evaluate them numerically

Substitute the Milky Way data for the Xi’s in


the formula for L, and maximize L numerically.

Secker (1992): µ̂ = −7.31, σ̂ = 1.03, δ̂ = 3.55

Calculate L2(−7.31, 1.03, 3.55)

Secker (1992, p. 1476):


ln L2(−7.31, 1.03, 3.55) = −173.0
34
Finally, calculate the estimated BIC:
L1(x̄, s)
BIC
d = 2 ln −(m1 −m2) ln n
L2(−7.31, 1.03, 3.55)
where m1 = 2, m2 = 3, n = 100

d = 2[ln L (x̄, s) − ln L (−7.31, 1.03, 3.55)]


BIC 1 2
+ ln 100
= 2[−176.4 − (−173.0)] + ln 100
= − 2.2

Apply the General Rules on p. 30 to assess the


strength of the evidence that Model 1 may be
superior to Model 2.

Since BIC
d < 2, we have very strong evidence
that the t-distribution model is superior to the
Gaussian distribution Model.

We reject the null hypothesis that the t-distribution


model for GCLF is superior to the Gaussian
model.
35
Concluding general remarks on the BIC

The BIC procedure is consistent: If Model 1 is


the true model then, as n → ∞, the BIC will
determine that it is.

Not all information criteria are consistent.

The BIC is not a panacea; some authors rec-


ommend that it be used in conjunction with
other information criteria.

There are also difficulties with the BIC

Findley (1991, Ann. Inst. Statist. Math.)


studied the performance of the BIC for com-
paring two models with different numbers of
parameters: “Suppose that the log-likelihood-
ratio sequence of two models with different
numbers of estimated parameters is bounded
in probability. Then the BIC will, with asymp-
totic probability 1, select the model having
fewer parameters.”
36

You might also like