Inf 2
Inf 2
Inf 2
VIII
June 4-8, 2012
Inference II:
Maximum Likelihood Estimation,
the Cramér-Rao Inequality, and
the Bayesian Information Criterion
James L Rosenberger
Acknowledgements:
Donald Richards,
Thomas P Hettmansperger
Department of Statistics
Center for Astrostatistics
Penn State University
1
The Method of Maximum Likelihood
2
X: a random variable
θ is a parameter
3
Protheroe, et al. “Interpretation of cosmic ray
composition - The path length distribution,”
ApJ., 247 1981
X: Length of paths
Parameter: θ > 0
4
LF for globular clusters in the Milky Way; van
den Bergh’s normal model,
(x − µ)2
" #
1
f (x) = √ exp −
2πσ 2σ 2
5
Choose a globular cluster at random; what is
the chance that the LF will be exactly -7.1
mag? Exactly -7.2 mag?
P (X = c) = 0
(x − µ)2
" #
1
f (x) = √ exp −
2πσ 2σ 2
P (X = −7.1) = 0, but
(−7.1 + 6.9)2
" #
1
f (−7.1) = √ exp −
1.1 2π 2(1.1)2
= 0.37
6
Interpretation: In one simulation of the ran-
dom variable X, the “likelihood” of observing
the number -7.1 is 0.37
f (−7.2) = 0.28
7
Return to a general model f (x; θ)
8
The likelihood function is
9
Example: “... cosmic ray composition - The
path length distribution ...”
X: Length of paths
Parameter: θ > 0
Likelihood function:
10
To maximize L, we use calculus
θ = X̄
ln L(θ) is maximized at θ = X̄
11
LF for globular clusters; X ∼ N (µ, σ 2)
(x − µ)2
" #
2 1
f (x; µ, σ ) = √ exp −
2πσ 2σ 2
Likelihood function:
12
LF for globular clusters; X ∼ N (µ, σ 2)
(x − µ)2
" #
1
f (x; µ, σ 2) = √ exp −
2πσ 2 2σ 2
n
n n 2 1 X 2
ln L = − ln(2π) − ln(σ ) − (Xi − µ)
2 2 2σ 2 i=1
n
∂ 1 X
ln L = 2 (Xi − µ)
∂µ σ i=1
n
∂ n 1 X
2
ln L = − + (X i − µ)
∂(σ 2) 2σ 2 2(σ 2)2 i=1
13
Solve for µ and σ 2 the simultaneous equations:
∂ ∂
ln L = 0, 2
ln L = 0
∂µ ∂(σ )
n
1
σ̂ 2 = (Xi − X̄)2
X
µ̂ = X̄,
n i=1
µ̂ is unbiased: E(µ̂) = µ
n σ̂ 2 ≡ S 2
For this reason, we use n−1
14
Calculus cannot always be used to find MLEs
Parameter: θ > 0
exp(−(x − θ)), x≥θ
Model: f (x; θ) =
0, x<θ
Conclusion: θ̂ = X(1)
15
General Properties of the MLE θ̂
2
∂
B = nE ln f (X; θ)
∂θ
17
The ML Method for Linear Regression Analysis
Statistical model:
Yi = α + βxi + i, i = 1, . . . , n
Parameters: α, β, σ 2
18
The likelihood function is the joint density func-
tion of the observed data, Y1, . . . , Yn
n "
2
#
1 (Y − α − βxi)
L(α, β, σ 2) = exp − i
Y
√
2 2σ 2
i=1 2πσ
n
(Yi − α − βxi)2
P
= (2πσ 2)−n/2 exp − i=1
2
2σ
and
n
2 1 X
σ̂ = (Yi − α̂ − β̂xi)2
n i=1
19
The ML Method for Testing Hypotheses
20
L(µ, σ 2) = f (X1; µ, σ 2) · · · f (Xn; µ, σ 2)
n
1 −
1 X
2
= exp (X i − µ)
(2πσ 2) n/2 2σ 2 i=1
Fact: 0 ≤ λ ≤ 1
n
1
σ2 = (Xi − 3)2
X
n i=1
21
n
max L(3, σ 2) =L 3, n
1 (Xi − 3)2
X
ω0
i=1
" #n/2
n
=
2πe n
i=1 (Xi − 3)
2
P
n
2 1 X
µ = X̄, σ = (Xi − X̄)2
n i=1
n
max L(µ, σ 2) =L X̄, n
1 (Xi − X̄)2
X
ωa ∪ω0
i=1
" #n/2
n
=
2πe n
i=1 (Xi − X̄)
2
P
22
The likelihood ratio test statistic:
n/2 n/2
n n
λ = n
÷ n
(Xi − 3)2 (Xi − X̄)2
P P
2πe 2πe
i=1 i=1
n/2
n n
2 (Xi − 3)2
X X
= (Xi − X̄) ÷
i=1 i=1
23
Given two unbiased estimators, we prefer the
one with smaller variance
Parameter: θ
Y : An unbiased estimator of θ
X: Length of paths
Parameter: θ > 0
ln f (X; θ) = − ln θ − θ−1X
∂2 −2 − 2θ −3 X
ln f (X; θ) = θ
∂θ2
∂2
" #
E ln f (X; θ) = E(θ −2 − 2θ −3 X)
∂θ 2
= θ−2 − 2θ−3E(X)
= θ−2 − 2θ−3θ
= − θ−2
The smallest possible value of Var(Y ) is θ2/n
1
÷ Var(Y )
B
Obviously, 0 ≤ efficiency ≤ 1
1
Var(Y ) ≥ h i2
∂
nE ∂θ ln f (X; θ)
Reference
28
More complicated models generally will have
lower residual errors
Likelihood functions:
L1(θ1, . . . , θm1 ) and L2(φ1, . . . , φm2 )
L1(θ̂1, . . . , θ̂m1 )
BIC = 2 ln
d − (m1 − m2) ln n
L2(φ̂1, . . . , φ̂m2 )
General rules:
BIC
d < 2: Weak evidence that Model 1 is su-
perior to Model 2
2 ≤ BIC
d ≤ 6: Moderate evidence that Model 1
is superior to Model 2
BIC
d > 10: Very strong evidence that Model 1
is superior to Model 2
30
Exercise: Two competing models for globular
cluster LF in the Galaxy
(x − µ)2
" #
1
f (x; µ, σ) = √ exp −
2πσ 2σ 2
32
We assume that the data constitute a random sample
Model 1: Write down the likelihood function,
Note that
n
1 1 X
2
L1(X̄, S) = exp − (X i − X̄)
(2πS 2)n/2 2S 2 i=1
=(2πS 2)−n/2 exp(−(n − 1)/2)
Since BIC
d < 2, we have very strong evidence
that the t-distribution model is superior to the
Gaussian distribution Model.