Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
Guy Lebanon
February 19, 2011
Maximum likelihood estimation is the most popular general purpose method for obtaining estimating a distribution
from a nite sample. It was proposed by Fisher about 100 years ago and has been widely used since.
Denition 1. Let X
(1)
, . . . , X
(n)
be sampled iid
1
from a distribution with a parameter that lies in a set . The
maximum likelihood estimator (MLE) is the that maximizes the likelihood function
L() = L(|X
(1)
, . . . , X
(n)
) = p
(X
(1)
, . . . , X
(n)
) =
n
i=1
p
(X
(i)
)
where p above is the density function if X is continuous and the mass function if X is discrete. The MLE is denoted
or
n
if we wish to emphasize the sample size.
Above, we suppress the dependency of L on X
1
, . . . , X
(n)
to emphasize that we are treating the likelihood as a function
of . Note that both X
(i)
and may be scalars or vectors (not necessarily of the same dimension) and that L may be
discrete or continuous in either X and or both or neither.
1. Strictly monotonic increasing functions g preserve order in the sense that the maximizer of L() is the same as the
maximizer of g(L()). As a consequence, we can nd the MLE by obtaining the maximizer of log L() rather than
the likelihood itself which is helpful since it transforms the multiplicative likelihood into a sum (sums are easier to
dierentiate than products). A common notation for the log of the likelihood is () = log L().
2. Any additive and multiplicative terms in () that are not a function of may be ignored since it dropping them
will not change the maximizer.
3. If L() is dierentiable in , we can try to nd the MLE by solving the equation d()/d = 0 for scalar or the
system of equations () = 0 i.e., ()/
j
= 0, j = 1, . . . , d for vector . The obtained solutions are necessarily
critical points (maximum, minimum or inection) of the log-likelihood. To actually prove that the solution is a
maximum we need to show in the scalar case d
2
()/d
2
< 0 or if is a vector that the Hessian matrix H() dened
by [H()]
ij
=
2
()
ij
is negative denite
2
(at the solution of () = 0).
4. If the above method does not work (we cant solve () = 0) we can nd the MLE by iteratively following in the
direction of the gradient: initialize randomly and iterate +() (where is a suciently small step size)
until convergence e.g., () or
(t+1)
(t)
< . Since the gradient vector points in the direction of steepest
ascent this will bring us to a maximum point.
5. The MLE is invariant in the sense that for all 1-1 functions h: denoting the MLE for by
we have that the
MLE for h() is h(
). The key property is that 1-1 functions have an inverse. If we use the parametrization h()
rather than , the likelihood function is L h
1
rather than L. This can be shown by noting that for 1-1 = h(),
p
(X) = p
h
1
()
(X) and thus the likelihood function of is K() = L(h
1
()). We conclude that if
is the MLE
of then
L(h
1
(h(
))) = L(
) L() = L(h
1
(h()))
and h(
X
(i)
(1)
1X
(i)
=
X
(i)
(1)
n
X
(i)
and the log-likelihood is () = (
X
(i)
) log +(n
X
(i)
) log(1).
Setting the loglikelihood derivative to zero yields 0 =
X
(i)
/(n
X
(i)
)/(1) or 0 = (1)
X
(i)
(n
X
(i)
) =
X
(i)
n which implies
=
1
n
X
(i)
. Note that the probability p and the loglikelihood are neither continuous nor
smooth in X but are both continuous and smooth in . For example the graph below are the loglikelihood of a sample
from a Bernouli distribution with = 0.5 and n = 3 which is maximized at
being the empirical average in accordance
with the math solution above.
1 > theme set(theme bw(base_size =8)); set. seed (0); # for reproducable experiment
2 > n=3; theta=0.5;samples=rbinom(n,1,theta); samples
1 [1] 1 0 0
1 > D=data.frame(theta=seq(0.01 ,1,length =100));
2 > D$loglikelihood=sum(samples) * log(D$theta) + (n-sum(samples )) * log (1-D$theta);
3 > mle=which.max(D$loglikelihood );
4 > p=ggplot(D,aes(theta ,loglikelihood )) + geom line()
5 > print(qplot(theta ,loglikelihood ,geom= ' line ' ,data=D,
6 + main= ' MLE(redsolid)vs.TrueParameter(bluedashed),n=3 ' )+
7 + geom vline(aes(xintercept=theta[mle]),color=I( ' red ' ),size=1.5)+
8 + geom vline(aes(xintercept =0.5),color=I( ' blue ' ),lty=2,size=1.5))
MLE (red solid) vs. True Parameter (blue dashed), n=3
theta
l
o
g
l
i
k
e
l
i
h
o
o
d
9
8
7
6
5
4
3
2
0.2 0.4 0.6 0.8 1.0
Contrast the above graph with the one below, which corresponds to n = 20 to see the improvement of
20
over
3
.
2
MLE (red solid) vs. True Parameter (blue dashed), n=20
theta
l
o
g
l
i
k
e
l
i
h
o
o
d
50
45
40
35
30
25
20
15
0.2 0.4 0.6 0.8 1.0
Example 2: Let X
(1)
, . . . , X
(n)
N(,
2
), = (,
2
). The log-likelihood is
() = log
1
(2
2
)
n/2
n
i=1
e
(X
(i)
)
2
/(2
2
)
= c
n
2
log
2
+ log e
n
i=1
(X
(i)
)
2
/(2
2
)
= c
n
2
log
2
i=1
(X
(i)
)
2
2
2
where c is an inconsequential additive constant. Setting the partial derivative with respect to to zero gives
()
X
(i)
2
= 0 =
X
def
=
1
n
n
i=1
X
(i)
.
Substituting this in the equation resulting from setting the partial derivative with respect to
2
to 0: 0 =
()
2
=
n
2
2
+
(X
(i)
)
2
2
4
or 0 =
n
2
2
+
(X
(i)
x)
2
2
4
or
2
n +
(X
(i)
x)
2
= 0 which implies
2
=
1
n
n
i=1
(X
(i)
X)
2
. By
property 4 above, the MLE for the standard deviation is =
_
2
=
_
1
n
n
i=1
(X
(i)
X)
2
.
In this case is a two dimensional vector and so the likelihood function forms a surface in three dimensions. It is
convenient to visualize using a heat-map or contour plot which plots the value of () in terms of colors or equal value
contours. For example compare the two graphs below showing the likelihood as a function of = (,
2
) for n = 5 and
n = 50. The MLE is clearly more accurate in the latter case.
1 > n=5; samples=rnorm(n,0,1); # 5 samples from N(0,1)
2 > mu=seq(-1 ,1,length =40);v=seq(0.01 ,2,length =40);
3 > D=expand.grid(mu=mu,sigma.square=v); # create all combinations of the two parameters
4 > nloglik=function(theta ,samples) { # loglikelihood function
5 + l=-log(theta [,2])*n/2; # works for multiple theta values arranged in data frame
6 + for (s in samples) l=l-(s-theta [,1])2/(2*theta [,2]);
7 + return(l);
8 + }
9 > D$loglikelihood=nloglik(D,samples );
10 > p=ggplot(D,aes(mu,sigma.square ,z=exp(loglikelihood )))+ stat contour (size=0.2);
11 > mle=which.max(D$loglikelihood );
12 > print(p+geom point(aes(mu[mle],sigma.square[mle]),color=I( ' red ' ),size =3)+
13 + geom point(aes (0,1),color=I( ' blue ' ),size=3,shape =22)+opts(title=
14 + ' Likelihoodcontours ,trueparameter(bluesquare)vs.MLE(redcircle),n=5 ' ))
3
Likelihood contours, true parameter (blue square) vs. MLE (red circle), n=5
mu
s
i
g
m
a
.
s
q
u
a
r
e
0.5
1.0
1.5
2.0
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1.0 0.5 0.0 0.5 1.0
1 > n=50; samples=rnorm(n,0,1); # 50 samples from N(0,1)
2 > mu=seq(-1 ,1,length =40);v=seq(0.01 ,2,length =40);
3 > D=expand.grid(mu=mu,sigma.square=v);
4 > D$loglikelihood=exp(nloglik(D,samples ));
5 > p=ggplot(D,aes(mu,sigma.square ,z=exp(loglikelihood )))+ stat contour (size=0.2);
6 > mle=which.max(D$loglikelihood );
7 > print(p+geom point(aes(mu[mle],sigma.square[mle]),color=I( ' red ' ),size =3)+
8 + geom point(aes (0,1),color=I( ' blue ' ),size=3,shape =22)+opts(title=
9 + ' Likelihoodcontours ,trueparameter(bluesquare)vs.MLE(redcircle),n=50 ' ))
Likelihood contours, true parameter (blue square) vs. MLE (red circle), n=50
mu
s
i
g
m
a
.
s
q
u
a
r
e
0.5
1.0
1.5
2.0
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1.0 0.5 0.0 0.5 1.0
R can also be used to nd the MLE using numeric iterative methods such as gradient descent. This can be very useful
when setting the loglikelihood gradient to zero does not result in closed form solution. Below is an example of using Rs
optim() function to numerically optimize the Gaussian likelihood (even though we dont really need to in this case since
there is a closed form solution). The optim() function takes three arguments: an initial parameter value to start the
iterative optimization, the objective function to minimize (minus the logikelihood in our case), and the parameter to pass
to the objective function. Comparing the MLE obtained using the iterative numeric algorithm with the grid search MLE
obtained in the previous example we see that the former is more accurate than latter. As the grid resolution is increased
that dierence will converge to 0.
1 > # numerical MLE for Gaussian parameters using 50 samples from the previous example
2 > nnloglik=function(theta ,samples) { # loglikelihood function
3 + l=-log(theta [2])*n/2; # works for multiple theta values arranged in data frame
4 + for (s in samples) l=l-(s-theta [1])2/(2*theta [2]);
5 + return(-l);
4
6 + }
7 > iterative.mle=optim(c(0.5 ,0.5),nnloglik ,samples=samples );
8 > iterative.mle$par # mle using numeric gradient descent
1 [1] 0.004178593 1.065951133
1 > c(D$mu[mle],D$sigma.square[mle]) # mle using grid search (above example)
1 [1] 0.02564103 1.08153846
Example 3: X
(1)
, . . . , X
(n)
U[0, ]. In this case p
(X
(i)
) =
1
if X
(i)
[0, ] and 0 otherwise. The likelihood
is L() =
n
if 0 X
(1)
, . . . , X
(n)
and 0 otherwise. We need to exercise care in this situation since the denition
of the likelihood branches to two option depending on the value of the parameter . To treat only one case and not two
we write the likelihood as L() =
n
1
{0X
(1)
,...,X
(n)
}
where 1
{A}
is the indicator function which equals 1 if A is true
and 0 otherwise. We cant at this point proceed as before since the likelihood is not a dierentiable function of (and
neither will the log-likelihood be). We therefore do not take derivatives and simply examine the function L(): it will be
zero for < max(X
(1)
, . . . , X
(n)
) and non-zero for max(X
(1)
, . . . , X
(n)
) in which case the likelihood function will be
monotonically decreasing in . It follows then the
= max(X
(1)
, . . . , X
(n)
). The graph below plots the likelihood function
for = 1 overlayed on the samples (n = 3) with the MLE and true parameter indicated by vertical lines.
1 > D=data.frame(theta=seq(0,2,length =100));
2 > n=3; samples=runif(n,0,1); samples; # 5 samples from U[0,theta], theta=1
1 [1] 0.72540527 0.48614910 0.06380247
1 > D$likelihood[D$theta<max(samples )]=0;
2 > D$likelihood[D$theta max(samples )]=D$theta[D$theta max(samples )](-n);
3 > p=ggplot(D,aes(theta ,likelihood )) + geom line()
4 > print(p+opts(title= ' likelihood ,MLE(redsolid)vs.trueparameter(bluedashed),n=3 ' )+
5 + geom vline(aes(xintercept=max(samples)),color=I( ' red ' ),size=1.5)+
6 + geom vline(aes(xintercept =1),color=I( ' blue ' ),lty=2,size=1.5)+
7 + geom point(aes(x=samples ,y=0,size =3)))
likelihood, MLE (red solid) vs. true parameter (blue dashed), n=3
theta
l
i
k
e
l
i
h
o
o
d
0.0
0.5
1.0
1.5
q q q
0.0 0.5 1.0 1.5 2.0
3
q
3
A variation of this situation has X
(1)
, . . . , X
(n)
U(0, ) (as before but this time the interval is open and not closed).
We start as before, but the likelihood at max(X
(1)
, . . . , X
(n)
) is zero. Since the likelihood increases as we get closer to
max(X
(1)
, . . . , X
(n)
) (from the right), and at max(X
(1)
, . . . , X
(n)
) it is zero. That is there is no MLE! For any specic
value
, we can always come up with
that will result in a higher likelihood. Thus there is no value of that maximizes
the likelihood.
Another variation has X
(1)
, . . . , X
(n)
U[, + 1]. In this case the likelihood is L() = 1
{X
(1)
,...,X
(n)
+1}
. The
likelihood is thus either zero or 1: it is 1 for many possible values, there are multiple maximizers or MLEs (all
for which
X
(1)
, . . . , X
(n)
+ 1)) rather than a unique one.
5
Theoretical Properties
The MLE may be motivated on practical grounds as the most popular estimation technique in statistics. It also has some
nice theoretical properties that motivate it. Below is a brief non-formal description of thes properties.
Consistency The MLE
n
is a function of X
(1)
, . . . , X
(n)
and is therefore a random variable. It is sometimes more
accurate and sometime less accurate depending on the samples. It is generally true, however, that as n increases the
value of the random variable
n
converges to the true parameter value with probability 1
P
_
lim
n
n
=
_
= 1.
The main condition for this is identiability, dened as
n
for large n is approximately normally distributed N(, I
1
()/n), with
expectation equal to the true parameter and variance decaying linearly with n.
Asymptotic Eciency Among all other unbiased estimators the MLE has the smallest asymptotic variance.
6