L08 MaximumLikelihoodEstimation
L08 MaximumLikelihoodEstimation
The purpose of these notes is to review the definition of a maximum likelihood estimate (MLE), and show
that the sample mean is the MLE of the µ parameter in a Gaussian. For more details about MLEs, see the
Wikipedia article:
https://en.wikipedia.org/wiki/Maximum_likelihood
Consider a random sample X1 , X2 , . . . , Xn coming from a distribution with parameter θ (for example, they
could be from a Gaussian distribution with parameter µ). Remember the terminology “random sample”
means that Xi random variables are independent and identically distributed (i.i.d.). Furthermore, let’s
assume that each Xi has a probability density function pXi (x; θ). Given a realization of our random sample,
x1 , x2 , . . . , xn , (remember, these are the actual numbers that we have observed), we define the likelihood
function L(θ) as follows:
L(θ) = pX1 ,...,Xn (x1 , x2 , . . . , xn ; θ),
Yn
= pXi (xi ; θ), using independence of the Xi .
i=1
Here, pX1 ,...,Xn is the joint pdf for all of the Xi variables. This pdf depends on the value of the parameter θ
for the distribution, so that is in the notation after the semicolon. Notice an important point, we are treating
the xi as constants (they are the data that we’ve observed) and L is a function of θ. Maximum likelihood
now says that we want to maximize this likelihood function as a function of θ.
To maximize this function, it is easier to think about maximizing it’s natural log. We can do this because ln
is a monotonically increasing function, so the value of µ that maximizes L also maximizes ln L. So, the log
likelihood function is defined as
n
1 X
`(µ) = ln L(µ) = − (xi − µ)2 + C,
2σ 2 i=1
where C is a constant in µ (we don’t need it to maximize `). Now, defining our estimate of µ to maximize
the log likelihood, we get
Xn
µ̂ = arg max `(µ) = arg min (xi − µ)2 .
µ µ
i=1
1
Notice we changed the sign in the last equality, and this changes us from a max to a min problem. This is
called least squares, as we are minimizing the sum-of-squared differences from the µ to our data xi . We can
solve this maximization problem exactly using the fact (from calculus) that the derivative of ` with respect
to µ will be zero at a maxima. We get
n n
d d X X
0= `(µ) = (xi − µ)2 = 2nµ − 2 xi .
dµ dµ i=1 i=1
Here are some plots demonstrating the above MLE of the mean of a Gaussian. First, we generated a random
sample, x1 , . . . , x20 from a normal distribution with µ = 3, σ = 1.
Next, we plot the likelihood functions, p(xi ; µ), for each of the points separately. Note that the xi points are
plotted on the bottom (x-axis) and each one has its own Gaussian pdf “hill” centered above it. These are the
p(xi ; µ).
0 1 2 3 4 5 6
Next, we plot the likelihood function for all of the data, which is just the product of all of the p(xi ; µ). The
vertical line is at the average of the xi data. You can see that the maximum of the likelihood curve is indeed
at the average.
2
4e−12 Likelihood Function
L(µ ; x)
2e−12
0e+00
0 1 2 3 4 5 6
Finally, we plot the log-likelihood function (the log of the previous plot, which is just a quadratic). The
maximum is still at the same place.
3
−2
−3 Average Log−Likelihood Function
l(µ ; x)
−4
−5
−6
0 1 2 3 4 5 6
Now, what is the MLE for θ? The likelihood for a single xi is:
Notice this is θ when xi = 1 and 1 − θ when xi = 0. Now the joint likelihood of all xi is just the product of
these individual likelihoods:
L(θ) = p(x1 , . . . , xn ; θ)
= p(x1 ; θ) × p(x2 ; θ) × · · · × p(xn ; θ)
Yn
= θxi (1 − θ)1−xi
i=1
P P
xi
=θ i (1 − θ) i (1−xi )
n
X
= θk (1 − θ)n−k , where k = xi
i=1
4
To maximize L(θ), we can take the derivative (without first taking log this time):
dL
= kθk−1 (1 − θ)n−k − (n − k)θk (1 − θ)n−k−1
dθ
= (k(1 − θ) − (n − k)θ) θk−1 (1 − θ)n−k−1
= (k − nθ) θk−1 (1 − θ)n−k−1
Setting this to zero (dL/dθ = 0), and then solving for θ, gives the maximum likelihood estimate:
k
θ̂ = .
n
This is what we intuitively expect. The value k is the number of ones appearing in our data, so θ̂ is the
proportion of ones in our data.