0% found this document useful (0 votes)
16 views5 pages

L08 MaximumLikelihoodEstimation

This document reviews the concept of Maximum Likelihood Estimation (MLE) and demonstrates that the sample mean is the MLE of the mean parameter µ in a Gaussian distribution. It explains the likelihood function, the process of maximizing it, and provides examples for both Gaussian and Bernoulli distributions. The MLE for the Bernoulli distribution is shown to be the proportion of ones in the data.

Uploaded by

Ed Z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

L08 MaximumLikelihoodEstimation

This document reviews the concept of Maximum Likelihood Estimation (MLE) and demonstrates that the sample mean is the MLE of the mean parameter µ in a Gaussian distribution. It explains the likelihood function, the process of maximizing it, and provides examples for both Gaussian and Bernoulli distributions. The MLE for the Bernoulli distribution is shown to be the proportion of ones in the data.

Uploaded by

Ed Z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Maximum Likelihood Estimation

Foundations of Data Analysis


March 4, 2021

The purpose of these notes is to review the definition of a maximum likelihood estimate (MLE), and show
that the sample mean is the MLE of the µ parameter in a Gaussian. For more details about MLEs, see the
Wikipedia article:
https://en.wikipedia.org/wiki/Maximum_likelihood
Consider a random sample X1 , X2 , . . . , Xn coming from a distribution with parameter θ (for example, they
could be from a Gaussian distribution with parameter µ). Remember the terminology “random sample”
means that Xi random variables are independent and identically distributed (i.i.d.). Furthermore, let’s
assume that each Xi has a probability density function pXi (x; θ). Given a realization of our random sample,
x1 , x2 , . . . , xn , (remember, these are the actual numbers that we have observed), we define the likelihood
function L(θ) as follows:
L(θ) = pX1 ,...,Xn (x1 , x2 , . . . , xn ; θ),
Yn
= pXi (xi ; θ), using independence of the Xi .
i=1

Here, pX1 ,...,Xn is the joint pdf for all of the Xi variables. This pdf depends on the value of the parameter θ
for the distribution, so that is in the notation after the semicolon. Notice an important point, we are treating
the xi as constants (they are the data that we’ve observed) and L is a function of θ. Maximum likelihood
now says that we want to maximize this likelihood function as a function of θ.

MLE of Gaussian mean parameter, µ


Now, let’s work this out for the Gaussian case, i.e., let X1 , X2 , . . . , Xn ∼ N (µ, σ 2 ). We will focus only on
the MLE of the µ parameter, essentially treating σ 2 as a known constant for simplicity of the example. The
likelihood function looks like this:
L(µ) = pX1 ,...,Xn (x1 , x2 , . . . , xn ; µ),
Yn
= pXi (xi ; θ), using independence of the Xi ,
i=1
n  
Y 1 1
= √ exp − 2 (xi − µ)2 , using Gaussian pdf for each Xi ,
i=1
2π 2σ
 n n
!
1 1 X
= √ exp − 2 (xi − µ)2 , product turns into a sum inside exp.
2πσ 2σ i=1

To maximize this function, it is easier to think about maximizing it’s natural log. We can do this because ln
is a monotonically increasing function, so the value of µ that maximizes L also maximizes ln L. So, the log
likelihood function is defined as
n
1 X
`(µ) = ln L(µ) = − (xi − µ)2 + C,
2σ 2 i=1

where C is a constant in µ (we don’t need it to maximize `). Now, defining our estimate of µ to maximize
the log likelihood, we get
Xn
µ̂ = arg max `(µ) = arg min (xi − µ)2 .
µ µ
i=1

1
Notice we changed the sign in the last equality, and this changes us from a max to a min problem. This is
called least squares, as we are minimizing the sum-of-squared differences from the µ to our data xi . We can
solve this maximization problem exactly using the fact (from calculus) that the derivative of ` with respect
to µ will be zero at a maxima. We get
n n
d d X X
0= `(µ) = (xi − µ)2 = 2nµ − 2 xi .
dµ dµ i=1 i=1

Solving for µ, we get the sample mean as the MLE:


n
1X
µ̂ = xi .
n i=1

Here are some plots demonstrating the above MLE of the mean of a Gaussian. First, we generated a random
sample, x1 , . . . , x20 from a normal distribution with µ = 3, σ = 1.
Next, we plot the likelihood functions, p(xi ; µ), for each of the points separately. Note that the xi points are
plotted on the bottom (x-axis) and each one has its own Gaussian pdf “hill” centered above it. These are the
p(xi ; µ).

Individual Likelihoods Per Point


0.4
0.3
L(µ ; xi)
0.2
0.1
0.0

0 1 2 3 4 5 6

Next, we plot the likelihood function for all of the data, which is just the product of all of the p(xi ; µ). The
vertical line is at the average of the xi data. You can see that the maximum of the likelihood curve is indeed
at the average.

2
4e−12 Likelihood Function
L(µ ; x)
2e−12
0e+00

0 1 2 3 4 5 6

Finally, we plot the log-likelihood function (the log of the previous plot, which is just a quadratic). The
maximum is still at the same place.

3
−2
−3 Average Log−Likelihood Function
l(µ ; x)
−4
−5
−6

0 1 2 3 4 5 6

MLE of a Bernoulli probability


The Bernoulli distribution is the binary variable distribution. If now our random variables Xi are binary
variables, the notation is Xi ∼ Ber(θ). The parameter θ gives the probability that Xi is a one. In other
words:
P (Xi = 1) = θ,
P (Xi = 0) = 1 − θ

Now, what is the MLE for θ? The likelihood for a single xi is:

p(xi ; θ) = θxi (1 − θ)1−xi

Notice this is θ when xi = 1 and 1 − θ when xi = 0. Now the joint likelihood of all xi is just the product of
these individual likelihoods:
L(θ) = p(x1 , . . . , xn ; θ)
= p(x1 ; θ) × p(x2 ; θ) × · · · × p(xn ; θ)
Yn
= θxi (1 − θ)1−xi
i=1
P P
xi
=θ i (1 − θ) i (1−xi )
n
X
= θk (1 − θ)n−k , where k = xi
i=1

4
To maximize L(θ), we can take the derivative (without first taking log this time):

dL
= kθk−1 (1 − θ)n−k − (n − k)θk (1 − θ)n−k−1

= (k(1 − θ) − (n − k)θ) θk−1 (1 − θ)n−k−1
= (k − nθ) θk−1 (1 − θ)n−k−1

Setting this to zero (dL/dθ = 0), and then solving for θ, gives the maximum likelihood estimate:

k
θ̂ = .
n

This is what we intuitively expect. The value k is the number of ones appearing in our data, so θ̂ is the
proportion of ones in our data.

You might also like