Stats, Mle, and Other Stuff: 1 Sevssd
Stats, Mle, and Other Stuff: 1 Sevssd
1 SE vs SD
For a collection of measurements of a single parameter, the SD tells us how
spread out our measurements are from the mean measurement value.
Notice that the SE of the mean will depend on how large our random samples
are. Bigger random samples means that each sample mean will, on average, be
closer to the true population mean, simply because larger samples means more
data. The exact effect is quantified by the so-called ”central limit theorem.”
Finally we should recall that the SE can be computed for things other than
the mean. Any statistic with a sampling distribution (so, probably every statis-
tic worth thinking about) has an SE. For instance, suppose we took some sample
data, did a line fit, and recorded the line slope. Repeat that process to get the
sampling distribution of slopes, whose SD is the SE of the slope.
1
we don’t know that either. Instead, what we can use is a technique known as
maximum likelihood estimation (MLE). In MLE, we assume some functional
form for our data, then find the values of the parameters of our function (thing
like slope or y-intercept) that make it the most consistent with our data. In the
process, we can estimate the sampling distributions of our parameters.
3 machinery of MLE
MLE is used when you have some data and a model with adjustable parameters.
You want to adjust the parameters so that the model and the data match the
best.
Now for some math. Suppose we run an experiment and get some data sets
[xi ] = [x1 , x2 , ..., xN ] and [yi ] = [y1 , y2 , ..., yN ]. We plot up our data and decide
we want to fit our data to some model function f where
The ci ’s specify that our model function f can depend on an arbitrary number
of extra parameters. As examples, suppose our model function was the constant
function. Then our model function only needs one extra parameter, c1 :
f (x, c1 ) = c1 (2)
f (x, c1 , c2 ) = c1 x + c2 (3)
For quadratic,
f (x, c1 , c2 , c3 ) = c1 x2 + c2 x + c3 (4)
and so on. Models also don’t need to be polynomial.
With our model function in hand, we can calculate the probability of obtain-
ing our data assuming that our model is true. By this, we mean that we as-
sume our yi data would be perfectly explained by our model, if there was no
noise/uncertainty in our experimental method. With noise added, the yi ’s are
displaced slightly off of the values predicted by our model. See fig 1.
2
no noise with noise
y y
x x
Mathematically, we are saying that if there is no noise then there exists some
c1 , c2 , ... such that the equality
yi = f (xi , c1 , c2 , c3 ) (5)
holds for all our data. In the presence of noise the above no longer holds and
we add in a fudge factor to make the equality work
yi = f (xi , c1 , c2 , c3 ) + i (6)
Now we must assume a profile for the noise in our experiment. The most basic
noise profile is a Gaussian (this turns out to often be a good approximation even
when the noise isn’t Gaussian) with some fixed variance σ 2 . This means that
the probability of obtaining some noise value is
2
1
p() = √ exp − 2 (7)
2πσ 2 2σ
With (6) and (7) we can define the probability of obtaining a pair of data points
(xi , yi ). Note that
i = yi − f (xi , c1 , c2 ...) (8)
so
(yi − f (xi , c1 , c2 ...))2
1
p(xi , yi ) = p(i ) = √ exp − (9)
2πσ 2 2σ 2
3
Equation (9) gives us the probability of obtaining only one data point pair. We
want the probability of obtaining our entire data set. Assuming all our data
point pairs are independent, the probability of obtaining all our pairs is the
product of all the individual probabilities for obtaining each pair. Thus
N
Y
p([xi ], [yi ]) = p(xi , yi ) (10)
i=0
The Π is like a sum, except we are multiplying all the elements in the sequence
instead of summing. Equation (10) is the likelihood. Note that to actually
compute (10) we need to assume values of c1 , c2 , ... (since they appear in 9).
Thus, it is customary to write the likelihood instead as
Things on the right of the bar are assumed to be known before the probabilty
is calculated.
Recall that the end goal is to find the model parameters c1 , c2 , ... maximize the
likelihood (12). This, in general, is hard. Often times it is easier to maximize
the log-likelihood L where
N 1
L ≡ log p([xi ], [yi ]|c1 , c2 , ...) = − log(2πσ 2 ) − χ2 (14)
2 2
This is usually done since the likelihood is often an extremely small number,
which makes it hard for computers to work with. Additionally the log-likelihood
will be maximal where the likelihood will be maximal, so results are the same
either way. In this case, maximizing the likelihood reduces to minimizing χ2 , so
this method is identical to the ordinary least squares method. (However MLE
is far more powerful in that you can use non-standard noise profiles and more
complicated non-linear models, unlike OLS).
4
4 maximizing likelihood
Even when considering the log-likelihood, maximization is a hard problem. In
calculus, maximization is usually carried out by computing derivatives and find-
ing where they go to 0. In this case, to maximize L(c1 , c2 , ...), we would need
to compute the partial derivatives
∂L ∂L ∂L
, , , ... (15)
∂c1 ∂c2 ∂c3
and find the the c1 , c2 , c3 , ... that make all the partial derivatives go to 0. (A
partial derivative with respect to variable c1 is like a regular derivative except
that we assume c2 , c3 , ... to all be constant; similarly for c2 , c3 , ...). The max-
imizing values of c1 , c2 , ... I denote as ĉ1 , ĉ2 ...; again, these are also called the
”maximum likelihood estimators” to c1 , c2 , ...
The above process is sometimes impossible to do with pen & paper, especially
for very complex/highly dimensional likelihood functions. So more often numer-
ical maximization methods are used to approximate where the maxima occur.
Commonnly used numerical methods include gradient-ascent methods and sim-
plex methods.
However, for simple models, exact solutions are possible using the derivative
method. I show two such solutions in later sections.
p(c1 |[xi ], [yi ]), p(c2 |[xi ], [yi ]), p(c3 |[xi ], [yi ]), ... (16)
Note that our dataset appears to the right of the bar because the values of our
model parameters should be constrained to fit our observed data. The SD’s of
the above probability distributions estimate the SE’s of the model parameters,
which we will denote σc1 , σc2 , σc3 , ...
How do we get the above? Well, going back, the likelihood function p([xi ], [yi ]|c1 , c2 , ...)
gives us the the probabilty of obtaining our data set given some values for our
model parameters. We can invert this probability using Bayes’ rule to obtain
the posterior. In most cases,2
5
The posterior, on the left, is the probability that our model with parameters
c1 , c2 , ... is true, given our data. The ∝ symbol denotes proportionality, meaning
the left hand side and right hand side are equal sans some constant.
This process is known as ”marginalization.” Often times doing the above integral
is hard. Fortunately, there are other ways to get the posteriors. In particular,
we can approximate our likelihood function with another function that we know
how to marginalize, without even doing an integral.
The crux of this approach involves Taylor expanding the log-likelihood about
the ML estimators ĉ1 , ĉ2 , ... to 2nd order.3 A multivariable 2nd order Taylor
expansion is rather messy, and involves something called the ”Hessian” matrix,
which I will not go into. However, only a subset of the terms in the expansion
(namely those involving the diagonal of the Hessian) matter for our purposes.
So I only write those out explicitly. The expansion of L is
N
1 X ∂2L
L(c1 , c2 , ...) ≈ C0 + (ci − ĉi )2 + ... (19)
2 i=1 ∂c2i
where C0 is a constant combining all the 0th order terms and the ... combines
all the cross-terms (terms with factors like ∂ 2 L/∂c21 × ∂ 2 L/∂c22 ).
6
distributions of our model parameters) as
−1
∂2L
2
σc1 =−
∂c21
2 −1
2 ∂ L (23)
σc2 =−
∂c22
...
Thus, we have arrived at how to estimate the SDs of the posterior distributions
(16). Note that (23) is evaluate at c1 = ĉ1 , c2 = ĉ2 , ... since our Taylor expansion
was centered on those values.
As a side note, a more general approach to (23) involves the ”Fisher informa-
tion.” Intuitively, we are estimating the ”spread” of the probabilty distributions
for c1 , c2 , ... with their second derivative, which contains information on the
curvature of the distribution. The Fisher information builds on this concept.
In this simple case the extra machinery of MLE is not strictly necessary. Let’s
say we have a population of x values with some true mean µ and variance σ 2 .
Now we sample our population to get N measured x values [x1 ,...,xN ]. The
mean of this sample x̄ is
N
1 X
x̄ = xi (24)
N i=1
We ultimately want the SD of x̄; this is the SE of the mean. To do so we
calculate the variance of x̄, using the identity
7
where a is a constant. (2) just comes from the fact that computing variance
involves squaring things. Anyways, this gives
N
1 X
Var(x̄) = Var xi
N i=1
N
1 X
= Var (xi )
N2 i=1
N
1 X (26)
= Var(xi )
N 2 i=1
N
1 X 2
= σ
N 2 i=1
σ2
=
N
So the SE is r
σ2 σ
SE = =√ (27)
N N
Note that in practice we do not know the true population SD, σ, so the general
approach is to estimate it with the standard deviation of our sample.4
Set the above to 0 and solve for c1 to find the maximimizing value of c1 :
N
1 X
ĉ1 = xi = x̄ (31)
N i=1
4 If you are astute you’ll notice there’s some trickery which involves switching the order
of the variance calculation and the sum. This is only allowable if our measurements xi are
completely independent of one another.
8
The ML estimator is just the arithmetic mean! Differentiate L twice with respect
to c1 to find
N
∂2L 1 X N
= (−1) = − 2 (32)
∂c21 σ 2 i=1 σ
By the relations in (23), the SE for the parameter c1 is
σ
σc1 = √ (33)
N
just as we derived in section 5.
f (x, c1 , c2 ) = c1 x + c2 (34)
where c1 is the slope and c2 is the y intercept. Our χ2 takes the form
N
X (yi − c1 xi − c2 )2
χ2 (c1 , c2 ) ≡ (35)
i=0
σ2
Our log-likelihood L is
N 1
L = log p([xi ], [yi ]|c1 , c2 ) = − log(2πσ 2 ) − χ2 (c1 , c2 ) (36)
2 2
We need to take the first derivatives of L like before and find the values of
c1 , c2 that make both first derivatives go to 0. This involves solving a system of
equations. One of the equations, for example, is
∂L 1 ∂χ2
0= =−
∂c1 2 ∂c1
N
1 X ∂
=− 2 × (yi − c1 xi − c2 )2
2σ i=0
∂c 1
N (37)
1 X
=− 2
× 2(yi − c1 xi − c1 ) × −xi
2σ i=1
N
X
0= xi (yi − c1 xi − c2 )
i=0
The two equations can be solved. If you were to carry out the maximization
you would find P P P
N (xi yi ) − xi yi
ĉ1 = (38)
N x2i − ( xi )2
P P
9
P P
yi − ĉ1 xi
ĉ2 = (39)
N
I have verified that (37) can be derived from the above procedure. Now we
want to know the SE’s. I will calculate just the SE for the slope. Differentiate
L twice with respect to c1 and find
∂2L 1 X 2
2 =− 2 (xi ) (40)
∂c1 σ
10