0% found this document useful (0 votes)
66 views42 pages

Gaussian MLEstimator

The document discusses the Gaussian distribution and Bayesian estimation. It covers properties of the Gaussian such as the maximum entropy property and central limit theorem. It also discusses multivariate Gaussians and derivations of mean, variance, and other integrals. Finally, it introduces Bayesian inference for Gaussians with known and unknown parameters, as well as conjugate normal-gamma and Gaussian-Wishart distributions.

Uploaded by

Slim Salim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views42 pages

Gaussian MLEstimator

The document discusses the Gaussian distribution and Bayesian estimation. It covers properties of the Gaussian such as the maximum entropy property and central limit theorem. It also discusses multivariate Gaussians and derivations of mean, variance, and other integrals. Finally, it introduces Bayesian inference for Gaussians with known and unknown parameters, as well as conjugate normal-gamma and Gaussian-Wishart distributions.

Uploaded by

Slim Salim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

The Gaussian Distribution MLE

Estimators and Introduction to


Bayesian Estimation
Prof. Nicholas Zabaras
School of Engineering
University of Warwick
Coventry CV4 7AL
United Kingdom

Email: [email protected]
URL: http://www.zabaras.com/

August 7, 2014

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 1


Contents
The Gaussian Distribution, Standard Normal, Degenerate Gaussian
Distribution, Multivariate Gaussian, the Gaussian and Maximum Entropy, the
CLT and the Gaussian Distribution, Convolution of Gaussians, MLE for the
Gaussian, MLE for the Multivariate Gaussian

Sequential MLE Estimation for the Gaussian, Robbins-Monro Algorithm

Bayesian Inference for the Gaussian with Known Variance, Bayesian Inference
for the Gaussian with Known Mean, Bayesian Inference for the Gaussian with
unknown Mean and Variance

Normal-Gamma Distribution, Gaussian-Wishart Distribution

Following closely Chris Bishops PRML book, Chapter 2


Kevin Murphys, Machine Learning: A probablistic perspective, Chapter 2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 2


The Gaussian Distribution
A random variable X is Gaussian or normally distributed
X N ( , s 2 ) if:
1 2
t
1
P X t
2
exp ( x ) dx
2s 2s
2

The following can be shown easily with direct integration:



1 1 2
X 2
exp ( x ) xdx ,
2s 2s
2

2
1 1 2 X 2 s 2
2
X 2 exp ( x ) x dx 2
s 2
, var[ X ]
2s 2s
2

The following integrals are useful in these derivations :




exp u du , u exp u du 0, du
2 2 2 2
u exp u

2

We often work with the precision of a Gaussian l=1/s2. The


higher l the narrower the distribution is.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 3
Standard Normal, CDF, Error Function
Plot of the Standard Normal N(0,1) and CDF. Let F(x;0,1)
PDF
the corresponding CDF. 0.4

0.35

Run gaussPlotDemo 0.3


from PMTK
0.25

0.2

0.15

0.1

0.05 N x;0,1
0
-3 -2 -1 0 1 2 3
x

N ( z | , s 2
)dz F z;0,1 , z ( x ) / s

F x;0,1
z
1 1
F z;0,1 dt
t /2
1 erf z / 2
2
e
2
2
x
2
erf x dt
t2
e
0

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 4


Degenerate Gaussian Distribution
Note that as s20, the Gaussian becomes a delta
function centered at the mean :

lim
s 0
2
N x | , s 2
(x )

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 5


Multivariate Gaussian
A multivariate X D
is Gaussian if its probability density is
1/2
1 1
N ( x | , S) exp ( x )T 1
S ( x )
2 D det S 2

where D , S DD is symmetric positive definite matrix


(covariance matrix).

We often work with the precision matrix L=S-1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 6


2D Gaussian
Level sets of 2D Gaussians (full, diagonal and spherical covariance matrix)
full
diagonal
10
10
full
8 8

6 6
0.2
4 4

0.15 2
2
0
0 0.1
-2
-2
0.05 -4
-4
-6
0
-6 10
-8
5 10
-8 0 5 -10
0 -5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -5
-10 -10 -10
-5 -4 -3 -2 -1 0 1 2 3 4 5 gaussPlot2DDemo
from PMTK
diagonal
spherical spherical
5

4
0.2 0.2
3

0.15
0.15 2

1 0.1
0.1
0
0.05
0.05 -1
0
-2 5
0 5
10 -3
0
5 5 0
-4
0 -5 -5
0 -5
-5 -6 -4 -2 0 2 4 6
-10 -5

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 7


Multivariate Gaussian: Maximum Entropy
We can show that the multivariate Gaussian maximizes the
entropy H with the constraints of normalization with given
mean and given variance S:
max p( x ) ln p( x )dx + l
p ( x ),l , m , L
p( x)dx 1 m xp( x)dx
T

Tr L p( x )( x )( x ) dx S
T

Setting the derivative wrt p(x) to zero gives:


0 1 ln p( x) l mT x Tr L( x )( x )T
1 l mT x ( x )T L( x )
p( x ) e
The coefficients can be found by satisfying the constraints.
We start by completing the square.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 8


Multivariate Gaussian: Maximum Entropy
1 l mT x ( x )T L ( x )
p( x ) e
y

1 1 1
1 l T m mT L1m ( x L1m )T L( x L1m )
e 4 2 2

Satisfying the mean constraint:

1 1
1
1 l T m mT L1m yT Ly
e 4
y + L m dy
2
The 1st term drops from symmetry, the 2nd gives from
normalization, thus we need to have:
1 1
L m m
2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 9


Multivariate Gaussian: Maximum Entropy
z
1 l ( x )T L ( x )
p( x ) e
Satisfying the variance constraint:

e
1 l zT Lz
zzT dz S
Note that with L = -S / 2 , the 3nd term from the exponential
when integrated gives:

e zz dz S(2 ) S
zT Lz T D /2 1/2

It remains to select l such that:


1 l 1/2

1 1
e (2 ) S l 1 ln
D /2
1/2
(2 ) S
D /2

The optimizing p(x) is now clearly the Gaussian.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 10
Multivariate Gaussian: Maximum Entropy
The entropy of the multivariate Gaussian is now computed
as follows:
H x N ( x | , S) ln N ( x | , S) dx
1
N ( x | , S)
2
D ln(2 ) ln S ( x )T S 1 ( x ) dx

D ln(2 ) ln S N ( x | , S) tr ( x )( x )T S 1 dx
1 1
2 2
1
2
1
2

D ln(2 ) ln S tr N ( x | , S)( x )( x )T dx S 1
D ln(2 ) ln S tr SS 1
1 1
2 2


D ln(2 ) ln S tr S 1S
1
2

D ln(2 ) ln S D
1
2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 11


Multivariate Gaussian: Maximum Entropy
Using also the KL distance definition, one can show that the
Gaussian has the largest entropy from any other distribution
satisfying the mean and 2nd moment constraints. To make
the presentation simple, consider

p ( x ) N ( x | , S), q( x ) xx T dx S
Then:
p( x )
0 KL(q || p) q( x ) ln dx q( x ) ln p( x )dx + q( x ) ln q( x )dx
q( x )
q ( x ) ln p( x )dx H [q] p( x ) ln p( x )dx H [ q]
H [ p] H [q] H [ p] H [q]

The intermediate step in the proof above accounts for the


moment constraint on q and the fact that log(p) is quadratic
in x!
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 12
The CLT and the Gaussian Distribution
Let (X1,X2, Xn) be independent and identically distributed
(i.i.d.) continuous random variables each with expectation
and variance s2.
1
Define: Zn ( X 1 X 2 ... X n N )
s N
As N, the distribution of Zn converges to the distribution
of a standard normal random variable
x
1
lim P Z n x e
t 2 /2
dt
N 2
1 N
s2
If Xn X j, for N large, X n ~ N , as N
N j 1 N
Somewhat of a justification for assuming that Gaussian
noise is common

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 13


The CLT and the Gaussian Distribution
As an example, assume N variables (X1,X2, Xn) each of
which has a uniform distribution over [0, 1] and then
consider the distribution of the mean
(X1+X2+ +Xn)/N. For large N, this distribution tends to a
Gaussian. The convergence as N increases can be rapid.
4 4
4

3.5 3.5
3.5

3 3
3

2.5 2.5
2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MatLab Code

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 14


The CLT and the Gaussian Distribution
10000
Histogram of 1
N
x where xij~Beta(1,5)
j 1
ij

N=1 N=5
3 3

2 2

1
1

N = 10
0 3
0 0.5 1 0
0 0.5 1

centralLimitDemo
2 from PMTK

0
0 0.5 1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 15
The CLT and the Gaussian Distribution
One consequence of this result is that the binomial
distribution which is a distribution over m defined by the
sum of N observations of the random binary variable x,
will tend to a Gaussian as N .

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 16


Example of the Convolution of Gaussians
Consider 2 Gaussians x1 ~ N (1 ,11 ), x2 ~ N ( 2 , 21 ). We
want to compute the entropy of the distribution of x=x1+x2.
p(x) can be computed from the convolution of two Gaussians
p( x) p( x | x2 ) p( x2 ) dx2
1
N ( 1 x2 ,11 ) N ( 2 , 2 )
We need to complete the square in the exponential in x2:
1 1
1 x ( 1 x2 ) 2 x2 2
2 2

2 2
1 1 ( x 1 ) 2 2
2 2
1 ( x 1 ) 2 2 1
1 2 x2
1
1 x 1
2

2 1 2 2 2 1 2
The 1st term is integrated out and the precision of x is:
12
1 1 2
1 2 1 2
1
H [ x] ln 2 es 2 ln 2 e 1 2
1
Thus the entropy of x is: 2 2 1 2
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 17
Maximum Likelihood for a Gaussian
Suppose that we have a data set of observations D = (x1, .
. . , xN)T, representing N observations of the scalar random
variable X. The observations are drawn independently from
a Gaussian distribution whose mean and variance s2 are
unknown.

We would like to determine these parameters from the data


set.

Data points that are drawn independently from the same


distribution are said to be independent and identically
distributed, which is often abbreviated to i.i.d.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 18


Maximum Likelihood for a Gaussian
Because our data set D is i.i.d., we can write the probability
of the data set, given and s2, in the form

N
Likelihood function : p( x | , s 2 ) N ( xi | , s 2 )
i 1

This is seen as a function of , s 2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 19


Max Likelihood for a Gaussian Distribution
N
Likelihood function : p( x | , s ) N ( xi | , s 2 )
2

i 1

One common criterion for determining the parameters in a


probability distribution using an observed data set is to find
the parameter values that maximize the likelihood function,
i.e. maximizing the probability of the data given the
parameters (contrast this with maximizing the probability of
the parameters given the data).

We can equivalently maximize the log-likelihood:


1 N N N
max2 ln p( x | , s ) max2 2 ( xi ) ln s ln(2 )
2 2 2
,s ,s 2s i 1 2 2
1 N 1 N
ML xi , s ML ( xi ML )2
2

N i 1 N i 1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 20


Maximum Likelihood for a Gaussian Distribution
N
1 1 N
ML
N
x
i 1
i ,s 2
ML ( xi ML ) 2
N i 1
Sample mean Sample variance wrt ML
mean ( not the exact mean)

The MLE underestimates the variance (bias due to


overfitting) because ML fitted some of the noise in the data.
The maximum likelihood solutions ML , s ML are functions
2

of the data set values x1, . . . , xN. Consider the


expectations of these quantities with respect to the data set
values, which come from a Gaussian.
Using the equations above you can show that :
In this derivation

N 1 2 you need to use :


ML , s ML
2
s E xi x j s 2 for i j
N
E xi2 s 2 2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 21


Maximum Likelihood for a Gaussian Distribution
We use :
N 1 2
s 2
ML s E xi x j s 2 for i j
N E xi2 s 2 2

1 N 2 1 N 1 N 2
s ( xn ML )
2
ML n N m
( x x )
N n 1 N n 1 m 1
1 N 2 2 N
1 N N
xn xn xm 2 xm xl
N n 1 N m 1 N m 1 l 1
1
N ( 2 s 2 ) N ( N 1) 2 ( 2 s 2 ) N 2 N ( N 1) 2 ( 2 s 2 )
2 1
N N N

N
1
N ( 2 s 2 ) N 2 s 2

N 1 s 2
N

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 22


Maximum Likelihood for a Gaussian Distribution
N 1 2
ML , s ML
2
s
N

On average the MLE estimate obtains the correct mean but


will underestimate the true variance by a factor (N 1)/N.
An unbiased estimate of the variance is given as:
N 1 N For large N,
i ML
2
s s ML
2
( x ) 2
the bias is not
N 1 N 1 i 1
a problem
This result can be obtained from a Bayesian treatment in
which we marginalize over the unknown mean.
The N-1 factor takes account the fact that 1 degree of
freedom has been used in fitting the mean and removes
the bias of MLE.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 23
MLE for the Multivariate Gaussian
We can easily generalize the earlier results for a
multivariate Gaussian. The log-likelihood takes the form:

ND N 1 N
ln p( X | D , , S) ln 2 ln | S | ( xn )T S 1 ( xn )
2 2 2 n1

Setting the derivatives wrt and S equal to zero gives the


following:

1 N 1 N
ML xn , S ML ( xn ML )( xn ML )T
N n1 N n1

We provide a proof of the calculation of S ML next.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 24


MLE for the Multivariate Guassian
ND N 1 N
ln p( X | D , , S) ln 2 ln | S | ( xn )T S 1 ( xn )
2 2 2 n1
We differentiate the log likelihood wrt S1. Each contributing
term is:
N N 1 N T N
1
ln | S | 1
ln | S | S S
2 S 2 S 2 2 A useful trick!
1 N 1 1 N 1 T
n
2 S 1 n 1
( x )T 1
S ( x n )
2 S 1
N Tr S
n 1 N
( x n )( x n )


N 1 Tr S 1 S NS
1 1
S symmetric
2 S 2
1 1 N
NS , where S ( xn )( xn )T
2 N n 1
So finally S ML S

Tr T , ln | A | A1 ,
T

Here we used: A
| A1 || A |1 , tr ( AB ) tr ( BA)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 25
Appendix: Some Useful Matrix Operations
Show that

Tr T and Tr T

Indeed

Tr ik ki nm
A B B Tr T
Amn Amn
Show that

ln | A | A1
T

Using the cofactor expansion of the det:


1 1
(1) m n M mn A1 nm
1
Amn
ln | A |
| A | Amn
| A |
| A | Amn
(1)i j Aij M ij
j | A|

where in the last step we used Cramers rule.

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 26


MLE for a Multivariate Gaussian
N N N
1 1 1
x x x
T
ML n x , S ML ( xn ML )( xn ML ) T
n
T
n xx
N n 1 N n 1 N n 1

Note that the unconstrained maximization of the log-


likelihood gives a symmetric S.

As for the univariate case, we can define an unbiased


covariance as:
1 N
S ML
N 1 n1
( x n ML )( x n ML )T
, S ML S

To prove this, you will need to use that:


xn xmT T mn S
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 27
Sequential MLE Estimation for Gaussians
Often we are interested to compute sequentially an
estimate of ML as more data arrive. This can easily be
done:
1 N xN 1 N 1
ML
(N )
xn xn
N n 1 N N n 1
xN N 1 1 N 1

N

N N 1 n 1
xn

xN N 1 ( N 1)

N

N
ML ML
( N 1) 1
N
x N ML
( N 1)

Learning Error signal
rate

This sequential approach cannot easily be generalized to


other cases (non-Gaussians, etc.)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 28
Robbins-Monro Algorithm
A more powerful approach to computing sequentially the MLE
estimates is via the Robbins-Monro algorithm.

We review the algorithm by considering the calculation of the zero of a


regression function.*

Consider the joint distribution p(z,q) of two


random variables and define the regression
function as:
f q z | q zp( z | q )dz
Assume we are given samples
from p(z,q) one at a time.

* Effectively, we dont know the regression function f(q) but we have data of a noisy version z of that. We
take the regression function to be the expectation z | q .
Robbins, H. and S. Monro (1951). A stochastic approximation method. Annals of Mathematical Statistics 22,
400407.
Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (Second ed.). Academic Press.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 29
Robbins-Monro Algorithm
f q z | q zp( z | q )dz

We want to find the root f q * 0 in a sequential


manner: The Robbins-Monro algorithm proceeds as:

q ( N ) q ( N 1) aN 1 z q ( N 1)

The learning coefficients {aN}


should satisfy:

lim aN 0, aN , aN2
N
n 1 n 1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 30


Robbins-Monro Algorithm
We can state the MLE calculation ML for our Gaussian
example as finding the root of a regression function:

1 N
N
1 CLT

N
ln p( xn | ) |ML 0
n 1 N
ln p( xn | ) |ML 0 ln p( x | ) 0
n 1 N

z| ML

In the context of the Robbins-Monro algorithm,


x ML ML
z ln p ( x | ) ML
| , z is a Gaussian, f z |
s 2
s2

The algorithm takes the form:


x ( N ) ML
( N 1)
ML
(N )
ML
( N 1)
aN 1
s2
s2
Substituting aN 1 gives the estimate discussed earlier.
N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 31
Robbins-Monro Algorithm
A graphical interpretation of the algorithm is shown here.

x ML
z
x ( N ) ML
( N 1)
s2
(N )
( N 1)
aN 1
ML ML
s2 p( z | ) is a Gaussian

ML
f z |
s2

The Robbin-Monro algorithm computes the zero of the


regression function.
Blum, J. A. (1965). Multidimensional stochastic approximation methods. Annals of Mathematical Statistics 25,
737744.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 32
Sequential MLE Estimation for Gaussians
Let us now repeat the same calculations but for the MLE
estimate of s2:
xN
2
N N 1
1 1
s ( N ) xn xn
2 2 2

N n 1 N n 1 N
xN
2
N 1 2
s ( N 1)
N N
s ( N 1)
2 1
N
xN s (2N 1)
2

If we substitute the expression for the Gaussian likelihood
into the Robbins-Monro procedure for maximizing likelihood:
xN
x
2
1 1
s s aN 1 s s s (2N 1)
2 2 2 2 2
( N 1) ln ( N 1) ( N 1) a N 1
s (2N 1) s s
(N ) 2 4 N

2 2 ( N 1)
2 ( N 1)

The 2 formulas are identical for: aN 1 2s (4N 1) N .

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 33


Sequential MLE: Multivariate Gaussian
To simplify things, assume that ML and thus:
N
1
( x )( x ) S (N )
ML
N n 1
n n
T

From this equation we can derive:


S (ML
N) N 1)
S (ML
1
N
( x N )( x N )T S (ML
N 1)

To apply the Robbins-Monro algorithm, assume that S is
diagonal and as before compute the derivative

N 1)
S (ML
ln p x N | , S (ML
N 1)
S ML ( x N )( x N )T S (ML
1 ( N 1) 2
2
N 1)

Substituting into the RM algorithm:
AN 1 S ML ( x N )( x N )T S (ML
( N 1) 1 ( N 1) 2 N 1)
S (N )
ML S ML
2
Thus from the RM algorithm, we can obtain the exact
update by selecting
AN 1 S ML
2 ( N 1) 2
N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 34
Bayesian Inference for the Gaussian: Known Variance

Consider X1 | ~ N ( , s 2 ), with prior ~ N ( 0 , s 02 ). We want


to infer with the variance s2 taken as known. The case
with multiple data points will be considered later on.
Then we can derive the following:
( x1 )2 ( 0 )2
( | x1 ) f ( x1 | ) ( ) exp
2 s 2
2 s 2
0
2 1 1 x1 0 1
( | x1 ) exp 2 2 2 2 exp 2 ( 1 )2
2 s s0 s s0 2s 1

| x1 ~ N ( 1 , s 12 ) with
1 1 1 s 0s
2 2
s 2
, and
s1 s 0 s
2 2 2 1
s0 s
2 2

x1 0
1 s 2 2
2

s s0
1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 35


Bayesian Inference: Predictive distribution
To predict the distribution of a new observation X | ~ N ( , s 2
)
in light of x1 , we use the predictive distribution as follows:
( x )2 ( 1 )2 1 ( x )2 ( 1 )2

2 s 2 s12
f ( x | x1 ) f ( x | ) ( | x1 ) d e 2s 2
e 2s12
d e d
Likelihood Posterior

We can complete the square by treating the integrand


above as a bivariate Gaussian in (x,). One can verify that:
1 1
s2
1 ( x ) 2 ( 1 ) 2 1 s 2 x 1
x 1 1
1 1
const.
2 s 2
s12
2 1 1
2
s2 s s1
2

S 1

s 2 s 12 s 12
From the above expression note that: S 2
s1 s1
2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 36


Bayesian Inference: Predictive distribution
We will see at a follow up lecture that if we partition the
mean and variance of a multivariate Gaussian as:
xa a S aa S ab
x= = , S
xb
b S S
ba bb
then, the marginal
p xa N xa | a , Saa
In our predictive distribution we need to integrate out .
Thus based on the above result and 1 , S s 2s 1 s 12 , we
2 2 2

have: 1 s1 s1

f ( x | x1 ) f ( x | ) ( | x1 ) d N x | 1 , s 2 s 12
Likelihood Posterior

Note the variance is the sum of model variance + variance


of posterior uncertainty in .

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 37


Bayesian Inference for the Gaussian
Consider X x1 , x2 ,..., xN ~ N ( , s 2 ), with prior ~ N (0 , s 02 ).

The likelihood takes the form: N 2


N
1 n ( x )
p( X | ) f ( xn | ) exp n 1

2s
2 N /2
2s
2
n 1


Note that in terms of this is not a probability density and is not
normalized. Introducing the conjugate (Gaussian) prior on leads to:
N
n 2
( x ) 2
N
( )
( | X ) f ( xn | ) ( ) exp n 1 0

n 1 2s 2
2s 0
2


N

2 N xn 1
1 2

( | X ) exp 2 2 n 1 0

2 exp 2 ( N )
2 s s0 s
2
s0 2s 1


Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 38
Bayesian Inference for the Gaussian
N
2 N 1 n
x
1
( | X ) exp 2 2 n 1 2 02 exp 2 ( N ) 2
2 s s0 s s0 2s N


So the posterior is a Gaussian as before with
| X ~ N ( N , s N2 ) with
1 1 N s 0s
2 2
s 2
, and
sN s0 s
2 2 2 N
Ns 0 s
2 2

N
xn N N s 2
s 2
N s N n 1 2 02 s N 2ML 02
2 2 0
ML 0
s s0 s s 0 Ns 0 s Ns 0 s
2 2 2 2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 39


Bayesian Inference for the Gaussian
| X ~ N ( N , s N2 ) with
1 1 N s 02s 2 Ns 02 s2
2 2 s 2
, and N ML 0
s N s0 s
2
Ns 0 s
2 N 2
Ns 0 s
2 2
Ns 0 s
2 2

Observe the posterior mean for N and N0.

The posterior precision is the sum of the precision of the


prior plus one contribution of the data precision for each
observed data point. As we have seen before for N the
posterior peaks around the ML and the posterior variance
goes to zero, i.e. the point MLE estimate is recovered
within the Bayesian paradigm for infinite data.
s2
How about when s ? In this case note that s and N ML
2 2
0 N
N

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 40


Bayesian Inference for the Gaussian
s s
2 2
N s 2
s 2
| X ~ N ( N , s N2 ) with s N2 0
, and 0
ML 0
Ns 0 s Ns 0 s Ns 0 s
2 2 N 2 2 2 2

4.5

3.5

3
N=10
2.5

2
N=2

MatLab 1.5 N=1


implementation 1
0.5 prior
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

X x1 , x2 ,..., xN ~ N (0.8,0.1), with prior ~ N (0,0.1).

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 41


Sequential Bayesian Inference
| X ~ N ( N , s N2 ) with
1 1 N s 02s 2 Ns 02 s2
2 2 s 2
, and N ML 0
s N s0 s
2
Ns 0 s
2 N 2
Ns 0 s
2 2
Ns 0 s
2 2

We can easily derive sequential estimates of the MLE.


They are as follows:
1 1 1 s N2 s N2
2 2 , and N 2 N 1 2 xN
s N s N 1 s
2
s N 1 s

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 42

You might also like