Gaussian MLEstimator
Gaussian MLEstimator
Email: [email protected]
URL: http://www.zabaras.com/
August 7, 2014
Bayesian Inference for the Gaussian with Known Variance, Bayesian Inference
for the Gaussian with Known Mean, Bayesian Inference for the Gaussian with
unknown Mean and Variance
2
1 1 2 X 2 s 2
2
X 2 exp ( x ) x dx 2
s 2
, var[ X ]
2s 2s
2
0.35
0.2
0.15
0.1
0.05 N x;0,1
0
-3 -2 -1 0 1 2 3
x
N ( z | , s 2
)dz F z;0,1 , z ( x ) / s
F x;0,1
z
1 1
F z;0,1 dt
t /2
1 erf z / 2
2
e
2
2
x
2
erf x dt
t2
e
0
lim
s 0
2
N x | , s 2
(x )
6 6
0.2
4 4
0.15 2
2
0
0 0.1
-2
-2
0.05 -4
-4
-6
0
-6 10
-8
5 10
-8 0 5 -10
0 -5 -4 -3 -2 -1 0 1 2 3 4 5
-5 -5
-10 -10 -10
-5 -4 -3 -2 -1 0 1 2 3 4 5 gaussPlot2DDemo
from PMTK
diagonal
spherical spherical
5
4
0.2 0.2
3
0.15
0.15 2
1 0.1
0.1
0
0.05
0.05 -1
0
-2 5
0 5
10 -3
0
5 5 0
-4
0 -5 -5
0 -5
-5 -6 -4 -2 0 2 4 6
-10 -5
Tr L p( x )( x )( x ) dx S
T
1 1 1
1 l T m mT L1m ( x L1m )T L( x L1m )
e 4 2 2
1 1
1
1 l T m mT L1m yT Ly
e 4
y + L m dy
2
The 1st term drops from symmetry, the 2nd gives from
normalization, thus we need to have:
1 1
L m m
2
e
1 l zT Lz
zzT dz S
Note that with L = -S / 2 , the 3nd term from the exponential
when integrated gives:
e zz dz S(2 ) S
zT Lz T D /2 1/2
D ln(2 ) ln S N ( x | , S) tr ( x )( x )T S 1 dx
1 1
2 2
1
2
1
2
D ln(2 ) ln S tr N ( x | , S)( x )( x )T dx S 1
D ln(2 ) ln S tr SS 1
1 1
2 2
D ln(2 ) ln S tr S 1S
1
2
D ln(2 ) ln S D
1
2
p ( x ) N ( x | , S), q( x ) xx T dx S
Then:
p( x )
0 KL(q || p) q( x ) ln dx q( x ) ln p( x )dx + q( x ) ln q( x )dx
q( x )
q ( x ) ln p( x )dx H [q] p( x ) ln p( x )dx H [ q]
H [ p] H [q] H [ p] H [q]
3.5 3.5
3.5
3 3
3
2.5 2.5
2.5
2 2 2
1 1 1
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
MatLab Code
N=1 N=5
3 3
2 2
1
1
N = 10
0 3
0 0.5 1 0
0 0.5 1
centralLimitDemo
2 from PMTK
0
0 0.5 1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 15
The CLT and the Gaussian Distribution
One consequence of this result is that the binomial
distribution which is a distribution over m defined by the
sum of N observations of the random binary variable x,
will tend to a Gaussian as N .
2 2
1 1 ( x 1 ) 2 2
2 2
1 ( x 1 ) 2 2 1
1 2 x2
1
1 x 1
2
2 1 2 2 2 1 2
The 1st term is integrated out and the precision of x is:
12
1 1 2
1 2 1 2
1
H [ x] ln 2 es 2 ln 2 e 1 2
1
Thus the entropy of x is: 2 2 1 2
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 17
Maximum Likelihood for a Gaussian
Suppose that we have a data set of observations D = (x1, .
. . , xN)T, representing N observations of the scalar random
variable X. The observations are drawn independently from
a Gaussian distribution whose mean and variance s2 are
unknown.
N
Likelihood function : p( x | , s 2 ) N ( xi | , s 2 )
i 1
i 1
N i 1 N i 1
1 N 2 1 N 1 N 2
s ( xn ML )
2
ML n N m
( x x )
N n 1 N n 1 m 1
1 N 2 2 N
1 N N
xn xn xm 2 xm xl
N n 1 N m 1 N m 1 l 1
1
N ( 2 s 2 ) N ( N 1) 2 ( 2 s 2 ) N 2 N ( N 1) 2 ( 2 s 2 )
2 1
N N N
N
1
N ( 2 s 2 ) N 2 s 2
N 1 s 2
N
ND N 1 N
ln p( X | D , , S) ln 2 ln | S | ( xn )T S 1 ( xn )
2 2 2 n1
1 N 1 N
ML xn , S ML ( xn ML )( xn ML )T
N n1 N n1
Here we used: A
| A1 || A |1 , tr ( AB ) tr ( BA)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 25
Appendix: Some Useful Matrix Operations
Show that
Tr T and Tr T
Indeed
Tr ik ki nm
A B B Tr T
Amn Amn
Show that
ln | A | A1
T
xN N 1 ( N 1)
N
N
ML ML
( N 1) 1
N
x N ML
( N 1)
Learning Error signal
rate
* Effectively, we dont know the regression function f(q) but we have data of a noisy version z of that. We
take the regression function to be the expectation z | q .
Robbins, H. and S. Monro (1951). A stochastic approximation method. Annals of Mathematical Statistics 22,
400407.
Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (Second ed.). Academic Press.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 29
Robbins-Monro Algorithm
f q z | q zp( z | q )dz
q ( N ) q ( N 1) aN 1 z q ( N 1)
x ML
z
x ( N ) ML
( N 1)
s2
(N )
( N 1)
aN 1
ML ML
s2 p( z | ) is a Gaussian
ML
f z |
s2
N n 1 N n 1 N
xN
2
N 1 2
s ( N 1)
N N
s ( N 1)
2 1
N
xN s (2N 1)
2
If we substitute the expression for the Gaussian likelihood
into the Robbins-Monro procedure for maximizing likelihood:
xN
x
2
1 1
s s aN 1 s s s (2N 1)
2 2 2 2 2
( N 1) ln ( N 1) ( N 1) a N 1
s (2N 1) s s
(N ) 2 4 N
2 2 ( N 1)
2 ( N 1)
| x1 ~ N ( 1 , s 12 ) with
1 1 1 s 0s
2 2
s 2
, and
s1 s 0 s
2 2 2 1
s0 s
2 2
x1 0
1 s 2 2
2
s s0
1
s 2 s 12 s 12
From the above expression note that: S 2
s1 s1
2
have: 1 s1 s1
f ( x | x1 ) f ( x | ) ( | x1 ) d N x | 1 , s 2 s 12
Likelihood Posterior
N
2 N xn 1
1 2
( | X ) exp 2 2 n 1 0
2 exp 2 ( N )
2 s s0 s
2
s0 2s 1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 38
Bayesian Inference for the Gaussian
N
2 N 1 n
x
1
( | X ) exp 2 2 n 1 2 02 exp 2 ( N ) 2
2 s s0 s s0 2s N
So the posterior is a Gaussian as before with
| X ~ N ( N , s N2 ) with
1 1 N s 0s
2 2
s 2
, and
sN s0 s
2 2 2 N
Ns 0 s
2 2
N
xn N N s 2
s 2
N s N n 1 2 02 s N 2ML 02
2 2 0
ML 0
s s0 s s 0 Ns 0 s Ns 0 s
2 2 2 2
4.5
3.5
3
N=10
2.5
2
N=2