0% found this document useful (0 votes)
21 views33 pages

Lec5 Part2

Uploaded by

fhjhgjh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views33 pages

Lec5 Part2

Uploaded by

fhjhgjh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Lecture 5

Gaussian Models - Part 2

Luigi Freda

ALCOR Lab
DIAG
University of Rome ”La Sapienza”

December 20, 2016

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 1 / 33


Outline

1 Inference in Jointly Gaussian Distributions


Statement of the Result
Interpolation of Noise-free Data

2 Linear Gaussian Systems


Statement of the Result
Inferring an Unknown Scalar from Noisy Measurements
Inferring an Unknown Vector from Noisy Measurements
Interpolating Noisy Measurements

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 2 / 33


Outline

1 Inference in Jointly Gaussian Distributions


Statement of the Result
Interpolation of Noise-free Data

2 Linear Gaussian Systems


Statement of the Result
Inferring an Unknown Scalar from Noisy Measurements
Inferring an Unknown Vector from Noisy Measurements
Interpolating Noisy Measurements

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 3 / 33


Intro

once we are given a Gaussian joint distribution p(x1 , x2 ), it is useful


to be able to compute the marginals p(x1 ) and conditionals p(x1 |x2 )
in the following slides we see how to compute these probability
densities

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 4 / 33


Marginals and Conditionals

Theorem 1
(Marginals and conditionals for an MVN)
Suppose x = (x1 , x2 ) ∼ N (x|µ, Σ), i.e. x is jointly Gaussian with parameters
     
µ1 Σ11 Σ12 Λ11 Λ12
µ= , Σ= , Λ = Σ−1 =
µ2 Σ21 Σ22 Λ21 Λ22

then the marginals are given by

p(x1 ) = N (x1 |µ1 , Σ11 )


p(x2 ) = N (x2 |µ2 , Σ22 )

and the posterior conditional is given by

p(x1 |x2 ) = N (x1 |µ1|2 , Σ1|2 )


µ1|2 = µ1 + Σ12 Σ−1
22 (x2 − µ2 )

= µ1 − Λ−1
11 Λ12 (x2 − µ2 )

Σ1|2 = Σ11 − Σ12 Σ−1 −1


22 Σ21 = Λ11

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 5 / 33


Marginals and Conditionals

from the previous theorem we have

p(x1 ) = N (x1 |µ1 , Σ11 )


p(x2 ) = N (x2 |µ2 , Σ22 )
p(x1 |x2 ) = N (x1 |µ1|2 , Σ1|2 )

the marginal and the conditional distributions are Gaussian


for the marginals, we just extract the rows and columns corresponding to x1 and x2

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 6 / 33


Marginals and Conditionals
Example with a 2D Gaussian

consider a 2D example with


   2 
µ1 σ1 ρσ1 σ2
µ= , Σ=
µ2 ρσ1 σ2 σ22
cov[X1 ,X2 ]
where ρ = σ 1 σ2
is the correlation coefficient
the marginal p(x1 ) is 1D Gaussian, obtained by projecting the joint distribution
onto the x1 line
p(x1 ) = N (x1 |µ1 , σ1 )

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 7 / 33


Marginals and Conditionals
Example with a 2D Gaussian

suppose we observe X2 = x2 , the conditional p(x1 |x2 ) is obtained by slicing


p(x1 , x2 ) through the X2 = x2 line
(ρσ1 σ2 )2
 
ρσ1 σ2 2
p(x1 |x2 ) = N x1 µ1 + (x2 − µ 2 ), σ 1 −
σ22 σ22

left: joint Gaussian distribution p(x1 , x2 ) with a correlation coefficient of 0.8; we


plot the 95% contour and the principal axes.
center : the unconditional marginal p(x1 )
right: the conditional p(x1 |x2 ) = N (x1 |0.8, 0.36), obtained by slicing p(x1 , x2 ) at
height x2 = 1
Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 8 / 33
Outline

1 Inference in Jointly Gaussian Distributions


Statement of the Result
Interpolation of Noise-free Data

2 Linear Gaussian Systems


Statement of the Result
Inferring an Unknown Scalar from Noisy Measurements
Inferring an Unknown Vector from Noisy Measurements
Interpolating Noisy Measurements

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 9 / 33


Interpolation of Noise-free Data
suppose we want to estimate a 1D function y = f (t), defined on the interval
[0, T ], starting from N observed points yi = f (ti )
we assume for now the data is no noise-free
as a matter of fact, we want to interpolate the data, i.e. fit a function that goes
exactly though the data
question: how does the function behave in between observed points?
the first thing is to assume that the unknown function is smooth
we’ll encode the smoothness in a prior

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 10 / 33


Interpolation of Noise-free Data
in order to encode the prior we start by discretizing the problem
we discretize the interval [0, T ] in D equal subintervals such that
T
xj = f (tj ), tj = j∆, ∆ = , j ∈ {1, ..., D}
D
we can encode the smoothness prior by assuming
1
xj = (xj−1 + xj+1 ) + j j ∈ {2, ..., D − 1}
2
where j is a Gaussian noise
we assume  = [2 , ..., D−1 ] ∼ N (0, λ1 I) where the precision λ controls the
smoothness degree
the above equation can be restated in vector form as
Lx = 
where
−1 −1
 
2
1
 −1 2 −1 
(D−2)×D
L= ∈R

2
 ..
. 
−1 2 −1
is a second order finite difference matrix
Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 11 / 33
Interpolation of Noise-free Data

given a vector x the degree of smoothness can be represented by the norm kk
a smoothness prior should give higher probabilities to vectors x which correspond
to smaller kk, hence
λ
p(x) ∝ exp(− kLxk22 )
2
where a factor λ can be used to weigh the overall smoothness
the smoothness prior can be expressed by using a Gaussian distribution as
λ
p(x) = N (x|µx , Σx ) = N (x|0, (λLT L)−1 ) ∝ exp(− kLxk22 )
2

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 12 / 33


Interpolation of Noise-free Data

smoothness prior

p(x) = N (x|µx , Σx ) = N (x|0, (λLT L)−1 )

let’s assume that we have used λ to scale L so that we can ignore it


note that Λx = LT L ∈ RD×D and, since L ∈ R(D−2)×D , one has1 rank(Λx ) = D − 2
hence Λx = LT L defines an improper prior known as intrinsic Gaussian random
field
however it’s possible to show that if we observe N ≥ 2 points, the posterior will be
proper

1
recall that rank(AB) = min(rank(A), rank(B))
Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 13 / 33
Interpolation of Noise-free Data

now suppose that in our D discretized intervals we have N noise-free observations


gathered in x2 ∈ RN and we want to compute the remaining N − D function
values x1 ∈ RD−N
we know that
p(x1 , x2 ) = N (x|µx , Σx ) = N (x|0, (LT L)−1 )
we can partition L = [L1 , L2 ] where L1 ∈ R(D−2)×(D−N) and L2 ∈ R(D−2)×N
one has   T
LT1 L2
 
Λ11 Λ12 L L1
Λ = LT L = = 1T
Λ21 Λ22 L2 L1 LT2 L2
by using theorem 1 one has

p(x1 |x2 ) = N (x1 |µ1|2 , Σ1|2 )


µ1|2 = µ1 − Λ−1 T
11 Λ12 (x − µ2 ) = −(L1 L1 )
−1 T
L1 L2 x

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 14 / 33


Interpolation of Noise-free Data

left: Gaussian with prior precision λ = 30


right: prior with λ = 0.01
the posterior mean µ1|2 equals the observed data at the specified points and
smoothly interpolates in between
p
the plots show the 95% pointwise marginals credibility intervals µj ± 2 Σ1|2,jj
N.B.: the variance goes up as we move aways from the the data
Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 15 / 33
Outline

1 Inference in Jointly Gaussian Distributions


Statement of the Result
Interpolation of Noise-free Data

2 Linear Gaussian Systems


Statement of the Result
Inferring an Unknown Scalar from Noisy Measurements
Inferring an Unknown Vector from Noisy Measurements
Interpolating Noisy Measurements

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 16 / 33


Linear Gaussian System
Problem and Assumptions

problem
suppose we have two variables x ∈ RDx and y ∈ RDy
y is a noisy observation of x
x is an hidden variable we want to estimate

assumptions
the prior is
p(x) = N (x|µx , Σx )
the likelihood is
p(y|x) = N (y|Ax + b, Σy |x )
Dy ×Dx
where A ∈ R and b ∈ RDy are known

N.B.: the above model is equivalent to assume y = Ax + b +  where  is a noise


characterized by the Gaussian distribution N (0, Σy |x )

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 17 / 33


Linear Gaussian System
Theorem

Theorem 2
(Bayes rule for linear Gaussian systems)
Given a linear Gaussian system, as the one described in the previous slide, the posterior
p(y|x) is given by

p(x|y) = N (x|µx|y , Σx|y )


Σ−1 −1 T −1
x|y = Σx + A Σy A

µx|y = Σx|y [AT Σ−1 −1


y (y − b) + Σx µx ]

In addition the normalization constant p(y) is given by

p(y) = N (y|Aµx + b, Σy |x + AΣx AT )

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 18 / 33


Outline

1 Inference in Jointly Gaussian Distributions


Statement of the Result
Interpolation of Noise-free Data

2 Linear Gaussian Systems


Statement of the Result
Inferring an Unknown Scalar from Noisy Measurements
Inferring an Unknown Vector from Noisy Measurements
Interpolating Noisy Measurements

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 19 / 33


Inferring an Unknown Scalar from Noisy Measurements
Problem

suppose we make N noisy measurements yi ∈ R of some underlying quantity


x ∈ R, i.e.
yi = xi + i
where i ∼ N (0, λ−1
y ) and λy = 1/σ
2

the likelihood is
p(yi |x) = N (yi |x, λ−1
y )

we assume a Gaussian prior

p(x) = N (x|µ0 , λ−1


0 )

given D = {y1 , ..., yN } we want then to compute the posterior p(x|D) by using a
Bayesian approach

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 20 / 33


Inferring an Unknown Scalar from Noisy Measurements
Solution

in order to use the theorem 2, we can introduce a variable y , [y1 , ..., yN ]T ∈ RN ,


a matrix A = 1TN ∈ R1×N and Σy |x = λy I
then we get the posterior

p(x|y) = N (x|µN , λ−1


N )
λN = λ0 + Nλy
P
λy i yi + λ0 µ0 Nλy y + λ0 µ0 Nλy λ0
µN = = = y+ µ0
λN Nλy + λ0 Nλy + λ0 Nλy + λ0

where y , N1 i yi
P

in this case the MLE estimate of x is exactly xMLE = y since


Y Y
xMLE = argmax p(D|θ) = argmax p(yi |x) = argmax N (yi |x, λ−1
y ) = y
x x x
i i

the posterior mean µN is a convex combination of the MLE y and the prior mean
µ0

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 21 / 33


Inferring an Unknown Scalar from Noisy Measurements

posterior

p(x|y) = N (x|µN , λ−1


N )
λN = λ0 + Nλy
Nλy λ0
µN = y+ µ0
Nλy + λ0 Nλy + λ0

note that the posterior mean is written in terms of Nλy y


having N measurements each of precision λy is equivalent to having one
measurement y with a precision Nλy , this means

p(x|y, λy ) = p(x|y , N, λy )

in other words (y , N, λy ) is a sufficient statistics for the problem

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 22 / 33


Inferring an Unknown Scalar from Noisy Measurements
Case with just a measurement

the procedure can be easily used for an online estimation


let Σ0 , λ−1 −1
0 , Σy |x , λy and Σi , λ−1
i ,

if we have just a measurement, i.e. N = 1, one has

p(x|y ) = N (x|µ1 , Σ1 )
 −1
1 1 Σ0 Σy |
Σ1 = + =
Σ0 Σy |x Σ0 + Σy |x
 
µ0 y Σ0 Σy |x
µ1 = Σ 1 + = µ0 +y
Σ0 Σy |x Σ0 + Σy |x Σ0 + Σy |x
where the posterior µ1 can be rewritten as
Σ0
µ1 = µ0 + (y − µ0 )
Σ0 + Σy |x
Σy |x
µ1 = y − (y − µ0 )
Σ0 + Σy |x

the third equation is called shrinkage: the data is adjusted towards the prior mean

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 23 / 33


Inferring an Unknown Scalar from Noisy Measurements

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 24 / 33


Outline

1 Inference in Jointly Gaussian Distributions


Statement of the Result
Interpolation of Noise-free Data

2 Linear Gaussian Systems


Statement of the Result
Inferring an Unknown Scalar from Noisy Measurements
Inferring an Unknown Vector from Noisy Measurements
Interpolating Noisy Measurements

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 25 / 33


Inferring an Unknown Vector from Noisy Measurements
Problem

suppose we make N noisy measurements yi ∈ RD of some vector x ∈ RD , i.e.

yi = xi + i

where i ∼ N (0, Σy |x )
the likelihood is
p(yi |x) = N (yi |x, Σy |x )
where A = I and b = 0
we assume a Gaussian prior

p(x) = N (x|µ0 , Σ0 )

given D = {y1 , ..., yN } we want then to compute the posterior p(x|D) by using a
Bayesian approach

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 26 / 33


Inferring an Unknown Vector from Noisy Measurements
Solution

in order to use the theorem 2, we can introduce a variable ỹ , [y1 , ..., yN ] ∈ RN , a


matrix   
A I
 ..  ..
à ,  .  = .
A I
and Σỹ |x = diag(Σy |x )
then we get the posterior

p(x|ỹ) = N (x|µN , ΣN )
Σ−1 −1 −1
N = Σ0 + NΣy |x

µN = ΣN (Σ−1 −1
y |x (Ny) + Σ0 µ0 )

1
P
where y , N i yi
in this case the MLE estimate of x is exactly xMLE = y
the expression of the posterior mean µN is very similar to the scalar case

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 27 / 33


Outline

1 Inference in Jointly Gaussian Distributions


Statement of the Result
Interpolation of Noise-free Data

2 Linear Gaussian Systems


Statement of the Result
Inferring an Unknown Scalar from Noisy Measurements
Inferring an Unknown Vector from Noisy Measurements
Interpolating Noisy Measurements

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 28 / 33


Interpolating Noisy Measurements
Problem

assume we have N noisy observations yi ∈ R


each yi corresponds to a distinct linear combination of a vector x ∈ RD
for each yi we have a noise i ∼ N (0, σ 2 )
we can model this setup as a linear Gaussian system
y = Ax + 
where y = [y1 , ..., yN ]T ∈ RN ,  = [1 , ..., N ]T ∈ RN ,  ∼ N (0, Σy ) and Σy = σ 2 I
the matrix A ∈ RN×D is known and can be used for selecting out certain
components, for instance if N = 2 and D = 4
 
1 0 0 0
A=
0 1 0 0

we again assume a smoothness prior


p(x) = N (x|µx , Σx ) = N (x|0, (λLT L)−1 )
where Λx = LT L defines an improper prior known as intrinsic Gaussian random
field
Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 29 / 33
Interpolating Noisy Measurements
Solution

linear Gaussian system


y = Ax + 
smoothness prior

p(x) = N (x|µx , Σx ) = N (x|0, (λLT L)−1 )

we can apply theorem 2 in order to compute the posterior p(y|x)

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 30 / 33


Interpolating Noisy Measurements
Solution

left: interpolation by using λ = 30


strong prior(large λ) =⇒ smooth estimate and low uncertainty
right: interpolation by using λ = 0.01
weak prior(small λ) =⇒ wiggly estimate and high uncertainty
N.B.: the precision λ affects the posterior mean as well as the posterior variance
Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 31 / 33
Interpolating Noisy Measurements
Solution

a MAP solution can be found by maximizing the posterior, i.e.


 
x̂MAP = argmax log p(x|y) = argmax log p(y|x) + log p(x)
x x

in the case A = I, we can equivalently solve the following optimization problem


N D  
1 X 2 λX 2 2
x̂MAP = argmin (xi − yi ) + (xj − xj−1 ) + (xj − xj+1 )
x 2σ 2 i=1 2 i=1

where we define x0 = x1 and xD+1 = xD for simplicity of notation


the previous equation is a discrete approximation to the following problem
Z Z
1 λ
argmin 2
(f (t) − y (t))2 dt + f 0 (t)dt
f 2σ 2

where f 0 (t) is the first time derivative of the function f


the first term measures the fit to the data and the second term penalizes function
that are too wiggly (Tikhonov regularization problem)

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 32 / 33


Credits

Kevin Murphy’s book

Luigi Freda (”La Sapienza” University) Lecture 5 December 20, 2016 33 / 33

You might also like