06 Gaussian Distributions
06 Gaussian Distributions
Lecture 06
Gaussian Probability Distributions
Philipp Hennig
04 May 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
The (univariate) Gaussian distribution
an exponentiated square
0.4
0.3
µ the mean of x
p(x)
(x−µ)2
−
0.2 p(x) = √1 e
σ 2π
2σ 2 =: N (x; µ, σ 2 ) σ 2 the variance of x
σ the standard deviation of x
0.1
x
0
0 1 2 3 4 5 6
µ−σ µ µ+σ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Univariate Gaussians
some observations and notations, conventions
Definition
1 (x−µ)2
N (x; µ, σ 2 ) =: √ e− 2σ2 with µ, σ ∈ R
σ 2π
will be called the Gaussian or normal distribution of x. We call x the argument or variable, µ, σ 2 the
parameters. We write x ∼ N (µ, σ 2 ) to say that the variable x is distributed with pdf N (x; µ, σ 2 ).
Z
▶ N (x; µ, σ 2 ) dx = 1 and N (x; µ, σ 2 ) > 0 ∀x ∈ R. So N is the density of a probability measure.
▶ Symmetry in x and µ: N (x; µ, σ 2 ) = N (µ; x, σ 2 )
▶ An exponential of a quadratic polynomial of the natural parameters (a, η, τ ) :
1
N (x; µ, σ 2 ) = exp a + ηx − τ x2 with τ = σ −2 (“precision”), η = σ −2 µ
2
1
a = − log(2π) − log λ2 + λ2 η 2
2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Gaussian Inference
The Gaussian is its own conjugate prior.
Let
p(x)
0.8 p(x) = N (x; µ, σ 2 )
p(y | x)
p(x | y) p(y | x) = N (y; x, ν 2 )
0.6
Then
p(x)
0.4 p(x)p(y | x)
p(x | y) = R
p(x)p(y | x) dx
0.2 = N (x; m, s2 ), with
1
s2 := −2
0 σ + ν −2
σ −2 µ + ν −2 y
0 2 4 6 m :=
σ −2 + ν −2
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Gaussian Inference
Least-Squares Estimation
p(x) = N (x; µ, σ 2 )
1 Y
N
p(y | x) = N (yi ; x, νi2 )
i=1
p(x)p(y | x)
p(x | y) = R
p(x)
0.5 p(x)p(y | x) dx
= N (x; m, s2 ), with
0 X
N
s−2 := σ −2 + νi−2
i=1
X
N
−1 0 1 2 3 4 s−2 m := σ −2 µ + νi−2 yi
x i=1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
The Multivariate Gaussian distribution
An exponentiated quadratic form
1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
The Multivariate Gaussian distribution
An exponentiated quadratic form
1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2
v⊺ Av ≥ 0 ∀v ∈ Rn .
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
The Multivariate Gaussian distribution
Equiprobability lines are ellipsoids
1 1 ⊺ −1
N (x; µ, Σ) = exp − (x − µ) Σ (x − µ) x, µ ∈ Rn , Σ ∈ Rn×n , spd.
(2π)n/2 |Σ|1/2 2
Z
8 ▶ N (x; µ, Σ) = 1 and N (x; µ, Σ) > 0 ∀x ∈ Rn .
2
0 1
= exp a + η ⊺ x − tr(xx⊺ Λ) (2)
2
−2
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
Note similarity to univariate case.
−4 µ1
−4 −2 0 4 6 8
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Products of Gaussians are Gaussians
Closure under Multiplication
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
Note similarity to univariate case.
−4 µ1
−4 −2 0 4 6 8
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Products of Gaussians are Gaussians
Closure under Multiplication
c = C(A−1 a + B−1 b)
0 Z = N (a; b, A + B)
−2
Note similarity to univariate case.
−4 µ1
−4 −2 0 4 6 8
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Linear Projections of Gaussians are Gaussians
Closure under linear maps
p(z) = N (z; µ, Σ)
0 ⇒ p(Az) = N (Az, Aµ, AΣA⊺ )
−2
−4
−4 −2 0 2 4 6 8
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Marginals of Gaussians are Gaussians
Closure under marginalization
−4
▶ so every finite-dim Gaussian is a marginal of
−4 −2 0 2 4 6 8 infinitely many more
x1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Cuts through Gaussians are Gaussians
Closure under conditioning
6 p(x, y)
p(x | Ax = y) =
p(y)
4
= N x; µ + ΣA⊺ (AΣA⊺ )−1 (y − Aµ),
2 Σ − ΣA⊺ (AΣA⊺ )−1 AΣ
x2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Inference with Gaussians
Since conditioning and marginalization are mapped to linear algebra, so is Bayes’ Theorem
Theorem
If p(x) = N (x; µ, Σ)
and p(y | x) = N (y; Ax + b, Λ),
then p(y) = N (y; Aµ + b, Λ + AΣA⊺ )
and p(x | y) = N (x; µ + ΣA⊺ (AΣA⊺ + Λ)−1 (y − (Aµ + b)), Σ − ΣA⊺ (AΣA⊺ + Λ)−1 AΣ)
| {z }| {z } | {z }
gain residual Gram matrix
−1 ⊺ −1 −1 ⊺ −1 −1 −1
= N (x; (Σ +A Λ A) (A Λ (y − b) + Σ µ), (Σ + A⊺ Λ−1 A)−1 )
| {z } | {z }
precision matrix precision matrix
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
The Core Insight for All of This
Gaussian inference is linear algebra at its core [image: Konrad Jacobs]
P Q
A= M := (S − RP−1 Q)−1
R S
−1
−1 P + P−1 QMRP−1 −P−1 QM
A =
−MRP−1 M
(Z + UWV⊺ )−1 = Z−1 − Z−1 U(W−1 + V⊺ Z−1 U)−1 V⊺ Z−1
|Z + UWV⊺ | = |Z| · |W| · |W−1 + V⊺ Z−1 U|
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Example 1: Conditional Independence, Marginal Correlation
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
temperature outside
x2
x1 x3
temperature temperature
in building 1 in building 2
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Example 1: Conditional Independence, Marginal Correlation
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
temperature outside
p(ν) = N (ν; µ, diag(σ 2 ))
x2
1 w1 0
A = 0 1 0 =⇒
0 w3 1
x1 x3
w1 σ22 + σ12 w1 σ22 w1 w3 σ22
temperature temperature
p(x = Aν) = N x; Aµ , σ22 w3 σ22
in building 1 in building 2
|{z}
=:m w23 σ22 + σ32
| {z }
x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 ) =:Σ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Example 1: Conditional Independence, Marginal Correlation
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
temperature outside
p(ν) = N (ν; µ, diag(σ 2 ))
x2
1 w1 0
A = 0 1 0 =⇒
0 w3 1
x1 x3
w1 σ22 + σ12 w1 σ22 w1 w3 σ22
temperature temperature
p(x = Aν) = N x; Aµ , σ22 w3 σ22
in building 1 in building 2
|{z}
=:m w23 σ22 + σ32
| {z }
x2 = ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 ) =:Σ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Example 1: Conditional Independence, Marginal Correlation
A zero in the precision matrix means independence conditional on everything else [DJC MacKay, The humble Gaussian distribution, 2006]
emission
gas price price
x1 x3
x2
electricity
price
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
emission
gas price price
x1 x3
2
σ1 w1 σ12 0
x2
p(x) = N x; m, σ2 + w21 σ12 + w23 σ32
2
w3 σ32
σ32
electricity | {z }
Σ
price 2
x1 µ1 σ1 0
p(x1 , x3 ) = N ; ,
x1 = ν1 p(ν1 ) = N (ν1 ; µ1 , σ12 ) x3 µ3 0 σ32
x3 = ν3 p(ν3 ) = N (ν3 ; µ3 , σ32 )
x2 = w1 x1 + w3 x3 + ν2 p(ν2 ) = N (ν2 ; µ2 , σ22 )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Example 2: Explaining away
a ± value in the precision matrix implies ∓ correlation conditional on everything else [DJC MacKay, The humble Gaussian distribution, 2006]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
20
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20 p(x2 | x1 , x3 ) = N (x2 ; w1 x1 + w3 x3 + µ2 , σ22 )
x3 [EUR/t]
15
10
2 4 6 8
x1 [USD/MMBtu]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Example 2: Explaining away
Bayesian Inference with Gaussians [DJC MacKay, The humble Gaussian distribution, 2006]
2
x1 µ σ 0
p(x1 , x3 ) = N ; 1 , 1
25 x3 µ3 0 σ32
p(x2 ) = N x2 ; w1 µ1 + w3 µ3 + µ2 , σ22 + w21 σ12 + w23 σ32
20 p(x2 | x1 , x3 ) = N (x2 ; w1 x1 + w3 x3 + µ2 , σ22 )
x3 [EUR/t]
x2 − wµ1,3 − µ2
p(x1 , x3 | x2 ) = N x1,3 ; µ1,3 − Σ1,3 w⊺ ,
wΣ1,3 w⊺ + σ22
15
1
Σ1,3 − Σ1,3 w⊺ wΣ1,3
wΣ1,3 w⊺ + σ22
10
x1 µ w σ 2 x2 − w1 µ1 − w3 µ3 − µ2
=N ; 1 − 1 12 ,
x3 µ3 w3 σ3 w21 σ12 + w23 σ32 + σ22
2 4 6 8 2
σ1 0 w1 σ12 1
x1 [USD/MMBtu] − 2
w σ w3 σ32
0 σ32 w3 σ32 w21 σ12 + w23 σ32 + σ22 1 1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
1 1
N (x; µ, Σ) = exp − (x − µ)⊺ Σ−1 (x − µ)
(2π)n/2 |Σ|1/2 2
Today:
▶ Gaussian distributions provide the linear algebra of inference.
▶ products of Gaussians are Gaussians
▶ linear maps of Gaussian variables are Gaussian variables
▶ marginals of Gaussians are Gaussians
▶ linear conditionals of Gaussians are Gaussians
If all variables in a generative model are linearly related, and the distributions of the parent variables are
Gaussian, then all conditionals, joints and marginals are Gaussian, with means and covariances com-
putable by linear algebra operations.
▶ A zero off-diagonal element in the covariance matrix implies independence if all other variables
are integrated out
▶ A zero off-diagonal element in the precision matrix implies independence conditional on all other
variables
[Σ]ij = 0 ⇒ p(xi , xj ) = N (xi ; [µ]i , [Σ]ii ) · N (xj ; [µ]j , [Σ]jj )
−1
[Σ ]ij = 0 ⇒ p(xi , xj | x̸=i,j ) = N (xi | x̸=i,j ) · N (xj | x̸=i,j )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ ▶
▶ ▶
▶ ▶
▶
Probabilistic ML — P. Hennig, SS 2021 — Lecture 06: Gaussian Probability Distributions— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23