JG Note

Jointly Gaussian
EECS 126 (UC Berkeley)

Spring 2022
1 Introduction
1.1 Definitions
We list equivalent definitions of jointly Gaussian random variables below.
Definition 1. Let random vector X := (X1 , . . . , Xn )⊤ . Let Z ∈ Rℓ be the
standard normal random vector (i.e. Zi ∼ N (0, 1) for i = 1, . . . , ℓ are i.i.d.).
Then X1 , . . . , Xn are jointly Gaussian if there exist µ ∈ Rn , A ∈ Rn×ℓ such
that X = AZ + µ.
Definition 2. X1 , . . . , Xn are jointly Gaussian if any linear combination of
X1 , . . . , Xn , u⊤ X, follows a normal distribution.
1.2 Probability Density Function

Given a positive definite Σ, the joint PDF of X is

1 1 ⊤ −1
fX (x) = p exp − (x − µ) Σ (x − µ) .
(2π)n det(Σ) 2
1.3 Proof of Covariance Matrix Expression

Finally, let’s prove that Σ = AA⊤ .
Proof. For any general random vector X, we define the (i, j)-entry of the
square covariance matrix ΣXX as Σi,j = cov(Xi , Xj ) = E[(Xi − E[Xi ])(Xj −
E[Xj ])]. Let X be jointly Gaussian. By definition 1, X = AZ + µ. We thus
see that
E[X] = E[AZ + µ] = A E[Z] + E[µ] = µ.
Hence,
Σ = E[(X − µ)(X − µ)⊤ ]
1
= E[(AZ)(AZ)⊤ ]
= E[AZZ⊤ A⊤ ]
= A E[ZZ⊤ ]A⊤
= AA⊤
2 Properties of JG RVs
2.1 Independent iff Uncorrelated
In general, for any two random variables X1 , X2 , if X1 and X2 are independent,
then they are necessarily uncorrelated:
cov(X1 , X2 ) = E[X1 X2 ] − E[X1 ] E[X2 ] = 0.
The correlation between two random variables X, Y is defined to be ρ :=

cov(X, Y )/(σX σY ) for standard deviations σX , σY . Thus it follows that inde-
pendence =⇒ zero covariance =⇒ uncorrelatedness.
While X1 , X2 being uncorrelated does not imply independence in general,
remarkably, jointly Gaussian random variables are independent if and only if
they are uncorrelated! Let’s see why this holds.
Theorem 1. Jointly Gaussian random variables are independent if and only

if they are uncorrelated.
Proof. Without loss of generality, we will consider the case of two jointly
Gaussian random variables. Extensions to higher dimensions follow by the
same reasoning. Suppose that X1 , X2 are uncorrelated. Recall that the entries
of the covariance matrix are Σi,j = cov(Xi , Xj ), which means that
2
σ1 0 −1 1/σ12 0
Σ= and Σ = .
0 σ22 0 1/σ22
Substituting Σ−1 above into the joint PDF, we find that
1 (x1 − µ1 )2 (x2 − µ2 )2

1
fX (x1 , x2 ) = p exp − +
(2π)2 σ12 σ22 2 σ12 σ22
2
1 (x1 − µ1 )2 1 (x2 − µ2 )2

1 1
=p exp − ·p exp −
2πσ12 2 σ12 2πσ22 2 σ22
= fX1 (x1 )fX2 (x2 ).
Note. We have shown that for jointly Gaussian random variables, the
variables being uncorrelated implies that they are independent. This does not,
however, mean that any two uncorrelated marginally normally distributed
random variables are necessarily independent. To see why the variables being
jointly Gaussian is so crucial, we will consider an example.
Example 1. Consider X ∼ N (0, 1), and Y = W X, where
(
1 w.p. 0.5
W =
−1 w.p. 0.5
is independent of X. Notice that X and Y are uncorrelated:
cov(X, Y ) = E[XY ] − E[X] E[Y ] = E[X 2 W ] − 0 = E[X 2 ] E[W ] = 0.
However, they are not independent:
P(X ≤ −1 | Y = 0) = 0 ̸= P(X ≤ −1).
Therefore, one must ensure that the random variables are jointly Gaussian
before assuming that any of these properties necessarily hold.
2.2 Linear Combinations of JG RVs are JG

Theorem 2. Linear combinations of jointly Gaussian random variables are
jointly Gaussian.
Proof. Again, without loss of generality, we will consider the case of two
jointly Gaussian random variables. Extensions to higher dimensions follow by
the same reasoning. Let X1 , X2 be jointly Gaussian. By definition 1, there
exist µ ∈ R2 and A ∈ R2×ℓ such that
⊤
X1 A1 Z + µ1
X= = AZ + µ = ,
X2 A⊤2 Z + µ2
3
where Ai is the ith row vector of A. Now, for any C, D, c, d ∈ R, let U =
CX1 + c and V = DX1 + d. Substituting in the above expressions, we find
U = C(A⊤
1 Z + µ1 ) + c
V = D(A⊤
2 Z + µ2 ) + d

U CA1 Cµ1 + c
= Z+ .
V DA2 Dµ2 + d
Since U, V satisfy the form in definition 1, they are jointly Gaussian.
2.3 MMSE and LLSE Are Equivalent

Recall from our Hilbert Space note that the minimum mean squared error
estimator (MMSE) finds the function φ that minimizes E[(Y − φ(X))2 ]. In
contrast, the linear least squares estimator (LLSE) limits φ to linear functions,
finding a, b ∈ R to minimize E[(Y − a − bX)2 ]. We will now show that for
jointly Gaussian random variables, the function that minimizes the mean
squared error is linear.
Theorem 3. For jointly Gaussian random variables, the MMSE E[X | Y ] is
equivalent to the LLSE L[X | Y ].
Proof. Let X, Y be jointly Gaussian random variables. We will first show
that X − L[X | Y ] and Y are uncorrelated. In the Hilbert Space note, we
discussed how X − L[X | Y ] is orthogonal to Y by the projection property of
LLSE. Since Y and X − L[X | Y ] are orthogonal, E[Y (X − L[X | Y ])] = 0.
Recall from the definition of LLSE that
cov(X, Y )
L[X | Y ] = E[X] + (Y − E[Y ]).
var(Y )
From the linearity of expectation, it follows that

cov(X, Y )
E[X − L[X | Y ]] = E X − E[X] − (Y − E[Y ]) = 0.
var(Y )
This means that
cov(Y, X − L[X | Y ]) = E[Y (X − L[X | Y ])] − E[Y ] E[(X − L[X | Y ])]

= 0 − E[Y ] · 0 = 0.
4
Therefore, X −L[X | Y ] and Y are uncorrelated. By Theorem 2, X −L[X | Y ]
and Y are jointly Gaussian since they are linear combinations of X and Y .
Thus, by Theorem 1, the uncorrelated jointly Gaussian X − L[X | Y ] and Y
must be independent.
We know that functions of independent random variables are independent
(see Lemma 1 in the Appendix). This implies that X − L[X | Y ] and φ(Y )
are independent for all functions φ(·). Independent random variables are
uncorrelated, so
cov(φ(Y ), X − L[X | Y ]) = E[φ(Y )(X − L[X | Y ])] − E[φ(Y )] · 0

= E[φ(Y )(X − L[X | Y ])] = 0.
Therefore X − L[X | Y ] is orthogonal to φ(Y ) for every φ(·). By the

orthogonality property of the MMSE, L[X | Y ] = E[X|Y ].
3 [Optional] Covariance Matrices

3.1 Positive Semidefiniteness
In general, covariance matrices are positive semidefinite (PSD).
Definition 3. A symmetric matrix M is PSD if the following equivalent

conditions hold:
1. M = AA⊤ for some matrix A.
2. For all vectors x, x⊤ M x ≥ 0.
3. M has all real, nonnegative eigenvalues.
Clearly the first point is true for covariance matrix of jointly Gaussian
random variables by definition. In the following subsections, we shall see how
to interpret each of these statements in different ways.
Note. In order for the PDF of a multivariate normal to be defined, the
covariance matrix must be positive definite, meaning that for all x, x⊤ Σx > 0
or that Σ has all real positive eigenvalues.
5
3.2 Projection
Suppose we had a jointly Gaussian vector X and its centered version X̂ =
X − µ, and wanted to find the variance when projecting X̂ along a particular
unit direction u. By the definition of projection, this quantity is
var(u⊤ X̂) = var(u⊤ AZ)

= cov((A⊤ u)⊤ Z, (A⊤ u)⊤ Z)
ℓ ℓ
!
X X
= cov (A⊤ u)i Zi , (A⊤ u)i Zi
i=1 i=1
ℓ
X
= (A⊤ u)2i cov(Zi , Zi )
i=1
= (A u)⊤ (A⊤ u)
⊤
= u⊤ Σu.
Thus, we can interpret the quantity u⊤ Σu as the variance of the projection of

X̂ along u, which must be nonnegative! Therefore the second property holds
for JG random variables. (Although here we restrict ourselves to u with unit
length, we can easily generalize by scaling u by a constant factor.)
3.3 Deriving the Square Root A

Suppose we are given X ∼ N (µ, Σ) and want to find an appropriate matrix
A such that Σ = AA⊤ . How can we do so? Well, the Spectral Theorem states
that any symmetric matrix M can be decomposed as
M = U ΛU ⊤ ,
where U is orthonormal and Λ is diagonal. U and Λ will also contain the

eigenvector and eigenvalue pairs of M .
If M is PSD, as is the covariance matrix Σ, then the entries of Λ will be
nonnegative with square root Λ1/2 , namely Λ with each of its diagonal entries
square rooted.
With all of this, we can find one such A that works, namely A = U Λ1/2 U ⊤ .
(Note that A is not unique, as A = Λ1/2 U ⊤ also satisfies Σ = AA⊤ .)
6
3.4 Density Level Curves
If we examine the PDF of a JG RV (assuming it has positive definite Σ, so
an inverse exists), the significant term is
g(x) = (x − µ)⊤ Σ−1 (x − µ).
The level curves of g are the points which have equal density in the PDF. It
turns out that the level curves of g are hyperellipsoids centered at µ. For
additional details, reference 4.2 in the Appendix.
4 Appendix
4.1 Functions of Independent RVs Are Independent
Lemma 1 (Functions of independent RVs are independent). Let X, Y be two
independent random variables and g, h be real valued functions defined on the
codomains of X and Y respectively. Then, g(X) and h(Y ) are independent
random variables.
Proof.
P(g(X) ∈ A, h(Y ) ∈ B) = P(X ∈ g −1 (A), Y ∈ h−1 (B))
= P(X ∈ g −1 (A)) · P(Y ∈ h−1 (B))
= P(g(X) ∈ A) · P(h(Y ) ∈ B).
4.2 Density Level Curves Continued

To get a geometric understanding, we shall work our way up in difficulty with
examples. For now, let us assume the random variables are zero-mean, so we
do not have to worry about the µ term.
4.2.1 When Σ = I
Let us start by considering the level curves of g when Σ = I:
g(x) = x⊤ Σ−1 x = x⊤ x = ∥x∥22 .
From this, we can clearly see that the level curves of g are hyperspheres
centered at the origin.
7
4.2.2 When Σ = Λ
Things get slightly more complicated when we generalize to a positive diagonal
matrix for Σ, but not by much:
ℓ
X 1 2
g(x) = x⊤ Σ−1 x = x⊤ Λ−1 x = xi .
i=1
λi
The level curves of g are now no longer hyperspheres, but hyperellipsoids!

These are generalizations of ellipses to higher dimensions, and their axes are
parallel to the coordinate
√ axes. In particular, the semi-axis length in the ith
coordinate direction is λi .
4.2.3 When Σ = U ΛU ⊤
Now let us consider the most general case:
ℓ
⊤ −1 ⊤
X
−1 1 ⊤ 2
⊤
g(x) = x Σ x = x U Λ U x = (U x)i .
λ
i=1 i
The
√ level curves are again hyperellipsoids with the same semi-axis lengths of
λi . However, this time, the semi-axis directions are not along the coordinate
directions, but along the directions defined by the columns of U !
4.2.4 Nonzero µ
Previously we have assumed µ = 0, but what if that isn’t actually true?
When our random vector has nonzero mean, we effectively have a translation.
The level curves of g will still remain the same shape, but will simply be
moved in space such that the center is at µ instead of the origin.

JG Note

Uploaded by

Copyright:

Available Formats

JG Note

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JG Note

Uploaded by

Copyright:

Available Formats

Jointly Gaussian

EECS 126 (UC Berkeley)

1.2 Probability Density Function

1.3 Proof of Covariance Matrix Expression

Σ = E[(X − µ)(X − µ)⊤ ]

cov(X1 , X2 ) = E[X1 X2 ] − E[X1 ] E[X2 ] = 0.

The correlation between two random variables X, Y is defined to be ρ :=

Theorem 1. Jointly Gaussian random variables are independent if and only

Substituting Σ−1 above into the joint PDF, we find that

= fX1 (x1 )fX2 (x2 ).

is independent of X. Notice that X and Y are uncorrelated:

cov(X, Y ) = E[XY ] − E[X] E[Y ] = E[X 2 W ] − 0 = E[X 2 ] E[W ] = 0.

However, they are not independent:

P(X ≤ −1 | Y = 0) = 0 ̸= P(X ≤ −1).

2.2 Linear Combinations of JG RVs are JG

Since U, V satisfy the form in definition 1, they are jointly Gaussian.

2.3 MMSE and LLSE Are Equivalent

cov(Y, X − L[X | Y ]) = E[Y (X − L[X | Y ])] − E[Y ] E[(X − L[X | Y ])]

cov(φ(Y ), X − L[X | Y ]) = E[φ(Y )(X − L[X | Y ])] − E[φ(Y )] · 0

Therefore X − L[X | Y ] is orthogonal to φ(Y ) for every φ(·). By the

3 [Optional] Covariance Matrices

Definition 3. A symmetric matrix M is PSD if the following equivalent

1. M = AA⊤ for some matrix A.

2. For all vectors x, x⊤ M x ≥ 0.

3. M has all real, nonnegative eigenvalues.

var(u⊤ X̂) = var(u⊤ AZ)

Thus, we can interpret the quantity u⊤ Σu as the variance of the projection of

3.3 Deriving the Square Root A

where U is orthonormal and Λ is diagonal. U and Λ will also contain the

4.2 Density Level Curves Continued

The level curves of g are now no longer hyperspheres, but hyperellipsoids!

You might also like