0% found this document useful (0 votes)
9 views18 pages

Lecture 09

Transformations; Joint Distributions

Uploaded by

Saeed Mazhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

Lecture 09

Transformations; Joint Distributions

Uploaded by

Saeed Mazhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Department of

Mathematics

Ma 3/103 KC Border
Introduction to Probability and Statistics Winter 2021

Lecture 9: Transformations; Joint Distributions

Relevant textbook passages:


Pitman [5]: Chapter 5; Section 6.4–6.5
Larsen–Marx [4]: Sections 3.7, 3.8, 3.9, 3.11

9.1 Density of a function of a random variable; aka change of variable


Pitman [5]:
Section 4.4,
′ pp. 302–309
If X is a random variable with cumulative distribution function FX and density fX = FX ,
and g is a (Borel) function, then
Y = g(X)
is a random variable. The cumulative distribution function FY of Y is given by

FY (y) = P (Y ⩽ y) = P (g(X) ⩽ y).

The density fY is then given by

d
fY (y) = P (g(X) ⩽ y),
dy
provided the derivative exists.

There are two special cases of particular interest. If g ′ (x) > 0 for all x, so that g is strictly
increasing, then g has an inverse, and

g(X) ⩽ y ⇐⇒ X ⩽ g −1 (y),

so 
FY (y) = FX g −1 (y)
is the cumulative distribution function of Y . The density fY of Y is found by differentiating
this.
Similarly, if g ′ (x) < 0 for all x, then g is strictly decreasing, and

Y ⩽ y ⇐⇒ g(X) ⩽ y ⇐⇒ X ⩾ g −1 (y),

and if F is continuous, this is just



FY (y) = 1 − FX g −1 (y) ,

and we may differentiate that (with respect to y, to get the density.)


Start with the case g ′ > 0. By the Inverse Function Theorem [2, Theorem 6.7, p. 252], then

d −1 1
g (y) = ′ −1 
dy g g (y)

KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–2

So in this case we have



d −1
 ′ −1
 d −1 fX g −1 (y)
fY (y) = FX g (y) = FX g (y) g (y) = ′ −1  .
dy dy g g (y)
Or letting y = g(x), 
fY g(x) = fX (x)/g ′ (x).
When g ′ < 0, then g is decreasing and we must differentiate 1 − Fg to find the density of Y . In
this case 
fY g(x) = −fX (x)/g ′ (x).
So to sum up:

9.1.1 Theorem (Density of a monotone function of X) Let X is a random variable



with cumulative distribution function FX and density fX = FX and let

Y = g(X).

If g is everywhere differentiable and either g ′ (x) > 0 for all x in the range of X, or g ′ (x) < 0
for all x in the range of X, then Y has a density fY given by

fX g −1 (y)
fY (y) = ′ −1 
g g (y)

9.1.2 Example (Change of scale and location) We say that the random variable Y is a
change of location and scale of the random variable X if
Y −b
Y = aX + b, or equivalently, X= ,
a
where a > 0. If X ∼ FX with density fX , then
   
y−b y−b
FY (y) = P (Y ⩽ y) = P (aX + b ⩽ y) = P X ⩽ = FX ,
a a
so  
1 y−b
fY (y) = fX .
a a

9.1.3 Example Let X ∼ U (0, 1), and let Y = 2X − 1. Then by Example 9.1.2 we have
( (
1, 0 ⩽ x ⩽ 1 1/2, −1 ⩽ y ⩽ 1
fX (x) = so fY (y) =
0 otherwise 0 otherwise.

In other words, Y ∼ U [−1, 1]. □

9.1.4 Example Let Z have the standard normal density


1
fZ (z) = √ e−z /2 ,
2


and let Y = σZ + µ. By Example 9.1.2 we have
1 1 y−µ
fY (y) = fZ (z) = √ e− 2 ( σ ) ,
2πσ

v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–3

9.1.5 Example Let X ∼ U [0, 1], (so f (x) = 1 for 0 ⩽ x ⩽ 1). Let

g(x) = xa ,

so
g ′ (x) = axa−1 .
Now
g −1 (y) = y 1/a ,
and as x ranges over the unit interval, so does y. To apply the inverse function theorem use

d −1 1 1 1
g (y) = ′ −1  = a−1 = a−1 .
dy g g (y) −1
a g (y) a y 1/a

 d −1
So the density of g(X) = X a is given by f g −1 (y) / dy g (y) or
(
1 (a−1)/a
ay (0 ⩽ y ⩽ 1)
fg (y) =
0 otherwise.


Even if g is not strictly increasing or decreasing, if we can find a nice expression for FY , we
may still be able to find the density of Y .

9.1.6 Example (A non-monotonic transformation) Let X have a Uniform[−1, 1] distri-


bution, and Y = X 2 . Then 0 ⩽ Y ⩽ 1 and
 √ √ √
FY (y) = P X 2 ⩽ y = P ( − y ⩽ X ⩽ y) = y (0 ⩽ y ⩽ 1),

and FY (y) = 0 for y ⩽ 0 and FY (y) = 1 for y ⩾ 1. Then

d√ 1
fY (y) = y= √ (0 ⩽ y ⩽ 1),
dy 2 y

and fY (y) = 0 for y ⩽ 0 or y ⩾ 1. □

9.2 Random vectors and joint distributions


Recall that a random variable X is a real-valued function on the sample space (Ω, F, P ), where
P is a probability measure on Ω; and that it induces a probability measure PX on R, called the
distribution of X, given by

PX (I) = P (X ∈ I) = P {ω ∈ Ω : X(ω) ∈ I} ,

for every interval in R. The distribution is enough to calculate the expectation of any (Borel)
function of X.
Now suppose I have more than one random variable on the same sample space. Then I can
consider the random vector (X, Y ) or X = (X1 , . . . , Xn ).

• A random vector X defines a probability PX on Rn , called the distribution of X via:

PX (I1 × · · · × In ) = P (X ∈ I1 × · · · × In ) = P {ω ∈ Ω : X1 (ω) ∈ I1 , . . . , Xn (ω) ∈ In } ,

where each Ii is an interval. This distribution is also called the joint distribution of X1 , . . . , Xn .

KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–4

• We can use this to define a joint cumulative distribution function, denoted FX , by

FX (x1 , . . . , xn ) = P (Xi ⩽ xi , for all i = 1, . . . , n)

• Given a joint cumulative distribution function we can recover the joint distribution. For
instance, Suppose (X, Y ) has joint cumulative distribution function F . The probability of the
rectangle (a1 , b1 ] × (a2 , b2 ] is given by

F (b1 , b2 ) − F (a1 , b2 ) − F (b1 , a2 ) + F (a1 , b1 ).

To see this, consult Figure 9.1. The rectangle is the event


 
((X, Y ) ≦ (b1 , b2 )) \ ((X, Y ) ≦ (b1 , b2 )) ∪ ((X, Y ) ≦ (b1 , b2 )) .

The probability is computed by using the inclusion/exclusion principle to compute the proba-
bility of the union and subtracting it from P (X, Y ) ≦ (b1 , b2 )). There are higher dimensional

(a1 , b2 ) (b1 , b2 )

(a1 , a2 ) (b1 , a2 )

Figure 9.1. Joint cumulative distribution function and rectangles.

versions, but the expressions are complicated (see, e.g., [1, pp. 394–395]). In this class we shall
mostly deal with joint densities.
• N.B. While the subscript X,Y or X is often used to identify a joint distribution or cumulative
distribution function, it is also frequently omitted. You are supposed to figure out the domain
of the function by inspecting its arguments.

9.2.1 Example Let S = {SS, SF, F S, F F } and let P be the probability measure on S defined
by
7 3 1 1
P (SS) = , P (SF ) = , P (F S) = , P (F F ) = .
12 12 12 12
Define the random variables X and Y by

X(SS) = 1, X(SF ) = 1, X(F S) = 0, X(F F ) = 0,


Y (SS) = 1, Y (SF ) = 0, Y (F S) = 1, Y (F F ) = 0.

That is, X and Y indicate Success or Failure on two different experiments, but the experiments
are not necessarily independent.
Then
10 2
PX (1) = , PX (0) = ,
12 12
8 4
PY (1) = , PY (0) = ,
12 12
7 3 1 1
PX,Y (1, 1) = , PX,Y (1, 0) = , PX,Y (0, 1) = , PX,Y (0, 0) = .
12 12 12 12

v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–5

9.3 Joint PMFs


A random vector X on a probability space (Ω, F, P ) is discrete, if you can enumerate its range.
Pitman [5]:
When X1 , . . . , Xn are discrete, the joint probability mass function of the random vector Section 3.1;
X = (X1 , . . . , Xn ) is usually denoted pX , and is given by also p. 348
Larsen–
pX (x1 , x2 , . . . , xn ) = P (X1 = x1 and X2 = x2 and · · · and Xn = xn ). Marx [4]:
Section 3.7
If X and Y are independent random variables, then pX,Y (x, y) = pX (x)pY (y).
For a function g of X and Y we have
XX
E g(X, Y ) = g(x, y)pX,Y (x, y).
x y

9.4 Joint densities


Pitman [5]:
Let X and Y be random variables on a probability space (Ω, F, P ). The random vector (X, Y ) Chapter 5.1
has a joint density fX,Y (x, y) if for every rectangle I1 × I2 ⊂ R2 , Larsen–
Z Z Marx [4]:
P ((X, Y ) ∈ I1 × I2 ) = fX,Y (x, y) dx dy. Section 3.7
I1 I1

If X and Y are independent, then fX,Y (x, y) = fX (x)fY (y).


For example, Z ∞Z ∞
P (X ⩾ Y ) = fX,Y (x, y) dx dy.
−∞ y
For a function g of X and Y we have
Z ∞
E g(X, Y ) = g(x, y)fX,Y (x, y) dx dy.
−∞

Again, the subscript X,Y or X on a density is frequently omitted.

9.4.1 Theorem If the joint cumulative distribution function F : Rn → R is differentiable, then


the joint density is given by
∂ n F (x1 , . . . , xn )
f (x1 , . . . , xn ) = .
∂x1 . . . , ∂xn

9.5 Recovering marginal distributions from joint distributions


So now we have random variables X and Y , and the random vector (X, Y ). They have distri-
butions PX , PY , and PX,Y . How are they related? The marginal distribution of X is just
the distribution PX of X alone. We can recover its probability mass function from the joint
probability mass function pX,Y as follows.

In the discrete case: X


pX (x) = P (X = x) = pX,Y (x, y)
y

Likewise X
pY (y) = P (Y = y) = pX,Y (x, y)
y

If X and Y are independent random variables, then pX,Y (x, y) = pX (x)pY (y).

KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–6

Larsen–
Marx [4]:
p. 169
For the density case, the marginal density of X, denoted fX is given by
Z ∞
fX (x) = fX,Y (x, y) dy,
−∞

and the marginal density fY of Y is given by


Z ∞
fY (y) = fX,Y (x, y) dx.
−∞

The recovery of a marginal density of X from a joint density of X and Y is sometimes


described as “integrating out” out y.
We just showed that if we know the joint distribution of two random variables, we can
recover their marginal distributions. Can we go the other way? That is, if I have the marginal
distributions of random variables X and Y , can I uniquely recover the joint distribution of X
and Y ? The answer is No, as the this simple example shows:

9.5.1 Example Suppose X and Y are both Bernoulli random variables with

P (X = 1) = P (Y = 1) = 0.5.

Here are three different joint distributions that give rise to these marginals:

Y = 1 0.25 0.25 Y = 1 0.20 0.30 Y = 1 0.50 0.00


Y = 0 0.25 0.25 , Y = 0 0.30 0.20 Y = 0 0.00 0.50
X=0 X=1 X=0 X=1 X=0 X=1

You can see that there are plenty of other joint distributions that work. □

9.6 The expectation of a sum


I already asserted that the expectation of a sum of random variables was the sum of their
expectations, and proved it for the case of discrete random variables. If the random vector
(X, Y ) has a joint density, then it is straightforward to show that E(X + Y ) = E X + E Y .
Since x + y is a (Borel) function of the vector (x, y), we have (Section 9.4) that
ZZ
E(X + Y ) = (x + y)fX,Y (x, y) dx dy
ZZ ZZ
= xfX,Y (x, y) dy dx + yfX,Y (x, y) dx dy
Z Z  Z Z 
= x fX,Y (x, y) dy dx + y fX,Y (x, y) dx dy
Z Z
= xfX (x) dx + yfY (y) dy

= E X + E Y.

9.7 The distribution of a sum


We already know to calculate the expectation of a sum of random variables—since expectation
is a positive linear operator, the expectation of a sum is the sum of the expectations.

v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–7

We are now in a position to describe the distribution of the sum of two random variables.
Pitman [5]: Let Z = X + Y .
p. 147 Discrete case: X X
Larsen– P (Z = z) = P (x, y) = pX,Y (x, z − x)
Marx [4]: (x,y):x+y=z all x
p. 178ff.

9.7.1 Density of a sum

If (X, Y ) has joint density fX,Y (x, y), what is the density of X + Y ?
To find the density of a sum, we first find its cumulative distribution function. Now X +Y ⩽ t
if and only if X ⩽ t − Y , so
ZZ Z ∞ Z t−y
P (X + Y ⩽ t) = fX,Y (x, y) dx dy = fX,Y (x, y) dx dy.
{(x,y):x⩽t−y} −∞ −∞

I have written the limits of integration as −∞ and ∞, but the density may well be zero in much
of this region, so it helps to pay attention the nonzero region.

9.7.1 Example Let X and Y be independent Uniform[0, 1] random variables. Their joint
density is 1 on the square [0, 1] × [0, 1]. The probability that their sum is ⩽ t is just the area
of the square lying below the line x + y = t. For 0 ⩽ t ⩽ 1, this is a triangle with area t2 .
For 1 ⩽ t ⩽ 2, the region is more complicated, but by symmetry it is easy to see it’s area is
1 − (1 − t)2 . So


 0, t⩽0

t2 /2, 0⩽t⩽1
FX+Y (t) =

 1 − (2 − t)2
/2, 1⩽2


1 t ⩾ 2.

Recalling that the density is the derivative of the cdf, so to find the density we need only
differentiate the cumulative distribution function.

9.7.2 Example (continued) The derivative of FX+Y for the example above is


0, t ⩽ 0 or t ⩾ 2
d
FX+Y (t) = t, 0⩽t⩽1
dt 

2 − t 1 ⩽ t ⩽ 2.


More generally the derivative of the cumulative distribution function is given by Pitman [5]:
ZZ pp. 372–373
d d
fX+Y (t) = P (X + Y ⩽ t) = fX,Y (x, y) dx dy
dt dt {(x,y):x⩽t−y}
Z Z t−y 
d ∞
= fX,Y (x, y) dx dy
dt −∞ −∞
Z ∞ Z t−y 
d
= fX,Y (x, y) dx dy
−∞ dt −∞
Z ∞
= fX,Y (t − y, y) dy.
−∞

KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–8

So if X and Y are independent, we get the convolution


Z ∞
fX+Y (t) = fX (t − y)fY (y) dy.
−∞

9.8 ⋆ Expectation of a random vector


Since random vectors are just vector-valued functions on a sample space S, we can add them
and multiply them just like any other functions. For example, the sum of random vectors
X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Yn ) is given by
 
(X + Y )(ω) = X + Y (ω) = X1 (ω), . . . , Xn (ω) + Y1 (ω), . . . , Yn (ω) .

Thus the set of random vectors is a vector space. In fact, the subset of random vectors whose
components have a finite expectation is also a vector subspace of the vector space of all random
vectors.
If X = (X1 , . . . , Xn ) is a random vector, and each Xi has expectation E Xi , the expectation
of X is defined to be
E X = (E X1 , . . . , E Xn ).

• Expectation is a linear operator on the space of random vectors. This means that

E(aX + bY ) = a E X + b E Y .

• Expectation is a positive operator on the space of random vectors. For vectors x =


(x1 , . . . , xn ), define x ≧ 0 if xi ⩾ 0 for each i = 1, . . . , n. Then

X ≧ 0 =⇒ E X ≧ 0.

9.9 Covariance
Pitman [5]: When X and Y are independent, we proved
§ 6.4, p. 430
Var(X + Y ) = Var X + Var Y.

More generally however, since expectation is a positive linear operator,

Var(X + Y ) = E (X + Y ) − E(X + Y ))2


= E (X − E X) + (Y − E Y ))2
= E (X − E X)2 + 2(X − E X)(Y − E Y ) + (Y − E Y )2
= Var(X) + Var(Y ) + 2 E(X − E X)(Y − E Y ).

9.9.1 Definition The covariance of X and Y is defined to be

Cov(X, Y ) = E(X − E X)(Y − E Y ). (1)

In general
Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ).

v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–9

There is another way to write the covariance:

Cov(X, Y ) = E(XY ) − E(X) E(Y ). (2)

Proof : Since expectation is a positive linear operator,


 
Cov(X, Y ) = E (X − E(X) Y − E(Y )

= E XY − X E(Y ) − Y E(X) + E(X) E(Y )
= E(XY ) − E(X) E(Y ) − E(X) E(Y ) + E(X) E(Y )
= E(XY ) − E(X) E(Y ).

9.9.2 Remark It follows that for any random variable X,

Cov(X, X) = Var X.

If X and Y are independent, then Cov(X, Y ) = 0.

The converse is not true.

9.9.3 Example (Covariance = 0, but variables are not independent) (Cf. Feller [3,
p. 236])
Let X be a random variable that assumes the values ±1 and ±2, each with probability 1/4.
(E X = 0)
Define Y = X 2 , and let Ȳ = E Y (= 2.5). Then

Cov(X, Y ) = E X(Y − Ȳ )
1 1 1 1
= 1(1 − Ȳ ) + (−1)(1 − Ȳ ) + 2(4 − Ȳ ) + (−2)(4 − Ȳ )
4 4 4 4
= 0.

But X and Y are not independent:

P (X = 1 & Y = 1) = P (X = 1) = 1/2,

but P (X = 1) = 1/4 and P (Y = 1) = 1/2, so

P (X = 1) · P (Y = 1) = 1/8.

9.9.4 Example (Covariance = 0, but variables are not independent) Let U , V be in-
dependent and identically distributed random variables with E U = E V = 0. Define

X = U + V, Y = U − V.

Since E X = E Y = 0,

Cov(X, Y ) = E(XY ) = E (U + V )(U − V ) = E(U 2 − V 2 ) = E U 2 − E V 2 = 0

since U and V have the same distribution.

KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–10

But are X and Y independent?


If U and V are integer-valued, then X and Y are also integer-valued, but more importantly
they have the same parity. That is, X is odd if and only if Y is odd. (This is a handy fact for
KenKen solvers.)
So let U and V be independent and assume the values ±1 and ±2, each with probability
1/4. (E U = E V = 0.) Then
1
P (X is odd) = P (X is even) = P (Y is odd) = P (Y is even) = ,
2
but
1
P (X is even and Y is odd) = 0 ̸= = P (X is even)P (Y is odd),
4
so X and Y are not independent. □
Pitman [5]:
p. 432 9.9.5 Remark The product (X − E X)(Y − E Y ) is positive at outcomes ω where X(ω) and
Y (ω) are either both above or both below their means, and negative when one is above and the
other below. So one very loose interpretation of positive covariance is that the random variables
are probably both above average or below average rather than not. Of course this is just a
tendency.

9.9.6 Example (The effects of covariance) For mean zero random variables that have a
positive covariance, the joint density tends to concentrate on the diagonal. Figure 9.2 shows
the joint density of two standard normals with various covariances. Figure 9.4 shows random
samples from these distributions.

9.10 ⋆ A covariance menagerie


Recall that for independent random variables X and Y , Var(X + Y ) = Var X + Var Y , and
Cov(XY ) = 0. For any random variable X with finite variance, Var X = E(X 2 ) − (E X)2 , so
E(X 2 ) = Var X + (E X)2 . Also, if E X = 0, then Cov(XY ) = E(XY ) (Why?).

9.10.1 Theorem (A Covariance Menagerie) Let X1 , . . . , Xn be independent and identi-


cally distributed random variables with common mean µ and variance σ 2 . Define
X
n
S= Xi , and X̄ = S/n,
i=1

and let
Di = Xi − X̄, (i = 1, . . . , n)
be the deviation of Xi from X̄.
Then
1. E(Xi Xj ) = (E Xi )(E Xj ) = µ2 , for i ̸= j (by independence).
2. E(Xi2 ) = σ 2 + µ2 .
Pn
3. E(Xi S) = j=1 E(Xi Xj ) + E(Xi2 ) = E(Xi2 ) = σ 2 + nµ2 .
j̸=i

4. E(Xi X̄) = (σ 2 /n) + µ2 .


5. E(S) = nµ.
6. Var(S) = nσ 2 .

v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–11

Covariance = 0 Covariance = 0.25

Covariance = 0.50 Covariance = 0.75

Figure 9.2. Joint density of standard normals, as covariance changes.

7. E(S 2 ) = nσ 2 + n2 µ2 .
8. E(X̄) = µ.
9. Var(X̄) = σ 2 /n.

10. E(X̄ 2 ) = (σ 2 /n) + µ2 .


11. E(Di ) = 0, i = 1, . . . , n.
12. Var(Di ) = E(Di2 ) = (n − 1)σ 2 /n :

Var(Di ) = E(Xi − X̄)2


= E(Xi2 ) − 2 E(Xi X̄) + E(X̄ 2 )
 
  1
= (σ + µ ) − 2 (σ /n) + µ + (σ /n) + µ = 1 −
2 2 2 2 2 2
σ2 .
n

13. Cov(Di , Dj ) = E(Di Dj ) = −σ 2 /n :



E(Di Dj ) = E (Xi − X̄)(Xj − X̄)
= E(Xi Xj ) − E(Xi X̄) − E(Xj X̄) + E(X̄ 2 )
 
= µ2 − (σ 2 /n) + µ2 ] − (σ 2 /n) + µ2 ] + [(σ 2 /n) + µ2 ] = −σ 2 /n.

KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–12

Covariance = 0 Covariance = 0.25

1.5 1.5

1.0 1.0

0.5 0.5

-1.5 -1.0 -0.5 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.5 1.0 1.5

-0.5 -0.5

-1.0 -1.0

-1.5 -1.5

Covariance = 0.5 Covariance = 0.75

1.5

1.0
1

0.5

-1.5 -1.0 -0.5 0.5 1.0 1.5 -2 -1 1 2

-0.5

-1
-1.0

-1.5

-2

Figure 9.3. Contours of the joint density of standard normals, as covariance changes.

Note that this means that deviations from the mean are negatively correlated. This makes sense,
because if one variate is bigger than the mean, another must be smaller to offset the difference.
14. Cov(Di , S) = E(Di S) = 0.
 
E(Di S) = E Xi − (S/n) S = E(Xi S) − E(S 2 /n)
= (σ 2 + nµ2 ) − (nσ 2 + n2 µ2 )/n = 0.

15. Cov(Di , X̄) = E(Di X̄) = E(Di S)/n = 0.


The proof of each is a straightforward plug-and-chug calculation. The only reason for writing
this as a theorem is to be able to refer to it easily.

v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–13

Covariance = 0 Covariance = 0 Covariance = 0

Covariance = 0.25 Covariance = 0.25 Covariance = 0.25

Covariance = 0.5 Covariance = 0.5 Covariance = 0.5

Covariance = 0.75 Covariance = 0.75 Covariance = 0.75

Covariance = 0.95 Covariance = 0.95 Covariance = 0.95

Covariance = -0.95 Covariance = -0.95 Covariance = -0.95

Figure 9.4. Random samples from standard normals, as covariance changes.

KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–14

9.11 ⋆ Covariance matrix of a random vector


In general, we define the covariance matrix of a random vector by
   
.. ..
 .   . 
Cov X =    
· · · E(Xi − E Xi )(Xj − E Xj ) · · · = · · · Cov(Xi , Xj ) · · ·
.. ..
. .

9.12 ⋆ Variance of a linear combination of random variables


9.12.1 Proposition Let X = (X1 , . . . , Xn ) be a random vector with covariance matrix
 
..
 . 
Σ=  · · · Cov(X 
i , X i ) · · ·
..
.

and let a = (a1 , . . . , an ). The random variable

X
n
Z = a · X = a′ X = ai Xi
i=1

has variance given by


n X
X n
Var Z = a′ Σa = Cov(Xi , Xj )ai aj ,
i=1 j=1

where a is treated as a column vector, and a′ is its transpose, a row vector.

Proof : This just uses the fact that expectation is a positive linear operator. Since adding
constants don’t change variance, we may subtract means and assume that each E Xi = 0. Then
Cov(Xi , Xj ) = E(Xi Xj ). Then Z has mean 0, so

! 
X X
Var Z = E Z 2 = E ai Xi  aj Xj 
i=1 j=1

X
n X
n X
n X
n
=E Xi Xj ai aj = E(Xi Xj )ai aj
i=1 j=1 i=1 j=1
X
n X
n
= Cov(Xi , Xj )ai aj .
i=1 j=1

Since Var Z ⩾ 0 and a is arbitrary, we see that Cov X is a positive semidefinite matrix.

9.13 ⋆ An inner product for random variables


Since random variables are just functions on the probability space (Ω, F, P ), the set of random
variables is a vector space under the usual operations of addition of functions and multiplication
by scalars. The collection of random variables that have finite variance is a linear subspace, often
denoted L2 (P ), and it has a natural inner product.

v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–15

9.13.1 Fact (Inner product on the space L2 (P ) of random variables) Let L2 (P ) de-
note the linear space of random variables that have finite variance. Then

(X, Y ) = E XY

is a real inner product on L2 (P ).


The proof of this straightforward, and is essentially the same as the proof that the Euclidean
inner product on Rm is an inner product.
The next result is just the Cauchy–Schwartz Inequality for this inner product, but I’ve
written out a self-contained proof for you.

9.13.2 Cauchy–Schwartz Inequality

E(XY )2 ⩽ (E X 2 )(E Y 2 ), (3)

with equality only if X and Y are linearly dependent, that is, only if there exist a, b not both zero
such that aX + bY = 0 a.s..

Proof : If either X or Y is zero a.s., then we have equality, so assume X, Y are nonzero. Define
the quadratic polynomial Q : R → R by

Q(λ) = E (λX + Y )2 ⩾ 0.

Since expectation is a positive linear operator,

Q(λ) = E(X 2 )λ2 + 2 E(XY )λ + E(Y 2 ).

Since this is always ⩾ 0, the discriminant of the quadratic polynomial Q(λ) is nonpositive, 1 that
is, 4 E(XY )2 − 4 E(X 2 ) E(Y 2 ) ⩽ 0, or E(XY )2 ⩽ E(X 2 ) E(Y 2 ). Equality in (3) can occur
only if the discriminant is zero,
 in which case Q has a real root. That is, there is some λ for
which Q(λ) = E (λX + Y )2 = 0. But this implies that λX + Y = 0 (almost surely).

9.13.3 Corollary
2
|Cov(X, Y )| ⩽ Var X Var Y. (4)

Proof : Apply the Cauchy–Schwartz inequality to the random variables, X − E X and Y − E Y ,


and then take square roots.

9.14 Covariance is bilinear


Since expectation is a positive linear operator, it is routine to show that

Cov(aX + bY, cZ + dW ) = ac Cov(X, Z) + bc Cov(Y, Z) + ad Cov(X, W ) + bd Cov(Y, W ).


1 In case you have forgotten how you derived the quadratic formula in Algebra I, rewrite the polynomial as
β
2
Q(z) = αx2 + βz + γ = 1
α
αz + 2
− (β 2 − 4αγ)/4α,

and note that the only way to guarantee that Q(z) ⩾ 0 for all z is to have α > 0 and β 2 − 4αγ ⩽ 0.

KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–16

9.15 Correlation

9.15.1 Definition The correlation between X and Y is defined to be


Cov(X, Y )
Corr(X, Y ) =
(SD X)(SD Y )

It is also equal to
Corr(X, Y ) = Cov(X ∗ , Y ∗ ) = E(X ∗ Y ∗ ),
where X ∗ and Y ∗ are the standardization of X and Y .

Let X have mean µX and standard deviation σX , and ditto for Y . Recall that
X − µX
X∗ =
σX
has mean 0 and std. dev. 1. Thus by the alternate formula for covariance

Cov(X ∗ , Y ∗ ) = E(X ∗ Y ∗ ) − E(X ∗ ) E(Y ∗ ) .


=0 =0

Now
 
X − µX Y − µY
E(X ∗ Y ∗ ) = E
σX σY
E(XY ) − E(X) E(Y )
=
σX σY
= Corr(X, Y )
Pitman [5]:
p. 433 Corollary 9.13.3 (the Cauchy–Schwartz Inequality) implies:

−1 ⩽ Corr(X, Y ) ⩽ 1

If the correlation between X and Y is zero, then the random variables X −E X and Y −E Y
are orthogonal in our inner product.

9.16 Linear transformations of random vectors


Let  
X1
 
X =  ... 
Xn
be a random vector. Define 
E X1
 
µ = E X =  ...  .
E Xn

v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–17

Define the covariance matrix of X by


      
Var X = Cov Xi Xj = E(Xi − µi )(Xj − µj ) = σij = E (X − µ)(X − µ)′ .

Let  
a11 · · · a1n
 .. 
A =  ... . 
am1 · · · amn
be an m × n matrix of constants, and let

Y = AX

Then, since expectation is a positive linear operator,

E Y = Aµ.

Moreover
Var Y = A(Var X)A′
 
since Var Y = E (AX − Aµ)(AX − Aµ)′ = E A(X − µ)(X − µ)′ A′ = A(Var X)A′ .
The covariance matrix Σ of a random vector Y = (Y1 , . . . , Yn ) is always positive semidefinite,
since for any vector w of weights, w′ Σw is the variance of the random variable w′ Y , and
variances are always nonnegative.

Bibliography
[1] C. D. Aliprantis and K. C. Border. 2006. Infinite dimensional analysis, 3d. ed. Berlin:
Springer–Verlag.
[2] T. M. Apostol. 1967. Calculus, Volume I: One-variable calculus with an introduction to
linear algebra, 2d. ed. New York: John Wiley & Sons.
[3] W. Feller. 1968. An introduction to probability theory and its applications, 3d. ed., volume 1.
New York: Wiley.
[4] R. J. Larsen and M. L. Marx. 2012. An introduction to mathematical statistics and its
applications, fifth ed. Boston: Prentice Hall.
[5] J. Pitman. 1993. Probability. Springer Texts in Statistics. New York, Berlin, and Heidelberg:
Springer.

KC Border v. 2020.10.21::10.28

You might also like