Lecture 09
Lecture 09
Mathematics
Ma 3/103 KC Border
Introduction to Probability and Statistics Winter 2021
d
fY (y) = P (g(X) ⩽ y),
dy
provided the derivative exists.
There are two special cases of particular interest. If g ′ (x) > 0 for all x, so that g is strictly
increasing, then g has an inverse, and
g(X) ⩽ y ⇐⇒ X ⩽ g −1 (y),
so
FY (y) = FX g −1 (y)
is the cumulative distribution function of Y . The density fY of Y is found by differentiating
this.
Similarly, if g ′ (x) < 0 for all x, then g is strictly decreasing, and
Y ⩽ y ⇐⇒ g(X) ⩽ y ⇐⇒ X ⩾ g −1 (y),
d −1 1
g (y) = ′ −1
dy g g (y)
KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–2
Y = g(X).
If g is everywhere differentiable and either g ′ (x) > 0 for all x in the range of X, or g ′ (x) < 0
for all x in the range of X, then Y has a density fY given by
fX g −1 (y)
fY (y) = ′ −1
g g (y)
9.1.2 Example (Change of scale and location) We say that the random variable Y is a
change of location and scale of the random variable X if
Y −b
Y = aX + b, or equivalently, X= ,
a
where a > 0. If X ∼ FX with density fX , then
y−b y−b
FY (y) = P (Y ⩽ y) = P (aX + b ⩽ y) = P X ⩽ = FX ,
a a
so
1 y−b
fY (y) = fX .
a a
□
9.1.3 Example Let X ∼ U (0, 1), and let Y = 2X − 1. Then by Example 9.1.2 we have
( (
1, 0 ⩽ x ⩽ 1 1/2, −1 ⩽ y ⩽ 1
fX (x) = so fY (y) =
0 otherwise 0 otherwise.
2π
and let Y = σZ + µ. By Example 9.1.2 we have
1 1 y−µ
fY (y) = fZ (z) = √ e− 2 ( σ ) ,
2πσ
□
v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–3
9.1.5 Example Let X ∼ U [0, 1], (so f (x) = 1 for 0 ⩽ x ⩽ 1). Let
g(x) = xa ,
so
g ′ (x) = axa−1 .
Now
g −1 (y) = y 1/a ,
and as x ranges over the unit interval, so does y. To apply the inverse function theorem use
d −1 1 1 1
g (y) = ′ −1 = a−1 = a−1 .
dy g g (y) −1
a g (y) a y 1/a
d −1
So the density of g(X) = X a is given by f g −1 (y) / dy g (y) or
(
1 (a−1)/a
ay (0 ⩽ y ⩽ 1)
fg (y) =
0 otherwise.
□
Even if g is not strictly increasing or decreasing, if we can find a nice expression for FY , we
may still be able to find the density of Y .
d√ 1
fY (y) = y= √ (0 ⩽ y ⩽ 1),
dy 2 y
for every interval in R. The distribution is enough to calculate the expectation of any (Borel)
function of X.
Now suppose I have more than one random variable on the same sample space. Then I can
consider the random vector (X, Y ) or X = (X1 , . . . , Xn ).
where each Ii is an interval. This distribution is also called the joint distribution of X1 , . . . , Xn .
KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–4
• Given a joint cumulative distribution function we can recover the joint distribution. For
instance, Suppose (X, Y ) has joint cumulative distribution function F . The probability of the
rectangle (a1 , b1 ] × (a2 , b2 ] is given by
The probability is computed by using the inclusion/exclusion principle to compute the proba-
bility of the union and subtracting it from P (X, Y ) ≦ (b1 , b2 )). There are higher dimensional
(a1 , b2 ) (b1 , b2 )
(a1 , a2 ) (b1 , a2 )
versions, but the expressions are complicated (see, e.g., [1, pp. 394–395]). In this class we shall
mostly deal with joint densities.
• N.B. While the subscript X,Y or X is often used to identify a joint distribution or cumulative
distribution function, it is also frequently omitted. You are supposed to figure out the domain
of the function by inspecting its arguments.
9.2.1 Example Let S = {SS, SF, F S, F F } and let P be the probability measure on S defined
by
7 3 1 1
P (SS) = , P (SF ) = , P (F S) = , P (F F ) = .
12 12 12 12
Define the random variables X and Y by
That is, X and Y indicate Success or Failure on two different experiments, but the experiments
are not necessarily independent.
Then
10 2
PX (1) = , PX (0) = ,
12 12
8 4
PY (1) = , PY (0) = ,
12 12
7 3 1 1
PX,Y (1, 1) = , PX,Y (1, 0) = , PX,Y (0, 1) = , PX,Y (0, 0) = .
12 12 12 12
□
v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–5
Likewise X
pY (y) = P (Y = y) = pX,Y (x, y)
y
If X and Y are independent random variables, then pX,Y (x, y) = pX (x)pY (y).
KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–6
Larsen–
Marx [4]:
p. 169
For the density case, the marginal density of X, denoted fX is given by
Z ∞
fX (x) = fX,Y (x, y) dy,
−∞
9.5.1 Example Suppose X and Y are both Bernoulli random variables with
P (X = 1) = P (Y = 1) = 0.5.
Here are three different joint distributions that give rise to these marginals:
You can see that there are plenty of other joint distributions that work. □
= E X + E Y.
v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–7
We are now in a position to describe the distribution of the sum of two random variables.
Pitman [5]: Let Z = X + Y .
p. 147 Discrete case: X X
Larsen– P (Z = z) = P (x, y) = pX,Y (x, z − x)
Marx [4]: (x,y):x+y=z all x
p. 178ff.
If (X, Y ) has joint density fX,Y (x, y), what is the density of X + Y ?
To find the density of a sum, we first find its cumulative distribution function. Now X +Y ⩽ t
if and only if X ⩽ t − Y , so
ZZ Z ∞ Z t−y
P (X + Y ⩽ t) = fX,Y (x, y) dx dy = fX,Y (x, y) dx dy.
{(x,y):x⩽t−y} −∞ −∞
I have written the limits of integration as −∞ and ∞, but the density may well be zero in much
of this region, so it helps to pay attention the nonzero region.
9.7.1 Example Let X and Y be independent Uniform[0, 1] random variables. Their joint
density is 1 on the square [0, 1] × [0, 1]. The probability that their sum is ⩽ t is just the area
of the square lying below the line x + y = t. For 0 ⩽ t ⩽ 1, this is a triangle with area t2 .
For 1 ⩽ t ⩽ 2, the region is more complicated, but by symmetry it is easy to see it’s area is
1 − (1 − t)2 . So
0, t⩽0
t2 /2, 0⩽t⩽1
FX+Y (t) =
1 − (2 − t)2
/2, 1⩽2
1 t ⩾ 2.
□
Recalling that the density is the derivative of the cdf, so to find the density we need only
differentiate the cumulative distribution function.
9.7.2 Example (continued) The derivative of FX+Y for the example above is
0, t ⩽ 0 or t ⩾ 2
d
FX+Y (t) = t, 0⩽t⩽1
dt
2 − t 1 ⩽ t ⩽ 2.
□
More generally the derivative of the cumulative distribution function is given by Pitman [5]:
ZZ pp. 372–373
d d
fX+Y (t) = P (X + Y ⩽ t) = fX,Y (x, y) dx dy
dt dt {(x,y):x⩽t−y}
Z Z t−y
d ∞
= fX,Y (x, y) dx dy
dt −∞ −∞
Z ∞ Z t−y
d
= fX,Y (x, y) dx dy
−∞ dt −∞
Z ∞
= fX,Y (t − y, y) dy.
−∞
KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–8
Thus the set of random vectors is a vector space. In fact, the subset of random vectors whose
components have a finite expectation is also a vector subspace of the vector space of all random
vectors.
If X = (X1 , . . . , Xn ) is a random vector, and each Xi has expectation E Xi , the expectation
of X is defined to be
E X = (E X1 , . . . , E Xn ).
• Expectation is a linear operator on the space of random vectors. This means that
E(aX + bY ) = a E X + b E Y .
X ≧ 0 =⇒ E X ≧ 0.
9.9 Covariance
Pitman [5]: When X and Y are independent, we proved
§ 6.4, p. 430
Var(X + Y ) = Var X + Var Y.
In general
Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ).
v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–9
Cov(X, X) = Var X.
9.9.3 Example (Covariance = 0, but variables are not independent) (Cf. Feller [3,
p. 236])
Let X be a random variable that assumes the values ±1 and ±2, each with probability 1/4.
(E X = 0)
Define Y = X 2 , and let Ȳ = E Y (= 2.5). Then
Cov(X, Y ) = E X(Y − Ȳ )
1 1 1 1
= 1(1 − Ȳ ) + (−1)(1 − Ȳ ) + 2(4 − Ȳ ) + (−2)(4 − Ȳ )
4 4 4 4
= 0.
P (X = 1 & Y = 1) = P (X = 1) = 1/2,
P (X = 1) · P (Y = 1) = 1/8.
9.9.4 Example (Covariance = 0, but variables are not independent) Let U , V be in-
dependent and identically distributed random variables with E U = E V = 0. Define
X = U + V, Y = U − V.
Since E X = E Y = 0,
Cov(X, Y ) = E(XY ) = E (U + V )(U − V ) = E(U 2 − V 2 ) = E U 2 − E V 2 = 0
KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–10
9.9.6 Example (The effects of covariance) For mean zero random variables that have a
positive covariance, the joint density tends to concentrate on the diagonal. Figure 9.2 shows
the joint density of two standard normals with various covariances. Figure 9.4 shows random
samples from these distributions.
□
and let
Di = Xi − X̄, (i = 1, . . . , n)
be the deviation of Xi from X̄.
Then
1. E(Xi Xj ) = (E Xi )(E Xj ) = µ2 , for i ̸= j (by independence).
2. E(Xi2 ) = σ 2 + µ2 .
Pn
3. E(Xi S) = j=1 E(Xi Xj ) + E(Xi2 ) = E(Xi2 ) = σ 2 + nµ2 .
j̸=i
v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–11
7. E(S 2 ) = nσ 2 + n2 µ2 .
8. E(X̄) = µ.
9. Var(X̄) = σ 2 /n.
KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–12
1.5 1.5
1.0 1.0
0.5 0.5
-1.5 -1.0 -0.5 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.5 1.0 1.5
-0.5 -0.5
-1.0 -1.0
-1.5 -1.5
1.5
1.0
1
0.5
-0.5
-1
-1.0
-1.5
-2
Figure 9.3. Contours of the joint density of standard normals, as covariance changes.
Note that this means that deviations from the mean are negatively correlated. This makes sense,
because if one variate is bigger than the mean, another must be smaller to offset the difference.
14. Cov(Di , S) = E(Di S) = 0.
E(Di S) = E Xi − (S/n) S = E(Xi S) − E(S 2 /n)
= (σ 2 + nµ2 ) − (nσ 2 + n2 µ2 )/n = 0.
v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–13
KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–14
X
n
Z = a · X = a′ X = ai Xi
i=1
Proof : This just uses the fact that expectation is a positive linear operator. Since adding
constants don’t change variance, we may subtract means and assume that each E Xi = 0. Then
Cov(Xi , Xj ) = E(Xi Xj ). Then Z has mean 0, so
!
X X
Var Z = E Z 2 = E ai Xi aj Xj
i=1 j=1
X
n X
n X
n X
n
=E Xi Xj ai aj = E(Xi Xj )ai aj
i=1 j=1 i=1 j=1
X
n X
n
= Cov(Xi , Xj )ai aj .
i=1 j=1
Since Var Z ⩾ 0 and a is arbitrary, we see that Cov X is a positive semidefinite matrix.
v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–15
9.13.1 Fact (Inner product on the space L2 (P ) of random variables) Let L2 (P ) de-
note the linear space of random variables that have finite variance. Then
(X, Y ) = E XY
with equality only if X and Y are linearly dependent, that is, only if there exist a, b not both zero
such that aX + bY = 0 a.s..
Proof : If either X or Y is zero a.s., then we have equality, so assume X, Y are nonzero. Define
the quadratic polynomial Q : R → R by
Q(λ) = E (λX + Y )2 ⩾ 0.
Since this is always ⩾ 0, the discriminant of the quadratic polynomial Q(λ) is nonpositive, 1 that
is, 4 E(XY )2 − 4 E(X 2 ) E(Y 2 ) ⩽ 0, or E(XY )2 ⩽ E(X 2 ) E(Y 2 ). Equality in (3) can occur
only if the discriminant is zero,
in which case Q has a real root. That is, there is some λ for
which Q(λ) = E (λX + Y )2 = 0. But this implies that λX + Y = 0 (almost surely).
9.13.3 Corollary
2
|Cov(X, Y )| ⩽ Var X Var Y. (4)
and note that the only way to guarantee that Q(z) ⩾ 0 for all z is to have α > 0 and β 2 − 4αγ ⩽ 0.
KC Border v. 2020.10.21::10.28
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–16
9.15 Correlation
It is also equal to
Corr(X, Y ) = Cov(X ∗ , Y ∗ ) = E(X ∗ Y ∗ ),
where X ∗ and Y ∗ are the standardization of X and Y .
Let X have mean µX and standard deviation σX , and ditto for Y . Recall that
X − µX
X∗ =
σX
has mean 0 and std. dev. 1. Thus by the alternate formula for covariance
Now
X − µX Y − µY
E(X ∗ Y ∗ ) = E
σX σY
E(XY ) − E(X) E(Y )
=
σX σY
= Corr(X, Y )
Pitman [5]:
p. 433 Corollary 9.13.3 (the Cauchy–Schwartz Inequality) implies:
−1 ⩽ Corr(X, Y ) ⩽ 1
If the correlation between X and Y is zero, then the random variables X −E X and Y −E Y
are orthogonal in our inner product.
v. 2020.10.21::10.28 KC Border
Ma 3/103 Winter 2021
KC Border Transformations; Joint Distributions 9–17
Let
a11 · · · a1n
..
A = ... .
am1 · · · amn
be an m × n matrix of constants, and let
Y = AX
E Y = Aµ.
Moreover
Var Y = A(Var X)A′
since Var Y = E (AX − Aµ)(AX − Aµ)′ = E A(X − µ)(X − µ)′ A′ = A(Var X)A′ .
The covariance matrix Σ of a random vector Y = (Y1 , . . . , Yn ) is always positive semidefinite,
since for any vector w of weights, w′ Σw is the variance of the random variable w′ Y , and
variances are always nonnegative.
Bibliography
[1] C. D. Aliprantis and K. C. Border. 2006. Infinite dimensional analysis, 3d. ed. Berlin:
Springer–Verlag.
[2] T. M. Apostol. 1967. Calculus, Volume I: One-variable calculus with an introduction to
linear algebra, 2d. ed. New York: John Wiley & Sons.
[3] W. Feller. 1968. An introduction to probability theory and its applications, 3d. ed., volume 1.
New York: Wiley.
[4] R. J. Larsen and M. L. Marx. 2012. An introduction to mathematical statistics and its
applications, fifth ed. Boston: Prentice Hall.
[5] J. Pitman. 1993. Probability. Springer Texts in Statistics. New York, Berlin, and Heidelberg:
Springer.
KC Border v. 2020.10.21::10.28