Lecture-3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 109

Random Vectors

Paolo Zacchia

Probability and Statistics

Lecture 3
Random vectors
It is important to analyze how different random variables relate
to one another. The starting point is the following concept.

Definition 1
Random Vector. A random vector x of length K is a collection of
K random variables X1 , . . . , XK :
 
X1
x =  ... 
 

XK

each with support Xk ⊆ R for k = 1, . . . , K.


The realizations of random vectors are denoted here with roman, bold
lower-case letters, e.g. x.  
x1
x =  ... 
 

xK
Joint distributions
Certain concepts about random variables extend quite naturally
to the multivariate case.

Definition 2
Support of a Random Vector. The support X ⊆ RK of a random
vector x is the Cartesian product of all the supports of the random
variables featured in x.

X = X 1 × · · · × XK

Definition 3
Joint Probability Cumulative Distribution. Given some random
vector x, its joint probability cumulative distribution is defined as the
following function.

Fx (x) = P (x ≤ x) = P (X1 ≤ x1 ∩ · · · ∩ XK ≤ xK )
Joint discrete distributions
Definition 4
Joint Probability Mass Function. Given some random vector x
composed by discrete random variables only, its joint probability mass
function fx (x) is defined as follows, for all x = (x1 , . . . , xK ) ∈ RK .

fx (x1 , . . . , xK ) = P (X1 = x1 ∩ · · · ∩ XK = xK )

• A joint p.m.f. is related to the joint c.d.f. via the following


relationship:
X
P (x ≤ x) = Fx (x) = fx (t)
t∈X:t≤x

• . . . and the total probability mass is obviously 1.


X
P (x ∈ X) = fx (x) = 1
x∈X
Joint continuous distributions (1/2)
Definition 5
Joint Probability Density Function. Given some random vector
x composed by continuous random variables only, its joint probabil-
ity density function fx (x) is defined as the function that satisfies the
following relationship, for all x = (x1 , . . . , xK ) ∈ RK .
ˆ x1 ˆ xK
Fx (x1 , . . . , xK ) = ··· fx (x1 , . . . , xK ) dx1 . . . dxK
−∞ −∞

• Given two vectors a = (a1 , . . . , aK ) and b = (b1 , . . . , bK )


with a, b ∈ RK and bk ≥ ak for k = 1, . . . , K, it is:

P (a1 ≤ x1 ≤ b1 ∩ · · · ∩ aK ≤ xK ≤ bK ) =
= Fx (b1 , . . . , bK ) − Fx (a1 , . . . , aK ) =
ˆ b1 ˆ bK
= ... fx (x1 , . . . , xK ) dx1 . . . dxK
a1 aK
(Continues. . . )
Joint continuous distributions (2/2)

Definition 6
Joint Probability Density Function. Given some random vector
x composed by continuous random variables only, its joint probabil-
ity density function fx (x) is defined as the function that satisfies the
following relationship, for all x = (x1 , . . . , xK ) ∈ RK .
ˆ x1 ˆ xK
Fx (x1 , . . . , xK ) = ··· fx (x1 , . . . , xK ) dx1 . . . dxK
−∞ −∞

• (Continued.) Clearly, the joint density integrates to 1 over


the entire support of x.
ˆ ˆ
P (x ∈ X) = ... fx (x1 , . . . , xK ) dx1 . . . dxK = 1
X1 XK
Marginal discrete distributions

Definition 7
Marginal Distribution (discrete). For some given random vector x
made of discrete random variables only, the probability mass function
of Xk – the k-th random variable in x – is obtained as:
X X X X
fXk (xk ) = ··· ··· fx (x1 , . . . , xK )
x1 ∈X1 xk−1 ∈Xk−1 xk+1 ∈Xk+1 xK ∈XK

Pxk
and thus FXk (xk ) = t=inf Xk fXk (t).

Note: the above summation proceeds over all the values in the
support of x, except those of Xk . If k = 1 or k = K, (that is to
say, Xk is either first or last in the list) then the summation is
to be reformulated accordingly.
Example: marginal demographics (1/2)

Recall the example about an imperfect medical treatment with


an imperfect take-up in the population.

• Let X ∈ {0, 1} indicate treatment take-up and Y ∈ {0, 1}


health status. Clearly, here (X, Y ) is a random vector.

• Let x = 1 for a taker, x = 0 for an hesitant, y = 1 if one


stays healthy, y = 0 if one gets sick.

• The entire joint p.m.f. is expressed as follows:

fX,Y (x = 1, y = 1) = 0.40, fX,Y (x = 1, y = 0) = 0.20,


fX,Y (x = 0, y = 1) = 0.15, fX,Y (x = 0, y = 0) = 0.25.

• This is a bivariate Bernoulli distribution.


Example: marginal demographics (2/2)

• The marginal distributions are obtained as follows:

fX (x) = fX,Y (x, y = 1) + fX,Y (x, y = 0) for x = 0, 1


fY (y) = fX,Y (x = 1, y) + fX,Y (x = 0, y) for y = 0, 1

• . . . and they can easily be represented with a table.

Y =0 Y =1 Total
X=0 0.25 0.15 0.40
X=1 0.20 0.40 0.60
Total 0.45 0.55 1

Unsurprisingly, marginal distributions lie at the margins of


the table!
Marginal continuous distributions

Definition 8
Marginal Distribution (continuous). For some given random vec-
tor x composed by continuous random variables only, the probability
density function of Xk – the k-th random variable in x – is obtained
as: ˆ
fXk (xk ) = fx (x) dx−k
×`6=k X`
´ xk
and thus FXk (xk ) = −∞
fXk (t) dt.
Note: here, ×`6=k X` indicates the Cartesian product of all the
supports of each random variable in x excluding Xk : e.g. for
k 6= 1, K it is ×`6=k X` = X1 × · · · × Xk−1 × Xk+1 × · · · × XK .
Similarly, the expression dx−k for the differential of the integral
is interpreted as the product of all differentials excluding that of
xk : dx−k = dx1 . . . dxk−1 dxk+1 . . . dxK .
Example: the bivariate normal distribution
A two-dimensional random vector x = (X1 , X2 ) is said to follow
a bivariate normal distribution if, given some parameters:

µ1 ∈ R, µ2 ∈ R, σ1 ∈ R++ , σ2 ∈ R++ , ρ ∈ [−1, 1]

the joint density function fX1 ,X2 (x1 , x2 ) is expressed as follows.

1
fX1 ,X2 (x1 , x2 ; µ1 , µ2 , σ1 , σ2 , ρ) = p ·
2πσ1 σ2 1 − ρ2
(x1 − µ1 )2 (x2 − µ2 )2
!
ρ (x1 − µ1 ) (x2 − µ2 )
· exp − 2 − +
2σ1 (1 − ρ2 ) 2σ22 (1 − ρ2 ) σ1 σ2 (1 − ρ2 )

See e.g. the figure in the next slide for µ1 = 1, µ2 = 2, σ1 = 0.5,


σ2 = 1 and ρ = 0.4.
• Note: the two marginal distributions are obtained through
integration as X1 ∼ N µ1 , σ21 and X2 ∼ N µ2 , σ22 .
 
Drawing the bivariate normal: ρ = 0.4

fX1 (x1 )
fX2 (x2 )
fX1 ,X2 (x1 , x2 )

0.4

0.2

0
−1 4

0 3
0.3 1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1
Joint discrete-continuous random variables (1/2)

Most ‘real’ random vectors mix discrete and continuous random


variables. The definitions of joint p.m.f. and p.d.f. are not valid
for such random vectors as a whole.

Here is an example.

• Let x = (H, G) where H indicates height and G ∈ {0, 1}


gender (say G = 1 for females and G = 0 for males).

• A full description of this population can be given as follows.

1 h − µF
 
fH,G (h, g = 1; µF , µM , σF , σM , p) = φ ·p
σF σF
1 h − µM
 
fH,G (h, g = 0; µF , µM , σF , σM , p) = φ · (1 − p)
σM σM
(Continues. . . )
Joint discrete-continuous random variables (2/2)
• (Continued.) Summing the two expressions delivers the
p.d.f. of H alone:

fH (h; µF , µM , σF , σM , p) =
1 h − µF 1 h − µM
   
= φ ·p+ φ · (1 − p)
σF σF σM σM

• likewise, the marginal distribution of G is returned through


integration of both expressions over h; clearly, G ∼ Be (p).

fG (g = 1; p) = p
fG (g = 0; p) = 1 − p

• This example is stylized! A joint p.m.f. or p.d.f. can still be


defined in a subset of a random vector’s components.
Transformations of random vectors (1/3)
• The analysis of transformations extends to random vectors.
Let y = g (x) where g (·) is a function taking K arguments
and returning J values, with possibly J 6= K (so |y| = J).

• Interest falls on the joint distribution of y. Such an analysis


is tractable if the transformation is invertible, that is, there
is a sequence of functions g1−1 (·) , . . . , gK
−1
(·) such that:

Xk = gk−1 (Y1 , . . . , YJ )

for k = 1, . . . , K.

• If x is discrete, the p.m.f. of y is obtained as:


 
fy (y) = fx g1−1 (y) , . . . , gK
−1
(y)

and the c.d.f. Fy (y) is derived consequently.


Transformations of random vectors (2/3)
In the continuous case, the results obtained in Lecture 1 for the
univariate case can be extended if the transformation is g (·) is
bijective (one-to-one and onto).
Theorem 1
Joint Density of Transformed Random Vectors. Consider x and
y = g (x): two random vectors of length K that are related through a
bijective transformation g (·) which preserves vector length, X and Y
their respective supports, and fx (x) the joint probability density func-
tion of x, which is continuous on X. If the inverse of the transformation
function, gk−1 (·), is continuously differentiable on Y for k = 1, . . . , K,
the joint probability density function of y can be calculated as:
(  
fx g1−1 (y) , . . . , gK
−1
(y) · det ∂y∂T g −1 (y)

if y ∈ Y
fy (y) =
0 if y ∈
/Y
T
where g −1 (y) = g1−1 (y) , . . . , gK
−1
(y) .
Transformations of random vectors (3/3)
• Note: in the above statement, the notation
∂g1−1 (y) ∂g1−1 (y) ∂g1−1 (y)
 
...
 ∂y 1 ∂y2 ∂yK 
 ∂g2−1 (y) ∂g2−1 (y) ∂g2−1 (y) 
∂ −1  ∂y
1 ∂y2 ... ∂yK 

g (y) = 
.. .. .. ..
∂y T  . . . .


 −1 −1 −1

∂gK (y) ∂gK (y) ∂gK (y)
∂y1 ∂y2 ... ∂yK

indicates the K × K Jacobian matrix of g −1 (y), while


∂ −1
 
det g (y)
∂y T
is the absolute value of its determinant.

• This theorem is an application of Jacobian transformations


from multivariate calculus. It can be further extended when
g (·) is bijective in a partition of the support of x.
Example: bivariate lognormal distribution (1/2)
Consider the previous bivariate normal distribution, and let
! ! !
Y1 X1 exp (X1 )
y= =g =
Y2 X2 exp (X2 )

be a transformed random vector y = (Y1 , Y2 ). Note that:


! ! !
X1 −1 Y1 log (Y1 )
x= =g =
X2 Y2 log (Y2 )

hence, here the absolute value of the determinant of the inverse


transformation is as follows.
 . 

∂ −1
 1 y1 0 1
det g (y1 , y2 ) = det  . = >0
∂y T 0 1 y2 y1 y2
Example: bivariate lognormal distribution (2/2)
Thus, the joint p.d.f. of the bivariate lognormal distribution
(with support in R2++ ) can be written as follows.

1 1
fY1 ,Y2 (y1 , y2 ; µ1 , µ2 , σ1 , σ2 , ρ) = p ·
2πσ1 σ2 1 − ρ2 y1 y2
(log y1 − µ1 )2 (log y2 − µ2 )2
· exp − − +
2σ21 (1 − ρ2 ) 2σ22 (1 − ρ2 )
!
ρ (log y1 − µ1 ) (log y2 − µ2 )
+
σ1 σ2 (1 − ρ2 )

The distribution is graphically shown in the next figure for the


parameters µ1 = 1, µ2 = 2, σ1 = 0.5, σ2 = 1 and ρ = 0.4.

(Note: this example is relatively simple because the Jacobian is


diagonal.)
Drawing the bivariate lognormal

fY1 (y1 )

fY2 (y2 )

0.2
fY1 ,Y2 (y1 , y2 )

0.1

0
0 5

1 4

2 3
4 · 10−2
y1 3 2 y2
2 · 10−2
4 1
0
5
Random matrices
• Random variables can also be arrayed in random matrices:
collections of random vectors of equal length.
• If a random matrix X has dimension K × J, it is:
 
X11 X12 ... X1J
h i  X21 X22 ... X2J 

X = x1 x2 . . . xJ = 
 .. .. .. .. 
 . . . . 

XK1 XK2 . . . XKJ

• . . . and the matrix X of its realizations looks alike.


 
x11 x12 ... x1J
h i  x21 x22 ... x2J 

X = x1 x2 . . . xJ = 
 .. .. .. .. 
 . . . . 

xK1 xK2 . . . xKJ

• All previous concepts also extend to random matrices.


Notation for random objects: summary
• X, Y , upper-case italic (slanted) letter: a random variable.

• x, y, lower-case italic (slanted) letter: the realization of a


random variable.

• x, y, lower-case italic (slanted) letter: a random vector.

• x, y, lower-case roman letter: the realization of a random


vector.

• X, Y , upper-case italic (slanted) letter: a random matrix.

• X,Y, upper-case roman letter: the realization of a random


matrix.
Independent random variables
Some random variables that are possibly collected in a random
vector are said to be independent: intuitively, the realization
of one provides no information about those of the others.

In other words, the events described by these random variables


are statistically independent.

Definition 9
Independent Random Variables. Let x = (X, Y ) be a random
vector with joint probability mass or density function fX,Y (x, y), and
marginal mass or density functions fX (x) and fY (y). Lastly, let up-
percase F denote corresponding cumulative distributions instead (joint
or marginal). The two random variables X and Y are independent if
the two equivalent conditions below hold.

fX,Y (x, y) = fX (x) fY (y) ⇐⇒ FX,Y (x, y) = FX (x) FY (y)


Mutually independent random variables

The idea extends naturally to grouped random variables.

Definition 10
Mutually – or Pairwise – Independent Random Variables. Let
x = (X1 , . . . , XK ) be a random vector with joint probability mass
or density function fx (x), and marginal mass or density functions
fX1 (x1 ) , . . . , fXK (xK ). Instead, let Fx (x) denote corresponding cu-
mulative distributions (either joint or marginal). The random variables
X1 , . . . , XK are pairwise independent if every pair of random variables
listed in x are independent, and they are mutually independent if the
two equivalent conditions below hold.
K
Y K
Y
fx (x) = fXk (xk ) ⇐⇒ Fx (x) = FXk (xk )
k=1 k=1

Note that while mutual independence implies pairwise independence,


the converse is not true.
Independent random vectors (1/2)
Definitions that are specific to random vectors follow.

Definition 11
Independent Random Vectors. Let (x1 , . . . , xJ ) be a sequence
of J random vectors whose joint probability mass or density function
is written as fx1 ,...,xJ (x1 , . . . , xJ ). Let the joint probability mass or
density functions of an individual nested random vector be fxi (xi ),
where i = 1, . . . , J, and the joint probability mass or density functions
of any two random vectors indexed i, j (with i 6= j) as fxi ,xj (xi , xj ).
Lastly, let Fx (x) denote corresponding cumulative distributions instead
(joint or marginal). Any pair of random vectors indexed i and j are
independent if the two equivalent conditions below hold.

fxi ,xj (xi , xj ) = fxi (xi ) fxj (xj )


Fxi ,xj (xi , xj ) = Fxi (xi ) Fxj (xj )

If the above holds for any i, j distinct pair, the J random vectors are
said to be pairwise independent. (Continues. . . )
Independent random vectors (2/2)

Definition 11
(Continued.) If the above holds for any i, j distinct pair, the J ran-
dom vectors are said to be pairwise independent. The J random vectors
are mutually independent if the two equivalent conditions below hold.
J
Y
fx1 ,...,xJ (x1 , . . . , xJ ) = fxi (xi )
i=1
J
Y
Fx1 ,...,xJ (x1 , . . . , xJ ) = Fxi (xi )
i=1

Note that within each random vector the underlying random variables
are not necessarily independent. In addition, if all the random vectors
in question have length one, all these definitions reduce to those given
above for random variables.
Independent random variables and events (1/2)

It now gets easier to appreciate the connection with statistically


independent events.

Theorem 2
Independence of Events. Any two events that are mapped by two
independent random variables X and Y are statistically independent.
Proof.
(Outline.) This requires to show that, for any two events A ⊂ SX and
B ⊂ SY – where SX and SY are the primitive sample spaces of X and
Y respectively – it is:

P (X ∈ X (A) ∩ Y ∈ Y (B)) = P (X ∈ X (A)) · P (Y ∈ Y (B))

which follows from the definitions of (joint) cumulative distribution,


mass and density functions, and that of independent events.
Independent random variables and events (2/2)

The logic applies to multiple random variables at once.

Theorem 2
Generalization: Mutual Independence between Events. Any
combination of events that are mapped by a sequence of J mutually
independent random vectors (x1 , . . . , xJ ) are mutually independent.
Proof.
(Outline.) Extending the reasoning above, consider a collection of J
events denoted by Ai ⊂ Sxi for i = 1, . . . , J, where Sxi is the primitive
sample space of xi . It must be shown that:
J
! J
\ Y
P (xi ∈ xi (Ai )) = P (xi ∈ xi (Ai ))
i=1 i=1

which follows by analogous considerations.


Independent functions of random variables (1/2)

Here is a useful result for later: independence between random


variables is preserved by transformations.

Theorem 3
Independence of Functions of Random Variables. Consider two
independent random variables X and Y , and let U = gX (X) be a
transformation of X and V = gY (Y ) a transformation of Y . The two
transformed random variables U and V are independent.
Proof.
(Outline.) This requires to show that,

fU,V (u, v) = fU (u) fV (v) ⇐⇒ FU,V (u, v) = FU (u) FV (v)


−1
which is achieved by manipulating the inverse mappings gX ([a, b]) and
−1
gY ([a, b]) for any appropriate interval [a, b] ⊂ R, with a ≤ b.
Independent functions of random variables (2/2)

This also applies to a multivariate environment, of course.

Theorem 3
Generalization: Independence of Functions of Random Vec-
tors. Consider a sequence of mutually independent random vectors
(x1 , . . . , xJ ), as well as a sequence of transformations (y1 , . . . , yJ ) such
that yi = gi (xi ) for i = 1, . . . , J. The J transformed random vectors
(y1 , . . . , yJ ) are also themselves mutually independent.
Proof.
(Outline.) The proof extends the logic behind the previous result from
the bivariate case to higher dimensions; it requires manipulating the J
Jacobian transformations at hand.
Random products and random ratios
• Notions of independence are central in statistics, first and
foremost for specifying a framework for estimation.

• They have many applications, including the derivation of


the distribution of random products or random ratios
of two independent random variables.

• Specifically, let X1 and X2 bet two independent random


variables. Consider transformations like the following:
X1
Y = X1 X2 or Y = .
X2

• The distribution of Y can be derived via some steps based


upon multivariate transformations. This is illustrated via
examples that are also relevant for statistical inference.
Cauchy distribution as a random ratio (1/4)
Observation 1
If X1 ∼ N (0, 1) and X2 ∼ N (0, 1), and the two random variables X1
and X2 are independent, the random variable Y obtained as
X1
Y =
X2
is such that Y ∼ Cauchy (0, 1).
Proof.
This is shown in three steps:
1. derive the joint distribution of (X1 , X2 );
2. derive the joint distribution of (Y, Z), where Z = |X2 |;
3. derive the marginal distribution of Y accordingly.
The first step is the easiest. Here is where independence is applied.
 2
x1 + x22

1
fX1 ,X2 (x1 , x2 ) = exp −
2π 2
Cauchy distribution as a random ratio (2/4)
Proof.
The second step is not immediate because this transformation is not
bijective. Yet Theorem 1 can be applied to partitions of the support
where the transformation is bijective. Note that if X = R2 is split as:

X0 = {(x1 , x2 ) : x2 = 0}
X1 = {(x1 , x2 ) : x2 < 0}
X2 = {(x1 , x2 ) : x2 > 0}

the transformation is bijective on X1 & X2 with image Y = R × R+ .


Also, P (X2 = 0) = 0 so X0 can be ignored. Consider X1 : there, it is
Z = −X2 and thus:
−1 −1
X1 = g1,X1
(Y, Z) = −Y Z and X2 = g2,X 1
(Y, Z) = −Z;

while in X2 it is Z = X2 with the following inverse transformation:


−1 −1
X1 = g1,X 2
(Y, Z) = Y Z and X2 = g2,X 2
(Y, Z) = Z.
Cauchy distribution as a random ratio (3/4)
Proof.
Thus, the determinant of the Jacobian in X1 is:
   
∂ −1 −z −y
det g (y, z) = det =z>0
∂y T X1 0 −1

which is always positive; in X2 it is identical.


   
∂ −1 z y
det g (y, z) = det =z>0
∂y T X2 0 1

The second step is accomplished by deriving the joint distribution of


(Y, Z); specifically this is obtained by separately applying Theorem 1
on both X1 and X2 and summing up the resulting density functions.
Hence:  !
z y2 + 1 z2
fY,Z (y, z) = exp −
π 2

and all is left to do is to integrate out z (step three).


Cauchy distribution as a random ratio (4/4)
Proof.
To proceed, note that the support of z is R+ .
ˆ +∞
fY (y) = fY,Z (y, z) dz
0
ˆ +∞
 !
z y2 + 1 z2
= exp − dz
0 π 2
ˆ +∞  !
1 y2 + 1
= exp − u du
0 2π 2
ˆ +∞ 2   !
1 y +1 y2 + 1
= exp − u du
π (y 2 + 1) 0 2 2
1
=
π (y 2 + 1)

The third line applies the change of variable u = z 2 while the integral
in the fourth line vanishes: it is the total probability of an exponential
distribution. The result is the standard Cauchy density.
Student’s t-distribution as a random ratio (1/3)
Observation 2
If Z ∼ N (0, 1) and X ∼ χ2 (ν), and the two random variables Z and
X are independent, the random variable Y obtained as
Z
Y =r
X
ν
is such that Y ∼ T (ν).

Proof.
The steps here are identical as in thep‘normals-ratio-to-Cauchy’ case;
however in place of X2 here is W = X/ν where X ∼ χ2 (ν). Thus,
one should first derive the p.d.f. of W . Note that the transformation
that defines W is increasing; the inverse is X = g −1 (W ) = νW 2 and
dx
thus dw = 2νw > 0; the support stays R++ and hence:
ν
νw2
 
ν2 ν−1
fW (w; ν) = ν w exp − for w > 0.
Γ ν2 · 2 2 −1

2
Student’s t-distribution as a random ratio (2/3)
Proof.
Because of independence, the joint density of w = (Z, W ) is:
ν  2
z + νw2

1 ν2 ν−1
fw (z, w; ν) = √  ν w exp −
2π Γ ν2 · 2 2 −1 2

and this completes the (longer) first step. Yet the second step is easier
now: it is necessary to derive the joint distribution of y = (Y, W ), but
this specific transformation is already bijective and there is no need to
split the support. Similarly to the Cauchy case, the determinant of the
Jacobian is w > 0; and since Z = Y W , the joint p.d.f. of interest is:

fy (y, w; ν) = w · fw (yw, w; ν)
ν+1  !
ν 2 1 ν y 2 + ν w2
=  ν−1 √ w exp −
Γ ν2 · 2 2 πν 2

for y ∈ R and w ∈ R++ . The last step is about integrating out w; this
requires a change of variable: u = w2 so as to recognize the integral as
the total probability of a Gamma-distributed random variable.
Student’s t-distribution as a random ratio (3/3)
Proof.
ˆ +∞
fY (y; ν) = fY,W (y, w) dw
0
ˆ +∞ ν+1  !
11 ν 2
ν ν + y 2 w2
= ν
√ ν−1 w exp − dw
Γ 2 πν 0 2 2 2
ˆ +∞   ν+1
y2
   
1 1 ν 2 ν−1 ν
= √ u 2 exp − 1 + u du
Γ ν2

πν 0 2 2 ν
− ν+1
Γ ν+1

y2

2 1 2

= ν
√ 1 + ×
Γ 2 πν ν
ˆ +∞  ν+1
ν + y2 ν + y2
    
1 2
ν−1
× u 2 exp − u du
Γ ν+1

0 2
2 2
− ν+1
Γ ν+1

y2

2 1 2

= ν
√ 1+
Γ 2 πν ν

The result is the p.d.f. of a t-distribution with parameter ν.


Other important random ratios

Observation 3
If X1 ∼ χ2 (ν1 ) and X2 ∼ χ2 (ν2 ), and the two random variables X1
and X2 are independent, the random variable Y obtained as

X1 /ν1
Y =
X2 /ν2

is such that Y ∼ F (ν1 , ν2 ).

Observation 4
If X1 ∼ Γ (α, γ) and X2 ∼ Γ (β, γ), and the two random variables X1
and X2 are independent, the random variables Y and W obtained as
X1
Y =
X1 + X2
W = X1 + X2

are independent and such that Y ∼ Beta (α, β) and W ∼ Γ (α + β, γ).


Multivariate moments
The concepts of moments extend straightforwardly to all the
marginal distributions of a random vector. Therefore, the r-th
uncentered moment of the k-th random variable in a random
vector x also obtains as:
X X
E [Xkr ] = ··· xrk fx (x)
x1 ∈X1 xK ∈XK
ˆ ˆ
E [Xkr ] = ... xrk fx (x) dx
X1 XK

if x is all-discrete or all-continuous respectively (where dx is the


product of all differentials); centered moments are analogous.

E [(Xk − E [Xk ])r ] = (xk − E [Xk ])r fx (x)


X X
···
x1 ∈X1 xK ∈XK
ˆ ˆ
E [(Xk − E [Xk ])r ] = ... (xk − E [Xk ])r fx (x) dx
X1 XK
Covariance
Multivariate distributions allow for important “cross-moments”
that express to what extent two random variables tend to move
together in a probabilistic sense.

Definition 12
Covariance. For any two random variables Xk and X` belonging to
a random vector x, their covariance is defined as the expectation of
a particular function of Xk and X` , i.e. the product of both variables’
deviations from their respective means.

Cov [Xk , X` ] = E [(Xk − E [Xk ]) (X` − E [X` ])]

For respectively all-discrete and all-continuous random vectors,


covariances can also be expressed as follows.
X X
Cov [Xk , X` ] = ··· (xk − E [Xk ]) (x` − E [X` ]) fx (x)
x1 ∈X1 xK ∈XK
ˆ ˆ
Cov [Xk , X` ] = ... (xk − E [Xk ]) (x` − E [X` ]) fx (x) dx
X1 XK
Covariance and Correlation
A covariance takes positive values if the two variables Xk and
X` move together, negative values if those proceed in opposite
directions. It can be expressed as more fundamental moments.

Cov [Xk , X` ] = E [(Xk − E [Xk ]) (X` − E [X` ])]


= E [Xk X` ] − E [Xk E [X` ]]
− E [X` E [Xk ]] + E [Xk ] E [X` ]
= E [Xk X` ] − E [Xk ] E [X` ]

Covariances can be normalized, thus becoming correlations.


Definition 13
Correlation. For any two random variables Xk and X` belonging to
a random vector x, their population correlation is defined as follows.

Cov [Xk , X` ]
Corr [Xk , X` ] = p p
Var [Xk ] Var [X` ]
Properties of Correlations (1/3)
Theorem 4
Properties of Correlation. For any two random variables X and Y ,
it is:
a. Corr [X, Y ] ∈ [−1, 1], and
b. |Corr [X, Y ]| = 1 if and only if there are some real numbers a 6= 0
and b such that P (Y = aX + b) = 1. If Corr [X, Y ] = 1 then it is
a > 0, while if Corr [X, Y ] = −1 it is a < 0.

Proof.
Define the following function:
h i
2
C (t) = E [(X − E [X]) · t + (Y − E [Y ])]
= Var [X] · t2 + 2 Cov [X, Y ] · t + Var [Y ]

which is nonnegative, because it is defined as the expectation of the


square of a random variable. (Continues. . . )
Properties of Correlations (2/3)

Theorem 4
Proof.
(Continued.) Since C (t) = 0 is nonnegative, its two roots must share
their sign, hence:
2
(2 Cov [X, Y ]) − 4 Var [X] Var [Y ] ≤ 0.

or, equivalently:
p p p p
− Var [X] Var [Y ] ≤ Cov [X, Y ] ≤ Var [X] Var [Y ].

thus a. is proved. Next, consider that |Corr [X, Y ]| = 1 only if

Var [X] · t2 + 2 Cov [X, Y ] · t + Var [Y ] = 0

or equivalently, C (t) = 0. (Continues. . . )


Properties of Correlations (3/3)
Theorem 4
Proof.
(Continued.) Another way to express this condition is:
 
2
P [(X − E [X]) t + (Y − E [Y ])] = 0 = 1

or equivalently:

P ((X − E [X]) t + (Y − E [Y ]) = 0) = 1

which only occurs if, given Y = aX + b:

a = −t
b = E [X] · t + E [Y ]
Cov [X, Y ]
t=−
Var [X]

and the proof of b. is completed by showing that a and Corr [X, Y ]


must also share the same sign.
Cross-expectation and independence (1/2)
Theorem 5
Cross-expectation of independent random variables. Given two
independent random variables X and Y , it is

E [XY ] = E [X] E [Y ]

if the above moments exist.


Proof.
The left hand side of the above relationship is the expectation of a
random variable which is defined as the product of X and Y :
ˆ ˆ ˆ ˆ
xyfX,Y (x, y) dxdy = xyfX (x) fY (y) dxdy
X Y
ˆX Y ˆ
= xfX (x) dx · yfY (y) dy
X Y

falling back to the product of two expressions that correspond to the


definition of mean (for X and Y respectively); observe that the first
equality exploits the definition of independent random variables.
Cross-expectation and independence (2/2)
Corollary 1
(Theorem 5.) Both the covariance and the correlation between two
independent random variables X and Y equal zero.

Corollary 2
(Theorem 5.) Given any transformations U = gX (X), V = gY (Y )
of two independent random variables X and Y whose moments exist,
it is:
E [U V ] = E [U ] E [V ]
because U and V are also independent; this also implies that U and V
have zero covariance and correlation and that also all higher moments
of X and Y inherit this property. For example:
h i
2 2
Var [XY ] = E (X − E [X]) (Y − E [Y ])
h i h i
2 2
= E (X − E [X]) E (Y − E [Y ])
= Var [X] Var [Y ]
2 2
which is best seen by setting U = (X − E [X]) and V = (Y − E [Y ]) .
Covariance in the bivariate normal (1/4)
In a bivariate normal distribution, the following holds.

E [X1 X2 ] = ρσ1 σ2 + µ1 µ2

This shown in a few steps. First, define the transformation from


x = (X1 , X2 ) to y = (Y, Z) as follows.
X1 − µ1 X2 − µ2
Y =
σ1 σ2
X1 − µ1
Z=
σ1

Note that the support of y is Y = R2 and the transformation is


clearly bijective. The inverse transformation is as follows.

X1 = g1−1 (Y, Z) = σ1 Z + µ1
Y
X2 = g2−1 (Y, Z) = σ2 + µ2
Z
Covariance in the bivariate normal (2/4)
Here the Jacobian has the following absolute determinant:
 

∂ −1
 0 σ1 σ1 σ2 σ1 σ2
det g (y, z) = det  σ2 σ2 y  = = √
∂y T − |z| z2
z z
thus, by Theorem 1 the joint p.d.f. of y = (Y, Z) is as follows.
!
1 z 2 − 2ρy + y 2 z −2
fY,Z (y, z; ρ) = p exp −
2π (1 − ρ2 ) z 2 2 (1 − ρ2 )
2 !
1 y − ρz 2
= φ (z) · p exp −
2π (1 − ρ2 ) z 2 2 (1 − ρ2 ) z 2

Recall that φ (z) is the standard normal’s p.d.f.; note that


   2
z 2 − 2ρy + y 2 z −2 = 1 − ρ2 z 2 + y − ρz 2 z −2

hence the second line.


Covariance in the bivariate normal (3/4)
The next step is to calculate the mean of Y .
ˆ +∞
E [Y ] = φ (z)
−∞
"ˆ 2 ! #
+∞
y y − ρz 2
× exp − dy dz
2 (1 − ρ2 ) z 2
p
−∞ 2π (1 − ρ2 ) z 2
| {z }
= ρz 2
ˆ +∞
=ρ z 2 φ (z) dz
−∞

• The simplification in the second line occurs because there,


the integral is the mean of a normally distributed random
variable with mean ρz 2 and variance 1 − ρ2 z 2 .


• The integral in the second line vanishes since Z ∼ N (0, 1)


and E Z 2 = 1.
 
Covariance in the bivariate normal (4/4)
In summary, one can conclude that:
Y
  
E [X1 X2 ] = E (σ1 Z + µ1 ) σ2 + µ2
Z
Y
 
= σ1 σ2 E [Y ] + σ1 µ2 E [Z] + σ2 µ1 E + µ1 µ2
Z
= ρσ1 σ2 + µ1 µ2
h i
as postulated, because E [Z] = E YZ = 0 are both expectations
of random variables following the standard normal distribution.
In light of this result, it is:

Cov [X1 , X2 ] = ρσ1 σ2


Corr [X1 , X2 ] = ρ

hence, parameter ρ has an interpretation as correlation (and in


fact its range is confined in the [−1, 1] interval).
Drawing the bivariate normal: ρ = −0.4

fX1 (x1 )
fX2 (x2 )
fX1 ,X2 (x1 , x2 )

0.4

0.2

0
−1 4

0 3
0.3 1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1
Drawing the bivariate normal: ρ = 0

fX1 (x1 )
fX2 (x2 )
fX1 ,X2 (x1 , x2 )

0.4

0.2

0
−1 4

0 3
0.3
1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1
Collecting terms
It is convenient to summarize all moments of a random vector of
the same order through compact notation. To begin with, the
mean vector E [x] gathers all the means.
 
E [X1 ]
 E [X2 ] 
 
E [x] =  . 

 .. 

E [XK ]
The variance-covariance matrix Var [x] instead collects its
namesakes; can be seen as the expectation of a random matrix.
h i
Var [x] = E (x − E [x]) (x − E [x])T
 
Var [X1 ] Cov [X1 , X2 ]. . . Cov [X1 , XK ]
 Cov [X2 , X1 ] Var [X2 ] . . . Cov [X2 , XK ]


= .. .. .. .. 

 . . . .


Cov [XK , X1 ] Cov [XK , X2 ] . . . Var [XK ]
Properties of variance-covariance matrices
• Var [x] it has dimension K × K and is symmetric.

• The elements along the diagonal of Var [x] (the variances),


are always nonnegative; the elements outside the diagonal
(the covariances), can be negative.
• The random matrix expression can be simplified
h i
Var [x] = E (x − E [x]) (x − E [x])T
h i
= E xxT − E [x] E [x]T
− E [x] E [x]T + E [x] E [x]T
h i
= E xxT − E [x] E [x]T

• Var [x] is positive semi-definite: for any non-zero vector a


of length K, the quadratic form aT Var [x] a ≥ 0 cannot be
negative. This property is demonstrated later.
Example: moments of the bivariate normal
• The mean vector of the bivariate normal is usually denoted
as follows. " #
µ
µ ≡ 1 = E [x]
µ2

• Instead, here the notation for the variance-covariance matrix


is the following.
" #
σ21 ρσ1 σ2
Σ≡ = Var [x]
ρσ1 σ2 σ22

• Note: Σ satisfies all the properties of a variance-covariance


matrix just outlined.
• If x = (X1 , X2 ) follows a bivariate normal distribution with
parameters specified as µ and Σ, this is denoted as follows.

x ∼ N (µ, Σ)
Moments of linear transformations (1/3)
Like in the univariate case, it is useful to derive the moment of
a transformed random vector in terms of the original moments.

Consider simple linear transformations of a random vector x


that return a random variable Y :

Y = aT x = a1 X1 + · · · + aK XK

where a = (a1 , . . . , aK )T has length K like x = (X1 , . . . , XK )T .

The mean of Y is obtained as follows.


h i
E [Y ] = E aT x
= E [a1 X1 + · · · + aK XK ]
= a1 E [X1 ] + · · · + aK E [XK ]
= aT E [x]
Moments of linear transformations (2/3)
The variance of Y is instead obtained as follows.
h i
Var [Y ] = Var aT x
 h i  h iT 
T T T T
=E a x−E a x a x−E a x
h i
= E aT (x − E [x]) (x − E [x])T a
h i
= aT E (x − E [x]) (x − E [x])T a
= aT Var [x] a
This shows that variance-covariance matrices must be positive
semi-definite: any quadratic form aT Var [x] a corresponds with
a variance Var [Y ] ≥ 0. This can be expanded as follows.
K k−1
" #
X X
Var [Y ] = a2k Var [Xk ] + 2 ak a` Cov [Xk , X` ]
k=1 `=1
Moments of linear transformations (3/3)
What if a linear transformation returns another random vector?
Let:
y = a + Bx = (Y1 , . . . , YJ )T
be a J-dimensional vector obtained via a linear transformation
of x where vector a has length J and matrix B has size J × K.

The mean of y is obtained as:

E [y] = E [a + Bx] = a + B E [x]

while its variance-covariance is as follows.

Var [y] = Var [a + Bx] = B Var [x] BT

To appreciate this expression, note that if bi and bj are the i-th


and hthe j-th rows
i of B, then the ij-th element of Var [y] equals
Cov bi x, bj x = bT
T T
i Var [x] bj .
Moments of non-linear transformations
Consider instead non-linear transformations like:

y = g (x) = (Y1 , . . . , YJ )T

is obtained from a generic J-valued function g (·) of x. Like in


the univariate case, a Tayor expansion helps the analysis:

g (x) ≈ g (E [x]) + T
g (E [x]) [x − E [x]]
∂x
∂ ∂
 
≈ g (E [x]) − T
g (E [x]) E [x] + g (E [x]) x
∂x ∂xT

showing that E [g (x)] ≈ g (E [x]) can be a bad approximation.


However:
T
∂ ∂
  
Var [g (x)] ≈ T
g (E [x]) Var [x] g (E [x])
∂x ∂xT
is a generally good approximation for the variance of y.
Cross-covariance matrix
Sometimes it is useful to collect all the covariances between the
elements of a random vector x of dimension Kx and those of a
random vector y of dimension Ky .

The resulting Kx × Ky matrix is named the cross-covariance


matrix and is denoted as Cov [x, y].

Just like the variance-covariance matrix, it is also defined as the


expectation of a random matrix.
h i
Cov [x, y] = E (x − E [x]) (y − E [y])T
   
Cov [X1 , Y1 ] Cov [X1 , Y2 ] ... Cov X1 , YKy 

 Cov [X2 , Y1 ] Cov [X2 , Y2 ] ... Cov X2 , YKy 

= .. .. .. .. 

 . . . .


 
Cov [XKx , Y1 ] Cov [XKx , Y2 ] . . . Cov XKx , YKy
Properties of cross-covariance matrices
• Like other expressions involving second-order moments, the
one defining the cross-covariance matrix can be simplified.
h i
Cov [x, y] = E (x − E [x]) (y − E [y])T
h i
= E xy T − E [x] E [y]T
− E [x] E [y]T + E [x] E [y]T
h i
= E xy T − E [x] E [y]T

• If x and y are independent, hthe cross-covariance


i matrix is a
collection of zeroes, since E xy T = E [x] E [y]T .
• The cross-covariance matrix between x and y relates with
the respective variance-covariance matrices as follows:

Var [x] − Cov [x, y] [Var [y]]−1 Cov [x, y]T ≥ 0

meaning that the left-hand side is positive semi-definite.


Cross-covariance matrices of transformation
Sometimes one is interested in the cross-covariance matrices of
transformed random vectors.

For linear transformations, if u = ax + Bx x and v = ay + By y


are obtained from x and y respectively and have length Ju and
Jv , their Ju × Jv cross-covariance matrix is as follows.

Cov [u, v] = Cov [ax + Bx x, ay + By y]


= Bx Cov [x, y] BT
y

At the same time, if u = gx (x) and v = gy (y) are the result of


some non-linears transformations, the following approximation
applies.
T
∂ ∂
  
Cov [u, v] ≈ T
gx (E [x]) Cov [x, y] gy (E [y])
∂x ∂y T
Multivariate moment generation
Definition 14
Moment generating function (multivariate). Given a random
vector x = (X1 , . . . , XK ) with support X, the moment-generating
function Mx (t) is given by the expectation of the transformation
g (X) = exp xT t , for t = (t1 , . . . , tK ) ∈ RK .
" K
!#
X
T
 
Mx (t) = E exp t x = E exp tk Xk
k=1

Definition 15
Characteristic function (multivariate). Given a random vector
x = (X1 , . . . , XK ) with support X, the characteristic function ϕx (t) is
given by the expectation of the transformation g (X) = exp itT x , for
t = (t1 , . . . , tK ) ∈ RK .
" K
!#
X
ϕx (t) = E exp itT x = E exp i
 
tk Xk
k=1
Generating multivariate moments
The r-th centered moments for each k-th element of the random
vector x can be calculated in analogy with the univariate case.

∂ r Mx (t) 1 ∂ r ϕx (t)
E [Xkr ] = = ·
∂trk t=0
ir ∂trk t=0

Furthermore, the cross-moments are obtained, for r, s ∈ N:

∂ r+s Mx (t) 1 ∂ r+s ϕx (t)


E [Xkr X`s ] = = ·
∂trk ∂ts` t=0
ir+s ∂trk ∂ts` t=0

which can be shown as follows for the case of m.g.f.s.


K
" !#
∂ r+s Mx (t) ∂ r+s X
=E exp tk Xk
∂trk ∂ts` ∂trk ∂ts` k=1
K
" !#
X
= E Xkr X`s exp tk Xk
k=1
Generating the bivariate normal covariance
The m.g.f. of the bivariate normal distribution is:

MX1 ,X2 (t1 , t2 ) = E [exp (t1 X1 + t2 X2 )]


1 2 2
 
= exp t1 µ1 + t2 µ2 + t1 σ1 + 2t1 t2 ρσ1 σ2 + t22 σ22
2
to be proven later. The first moment of X1 X2 is obtained from:

∂2
MX1 ,X2 (t1 , t2 ) =
∂t1 ∂t2
h   i
= µ1 + t1 σ21 + t2 ρσ1 σ2 µ2 + t2 σ22 + t1 ρσ1 σ2 + ρσ1 σ2 ×
1 2 2
  
× exp t1 µ1 + t2 µ2 + t1 σ1 + 2t1 t2 ρσ1 σ2 + t22 σ22
2
and by evaluating the above expression for t1 = 0 and t2 = 0.
∂2
E [X1 X2 ] = MX1 ,X2 (t1 , t2 ) = ρσ1 σ2 + µ1 µ2
∂t1 ∂t2 t1 ,t2 =0
Moment generation and independence (1/2)

Theorem 6
Moment generating functions and characteristic functions of
independent random variables. If the random variables belonging
to a random vector x = (X1 , . . . , XK ) are pairwise independent, the
moment generating function of x (if it exists) and the characteristic
function of x equal the product of the K moment generating functions
(if they exist) and the K characteristic functions of the K random
variables involved, respectively.
K
Y
Mx (t) = MXk (tk )
k=1
K
Y
ϕx (t) = ϕXk (tk )
k=1
Moment generation and independence (2/2)
Proof.
This is an application Theorem 3 and Theorem 5 upon a sequence of K
transformed random variables: exp (t1 X1 ) , . . . , exp (tK XK ) which are
themselves mutually independent. For m.g.f.s:
" K
!#
X
T
 
Mx (t) = E exp t x = E exp tk Xk
k=1
" K
#
Y
=E exp (tk Xk )
k=1
K
Y
= E [exp (tk Xk )]
k=1
K
Y
= MXk (tk )
k=1

and the case of characteristic functions is analogous.


Moment generation, sums, independence (1/2)
Theorem 7
Moment generating and characteristic functions of linear com-
binations of independent random variables. Consider a random
variable Y obtained as the sum of N linearly transformed, pairwise
independent random variables x = (X1 , . . . , XN ):
N
X
Y = (ai + bi Xi )
i=1

where (ai , bi ) ∈ R2 for i = 1, . . . , N . The moment generating and


characteristic functions of Y are obtained as follows.
N
! N
X Y
MY (t) = exp t ai MXi (bi t)
i=1 i=1
N
! N
X Y
ϕY (t) = exp t ai ϕXi (bi t)
i=1 i=1
Moment generation, sums, independence (2/2)
Proof.
For moment generating functions this results is obtained as:
" N
!#
X
MY (t) = E [exp (tY )] = E exp t · (ai + bi Xi )
i=1
N
! " N
!#
X X
= exp t ai E exp tbi Xi
i=1 i=1
N
! "N #
X Y
= exp t ai E exp (tbi Xi )
i=1 i=1
N
! N
X Y
= exp t ai E [exp (tbi Xi )]
i=1 i=1
N
! N
X Y
= exp t ai MXi (bi t)
i=1 i=1

where the second-to-last line exploits mutual independence. The case


of characteristic functions is analogous.
Sums of independent random variables (1/6)
This result is extremely powerful: it allows to quickly derive the
distribution of a sum of independent random variables
in a number of cases. Consider the following one.

Observation 5
If all the N random variables in the vector (X1 , . . . , XN ) are pairwise
independent and Xi ∼ Be (p) for i = 1, . . . , N then:
N
X
Xi ∼ BN (p, N ) .
i=1

Proof.
If MXi (t) = p exp (t) + (1 − p), it suffices to multiply the N identical
moment generating functions:
N
MPN Xi
(t) = [p exp (t) + (1 − p)]
i=1

which implies the statement by Theorem 7.


Sums of independent random variables (2/6)
The following observations all assume that the random variables
in (X1 , . . . , XN ) are pairwise independent. Proofs are omitted.

Observation 6 PN
If Xi ∼ NB (p, 1), it is i=1 Xi ∼ NB (p, N ).

Observation 7 PN
If Xi ∼ Pois (λ), it is i=1 Xi ∼ Pois (N λ).

Observation 8 P
N
Xi ∼ Γ N, λ−1 .

If Xi ∼ Exp (λ), it is i=1

Observation 9 P P 
2 N N
If Xi ∼ χ (κi ), it is i=1 Xi ∼ χ2 i=1 κi .

Observation 10 P P 
N N
If Xi ∼ Γ (αi , β), it is i=1 Xi ∼ Γ i=1 αi , β .
Sums of independent random variables (3/6)
Observation 11

If Xi ∼ N µi , σ2i , for all real ai , bi it is as follows.

N N N
!
X X X
Y = (ai + bi Xi ) ∼ N (ai + bi µi ) , b2i σ2i
i=1 i=1 i=1

Proof.
This requires few steps:
K
! K
X Y
MY (t) = exp t ai MXi (bi t)
i=1 i=1
K
! K K
!
X X
2
X b2 σ2i i
= exp t ai exp t bi µi + t
i=1 i=1 i=1
2
K K
!
X X b2 σ2i i
= exp t (ai + bi µi ) + t2
i=1 i=1
2

the third line expresses a familiar m.g.f. for Y .


Sums of independent random variables (4/6)
• Note: the previous observations states that a linear combination
of independent normal random variables is itself normal.
• Later, this result is generalized while relaxing the independence
requirement.
• This result also easily extends to lognormal distributions, but in
its own way.

Observation 12 
2
If log (Xi ) ∼ N µi , σi , for all real ai , bi it is as follows.

N
! N N
!
Y X X
log exp (ai ) Xibi ∼N (ai + bi µi ) , b2i σ2i
i=1 i=1 i=1

Proof. Q 
N bi PN
Since log i=1 exp (ai ) Xi = i=1 [ai + bi log (Xi )], the previous
observation extends easily.
Sums of independent random variables (5/6)

Observation 13
If X1 ∼ Exp (λ1 ) and X2 ∼ Exp (λ2 ) are independent, the random
variable Y = X1 /λ1 − X2 /λ2 is such that Y ∼ Laplace (0, 1).
Proof.
Define the two random variables W1 = X1 /λ1 and W2 = −X2 /λ2 ,
which obviously are independent. By the properties of m.g.f.s for linear
transformations, the m.g.f. of the two transformed random variables is:
−1
MW1 (t) = (1 − t)
−1
MW2 (t) = (1 + t)

and since Y = W1 + W2 , the moment generating function of Y is:


−1
MY (t) = MW1 (t) MW2 (t) = 1 − t2

that is, that of a standard Laplace distribution.


Sums of independent random variables (6/6)
Observation 14
If X1 ∼ Gumbel (µ1 , σ) and X2 ∼ Gumbel (µ2 , σ) are independent, the
random variable Y = X1 − X2 is such that Y ∼ Logistic (µ1 − µ2 , σ).
Proof.
The m.g.f. of Xi (for i = 1, 2) is MXi (t) = exp (µi t) Γ (1 − σt). Thus,
the transformed random variables Wi = −Xi (for i = 1, 2) have m.g.f.

MWi (t) = exp (−µi t) Γ (1 + σt)

while X1 is independent of W2 and vice versa. Since Y = X1 + W2 :

MY (t) = MX1 (t) MW2 (t)


= exp (µ1 t) Γ (1 − σt) · exp (−µ2 t) Γ (1 + σt)
Γ (1 − σt) Γ (1 + σt)
= exp (µ1 t − µ2 t)
Γ (2)
= exp ((µ1 − µ2 ) t) · B (1 − σt, 1 + σt)

i.e. the logistic’s m.g.f. sought after (note that Γ (2) = 1! = 1).
Fixing realizations
• It is often useful to analyze certain random variables of a
random vector when the realizations of the other random
variables are held constant, or “fixed.”

• This leads to the analysis of conditional distributions


and conditional moments.

• One can also “fix” subsets of the support – not just single
realizations; this case is left aside for the moment.

• This discussion is based on two random vectors x and y of


dimension Kx ≥ 1 & Ky ≥ 1 (with possibly Kx 6= Ky ) and
supports X and Y respectively, expressed as follows.
 T
x = X1 . . . XKx
 T
y = Y1 . . . YKy
Conditional mass or density
• In what follows, assume that x contains either only discrete
r.v.s, or only continuous ones, but not both. Same with y.

• Definitions of joint mass/density would adjust accordingly.

Definition 16
Conditional mass or density function. Consider the combined ran-
dom vector (x, y) with joint mass/density function fx,y (x, y). Suppose
that the random vectors x has a probability mass or density function
fx (x). The conditional mass or density function of y, given x = x, is
defined as follows for all x ∈ X:
fx,y (x, y)
f y|x ( y| x = x) =
fx (x)

It is a conditional mass function if all the random variables in y are


discrete, and a conditional density function if they are all continuous.
Example: conditional normal distribution

Return to the bivariate normal distribution. Whenever one fixes


any X2 = x2 ∈ R, the resulting conditional p.d.f. is:

f X1 |X2 ( x1 | X2 = x2 ) =
 " #2 
σ1
 x1 − µ1 − ρ (x2 − µ2 ) 
1  σ2 
=q exp −
 
2σ21 (1 − ρ2 )

2πσ21 (1 − ρ2 ) 


so that we write the resulting conditional distribution as:


σ1
   
X1 | X2 = x2 ∼ N µ1 + ρ (x2 − µ2 ) , σ21 1 − ρ2
σ2
and this is symmetrical if one fixes any X1 = x1 ∈ R instead.
Conditional cumulative distribution
Definition 17
Conditional cumulative distribution. Consider the combined ran-
dom vector (x, y) with joint mass/density function fx,y (x, y). The
conditional cumulative distribution of y, given x = x is defined as:
X
F y|x ( y| x = x) = f y|x ( t| x = x)
t∈Y:t≤y

if all the random variables in y are discrete, and


ˆ y1 ˆ yKy
F y|x ( y| x = x) = ... f y|x ( t| x = x) dt
−∞ −∞

if all the random variables in y are continuous.

• If x is indeterminate/unspecified, the short notation f y|x ( y| x)


and F y|x ( y| x) can also be used.
• In the previous example about the normal distribution, one may
denote the conditional p.d.f. as f X1 |X2 ( x1 | x2 ).
Interpreting conditional distributions
• Adapting the general definitions of (joint) p.m.f. and c.d.f.,
a conditional p.m.f. or c.d.f. has a direct interpretation in
terms of conditional probability.
• For conditional p.d.f.s, the interpretation holds for specified
intervals of the non-fixed vector.

P a1 ≤ Y1 ≤ b1 ∩ ... ∩ aKy ≤ YKy ≤ bKy X1 = x1 ∩ ... ∩ XKx = xKx
ˆ b1 ˆ bKy
= ... f y|x ( y| x = x) dy
a1 aKy

• If x is discrete whereas y is continuous – or vice versa – the


interpretation must (intuitively) adjust accordingly.
• Recall the example about height across genders. Describing
the height of one gender only is an exercise in conditioning.
1 h − µF
 
f H|G=1 ( h| g = 1) = φ
σF σF
Conditional distributions and independence
• All these definitions are somehow “moot” for independent
random variables/vectors. If x and y are independent:

fx,y (x, y) = fx (x) fy (y)

and thus the following holds.

f y|x ( y| x) = fy (y)
f x|y ( x| y) = fx (x)

• In words, the conditional distribution of y given x is equal


to the unconditional distribution of y, and vice versa.

• This is straightforward given the interpretations in terms of


conditional probability.
Conditional moments: expectation
• Moments are easily defined on conditional distributions too.

• The conditional expectation, also called regression, is


defined for discrete random variables as:
X X
E [ y| x] = ··· yf y|x ( y| x)
y1 ∈Y1 yK ∈YKy

and for continuous random variables as follows.


ˆ ˆ
E [ y| x] = ... yf y|x ( y| x) dy
Y1 YKy

• Note: it is a scalar if Ky = 1, a vector if Ky > 1.

• It represents the mean of y when x = x is “fixed.”


Conditional moments: variance-covariance (1/2)
• The conditional variance is defined in the discrete case
as:

(y − E [ y| x]) (y − E [ y| x])T ×
X X
Var [ y| x] = ···
y1 ∈Y1 yK ∈YKy

× f y|x ( y| x)

and in the continuous case as follows.


ˆ ˆ
Var [ y| x] = ... (y − E [ y| x]) (y − E [ y| x])T ×
Y1 YKy

× f y|x ( y| x) dy

• Note: it is a scalar if Ky = 1, a square matrix if Ky > 1. In


the latter case it is typically more appropriate to call it the
conditional variance-covariance.
Conditional moments: variance-covariance (2/2)
• Similarly with the conditional expectation, the conditional
variance-covariance denotes the dispersion or co-movement
of/between random variables in y when x = x is “fixed.”

• The conditional variance-covariances inherit all the typical


properties of their unconditional counterparts.

• For example, they can be recast as follows.


h i
Var [ y| x] = E (y − E [ y| x]) (y − E [ y| x])T x
h i
= E yy T x − E [ y| x] E [ y| x]T
− E [ y| x] E [ y| x]T + E [ y| x] E [ y| x]T
h i
= E yy T x − E [ y| x] E [ y| x]T
More about conditional moments
• Examples of conditional moments: the conditional normal
distribution of X1 | X2 = x2 that is derived earlier from the
bivariate normal has the following mean and variance.
σ1
E [ X1 | X2 = x2 ] = µ1 + ρ (x2 − µ2 )
σ2
 
Var [ X1 | X2 = x2 ] = σ21 1 − ρ2

• Conditional moments may be often recast as functions of


the conditioned random variable. In this case, the notation
for random variables/vectors/matrices (as opposed to their
realizations) may be used. Examples of this notation follow.

E [ X1 | X2 ] E [ y| x]
Var [ X1 | X2 ] Var [ y| x]
Law of Iterated Expectations
Theorem 8
Law of Iterated Expectations. Given any two random vectors x
and y, it is:
E [y] = Ex [E [ y| x]]
where Ex [·] denotes an expectation taken over the support of x.
Proof.
In the continuous case, apply the following decomposition:
ˆ ˆ
E [y] = yfx,y (x, y) dydx
ˆX ˆY
= yf y|x ( y| x) fx (x) dydx
ˆX Y ˆ 
= fx (x) yf y|x ( y| x) dy dx
X Y
= Ex [E [ y| x]]

and the discrete case is analogous (sums substitute integrals).


Example: linear regression (1/4)
• The linear regression model is a cornerstone of statistics
and econometrics and it is intimately linked to conditional
expectations.

• The model relates an endogenous or dependent variable


Yi to some exogenous or independent variables Xi (also
denoted as Zi ).

• Here i denotes the unit of observation (more on this later).

• The conditional expectation function of Yi given a sequence


of Xki (or Zki ) for k = 1, . . . , K is modeled as linear.

E [ Yi | X1i , . . . , XKi ] = β0 + β1 X1i + · · · + βK Xki

Here (β0 , β1 , . . . , βK ) are the parameters of interest.


Example: linear regression (2/4)
• The simplest linear regression model is the bivariate one.

E [ Yi | Xi ] = β0 + β1 Xi
This is equivalent to the following expression.
E [ Yi − β0 − β1 Xi | Xi ] = 0

• Parameter β0 is given the interpretation as the conditional


mean of Yi given Xi = 0, or equivalently, as the “constant”
coefficient that satisfies the following relationship.
E [Yi − β0 − β1 Xi ] = 0

• Note this application of the Law of Iterated Expectations.

E [Xi (Yi − β0 − β1 Xi )] = EX [E [ Xi (Yi − β0 − β1 Xi )| Xi ]]


= EX [Xi · E [ (Yi − β0 − β1 Xi )| Xi ]]
=0
Example: linear regression (3/4)
• One can thus put together two equations for two unknowns.
E [Yi − β0 − β1 Xi ] = 0
E [Xi (Yi − β0 − β1 Xi )] = 0
• After some manipulation, the solution for β0 and β1 is:
Cov [Xi , Yi ]
β0 = E [Yi ] − · E [Xi ]
Var [Xi ]
Cov [Xi , Yi ]
β1 =
Var [Xi ]
although β0 is commonly written as β0 = E [Yi ] − β1 E [Xi ].
• Parameter β1 is named the regression slope; it expresses
the average response of Yi to Xi , and it is closely related to
a measure of correlation.
p
Var [Yi ]
β1 = Corr [Xi , Yi ] · p
Var [Xi ]
Example: linear regression (4/4)

0.6
f Yi |Xi ( yi | xi )

0.3

0
−1 4

0 3
E [ Yi | Xi ]
1 2

2 1
xi yi
3 0

4 −1
Law of Total Variance
Theorem 9
Law of Total Variance (variance decomposition). Given any two
random vectors x and y:

Var [y] = Varx [E [ y| x]] + Ex [Var [ y| x]]

where Varx [·] and Ex [·] denote sums/integrals taken over the support
of x for every element of the argument vectors/matrices.
Proof.
T
Var [y] = E yy T − E [y] E [y]
 

T
= Ex E yy T x − Ex [E [ y| x]] [Ex [E [ y| x]]]
   
h  i
T
= Ex E yy T x − E [ y| x] E [ y| x] x

h i
T T
+ Ex E [ y| x] E [ y| x] x − Ex [E [ y| x]] [Ex [E [ y| x]]]
= Ex [Var [ y| x]] + Varx [E [ y| x]]

Note the various applications of the Law of Iterated Expectations.


Example: income groups (1/4)
• Here is an application for the Law of Total Variance.

• Let X = 1, 2, 3, 4 be a discrete random variable whose role


is to identify groups of a population (e.g. ethnicities).

• Let Y be a continuous random variable (e.g. log-income)


that is distributed over the population in question.

• Assume the following conditional distributions.

Y | X = 1 ∼ N (3, 1.5)
Y | X = 2 ∼ N (4.5, 2.5)
Y | X = 3 ∼ N (5.5, 3)
Y | X = 4 ∼ N (7, 4)

• Groups have equal size: P (X = x) = 0.25 for x = 1, 2, 3, 4.


Example: income groups (2/4)
f Yi |Xi ( yi | xi )

0.1

1 15

2 10

xi 3 5
yi
4 0
Example: income groups (3/4)

• By the Law of Iterated Expectations, E [Y ] is as follows.

4
1X
E [Y ] = EX [E [ Y | X]] = E [ Y | X = x] = 5
4 x=1

• Let interest be on the analysis of income inequality Var [Y ].

• The between group variation VarX [E [ Y | X]] is:

4
1X
VarX [E [ Y | X]] = (E [ Y | X = x] − EX [E [ Y | X]])2
4 x=1
= 2.125

• . . . here, this is interpreted as the dispersion of the average


(log-)income across the population’s groups.
Example: income groups (4/4)

• The within group variation EX [Var [ Y | X]] is:

4
1X
EX [Var [ Y | X]] = Var [ Y | X = x] = 2.75
4 x=1

• . . . which instead is interpreted as the average dispersion of


(log-)income as calculated separately for each group.

• By the Law of Total Variance, here it is:

Var [Y ] = VarX [E [ Y | X]] + EX [Var [ Y | X]] = 4.875

• . . . hence, the two components of inequality have about the


same effect on the overall income inequality.
Two multivariate distributions
• This lecture concludes with the analysis of two important
multivariate distributions: one discrete, one continuous.

• The discrete distribution is the multinomial distribution:


it generalizes the binomial distribution to allow for multiple
outcomes of a trial.

• The continuous distribution is the multivariate normal


distribution: it extends the bivariate normal to allow for a
random vector of length K ≥ 2 collecting random variables
that are normally distributed and possibly correlated.

• The analysis proceeds along the lines of Lecture 2: for both


analyzed distributions, their support, parameters, p.m.f. or
p.d.f., m.g.f. et cetera are specified.
The multinomial distribution (1/4)
• Consider a variation of the binomial experiment: there are
n trials but these are not Bernoulli.

• Instead, any trial can return one out of K alternatives (e.g.


colors of the balls in an urn with replacement), with K ≥ 2.

• Each alternative has probability pk ∈ [0, 1] in each trial (for


k = 1, . . . , K), with K
P
k=1 pk = 1.

• The experiment delivers a list of success counts for every


alternative, which can be written as x = (X1 , . . . , XK ).

• Given that Xk ∈ {0, 1, . . . , n} for k = 1, . . . , K, the support


of this random vector is the following set.
K
( )
n
X
X= x = (x1 , . . . , xK ) ∈ {0, 1, . . . , n} : xk = n
k=1
The multinomial distribution (2/4)
• The joint p.m.f. of the multinomial distribution is:
K
n! Y xk
fx (x1 , . . . , xK ; n, p1 , . . . , pK ) = QK pk
k=1 xk ! k=1

−1
where the multinomial coefficient n! · K
Q
k=1 (xk !) counts
the number of realizations containing exactly (x1 , . . . , xK )
successes for each alternative out of n draws.

• The joint c.d.f. sums the mass function over points in the
support as follows, for t = (t1 , . . . , tK ).
K
X n! Y tk
Fx (x1 , . . . , xK ; n, p1 , . . . , pK ) = QK pk
t∈X:t≤x k=1 tk ! k=1
The multinomial distribution (3/4)
• The distribution owes its name to the multinomial theorem,
which helps show that the total probability mass equals 1.
K K
!n
X n! Y xk
X
P (x ∈ X) = QK pk = pk =1
x∈X k=1 xk ! k=1 k=1

• This helps calculate the m.g.f. too.


K K
!
X X n! Y xk
Mx (t1 , . . . , tK ) = exp tk xk QK pk
x∈X k=1 k=1 xk ! k=1
K
n!
(pk · exp (tk ))xk
X Y
= QK
x∈X k=1 xk ! k=1
K
!n
X
= pk · exp (tk )
k=1
The multinomial distribution (4/4)
• Through the m.g.f. one can show that for all k = 1, . . . , K:

E [Xk ] = npk
Var [Xk ] = npk (1 − pk )
and for all k, ` = 1, . . . , K:
Cov [Xk , X` ] = −npk p`
and the covariance is always negative because an increasing
number of successes for one alternative implies a decreasing
number for another alternative.
• These moments can be expressed in compact notation as:

E [x] = np
 
Var [x] = n diag (p) − ppT
 T
where p ≡ p1 p2 . . . pK .
The multivariate normal distribution (1/8)
• A random vector x = (X1 , . . . , XK ) of length K follows the
multivariate normal distribution with support X = RK if
its joint probability density function is as follows.
1 1
 
fx (x; µ, Σ) = q exp − (x − µ)T Σ−1 (x − µ)
K
(2π) |Σ| 2

• The distribution’s parameters are collected by a vector µ


of length K and a symmetric, positive semi-definite matrix
Σ having size K × K and full rank:
   
µ1 σ11 σ12 ... σ1K
 µ2 
 
 σ21

σ22 ... σ2K 

µ≡
 ..  and Σ ≡  ..
  .. .. .. 
 .   . . . . 

µK σK1 σK2 . . . σKK

with σij = σji for i, j = 1, . . . , K and where |Σ| denotes the


determinant of Σ.
The multivariate normal distribution (2/8)
• The following notation is used to express that x follows a
multivariate normal distribution with given parameters.
x ∼ N (µ, Σ)
• A special case is the standardized multivariate normal, with
µ = 0 & Σ = I. If a random vector z follows the standard
multivariate normal distribution, this is written as follows.
z ∼ N (0, I)
• Since Σ is symmetric and positive semi-definite, a Cholesky
decomposition can be applied to Σ so as to return a matrix
1 1 1
Σ 2 such that Σ− 2 ΣΣ− 2 = I.
• Consequently, x and z are reciprocally related through the
following multivariate transformations.
1
z = Σ− 2 (x − µ)
1
x = Σ2 z + µ
The multivariate normal distribution (3/8)
• The c.d.f. obtains from integrating the p.d.f.: similarly as in
the univariate case, it has no closed form solution. Showing
that the total p.d.f. integrates to 1 is tedious here too.
ˆ x1 ˆ xK
Fx (x; µ, Σ) = ... Fx (t; µ, Σ) dt
−∞ −∞

• However, obtaining the m.g.f. is relatively easy if one starts


from the standardized case z ∼ N (0, I).
ˆ
1 1 T
   
T
Mz (t) = exp t z q exp − z z dz
RK (2π)K 2

(z − t)T (z − t)
!
tT t 1
= exp q exp − dz
2 RK (2π)K 2
!
tT t
= exp
2
The multivariate normal distribution (4/8)
• The general version of the m.g.f. is then:
h  i
Mx (t) = E exp tT x
1
h   i
= E exp tT Σ 2 z + µ
1
  h  i
= exp tT µ · E exp tT Σ 2 z
!
tT Σt
T
= exp t µ +
2
1
because if d = tΣ 2 , it is dT d = tT Σt.
• This allows to derive all key moments as:

E [x] = µ
Var [x] = Σ

and the covariances lie in the off-diagonal elements of Σ.


The multivariate normal distribution (5/8)
• If x ∼ N (µ, Σ), what is the distribution of y = a + Bx?
Note that the transformation y may have length J 6= K.

• The m.g.f. of y is:


h  i
My (t) = E exp tT y
h  i
= E exp tT (a + Bx)
  h  i
= exp tT a · E exp tT Bx
!
T tT BΣBT t
= exp t (Bµ + a) +
2
because if b = tB, it is bT Σb = tT BΣBT t.

• Therefore:  
y ∼ N Bµ + a, BΣBT
even if the random variables in x are dependent.
The multivariate normal distribution (6/8)
• Suppose that x = (x1 , x2 ) can be split into two subvectors
x1 and x2 of length K1 and K2 respectively; K1 + K2 = K.
What are the marginal, conditional distributions of x1 , x2 ?

• Partition the original collection of parameters as follows:


! !
µ1 Σ11 Σ12
µ= and Σ =
µ2 Σ21 Σ22

where:
• µ1 is a vector of length K1 , µ2 one of length K2 ;
• Σ11 a symmetric K1 × K1 matrix, Σ22 a symmetric
K2 × K2 matrix;
• while Σ12 and Σ21 are two matrices, where one is the
transpose of the other, having dimension K1 × K2 and
K2 × K1 respectively.
The multivariate normal distribution (7/8)

• (Algebraic digression.) By the properties of partioned in-


verse matrices:
−1 −1
!
−1 Σ1 −Σ1 Σ12 Σ−1
22
Σ = −1 −1 −1
−Σ2 Σ21 Σ11 Σ2

where:

Σ1 ≡ Σ11 − Σ12 Σ−1


22 Σ21
Σ2 ≡ Σ22 − Σ21 Σ−1
11 Σ12

and:
|Σ| = Σ1 · |Σ22 | = Σ2 · |Σ11 |
relating the determinant of Σ to those of the matrices ex-
pressing its partitioned inverse.
The multivariate normal distribution (8/8)
• All this lets rewrite the p.d.f. of x in (very) “long” form as:
1
fx (x1 , x2 ; µ1 , µ2 , Σ11 , Σ12 , Σ21 , Σ22 ) = q ×
K
(2π) Σ1 · |Σ22 |

1 −1 1 −1
× exp (x1 − µ1 )T Σ1 Σ12 Σ−1
22 (x2 − µ2 ) − (x1 − µ1 )T Σ1 (x1 − µ1 ) +
2 2

1 −1 1 −1
+ (x2 − µ2 )T Σ2 Σ21 Σ−1
11 (x1 − µ1 ) − (x2 − µ2 )T Σ2 (x2 − µ2 )
2 2

• . . . so the two “marginalized” distributions for x1 , x2 are:


x1 ∼ N (µ1 , Σ11 )
x2 ∼ N (µ2 , Σ22 )
• . . . while the conditional ones are, reciprocally, as follows.
 
x1 | x2 ∼ N µ1 + Σ12 Σ−1 −1
22 (x2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21
 
x2 | x1 ∼ N µ2 + Σ21 Σ−1 −1
11 (x1 − µ1 ) , Σ22 − Σ21 Σ11 Σ12

You might also like