Lecture-3
Lecture-3
Lecture-3
Paolo Zacchia
Lecture 3
Random vectors
It is important to analyze how different random variables relate
to one another. The starting point is the following concept.
Definition 1
Random Vector. A random vector x of length K is a collection of
K random variables X1 , . . . , XK :
X1
x = ...
XK
xK
Joint distributions
Certain concepts about random variables extend quite naturally
to the multivariate case.
Definition 2
Support of a Random Vector. The support X ⊆ RK of a random
vector x is the Cartesian product of all the supports of the random
variables featured in x.
X = X 1 × · · · × XK
Definition 3
Joint Probability Cumulative Distribution. Given some random
vector x, its joint probability cumulative distribution is defined as the
following function.
Fx (x) = P (x ≤ x) = P (X1 ≤ x1 ∩ · · · ∩ XK ≤ xK )
Joint discrete distributions
Definition 4
Joint Probability Mass Function. Given some random vector x
composed by discrete random variables only, its joint probability mass
function fx (x) is defined as follows, for all x = (x1 , . . . , xK ) ∈ RK .
fx (x1 , . . . , xK ) = P (X1 = x1 ∩ · · · ∩ XK = xK )
P (a1 ≤ x1 ≤ b1 ∩ · · · ∩ aK ≤ xK ≤ bK ) =
= Fx (b1 , . . . , bK ) − Fx (a1 , . . . , aK ) =
ˆ b1 ˆ bK
= ... fx (x1 , . . . , xK ) dx1 . . . dxK
a1 aK
(Continues. . . )
Joint continuous distributions (2/2)
Definition 6
Joint Probability Density Function. Given some random vector
x composed by continuous random variables only, its joint probabil-
ity density function fx (x) is defined as the function that satisfies the
following relationship, for all x = (x1 , . . . , xK ) ∈ RK .
ˆ x1 ˆ xK
Fx (x1 , . . . , xK ) = ··· fx (x1 , . . . , xK ) dx1 . . . dxK
−∞ −∞
Definition 7
Marginal Distribution (discrete). For some given random vector x
made of discrete random variables only, the probability mass function
of Xk – the k-th random variable in x – is obtained as:
X X X X
fXk (xk ) = ··· ··· fx (x1 , . . . , xK )
x1 ∈X1 xk−1 ∈Xk−1 xk+1 ∈Xk+1 xK ∈XK
Pxk
and thus FXk (xk ) = t=inf Xk fXk (t).
Note: the above summation proceeds over all the values in the
support of x, except those of Xk . If k = 1 or k = K, (that is to
say, Xk is either first or last in the list) then the summation is
to be reformulated accordingly.
Example: marginal demographics (1/2)
Y =0 Y =1 Total
X=0 0.25 0.15 0.40
X=1 0.20 0.40 0.60
Total 0.45 0.55 1
Definition 8
Marginal Distribution (continuous). For some given random vec-
tor x composed by continuous random variables only, the probability
density function of Xk – the k-th random variable in x – is obtained
as: ˆ
fXk (xk ) = fx (x) dx−k
×`6=k X`
´ xk
and thus FXk (xk ) = −∞
fXk (t) dt.
Note: here, ×`6=k X` indicates the Cartesian product of all the
supports of each random variable in x excluding Xk : e.g. for
k 6= 1, K it is ×`6=k X` = X1 × · · · × Xk−1 × Xk+1 × · · · × XK .
Similarly, the expression dx−k for the differential of the integral
is interpreted as the product of all differentials excluding that of
xk : dx−k = dx1 . . . dxk−1 dxk+1 . . . dxK .
Example: the bivariate normal distribution
A two-dimensional random vector x = (X1 , X2 ) is said to follow
a bivariate normal distribution if, given some parameters:
1
fX1 ,X2 (x1 , x2 ; µ1 , µ2 , σ1 , σ2 , ρ) = p ·
2πσ1 σ2 1 − ρ2
(x1 − µ1 )2 (x2 − µ2 )2
!
ρ (x1 − µ1 ) (x2 − µ2 )
· exp − 2 − +
2σ1 (1 − ρ2 ) 2σ22 (1 − ρ2 ) σ1 σ2 (1 − ρ2 )
fX1 (x1 )
fX2 (x2 )
fX1 ,X2 (x1 , x2 )
0.4
0.2
0
−1 4
0 3
0.3 1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1
Joint discrete-continuous random variables (1/2)
Here is an example.
1 h − µF
fH,G (h, g = 1; µF , µM , σF , σM , p) = φ ·p
σF σF
1 h − µM
fH,G (h, g = 0; µF , µM , σF , σM , p) = φ · (1 − p)
σM σM
(Continues. . . )
Joint discrete-continuous random variables (2/2)
• (Continued.) Summing the two expressions delivers the
p.d.f. of H alone:
fH (h; µF , µM , σF , σM , p) =
1 h − µF 1 h − µM
= φ ·p+ φ · (1 − p)
σF σF σM σM
fG (g = 1; p) = p
fG (g = 0; p) = 1 − p
Xk = gk−1 (Y1 , . . . , YJ )
for k = 1, . . . , K.
1 1
fY1 ,Y2 (y1 , y2 ; µ1 , µ2 , σ1 , σ2 , ρ) = p ·
2πσ1 σ2 1 − ρ2 y1 y2
(log y1 − µ1 )2 (log y2 − µ2 )2
· exp − − +
2σ21 (1 − ρ2 ) 2σ22 (1 − ρ2 )
!
ρ (log y1 − µ1 ) (log y2 − µ2 )
+
σ1 σ2 (1 − ρ2 )
fY1 (y1 )
fY2 (y2 )
0.2
fY1 ,Y2 (y1 , y2 )
0.1
0
0 5
1 4
2 3
4 · 10−2
y1 3 2 y2
2 · 10−2
4 1
0
5
Random matrices
• Random variables can also be arrayed in random matrices:
collections of random vectors of equal length.
• If a random matrix X has dimension K × J, it is:
X11 X12 ... X1J
h i X21 X22 ... X2J
X = x1 x2 . . . xJ =
.. .. .. ..
. . . .
XK1 XK2 . . . XKJ
Definition 9
Independent Random Variables. Let x = (X, Y ) be a random
vector with joint probability mass or density function fX,Y (x, y), and
marginal mass or density functions fX (x) and fY (y). Lastly, let up-
percase F denote corresponding cumulative distributions instead (joint
or marginal). The two random variables X and Y are independent if
the two equivalent conditions below hold.
Definition 10
Mutually – or Pairwise – Independent Random Variables. Let
x = (X1 , . . . , XK ) be a random vector with joint probability mass
or density function fx (x), and marginal mass or density functions
fX1 (x1 ) , . . . , fXK (xK ). Instead, let Fx (x) denote corresponding cu-
mulative distributions (either joint or marginal). The random variables
X1 , . . . , XK are pairwise independent if every pair of random variables
listed in x are independent, and they are mutually independent if the
two equivalent conditions below hold.
K
Y K
Y
fx (x) = fXk (xk ) ⇐⇒ Fx (x) = FXk (xk )
k=1 k=1
Definition 11
Independent Random Vectors. Let (x1 , . . . , xJ ) be a sequence
of J random vectors whose joint probability mass or density function
is written as fx1 ,...,xJ (x1 , . . . , xJ ). Let the joint probability mass or
density functions of an individual nested random vector be fxi (xi ),
where i = 1, . . . , J, and the joint probability mass or density functions
of any two random vectors indexed i, j (with i 6= j) as fxi ,xj (xi , xj ).
Lastly, let Fx (x) denote corresponding cumulative distributions instead
(joint or marginal). Any pair of random vectors indexed i and j are
independent if the two equivalent conditions below hold.
If the above holds for any i, j distinct pair, the J random vectors are
said to be pairwise independent. (Continues. . . )
Independent random vectors (2/2)
Definition 11
(Continued.) If the above holds for any i, j distinct pair, the J ran-
dom vectors are said to be pairwise independent. The J random vectors
are mutually independent if the two equivalent conditions below hold.
J
Y
fx1 ,...,xJ (x1 , . . . , xJ ) = fxi (xi )
i=1
J
Y
Fx1 ,...,xJ (x1 , . . . , xJ ) = Fxi (xi )
i=1
Note that within each random vector the underlying random variables
are not necessarily independent. In addition, if all the random vectors
in question have length one, all these definitions reduce to those given
above for random variables.
Independent random variables and events (1/2)
Theorem 2
Independence of Events. Any two events that are mapped by two
independent random variables X and Y are statistically independent.
Proof.
(Outline.) This requires to show that, for any two events A ⊂ SX and
B ⊂ SY – where SX and SY are the primitive sample spaces of X and
Y respectively – it is:
Theorem 2
Generalization: Mutual Independence between Events. Any
combination of events that are mapped by a sequence of J mutually
independent random vectors (x1 , . . . , xJ ) are mutually independent.
Proof.
(Outline.) Extending the reasoning above, consider a collection of J
events denoted by Ai ⊂ Sxi for i = 1, . . . , J, where Sxi is the primitive
sample space of xi . It must be shown that:
J
! J
\ Y
P (xi ∈ xi (Ai )) = P (xi ∈ xi (Ai ))
i=1 i=1
Theorem 3
Independence of Functions of Random Variables. Consider two
independent random variables X and Y , and let U = gX (X) be a
transformation of X and V = gY (Y ) a transformation of Y . The two
transformed random variables U and V are independent.
Proof.
(Outline.) This requires to show that,
Theorem 3
Generalization: Independence of Functions of Random Vec-
tors. Consider a sequence of mutually independent random vectors
(x1 , . . . , xJ ), as well as a sequence of transformations (y1 , . . . , yJ ) such
that yi = gi (xi ) for i = 1, . . . , J. The J transformed random vectors
(y1 , . . . , yJ ) are also themselves mutually independent.
Proof.
(Outline.) The proof extends the logic behind the previous result from
the bivariate case to higher dimensions; it requires manipulating the J
Jacobian transformations at hand.
Random products and random ratios
• Notions of independence are central in statistics, first and
foremost for specifying a framework for estimation.
X0 = {(x1 , x2 ) : x2 = 0}
X1 = {(x1 , x2 ) : x2 < 0}
X2 = {(x1 , x2 ) : x2 > 0}
The third line applies the change of variable u = z 2 while the integral
in the fourth line vanishes: it is the total probability of an exponential
distribution. The result is the standard Cauchy density.
Student’s t-distribution as a random ratio (1/3)
Observation 2
If Z ∼ N (0, 1) and X ∼ χ2 (ν), and the two random variables Z and
X are independent, the random variable Y obtained as
Z
Y =r
X
ν
is such that Y ∼ T (ν).
Proof.
The steps here are identical as in thep‘normals-ratio-to-Cauchy’ case;
however in place of X2 here is W = X/ν where X ∼ χ2 (ν). Thus,
one should first derive the p.d.f. of W . Note that the transformation
that defines W is increasing; the inverse is X = g −1 (W ) = νW 2 and
dx
thus dw = 2νw > 0; the support stays R++ and hence:
ν
νw2
ν2 ν−1
fW (w; ν) = ν w exp − for w > 0.
Γ ν2 · 2 2 −1
2
Student’s t-distribution as a random ratio (2/3)
Proof.
Because of independence, the joint density of w = (Z, W ) is:
ν 2
z + νw2
1 ν2 ν−1
fw (z, w; ν) = √ ν w exp −
2π Γ ν2 · 2 2 −1 2
and this completes the (longer) first step. Yet the second step is easier
now: it is necessary to derive the joint distribution of y = (Y, W ), but
this specific transformation is already bijective and there is no need to
split the support. Similarly to the Cauchy case, the determinant of the
Jacobian is w > 0; and since Z = Y W , the joint p.d.f. of interest is:
fy (y, w; ν) = w · fw (yw, w; ν)
ν+1 !
ν 2 1 ν y 2 + ν w2
= ν−1 √ w exp −
Γ ν2 · 2 2 πν 2
for y ∈ R and w ∈ R++ . The last step is about integrating out w; this
requires a change of variable: u = w2 so as to recognize the integral as
the total probability of a Gamma-distributed random variable.
Student’s t-distribution as a random ratio (3/3)
Proof.
ˆ +∞
fY (y; ν) = fY,W (y, w) dw
0
ˆ +∞ ν+1 !
11 ν 2
ν ν + y 2 w2
= ν
√ ν−1 w exp − dw
Γ 2 πν 0 2 2 2
ˆ +∞ ν+1
y2
1 1 ν 2 ν−1 ν
= √ u 2 exp − 1 + u du
Γ ν2
πν 0 2 2 ν
− ν+1
Γ ν+1
y2
2 1 2
= ν
√ 1 + ×
Γ 2 πν ν
ˆ +∞ ν+1
ν + y2 ν + y2
1 2
ν−1
× u 2 exp − u du
Γ ν+1
0 2
2 2
− ν+1
Γ ν+1
y2
2 1 2
= ν
√ 1+
Γ 2 πν ν
Observation 3
If X1 ∼ χ2 (ν1 ) and X2 ∼ χ2 (ν2 ), and the two random variables X1
and X2 are independent, the random variable Y obtained as
X1 /ν1
Y =
X2 /ν2
Observation 4
If X1 ∼ Γ (α, γ) and X2 ∼ Γ (β, γ), and the two random variables X1
and X2 are independent, the random variables Y and W obtained as
X1
Y =
X1 + X2
W = X1 + X2
Definition 12
Covariance. For any two random variables Xk and X` belonging to
a random vector x, their covariance is defined as the expectation of
a particular function of Xk and X` , i.e. the product of both variables’
deviations from their respective means.
Cov [Xk , X` ]
Corr [Xk , X` ] = p p
Var [Xk ] Var [X` ]
Properties of Correlations (1/3)
Theorem 4
Properties of Correlation. For any two random variables X and Y ,
it is:
a. Corr [X, Y ] ∈ [−1, 1], and
b. |Corr [X, Y ]| = 1 if and only if there are some real numbers a 6= 0
and b such that P (Y = aX + b) = 1. If Corr [X, Y ] = 1 then it is
a > 0, while if Corr [X, Y ] = −1 it is a < 0.
Proof.
Define the following function:
h i
2
C (t) = E [(X − E [X]) · t + (Y − E [Y ])]
= Var [X] · t2 + 2 Cov [X, Y ] · t + Var [Y ]
Theorem 4
Proof.
(Continued.) Since C (t) = 0 is nonnegative, its two roots must share
their sign, hence:
2
(2 Cov [X, Y ]) − 4 Var [X] Var [Y ] ≤ 0.
or, equivalently:
p p p p
− Var [X] Var [Y ] ≤ Cov [X, Y ] ≤ Var [X] Var [Y ].
or equivalently:
P ((X − E [X]) t + (Y − E [Y ]) = 0) = 1
a = −t
b = E [X] · t + E [Y ]
Cov [X, Y ]
t=−
Var [X]
E [XY ] = E [X] E [Y ]
Corollary 2
(Theorem 5.) Given any transformations U = gX (X), V = gY (Y )
of two independent random variables X and Y whose moments exist,
it is:
E [U V ] = E [U ] E [V ]
because U and V are also independent; this also implies that U and V
have zero covariance and correlation and that also all higher moments
of X and Y inherit this property. For example:
h i
2 2
Var [XY ] = E (X − E [X]) (Y − E [Y ])
h i h i
2 2
= E (X − E [X]) E (Y − E [Y ])
= Var [X] Var [Y ]
2 2
which is best seen by setting U = (X − E [X]) and V = (Y − E [Y ]) .
Covariance in the bivariate normal (1/4)
In a bivariate normal distribution, the following holds.
E [X1 X2 ] = ρσ1 σ2 + µ1 µ2
X1 = g1−1 (Y, Z) = σ1 Z + µ1
Y
X2 = g2−1 (Y, Z) = σ2 + µ2
Z
Covariance in the bivariate normal (2/4)
Here the Jacobian has the following absolute determinant:
∂ −1
0 σ1 σ1 σ2 σ1 σ2
det g (y, z) = det σ2 σ2 y = = √
∂y T − |z| z2
z z
thus, by Theorem 1 the joint p.d.f. of y = (Y, Z) is as follows.
!
1 z 2 − 2ρy + y 2 z −2
fY,Z (y, z; ρ) = p exp −
2π (1 − ρ2 ) z 2 2 (1 − ρ2 )
2 !
1 y − ρz 2
= φ (z) · p exp −
2π (1 − ρ2 ) z 2 2 (1 − ρ2 ) z 2
fX1 (x1 )
fX2 (x2 )
fX1 ,X2 (x1 , x2 )
0.4
0.2
0
−1 4
0 3
0.3 1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1
Drawing the bivariate normal: ρ = 0
fX1 (x1 )
fX2 (x2 )
fX1 ,X2 (x1 , x2 )
0.4
0.2
0
−1 4
0 3
0.3
1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1
Collecting terms
It is convenient to summarize all moments of a random vector of
the same order through compact notation. To begin with, the
mean vector E [x] gathers all the means.
E [X1 ]
E [X2 ]
E [x] = .
..
E [XK ]
The variance-covariance matrix Var [x] instead collects its
namesakes; can be seen as the expectation of a random matrix.
h i
Var [x] = E (x − E [x]) (x − E [x])T
Var [X1 ] Cov [X1 , X2 ]. . . Cov [X1 , XK ]
Cov [X2 , X1 ] Var [X2 ] . . . Cov [X2 , XK ]
= .. .. .. ..
. . . .
Cov [XK , X1 ] Cov [XK , X2 ] . . . Var [XK ]
Properties of variance-covariance matrices
• Var [x] it has dimension K × K and is symmetric.
x ∼ N (µ, Σ)
Moments of linear transformations (1/3)
Like in the univariate case, it is useful to derive the moment of
a transformed random vector in terms of the original moments.
Y = aT x = a1 X1 + · · · + aK XK
y = g (x) = (Y1 , . . . , YJ )T
Definition 15
Characteristic function (multivariate). Given a random vector
x = (X1 , . . . , XK ) with support X, the characteristic function ϕx (t) is
given by the expectation of the transformation g (X) = exp itT x , for
t = (t1 , . . . , tK ) ∈ RK .
" K
!#
X
ϕx (t) = E exp itT x = E exp i
tk Xk
k=1
Generating multivariate moments
The r-th centered moments for each k-th element of the random
vector x can be calculated in analogy with the univariate case.
∂ r Mx (t) 1 ∂ r ϕx (t)
E [Xkr ] = = ·
∂trk t=0
ir ∂trk t=0
∂2
MX1 ,X2 (t1 , t2 ) =
∂t1 ∂t2
h i
= µ1 + t1 σ21 + t2 ρσ1 σ2 µ2 + t2 σ22 + t1 ρσ1 σ2 + ρσ1 σ2 ×
1 2 2
× exp t1 µ1 + t2 µ2 + t1 σ1 + 2t1 t2 ρσ1 σ2 + t22 σ22
2
and by evaluating the above expression for t1 = 0 and t2 = 0.
∂2
E [X1 X2 ] = MX1 ,X2 (t1 , t2 ) = ρσ1 σ2 + µ1 µ2
∂t1 ∂t2 t1 ,t2 =0
Moment generation and independence (1/2)
Theorem 6
Moment generating functions and characteristic functions of
independent random variables. If the random variables belonging
to a random vector x = (X1 , . . . , XK ) are pairwise independent, the
moment generating function of x (if it exists) and the characteristic
function of x equal the product of the K moment generating functions
(if they exist) and the K characteristic functions of the K random
variables involved, respectively.
K
Y
Mx (t) = MXk (tk )
k=1
K
Y
ϕx (t) = ϕXk (tk )
k=1
Moment generation and independence (2/2)
Proof.
This is an application Theorem 3 and Theorem 5 upon a sequence of K
transformed random variables: exp (t1 X1 ) , . . . , exp (tK XK ) which are
themselves mutually independent. For m.g.f.s:
" K
!#
X
T
Mx (t) = E exp t x = E exp tk Xk
k=1
" K
#
Y
=E exp (tk Xk )
k=1
K
Y
= E [exp (tk Xk )]
k=1
K
Y
= MXk (tk )
k=1
Observation 5
If all the N random variables in the vector (X1 , . . . , XN ) are pairwise
independent and Xi ∼ Be (p) for i = 1, . . . , N then:
N
X
Xi ∼ BN (p, N ) .
i=1
Proof.
If MXi (t) = p exp (t) + (1 − p), it suffices to multiply the N identical
moment generating functions:
N
MPN Xi
(t) = [p exp (t) + (1 − p)]
i=1
Observation 6 PN
If Xi ∼ NB (p, 1), it is i=1 Xi ∼ NB (p, N ).
Observation 7 PN
If Xi ∼ Pois (λ), it is i=1 Xi ∼ Pois (N λ).
Observation 8 P
N
Xi ∼ Γ N, λ−1 .
If Xi ∼ Exp (λ), it is i=1
Observation 9 P P
2 N N
If Xi ∼ χ (κi ), it is i=1 Xi ∼ χ2 i=1 κi .
Observation 10 P P
N N
If Xi ∼ Γ (αi , β), it is i=1 Xi ∼ Γ i=1 αi , β .
Sums of independent random variables (3/6)
Observation 11
If Xi ∼ N µi , σ2i , for all real ai , bi it is as follows.
N N N
!
X X X
Y = (ai + bi Xi ) ∼ N (ai + bi µi ) , b2i σ2i
i=1 i=1 i=1
Proof.
This requires few steps:
K
! K
X Y
MY (t) = exp t ai MXi (bi t)
i=1 i=1
K
! K K
!
X X
2
X b2 σ2i i
= exp t ai exp t bi µi + t
i=1 i=1 i=1
2
K K
!
X X b2 σ2i i
= exp t (ai + bi µi ) + t2
i=1 i=1
2
Observation 12
2
If log (Xi ) ∼ N µi , σi , for all real ai , bi it is as follows.
N
! N N
!
Y X X
log exp (ai ) Xibi ∼N (ai + bi µi ) , b2i σ2i
i=1 i=1 i=1
Proof. Q
N bi PN
Since log i=1 exp (ai ) Xi = i=1 [ai + bi log (Xi )], the previous
observation extends easily.
Sums of independent random variables (5/6)
Observation 13
If X1 ∼ Exp (λ1 ) and X2 ∼ Exp (λ2 ) are independent, the random
variable Y = X1 /λ1 − X2 /λ2 is such that Y ∼ Laplace (0, 1).
Proof.
Define the two random variables W1 = X1 /λ1 and W2 = −X2 /λ2 ,
which obviously are independent. By the properties of m.g.f.s for linear
transformations, the m.g.f. of the two transformed random variables is:
−1
MW1 (t) = (1 − t)
−1
MW2 (t) = (1 + t)
i.e. the logistic’s m.g.f. sought after (note that Γ (2) = 1! = 1).
Fixing realizations
• It is often useful to analyze certain random variables of a
random vector when the realizations of the other random
variables are held constant, or “fixed.”
• One can also “fix” subsets of the support – not just single
realizations; this case is left aside for the moment.
Definition 16
Conditional mass or density function. Consider the combined ran-
dom vector (x, y) with joint mass/density function fx,y (x, y). Suppose
that the random vectors x has a probability mass or density function
fx (x). The conditional mass or density function of y, given x = x, is
defined as follows for all x ∈ X:
fx,y (x, y)
f y|x ( y| x = x) =
fx (x)
f X1 |X2 ( x1 | X2 = x2 ) =
" #2
σ1
x1 − µ1 − ρ (x2 − µ2 )
1 σ2
=q exp −
2σ21 (1 − ρ2 )
2πσ21 (1 − ρ2 )
f y|x ( y| x) = fy (y)
f x|y ( x| y) = fx (x)
(y − E [ y| x]) (y − E [ y| x])T ×
X X
Var [ y| x] = ···
y1 ∈Y1 yK ∈YKy
× f y|x ( y| x)
× f y|x ( y| x) dy
E [ X1 | X2 ] E [ y| x]
Var [ X1 | X2 ] Var [ y| x]
Law of Iterated Expectations
Theorem 8
Law of Iterated Expectations. Given any two random vectors x
and y, it is:
E [y] = Ex [E [ y| x]]
where Ex [·] denotes an expectation taken over the support of x.
Proof.
In the continuous case, apply the following decomposition:
ˆ ˆ
E [y] = yfx,y (x, y) dydx
ˆX ˆY
= yf y|x ( y| x) fx (x) dydx
ˆX Y ˆ
= fx (x) yf y|x ( y| x) dy dx
X Y
= Ex [E [ y| x]]
E [ Yi | Xi ] = β0 + β1 Xi
This is equivalent to the following expression.
E [ Yi − β0 − β1 Xi | Xi ] = 0
0.6
f Yi |Xi ( yi | xi )
0.3
0
−1 4
0 3
E [ Yi | Xi ]
1 2
2 1
xi yi
3 0
4 −1
Law of Total Variance
Theorem 9
Law of Total Variance (variance decomposition). Given any two
random vectors x and y:
where Varx [·] and Ex [·] denote sums/integrals taken over the support
of x for every element of the argument vectors/matrices.
Proof.
T
Var [y] = E yy T − E [y] E [y]
T
= Ex E yy T x − Ex [E [ y| x]] [Ex [E [ y| x]]]
h i
T
= Ex E yy T x − E [ y| x] E [ y| x] x
h i
T T
+ Ex E [ y| x] E [ y| x] x − Ex [E [ y| x]] [Ex [E [ y| x]]]
= Ex [Var [ y| x]] + Varx [E [ y| x]]
Y | X = 1 ∼ N (3, 1.5)
Y | X = 2 ∼ N (4.5, 2.5)
Y | X = 3 ∼ N (5.5, 3)
Y | X = 4 ∼ N (7, 4)
0.1
1 15
2 10
xi 3 5
yi
4 0
Example: income groups (3/4)
4
1X
E [Y ] = EX [E [ Y | X]] = E [ Y | X = x] = 5
4 x=1
4
1X
VarX [E [ Y | X]] = (E [ Y | X = x] − EX [E [ Y | X]])2
4 x=1
= 2.125
4
1X
EX [Var [ Y | X]] = Var [ Y | X = x] = 2.75
4 x=1
−1
where the multinomial coefficient n! · K
Q
k=1 (xk !) counts
the number of realizations containing exactly (x1 , . . . , xK )
successes for each alternative out of n draws.
• The joint c.d.f. sums the mass function over points in the
support as follows, for t = (t1 , . . . , tK ).
K
X n! Y tk
Fx (x1 , . . . , xK ; n, p1 , . . . , pK ) = QK pk
t∈X:t≤x k=1 tk ! k=1
The multinomial distribution (3/4)
• The distribution owes its name to the multinomial theorem,
which helps show that the total probability mass equals 1.
K K
!n
X n! Y xk
X
P (x ∈ X) = QK pk = pk =1
x∈X k=1 xk ! k=1 k=1
E [Xk ] = npk
Var [Xk ] = npk (1 − pk )
and for all k, ` = 1, . . . , K:
Cov [Xk , X` ] = −npk p`
and the covariance is always negative because an increasing
number of successes for one alternative implies a decreasing
number for another alternative.
• These moments can be expressed in compact notation as:
E [x] = np
Var [x] = n diag (p) − ppT
T
where p ≡ p1 p2 . . . pK .
The multivariate normal distribution (1/8)
• A random vector x = (X1 , . . . , XK ) of length K follows the
multivariate normal distribution with support X = RK if
its joint probability density function is as follows.
1 1
fx (x; µ, Σ) = q exp − (x − µ)T Σ−1 (x − µ)
K
(2π) |Σ| 2
E [x] = µ
Var [x] = Σ
• Therefore:
y ∼ N Bµ + a, BΣBT
even if the random variables in x are dependent.
The multivariate normal distribution (6/8)
• Suppose that x = (x1 , x2 ) can be split into two subvectors
x1 and x2 of length K1 and K2 respectively; K1 + K2 = K.
What are the marginal, conditional distributions of x1 , x2 ?
where:
• µ1 is a vector of length K1 , µ2 one of length K2 ;
• Σ11 a symmetric K1 × K1 matrix, Σ22 a symmetric
K2 × K2 matrix;
• while Σ12 and Σ21 are two matrices, where one is the
transpose of the other, having dimension K1 × K2 and
K2 × K1 respectively.
The multivariate normal distribution (7/8)
where:
and:
|Σ| = Σ1 · |Σ22 | = Σ2 · |Σ11 |
relating the determinant of Σ to those of the matrices ex-
pressing its partitioned inverse.
The multivariate normal distribution (8/8)
• All this lets rewrite the p.d.f. of x in (very) “long” form as:
1
fx (x1 , x2 ; µ1 , µ2 , Σ11 , Σ12 , Σ21 , Σ22 ) = q ×
K
(2π) Σ1 · |Σ22 |
1 −1 1 −1
× exp (x1 − µ1 )T Σ1 Σ12 Σ−1
22 (x2 − µ2 ) − (x1 − µ1 )T Σ1 (x1 − µ1 ) +
2 2
1 −1 1 −1
+ (x2 − µ2 )T Σ2 Σ21 Σ−1
11 (x1 − µ1 ) − (x2 − µ2 )T Σ2 (x2 − µ2 )
2 2