0% found this document useful (0 votes)
26 views

Multiple Random Variables

The document discusses the joint distribution of a pair of random variables X and Y. It defines the joint distribution function FXY(x,y) as the probability that X is less than or equal to x and Y is less than or equal to y. FXY(x,y) must satisfy certain properties, including being non-decreasing in each argument and satisfying FXY(x2,y2) - FXY(x2,y1) - FXY(x1,y2) + FXY(x1,y1) ≥ 0 for all x1 < x2 and y1 < y2. The joint distribution fully characterizes the probability of the random vector (X,Y).

Uploaded by

sunnyman9899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Multiple Random Variables

The document discusses the joint distribution of a pair of random variables X and Y. It defines the joint distribution function FXY(x,y) as the probability that X is less than or equal to x and Y is less than or equal to y. FXY(x,y) must satisfy certain properties, including being non-decreasing in each argument and satisfying FXY(x2,y2) - FXY(x2,y1) - FXY(x1,y2) + FXY(x1,y1) ≥ 0 for all x1 < x2 and y1 < y2. The joint distribution fully characterizes the probability of the random vector (X,Y).

Uploaded by

sunnyman9899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 248

A pair of random variables

◮ Let X, Y be random variables on the same probability


space (Ω, F, P )
◮ Each of X, Y maps Ω to ℜ.
◮ We can think of the pair of radom variables as a
vector-valued function that maps Ω to ℜ2 .

[ XY ]

Sample Space 2
R

P S Sastry, IISc, E1 222 Aug 2021 1/248


◮ Just as in the case of a single rv, we can think of the
induced probability space for the case of a pair of rv’s too.
◮ That is, by defining the pair of random variables, we
essentially create a new probability space with sample
space being ℜ2 .
◮ The events now would be the Borel subsets of ℜ2 .
◮ Recall that ℜ2 is cartesian product of ℜ with itself.
◮ So, we can create Borel subsets of ℜ2 by cartesian
product of Borel subsets of ℜ.

B 2 = σ ({B1 × B2 : B1 , B2 ∈ B})

where B is the Borel σ-algebra we considered earlier, and


B 2 is the set of Borel sets of ℜ2 .

P S Sastry, IISc, E1 222 Aug 2021 2/248


◮ Recall that B is the smallest σ-algebra containing all
intervals.
◮ Let I1 , I2 ⊂ ℜ be intervals. Then I1 × I2 ⊂ ℜ2 is known
as a cylindrical set.
[a, b] X [c, d]

a b x

◮ B 2 is the smallest σ-algebra containing all cylindrical sets.


◮ We saw that B is also the smallest σ-algebra containing
all intervals of the form (−∞, x].
◮ Similarly B 2 is the smallest σ-algebra containing
cylindrical sets of the form (−∞, x] × (−∞, y].
P S Sastry, IISc, E1 222 Aug 2021 3/248
◮ Let X, Y be random variables on the probability space
(Ω, F, P )
◮ This gives rise to a new probability space (ℜ2 , B 2 , PXY )
with PXY given by

PXY (B) = P [(X, Y ) ∈ B], ∀B ∈ B 2


= P ({ω : (X(ω).Y (ω)) ∈ B})

(Here, B ⊂ ℜ2 )
◮ Recall that for a single rv, the resulting probability space
is (ℜ, B, PX ) with

PX (B) = P [X ∈ B] = P ({ω : X(ω) ∈ B})

(Here, B ⊂ ℜ)

P S Sastry, IISc, E1 222 Aug 2021 4/248


◮ In the case of a single rv, we define a distribution
function, FX which essentailly assigns probability to all
intervals of the form (−∞, x].
◮ This FX uniquely determines PX (B) for all Borel sets, B.
◮ In a similar manner we define a joint distribution function
FXY for a pair of random varibles.
◮ FXY (x, y) would be PXY ((−∞, x] × (−∞, y]).
◮ FXY fixes the probability of all cylindrical sets of the form
(−∞, x] × (−∞, y] and hence uniquely determines the
probability of all Borel sets of ℜ2 .

P S Sastry, IISc, E1 222 Aug 2021 5/248


Joint distribution of a pair of random variables

◮ Let X, Y be random variables on the same probability


space (Ω, F, P )
◮ The joint distribution function of X, Y is FXY : ℜ2 → ℜ,
defined by

FXY (x, y) = P [X ≤ x, Y ≤ y]
= P ({ω : X(ω) ≤ x} ∩ {ω : Y (ω) ≤ y})

◮ The joint distribution function is the probability of the


intersection of the events [X ≤ x] and [Y ≤ y].

P S Sastry, IISc, E1 222 Aug 2021 6/248


Properties of Joint Distribution Function

◮ Joint distribution function:

FXY (x, y) = P [X ≤ x, Y ≤ y]

◮ FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;


FXY (∞, ∞) = 1
(These are actually limits: limx→−∞ FXY (x, y) = 0, ∀y)
◮ FXY is non-decresing in each of its arguments
◮ FXY is right continuous and has left-hand limits in each
of its arguments
◮ These are straight-forward extensions of single rv case
◮ But there is another crucial property satisfied by FXY .

P S Sastry, IISc, E1 222 Aug 2021 7/248


◮ Recall that, for the case of a single rv, given x1 < x2 , we
have
P [x1 < X ≤ x2 ] = FX (x2 ) − FX (x1 )
◮ The LHS above is a probability.
Hence the RHS should be non-negative
The RHS is non-negative because FX is non-decreasing.
◮ We will now derive a similar expression in the case of two
random variables.
◮ Here, the probability we want is that of the pair of rv’s
being in a cylindrical set.

P S Sastry, IISc, E1 222 Aug 2021 8/248


◮ Let x1 < x2 and y1 < y2 . We want
P [x1 < X ≤ x2 , y1 < Y ≤ y2 ].
◮ Consider the Borel set B = (−∞, x2 ] × (−∞, y2 ].
y

y
B3 2
B1

y
1

x1 x2 x
B2

B , (−∞, x2 ] × (−∞, y2 ] = B1 + (B2 ∪ B3 )


B1 = (x1 , x2 ] × (y1 , y2 ]
B2 = (−∞, x2 ] × (−∞, y1 ]
B3 = (−∞, x1 ] × (−∞, y2 ]
B2 ∩ B3 = (−∞, x1 ] × (−∞, y1 ]

P S Sastry, IISc, E1 222 Aug 2021 9/248


y

y
B3 2
B1

y
1

x1 x2 x
B2

FXY (x2 , y2 ) = P [X ≤ x2 , Y ≤ y2 ] = P [(X, Y ) ∈ B]


= P [(X, Y ) ∈ B1 + (B2 ∪ B3 )]
= P [(X, Y ) ∈ B1 ] + P [(X, Y ) ∈ (B2 ∪ B3 )]

P [(X, Y ) ∈ B2 ] = P [X ≤ x2 , Y ≤ y1 ] = FXY (x2 , y1 )


P [(X, Y ) ∈ B3 ] = P [X ≤ x1 , Y ≤ y2 ] = FXY (x1 , y2 )
P [(X, Y ) ∈ B2 ∩ B3 ] = P [X ≤ x1 , Y ≤ y1 ] = FXY (x1 , y1 )

P [(X, Y ) ∈ B1 ] = FXY (x2 , y2 ) − P [(X, Y ) ∈ (B2 ∪ B3 )]


= FXY (x2 , y2 ) − FXY (x2 , y1 ) − FXY (x1 , y2 ) + FXY (x1 , y1 )

P S Sastry, IISc, E1 222 Aug 2021 10/248


◮ What we showed is the following.
◮ For x1 < x2 and y1 < y2

P [x1 < X ≤ x2 , y1 < Y ≤ y2 ] = FXY (x2 , y2 ) − FXY (x2 , y1 )


−FXY (x1 , y2 ) + FXY (x1 , y1 )

◮ This means FXY should satisfy

FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0

for all x1 < x2 and y1 < y2


◮ This is an additional condition that a function has to
satisfy to be the joint distribution function of a pair of
random variables

P S Sastry, IISc, E1 222 Aug 2021 11/248


Properties of Joint Distribution Function
◮ Joint distribution function: FXY : ℜ2 → ℜ

FXY (x, y) = P [X ≤ x, Y ≤ y]
◮ It satisfies
1. FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y;
FXY (∞, ∞) = 1
2. FXY is non-decreasing in each of its arguments
3. FXY is right continuous and has left-hand limits in each
of its arguments
4. For all x1 < x2 and y1 < y2

FXY (x2 , y2 )−FXY (x2 , y1 )−FXY (x1 , y2 )+FXY (x1 , y1 ) ≥ 0

◮ Any F : ℜ2 → ℜ satisfying the above would be a joint


distribution function.
P S Sastry, IISc, E1 222 Aug 2021 12/248
◮ Let X, Y be two discrete random variables (defined on
the same probability space).
◮ Let X ∈ {x1 , · · · xn } and Y ∈ {y1 , · · · , ym }.
◮ We define the joint probability mass function of X and Y
as
fXY (xi , yj ) = P [X = xi , Y = yj ]
(fXY (x, y) is zero for all other values of x, y)
◮ The fXY would satisfy
P P
◮ fXY (x, y) ≥ 0, ∀x, y and i j fXY (xi , yj ) = 1
◮ This is a straight-forward extension of the pmf of a single
discrete rv.

P S Sastry, IISc, E1 222 Aug 2021 13/248


Example

◮ Let Ω = (0, 1) with the ‘usual’ probability.


◮ So, each ω is a real number between 0 and 1
◮ Let X(ω) be the digit in the first decimal place in ω and
let Y (ω) be the digit in the second decimal place.
◮ If ω = 0.2576 then X(ω) = 2 and Y (ω) = 5
◮ Easy to see that X, Y ∈ {0, 1, · · · , 9}.
◮ We want to calculate the joint pmf of X and Y

P S Sastry, IISc, E1 222 Aug 2021 14/248


Example
◮ What is the event [X = 4]?

[X = 4] = {ω : X(ω) = 4} = [0.4, 0.5)


◮ What is the event [Y = 3]?

[Y = 3] = [0.03, 0.04) ∪ [0.13, 0.14) ∪ · · · ∪ [0.93, 0.94)


◮ What is the event [X = 4, Y = 3]?
It is the intersection of the above

[X = 4, Y = 3] = [0.43, 0.44)
◮ Hence the joint pmf of X and Y is

fXY (x, y) = P [X = x, Y = y] = 0.01, x, y ∈ {0, 1, · · · , 9}

P S Sastry, IISc, E1 222 Aug 2021 15/248


Example
◮ Consider the random experiment of rolling two dice.
Ω = {(ω1 , ω2 ) : ω1 , ω2 ∈ {1, 2, · · · , 6}}
◮ Let X be the maximum of the two numbers and let Y be
the sum of the two numbers.
◮ Easy to see X ∈ {1, 2, · · · , 6} and Y ∈ {2, 3, · · · , 12}
◮ What is the event [X = m, Y = n]? (We assume m, n
are in the correct range)
[X = m, Y = n] = {(ω1 , ω2 ) ∈ Ω : max(ω1 , ω2 ) = m, ω1 +ω2 = n}
◮ For this to be a non-empty set, we must have
m < n ≤ 2m
◮ Then [X = m, Y = n] = {(m, n − m), (n − m, m)}
◮ Is this always true? No! What if n = 2m?
[X = 3, Y = 6] = {(3, 3)},
[X = 4, Y = 6] = {(4, 2), (2, 4)}
◮ So, P [X = m, Y = n] is either 2/36 or 1/36 (assuming
m, n satisfy other requirements) P S Sastry, IISc, E1 222 Aug 2021 16/248
Example
◮ We can now write the joint pmf.
◮ Assume 1 ≤ m ≤ 6 and 2 ≤ n ≤ 12. Then
( 2
36
if m < n < 2m
fXY (m, n) = 1
36
if n = 2m
(fXY (m, n) is zero in all other cases)
◮ Does this satisfy requirements of joint pmf?
6 2m−1 6
X X X 2 X 1
fXY (m, n) = +
m,n m=1 n=m+1
36 m=1 36
6
2 X 1
= (m − 1) + 6
36 m=1 36
2 6
= (21 − 6) + =1
36 36
P S Sastry, IISc, E1 222 Aug 2021 17/248
Joint Probability mass function

◮ Let X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · } be discrete


random variables.
◮ The joint pmf: fXY (x, y) = P [X = x, Y = y].
◮ The joint pmf satisfies:
◮ fP
XYP (x, y) ≥ 0, ∀x, y and
j fXY (xi , yj ) = 1

i
◮ Given the joint pmf, we can get the joint df as
X X
FXY (x, y) = fXY (xi , yj )
i: j:
xi ≤x yj ≤y

P S Sastry, IISc, E1 222 Aug 2021 18/248


◮ Given sets {x1 , x2 , · · · } and {y1 , y2 , · · · }.
◮ Suppose fXY : ℜ2 → [0, 1] be such that
◮ fXY (x, y) = 0 unless x = xi for some i and y = yj for
some
P Pj, and
j fXY (xi , yj ) = 1

i
◮ Then fXY is a joint pmf.
◮ This is because, if we define
X X
FXY (x, y) = fXY (xi , yj )
i: j:
xi ≤x yj ≤y

then FXY satisfies all properties of a df.


◮ We normally specify a pair of discrete random variables by
giving the joint pmf

P S Sastry, IISc, E1 222 Aug 2021 19/248


◮ Given the joint pmf, we can (in principle) compute the
probability of any event involving the two discrete random
variables.
X
P [(X, Y ) ∈ B] = fXY (xi , yj )
i,j:
(xi ,yj )∈B

◮ Now, events can be specified in terms of relations


between the two rv’s too

[X < Y + 2] = {ω : X(ω) < Y (ω) + 2}

◮ Thus, X
P [X < Y + 2] = fXY (xi , yj )
i,j:
xi <yj +2

P S Sastry, IISc, E1 222 Aug 2021 20/248


◮ Take the example: 2 dice, X is max and Y is sum
◮ fXY (m, n) = 0 unless m = 1, · · · , 6 and n = 2, · · · , 12.
For this range
( 2
36
if m < n < 2m
fXY (m, n) = 1
36
if n = 2m
◮ Suppose we want P [Y = X + 2].
X 6
X
P [Y = X + 2] = fXY (m, n) = fXY (m, m + 2)
m,n: m=1
n=m+2
6
X
= fXY (m, m + 2) since we need m + 2 ≤ 2m
m=2
1 2 9
= +4 =
36 36 36

P S Sastry, IISc, E1 222 Aug 2021 21/248


Joint density function
◮ Let X, Y be two continuous rv’s with df FXY .
◮ If there exists a function fXY that satisfies
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞

then we say that X, Y have a joint probability density


function which is fXY
◮ Please note the difference in the definition of joint pmf
and joint pdf.
◮ When X, Y are discrete we defined a joint pmf
◮ We are not saying that if X, Y are continuous rv’s then a
joint density exists.

P S Sastry, IISc, E1 222 Aug 2021 22/248


properties of joint density

◮ The joint density (or joint pdf) of X, Y is fXY that


satisfies
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞

◮ Since FXY is non-decreasing in each argument, we must


have fXY (x, y) ≥ 0.
R∞ R∞

−∞ −∞ XY
f (x′ , y ′ ) dy ′ dx′ = 1 is needed to ensure
FXY (∞, ∞) = 1.

P S Sastry, IISc, E1 222 Aug 2021 23/248


properties of joint density

◮ The joint density fXY satisfies the following


1. fXY (x, y) ≥ 0, ∀x, y
R∞ R∞
2. −∞ −∞ fXY (x′ , y ′ ) dy ′ dx′ = 1

◮ These are very similar to the properties of the density of a


single rv

P S Sastry, IISc, E1 222 Aug 2021 24/248


Example: Joint Density
◮ Consider the function
f (x, y) = 2, 0 < x < y < 1 (f (x, y) = 0, otherwise)

◮ Let us show this is a density


Z ∞ Z ∞ Z 1Z y Z 1 Z 1
f (x, y) dx dy = 2 dx dy = 2 x|y0 dy = 2y dy = 1
−∞ −∞ 0 0 0 0

◮ We can say this density is uniform over the region


y

1.0

0.5

x
0.5 1.0

The figure is not a plot of the density function!!


P S Sastry, IISc, E1 222 Aug 2021 25/248
properties of joint density

◮ The joint density fXY satisfies the following


1. fXY (x, y) ≥ 0, ∀x, y
R∞ R∞
2. −∞ −∞ fXY (x′ , y ′ ) dy ′ dx′ = 1
◮ Any function fXY : ℜ2 → ℜ satisfying the above two is a
joint density function.
◮ Given fXY satisfying the above, define
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞

◮ Then we can show FXY is a joint distribution.

P S Sastry, IISc, E1 222 Aug 2021 26/248


R∞ R∞
◮ fXY (x, y) ≥ 0 and −∞ −∞
fXY (x′ , y ′ ) dy ′ dx′ = 1
◮ Define
Z x Z y
FXY (x, y) = fXY (x′ , y ′ ) dy ′ dx′ , ∀x, y
−∞ −∞

◮ Then, FXY (−∞, y) = FXY (x, −∞) = 0, ∀x, y and


FXY (∞, ∞) = 1
◮ Since fXY (x, y) ≥ 0, FXY is non-decreasing in each
argument.
◮ Since it is given as an integral, the above also shows that
FXY is continuous in each argument.
◮ The only property left is the special property of FXY we
mentioned earlier.

P S Sastry, IISc, E1 222 Aug 2021 27/248


∆ , FXY (x2 , y2 ) − FXY (x1 , y2 ) − FXY (x2 , y1 ) + FXY (x1 , y1 ).

◮ We need to show ∆ ≥ 0 if x1 < x2 and y1 < y2 .


◮ We have
Z x 2 Z y2 Z x 1 Z y2
∆ = fXY dy dx − fXY dy dx
−∞ −∞ −∞ −∞
Z x 2 Z y1 Z x 1 Z y1
− fXY dy dx + fXY dy dx
−∞ −∞ −∞ −∞
Z x 2  Z y2 Z y1 
= fXY dy − fXY dy dx
−∞ −∞ −∞
Z x 1  Z y2 Z y1 
− fXY dy − fXY dy dx
−∞ −∞ −∞

P S Sastry, IISc, E1 222 Aug 2021 28/248


◮ Thus we have
Z x 2  Z y2 Z y1 
∆ = fXY dy − fXY dy dx
−∞ −∞ −∞
Z x 1  Z y2 Z y1 
− fXY dy − fXY dy dx
−∞ −∞ −∞
Z x 2 Z y2 Z x 1 Z y2
= fXY dy dx − fXY dy dx
−∞ y1 −∞ y1
Z x 2 Z y2
= fXY dy dx ≥ 0
x1 y1

◮ This actually shows


Z x2 Z y2
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = fXY dy dx
x1 y1

P S Sastry, IISc, E1 222 Aug 2021 29/248


◮ What we showed is the following
◮ Any function fXY : ℜ2 → ℜ that satisfies
◮ fRXY (x, y) ≥ 0, ∀x, y
∞ R∞
−∞ −∞ fXY (x, y) dx dy = 1

is a joint density function.


◮ This is because
R y now
Rx
FXY (x, y) = −∞ −∞ fXY (x, y) dx dy
would satisfy all conditions for a df.
◮ Convenient to specify joint density (when it exists)
◮ We also showed
Z x 2 Z y2
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = fXY dy dx
x1 y1

◮ In general
Z
P [(X, Y ) ∈ B] = fXY (x, y) dx dy, ∀B ∈ B 2
B

P S Sastry, IISc, E1 222 Aug 2021 30/248


◮ Let us consider the example

f (x, y) = 2, 0 < x < y < 1

◮ Suppose wee want probability of [Y > X + 0.5]

P [Y > X + 0.5] = P [(X, Y ) ∈ {(x, y) : y > x + 0.5}]


Z
= fXY (x, y) dx dy
{(x,y) : y>x+0.5}
Z 1 Z y−0.5
= 2 dx dy
Z0.51 0

= 2(y − 0.5)dy
0.5
1
y2
=2 − y|10.5 = 1 − 0.25 − 1 + 0.5 = 0.25
2 0.5

P S Sastry, IISc, E1 222 Aug 2021 31/248


◮ We can look at it geometrically
y

1.0

0.5

x
0.5 1.0

◮ The probability of the event we want is the area of the


small triangle divided by that of the big triangle.

P S Sastry, IISc, E1 222 Aug 2021 32/248


Marginal Distributions
◮ Let X, Y be random variables with joint distribution
function FXY .
◮ We know FXY (x, y) = P [X ≤ x, Y ≤ y].
◮ Hence

FXY (x, ∞) = P [X ≤ x, Y ≤ ∞] = P [X ≤ x] = FX (x)

◮ We define the marginal distribution functions of X, Y by

FX (x) = FXY (x, ∞); FY (y) = FXY (∞, y)

◮ These are simply distribution functions of X and Y


obtained from the joint distribution function.

P S Sastry, IISc, E1 222 Aug 2021 33/248


Marginal mass functions
◮ Let X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · }
◮ Let fXY be their joint mass function.
◮ Then
X X
P [X = xi ] = P [X = xi , Y = yj ] = fXY (xi , yj )
j j

P [Y = yj ], j = 1, · · · , form a partition
(This is because
and P (A) = i P (ABi ) when Bi is a partition)
◮ We define the marginal mass functions of X and Y as
X X
fX (xi ) = fXY (xi , yj ); fY (yj ) = fXY (xi , yj )
j i

◮ These are mass functions of X and Y obtained from the


joint mass function
P S Sastry, IISc, E1 222 Aug 2021 34/248
marginal density functions
◮ Let X, Y be continuous rv with
R x joint
R y density f′ XY′ . ′ ′
◮ Then we know FXY (x, y) = −∞ −∞ fXY (x , y ) dy dx
◮ Hence, we have
Z x Z ∞
FX (x) = FXY (x, ∞) = fXY (x′ , y ′ ) dy ′ dx′
Z−∞ −∞
x Z ∞ 
= fXY (x , y ) dy dx′
′ ′ ′
−∞ −∞
◮ Since X is a continuous rv, this means
Z ∞
fX (x) = fXY (x, y) dy
−∞
We call this the marginal density of X.
◮ Similarly, marginal density of Y is
Z ∞
fY (y) = fXY (x, y) dx
−∞
◮ These are pdf’s of X and Y obtained from theIISc,joint
P S Sastry, E1 222 Aug 2021 35/248
Example
◮ Rolling two dice, X is max, Y is sum
◮ We had, for 1 ≤ m ≤ 6 and 2 ≤ n ≤ 12,
( 2
36
if m < n < 2m
fXY (m, n) = 1
36
if n = 2m
P
◮ We know, fX (m) = n fXY (m, n), m = 1, · · · , 6.
◮ Given m, for what values of n, fXY (m, n) > 0 ?
We can only have n = m + 1, · · · , 2m.
◮ Hence we get
2m 2m−1
X X 2 1 2 1 2m − 1
fX (m) = fXY (m, n) = + = (m−1)+ =
n=m+1 n=m+1
36 36 36 36 36

P S Sastry, IISc, E1 222 Aug 2021 36/248


Example
◮ Consider the joint density

fXY (x, y) = 2, 0 < x < y < 1

◮ The marginal density of X is: for 0 < x < 1,


Z ∞ Z 1
fX (x) = fXY (x, y) dy = 2 dy = 2(1 − x)
−∞ x

Thus, fX (x) = 2(1 − x), 0 < x < 1


◮ We can easily verify this is a density
Z ∞ Z 1
1
fX (x) dx = 2(1 − x) dx = (2x − x2 ) 0
=1
−∞ 0

P S Sastry, IISc, E1 222 Aug 2021 37/248


We have: fXY (x, y) = 2, 0 < x < y < 1
◮ We can similarly find density of Y .

◮ For 0 < y < 1,

Z ∞ Z y
fY (y) = fXY (x, y) dx = 2 dx = 2y
−∞ 0

◮ Thus, fY (y) = 2y, 0 < y < 1 and


1 1
y2
Z
2y dy = 2 =1
0 2 0

P S Sastry, IISc, E1 222 Aug 2021 38/248


◮ If we are given the joint df or joint pmf/joint density of
X, Y , then the individual df or pmf/pdf are uniquely
determined.
◮ However, given individual pdf of X and Y , we cannot
determine the joint density. (same is true of pmf or df)
◮ There can be many different joint density functions all
having the same marginals

P S Sastry, IISc, E1 222 Aug 2021 39/248


Conditional distributions

◮ Let X, Y be rv’s on the same probability space


◮ We define the conditional distribution of X given Y by

FX|Y (x|y) = P [X ≤ x|Y = y]

(For now ignore the case of P [Y = y] = 0).


◮ Note that FX|Y : ℜ2 → ℜ
◮ FX|Y (x|y) is a notation. We could write FX|Y (x, y).

P S Sastry, IISc, E1 222 Aug 2021 40/248


◮ Conditional distribution of X given Y is

FX|Y (x|y) = P [X ≤ x|Y = y]

It is the conditional probability of [X ≤ x] given (or


conditioned on) [Y = y].
◮ Consider example: rolling 2 dice, X is max, Y is sum

P [X ≤ 4|Y = 3] = 1; P [X ≤ 4|Y = 9] = 0

◮ This is what conditional distribution captures.


◮ For every value of y, FX|Y (x|y) is a distribution function
in the variable x.
◮ It defines a new distribution for X based on knowing the
value of Y .

P S Sastry, IISc, E1 222 Aug 2021 41/248


◮ Let: X ∈ {x1 , x2 , · · · } and Y ∈ {y1 , y2 , · · · }. Then

P [X ≤ x, Y = yj ]
FX|Y (x|yj ) = P [X ≤ x|Y = yj ] =
P [Y = yj ]

(We define FX|Y (x|y) only when y = yj for some j).


◮ For each yj , FX|Y (x|yj ) is a df of a discrete rv in x.

◮ Since X is a discrete rv, we can write the above as


P
P [X ≤ x, Y = yj ] i:xi ≤x P [X = xi , Y = yj ]
FX|Y (x|yj ) = =
P [Y = yj ] P [Y = yj ]
X P [X = xi , Y = yj ]
=
i:xi ≤x
P [Y = yj ]
X fXY (xi , yj ) 

=
i:x ≤x
fY (yj )
i

P S Sastry, IISc, E1 222 Aug 2021 42/248


Conditional mass function
◮ We got
X  fXY (xi , yj ) 
FX|Y (x|yj ) =
i:x ≤x
fY (yj )
i

◮ We define the conditional mass function of X given Y as

fXY (xi , yj )
fX|Y (xi |yj ) = = P [X = xi |Y = yj ]
fY (yj )

◮ Note that
X X
fX|Y (xi |yj ) = 1, ∀yj ; and FX|Y (x|yj ) = fX|Y (xi |yj )
i i:xi ≤x

P S Sastry, IISc, E1 222 Aug 2021 43/248


Example: Conditional pmf
◮ Consider the random experiment of tossing a coin n
times.
◮ Let X denote the number of heads and let Y denote the
toss number on which the first head comes.
◮ For 1 ≤ k ≤ n
P [Y = k, X = 1]
fY |X (k|1) = P [Y = k|X = 1] =
P [X = 1]
p(1 − p)n−1
= n
C1 p(1 − p)n−1
1
=
n
◮ Given there is only one head, it is equally likely to occur
on any toss.
P S Sastry, IISc, E1 222 Aug 2021 44/248
◮ The conditional mass function is
fXY (xi , yj )
fX|Y (xi |yj ) = P [X = xi |Y = yj ] =
fY (yj )
◮ This gives us the useful identity
fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
( P [X = xi , Y = yj ] = P [X = xi |Y = yj ]P [Y = yj ])
◮ This gives us the total proability rule for discrete rv’s
X X
fX (xi ) = fXY (xi , yj ) = fX|Y (xi |yj )fY (yj )
j j

◮ This is same as
X
P [X = xi ] = P [X = xi |Y = yj ]P [Y = yj ]
j
P
(P (A) = j P (A|Bj )P (Bj ) when B1 , · · · form a
partition)
P S Sastry, IISc, E1 222 Aug 2021 45/248
Bayes Rule for discrete Random Variable
◮ We have

fXY (xi , yj ) = fX|Y (xi |yj )fY (yj ) = fY |X (yj |xi )fX (xi )

◮ This gives us Bayes rule for discrete rv’s

fY |X (yj |xi )fX (xi )


fX|Y (xi |yj ) =
fY (yj )
fY |X (yj |xi )fX (xi )
= P
i fXY (xi , yj )
fY |X (yj |xi )fX (xi )
= P
i fY |X (yj |xi )fX (xi )

P S Sastry, IISc, E1 222 Aug 2021 46/248


◮ Let X, Y be continuous rv’s with joint density, fXY .
◮ We once again want to define conditional df

FX|Y (x|y) = P [X ≤ x|Y = y]

◮ But the conditioning event, [Y = y] has zero probability.


◮ Hence we define conditional df as follows

FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]


δ↓0

◮ This is well defined if the limit exists.


◮ The limit exists for all y where fY (y) > 0 (and for all x)

P S Sastry, IISc, E1 222 Aug 2021 47/248


◮ The conditional df is given by (assuming fY (y) > 0)
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ↓0
P [X ≤ x, Y ∈ [y, y + δ]]
= lim
δ↓0 P [Y ∈ [y, y + δ]]
R x R y+δ
−∞ y
fXY (x′ , y ′ ) dy ′ dx′
= lim R y+δ
δ↓0
y
fY (y ′ ) dy ′
Rx
f (x′ , y) δ dx′ + o(δ)
−∞ XY
= lim
δ↓0 fY (y) δ + o(δ)
Z x
fXY (x′ , y) ′
= dx
−∞ fY (y)
◮ We define conditional density of X given Y as
fXY (x, y)
fX|Y (x|y) =
fY (y)

P S Sastry, IISc, E1 222 Aug 2021 48/248


◮ Let X, Y have joint density fXY .
◮ The conditional df of X given Y is
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ↓0

◮ This exists if fY (y) > 0 and then it has a density:


Z x Z x
fXY (x′ , y) ′
FX|Y (x|y) = dx = fX|Y (x′ |y) dx′
−∞ f Y (y) −∞

◮ This conditional density is given by


fXY (x, y)
fX|Y (x|y) =
fY (y)
◮ We (once again) have the useful identity
fXY (x, y) = fX|Y (x|y) fY (y) = fY |X (y|x)fX (x)

P S Sastry, IISc, E1 222 Aug 2021 49/248


Example
fXY (x, y) = 2, 0 < x < y < 1
◮ We saw that the marginal densities are

fX (x) = 2(1 − x), 0 < x < 1; fY (y) = 2y, 0 < y < 1

◮ Hence the conditional densities are given by

fXY (x, y) 1
fX|Y (x|y) = = , 0<x<y<1
fY (y) y
fXY (x, y) 1
fY |X (y|x) = = , 0<x<y<1
fX (x) 1−x
◮ We can see this intuitively
Conditioned on Y = y, X is uniform over (0, y).
Conditioned on X = x, Y is uniform over (x, 1).
P S Sastry, IISc, E1 222 Aug 2021 50/248
◮ The identity fXY (x, y) = fX|Y (x|y)fY (y) can be used to
specify the joint density of two continuous rv’s
◮ We can specify the marginal density of one and the
conditional density of the other given the first.
◮ This may actually be the model of how the the rv’s are
generated.

P S Sastry, IISc, E1 222 Aug 2021 51/248


Example
◮ Let X be uniform over (0, 1) and let Y be uniform over
0 to X. Find the density of Y .
◮ What we are given is
1
fX (x) = 1, 0 < x < 1; fY |X (y|x) = , 0 < y < x < 1
x
◮ Hence the joint density is:
fXY (x, y) = x1 , 0 < y < x < 1.
◮ Hence the density of Y is
Z ∞ Z 1
1
fY (y) = fXY (x, y) dx = dx = − ln(y), 0 < y < 1
−∞ y x

◮ We can verify it to be a density


Z 1 1
1
Z
1
− ln(y) dy = −y ln(y)|0 + y dy = 1
0 0 y
P S Sastry, IISc, E1 222 Aug 2021 52/248
◮ We have the identity
fXY (x, y) = fX|Y (x|y) fY (y)
◮ By integrating both sides
Z ∞ Z ∞
fX (x) = fXY (x, y) dy = fX|Y (x|y) fY (y) dy
−∞ −∞
◮ This is a continuous analogue of total probability rule.
◮ But note that, since X is continuous rv, fX (x) is NOT
P [X = x]
◮ In case of discrete rv, the mass function value fX (x) is
equal to P [X = x] and we had
X
fX (x) = fX|Y (x|y)fY (y)
y
◮ It is as if one can simply replace pmf by pdf and
summation by integration!!
◮ While often that gives the right result, one needs to be
very careful
P S Sastry, IISc, E1 222 Aug 2021 53/248
◮ We have the identity

fXY (x, y) = fX|Y (x|y) fY (y) = fY |X (y|x)fX (x)

◮ This gives rise to Bayes rule for continuous rv

fY |X (y|x)fX (x)
fX|Y (x|y) =
fY (y)
fY |X (y|x)fX (x)
= R∞
f (y|x)fX (x) dx
−∞ Y |X

◮ This is essentially identical to Bayes rule for discrete rv’s.


We have essentially put the pdf wherever there was pmf

P S Sastry, IISc, E1 222 Aug 2021 54/248


◮ To recap, we started by defining conditional distribution
function.
FX|Y (x|y) = P [X ≤ x|Y = y]
◮ When X, Y are discrete, we define this only for y = yj .
That is, we define it only for all values that Y can take.
◮ When X, Y have joint density, we defined it by
FX|Y (x|y) = lim P [X ≤ x|Y ∈ [y, y + δ]]
δ↓0

This limit exists and FX|Y is well defined if fY (y) > 0.


That is, essentially again for all values that Y can take.
◮ In the discrete case, we define fX|Y as the pmf
corresponding to FX|Y . This conditional pmf can also be
defined as a conditional probability
◮ In the continuous case fX|Y is the density corresponding
to FX|Y .
◮ In both cases we have: fXY (x, y) = fX|Y (x|y)fY (y)
◮ This gives total probability rule and Bayes rule for random
variables P S Sastry, IISc, E1 222 Aug 2021 55/248
◮ Now, let X be a continuous rv and let Y be discrete rv.
◮ We can define FX|Y as

FX|Y (x|y) = P [X ≤ x|Y = y]

This is well defined for all values that y takes. (We


consider only those y)
◮ Since X is continuous rv, this df would have a density
Z x
FX|Y (x|y) = fX|Y (x′ |y) dx′
−∞

◮ Hence we can write

P [X ≤ x, Y = y] = FX|Y (x|y)P [Y = y]
Z x
= fX|Y (x′ |y) fY (y) dx′
−∞

P S Sastry, IISc, E1 222 Aug 2021 56/248


◮ We now get
X
FX (x) = P [X ≤ x] = P [X ≤ x, Y = y]
y
X Z x
= fX|Y (x′ |y) fY (y) dx′
y −∞
Z x X
= fX|Y (x′ |y) fY (y) dx′
−∞ y

◮ This gives us
X
fX (x) = fX|Y (x|y)fY (y)
y

◮ This is another version of total probability rule.


◮ Earlier we derived this when X, Y are discrete.
◮ The formula is true even when X is continuous
Only difference is we need to take fX as the density of X.
P S Sastry, IISc, E1 222 Aug 2021 57/248
◮ When X, Y are discrete we have
X
fX (x) = fX|Y (x|y)fY (y)
y

◮ When X is continuous and Y is discrete, we defined


fX|Y (x|y) to be the density corresponding to
FX|Y (x|y) = P [X ≤ x|Y = y]
◮ Then we once again get
X
fX (x) = fX|Y (x|y)fY (y)
y

However, now, fX is density (and not a mass function).


fX|Y is also a density now.
◮ Suppose Y ∈ {1, 2, 3} and fY (i) = λi .
Let fX|Y (x|i) = fi (x). Then
fX (x) = λ1 f1 (x) + λ2 f2 (x) + λ3 f3 (x)
Called a mixture density model
P S Sastry, IISc, E1 222 Aug 2021 58/248
◮ Continuing with X continuous rv and Y discrete. We
have
Z x
FX|Y (x|y) = P [X ≤ x|Y = y] = fX|Y (x′ |y) dx′
−∞

◮ We also have
Z x
P [X ≤ x, Y = y] = fX|Y (x′ |y) fY (y) dx′
−∞

◮ Hence we can define a ‘joint density’

fXY (x, y) = fX|Y (x|y)fY (y)

◮ This is a kind of mixed density and mass function.


◮ We will not be using such ‘joint densities’ here

P S Sastry, IISc, E1 222 Aug 2021 59/248


◮ Continuing with X continuous rv and Y discrete
◮ Can we define fY |X (y|x)?
◮ Since Y is discrete, this (conditional) mass function is
fY |X (y|x) = P [Y = y|X = x]
But the conditioning event has zero prob
We now know how to handle it
fY |X (y|x) = lim P [Y = y|X ∈ [x, x + δ]]
δ↓0

◮ For simplifying this we note the following:


Z x
P [X ≤ x, Y = y] = fX|Y (x′ |y) fY (y) dx′
−∞

Z x+δ
⇒ P [X ∈ [x, x+δ], Y = y] = fX|Y (x′ |y) fY (y) dx′
x

P S Sastry, IISc, E1 222 Aug 2021 60/248


◮ We have

fY |X (y|x) = lim P [Y = y|X ∈ [x, x + δ]]


δ↓0
P [Y = y, X ∈ [x, x + δ]]
= lim
δ↓0 P [X ∈ [x, x + δ]]
R x+δ
fX|Y (x′ |y) fY (y) dx′
= lim x R x+δ
δ↓0
x
fX (x′ ) dx′
fX|Y (x|y) fY (y)
=
fX (x)
⇒ fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)

◮ This gives us further versions of total probability rule and


Bayes rule.

P S Sastry, IISc, E1 222 Aug 2021 61/248


◮ First let us look at the total probability rule possibilities
◮ When X is continuous rv and Y is discrete rv, we derived

fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)

Note that fY is mass fn, fX is density and so on.


◮ Since fX|Y is a density (corresponding to FX|Y ),
Z ∞
fX|Y (x|y) dx = 1
−∞

◮ Hence we get
Z ∞
fY (y) = fY |X (y|x)fX (x) dx
−∞

◮ Earlier we derived the same formula when X, Y have a


joint density.
P S Sastry, IISc, E1 222 Aug 2021 62/248
◮ Let us review all the total probability formulas
X
1. fX (x) = fX|Y (x|y)fY (y)
y

◮ We first derived this when X, Y are discrete.


◮ But now we proved this holds when Y is discrete
If X is continuous the fX , fX|Y are densities; If X is also
discrete they are mass functions
Z ∞
2. fY (y) = fY |X (y|x)fX (x) dx
−∞

◮ We first proved it when X, Y have a joint density


We now know it holds also when X is cont and Y is
discrete. In that case fY is a mass function

P S Sastry, IISc, E1 222 Aug 2021 63/248


◮ When X is continuous rv and Y is discrete rv, we derived

fY |X (y|x)fX (x) = fX|Y (x|y) fY (y)

◮ This once again gives rise to Bayes rule:

fX|Y (x|y) fY (y) fY |X (y|x)fX (x)


fY |X (y|x) = fX|Y (x|y) =
fX (x) fY (y)
◮ Earlier we showed this hold when X, Y are both discrete
or both continuous.
◮ Thus Bayes rule holds in all four possible scenarios
◮ Only difference is we need to interpret fX or fX|Y as
mass functions when X is discrete and as densities when
X is a continuous rv
◮ In general, one refers to these always as densities since
the actual meaning would be clear from context.
P S Sastry, IISc, E1 222 Aug 2021 64/248
Example

◮ Consider a communication system. The transmitter puts


out 0 or 5 volts for the bits of 0 and 1, and, volage
measured by the receiver is the sent voltage plus noise
added by the channel.
◮ We assume noise has Gaussian density with mean zero
and variance σ 2 .
◮ We want the probability that the sent bit is 1 when
measured voltage at the receiver is x (to decide what is
sent).
◮ Let X be the measured voltage and let Y be sent bit.
◮ We want to calculate fY |X (1|x).
◮ We want to use the Bayes rule to calculate this

P S Sastry, IISc, E1 222 Aug 2021 65/248


◮ We need fX|Y . What does our model say?
◮ fX|Y (x|1) is Gaussian with mean 5 and variance σ 2 and
fX|Y (x|0) is Gaussian with mean zero and variance σ 2

fX|Y (x|1) fY (1)


P [Y = 1|X = x] = fY |X (1|x) =
fX (x)
◮ We need fY (1), fY (0). Let us take them to be same.
◮ In practice we only want to know whether
fY |X (1|x) > fY |X (0|x)
◮ Then we do not need to calculate fX (x).
We only need ratio of fY |X (1|x) and fY |X (0|x).

P S Sastry, IISc, E1 222 Aug 2021 66/248


◮ The ratio of the two probabilities is

fY |X (1|x) fX|Y (x|1) fY (1)


=
fY |X (0|x) fX|Y (x|0) fY (0)
1 2
√1 e− 2σ2 (x−5)
σ 2π
= 1 2
√1 e− 2σ2 (x−0)
σ 2π
−0.5σ −2 (x2 −10x+25−x2 )
= e
−2
= e0.5σ (10x−25)

◮ We are only interested in whether the above is greater


than 1 or not.
◮ The ratio is greater than 1 if 10x > 25 or x > 2.5
◮ So, if X > 2.5 we will conclude bit 1 is sent. Intuitively
obvious!

P S Sastry, IISc, E1 222 Aug 2021 67/248


◮ We did not calculate fX (x) in the above.
◮ We can calculate it if we want.
◮ Using total probability rule
X
fX (x) = fX|Y (x|y)fY (y)
y
= fX|Y (x|1)fY (1) + fX|Y (x|0)fY (0)
1 1 (x−5)2 1 1 x2
= √ e− 2σ2 + √ e− 2σ2
2 σ 2π 2 σ 2π
◮ It is a mixture density

P S Sastry, IISc, E1 222 Aug 2021 68/248


◮ As we saw, given the joint distribution we can calculate
all the marginals.
◮ However, there can be many joint distributions with the
same marginals.
◮ Let F1 , F2 be one dimensional df’s of continuous rv’s with
f1 , f2 being the corresponding densities.
Define a function f : ℜ2 → ℜ by
f (x, y) = f1 (x)f2 (y) [1 + α(2F1 (x) − 1)(2F2 (y) − 1)]
where α ∈ (−1, 1).
◮ First note that f (x, y) ≥ 0, ∀α ∈ (−1, 1).
For different α we get different functions.
◮ We first show that f (x, y) is a joint density.
◮ For this, we note the following

(F1 (x))2
Z ∞
1
f1 (x) F1 (x) dx = =
−∞ 2 2
−∞

P S Sastry, IISc, E1 222 Aug 2021 69/248


f (x, y) = f1 (x)f2 (y) [1 + α(2F1 (x) − 1)(2F2 (y) − 1)]

Z ∞ Z ∞ Z ∞ Z ∞
f (x, y) dx dy = f1 (x) dx f2 (y) dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
+α (2f1 (x)F1 (x) − f1 (x)) dx (2f2 (y)F2 (y) − f2 (y)) dy
−∞ −∞
= 1
R∞
because 2 −∞
f1 (x) F1 (x) dx = 1. This also shows
Z ∞ Z ∞
f (x, y)dx = f2 (y); f (x, y)dy = f1 (x)
−∞ −∞

P S Sastry, IISc, E1 222 Aug 2021 70/248


◮ Thus infinitely many joint distributions can all have the
same marginals.
◮ So, in general, the marginals cannot determine the joint
distribution.
◮ An important special case where this is possible is that of
independent random variables

P S Sastry, IISc, E1 222 Aug 2021 71/248


Independent Random Variables
◮ Two random variable X, Y are said to be independent if
for all Borel sets B1 , B2 , the events [X ∈ B1 ] and
[Y ∈ B2 ] are independent.
◮ If X, Y are independent then

P [X ∈ B1 , Y ∈ B2 ] = P [X ∈ B1 ] P [Y ∈ B2 ], ∀B1 , B2 ∈ B

◮ In particular

FXY (x, y) = P [X ≤ x, Y ≤ y] = P [X ≤ x]P [Y ≤ y] = FX (x) FY (y)

◮ Theorem: X, Y are independent if and only if


FXY (x, y) = FX (x)FY (y).

P S Sastry, IISc, E1 222 Aug 2021 72/248


◮ Suppose X, Y are independent discrete rv’s

fXY (x, y) = P [X = x, Y = y] = P [X = x]P [Y = y] = fX (x)fY (y)

The joint mass function is a product of marginals.


◮ Suppose fXY (x, y) = fX (x)fY (y). Then
X X
FXY (x, y) = fXY (xi , yj ) = fX (xi )fY (yj )
xi ≤x,yj ≤y xi ≤x,yj ≤y
X X
= fX (xi ) fY (yj ) = FX (x)FY (y)
xi ≤x yj ≤y

◮ So, X, Y are independent if and only if


fXY (x, y) = fX (x)fY (y)

P S Sastry, IISc, E1 222 Aug 2021 73/248


◮ Let X, Y be independent continuous rv
Z x Z y
′ ′
FXY (x, y) = FX (x)FY (y) = fX (x ) dx fY (y ′ ) dy ′
−∞ −∞
Z y Z x
= (fX (x′ )fY (y ′ )) dx′ dy ′
−∞ −∞

◮ This implies joint density is product of marginals.


◮ Now, suppose fXY (x, y) = fX (x)fY (y)
Z y Z x
FXY (x, y) = fXY (x′ , y ′ ) dx′ dy ′
−∞ −∞
Z y Z x
= fX (x′ )fY (y ′ ) dx′ dy ′
−∞ −∞
Z x Z y
′ ′
= fX (x ) dx fY (y ′ ) dy ′ = FX (x)FY (y)
−∞ −∞

◮ So, X, Y are independent if and only if


fXY (x, y) = fX (x)fY (y)
P S Sastry, IISc, E1 222 Aug 2021 74/248
◮ Let X, Y be independent.
◮ Then P [X ∈ B1 |Y ∈ B2 ] = P [X ∈ B1 ].
◮ Hence, we get FX|Y (x|y) = FX (x).
◮ This also implies fX|Y (x|y) = fX (x).
◮ This is true for all the four possibilities of X, Y being
continuous/discrete.

P S Sastry, IISc, E1 222 Aug 2021 75/248


More than two rv
◮ Everything we have done so far is easily extended to
multiple random variables.
◮ Let X, Y, Z be rv on the same probability space.
◮ We define joint distribution function by

FXY Z (x, y, z) = P [X ≤ x, Y ≤ y, Z ≤ z]

◮ If all three are discrete then the joint mass function is

fXY Z (x, y, z) = P [X = x, Y = y, Z = z]

◮ If they are continuous , they have a joint density if


Z z Z y Z x
FXY Z (x, y, z) = fXY Z (x′ , y ′ , z ′ ) dx′ dy ′ dz ′
−∞ −∞ −∞

P S Sastry, IISc, E1 222 Aug 2021 76/248


◮ Easy to see that joint mass function satisfies
1. fXY Z (x, y, z) ≥ 0 and is non-zero only for countably
many
P tuples.
2. x,y,z fXY Z (x, y, z) = 1
◮ Similarly the joint density satisfies
1. fRXY ZR(x, y, z) ≥ 0
∞ ∞ R∞
2. −∞ −∞ −∞ fXY Z (x, y, z) dx dy dz = 1
◮ These are straight-forward generalizations
◮ The properties of joint distribution function such as it
being non-decreasing in each argument etc are easily seen
to hold here too.
◮ Generalizing the special property of the df (relating to
probability of cylindrical sets) is a little more complicated.
◮ We specify multiple random variables either through joint
mass function or joint density function.

P S Sastry, IISc, E1 222 Aug 2021 77/248


◮ Now we get many different marginals:

FXY (x, y) = FXY Z (x, y, ∞); FZ (z) = FXY Z (∞, ∞, z) and so on

◮ Similarly we get
Z ∞
fY Z (y, z) = fXY Z (x, y, z) dx;
Z−∞
∞ Z ∞
fX (x) = fXY Z (x, y, z) dy dz
−∞ −∞

◮ Any marginal is a joint density of a subset of these rv’s


and we obtain it by integrating the (full) joint density
with respect to the remaining variables.
◮ We obtain the marginal mass functions for a subset of the
rv’s also similarly where we sum over the remaining
variables.
P S Sastry, IISc, E1 222 Aug 2021 78/248
◮ We have to be a little careful in dealing with these when
some random variables are discrete and others are
continuous.
◮ Suppose X is continuous and Y, Z are discrete. We do
not have any joint density or mass function as such.
◮ However, the joint df is always well defined.
◮ Suppose we want marginal joint distribution of X, Y . We
know how to get FXY by marginalization.
◮ Then we can get fX (a density), fY (a mass fn), fX|Y
(conditinal density) and fY |X (conditional mass fn)
◮ With these we can generally calculate most quantities of
interest.

P S Sastry, IISc, E1 222 Aug 2021 79/248


◮ Like in case of marginals, there are different types of
conditional distributions now.
◮ We can always define conditional distribution functions
like

FXY |Z (x, y|z) = P [X ≤ x, Y ≤ y|Z = z]


FX|Y Z (x|y, z) = P [X ≤ x|Y = y, Z = z]

◮ In all such cases, if the conditioning random variables are


continuous, we define the above as a limit.
◮ For example when Z is continuous

FXY |Z (x, y|z) = lim P [X ≤ x, Y ≤ y|Z ∈ [z, z + δ]]


δ↓0

P S Sastry, IISc, E1 222 Aug 2021 80/248


◮ If X, Y, Z are all discrete then, all conditional mass
functions are defined by appropriate conditional
probabilities. For example,
fX|Y Z (x|y, z) = P [X = x|Y = y, Z = z]
◮ Thus the following are obvious
fXY Z (x, y, z)
fXY |Z (x, y|z) =
fZ (z)
fXY Z (x, y, z)
fX|Y Z (x|y, z) =
fY Z (y, z)
fXY Z (x, y, z) = fZ|Y X (z|y, x)fY |X (y|x)fX (x)
◮ For example, the first one above follows from
P [X = x, Y = y, Z = z]
P [X = x, Y = y|Z = z] =
P [Z = z]

P S Sastry, IISc, E1 222 Aug 2021 81/248


◮ When X, Y, Z have joint density, all such relations hold
for the appropriate (conditional) densities. For example,

P [Z ≤ z, X ∈ [x, x + δ], Y ∈ [y, y + δ]]


FZ|XY (z|x, y) = lim
δ↓0 P [X ∈ [x, x + δ, Y ∈ [y, y + δ]]
R z R x+δ R y+δ
−∞ x y
fXY Z (x′ , y ′ , z ′ ) dy ′ dx′ dz ′
= lim R x+δ R y+δ
δ↓0
x y
fXY (x′ , y ′ ) dy ′ dx′
Z z Z z
fXY Z (x, y, z ′ ) ′
= dz = fZ|XY (z ′ |x, y) dz ′
−∞ f XY (x, y) −∞

◮ Thus we get

fXY Z (x, y, z) = fZ|XY (z|x, y)fXY (x, y) = fZ|XY (z|x, y)fY |X (y|x)fX (x)

P S Sastry, IISc, E1 222 Aug 2021 82/248


◮ We can similarly talk about the joint distribution of any
finite number of rv’s
◮ Let X1 , X2 , · · · , Xn be rv’s on the same probability space.
◮ We denote it as a vector X or X. We can think of it as a
mapping, X : Ω → ℜn .
◮ We can write the joint distribution as

FX (x) = P [X ≤ x] = P [Xi ≤ xi , i = 1, · · · , n]

◮ We represent by fX (x) the joint density or mass function.


Sometimes we also write it as fX1 ···Xn (x1 , · · · , xn )
◮ We use similar notation for marginal and conditional
distributions

P S Sastry, IISc, E1 222 Aug 2021 83/248


Independence of multiple random variables

◮ Random variables X1 , X2 , · · · , Xn are said to be


independent if the the events [Xi ∈ Bi ], i = 1, · · · , n are
independent.
(Recall definition of independence of a set of events)
◮ Independence implies that the marginals would determine
the joint distribution.
◮ If X, Y, Z are independent then
fXY Z (x, y, z) = fX (x)fY (y)fZ (z)
◮ For independent random variables, the joint mass
function (or density function) is product of individual
mass functions (or density functions)

P S Sastry, IISc, E1 222 Aug 2021 84/248


Example
◮ Let a joint density be given by

fXY Z (x, y, z) = K, 0<z<y<x<1

First let us determine K.


Z ∞ Z ∞ Z ∞ Z 1 Z x Z y
fXY Z (x, y, z) dz dy dx = K dz dy dx
−∞ −∞ −∞ 0 0 0
Z 1 Z x
= K y dy dx
0 0
1 2
x
Z
= K dx
0 2
1
= K ⇒K=6
6

P S Sastry, IISc, E1 222 Aug 2021 85/248


fXY Z (x, y, z) = 6, 0<z<y<x<1

◮ Suppose we want to find the (marginal) joint distribution


of X and Z.
Z ∞
fXZ (x, z) = fXY Z (x, y, z) dy
−∞
Z x
= 6 dy, 0 < z < x < 1
z
= 6(x − z), 0<z<x<1

P S Sastry, IISc, E1 222 Aug 2021 86/248


◮ We got the joint density as

fXZ (x, z) = 6(x − z), 0<z<x<1

◮ We can verify this is a joint density


Z ∞Z ∞ Z 1Z x
fXZ (x, z) dz dx = 6(x − z) dz dx
−∞ −∞ 0 0
Z 1 x
x z2
= 6x z|0 − 6 dx
0 2 0
Z 1
x2

2
= 6x − 6 dx
0 2
1
x3
= 3 =1
3 0

P S Sastry, IISc, E1 222 Aug 2021 87/248


◮ The joint density of X, Y, Z is

fXY Z (x, y, z) = 6, 0<z<y<x<1

◮ The joint density of X, Z is

fXZ (x, z) = 6(x − z), 0<z<x<1

◮ Hence,
fXY Z (x, y, z) 1
fY |XZ (y|x, z) = = , 0<z<y<x<1
fXZ (x, z) x−z

P S Sastry, IISc, E1 222 Aug 2021 88/248


Functions of multiple random variables
◮ Let X, Y be random variables on the same probability
space.
◮ Let g : ℜ2 → ℜ.
◮ Let Z = g(X, Y ). Then Z is a rv
◮ This is analogous to functions of a single rv
Z = g(X,Y)

g
R2
Sample Space
[ XY] R

B’ B

P S Sastry, IISc, E1 222 Aug 2021 89/248


◮ let Z = g(X, Y )
◮ We can determine distribution of Z from the joint
distribution of X, Y

FZ (z) = P [Z ≤ z] = P [g(X, Y ) ≤ z]

◮ For example, if X, Y are discrete, then


X
fZ (z) = P [Z = z] = P [g(X, Y ) = z] = fXY (xi , yj )
xi ,yj :
g(xi ,yj )=z

P S Sastry, IISc, E1 222 Aug 2021 90/248


◮ Let X, Y be discrete rv’s. Let Z = min(X, Y ).

fZ (z) = P [min(X, Y ) = z]
= P [X = z, Y > z] + P [Y = z, X > z] + P [X = Y = z]
X X
= P [X = z, Y = y] + P [X = x, Y = z]
y>z x>z
+P [X = z, Y = z]
X X
= fXY (z, y) + fXY (x, z) + fXY (z, z)
y>z x>z

◮ Now suppose X, Y are independent and both of them


have geometric distribution with the same parameter, p.
◮ Such random variables are called independent and
identically distributed or iid random variables.

P S Sastry, IISc, E1 222 Aug 2021 91/248


◮ Now we can get pmf of Z as (note Z ∈ {1, 2, · · · })

fZ (z) = P [X = z, Y > z] + P [Y = z, X > z] + P [X = Y = z]


= P [X = z]P [Y > z] + P [Y = z]P [X > z] + P [X = z]P [Y = z]
2
= p(1 − p)z−1 (1 − p)z ∗ 2 + p(1 − p)z−1
2
= 2p(1 − p)z−1 (1 − p)z + p(1 − p)z−1
= 2p(1 − p)2z−1 + p2 (1 − p)2z−2
= p(1 − p)2z−2 (2(1 − p) + p)
= (2 − p)p(1 − p)2z−2

P S Sastry, IISc, E1 222 Aug 2021 92/248


◮ We can show this is a pmf

X ∞
X
fZ (z) = (2 − p)p(1 − p)2z−2
z=1 z=1

X
= (2 − p)p (1 − p)2z−2
z=1
1
= (2 − p)p
1 − (1 − p)2
1
= (2 − p)p =1
2p − p2

P S Sastry, IISc, E1 222 Aug 2021 93/248


◮ Let us consider the max and min functions, in general.
◮ Let Z = max(X, Y ). Then we have

FZ (z) = P [Z ≤ z] = P [max(X, Y ) ≤ z]
= P [X ≤ z, Y ≤ z]
= FXY (z, z)
= FX (z)FY (z), if X, Y are independent
= (FX (z))2 , if they are iid

◮ This is true of all random variables.


◮ Suppose X, Y are iid continuous rv. Then density of Z is

fZ (z) = 2FX (z)fX (z)

P S Sastry, IISc, E1 222 Aug 2021 94/248


◮ Suppose X, Y are iid uniform over (0, 1)
◮ Then we get df and pdf of Z = max(X, Y ) as

FZ (z) = z 2 , 0 < z < 1; and fZ (z) = 2z, 0 < z < 1

FZ (z) = 0 for z ≤ 0 and FZ (z) = 1 for z ≥ 1 and


fZ (z) = 0 outside (0, 1)

P S Sastry, IISc, E1 222 Aug 2021 95/248


◮ This is easily generalized to n radom variables.
◮ Let Z = max(X1 , · · · , Xn )

FZ (z) = P [Z ≤ z] = P [max(X1 , X2 , · · · , Xn ) ≤ z]
= P [X1 ≤ z, X2 ≤ z, · · · , Xn ≤ z]
= FX1 ···Xn (z, · · · , z)
= FX1 (z) · · · FXn (z), if they are independent
= (FX (z))n , if they are iid
where we take FX as the common df

◮ For example if all Xi are uniform over (0, 1) and ind, then
FZ (z) = z n , 0 < z < 1

P S Sastry, IISc, E1 222 Aug 2021 96/248


◮ Consider Z = min(X, Y ) and X, Y independent

FZ (z) = P [Z ≤ z] = P [min(X, Y ) ≤ z]

◮ It is difficult to write this in terms of joint df of X, Y .


◮ So, we consider the following

P [Z > z] = P [min(X, Y ) > z]


= P [X > z, Y > z]
= P [X > z]P [Y > z], using independence
= (1 − FX (z))(1 − FY (z))
= (1 − FX (z))2 , if they are iid

Hence, FZ (z) = 1 − (1 − FX (z))(1 − FY (z))


◮ We can once again find density of Z if X, Y are
continuous
P S Sastry, IISc, E1 222 Aug 2021 97/248
◮ Suppose X, Y are iid uniform (0, 1).
◮ Z = min(X, Y )

FZ (z) = 1 − (1 − FX (z))2 = 1 − (1 − z)2 , 0 < z < 1

◮ We get the density of Z as

fZ (z) = 2(1 − z), 0 < z < 1

P S Sastry, IISc, E1 222 Aug 2021 98/248


◮ min fn is also easily generalized to n random variables
◮ Let Z = min(X1 , X2 , · · · , Xn )

P [Z > z] = P [min(X1 , X2 , · · · , Xn ) > z]


= P [X1 > z, · · · , Xn > z]
= P [X1 > z] · · · P [Xn > z], using independence
= (1 − FX1 (z)) · · · (1 − FXn (z))
= (1 − FX (z))n , if they are iid

◮ Hence, when Xi are iid, the df of Z is

FZ (z) = 1 − (1 − FX (z))n

where FX is the common df

P S Sastry, IISc, E1 222 Aug 2021 99/248


Joint distribution of max and min
◮ X, Y iid with df F and density f
Z = max(X, Y ) and W = min(X, Y ).
◮ We want joint distribution function of Z and W .
◮ We can use the following
P [Z ≤ z] = P [Z ≤ z, W ≤ w] + P [Z ≤ z, W > w]

P [Z ≤ z, W > w] = P [w < X, Y ≤ z] = (F (z) − F (w))2


P [Z ≤ z] = P [X ≤ z, Y ≤ z] = (F (z))2
◮ So, we get FZW as
FZW (z, w) = P [Z ≤ z, W ≤ w]
= P [Z ≤ z] − P [Z ≤ z, W > w]
= (F (z))2 − (F (z) − F (w))2
◮ Is this correct for all values of z, w?
P S Sastry, IISc, E1 222 Aug 2021 100/248
◮ We have P [w < X, Y ≤ z] = (F (z) − F (w))2 only when
w ≤ z.
◮ Otherwise it is zero.
◮ Hence we get FZW as

(F (z))2

if w > z
FZW (z, w) = 2 2
(F (z)) − (F (z) − F (w)) if w ≤ z

◮ We can get joint density of Z, W as

∂2
fZW (z, w) = FZW (z, w)
∂z ∂w
= 2f (z)f (w), w ≤ z

P S Sastry, IISc, E1 222 Aug 2021 101/248


◮ Let X, Y be iid uniform over (0, 1).
◮ Define Z = max(X, Y ) and W = min(X, Y ).
◮ Then the joint density of Z, W is

fZW (z, w) = 2f (z)f (w), w ≤ z


= 2, 0 < w ≤ z < 1

P S Sastry, IISc, E1 222 Aug 2021 102/248


Order Statistics
◮ Let X1 , · · · , Xn be iid with density f .
◮ Let X(k) denote the k th smallest of these.
◮ That is, X(k) = gk (X1 , · · · , Xn ) where gk : ℜn → ℜ and
the value of gk (x1 , · · · , xn ) is the k th smallest of the
numbers x1 , · · · , xn .
◮ X(1) = min(X1 , · · · , Xn ), X(n) = max(X1 , · · · , Xn )
◮ The joint distribution of X(1) , · · · X(n) is called the order
statistics.
◮ Earlier, we calculated the order statistics for the case
n = 2.
◮ It can be shown that
n
Y
fX(1) ···X(n) (x1 , · · · xn ) = n! f (xi ), x1 < x2 < · · · < xn
i=1

P S Sastry, IISc, E1 222 Aug 2021 103/248


Marginal distributions of X(k)

◮ Let X1 , · · · , Xn be iid with df F and density f .


◮ Let X(k) denote the k th smallest of these.
◮ We want the distribution of X(k) .
◮ The event [X(k) ≤ y] is:
“at least k of these are less than or equal to y”
◮ We want probability of this event.

P S Sastry, IISc, E1 222 Aug 2021 104/248


Marginal distributions of X(k)
◮ X1 , · · · , Xn iid with df F and density f .
◮ P [Xi ≤ y] = F (y) for any i and y.
◮ Since they are independent, we have, e.g.,

P [X1 ≤ y, X2 > y, X3 ≤ y] = (F (y))2 (1 − F (y))

◮ Hence, probability that exactly k of these n random


variables are less than or equal to y is
n
Ck (F (y))k (1 − F (y))n−k
◮ Hence we get
n
X
n
FX(k) (y) = Cj (F (y))j (1 − F (y))n−j
j=k

We can get the density by differentiating this.


P S Sastry, IISc, E1 222 Aug 2021 105/248
Sum of two discrete rv’s
◮ Let X, Y ∈ {0, 1, · · · }
◮ Let Z = X + Y . Then we have
X
fZ (z) = P [X + Y = z] = P [X = x, Y = y]
x,y:
x+y=z
z
X
= P [X = k, Y = z − k]
k=0
Xz
= fXY (k, z − k)
k=0

◮ Now suppose X, Y are independent. Then


z
X
fZ (z) = fX (k)fY (z − k)
k=0

P S Sastry, IISc, E1 222 Aug 2021 106/248


◮ Now suppose X, Y are independent Poisson with
parameters λ1 , λ2 . And, Z = X + Y .
z
X
fZ (z) = fX (k)fY (z − k)
k=0
z
X λk 1 −λ1 λ2z−k −λ2
= e e
k=0
k! (z − k)!
z
−(λ1 +λ2 ) 1 X z!
= e λk1 λ2z−k
z! k=0 k!(z − k)!
1
= e−(λ1 +λ2 ) (λ1 + λ2 )z
z!
◮ Z is Poisson with parameter λ1 + λ2

P S Sastry, IISc, E1 222 Aug 2021 107/248


Sum of two continuous rv
◮ Let X, Y have a joint density fXY . Let Z = X + Y

FZ (z) = P [Z ≤ z] = P [X + Y ≤ z]
Z Z
= fXY (x, y) dy dx
{(x,y):x+y≤z}
Z ∞ Z z−x
= fXY (x, y) dy dx
x=−∞ y=−∞
change variable y to t: t = x + y
dt = dy; y = z − x ⇒ t = z
Z ∞ Z z
= fXY (x, t − x) dt dx
x=−∞ t=−∞
Z z Z ∞ 
= fXY (x, t − x) dx dt
−∞ −∞

◮ This gives us the density of Z


P S Sastry, IISc, E1 222 Aug 2021 108/248
◮ X, Y have joint density fXY . Z = X + Y . Then
Z ∞
fZ (z) = fXY (x, z − x) dx
−∞

◮ Now suppose X and Y are independent. Then


Z ∞
fZ (z) = fX (x) fY (z − x) dx
−∞

Density of sum of independent random variables is


the convolution of their densities.

fX+Y = fX ∗ fY (Convolution)

P S Sastry, IISc, E1 222 Aug 2021 109/248


Distribution of sum of iid uniform rv’s
◮ Suppose X, Y are iid uniform over (−1, 1).
◮ let Z = X + Y . We want fZ .
◮ The density of X, Y is

◮ fZ is convolution of this density with itself.

P S Sastry, IISc, E1 222 Aug 2021 110/248


◮ fX (x) = 0.5, −1 < x < 1. fY is also same
◮ Note that Z takes values in [−2, 2]
Z ∞
fZ (z) = fX (t) fY (z − t) dt
−∞

◮ For the integrand to be non-zero we need


◮ −1 < t < 1 ⇒ t < 1, t > −1
◮ −1 < z − t < 1 ⇒ t < z + 1, t > z − 1
◮ Hence we need:
t < min(1, z + 1), t > max(−1, z − 1)
◮ Hence, for z < 0, we need −1 < t < z + 1
and, for z ≥ 0 we need z − 1 < t < 1
◮ Thus we get
 R z+1
 −1 14 dt = z+2
4
if − 2 ≤ z < 0
fZ (z) =
 R 1 1 dt = 2−z
if 2 ≥ z ≥ 0
z−1 4 4

P S Sastry, IISc, E1 222 Aug 2021 111/248


◮ Thus, the density of sum of two ind rv’s that are uniform
over (−1, 1) is
 z+2
4
if − 2 < z < 0
fZ (z) = 2−z
4
if 0 < z < 2

◮ This is a triangle with vertices (−2, 0), (0, 0.5), (2, 0)

P S Sastry, IISc, E1 222 Aug 2021 112/248


Independence of functions of random variable

◮ Suppose X and Y are independent.


◮ Then g(X) and h(Y ) are independent
◮ This is because [g(X) ∈ B1 ] = [X ∈ B̃1 ] for some Borel
set, B̃1 and similarly [h(Y ) ∈ B2 ] = [Y ∈ B̃2 ]
◮ Hence, [g(X) ∈ B1 ] and [h(Y ) ∈ B2 ] are independent.

P S Sastry, IISc, E1 222 Aug 2021 113/248


Independence of functions of random variable

◮ This is easily generalized to functions of multiple random


variables.
◮ If X, Y are vector random variables (or random vectors),
independence implies [X ∈ B1 ] is independent of
[Y ∈ B2 ] for all borel sets B1 , B2 (in appropriate spaces).
◮ Then g(X) would be independent of h(Y).
◮ That is, suppose X1 , · · · , Xm , Y1 , · · · , Yn are
independent.
◮ Then, g(X1 , · · · , Xm ) is independent of h(Y1 , · · · , Yn ).

P S Sastry, IISc, E1 222 Aug 2021 114/248


◮ Let X1 , X2 , X3 be independent continuous rv
◮ Z = X1 + X2 + X3 .
◮ Can we find density of Z?
◮ Let W = X1 + X2 . We know how to find its density
◮ Then Z = W + X3 and W and X3 are independent.
◮ So, density of Z is the convolution of the densities of W
and X3 .

P S Sastry, IISc, E1 222 Aug 2021 115/248


◮ Suppose X, Y are iid exponential rv’s.

fX (x) = λ e−λx , x > 0


◮ Let Z = X + Y . Then, density of Z is
Z ∞
fZ (z) = fX (x) fY (z − x) dx
−∞
Z z
= λ e−λx λ e−λ(z−x) dx
0
Z z
2 −λz
= λ e dx = λ2 z e−λz
0

◮ Thus, sum of independent exponential random variables


has gamma distribution:

fZ (z) = λz λe−λz , z > 0

P S Sastry, IISc, E1 222 Aug 2021 116/248


Sum of independent gamma rv

◮ Gamma density with parameters α > 0 and λ > 0 is given


by
1
f (x) = λα xα−1 e−λx , x > 0
Γ(α)
We will call this Gamma(α, λ).
◮ The α is called the shape parameter and λ is called the
rate parameter.
◮ For α = 1 this is the exponential density.
◮ Let X ∼ Gamma(α1 , λ), Y ∼ Gamma(α2 , λ).
Suppose X, Y are independent.
◮ Let Z = X + Y . Then Z ∼ Gamma(α1 + α2 , λ).

P S Sastry, IISc, E1 222 Aug 2021 117/248


Z ∞
fZ (z) = fX (x) fY (z − x) dx
−∞
Z z
1 1
= λα1 xα1 −1 e−λx λα2 (z − x)α2 −1 e−λ(z−x) dx
0 Γ(α 1 ) Γ(α 2 )
λα1 +α2 e−λz z α1 −1  x α1 −1 α2 −1  x α2 −1
Z
= z z 1− dx
Γ(α1 )Γ(α2 ) 0 z z
x
change the variable: t = (⇒ z −1 dx = dt)
z
λα1 +α2 e−λz α+ α2 −1 1 α1 −1
Z
= z t (1 − t)α2 −1 dt
Γ(α1 )Γ(α2 ) 0
1 α1 +α2 α1 +α2 −1 −λz
= λ z e
Γ(α1 + α2 )
Because
Z 1
Γ(α1 )Γ(α2 )
tα1 −1 (1 − t)α2 −1 dt =
0 Γ(α1 + α2 )
P S Sastry, IISc, E1 222 Aug 2021 118/248
◮ If X, Y are independent gamma random variables then
X + Y also has gamma distribution.
◮ If X ∼ Gamma(α1 , λ), and Y ∼ Gamma(α2 , λ), then
X + Y ∼ Gamma(α1 + α2 , λ).

P S Sastry, IISc, E1 222 Aug 2021 119/248


Sum of independent Gaussians

◮ Sum of independent Gaussians random variables is a


Gaussian rv
◮ If X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) and X, Y are
independent, then
X + Y ∼ N (µ1 + µ2 , σ12 + σ22 )
◮ We can show this.
◮ The algebra is a little involved.
◮ There is a calculation trick that is often useful with
Gaussian density

P S Sastry, IISc, E1 222 Aug 2021 120/248


A Calculation Trick

∞  
1  2
Z

I = exp − x − 2bx + c dx
−∞ 2K
Z ∞  
1  2 2

= exp − (x − b) + c − b dx
−∞ 2K
Z ∞
(x − b)2 (c − b2 )
   
= exp − exp − dx
−∞ 2K 2K
(c − b2 ) √
 
= exp − 2πK
2K

because

(x − b)2
 
1
Z
√ exp − dx = 1
2πK −∞ 2K

P S Sastry, IISc, E1 222 Aug 2021 121/248


◮ We next look at a general theorem that is quite useful in
dealing with functions of multiple random variables.
◮ This result is only for continuous random variables.

P S Sastry, IISc, E1 222 Aug 2021 122/248


◮ Let X1 , · · · , Xn be continuous random variables with
joint density fX1 ···Xn . We define Y1 , · · · Yn by
Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )
We think of gi as components of g : ℜn → ℜn .
◮ We assume g is continuous with continuous first partials
and is invertible.
◮ Let h be the inverse of g. That is
X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )
◮ Each of gi , hi are ℜn → ℜ functions and we can write
them as
yi = gi (x1 , · · · , xn ); ··· xi = hi (y1 , · · · , yn )
We denote the partial derivatives of these functions by
∂xi
∂yj
etc.
P S Sastry, IISc, E1 222 Aug 2021 123/248
◮ The jacobian of the inverse transformation is
∂x1 ∂x1 ∂x1
∂y1 ∂y2
··· ∂yn

∂(x1 , · · · , xn ) ∂x2 ∂x2


··· ∂x2
J= = ∂y1 ∂y2 ∂yn
∂(y1 , · · · , yn ) .. .. .. ..
. . . .
∂xn ∂xn ∂xn
∂y1 ∂y2
··· ∂yn

◮ We assume that J is non-zero in the range of the


transformation
◮ Theorem: Under the above conditions, we have

fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))

Or, more compactly, fY (y) = |J|fX (h(y))

P S Sastry, IISc, E1 222 Aug 2021 124/248


◮ Let X1 , X2 have a joint density, fX . Consider

Y1 = g1 (X1 , X2 ) = X1 + X2 (g1 (a, b) = a + b)


Y2 = g2 (X1 , X2 ) = X1 − X2 (g2 (a, b) = a − b)

This transformation is invertible


Y1 + Y2
X1 = h1 (Y1 , Y2 ) = (h1 (a, b) = (a + b)/2)
2
Y1 − Y2
X2 = h2 (Y1 , Y2 ) = (h2 (a, b) = (a − b)/2)
2
0.5 0.5
The jacobian is: = −0.5.
0.5 −0.5
y1 +y2 y1 −y2

◮ This gives: fY1 Y2 (y1 , y2 ) = 0.5 fX1 X2 2
, 2

P S Sastry, IISc, E1 222 Aug 2021 125/248


Proof of Theorem
◮ Let B = (−∞, y1 ] × · · · × (−∞, yn ] ⊂ ℜn . Then
FY (y) = FY1 ···Yn (y1 , · · · yn ) = P [Yi ≤ yi , i = 1, · · · , n]
Z
= fY1 ···Yn (y1′ , · · · , yn′ ) dy1′ · · · dyn′
B

◮ Define
g −1 (B) = {(x1 , · · · , xn ) ∈ ℜn : g(x1 , · · · , xn ) ∈ B}
= {(x1 , · · · , xn ) ∈ ℜn : gi (x1 , · · · , xn ) ≤ yi , i = 1 · · · n}
◮ Then we have
FY1 ···Yn (y1 , · · · yn ) = P [gi (X1 , · · · , Xn ) ≤ yi , i = 1, · · · n]
Z
= fX1 ···Xn (x′1 , · · · , x′n ) dx′1 · · · dx′n
g −1 (B)

P S Sastry, IISc, E1 222 Aug 2021 126/248


Proof of Theorem
◮ B = (−∞, y1 ] × · · · × (−∞, yn ].
◮ g −1 (B) = {(x1 , · · · , xn ) ∈ ℜn : g(x1 , · · · , xn ) ∈ B}

FY (y1 , · · · , yn ) = P [gi (X1 , · · · , Xn ) ≤ yi , i = 1, · · · , n]


Z
= fX1 ···Xn (x′1 , · · · , x′n ) dx′1 · · · dx′n
g −1 (B)
change variables: yi′ = gi (x′1 , · · ·
, x′n ), i = 1, · · · n
(x′1 , · · · x′n ) ∈ g (B) ⇒ (y1′ , · · · , yn′ ) ∈ B
−1

x′i = hi (y1′ , · · · , yn′ ), dx′1 · · · dx′n = |J|dy1′ · · · dyn′


Z
FY (y1 , · · · , yn ) = fX1 ···Xn (h1 (y′ ), · · · , hn (y′ )) |J|dy1′ · · · dyn′
B
⇒ fY1 ···Yn (y1 , · · · , yn ) = fX1 ···Xn (h1 (y), · · · , hn (y)) |J|

P S Sastry, IISc, E1 222 Aug 2021 127/248


◮ X1 , · · · Xn are continuous rv with joint density

Y1 = g1 (X1 , · · · , Xn ) ··· Yn = gn (X1 , · · · , Xn )

◮ The transformation is continuous with continuous first


partials and is invertible and

X1 = h1 (Y1 , · · · , Yn ) ··· Xn = hn (Y1 , · · · , Yn )

◮ We assume the Jacobian of the inverse transform, J, is


non-zero
◮ Then the density of Y is

fY1 ···Yn (y1 , · · · , yn ) = |J|fX1 ···Xn (h1 (y1 , · · · , yn ), · · · , hn (y1 , · · · , yn ))

◮ Called multidimensional change of variable formula

P S Sastry, IISc, E1 222 Aug 2021 128/248


◮ Let X, Y have joint density fXY . Let Z = X + Y .
◮ We want to find fZ using the theorem.
◮ To use the theorem, we need an invertible transformation
of ℜ2 onto ℜ2 of which one component is x + y.
◮ Take Z = X + Y and W = X − Y . This is invertible.
◮ X = (Z + W )/2 and Y = (Z − W )/2. The Jacobian is
1 1
2 2 1
J= =−
1
2
− 12 2
◮ Hence we get
z+w z−w
 
1
fZW (z, w) = fXY ,
2 2 2
◮ Now we get density of Z as
Z ∞
z+w z−w
 
1
fZ (z) = fXY , dw
−∞ 2 2 2

P S Sastry, IISc, E1 222 Aug 2021 129/248


◮ let Z = X + Y and W = X − Y . Then
Z ∞
z+w z−w
 
1
fZ (z) = fXY , dw
−∞ 2 2 2
z+w 1
change the variable: t = ⇒ dt = dw
2 2
Z ∞ ⇒ w = 2t − z ⇒ z − w = 2z − 2t
fZ (z) = fXY (t, z − t) dt
−∞
Z ∞
= fXY (z − s, s) ds,
−∞

◮ We get same result as earlier. If, X, Y are independent


Z ∞
fZ (z) = fX (t) fY (z − t) dt
−∞

P S Sastry, IISc, E1 222 Aug 2021 130/248


◮ let Z = X + Y and W = X − Y . We got

z+w z−w
 
1
fZW (z, w) = fXY ,
2 2 2
◮ Now we can calculate fW also.
Z ∞
z+w z−w
 
1
fW (w) = fXY , dz
−∞ 2 2 2
z+w 1
change the variable: t = ⇒ dt = dz
2 2
Z ∞ ⇒ z = 2t − w ⇒ z − w = 2t − 2w
fW (w) = fXY (t, t − w) dt
−∞
Z ∞
= fXY (s + w, s)ds,
−∞

P S Sastry, IISc, E1 222 Aug 2021 131/248


Example
◮ Let X, Y be iid U [0, 1]. Let Z = X − Y .
Z ∞
fZ (z) = fX (t) fY (t − z) dt
−∞
◮ For the integrand to be non-zero
◮ 0 ≤ t ≤ 1 ⇒ t ≥ 0, t ≤ 1
◮ 0 ≤ t − z ≤ 1 ⇒ t ≥ z, t ≤ 1 + z
◮ ⇒ max(0, z) ≤ t ≤ min(1, 1 + z)
◮ Thus, we get density as (note Z ∈ (−1, 1))
 R
 1+z 1 dt = 1 + z, if − 1 ≤ z ≤ 0
0
fZ (z) =
 1 1 dt = 1 − z,
R
0≤z≤1
z

◮ Thus, when X, Y ∼ U (0, 1) iid


fX−Y (z) = 1 − |z|, −1 < z < 1

P S Sastry, IISc, E1 222 Aug 2021 132/248


◮ We showed that
Z ∞ Z ∞
fX+Y (z) = fXY (t, z − t) dt = fXY (z − t, t) dt
−∞ −∞
Z ∞ Z ∞
fX−Y (w) = fXY (t, t − w) dt = fXY (t + w, t)dt
−∞ −∞

◮ Suppose X, Y are discrete. Then we have


X
fX+Y (z) = P [X + Y = z] = P [X = k, Y = z − k]
k
X
= fXY (k, z − k)
k
X
fX−Y (w) = P [X − Y = w] = P [X = k, Y = k − w]
k
X
= fXY (k, k − w)
k

P S Sastry, IISc, E1 222 Aug 2021 133/248


Distribution of product of random variables
◮ We want density of Z = XY .
◮ We need one more function to make an invertible
transformation
◮ A possible choice: Z = XY W =Y
◮ This is invertible: X = Z/W Y = W
1 −z
w w2 1
J= =
0 1 w
◮ Hence we get
1 z 
fZW (z, w) = fXY ,w
w w
◮ Thus we get the density of product as
Z ∞
1 z 
fZ (z) = fXY , w dw
−∞ w w
P S Sastry, IISc, E1 222 Aug 2021 134/248
Density of XY
◮ Let X, Y have joint density fXY . Let Z = XY .
◮ We can find density of XY directly also (but it is more
complicated)
◮ Let Az = {(x, y) ∈ ℜ2 : xy ≤ z} ⊂ ℜ2 .

FZ (z) = P [XY ≤ z] = P [(X, Y ) ∈ Az ]


Z Z
= fXY (x, y) dy dx
Az

◮ We need to find limits for integrating over Az


◮ If x > 0, then xy ≤ z ⇒ y ≤ z/x
If x < 0, then xy ≤ z ⇒ y ≥ z/x
Z 0 Z ∞ Z ∞ Z z/x
FZ (z) = fXY (x, y) dy dx+ fXY (x, y) dy dx
−∞ z/x 0 −∞

P S Sastry, IISc, E1 222 Aug 2021 135/248


Z 0 Z ∞ Z ∞ Z z/x
FZ (z) = fXY (x, y) dy dx + fXY (x, y) dy dx
−∞ z/x 0 −∞

◮ Change variable from y to t using t = xy


y = t/x; dy = x1 dt; y = z/x ⇒ t = z
Z 0 Z −∞ Z ∞Z z
1 t 1 t
FZ (z) = fXY (x, ) dt dx + fXY (x, ) dt d
−∞ z x x 0 −∞ x x
Z 0 Z z Z ∞Z z
1 t 1 t
= fXY (x, ) dt dx + fXY (x, )
−∞ x x −∞ x x
Z−∞
∞ Z z   0
1 t
= fXY x, dt dx
−∞ −∞ x x
Z z Z ∞  
1 t
= fXY x, dx dt
−∞ −∞ x x
R∞
This shows: fZ (z) = −∞ x1 fXY x, xz dx


P S Sastry, IISc, E1 222 Aug 2021 136/248


example

◮ Let X, Y be iid U (0, 1). Let Z = XY .


Z ∞
1 z
fZ (z) = fX fY (w) dw
−∞ w w
z
◮ We need: 0 < w < 1 and 0 < w
< 1. Hence
1 1
1 1
Z Z
fZ (z) = dw = dw = − ln(z), 0 < z < 1
z w z w

P S Sastry, IISc, E1 222 Aug 2021 137/248


◮ X, Y have joint density and Z = XY . Then
Z ∞
1 z 
fZ (z) = fXY .w dw
−∞ w w

Suppose X, Y are discrete and Z = XY


X X
fZ (0) = P [X = 0 or Y = 0] = fXY (x, 0) + fXY (0, y)
x y
   
X k X k
fZ (k) = P X = ,Y = y = fXY , y , k 6= 0
y6=0
y y6=0
y

◮ We cannot always interchange density and mass


functions!!

P S Sastry, IISc, E1 222 Aug 2021 138/248


◮ We wanted density of Z = XY .
◮ We used: Z = XY and W = Y .
◮ We could have used: Z = XY and W = X.
◮ This is invertible: X = W and Y = Z/W .

0 1 1
J= =−
1
w
−z
w2
w

◮ This gives

1  z
fZW (z, w) = fXY w,
w w
Z ∞
1  z
fZ (z) = fXY w, dw
−∞ w w
◮ The fZ should be same in both cases.
P S Sastry, IISc, E1 222 Aug 2021 139/248
Distributions of quotients
◮ X, Y have joint density and Z = X/Y .
◮ We can take: Z = X/Y W =Y
◮ This is invertible: X = ZW Y = W
w z
J= =w
0 1

◮ Hence we get

fZW (z, w) = |w| fXY (zw, w)

◮ Thus we get the density of quotient as


Z ∞
fZ (z) = |w| fXY (zw, w) dw
−∞

P S Sastry, IISc, E1 222 Aug 2021 140/248


example
◮ Let X, Y be iid U (0, 1). Let Z = X/Y .
Note Z ∈ (0, ∞)
Z ∞
fZ (z) = |w| fX (zw) fY (w) dw
−∞

◮ We need 0 < w < 1 and 0 < zw < 1 ⇒ w < 1/z.


◮ So, when z ≤ 1, w goes from 0 to 1; when z > 1, w goes
from 0 to 1/z.
◮ Hence we get density as
 R
 01 w dw = 21 , if 0 < z ≤ 1
fZ (z) =
 R 1/z w dw = 1 , 1 < z < ∞
0 2z 2

P S Sastry, IISc, E1 222 Aug 2021 141/248


◮ X, Y have joint density and Z = X/Y
Z ∞
fZ (z) = |w| fXY (zw, w) dw
−∞

◮ SupposeX, Y are discrete and Z = X/Y

fZ (z) = P [Z = z] = P [X/Y = z]
X
= P [X = yz, Y = y]
y
X
= fXY (yz, y)
y

P S Sastry, IISc, E1 222 Aug 2021 142/248


◮ We chose: Z = X/Y and W = Y .
◮ We could have taken: Z = X/Y and W = X
◮ The inverse is: X = W and Y = W/Z

0 1 w
J= =−
− zw2 1
z
z2

◮ Thus we get the density of quotient as


Z ∞
w  w
fZ (z) = 2
f XY w, dw
−∞ z z
w dw
put t = ⇒ dt = , w = tz
z z
Z ∞
= |t|fXY (tz, t) dt
−∞

◮ We can show that the density of quotient is same in both


these approches.
P S Sastry, IISc, E1 222 Aug 2021 143/248
Summary: Densities of standard functions of rv’s

◮ We derived densities of sum, difference, product and


quotient of random variables.
Z ∞ Z ∞
fX+Y (z) = fXY (t, z − t) dt = fXY (z − t, t) dt
−∞ −∞
Z ∞ Z ∞
fX−Y (z) = fXY (t, t − z) dt = fXY (t + z, t)dt
−∞ −∞
Z ∞ Z ∞
1 z  1  z
fX∗Y (z) = fXY , t dt = fXY t, dt
−∞ t t −∞ t t
Z ∞ Z ∞  
t t
f(X/Y ) (z) = |t| fXY (zt, t) dt = 2
fXY t, dt
−∞ −∞ z z

P S Sastry, IISc, E1 222 Aug 2021 144/248


Exchangeable Random Variables
◮ X1 , X2 , · · · , Xn are said to be exchangeable if their joint
distribution is same as that of any permutation of them.
◮ let (i1 , · · · , in ) be a permutation of (1, 2, · · · , n). Then
joint df of (Xi1 , · · · , Xin ) should be same as that
(X1 , · · · , Xn )
◮ Take n = 3. Suppose FX1 X2 X3 (a, b, c) = g(a, b, c). If they
are exchangeable, then

FX2 X3 X1 (a, b, c) = P [X2 ≤ a, X3 ≤ b, X1 ≤ c]


= P [X1 ≤ c, X2 ≤ a, X3 ≤ b]
= g(c, a, b) = g(a, b, c)

◮ The df or density should be “symmetric” in its variables if


the random variables are exchangeable.

P S Sastry, IISc, E1 222 Aug 2021 145/248


◮ Consider the density of three random variables
2
f (x, y, z) = (x + y + z), 0 < x, y, z < 1
3
◮ They are exchangeable (because f (x, y, z) = f (y, x, z))
◮ If random variables are exchangeable then they are
identically distributed.
FXY Z (a, ∞, ∞) = FXY Z (∞, ∞, a) ⇒ FX (a) = FZ (a)
◮ The above example shows that exchangeable random
variables need not be independent. The joint density is
not factorizable.
Z 1Z 1
2 2(x + 1)
(x + y + z) dy dz =
0 0 3 3
◮ So, the joint density is not the product of marginals

P S Sastry, IISc, E1 222 Aug 2021 146/248


Expectation of functions of multiple rv
◮ Theorem: Let Z = g(X1 , · · · Xn ) = g(X). Then
Z
E[Z] = g(x) dFX (x)
ℜn

◮ That is, if they have a joint density, then


Z
E[Z] = g(x) fX (x) dx
ℜn

◮ Similarly, if all Xi are discrete


X
E[Z] = g(x) fX (x)
x

P S Sastry, IISc, E1 222 Aug 2021 147/248


◮ Let Z = X + Y . Let X, Y have joint density fXY
Z ∞Z ∞
E[X + Y ] = (x + y) fXY (x, y) dx dy
−∞ −∞
Z ∞ Z ∞
= x fXY (x, y) dy dx
−∞
Z ∞−∞ Z ∞
+ y fXY (x, y) dx dy
−∞ −∞
Z ∞ Z ∞
= x fX (x) dx + y fY (y) dy
−∞ −∞
= E[X] + E[Y ]

◮ Expectation is a linear operator.


◮ This is true for all random variables.

P S Sastry, IISc, E1 222 Aug 2021 148/248


◮ We saw E[X + Y ] = E[X] + E[Y ].
◮ Let us calculate Var(X + Y ).

Var(X + Y ) = E ((X + Y ) − E[X + Y ])2


 

= E ((X − EX) + (Y − EY ))2


 

= E (X − EX)2 + E (Y − EY )2
   

+2E [(X − EX)(Y − EY )]


= Var(X) + Var(Y ) + 2Cov(X, Y )

where we define covariance between X, Y as

Cov(X, Y ) = E [(X − EX)(Y − EY )]

P S Sastry, IISc, E1 222 Aug 2021 149/248


◮ We define covariance between X and Y by

Cov(X, Y ) = E [(X − EX)(Y − EY )]


= E [XY − X(EY ) − Y (EX) + EX EY ]
= E[XY ] − EX EY

◮ Note that Cov(X, Y ) can be positive or negative


◮ X and Y are said to be uncorrelated if Cov(X, Y ) = 0
◮ If X and Y are uncorrelated then

Var(X + Y ) = Var(X) + Var(Y )

◮ Note that E[X + Y ] = E[X] + E[Y ] for all random


variables.

P S Sastry, IISc, E1 222 Aug 2021 150/248


Example
◮ Consider the joint density

fXY (x, y) = 2, 0 < x < y < 1

◮ We want to calculate Cov(X, Y )


Z 1Z 1 1
1
Z
EX = x 2 dy dx = 2 x (1 − x) dx =
0 x 0 3
1 y 1
2
Z Z Z
EY = y 2 dx dy = 2 y 2 dy =
0 0 0 3
1Z y 1
y2 1
Z Z
E[XY ] = xy 2 dx dy = 2 y dy =
0 0 0 2 4

1 2 1
◮ Hence, Cov(X, Y ) = E[XY ] − EX EY = 4
− 9
= 36
P S Sastry, IISc, E1 222 Aug 2021 151/248
Independent random variables are uncorrelated

◮ Suppose X, Y are independent. Then


Z Z
E[XY ] = x y fXY (x, y) dx dy
Z Z
= x y fX (x) fY (y) dx dy
Z Z
= xfX (x) dx yfY (y) dy = EX EY

◮ Then, Cov(X, Y ) = E[XY ] − EX EY = 0.


◮ X, Y independent ⇒ X, Y uncorrelated

P S Sastry, IISc, E1 222 Aug 2021 152/248


Uncorrelated random variables may not be
independent
◮ Suppose X ∼ N (0, 1) Then, EX = EX 3 = 0
◮ Let Y = X 2 Then,

E[XY ] = EX 3 = 0 = EX EY

◮ Thus X, Y are uncorrelated.


◮ Are they independent? No
e.g.,
P [X > 2 |Y < 1] = 0 6= P [X > 2]

◮ X, Y are uncorrealted does not imply they are


independent.
P S Sastry, IISc, E1 222 Aug 2021 153/248
◮ We define the correlation coefficient of X, Y by

Cov(X, Y )
ρXY = p
Var(X) Var(Y )
◮ If X, Y are uncorrelated then ρXY = 0.
◮ We will show that |ρXY | ≤ 1
◮ Hence −1 ≤ ρXY ≤ 1, ∀X, Y

P S Sastry, IISc, E1 222 Aug 2021 154/248


◮ We have E [(αX + βY )2 ] ≥ 0, ∀α, β ∈ ℜ
α2 E[X 2 ] + β 2 E[Y 2 ] + 2αβE[XY ] ≥ 0, ∀α, β ∈ ℜ
E[XY ]
Take α = −
E[X 2 ]
(E[XY ])2 2 2 (E[XY ])2
+ β E[Y ] − 2β ≥ 0, ∀β ∈ ℜ
E[X 2 ] E[X 2 ]
aβ 2 + bβ + c ≥ 0, ∀β ⇒ b2 − 4ac ≤ 0
2
(E[XY ])2 2

2 (E[XY ])
⇒ 4 − 4E[Y ] ≤0
E[X 2 ] E[X 2 ]
2
(E[XY ])2 E[Y 2 ](E[XY ])2

⇒ ≤
E[X 2 ] E[X 2 ]
(E[XY ])4 E[Y 2 ](E[X 2 ])2
⇒ ≤
(E[XY ])2 E[X 2 ]
⇒ (E[XY ])2 ≤ E[X 2 ]E[Y 2 ]

P S Sastry, IISc, E1 222 Aug 2021 155/248


◮ We showed that
(E[XY ])2 ≤ E[X 2 ]E[Y 2 ]
◮ Take X − EX in place of X and Y − EY in place of Y
in the above algebra.
◮ This gives us
(E[(X − EX)(Y − EY )])2 ≤ E[(X−EX)2 ]E[(Y −EY )2 ]

⇒ (Cov(X, Y ))2 ≤ Var(X)Var(Y )


◮ Hence we get
!2
Cov(X, Y )
ρ2XY = p ≤1
Var(X)Var(Y )
◮ The equality holds here only if E [(αX + βY )2 ] = 0
Thus, |ρXY | = 1 only if αX + βY = 0
◮ Correlation coefficient of X, Y is ±1 only when Y is a
linear function of X P S Sastry, IISc, E1 222 Aug 2021 156/248
Linear Least Squares Estimation

◮ Suppose we want to approximate Y as an affine function


of X.
◮ We want a, b to minimize E [(Y − (aX + b))2 ]
◮ For a fixed a, what is the b that minimizes
E [((Y − aX) − b)2 ] ?
◮ We know the best b here is:
b = E[Y − aX] = EY − aEX.
◮ So, we want to find the best a to minimize
J(a) = E [(Y − aX − (EY − aEX))2 ]

P S Sastry, IISc, E1 222 Aug 2021 157/248


◮ We want to find a to minimize

J(a) = E (Y − aX − (EY − aEX))2


 

= E ((Y − EY ) − a(X − EX))2


 

= E (Y − EY )2 + a2 (X − EX)2 − 2a(Y − EY )(X − EX)


 

= Var(Y ) + a2 Var(X) − 2aCov(X, Y )

◮ So, the optimal a satisfies

Cov(X, Y )
2aVar(X) − 2Cov(X, Y ) = 0 ⇒ a =
Var(X)

P S Sastry, IISc, E1 222 Aug 2021 158/248


◮ The final mean square error, say, J ∗ is

J ∗ = Var(Y ) + a2 Var(X) − 2aCov(X, Y )


 2
Cov(X, Y ) Cov(X, Y )
= Var(Y ) + Var(X) − 2 Cov(X, Y )
Var(X) Var(X)
(Cov(X, Y ))2
= Var(Y ) −
Var(X)
(Cov(X, Y ))2
 
= Var(Y ) 1 −
Var(Y ) Var(X)
2

= Var(Y ) 1 − ρXY

P S Sastry, IISc, E1 222 Aug 2021 159/248


◮ The best mean-square approximation of Y as a ‘linear’
function of X is
 
Cov(X, Y ) Cov(X, Y )
Y = X + EY − EX
Var(X) Var(X)
◮ Called the line of regression of Y on X.
◮ If cov(X, Y ) = 0 then this reduces to approximating Y by
a constant, EY .
◮ The final mean square error is

Var(Y ) 1 − ρ2XY


◮ If ρXY = ±1 then the error is zero


◮ If ρXY = 0 the final error is Var(Y )

P S Sastry, IISc, E1 222 Aug 2021 160/248


◮ The covariance of X, Y is

Cov(X, Y ) = E[(X−EX) (Y −EY )] = E[XY ]−EX EY

Note that Cov(X, X) = Var(X)


◮ X, Y are called uncorrelated if Cov(X, Y ) = 0.
◮ X, Y independent ⇒ X, Y uncorrelated.
◮ Uncorrelated random variables need not necessarily be
independent
◮ Covariance plays an important role in linear least squares
estimation.
◮ Informally, covariance captures the ‘linear dependence’
between the two random variables.

P S Sastry, IISc, E1 222 Aug 2021 161/248


Covariance Matrix
◮ Let X1 , · · · , Xn be random variables (on the same
probability space)
◮ We represent them as a vector X.
◮ As a notation, all vectors are column vectors:
X = (X1 , · · · , Xn )T
◮ We denote E[X] = (EX1 , · · · , EXn )T
◮ The n × n matrix whose (i, j)th element is Cov(Xi , Xj ) is
called the covariance matrix (or variance-covariance
matrix) of X. Denoted as ΣX or ΣX
 
Cov(X1 , X1 ) Cov(X1 , X2 ) · · · Cov(X1 , Xn )
 Cov(X2 , X1 ) Cov(X2 , X2 ) · · · Cov(X2 , Xn ) 
ΣX =  .. .. .. ..
 

 . . . . 
Cov(Xn , X1 ) Cov(Xn , X2 ) · · · Cov(Xn , Xn )

P S Sastry, IISc, E1 222 Aug 2021 162/248


Covariance matrix

◮ If a = (a1 , · · · , an )T then
a aT is a n × n matrix whose (i, j)th element is ai aj .
◮ Hence we get

ΣX = E (X − EX) (X − EX)T
 

◮ This is because
(X − EX) (X − EX)T ij = (Xi − EXi )(Xj − EXj )


and (ΣX )ij = E[(Xi − EXi )(Xj − EXj )]

P S Sastry, IISc, E1 222 Aug 2021 163/248


◮ Recall the following about vectors and matrices
◮ let a, b ∈ ℜn be column vectors. Then
2 T T 
aT b = aT b a b = bT a aT b = bT a aT b


◮ Let A be an n × n matrix with elements aij . Then


n
X
T
b Ab = bi bj aij
i,j=1

where b = (b1 , · · · , bn )T
◮ A is said to be positive semidefinite if bT Ab ≥ 0, ∀b

P S Sastry, IISc, E1 222 Aug 2021 164/248


◮ ΣX is a real symmetric matrix
◮ It is positive semidefinite.
◮ Let a ∈ ℜn and let Y = aT X.
◮ Then, EY = aT EX. We get variance of Y as
h 2 i
Var(Y ) = E[(Y − EY )2 ] = E aT X − aT EX
h 2 i
T
= E a (X − EX)
= E aT (X − EX) (X − EX)T a
 

= aT E (X − EX) (X − EX)T a
 

= aT ΣX a

◮ This gives aT ΣX a ≥ 0, ∀a
◮ This shows ΣX is positive semidefinite

P S Sastry, IISc, E1 222 Aug 2021 165/248


Y = aT X = i ai Xi – linear combination of Xi ’s.
P

◮ We know how to find its mean and variance


X
EY = aT EX = ai EXi ;
i
X
Var(Y ) = aT ΣX a = ai aj Cov(Xi , Xj )
i,j

Specifically, by taking all components of a to be 1, we get


n
! n n n X
X X X X
Var Xi = Cov(Xi , Xj ) = Var(Xi )+ Cov(Xi , Xj )
i=1 i,j=1 i=1 i=1 j6=i

◮ If Xi are independent, variance of sum is sum of


variances.

P S Sastry, IISc, E1 222 Aug 2021 166/248


◮ Covariance matrix ΣX positive semidefinite because

aT ΣX a = Var(aT X) ≥ 0

◮ ΣX would be positive definite if aT ΣX a > 0, ∀a 6= 0


◮ It would fail to be positive definite if Var(aT X) = 0 for
some nonzero a.
◮ Var(Z) = E[(Z − EZ)2 ] = 0 implies Z = EZ, a
constant.
◮ Hence, ΣX fails to be positive definite only if there is a
non-zero linear combination of Xi ’s that is a constant.

P S Sastry, IISc, E1 222 Aug 2021 167/248


◮ Covariance matrix is a real symmetric positive
semidefinite matrix
◮ It have real and non-negative eigen values.
◮ It would have n linearly independent eigen vectors.
◮ These also have some interesting roles.
◮ We consider one simple example.

P S Sastry, IISc, E1 222 Aug 2021 168/248


◮ Let Y = aT X and assume ||a|| = 1
◮ Y is projection of X along the direction a.
◮ Suppose we want to find a direction along which variance
is maximized
◮ We want to maximize aT ΣX a subject to aT a = 1
◮ The lagrangian is aT ΣX a + η(1 − aT a)
◮ Equating the gradient to zero, we get

ΣX a = ηa

◮ So, a should be an eigen vector (with eigen value η).


◮ Then the variance would be aT ΣX a = ηaT a = η
◮ Hence the direction is the eigen vector corresponding to
the highest eigen value.

P S Sastry, IISc, E1 222 Aug 2021 169/248


Joint moments

◮ Given two random variables, X, Y


◮ The joint moment of order (i, j) is defined by

mij = E[X i Y j ]

m10 = EX, m01 = EY , m11 = E[XY ] and so on


◮ Similarly joint central moments of order (i, j) are defined
by
sij = E (X − EX)i (Y − EY )j
 

s10 = s01 = 0, s11 = Cov(X, Y ), s20 = Var(X) and so on


◮ We can similarly define joint moments of multiple random
variables

P S Sastry, IISc, E1 222 Aug 2021 170/248


◮ We can define moment generating function of X, Y by

MXY (s, t) = E esX+tY , s, t ∈ ℜ


 

◮ This is easily generalized to n random variables


h T i
MX (s) = E es X , s ∈ ℜn

◮ Once again, we can get all the moments by differentiating


the moment generating function

MX (s) = EXi
∂si s=0

◮ More generally
∂ m+n
MX (s) = EXin Xjm
∂sni ∂sm
j s=0

P S Sastry, IISc, E1 222 Aug 2021 171/248


Conditional Expectation

◮ Suppose X, Y have a joint density fXY


◮ Consider the conditional density fX|Y (x|y). This is a
density in x for every value of y.
◮ Since it isR a density, we can use it in an expectation
integral: g(x) fX|Y (x|y) dx
◮ This is like expectation of g(X) since fX|Y (x|y) is a
density in x.
◮ However, its value would be a function of y.
◮ That is, this is a kind of expectation that is a function of
Y (and hence is a random variable)
◮ It is called conditional expectation.

P S Sastry, IISc, E1 222 Aug 2021 172/248


◮ Let X, Y be discrete random variables (on the same
probability space).
◮ The conditinal expectation of h(X) conditioned on Y is a
function of Y , and is defined by
E[h(X)|Y ] = g(Y ) where
X
E[h(X)|Y = y] = g(y) = h(x) fX|Y (x|y)
x

◮ Thus
X
E[h(X)|Y = y] = h(x) fX|Y (x|y)
x
X
= h(x) P [X = x|Y = y]
x

◮ Note that, E[h(X)|Y ] is a random variable


P S Sastry, IISc, E1 222 Aug 2021 173/248
◮ Let X, Y have joint density fXY .
◮ The conditinal expectation of h(X) conditioned on Y is a
function of Y , and its value for any y is defined by
Z ∞
E[h(X)|Y = y] = h(x) fX|Y (x|y) dx
−∞

◮ Once again, what this means is that E[h(X)|Y ] = g(Y )


where Z ∞
g(y) = h(x) fX|Y (x|y) dx
−∞

P S Sastry, IISc, E1 222 Aug 2021 174/248


A simple example
◮ Consider the joint density
fXY (x, y) = 2, 0 < x < y < 1
◮ We calculated the conditional densities earlier
1 1
fX|Y (x|y) = , fY |X (y|x) = , 0<x<y<1
y 1−x
◮ Now we can calculate the conditional expectation
Z ∞
E[X|Y = y] = x fX|Y (x|y) dx
−∞
Z y y
1 1 x2 y
= x dx = =
0 y y 2 0 2
◮ This gives: E[X|Y ] = Y2
1+X
◮ We can show E[Y |X] = 2
P S Sastry, IISc, E1 222 Aug 2021 175/248
◮The conditional expectation is defined by
X
E[h(X)|Y = y] = h(x) fX|Y (x|y), X, Y are discrete
Zx ∞
E[h(X)|Y = y] = h(x) fX|Y (x|y) dx, X, Y have joint density
−∞

◮ We can actually define E[h(X, Y )|Y ] also as above.


That is,
Z ∞
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y) dx
−∞

◮ It has all the properties of expectation:


1. E[a|Y ] = a where a is a constant
2. E[ah1 (X) + bh2 (X)|Y ] = aE[h1 (X)|Y ] + bE[h2 (X)|Y ]
3. h1 (X) ≥ h2 (X) ⇒ E[h1 (X)|Y ] ≥ E[h2 (X)|Y ]

P S Sastry, IISc, E1 222 Aug 2021 176/248


◮ Conditional expectation also has some extra properties
which are very important
◮ E [E[h(X)|Y ]] = E[h(X)]
◮ E[h1 (X)h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]
◮ E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
◮ We will justify each of these.
◮ The last property above follows directly from the
definition.

P S Sastry, IISc, E1 222 Aug 2021 177/248


◮ Expectation of a conditional expectation is the
unconditional expectation
E [ E[h(X)|Y ] ] = E[h(X)]
In the above, LHS is expectation of a function of Y .
◮ Let us denote g(Y ) = E[h(X)|Y ]. Then
E [ E[h(X)|Y ] ] = E[g(Y )]
Z ∞
= g(y) fY (y) dy
−∞
Z ∞ Z ∞ 
= h(x) fX|Y (x|y) dx fY (y) dy
−∞ −∞
Z ∞Z ∞
= h(x) fXY (x, y) dy dx
−∞ −∞
Z ∞
= h(x) fX (x) dx
−∞
= E[h(X)]
P S Sastry, IISc, E1 222 Aug 2021 178/248
◮ Any factor that depends only on the conditioning variable
behaves like a constant inside a conditional expectation

E[h1 (X) h2 (Y )|Y ] = h2 (Y )E[h1 (X)|Y ]

◮ Let us denote g(Y ) = E[h1 (X) h2 (Y )|Y ]

g(y) = E[h1 (X) h2 (Y )|Y = y]


Z ∞
= h1 (x)h2 (y) fX|Y (x|y) dx
−∞
Z ∞
= h2 (y) h1 (x) fX|Y (x|y) dx
−∞
= h2 (y) E[h1 (X)|Y = y]
⇒ E[h1 (X) h2 (Y )|Y ] = g(Y ) = h2 (Y )E[h1 (X)|Y ]

P S Sastry, IISc, E1 222 Aug 2021 179/248


◮ A very useful property of conditional expectation is
E[ E[X|Y ] ] = E[X] (Assuming all expectations exist)
◮ We can see this in our earlier example.

fXY (x, y) = 2, 0 < x < y < 1

◮ We easily get: EX = 13 and EY = 32


◮ We also showed E[X|Y ] = Y2
 
Y 1
E[ E[X|Y ] ] = E = = E[X]
2 3
◮ Similarly
 
1+X 2
E[ E[Y |X] ] = E = = E[Y ]
2 3

P S Sastry, IISc, E1 222 Aug 2021 180/248


Example
◮ Let X, Y be random variables with joint density given by

fXY (x, y) = e−y , 0 < x < y < ∞


◮ The marginal densities are:
Z ∞ Z ∞
fX (x) = fXY (x, y) dy = e−y dy = e−x , x > 0
−∞ x
Z ∞ Z y
fY (y) = fXY (x, y) dx = e−y dx = y e−y , y > 0
−∞ 0
Thus, X is exponential and Y is gamma.
◮ Hence we have

EX = 1; Var(X) = 1; EY = 2; Var(Y ) = 2

P S Sastry, IISc, E1 222 Aug 2021 181/248


fXY (x, y) = e−y , 0 < x < y < ∞

◮ Let us calculate covariance of X and Y


Z ∞Z ∞
E[XY ] = xy fXY (x, y) dx dy
−∞ −∞
Z ∞Z y Z ∞
−y 1 3 −y
= xye dx dy = y e dy = 3
0 0 0 2
◮ Hence, Cov(X, Y ) = E[XY ] − EX EY = 3 − 2 = 1.
◮ ρXY = √12

P S Sastry, IISc, E1 222 Aug 2021 182/248


◮ Recall the joint and marginal densities

fXY (x, y) = e−y , 0 < x < y < ∞

fX (x) = e−x , x > 0; fY (y) = ye−y , y > 0


◮ The conditional densities will be
fXY (x, y) e−y 1
fX|Y (x|y) = = −y = , 0 < x < y < ∞
fY (y) ye y

fXY (x, y) e−y


fY |X (y|x) = = −x = e−(y−x) , 0 < x < y < ∞
fX (x) e

P S Sastry, IISc, E1 222 Aug 2021 183/248


◮ The conditional densities are
1
fX|Y (x|y) = ; fY |X (y|x) = e−(y−x) , 0 < x < y < ∞
y
◮ We can now calculate the conditional expectation
Z y
1 y
Z
E[X|Y = y] = x fX|Y (x|y) dx = x dx =
0 y 2
Y
Thus E[X|Y ] = 2
Z Z ∞
E[Y |X = x] = y fY |X (y|x) dy = ye−(y−x) dy
 Z ∞ x 
x −y ∞ −y
= e −ye x + e dy
x
= ex xe−x + e−x = 1 + x


Thus, E[Y |X] = 1 + X


P S Sastry, IISc, E1 222 Aug 2021 184/248
◮ We got
Y
E[X|Y ] = ; E[Y |X] = 1 + X
2
◮ Using this we can verify:
 
Y EY 2
E[ E[X|Y ] ] = E = = = 1 = EX
2 2 2

E[ E[Y |X] ] = E[1 + X] = 1 + 1 = 2 = EY

P S Sastry, IISc, E1 222 Aug 2021 185/248


◮ A property of conditional expectation is

E[ E[X|Y ] ] = E[X]

◮We assume that all three expectations exist.


◮ Very useful in calculating expectations

X Z
EX = E[ E[X|Y ] ] = E[X|Y = y] fY (y) or E[X|Y = y] fY (y) d
y

◮ Can be used to calculate probabilities of events too

P (A) = E[IA ] = E [ E [IA |Y ] ]

P S Sastry, IISc, E1 222 Aug 2021 186/248


◮ Let X be geometric and we want EX.
◮ X is number of tosses needed to get head
◮ Let Y ∈ {0, 1} be outcome of first toss. (1 for head)

E[X] = E[ E[X|Y ] ]
= E[X|Y = 1] P [Y = 1] + E[X|Y = 0] P [Y = 0]
= E[X|Y = 1] p + E[X|Y = 0] (1 − p)
= 1 p + (1 + EX)(1 − p)
⇒ EX (1 − (1 − p)) = p + (1 − p)
⇒ EX p = 1
1
⇒ EX =
p

P S Sastry, IISc, E1 222 Aug 2021 187/248


◮ P [X = k|Y = 1] = 1 if k = 1 (otherwise it is zero) and
hence E[X|Y = 1] = 1
(
0 if k = 1
P [X = k|Y = 0] = (1−p)k−1 p
(1−p)
if k ≥ 2

Hence

X
E[X|Y = 0] = k (1 − p)k−2 p
k=2
X∞ ∞
X
k−2
= (k − 1) (1 − p) p+ (1 − p)k−2 p
k=2 k=2
X∞ ∞
X
′ ′
= k ′ (1 − p)k −1 p + (1 − p)k −1 p
k′ =1 k′ =1
= EX + 1

P S Sastry, IISc, E1 222 Aug 2021 188/248


Another example

◮ Example: multiple rounds of the party game


◮ Let Rn denote number of rounds when you start with n
people.
◮ We want R̄n = E [Rn ].
◮ We want to use E [Rn ] = E[ E [Rn |Xn ] ]
◮ We need to think of a useful Xn .
◮ Let Xn be the number of people who got their own hat in
the first round with n people.

P S Sastry, IISc, E1 222 Aug 2021 189/248


◮ Rn – number of rounds when you start with n people.
◮ Xn – number of people who got their own hat in the first
round

E [Rn ] = E[ E [Rn |Xn ] ]


Xn
= E [Rn |Xn = i] P [Xn = i]
i=0
n
X
= (1 + E [Rn−i ]) P [Xn = i]
i=0
Xn n
X
= P [Xn = i] + E [Rn−i ] P [Xn = i]
i=0 i=0

If we can guess value of E[Rn ] then we can prove it using


mathematical induction

P S Sastry, IISc, E1 222 Aug 2021 190/248


◮ What would be E[Xn ]?
◮ Let Yi ∈ {0, 1} denote whether or not ith person got his
own hat.
◮ We know
(n − 1)! 1
E[Yi ] = P [Yi = 1] = =
n! n
n
X n
X
Now, Xn = Yi and hence EXn = E[Yi ] = 1
i=1 i=1

◮ Hence a good guess is E[Rn ] = n.


◮ We verify it using mathematical induction. We know
E[R1 ] = 1

P S Sastry, IISc, E1 222 Aug 2021 191/248


◮ Assume: E [Rk ] = k, 1 ≤ k ≤ n − 1
n
X n
X
E [Rn ] = P [Xn = i] + E [Rn−i ] P [Xn = i]
i=0 i=0
n
X
= 1 + E [Rn ] P [Xn = 0] + E [Rn−i ] P [Xn = i]
i=1
n
X
= 1 + E [Rn ] P [Xn = 0] + (n − i) P [Xn = i]
i=1

n
X
E [Rn ] (1 − P [Xn = 0]) = 1 + n(1 − P [Xn = 0]) − i P [Xn = i]
i=1
= 1 + n (1 − P [Xn = 0]) − E[Xn ]
= 1 + n (1 − P [Xn = 0]) − 1
⇒ E [Rn ] = n

P S Sastry, IISc, E1 222 Aug 2021 192/248


Analysis of Quicksort

◮ Given n numbers we want to sort them. Many algorithms.


◮ Complexity – order of the number of comparisons needed
◮ Quicksort: Choose a pivot. Separte numbers into two
parts – less and greater than pivot, do recursively
◮ Separating into two parts takes n − 1 comparisons.
◮ Suppose the two parts contain m and n − m − 1.
Comparisons needed to Separate each of them into two
parts depends on m
◮ So, final number of comparisons depends on the ‘number
of rounds’

P S Sastry, IISc, E1 222 Aug 2021 193/248


quicksort details
◮ Given {x1 , · · · , xn }.
◮ Choose first as pivot
{xj1 , xj2 , · · · , xjm }x1 {xk1 , xk2 , · · · , xkn−1−m }
◮ Suppose rn is the number of comparisons. If we get
(roughly) equal parts, then
rn ≈ n+2rn/2 = n+2(n/2+2rn/4 ) = n+n+4rn/4 = · · · = n log2 (n)
◮ If all the rest go into one part, then
n(n + 1)
rn = n + rn−1 = n + (n − 1) + rn−2 = · · · =
2
◮ If you are lucky, O(n log(n)) comparisons.
◮ If unlucky, in the worst case, O(n2 ) comparisons
◮ Question: ‘on the average’ how many comparisons?
P S Sastry, IISc, E1 222 Aug 2021 194/248
Average case complexity of quicksort
◮ Assume pivot is equally likely to be the smallest or second
smallest or mth smallest.
◮ Mn – number of comparisons.
◮ Define: X = j if pivot is j th smallest
◮ Given X = j we know Mn = (n − 1) + Mj−1 + Mn−j .
n
X
E[Mn ] = E[ E[Mn |X] ] = E[Mn |X = j] P [X = j]
j=1
n
X 1
= E[(n − 1) + Mj−1 + Mn−j ]
j=1
n
n−1
2X
= (n − 1) + E[Mk ], (taking M0 = 0)
n k=1
◮ This is a recurrence relation. (A little complicated to
solve)
P S Sastry, IISc, E1 222 Aug 2021 195/248
Least squares estimation
◮ We want to estimate Y as a function of X.
◮ We want an estimate with minimum mean square error.
◮ We want to solve (the min is over all functions g)

min E (Y − g(X))2
g

◮ Earlier we considered only linear functions:


g(X) = aX + b
◮ Now we want the ‘best’ function (linear or nonlinear)
◮ The solution now turns out to be

g ∗ (X) = E[Y |X]

◮ Let us prove this.

P S Sastry, IISc, E1 222 Aug 2021 196/248


◮ We want to show that for all g

E (E[Y | X] − Y )2 ≤ E (g(X) − Y )2
   

◮ We have
2
(g(X) − Y )2 =

(g(X) − E[Y | X]) + (E[Y | X] − Y )
2 2
= g(X) − E[Y | X] + E[Y | X] − Y
 
+ 2 g(X) − E[Y | X] E[Y | X] − Y

◮ Now we can take expectation on both sides.


◮ We first show that expectation of last term on RHS
above is zero.

P S Sastry, IISc, E1 222 Aug 2021 197/248


First consider the last term
 
E (g(X) − E[Y | X])(E[Y | X] − Y )
  
= E E (g(X) − E[Y | X])(E[Y | X] − Y ) | X
because E[Z] = E[ E[Z|X] ]
  
= E (g(X) − E[Y | X]) E (E[Y | X] − Y ) | X
because E[h1 (X)h2 (Z)|X] = h1 (X) E[h2 (Z)|X]
  
= E (g(X) − E[Y | X]) E (E[Y | X])|X − E{Y | X})
 
= E (g(X) − E[Y | X]) (E[Y | X] − E[Y | X))
= 0

P S Sastry, IISc, E1 222 Aug 2021 198/248


◮ We earlier got
2 2
(g(X) − Y )2 = g(X) − E[Y | X] + E[Y | X] − Y
 
+ 2 g(X) − E[Y | X] E[Y | X] − Y

◮ Hence we get

E (g(X) − Y )2 = E (g(X) − E[Y | X])2


   

+ E (E[Y | X] − Y )2
 

≥ E (E[Y | X] − Y )2
 

◮ Since the above is true for all functions g, we get

g ∗ (X) = E [Y | X]

P S Sastry, IISc, E1 222 Aug 2021 199/248


Sum of random number of random variables

◮ Let X1 , X2 , · · · be iid rv on the same probability space.


Suppose EXi = µ < ∞, ∀i.
◮ Let N be a positive integer valued rv that is independent
of all Xi (EN < ∞)
Let S = N
P
i=1 Xi .

◮ We want to calculate ES.


◮ We can use
E[S] = E[ E[S|N ] ]

P S Sastry, IISc, E1 222 Aug 2021 200/248


◮ We have
" N
#
X
E[S|N = n] = E Xi | N = n
" i=1
n
#
X
= E Xi | N = n
i=1
since E[h(X, Y )|Y = y] = E[h(X, y)|Y = y]
n
X Xn
= E[Xi | N = n] = E[Xi ] = nµ
i=1 i=1

◮ Hence we get

E[S|N ] = N µ ⇒ E[S] = E[N ]E[X1 ]

P S Sastry, IISc, E1 222 Aug 2021 201/248


Wald’s formula
We took S = N
P
i=1 Xi with N independent of all Xi .

◮ With iid Xi , the formula ES = EN EX1 is valid even


under some dependence between N and Xi .
◮ Here are one version of assumptions needed.
 1 |] < ∞ and EN < ∞ (Xi iid).
A1 E[|X
A2 E Xn I[N ≥n] = E[Xn ]P [N ≥ n], ∀n
Let SN = N
P
i=1 Xi .

◮ Then, ESN = EX1 EN


◮ Suppose the event [N ≤ n − 1] depends only on
X1 , · · · , Xn−1 .
◮ Such an N is called a stopping time.
◮ Then the event [N ≤ n − 1] and hence its complement
[N ≥ n] is independent of Xn and hence A2 holds.

P S Sastry, IISc, E1 222 Aug 2021 202/248


Wald’s formula

◮ In the general case, we do not need Xi to be iid.


◮ Here is one version of this Wald’s formula. We assume
 i |] < ∞, ∀i and EN < ∞.
1. E[|X
2. E Xn I[N ≥n] = E[Xn ]P [N ≥ n], ∀n
Let SN = N
P PN
i=1 Xi and let TN = i=1 E[Xi ].

◮ Then, ESN = ETN .


If E[Xi ] is same for all i, ESN = EX1 EN .

P S Sastry, IISc, E1 222 Aug 2021 203/248


Variance of random sum
PN
◮ S= i=1 Xi , Xi iid, ind of N . Want Var(S)
 !2    !2 
XN XN
E[S 2 ] = E  Xi  = E  E  Xi | N 
i=1 i=1

◮ As earlier, we have
 !2   !2 
XN n
X
E Xi | N = n = E  Xi | N = n
i=1 i=1
 !2 
n
X
= E Xi 
i=1

P S Sastry, IISc, E1 222 Aug 2021 204/248


Let Y = ni=1 Xi , Xi iid
P

◮ Then, Var(Y ) = n Var(X1 )


◮ Hence we have

E[Y 2 ] = Var(Y ) + (EY )2 = n Var(X1 ) + (nEX1 )2

◮ Using this
 !2   !2 
N
X n
X
E Xi | N = n = E  Xi  = n Var(X1 )+(nEX1 )2
i=1 i=1

◮ Hence
 !2 
N
X
E Xi | N  = N Var(X1 ) + N 2 (EX1 )2
i=1

P S Sastry, IISc, E1 222 Aug 2021 205/248


PN
◮ S= i=1 Xi (Xi iid). We got

E[S 2 ] = E[ E[S 2 |N ] ] = EN Var(X1 ) + E[N 2 ](EX1 )2

◮ Now we can calculate variance of S as

Var(S) = E[S 2 ] − (ES)2


= EN Var(X1 ) + E[N 2 ](EX1 )2 − (EN EX1 )2
EN Var(X1 ) + (EX1 )2 E[N 2 ] − (EN )2

=
= EN Var(X1 ) + Var(N ) (EX1 )2

P S Sastry, IISc, E1 222 Aug 2021 206/248


Another Example
◮ We toss a (biased) coin till we get k consecutive heads.
Let Nk denote the number of tosses needed.
◮ N1 would be geometric.
◮ We want E[Nk ]. What rv should we condition on?
◮ Useful rv here is Nk−1

E[Nk | Nk−1 = n] = (n + 1)p + (1 − p)(n + 1 + E[Nk ])

◮ Thus we get the recurrence relation

E[Nk ] = E[ E[Nk | Nk−1 ] ]


= E [(Nk−1 + 1)p + (1 − p)(Nk−1 + 1 + E[Nk ])]

P S Sastry, IISc, E1 222 Aug 2021 207/248


◮ We have
E[Nk ] = E [(Nk−1 + 1)p + (1 − p)(Nk−1 + 1 + E[Nk ])]
◮ Denoting Mk = E[Nk ], we get
Mk = pMk−1 + p + (1 − p)Mk−1 + (1 − p) + (1 − p)Mk
pMk = Mk−1 + 1
1 1
Mk = Mk−1 +
p p
   2  2
1 1 1 1 1 1 1
= Mk−2 + + = Mk−2 + +
p p p p p p p
 k−1 k−1  j
1 X 1
= M1 +
p j=1
p
1 − pk 1
= taking M 1 =
(1 − p)pk p

P S Sastry, IISc, E1 222 Aug 2021 208/248


◮ As mentioned earlier, we can use the conditional
expectation to calculate probabilities of events also.

P (A) = E[IA ] = E [ E [IA |Y ] ]

E[IA |Y = y] = P [IA = 1|Y = y] = P (A|Y = y)

◮ Thus, we get

P (A) = E[IA ] = E [ E [IA |Y ] ]


X
= P (A|Y = y)P [Y = y], when Y is discrete
y
Z
= P (A|Y = y) fY (y) dy, when Y is continuous

P S Sastry, IISc, E1 222 Aug 2021 209/248


Example
◮ Let X, Y be independent continuous rv
◮ We want to calculate P [X ≤ Y ]
◮ We can calculate it by integrating joint density over
A = {(x, y) : x ≤ y}
Z Z
P [X ≤ Y ] = fX (x) fY (y) dx dy
A
Z ∞ Z y 
= fY (y) fX (x) dx dy
−∞ −∞
Z ∞
= FX (y) fY (y) dy
−∞

◮ IF X, Y are iid then P [X < Y ] = 0.5

P S Sastry, IISc, E1 222 Aug 2021 210/248


◮ We can also use the conditional expectation method here
Z ∞
P [X ≤ Y ] = P [X ≤ Y | Y = y] fY (y) dy
−∞
Z ∞
= P [X ≤ y | Y = y] fY (y) dy
−∞
Z ∞
= P [X ≤ y] fY (y) dy
−∞
Z ∞
= FX (y) fY (y) dy
−∞

P S Sastry, IISc, E1 222 Aug 2021 211/248


Another Example

◮ Consider a sequence of bernoullli trials where p,


probability of success, is random.
◮ We first choose p uniformly over (0, 1) and then perform
n tosses.
◮ Let X be the number of heads.
◮ Conditioned on knowledge of p, we know distribution of
X
P [X = k | p] = n Ck pk (1 − p)n−k
◮ Now we can calculate P [X = k] using the conditioning
argument.

P S Sastry, IISc, E1 222 Aug 2021 212/248


◮ Assuming p is chosen uniformly from (0, 1), we get
Z
P [X = k] = [P [X = k | p] f (p) dp
Z 1
n
= Ck pk (1 − p)n−k 1 dp
0

n k!(n − k)!
= Ck
(n + 1)!
Z 1
Γ(k + 1)Γ(n − k + 1)
because pk (1 − p)n−k dp =
0 Γ(n + 2)
1
=
n+1
1
◮ So, we get: P [X = k] = n+1
, k = 0, 1, · · · , n

P S Sastry, IISc, E1 222 Aug 2021 213/248


Tower property of Conditional Expectation

◮ Conditional expectation satisfies

E[ E[h(X)|Y, Z] | Y ] = E[h(X)|Y ]

Note that all these can be random vectors.


◮ Let

g1 (Y, Z) = E[h(X)|Y, Z]
g2 (Y ) = E[g1 (Y, Z)|Y ]

We want to show g2 (Y ) = E[h(X)|Y ]

P S Sastry, IISc, E1 222 Aug 2021 214/248


◮ Recall: g1 (Y, Z) = E[h(X)|Y, Z], g2 (Y ) = E[g1 (Y, Z)|Y ]
Z
g2 (y) = g1 (y, z) fZ|Y (z|y) dz
Z Z 
= h(x) fX|Y Z (x|y, z) dx fZ|Y (z|y) dz
Z Z 
= h(x) fX|Y Z (x|y, z) fZ|Y (z|y) dz dx
Z Z 
= h(x) fXZ|Y (x, z|y) dz dx
Z
= h(x) fX|Y (x|y) dx

◮ Thus we get
E[ E[h(X)|Y, Z] | Y ] = E[h(X)|Y ]

P S Sastry, IISc, E1 222 Aug 2021 215/248


Gaussian or Normal distribution
◮ The Gaussian or normal density is given by
1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞
σ 2π
◮ If X has this density, we denote it as X ∼ N (µ, σ 2 ).
We showed EX = µ and Var(X) = σ 2
◮ The density is a ‘bell-shaped’ curve

P S Sastry, IISc, E1 222 Aug 2021 216/248


◮ Standard Normal rv — X ∼ N (0, 1)
◮ The distribution function of standard normal is
Z x
1 t2
Φ(x) = √ e− 2 dt
−∞ 2π
◮ Suppose X ∼ N (µ, σ 2 )
Z b
1 (x−µ)2
P [a ≤ X ≤ b] = √ e− 2σ2 dx
a σ 2π
(x − µ) 1
take y = ⇒ dy = dx
σ σ
Z (b−µ)
σ 1 y2
= √ e− 2 dy
(a−µ) 2π

b−µ a−µ
  
= Φ −Φ
σ σ
◮ We can express probability of events involving all Normal
rv using Φ.
P S Sastry, IISc, E1 222 Aug 2021 217/248
◮ X ∼ N (0, 1). Then its mgf is
Z ∞
 tX  1 x2
MX (t) = E e = etx √ e− 2 dx

Z ∞ −∞
1 1 2
= √ e− 2 (x −2tx) dx
2π −∞
Z ∞
1
e− 2 ((x−t) −t ) dx
1 2 2
= √
2π −∞
Z ∞
1 2
t 1 1 2
= e2 √ e− 2 (x−t) dx
2π −∞
1 2
= e2t
◮ Now let Y = σX + µ. Then Y ∼ N (µ, σ 2 ).
The mgf of Y is
MY (t) = E et(σX+µ) = etµ E e(tσ)X = etµ MX (tσ)
   

= e(µt+ 2 t σ )
1 2 2

P S Sastry, IISc, E1 222 Aug 2021 218/248


Multi-dimensional Gaussian Distribution

◮ The n-dimensional Gaussian density is given by


1 1 T Σ−1 (x−
fX (x) = 1 n
e− 2 (x−µ) µ) , x ∈ ℜn
|Σ| (2π)
2 2

◮ µ ∈ ℜn and Σ ∈ ℜn×n are parameters of the density and


Σ is symmetric and positive definite.
◮ If X1 , · · · , Xn have the above joint density, they are said
to be jointly Gaussian.
◮ We denote this by X ∼ N (µ, Σ)
◮ We will now show that this is a joint density function.

P S Sastry, IISc, E1 222 Aug 2021 219/248


◮ We begin by showing the following is a density (when M
is symmetric +ve definite)
1 T My
fY (y) = C e− 2 y
1 T
Let I = ℜn C e− 2 y M y dy
R

◮ Since M is real symmetric, there exists an orthogonal


transform, L with L−1 = LT , |L| = 1 and LT M L is
diagonal
◮ Let LT M L = diag(m1 , · · · , mn ).
◮ Then for any z ∈ ℜn ,
X
zT LT M Lz = mi zi2
i

P S Sastry, IISc, E1 222 Aug 2021 220/248


◮ We now get
Z
1 T
I = C e− 2 y M y dy
ℜn
change variable: z = L−1 y = LT y ⇒ y = Lz
Z
1 T T
= C e− 2 z L M Lz dz (note that |L| = 1)
n
Zℜ
1 P 2
= C e− 2 i mi zi dz
ℜn
n n Z zi2
YZ − 21
− 21 mi zi2 1
Y
= C e dzi = C e mi
dzi
i=1 ℜ i=1 ℜ
n r
Y 1
= C 2π
i=1
mi

P S Sastry, IISc, E1 222 Aug 2021 221/248


◮ We will first relate m1 · · · mn to the matrix M .
◮ By definition, LT M L = diag(m1 , · · · , mn ). Hence
 
1 1 −1
diag ,··· , = LT M L = L−1 M −1 (LT )−1 = LT M −1 L
m1 mn

◮ Since |L| = 1, we get


1
LT M −1 L = M −1 =
m1 · · · mn
Putting all this together
n r
1
Z
1
− 21 yT M y n
Y
Ce dy = C 2π = C (2π) 2 M −1 2

ℜn i=1
mi

1
Z
1 T My
⇒ n 1 e− 2 y dy = 1
(2π) |M −1 |
2 2 ℜn

P S Sastry, IISc, E1 222 Aug 2021 222/248


◮ We showed the following is a density (taking M −1 = Σ)
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) |Σ|
2 2

◮ Let X = Y + µ. Then
1 1 T Σ−1 (x−
fX (x) = fY (x − µ) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2

◮ This is the multidimensional Gaussian distribution

P S Sastry, IISc, E1 222 Aug 2021 223/248


◮ Consider Y with joint density
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) |Σ|
2 2

◮ As earlier let M = Σ−1 . Let LT M L = diag(m1 , · · · , mn )


◮ Define Z = (Z1 , · · · , Zn )T = LT Y. Then Y = LZ.
◮ Recall |L| = 1, |M −1 | = (m1 · · · mn )−1
◮ Then density of Z is
1 1 T T 1 1 P
mi zi2
fZ (z) = n 1 e− 2 z L M Lz
= n 1 1 e
−2 i

(2π) |M −1 |
2 2 (2π) 2 ( m1 ···m n
)2
n n z2
r r
1 1 1 1 − 12 1i
− 12 mi zi2
Y Y
= q e = q e mi
i=1
2π 1
i=1
2π 1
mi mi

This shows that Zi ∼ N (0, m1i ) and Zi are independent.


P S Sastry, IISc, E1 222 Aug 2021 224/248
◮ If Y has density fY and Z = LT Y then Zi ∼ N (0, m1i )
and Zi are independent. Hence,
 
1 1
ΣZ = diag ,··· , = LT M −1 L
m1 mn

◮ Also, since E[Zi ] = 0, ΣZ = E[ZZT ].


◮ Since Y = LZ, E[Y] = 0 and

ΣY = E[YYT ] = E[LZZT LT ] = LE[ZZT ]LT = L(LT M −1 L)LT = M −1

◮ Thus, if Y has density


1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) |Σ|
2 2

then EY = 0 and ΣY = M −1 = Σ
P S Sastry, IISc, E1 222 Aug 2021 225/248
◮ Let Y have density
1 1 T Σ−1 y
fY (y) = n 1 e− 2 y , y ∈ ℜn
(2π) |Σ|
2 2

◮ Let X = Y + µ. Then
1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2

◮ We have
EX = E[Y + µ] = µ

ΣX = E[(X − µ)(X − µ)T ] = E[YYT ] = Σ

P S Sastry, IISc, E1 222 Aug 2021 226/248


Multi-dimensional Gaussian density
◮ X = (X1 , · · · , Xn )T are said to be jointly Gaussian if
1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2

◮ EX = µ and ΣX = Σ.
◮ Suppose Cov(Xi , Xj ) = 0, ∀i 6= j ⇒ Σij = 0, ∀i 6= j.
◮ Then Σ is diagonal. Let Σ = diag(σ12 , · · · , σn2 ).
2 n 2
1 1
 
xi −µi xi −µi
− 21 n − 12
P Y
fX (x) = n e i=1 σi
= √ e σi

(2π) σ1 · · · σn
2
i=1
σi 2π

◮ This implies Xi are independent.


◮ If X1 , · · · , Xn are jointly Gaussian then uncorrelatedness
implies independence.
P S Sastry, IISc, E1 222 Aug 2021 227/248
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian:
1 1 T Σ−1 (x−
fX (x) = n 1 e− 2 (x−µ) µ)
(2π) |Σ|
2 2

◮ Let Y = X − µ.
◮ Let M = Σ−1 and L be such that
LT M L = diag(m1 , · · · , mn )
◮ Let Z = (Z1 , · · · , Zn )T = LT Y .
◮ Then we saw that Zi ∼ N (0, m1i ) and Zi are independent.
◮ If X1 , · · · , Xn are jointly Gaussian then there is a ‘linear’
transform that transforms them into independent random
variables.

P S Sastry, IISc, E1 222 Aug 2021 228/248


Moment generating function
◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian
◮ Let Y = X − µ and Z = (Z1 , · · · , Zn )T = LT Y as earlier
◮ The moment generating function of X is given by
h T i
MX (s) = E es X
h T i h T i
s (Y+µ) sT µ
= E e =e E es Y
T
h T i
= es µ E es LZ
h T i
sT µ
= e E eu Z
where u = LT s
T
= es µ MZ (u)

P S Sastry, IISc, E1 222 Aug 2021 229/248


◮ Since Zi are independent, easy to get MZ .
◮ We know Zi ∼ N (0, m1i ). Hence

1 1 u2
u2i i
MZi (ui ) = e 2 mi = e 2mi

h i n n u2 u2
i i
P
uT Z
Y Y
ui Z i
 
MZ (u) = E e = E e = e 2mi = e i 2mi

i=1 i=1

◮ We derived earlier
T
MX (s) = es µ MZ (u), where u = LT s

P S Sastry, IISc, E1 222 Aug 2021 230/248


◮ We got
P u2
i
T
MX (s) = es µ MZ (u); u = LT s; MZ (u) = e i 2mi

◮ Earlier we have shown LT M −1 L = diag( m11 , · · · , m1n )


where M −1 = Σ. Now we get

1 X u2i 1 1 1
= uT (LT M −1 L)u = sT M −1 s = sT Σs
2 i mi 2 2 2

◮ Hence we get
T 1 T
MX (s) = es µ + 2 s Σs

◮ This is the moment generating function of


multi-dimensional Normal density

P S Sastry, IISc, E1 222 Aug 2021 231/248


◮Let X, Y be jointly Gaussian. For simplicity let
EX = EY = 0.
◮ Let Var(X) = σ 2 , Var(Y ) = σ 2 ;
x y
let ρXY = ρ ⇒ Cov(X, Y ) = ρσx σy .
◮ Now, the covariance matrix and its inverse are given by

σx2 σy2
   
ρσx σy −1 1 −ρσx σy
Σ= ; Σ = 2 2
ρσx σy σy2 σx σy (1 − ρ2 ) −ρσx σy σx2

◮ The joint density of X, Y is given by


 
x2 y2
1 − 1
2(1−ρ2 ) 2 + 2 − 2ρxy
σx σy
fXY (x, y) = p e σx σy

2πσx σy 1 − ρ2
◮ This is the bivariate Gaussian density

P S Sastry, IISc, E1 222 Aug 2021 232/248


◮ Suppose X, Y are jointly Gaussian (with the density
above)
◮ Then, all the marginals and conditionals would be
Gaussian.
◮ X ∼ N (0, σx2 ), and Y ∼ N (0, σy2 )
◮ fX|Y (x|y) would be a Gaussian density with mean yρ σσxy
and variance σx2 (1 − ρ2 ).

P S Sastry, IISc, E1 222 Aug 2021 233/248


◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian.
◮ Then we call X as a Gaussian vector.
◮ It is possible that Xi , i = 1, · · · , n are individually
Gaussian but X is not a Gaussian vector.
◮ For example, X, Y may be individually Gaussian but their
joint density is not the bivariate normal density.
◮ Gaussian vectors have some special properties. (E.g.,
uncorrelated implies independence)
◮ Important to note that ‘individually Gaussian’ does not
mean ‘jointly Gaussian’

P S Sastry, IISc, E1 222 Aug 2021 234/248


◮ The multi-dimensional Gaussian density has some
important properties.
◮ We have seen some of them earlier.
◮ If X1 , · · · , Xn are jointly Gaussian then they are
independent if they are uncorrelated.
◮ Suppose X1 , · · · , Xn be jointly Gaussian and have zero
means. Then there is an orthogonal transform Y = AX
such that Y1 , · · · , Yn are jointly Gaussian and
independent.
◮ Another important property is the following
◮ X1 , · · · , Xn are jointly Gaussian if and only if tT X is
Gaussian for for all non-zero t ∈ ℜn .
◮ We will prove this using moment generating functions

P S Sastry, IISc, E1 222 Aug 2021 235/248


◮ Suppose X = (X1 , · · · , Xn )T be jointly Gaussian and let
W = tT X.
◮ Let µX and ΣX denote the mean vector and covariance
matrix of X. Then

µw , EW = tT µX ; σw2 , Var(W ) = tT ΣX t

◮ The mgf of W is given by


 uW  h T i
MW (u) = E e = E eu t X
Tµ 1 2 T
= MX (ut) = eut x + 2 u t Σx t

1 2 σ2
= euµw + 2 u w

showing that W is Gaussian


◮ Shows density of Xi is Gaussian for each i. For example,
if we take t = (1, 0, 0, · · · , 0)T then tT X would be X1 .
P S Sastry, IISc, E1 222 Aug 2021 236/248
◮ Now suppose W = tT X is Gaussian for all t 6= 0.
1 2 σ2 Tµ 1 2
tT Σ X t
MW (u) = euµw + 2 u w = eu t X+2u

◮This implies
h T i T 1 2 T
E eu t X = eu t µX + 2 u t ΣX t , ∀u ∈ ℜ, ∀t ∈ ℜn , t 6= 0
h T i T 1 T
E et X = et µX + 2 t ΣX t , ∀t

This implies X is jointly Gaussian.


◮ This is a defining property of multidimensional Gaussian
density

P S Sastry, IISc, E1 222 Aug 2021 237/248


◮ Let X = (X1 , · · · , Xn )T be jointly Gaussian.
◮ Let A be a k × n matrix with rank k.
◮ Then Y = AX is jointly Gaussian.
◮ We will once again show this using the moment
generating function.
◮ Let µx and Σx denote mean vector and covariance matrix
of X. Similarly µy and Σy for Y
◮ We have µy = Aµx and

Σy = E (Y − µy )(Y − µy )T
 

= E (A(X − µx ))(A(X − µx ))T


 

= E A(X − µx )(X − µx )T AT
 

= A E (X − µx )(X − µx )T AT = AΣx AT
 

P S Sastry, IISc, E1 222 Aug 2021 238/248


◮ The mgf of Y is
h T i
MY (s) = E es Y (s ∈ ℜk )
h T i
s AX
= E e
= MX (AT s)
Tµ 1 T
x + 2 t Σx t
(Recall MX (t) = et )
1 T
sT Aµ x+ 2 s A
Σx AT s
= e
T 1 T
= e s µy + 2 s Σ y s

This shows Y is jointly Gaussian

P S Sastry, IISc, E1 222 Aug 2021 239/248


◮ X is jointly Gaussian and A is a k × n matrix with rank k.
◮ Then Y = AX is jointly Gaussian.
◮ This shows all marginals of X are gaussian
◮ For example, if you take A to be
 
1 0 0 ··· 0
A=
0 1 0 ··· 0

then Y = (X1 , X2 )T

P S Sastry, IISc, E1 222 Aug 2021 240/248


◮ Finding the distribution of a rv by calculating its mgf is
useful in many situations.
◮ Let X1 , X2 , · · · be iid with mgf MX (t).
Let SN = N
P
i=1 Xi where N is a positive integer valued

rv which is independent of all Xi .
◮ We want to find out the distribution of SN .
◮ We can calculate mgf of SN in terms of MX and
distribution of N .
◮ We can use properties of conditional expectation for this

P S Sastry, IISc, E1 222 Aug 2021 241/248


The mgf of SN is MSN (t) = E etSN
 

h PN i
E etSN | N = n = E et i=1 Xi | N = n
 
h Pn i
t i=1 Xi
= E e |N =n
" n #
h Pn i Y
= E et i=1 Xi = E etXi
i=1
n
Y
E etXi = (MX (t))n
 
=
i=1

◮ Hence we get

E etSN | N = (MX (t))N


 

P S Sastry, IISc, E1 222 Aug 2021 242/248


◮ We can now find mgf of SN as

MSN (t) = E etSN


 

= E E etSN | N
  
h i
= E (MX (t))N

X
= (MX (t))n fN (n)
n=1
= GN ( MX (t) )

where GN (s) = EsN is the generating function of N


◮ This method is useful for finding distribution of SN when
we can recognize the distribution from its mgf

P S Sastry, IISc, E1 222 Aug 2021 243/248


◮ We can also find distribution function of SN directly using
the technique of conditional expectations.
◮ FSN (s) = P [SN ≤ s] and we know how to find
probabilities of events using conditional expectation.
" N # ∞
" N #
X X X
P Xi ≤ s = P Xi ≤ s | N = n P [N = n]
i=1 n=1

" i=1
n
#
X X
= P Xi ≤ s P [N = n]
n=1 i=1

P S Sastry, IISc, E1 222 Aug 2021 244/248


Jensen’s Inequality
◮ Let g : ℜ → ℜ be a convex function. Then
g(EX) ≤ E[g(X)]
◮ For example, (EX)2 ≤ E [X 2 ]
◮ Function g is convex if (see figure on left)
g(αx+(1−α)y) ≤ αg(x)+(1−α)g(y), ∀x, y, ∀0 ≤ α ≤ 1
◮ If g is convex, then, given any x0 , exists λ(x0 ) such that
(see figure on right)
g(x) ≥ g(x0 ) + λ(x0 )(x − x0 ), ∀x

P S Sastry, IISc, E1 222 Aug 2021 245/248


Jensen’s Inequality: Proof
◮ We have: ∀x0 , ∃λ(x0 ) such that

g(x) ≥ g(x0 ) + λ(x0 )(x − x0 ), ∀x

◮ Take x0 = EX and x = X(ω). Then

g(X(ω)) ≥ g(EX) + λ(EX)(X(ω) − EX), ∀ω

◮ Y (ω) ≥ Z(ω), ∀ω ⇒ Y ≥ Z ⇒ EY ≥ EZ
Hence we get

g(X) ≥ g(EX) + λ(EX)(X − EX)


⇒ E[g(X)] ≥ g(EX) + λ(EX) E[X − EX] = g(EX)

◮ This completes the proof

P S Sastry, IISc, E1 222 Aug 2021 246/248


◮ Consider the set of all mean-zero random variables.
◮ It is closed under addition and scalar (real number)
multiplication.
◮ Cov(X, Y ) = E[XY ] satisfies
1. Cov(X, Y ) = Cov(Y, X)
2. Cov(X, X) = Var(X) ≥ 0 and is zero only if X = 0
3. Cov(aX, Y ) = aCov(X, Y )
4. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y )
◮ Thus Cov(X, Y ) is an inner product here.
◮ The Cauchy-Schwartz inequality (|xT y| ≤ ||x|| ||y||)
gives
p p
|Cov(X, Y )| ≤ Cov(X, X) Cov(Y, Y ) = Var(X) Var(Y )

◮ This is same as |ρXY | ≤ 1


◮ A generalization of Cauchy-Schwartz inequality is Holder
inequality
P S Sastry, IISc, E1 222 Aug 2021 247/248
Holder Inequality
1 1
◮ For all p, q with p, q > 1 and p
+ q
=1
1 1
E[|XY |] ≤ (E|X|p ) p (E|Y |q ) q
(We assume all the expectations are finite)
◮ If we take p = q = 2
p
E[|XY |] ≤ E[X 2 ] E[Y 2 ]
◮ This is same as Cauchy-Schwartz inequality. This implies
|ρXY | ≤ 1.
Cov(X, Y ) = E[(X − EX)(Y − EY )]
 
≤ E (X − EX)(Y − EY )
p
≤ E[(X − EX)2 ] E[(Y − EY )2 ]
p
= Var(X) Var(Y )

P S Sastry, IISc, E1 222 Aug 2021 248/248

You might also like