0% found this document useful (0 votes)
5 views67 pages

Probability 2

Uploaded by

rodwayworker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views67 pages

Probability 2

Uploaded by

rodwayworker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

ST119: Probability 2 Lecture notes for Week 1

Review: distributions of random variables


Recall that (⌦, F, P) is a probability space if ⌦ is a non-empty set – the sample space; F is a
-algebra on ⌦, (i.e. a collection of subsets of ⌦ which is closed under complementation and
taking countable unions) – the events; and P is a probability measure on (⌦, F) i.e

P : F ! [0, 1];

P(⌦) = 1;
S
1 P
1
P( An ) = P(An ) if the (An ) are disjoint events.
n=1 n=1

Definition 1. The distribution of a random variable X (defined on a probability


space (⌦, F, P)) is the collection of values

{P(X 2 A) : A ✓ R}. (?)

Informally speaking, the distribution of X is the set of all possible answers to questions of the
kind: “What is the probability that X . . . ?”, where [. . .] can be any behavior we can wonder
about (for instance: “What is the probability that X is between 2 and 5?”, “What is the
probability that X is equal to 15?”, “What is the probability that X is negative?”).
In a more rigorous treatment of probability, we must constrain the expression (?) by saying
that the sets A that can appear inside P(X 2 A) must be “reasonable” (the technical term
is Borel set). What is exactly meant by a “reasonable” (Borel) set is a topic in Measure
Theory. It suffices to say that all sets we would naturally think of, and all sets that will be
encountered in this module, are “reasonable” (Borel). Formally, the Borel -algebra is the
smallest -algebra of subsets of R which contains the (open/closed/arbitrary) intervals.

Remark 1. We can characterise the distribution of a random variable (r.v.) in many ways,
but one is to just give the values of
P(X 2 A)
for A an arbitrary interval, or even just ones of the form ( 1, x].

We also recall the following definition.


Definition 2. Let X be a random variable. The (cumulative) distribution function
(c.d.f.) of X is the function FX : R ! [0, 1] defined by

FX (x) = P(X 6 x), x 2 R,

so that the distribution of X is characterised by its cdf.

1
Review: discrete random variables
Definition 3. A random variable X is discrete if there exists a finite or countably infinite
set {x1 , x2 , . . .} of real numbers such that
X
P(X = xi ) > 0 for all i, P(X = xi ) = 1.
i>1

The set {x1 , x2 , . . .} is called the support of X, and is denoted supp(X). The probability
mass function (p.m.f.) of X is the function pX : supp(X) ! [0, 1] given by

pX (x) = P(X = x), x 2 supp(X).

Remark 2. Of course, we have X


pX (x) = 1.
x2supp(X)

And for any Borel set A ✓ R, we have


X
P(X 2 A) = pX (x).
x2supp(X)\A

We could extend the definition of pX to the whole of R by setting

pX (x) = 0 for x 62 supp(X).

In the following example, we consider the cumulative distribution function of a discrete random
variable.

2
Example 1. Let X be a discrete random variable whose support is the set N0 = {0, 1, . . .}.
Let us sketch the graph of the cumulative distribution function FX . First observe that

x<0 =) FX (x) = P(X 6 x) = 0.

Next, note that

FX (0) = P(X 6 0) = P(X < 0) + P(X = 0) = 0 + pX (0) = pX (0)

and for any x 2 (0, 1),

FX (x) = P(X 6 x) = P(X < 0) + P(X = 0) + P(0 < X 6 x) = 0 + pX (0) + 0 = pX (0).

By arguing similarly, we conclude that


8
>
> 0 if x < 0,
>
>
>
> x 2 [0, 1),
<pX (0) if
FX (x) = pX (0) + pX (1) if x 2 [1, 2),
>
>
>
> pX (0) + pX (1) + pX (2) if x 2 [2, 3)
>
>
:· · ·

The graph of FX looks like this:

FX
1

pX (0) + pX (1) + pX (2)


pX (0) + pX (1)
pX (0)

x
1 2 3 4

Common families of discrete distributions


Definition 4 (Bernoulli distribution). We say that a random variable X has Bernoulli
distribution with parameter p 2 (0, 1) if supp(X) = {0, 1} and the probability mass function
of X is
pX (1) = p, pX (0) = 1 p.
We write X ⇠ Ber(p). We also say that X is a Bernoulli random variable.

3
Example 2. Assume that we roll a fair diea . This is modelled by the probability space given
by ⌦ = {1, . . . , 6} and P such that P({i}) = 1/6 for 1, . . . , 6. Let X be the random variable
given by (
1 if ! = 6;
X(!) =
0 otherwise.
Then, X has a Bernoulli distribution, since its support is {0, 1}. The parameter is

pX (1) = P(X = 1) = P({6}) = 1/6.


a
This is actually the correct singular of dice. As in the phrase “the die is cast” attributed to Caesar
[speaking in Latin of course-‘alea iacta est’] when he crossed the river Rubicon, initiating the Roman civil
war which led to his installation as dictator for life.

We thus write X ⇠ Ber(1/6). Next, let Y be the random variable given by


(
1 if the die result is even;
Y (!) =
0 otherwise.

Again, Y is also a Bernoulli random variable, since its support is {0, 1}. The parameter is

pY (1) = P(Y = 1) = P({die result is even}) = P({2, 4, 6}) = 1/2.

So Y ⇠ Ber(1/2).
Bernoulli random variables are usually associated to Bernoulli trials. A Bernoulli trial is a
random experiment in which we label a subset of the possible outcomes as successes, and the
remaining possible outcomes as failures (so there is an underlying interpretation that we are
rooting for certain outcomes). For instance, in the above example of the roll of a fair die, we
could imagine that we are hoping for a 6. Then, the die roll would be a Bernoulli trial, with 6
being a success, and anything else being a failure.
Whenever we have a Bernoulli trial, we can define a Bernoulli random variable X, by saying
that X = 1 if the trial is a success, and X = 0 if the trial is a failure.
Definition 5 (Binomial distribution). Let n 2 N and p 2 (0, 1). A random variable X
has binomial distribution with parameters n and p if supp(X) = {0, . . . , n} and X has the
probability mass function
✓ ◆
n
pX (k) = · pk · (1 p)n k , k 2 {0, . . . , n}.
k

We write X ⇠ Bin(n, p). We also say that X is a binomial random variable.

Note that pX is indeed a probability mass function since, by the Binomial Theorem,
Xn ✓ ◆
n n
1 = (p + (1 p)) = · pk · (1 p)n k .
k=0
k

A random variable with the Bin(n, p) distribution serves to model the number of successes
obtained when we perform n independent Bernoulli trials, each with success probability p. Let
us justify this statement. Let us first model the performing of n independent Bernoulli trials
by taking the sample space

⌦ = {!1 !2 . . . !n : !i 2 {S, F} for each i},

4
where ‘S’ represents success and ‘F’ represents failure. That is, each outcome ! 2 ⌦ is a “word”
consisting of n letters, all of which are either S or F. The probability of an outcome ! = !1 . . . !n
is given by
P({!}) = pnumber of S in ! · (1 p)number of F in ! .
For instance, if n = 3 and p = 1/3, we have
⌦ = {SSS, SSF, SFS, SFF, FSS, FSF, FFS, FFF}
and
P({SSS}) = (1/3)3 · (2/3)0 = 1/27,
P({SSF}) = P({SFS}) = P({FSS}) = (1/3)2 · (2/3)1 = 2/27,
P({SFF}) = P({FSF}) = P({FFS}) = (1/3)1 · (2/3)2 = 4/24,
P({FFF}) = (1/3)0 · (2/3)3 = 8/27.
Returning to the general case with arbitrary n and p, let X denote the number of successes
obtained in the n trials. Equivalently,
X(!) = number of S in !.
Let us fix k 2 {0, . . . , n} and let us compute P(X = k). The event {X = k} is the event that
we have obtained k successes out of the n trials. That is,
X X
P(X = k) = P({!}) = pk · (1 p)n k
!:the number of S !:the number of S
in ! is k in ! is k

= pk · (1 p)n k · #{! : the number of S in ! is k}


✓ ◆
n
= · pk · (1 p)n k ,
k
so X ⇠ Bin(n, p) as claimed.
Note that if n = 1, then the Bin(n, p) distribution is the same as Ber(p).
Definition 6 (Geometric distribution). Let p 2 (0, 1). A random variable X has geometric
distribution with parameter p if supp(X) = N = {1, 2, . . .} and X has the probability mass
function
pX (k) = p · (1 p)k 1 , k 2 N.
We write X ⇠ Geom(p).

To see that pX is indeed a probability mass function, note that


1
X 1
X 1
X
k 1 1 1
pX (k) = p · (1 p) =p· (1 p)` = p · =p· = 1.
k=0 k=1 `=0
1 (1 p) p

Random variables having the geometric distribution arise in the following situation. Suppose
that we repeatedly perform independent Bernoulli trials, all with the same probability p of
success. Then, the number of trials performed to obtain the first success has a geometric dis-
tribution with parameter p.
Remark 3. Some people use a slightly di↵erent definition for the geometric distribution with
parameter p; namely, they count the number of failed trials performed before the first success
is obtained. So, if success is obtained in the first trial, the number of failed trials is zero and
X = 0. This gives a p.m.f. p̃X (k) = p · (1 p)k , for k 2 N0 . We will not adopt this form.

5
Example 3. We roll a die repeatedly until we roll a 6 for the first time. Let X be the total
number of times we roll the die. Then, X ⇠ Geom( 16 ).

Definition 7 (Poisson distribution). Let > 0. A random variable X has Poisson dis-
tribution with parameter if supp(X) = N0 = {0, 1, . . .} and the probability mass function
of X is
k
pX (k) = ·e , k 2 N0
k!
(recall that 0! = 1). We write X ⇠ Poi( ).

To show that pX is indeed a probability mass function, we compute


1
X 1
X k 1
X k
pX (k) = ·e =e · =e · e = 1.
k=0 k=0
k! k=0
k!

Random variables that count rare occurrences among many trials (such as: number of accidents
in a road throughout a year, number of types in a book) typically follow (at least approximately)
the Poisson distribution. This is made rigorous in the following proposition.
Proposition 1 (Poisson approximation to the binomial distribution). Let be a positive
real number. Let X ⇠ Poisson( ). For each integer n > , let Xn ⇠ Bin(n, /n). Then, for
each k 2 N, we have
n!1
pXn (k) ! pX (k).

This proposition says that, when we perform a large number (= n) of independent Bernoulli
trials, all with same probability (= /n) of success, then the number of successes we obtain is
approximately distributed as Poisson( ).
Proof. (This proof is not examinable). We start writing
✓ ◆ ✓ ◆k ✓ ◆n k
n
pXn (k) = · · 1
k n n
✓ ◆k ✓ ◆n
n(n 1) · · · (n k + 1) 1
= · · k
· 1 .
k! n 1 n
n

We rearrange the right-hand side as


✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆n
1 1
· n· · (n 1) · · (n 2) · · · · (n k + 1) · · k
· 1 .
k! n n n n 1 n
| {z } | {zn } | {z }
=:An =:Cn
=:Bn

Next, note that


✓ ◆ ✓ ◆ ✓ ◆
n!1 k
An = · · 2· ··· (k 1) · !
n n n

and ✓ ◆k
n!1
Bn = 1 ! 1.
n

6
|E(x)|
Finally, using the Taylor approximation log(1 + x) = x + E(x) where lim = 0, we obtain
x!0 x
⇢ ✓ ◆ ⇢ ✓ ✓ ◆◆
Cn = exp n · log 1 = exp n · +E
n n n

E( /n) n!1
= exp + · !e .
/n

Exercise 1. Assume that the probability of a typo in any word in a book is 10 5 indepen-
dently of each other word.The book contains 2 · 105 words. Using a Poisson approximation
to the binomial distribution, estimate the probability that there are exactly two typos in the
book.

Solution. The number of typos (successes), denoted X, has the binomial distribution with
parameters n = 2 · 105 and p = 10 5 . We approximate this by a random variable Y with
Poisson distribution with parameter = np = 2 · 105 · 10 5 = 2. Then,

22
P(X = 2) ⇡ P(Y = 2) = pY (2) = e 2
⇡ 0.271.
2!

Remark 4. The above proposition and exercise show that we can approximate Bin(n, p)
by Poi(np) when n is large and p is small. Of course, saying “large” and “small” is a bit
vague, and we may want to have more quantitative statements. Although we will not go
further in this module, it is important to mention that there are more precise formulations of
the above proposition, which give some information about the size of the error in the Poisson
approximation to the binomial. A theorem due to Lucien Le Cam states that, if X ⇠ Bin(n, p)
and Y ⇠ Poi( ) with = np, then |pX (k) pY (k)| < np2 for all k. For instance, in the typo
exercise above, the error in the approximation is smaller than 2 · 105 · (10 5 )2 = 2 · 10 5 .

Review: continuous random variables


We recall the definition of continuous random variables:
Definition 8. A random variable X is called continuous (or is said to have a continuous
distribution) if there exists a non-negative function fX : R ! R such that
Z
P(X 2 B) = fX (y)dy for any B ✓ R.
B

The function fX is called the probability density function of X.

Remark 5. Technically, we should say that such a r.v. is absolutely continuous, since there
are r.v.s with continuous, but not di↵erentiable cdfs.

If X is continuous, then
Z x
FX (x) = fX (y)dy and FX0 (x) = fX (x) for every x at which fX is continuous.
1

Moreover, for any x 2 R we have


P(X = x) = 0.

7
Common families of continuous distributions
Definition 9 (Uniform distribution). Let a, b 2 R with a < b. A random variable X has
the uniform distribution in (a, b) if it has cumulative distribution function given by
8
>
>0
> if x 6 a,
<
FX (x) = xb aa if a < x < b,
>
>
>
: 1 if x > b.

We write X ⇠ Unif(a, b).

The corresponding probability density function is


(
1
if a < x < b,
fX (x) = b a .
0 otherwise.

1 FX 1

FX (x0 )
fX
area
x x
a x0 b a x0 b

Exercise 2. Let X ⇠ Unif( 1, 10). Find the probability that X is positive.


0 ( 1) 1 10
Solution. P(X > 0) = 1 P(X 6 0) = 1 FX (0) = 1 =1 = .
10 ( 1) 11 11

Definition 10 (Exponential distribution). Let > 0. A random variable X has the


exponential distribution with parameter if it has cumulative distribution function given by
(
1 e x if x > 0,
FX (x) =
0 if x 6 0.

We write X ⇠ Exp( ).

The corresponding probability density function is


(
e x if x > 0,
fX (x) =
0 if x 6 0.

slope=

1 FX
FX (x0 )

area fX
x x
x0 x0
2
slope=

8
The exponential distribution is commonly used to model the lifetime of entities that have a
lack of memory property. To explain what this means, let us think of light bulbs. Suppose that
the lifetime of light bulbs of a particular brand has an Exponential( ) distribution (assume
that we turn on the light and don’t turn it o↵ until the bulb burns out). Then, the memoryless
property means that: regardless of whether the bulb has just been activated, or it has been
active for a certain amount of time, the distribution of the remaining lifetime is the same.
Mathematically, this is expressed by the following identity, which holds for all s, t > 0:
&

P(X > t + s | X > t) = P(X > s).

)t
* IBP
e- -A
t
You will prove this in the exercises. ↑ (k) =
K

We now see a distribution that generalizes the exponential distribution. =


[y + ((-)
E *
+ -
*
H = k- 5(k 1)
+


=
( D! (1)
-

Definition 11 (Gamma function and distribution). First define the Gamma function ↑
as Z 1
(w) := xw 1 · e x dx, w 2 (0, 1). ↑ (w) (0 - 1) ! WeN =

0
Let w > 0 and > 0. A random variable X has Gamma distribution with parameters w
and if it has probability density function
FJWEN, get this byne

( w -

(w)
· xw 1 · e x if x > 0, adding independent Exp(1) S.
w V
.

fX (x) =
0 otherwise.

We write X ⇠ Gamma(w, ).

To see that fX is indeed a density function, we compute:


Z 1 Z 1 w
fX (x)dx = · xw 1 · e x dx
0 0 (w)
w Z 1
= · xw 1 · e x dx (substitute x = y)
(w) 0
w Z 1 ⇣ ⌘w 1 Z 1
y y dy 1
= · ·e = · y w 1 · e y dy = 1.
(w) 0 (w) 0

It is worth noting that the Gamma distribution with parameters w = 1 and is equal to the
exponential distribution with parameter .

Definition 12 (Normal distribution). Let µ 2 R and > 0. A random variable X has


normal (or Gaussian) distribution with parameters µ and 2 if it has probability density
function Z is standard normal
if N(0 1) 2
,

1 (x µ)2
fX (x) = p · e 2 2 , x 2 R. pay () =
w

2⇡ 2
and
cl T()
We write X ⇠ N (µ, 2
). Note X-m -
N(0 82)
,

The graph of fX is a ‘bell curve’, symmetric about µ. The value of controls the dispersion
of the curve. See the figure below.
The computation to show that fX is a probability density function is a bit involved and requires
knowledge of calculus in Rd so we omit it.
The normal distribution often arises when when we measure the dispersion of some real-world
quantity about its average value. As we will see later, this is justified by the Central Limit

9
Theorem, which says that the normal distribution arises as a universal limiting distribution in
probability theory.

fX
x
0.5 0.5

µ= 0.5, = 0.3 µ = 0, = 0.3 µ = 0.5, = 0.3

µ = 0, = 0.2 µ = 0, = 0.7

10
ST119: Probability 2 Lecture notes for Week 2

Expectation of common families of distributions


We recall that, for a random variable X,
(P
x · pX (x) if X is discrete;
E[X] = R 1x2supp(X)
1
x · fX (x) dx if X is continuous.

Remark 1. It is convenient to extend the definition of the probability mass function, pX of a


discrete random variable, X, to the whole of R by defining

PX (x) = 0 for x 62 supp(X).

The following fact will be useful.

Proposition 1. Let X be a discrete random variable whose support is a subset of N0 . Then,


X
E[X] = P(X > n).
n>1

Proof. We have !
1
X 1
X n
X
E[X] = n · pX (n) = 1 · pX (n).
n=1 n=1 j=1

The above double sum is over all pairs (n, j) such that n 2 N and j 2 {1, . . . , n}. We can
rewrite the sum as 1 X 1 1
X X
pX (n) = P(X > j)
j=1 n=j j=1

as required.
An analogous statement also holds for continuous random variables:

Proposition 2. Let X be a continuous random variable which only takes positive values.
Then, Z 1
E[X] = P(X > x) dx.
0
da
-F()
=

Proof. Similarly to the discrete case, we have


Z 1 Z 1 ✓Z x ◆
E[X] = x · fX (x) dx = 1 dy fX (x) dx
0 0
Z 1Z x 0 Tonelli's Theorem
= fX (x) dy dx
0 0
Z 1Z 1 Z 1
= fX (x) dx dy = P(X > y) dy.
0 y 0

1
We will now revisit the families of distributions seen in Week 1. For a random variable following
each of the distributions we have seen, we will compute the expectation (as a function of the
parameters).
Bernoulli distribution. If X ⇠ Ber(p), then

E[X] = 0 · pX (0) + 1 · pX (1) = pX (1) = p.

Binomial distribution. For X ⇠ Bin(n, p), we have Note X =


X
:
where X
:
~
Ber(p)
Xn ✓ ◆
n : E =
E() = (i) =

E[X] = · pk · (1 p)n k 0 p
=


k=0
k
Xn ✓ ◆
n
= k· · pk · (1 p)n k

k=1
k
n
X n!
= k· · pk · (1 p)n k .
k=1
k!(n k)!

Using the facts that k · 1


k!
= 1
(k 1)!
for k > 1 and that n! = n · (n 1)!, the above becomes
n
X (n 1)!
np · · pk 1
· (1 p)n k .
k=1
(k 1)!(n k)!

With the change of indices m = n 1 and j = k 1, the above becomes


m
X m!
np · · pj · (1 p)m j .
j=0
j!(m j)!

We now note that the expression inside the sum is equal to pY (j), where Y is a random variable
with the Bin(m, p) distribution. Hence, the above equals
m
X
np · pY (j) = np · 1 = np.
j=0

In conclusion, E[X] = np.

Geometric distribution. We will use the formula for a geometric series: for a 2 ( 1, 1), we
have 1 1
X 1 X a
n
a = and an = .
n=0
1 a n=1
1 a
Let X ⇠ Geom(p). We will compute E[X] by two methods.
Method 1.
1
X X X ✓ ◆
d
E[X] = k · p(1 p) k 1
=p k(1 p) k 1
=p (1 p) k
.
k=1 k=1 k=1
dp

We will now exchange the derivative with the infinite sum (we omit the rigorous justification
for doing so) to obtain
1 ✓ ◆ ✓ ◆ ✓ ◆
d X k d 1 p d 1 1 1
p· (1 p) = p · = p· 1 = p· 2
= .
dp k=1 dp 1 (1 p) dp p p p

2
In conclusion, E[X] = 1/p.
Method 2. By Proposition 1, we have
1
X 1 X
X 1 1 X
X 1 1 X
X 1
E[X] = P(X > n) = pX (m) = p(1 p)m 1
=p (1 p)m 1 . (?)
n=1 n=1 m=n n=1 m=n n=1 m=n

With the change of variable ` = m n (so that m = ` + n, and ` = 0 when m = n), we have
that 1 1 1
X X X (1 p)n 1
m 1 `+n 1 n 1
(1 p) = (1 p) = (1 p) (1 p)` = .
m=n `=0 `=0
p
Plugging this into (?), we obtain
1 1
1X X 1
E[X] = p · (1 p) n 1
= (1 p)j = .
p n=1 j=0
p

! -

Poisson distribution. If X ⇠ Poi( ), then


1
X k S 1
X k 1 1
X j 1
X
E[X] = k· e = · e = · e = · pX (j) = .
k=0
k! k=1
(k 1)! j=0
j! j=0

Uniform distribution. If X ⇠ Unif(a, b), then


Z b
1 1 b2 a2 1 (b a)(b + a) a+b
E[X] = x· dx = · = · = .
a b a b a 2 b a 2 2
Exponential distribution. If X ⇠ Exp( ), then, by Proposition 2, justme IBP or .

Z 1 Z 1 Z 1 ✓ ◆1
e x 1
E[X] = P(X > x) dx = (1 FX (x)) dx = e x
dx = = .
0 0 0 x=0

Normal distribution. You will prove in the exercises that if X ⇠ N (µ, ), then E[X] = µ, 2

that is, the parameter µ is the mean (or expectation) associated to the distribution. Here we
will show that 2 is the variance. Recall that, for a random variable X, the variance is defined
as
Var(X) = E[(X E[X])2 ] = E[X 2 ] (E[X])2 .
We compute, for X ⇠ N (µ, ) (already using the fact that E[X] = µ),
2
Z 1
1 (x µ)2
Var(X) = E[(X µ) ] =2
(x µ)2 · p e 2 2 dx,
1 2⇡
R1 x µ
where we used the formula E[g(X)] = 1 g(x) · fX (x) dx. Using the substitution y = ,
the above becomes
2 Z 1 y2
p y 2 · e 2 dy.
2⇡ 1
We now integrate by parts,
y2 y2
u = y, dv = y · e 2 dy =) du = dy, v = e 2 ,
and then
Z "✓ ◆1 Z #
2 1 1
2 y2 1 y2 y2
p y ·e 2 dy = p ye 2 + e 2 dy
2⇡ 1 2⇡ y= 1 1
 Z
2 p 1
2
=p 0+ 2⇡ · fX (y) dy = .
2⇡ 1

3
Gamma :

G -
(a i) ,

to() = 1/

E(0) =

Jo = "I do a
= x

Negative Binomial :

independent Ber(p)
H-NBin(k p) ,
The muder
of think until we have K-successes

Pp(x) P(exactly h 1
Successes 1-D Dials then
muse)
=
in
-

a
,

=
=
(e) ph/1-plan

E(H) =
ECG ) : where G : = Geom (p) independent
=
RE(bi) =
The distribution of the transformation of a random variable
We now study questions of the following kind: let X be a random variable whose distribution
is known to us, and let g : R ! R; then, what is the distribution of Y = g(X)? Although this
can be hard to answer in some situations, we will see some cases where it is straightforward.
A general comment: in this lecture and in the rest of the module, when we say “determine
the distribution of a random variable”, this can be understood as determining the cumulative
distribution function of the random variable. It is also satisfactory to determine the probability
mass function (in case the random variable is discrete) or probability density function (in case
it is continuous).
A word about inversion. Let g : R ! R be a function. For any y 2 R, we define the set

g 1 (y) := {x 2 R : g(x) = y} ✓ R.

That is, g 1 (y) is the set of x that are mapped to y by g.


In case g is a bijection (meaning that it is one-to-one and its image is equal to R), then g 1 (y)
is a set consisting of a single real number, so in that case we regard g 1 (y) as a number rather
than a set. Then g 1 is also a function from R to R, satisfying g 1 (g(x)) = x and g(g 1 (y)) = y.
1
Finally, don’t confuse g 1 (y) with g(y) . We will in general avoid using a 1 exponent to indi-
cate “one over something” when it can lead to confusion.
Discrete case. Let X be a discrete random variable and g : R ! R. Then, Y is also discrete,
and has support
supp(Y ) = {g(x) : x 2 supp(X)}.
In order to describe the distribution of Y , we find its probability mass function pY :
X X
pY (y) = P(Y = y) = P(g(X) = y) = P(X = x) = pX (x),
x2supp(X): x2supp(X):
g(x)=y g(x)=y

We thus obtain the formula


X
pY (y) = pX (x), y 2 supp(Y ). (1)
x2supp(X)\g 1 (y)

In case g : R ! R is a bijection, the above formula becomes

pY (y) = pX (g 1 (y)). (2)

Example 1. Let X be a discrete random variable and Y = aX + b, where a, b 2 R, a = 6 0.


In this case, g(x) = ax + b is a bijection (increasing if a > 0 and decreasing if a < 0). The
inverse function is given by
y b
g 1 (y) = , y 2 R.
a
We obtain that the support of Y is

{ax + b : x 2 supp(X)},

and the probability mass function of Y is


✓ ◆
y b
pY (y) = pX , y 2 supp(Y ).
a

4
Example 2. Let X ⇠ Geom(p) and Y = | sin( 12 ⇡X)|. Let us determine the support and
distribution of Y .

Since X ⇠ Geom(p), the support of X is N and the probability mass function of X is pX (k) =
p(1 p)k 1 , for k 2 N. Define g(x) = | sin( 12 ⇡x)|, so that Y = g(X). We note that:

if x 2 N is even, then we can write x = 2k for some k 2 N, so

g(x) = g(2k) = | sin( 12 ⇡ · 2k)| = | sin(⇡k)| = 0;

if x 2 N is odd, then we can write x = 2k + 1 for some k 2 N0 , so

g(x) = g(2k + 1) = | sin( 12 ⇡ · (2k + 1))| = | sin(⇡k + ⇡2 )| = 1.

This shows that the support of Y is {0, 1}. Next, by (1) we have
X X
pY (0) = pX (x) and pY (1) = pX (x).
x2N\g 1 (0) x2N\g 1 (1)

We have

g 1 (0) = {x 2 R : g(x) = 0} = {x 2 R : | sin( 12 ⇡x)| = 0} = {. . . , 2, 2, 0, . . .},


g 1 (1) = {x 2 R : g(x) = 1} = {x 2 R : | sin( 12 ⇡x)| = 1} = {. . . , 3, 1, 1, 3, . . .},

so
N \ g 1 (0) = {2, 4, 6, . . .}, N \ g 1 (1) = {1, 3, 5, . . .}.
Therefore,

pY (1) = pX (1) + pX (3) + pX (5) + · · ·


= p + p(1 p)2 + p(1 p)4 + · · ·
X1
1 p 1
=p ((1 p)2 )j = p · 2
= 2
= .
j=0
1 (1 p) 2p p 2 p

1
This shows that Y has a Bernoulli distribution with parameter pY (1) = 2 p
.

Continuous case. Assume X is a continuous random variable, g : R ! R and Y = g(X).


In this case, to determine the distribution of Y , it is no longer helpful to consider probabilities
of events of the form {X = x}, since such probabilities are all zero!
It is often possible to obtain the cumulative distribution function of Y through a direct method,
as we now illustrate.
Example 3. Let X be a continuous random variable, and let Y = X 2 . Let us find the
cumulative distribution function and probability density function of Y in terms of those of X.
We start with
FY (y) = P(Y 6 y) = P(X 2 6 y).
Now, in case y < 0, the event {X 2 6 y} is empty, so FY (y) = 0. On the other hand, if y > 0,
p p
then {X 2 6 y} = { y 6 X 6 y}, so
p p p p p p
FY (y) = P( y6X6 y) = P( y<X6 y) = FX ( y) FX ( y).

5
To summarize, ( p p
FX ( y) FX ( y) if y > 0,
FY (y) =
0 otherwise.
In case we also want the density function of Y , we can di↵erentiate. For y < 0 we have fY (y) =
0 since FY is constant on ( 1, 0). For y > 0,

d p p
fY (y) = FY0 (y) = (FX ( y) FX ( y))
dy
✓ ◆
p 1 p 1
= fX ( y) · p fX ( y) · p
2 y 2 y
1 p p
= p · (fX ( y) + fX ( y)).
S
2 y

(A technicality: we haven’t found fY at y = 0, because FY may fail to be di↵erentiable


there. This is not a problem, though. The density function only Rserves the purpose of being
integrated to give probabilities, through the formula P(Y 2 B) = B fY (y)dy, and the integral
is not a↵ected by the value of the function at a single point).
The following proposition allows us to handle cases where g is a bijection. > > continuous

Proposition 3. Let X be a continuous random variable. Let I, J ✓ R be intervals, assume -


that X takes values in I, and let g : I ! J be a di↵erentiable function that is strictly
monotone (either strictly increasing or strictly decreasing). Then, Y := g(X) has probability
Ie bijection density function

d 1
fY (y) = fX (g 1 (y)) · g (y) , y 2 J.
dy

Proof. (This proof is non-examinable). First assume that g is increasing. Then,

FY (y) = P(Y 6 y) = P(g(X) 6 y) = P(X 6 g 1 (y)) = FX (g 1 (y)).

Di↵erentiation and the chain rule then give

d d d 1
fY (y) = Fy (y) = fX (g 1 (y)) · g 1 (y) = fX (g 1 (y)) · g (y) .
dy dy dy
1
The last equality follows from the fact that g is increasing when g is increasing, which is easy
to check.
Next, assume that g is decreasing. Then,

FY (y) = P(Y 6 y) = P(g(X) 6 y) = P(X > g 1 (y)) = P(X > g 1 (y)) = 1 FX (g 1 (y)).

Again di↵erentiating,

d d 1 d 1
fY (y) = (1 Fy (y)) = fX (g 1 (y)) · g (y) = fX (g 1 (y)) · g (y) ,
dy dy dy
1
since g is decreasing in this case, so it has a negative derivative, so |g 1 | = g 1.
U-%0 1) x+e" 1
g
ey
, :
-

[
0x0

Y =
g(u)(k) = x

1
x[0, 1]
x> /

g" (y) Moy (1 y) Fy (y)


Grying ico,e
= + =
-

is
y>
e -
1
6
+y (3)
Get Jose
=
Example 4. Let X ⇠ N (µ, 2 ) and Y = aX + b, where a, b 2 R, a 6= 0. Let us use the above
proposition to find fY . We take g(x) = ax + b, so g maps R into R, and is increasing if a > 0
and decreasing if a < 0. The inverse is g 1 (y) = y a b . We then get
y b (y (aµ+b))2
d 1 1 ( a µ)2 1 1
fY (y) = fX (g 1 (y)) · g (y) = p ·e 2 2 · =p ·e 2( a)2 .
dy 2⇡ a 2⇡ |a|
2 2
This shows that Y ⇠ N (aµ + b, a ). In particular, if X ⇠ N (0, 1) and Y = aX + b,
then Y ⇠ N (b, a2 ).

Example 5. Let X be a continuous random variable with probability density function


(
2x if x 2 (0, 1),
fX (x) =
0 otherwise.

Determine the distribution of V := X 3 .

Solution. The function g(x) = x3 is di↵erentiable and increasing, so we can apply the theorem.
We have g 1 (y) = y 1/3 , and then

d 1
fV (y) = fX (g 1 (y)) · g (y).
dy

Note that g 1 (y) 2 (0, 1) if and only if y 2 (0, 1), so fX (g 1 (y)) > 0 if and only if y 2 (0, 1).
d
Also, dy g 1 (y) = 13 · y 2/3 . We then obtain
(
2y 1/3 · 13 · y 2/3
= 2
3
·y 1/3
if y 2 (0, 1),
fV (y) =
0 otherwise.

The cumulative distribution function of V is given by


8
>
<0R if y < 0,
y 2
FV (y) = · y 1/3 dy = y 2/3 if 0 6 y 6 1,
>
:
0 3
1 if y > 1.

Alternate solution. Instead, we could have started by computing


8
Z x <0 if x 6 0;
>
FX (x) = fX (x)dx = x2 if x 2 (0, 1);
0 >
:
1 if x > 1,

and then
8
>
<0 if y < 0;
FV (y) = P(V 6 y) = P(X 6 y) = P(X 6 y ) = FX (y ) = y 2/3
3 1/3 1/3
if 0 6 y 6 1;
>
:
1 if y > 1,

and then di↵erentiating to obtain the same answer as before.

7
The functions FV and fV are plotted below:

fV
FV
1
2/3

x x
1 1

8
ST119: Probability 2 Lecture notes for Week 3

Distribution of sums: discrete case


We now consider the following problem: we have two random variables X and Y whose joint
distribution is known to us, and we want to determine the distribution of X + Y . We address
this separately for discrete and continuous joint distributions, starting with the former.
Assume that X and Y are discrete with joint probability mass function pX,Y . Also assume
that X and Y both only take integer values. What is the probability mass function of Z :=
X + Y ? For m 2 Z we compute p(z 2)
=2 x( 2)
= =
,
,

X a+
y

pZ (m) = P(Z = m) = P(X + Y = m) = P(X = k, X + Y = m) [Px x(x x)


=
,
,
z -

k2Z
X
= P(X = k, k + Y = m)
k2Z
X X
= P(X = k, Y = m k) = pX,Y (k, m k).
k2Z k2Z

In case X and Y are independent, we have pX,Y (x, y) = pX (x) · pY (y), so the above also gives
X
pZ (m) = pX (k) · pY (m k).
k2Z

Formulas of this kind are known as discrete convolution formulas.


The following proposition is a nice application of this last formula.
Proposition 1. Let X and Y be independent, X ⇠ Poi( ) and Y ⇠ Poi(µ). Then, X + Y ⇠
Poi( + µ).

Proof. Let Z := X + Y . Note that Z > 0 (since X > 0 and Y > 0), so pZ (m) = 0 if m < 0.
For m > 0, we apply the formula
X
pZ (m) = pX (k) · pY (m k).
k2Z

Noting that pX (k) = 0 for k < 0 and pY (m k) = 0 for k > m, the right-hand side becomes
m
X m
X k
µm k µ
pX (k) · pY (m k) = ·e · ·e .
k=0 k=0
k! (m k)!

We multiply and divide the right-hand side by m!; rearranging terms, it becomes
m
( +µ) 1 X m! k ( + µ)m
e · · · · µm k
=e ( +µ)
· .
m! k=0 k!(m k)! m!

This is the probability mass function of the Poisson( + µ) distribution.


The following is a statement involving sums of Bernoulli random variables.
Proposition 2. Let X1 , . . . , Xn be independent random variables, all with the Bernoulli
distribution with parameter p. Then, X1 + · · · + Xn ⇠ Bin(n, p).

Although we could also prove it using discrete convolutions, we will not do so, because the
proposition follows immediately from our interpretations of Bernoulli and binomial random

1
variables with Bernoulli trials. The sum X1 + · · · + Xn is doing exactly what the binomial
random variable does: counting the number of successes out of the n trials.
Example 1. In Week 2, we proved that X ⇠ Bin(n, p) has E[X] = np by doing a long
computation. With the above proposition, we obtain a much easierPn way to verify this. Indeed,
letting X1 , . . . , Xn ⇠ Ber(p) be independent and letting X = i=1 Xi , we have that
" n # n n
X X X
E[X] = E Xi = E[Xi ] = p = np.
i=1 i=1 i=1

Corollary 3. Let X and Y be independent, X ⇠ Bin(m, p) and Y ⇠ Bin(n, p). Then, X +


Y ⇠ Bin(m + n, p).

Proof. Let X1 , . . . , Xm+n be independent random variables with the Bernoulli distribution with
parameter p. Define
m
X n+m
X
W := Xi , Z := Xi .
i=1 i=m+1

By the previous proposition, we have that W ⇠ Bin(m, p) and Z ⇠ Bin(n, p). Moreover,
since W depends only on X1 , . . . , Xm and Z depends only on Xm+1 , . . . , Xn+m (and these two
sets of random variables have nothing in common), we have that W and Z are independent.
We then have, for all (x, y),

pW,Z (x, y) = pW (x) · pZ (y) = pX (x) · pY (y) = pX,Y (x, y),

so (W, Z) has the same distribution as (X, Y ). Then, W +Z has the same distribution as X +Y .
Finally, the previous proposition implies that W + Z ⇠ Bin(m + n, p), so X + Y also has this
distribution.

Problems with counting and expectation


The following problem illustrates a method to compute expectations of random variables that
represent some sort of counting procedure.
Example 2. A group of k children play a game. The children stand in line, each holding a
ball, and a little far from the front of the line is a table with n identical baskets. In turn,
each child:
1
chooses a basket uniformly at random (so that each basket has probability n
of being
chosen by a given child);

attempts to throw the ball inside the chosen basket, and has probability of success equal
to q 2 (0, 1).

Assume that the trial of each child is independent of the others. Let X be the number of
baskets that are empty after all children have taken their turn. Find the expectation of X.

Solution. For i = 1, . . . , n, we let


(
1 if basket i ends up being empty;
Xi =
0 otherwise.

2
Note that n
X
X= Xi .
i=1

We thus have n
X
E[X] = E[Xi ].
i=1

Since each Xi is a Bernoulli random variable, its expectation is equal to the probability that
it equals one, which we now compute:
k
!
\
P(Xi = 1) = P {child ` does not put a ball inside basket i}
`=1

k
Y
= P(child ` does not put a ball inside basket i)
`=1
⇣ q ⌘k
= (P(child 1 does not put a ball inside basket i))k = 1 .
n
This gives
n ⇣
X q ⌘k ⇣ q ⌘k
E[X] = 1 =n 1 .
i=1
n n

The coupon collector problem


A child buys coupons to try to complete a coupon book. There are N di↵erent types of
coupons. They are sold in sealed envelopes; each envelope contains a single coupon, which is
equally likely to be of any of the N types. Let X be the number of coupons bought until the
collection is completed. We will compute the expectation of X.
We define auxiliary random variables Y1 , . . . , YN . We let Y1 = 1 (that is, Y1 is actually de-
terministic, always equal to 1). For k 2 {2, . . . , N }, we define Yk as the number of coupons
bought from (and not including) the moment when her collection reached k 1 di↵erent types
to (and including) the moment when her collection reached k di↵erent types. To illustrate this,
assume that the coupon types are A, B, C, and the sequence of coupons bought by the child
are:
A, A, A, B, B, A, A, B, C.
In this case, we have that Y1 = 1, Y2 = 3 and Y3 = 5. We now make two observations.

First, we have
X = Y1 + . . . + YN .
Indeed, the total number of coupons bought is equal to the sum of the total number of
coupons bought between the successive times when the collection grows.

Second, for k > 2, the distribution of Yk is Geometric( N Nk+1 ). To see this, assume that
the child just bought the coupon that made her collection grow to k 1 di↵erent types.
From this moment (and not including it) until (and including) the next unseen coupon
is bought, the child is making Bernoulli trials, where success means buying an unseen
coupon (and thus has probability N (k N
1)
= N Nk+1 ).

3
Putting these two observations together, and using the fact that the expectation of a Geometric(p)
random variable is 1/p, we obtain

E[X] = E[Y1 + Y2 + . . . + YN ]
= E[Y1 ] + E[Y2 ] + . . . + E[YN ]
X1 N
N N N N
= + + + ... + =N .
N N 1 N 2 1 k=1
k
P RN 1
The number N k=1 k is very close to 1 x dx = log(N ). Hence, E[X] is very close to N log(N ).
1

It is not hard to see that the random variables Y1 , . . . , YN that we used in the above rea-
soning are P
actually independent. However, we never needed this: we just used the equal-
ity E[X] = E[Yk ] – the expectation of a sum equals to sum of the expectations – for which
no independence is required.

Distribution of sums: continuous case


The following proposition gives the convolution formula for jointly continuous and independent
random variables.
Proposition 4. Let X and Y be independent continuous random variables with density
functions fX and fY , respectively. Then, Z := X + Y has density function given by
Z 1 Z 1
(z) = fX (x) · fY (z x) dx = fX (z y) · fY (y) dy.
1 1

Example 3. Let X and Y be independent, both with the Exponential distribution with
parameter > 0, that is,
(
e x if x > 0;
fX (x) = fY (x) =
0 otherwise.

Let Z := X + Y . We start with the convolution formula,


Z 1
(z) = fX (x) · fY (z x) dx.
1

Now, note that the product inside the integral is equal to zero when x < 0 (since fX (x) = 0
then) and when x > z (since fY (z x) = 0 then). The integral is then equal to
Z z Z z
x (z x) 2
e · e dx = · e z dx = 2 · z · e z .
0 0

2 z
Hence, (z) = ·z·e , for z > 0.

Review: variance
Let us recall the definition of the variance of a random variable.
Definition 1. For a random variable X, we define

Var(X) = E[(X E[X])2 ].

4
As you have seen, it is easy to prove that the alternate formula holds:

Var(X) = E[X 2 ] (E[X])2 .

Recall that, given real numbers a, b, we have

Var(aX + b) = a2 · Var(X).

Moreover, if X and Y are independent random variables, we have

Var(X + Y ) = Var(X) + Var(Y ).

Let us now use this fact to compute the variance of a binomial random variable.
Example P4. Let X1 , . . . , Xn be independent, all with the Ber(p) distribution. Recall
that X = ni=1 Xi follows a Bin(n, p) distribution. We have that
n
! n
X X
Var(X) = Var Xi = Var(Xi ).
i=1 i=1

The variance of the Xi ’s is easy to compute. We already found that E[Xi ] = p. Also note that,
since Xi only attains the values 0 and 1, we have that Xi2 = Xi (since 02 = 0 and 12 = 1),
so E[Xi2 ] = E[Xi ] = p, and

Var(Xi ) = E[Xi2 ] (E[Xi ])2 = p p2 = p(1 p).

We then obtain n
X
Var(X) = p(1 p) = np(1 p).
i=1

Markov’s and Chebyshev’s inequalities


In ST118, you practiced computing the expectation and variances of random variables, but
you haven’t really seen so much about what can be done with these quantities.
We will now give some practical uses of expectation and variance. Namely, they will allow us to
make estimates about probabilities involving the random variable in question. These estimates
are two inequalities: Markov’s inequality and Chebyshev’s inequality.
Theorem 1 (Markov’s inequality). Let X be a non-negative random variable whose ex-
pectation is well defined. We then have

E[X]
P(X > x) 6 for all x > 0.
x
Proof. Fix x > 0. Define the random variable
(
x if X > x;
Y :=
0 otherwise.

We have that X > Y , since:

if X > x, then Y = x, so X > Y ;

5
if X 2 [0, x), then Y = 0, so X > Y .
This also gives E[X] > E[Y ]. Next, note that Y is a discrete random variable (it only attains
the values 0 and x) with
pY (x) = P(X > x), pY (0) = P(X < x).
Hence,
E[X] > E[Y ] = 0 · pY (0) + x · pY (x) = x · pY (x) = x · P(X > x).
Rearranging this, we obtain the desired inequality.

Example 5. Suppose that each week, a company produces on average 50 items. Give an
upper bound for the probability that a week’s production exceeds 75 items.
Terminology: An upper bound for an unknown quantity p is any real number a such
that p 6 a. This terminology is typically used when we are trying to narrow down the (un-
known) value of p as best we can. The smaller an upper bound, the more it is informative.
For instance, in this exercise, we could say that 1 is an upper bound for the probability, which
is obviously true, but doesn’t give any information.
Solution. The assumption means that a week’s production is a random variable X
with E[X] = 50. We use Markov’s inequality to estimate

E[X] 50 2
P(X > 75) 6 P(X > 75) 6 = = .
75 75 3

It is important to note that Markov’s inequality does not always give a useful bound. Indeed,
if X is a non-negative random variable with expectation equal to µ, and x 2 (0, µ], then in the
inequality
µ
P(X > x) 6 ,
x
the right-hand side is larger than 1, so the bound only tells us that the probability is smaller
than or equal to 1.
While Markov’s inequality gives a bound on the probability that a random variable is large,
Chebyshev’s inequality gives a bound on the probability that a random variable is far from its
expectation.
Theorem 2 (Chebyshev’s inequality). Let X be a random variable whose variance is well
defined. Then,
Var(X)
P(|X E[X]| > x) 6 for all x > 0.
x2
We emphasize that here no assumption is made concerning the sign of X.
Proof. Let Y := (X E[X])2 . Then, Y is non-negative and
⇥ ⇤
E[Y ] = E (X E[X])2 = Var(X);
in particular, Y has finite expectation. Next, note that for any x > 0, we have
{|X E[X]| > x} = {(X E[X])2 > x2 } = {Y > x2 }.
Hence, by Markov’s inequality we have
E[Y ] Var(X)
P(|X E[X]| > x) = P(Y > x2 ) 6 = .
x2 x2

6
Example 6. We roll a fair six-sided die n times. Let X denote the p
number of 6’s obtained.
n
Give an upper bound to the probability that |X 6 | is larger than n.

Solution. Note that X ⇠ Bin(n, 1/6). Recalling that a Bin(n, p) random variable has expec-
tation np and variance np(1 p), we have
n 5n
E[X] = , Var(X) = .
6 36
By Chebyshev’s inequality,
⇣ n p ⌘ Var(X) 5
·n 5
P X > n 6 p 2 = 36
= .
6 ( n) n 36

7
ST119: Probability 2 Lecture notes for Week 4

Covariance
We now introduce the covariance, a concept that is closely related to the variance. Rather
than being associated to a single random variable X, the covariance is associated to a pair of
random variables X and Y . Roughly speaking, the covariance of X and Y measures the degree
with which these random variables tend to vary together.
Definition 1. Let X and Y be two random variables. The covariance of X and Y is defined
as
Cov(X, Y ) = E [(X E[X]) (Y E[Y ])] ,
whenever this expectation exists.

Before we look at examples, let us list a few first properties of the covariance. First, the
covariance is symmetric, that is,
Cov(X, Y ) = Cov(Y, X).
Second, we clearly have
Cov(X, X) = Var(X).
Third, in the same way that there are two formulas to compute the variance (namely, E[(X
E[X])2 ] and E[X 2 ] (E[X])2 ), there is also an alternate formula for the covariance:
Proposition 1. Let X and Y be two random variables. Then,

Cov(X, Y ) = E[XY ] E[X]E[Y ].


Proof.
Cov(X, Y ) = E[(X E[X])(Y E[Y ])]
= E[XY XE[Y ] Y E[X] + E[X]E[Y ]]
= E[XY ] E[X]E[Y ] E[X]E[Y ] + E[X]E[Y ]
= E[XY ] E[X]E[Y ].

Example 1. Let X and Y be discrete random variables with join probability mass function
1 1 1
pX,Y ( 2, 2) = , pX,Y ( 2, 2) = , pX,Y (3, 2) = .
6 3 2
Let us find the covariance of X and Y . We need to compute E[XY ], E[X] and E[Y ], which
we now do:
7
E[XY ] = ( 2) · ( 2) · pX,Y ( 2, 2) + ( 2) · 2 · pX,Y ( 2, 2) + 3 · 2 · pX,Y (3, 2) = ,
3
1
E[X] = ( 2) · pX,Y ( 2, 2) + ( 2) · pX,Y ( 2, 2) + 3 · pX,Y (3, 2) = ,
2
4
E[Y ] = ( 2) · pX,Y ( 2, 2) + 2 · pX,Y ( 2, 2) + 2 · pX,Y (3, 2) = .
3
Hence,
7 1 4 5
Cov(X, Y ) = E[XY ] E[X]E[Y ] = · = .
3 2 3 3

1
Example 2. In each of the items below, a set A ✓ R2 is equal to the union of the red squares.

(a) (b) (c)


y y y

1 1 1

x x x
1 1 1

In each item, consider a pair of jointly continuous random variables X and Y whose joint
probability density function is given by fX,Y (x, y) = C if (x, y) 2 A (and 0 otherwise). Find C
and Cov(X, Y ) in each case.
Solution.

(a) Since
Z 1 Z 1 Z 1 Z 1 Z 3 Z 3
fX,Y (x, y) dx dy = C dx dy + C dx dy = 8C
1 1 3 3 1 1

and this double integral should equal 1, we obtain C = 18 . It is easy to see that the
distribution of X and Y are both symmetric about 0, so E[X] = E[Y ] = 0. We compute
Z 1Z 1
E[XY ] = xy · fX,Y (x, y) dx dy
1 1
Z 1Z 1 Z 3 Z 3
1 1
= xy dx dy + xy dx dy = 4.
8 3 3 8 1 1

We then have
Cov(X, Y ) = E[XY ] E[X]E[Y ] = 4.

(b) Similarly to above, we have C = 18 , and again by symmetry, E[X] = E[Y ] = 0. Next,
Z 3 Z 1 Z 1 Z 3
1 1
E[XY ] = xy dx dy + xy dx dy = 4.
8 1 3 8 3 1

Hence,
Cov(X, Y ) = 4
in this case.

(c) Here we get C = 1


16
, and simple symmetry considerations show that E[XY ] = E[X] =
E[Y ] = 0, so
Cov(X, Y ) = 0.
In fact, in this last case it can be shown that X and Y are independent, which, as we
will see next, always implies that the covariance between them is zero.

2
Proposition 2. Let X and Y be two independent random variables. Then,

Cov(X, Y ) = 0.
Proof. Recall that if X and Y are independent, then E[XY ] = E[X]E[Y ], so
Cov(X, Y ) = E[XY ] E[X]E[Y ] = 0.

The following example shows that the converse to the above proposition is in general not true.
Example 3 (Zero covariance does not imply independence). Let X be a discrete
random variable with probability mass function
1
pX ( 2) = pX ( 1) = pX (1) = pX (2) = .
4
Let Y = X 2 . Then,

Cov(X, Y ) = E[XY ] E[X]E[Y ] = E[X 3 ] E[X]E[X 2 ].

Note that the distribution of X is symmetric about 0, and the same holds for X 3 , so E[X] =
E[X 3 ] = 0, so by the above we get Cov(X, Y ) = 0. However, X and Y are not independent;
this can for example be shown by noting that

P(X = 1, Y = 4) = 0 6= P(X = 1) · P(Y = 4).

We will now see that the covariance is linear in each of its two arguments.
Proposition 3. (Covariance of sums) For random variables X, Y and Z and real num-
bers a, b, we have Cor(aXtb <Y + d)
ac(or(X Y)
,
=
,

Cov(aX + bY, Z) = Cov(Z, aX + bY ) = a · Cov(X, Z) + b · Cov(Y, Z).

We leave the verification of this property as an exercise. Note that linearity can be used
repeatedly to give
m n
! m Xn
X X X
Cov ai X i , b j Yy = ai bj · Cov(Xi , Yj ).
i=1 j=1 i=1 j=1

Variance of sums
Let us recall that, if X and Y are independent random variables, then Var(X + Y ) = Var(X) +
Var(Y ). This equality need not be true when X and Y are not independent. We now see a
formula that holds in all cases.
Proposition 4. (Variance of sums) For random variables X and Y , we have

Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).

More generally, for random variables X1 , . . . , Xn , we have


n
! n
X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 16i<j6n

3
Proof. We prove the second formula, since it is more general:
n
! n n
! n X n
X X X X
Var Xi = Cov Xi , Xi = Cov(Xi , Xj )
i=1 i=1 i=1 i=1 j=1

n
X X
= Cov(Xi , Xi ) + Cov(Xi , Xj )
i=1 16i,j6n,
i6=j

n
X X
= Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 16i<j6n

Note that, in the above proposition, if the Xi ’s are independent, then Cov(Xi , Xj ) = 0 for
all i 6= j, so we re-obtain that the variance of the sum is equal to the sum of the variances.
Example 4. A group of N people, all of whom have hats, throw their hats on the floor of
a room and shu✏e them.Then each person takes a hat from the floor at random (assume
that the shu✏ing is perfect, in the sense that at the end of the procedure, all the possible
allocations of hats to the people are equally likely). Let X denote the number of people who
recover their own hats. Find the expectation and variance of X.

Solution. We enumerate the people from 1 to N , and for i = 1, . . . , N , define the random
variable (
1 if person i recovers their own hat;
Xi =
0 otherwise.
PN
We have X = i=1 Xi , so we can compute the expectation and variance of X using the
formulas
XN XN X
E[X] = E[Xi ], Var(X) = Var(Xi ) + Cov(Xi , Xj ).
i=1 i=1 i6=j

In order to apply these formulas, we need E[Xi ] and Var(Xi ) for each i and Cov(Xi , Xj ) for
each i 6= j. Since person i is equally likely to recover any of the N hats, we have
1
P(Xi = 1) = .
N
Hence, Xi ⇠ Ber(1/N ) and we have

1 1 N 1 N 1
E[Xi ] = , Var(Xi ) = · = .
N N N N2
Also, for i 6= j we note that Xi Xj is also a Bernoulli random variable (it only attains the
values 0 and 1) and

(N 2)! 1
E[Xi Xj ] = P(Xi Xj = 1) = P(Xi = 1, Xj = 1) = =
N! N (N 1)

(the third equality follows from the fact that there are (N 2)! outcomes where people i and j
recover their own hats, and N ! outcomes in total).

4
Then,
1 1 1
Cov(Xi , Xj ) = E[Xi Xj ] E[Xi ]E[Xj ] = 2
= 2 .
N (N 1) N N (N 1)

We are now ready to compute

XN
1 1
E[X] = =N· =1
i=1
N N

and
N
X X
N 1 1
Var(X) = +
i=1
N2 i6=j
N 2 (N 1)

N 1 1
= + N (N 1) · = 1.
N N 2 (N 1)

Cauchy-Schwarz inequality
Next, we look at an important inequality.
Theorem 1. (Cauchy-Schwarz inequality) Given two random variables X and Y , we
have p
|Cov(X, Y )| 6 Var(X) · Var(Y ).

Proof. We give a classical proof of this inequality, that is often seen in Linear Algebra. Define
rety(t) E((X ty))) El =
-
= 0 since ... 40

'(t) := Var(X tY ), t 2 R. E(x)) ItE(Xy) E E(y2] = -


+
Since this is a
quadratic in which is non-negative ,
b2 S a c

We have : LEAYY - GE(PE) => E(YEE()


Nov
apply To X-E(X) Y-Ely)
2tCov(X, Y ) + t2 Cov(Y, Y )
,

'(t) = Cov(X tY, X tY ) = Cov(X, X)


-
:
E((X-E(A)(Y -

E(X))) < E(X E())E((y E(Y))))


-
-

2
= Var(X) 2tCov(X, Y ) + t Var(Y ). :
Cor(X , Y) = Var(x)Var(y)

Note that '(t) > 0 for all t, since the variance is always non-negative. Now, when a quadratic
function t 7! at2 + bt + c is non-negative for all t 2 R, the discriminant b2 4ac must be less
than or equal to zero. This gives

( 2Cov(X, Y ))2 4Var(Y )Var(X) 6 0,

which immediately gives the desired inequality.

Correlation coefficient
The covariance is a useful quantity that describes how two random variables vary together.
However, it has one disadvantage: it is not scale invariant. To explain what this means,
suppose that X and Y are two random variables, both measuring lengths in meters. Assume
that U and V give the same measurements as X and Y , respectively, but in centimeters, that
is, U = 100X and V = 100Y . Then,

Cov(U, V ) = Cov(100X, 100Y ) = 100 · 100 · Cov(X, Y ) = 104 · Cov(X, Y ).

5
This means that changing the scale also changes the covariance. To obtain a scale-invariant
quantity, we make the following definition.
Definition 2. The correlation coefficient between two random variables X and Y is defined
as
Cov(X, Y )
⇢(X, Y ) = p .
Var(X) · Var(Y )

Proposition 5. Let X and Y be random variables. We have

1 6 ⇢(X, Y ) 6 1.

Moreover, for any a, b, c, d 2 R with a, c > 0, we have

⇢(aX + b, cY + d) = ⇢(X, Y ).

Proof. The first statement is an immediate consequence of the Cauchy-Schwarz inequality. For
the second statement, first note that

Cov(aX + b, cY + d) = ac · Cov(X, Y ) + c · Cov(b, Y ) + a · Cov(X, d) + Cov(b, d).

We now note that the covariance between a constant and any other random variable is equal
to zero. For instance,

Cov(b, Y ) = E[bY ] E[b] · E[Y ] = b · E[Y ] b · E[Y ] = 0.

We then obtain from the above that

Cov(aX + b, cY + d) = ac · Cov(X, Y ).

Then,

Cov(aX + b, cY + d) ac · Cov(X, Y )
⇢(aX + b, cY + d) = p =p
Var(aX + b) · Var(cY + d) a · Var(X) · c2 · Var(Y )
2

ac · Cov(X, Y ) Cov(X, Y )
= p =p = ⇢(X, Y ).
|a| · |c| · Var(X) · Var(Y ) Var(X) · Var(Y )

It is useful to observe that ⇢(X, X) = 1 and ⇢(X, X) = 1. Two random variables X and Y
are called uncorrelated if ⇢(X, Y ) = 0. Note that if X and Y are independent, then they are
uncorrelated, but the converse is not in general true.
Example 5. Let X1 and X2 be two independent random variables with expectation 0 and
variance 1.

1. Find ⇢(X1 , X2 ).
p
2. Let Y1 := X1 and Y2 := cX1 + 1 c2 · X2 , where c 2 [ 1, 1]. Determine E[Y2 ], E[Y22 ]
and ⇢(Y1 , Y2 ).
=
C

Way to create Imudin viables with correlation c


,
out
of 2
independent r v.
.

6
Solution.

1. Since X1 and X2 are independent, we have ⇢(X1 , X2 ) = 0.

2. We start with
h p i p
E[Y2 ] = E cX1 + 1 c2 · X2 = c · E[X1 ] + 1 c2 · E[X2 ] = 0.
| {z } | {z }
=0 =0

Next,
⇣ ⌘2 h i
p p
E[Y22 ] =E cX1 + 1 c2 · X 2 = E c2 · X12 + (1 c2 ) · X22 + 2c 1 c2 · X 1 · X 2
p
= c2 · E[X12 ] + (1 c2 ) · E[X22 ] + 2c 1 c2 · E[X1 ] · E[X2 ]
= c2 · E[X12 ] + (1 c2 ) · E[X22 ].

Noting that

1 = Var(X1 ) = E[X12 ] E[X1 ]2 = E[X12 ], 1 = Var(X2 ) = E[X22 ] E[X2 ]2 = E[X22 ],

we obtain that E[Y22 ] = c2 + 1 c2 = 1. For the correlation coefficient, we first compute


p p
Cov(Y1 , Y2 ) = Cov(X1 , cX1 + 1 c2 X2 ) = c · Var(X1 ) + 1 c2 · Cov(X1 , X2 ) = c.

Then, since Var(Y1 ) = Var(Y2 ) = 1,

Cov(Y1 , Y2 )
⇢(Y1 , Y2 ) = p = c.
Var(Y1 ) · Var(Y2 )

7
ST119: Probability 2 Lecture notes for Week 5

Conditional distributions: discrete case


Recall that, for two events A, B with P(B) > 0, we define the conditional probability of A
given B as
P(A \ B)
P(A|B) = .
P(B)
Also recall that this can be interpreted as our “updated degree of belief that A occurs, after
we are told that B occurs”. For example, when rolling a fair die, the conditional probability
that the result is 2 conditional on the result being an even number is 1/3.
We now want to extend the notion of conditioning to random variables, starting with the dis-
crete case.
Definition 1. Let X and Y be discrete random variables. The conditional probability
mass function of X given Y is the function

pX,Y (x, y)
pX|Y (x|y) = P(X = x | Y = y) = , for x 2 supp(X), y 2 supp(Y ).
pY (y)
Hence, pX|Y (x|y) is the probability that X = x, given that Y = y. Let us make some comments
about this definition:

Note that, for all y 2 supp(Y ),


X X pX,Y (x, y) pY (y)
pX|Y (x|y) = = = 1.
x x
pY (y) pY (y)

So, for each fixed y, the function that maps x into pX|Y (x|y) is a probability mass func-
tion. We refer to the distribution associated to this probability mass function as the
distribution of X given that Y = y.

The following equalities are often very useful:


X X
pX (x) = pX,Y (x, y) = pX|Y (x|y) · pY (y). (1)
y y

Recall that two discrete random variables X and Y are independent if and only if pX,Y (x, y) =
pX (x)pY (y) for all x, y. This is equivalent to saying that pX|Y (x|y) = pX (x) for all x and
all y 2 supp(Y ).

1
Example 1. Let X and Y be discrete with joint probability mass function given by the
following table:
x\y 1 2 3
1 1 1
0 12 12 12
1 1
1 0 2 4

Let us find pX|Y (x|y) for all choices of x and y. Note first that

1 1 1 1 7 1 1 1
pY (1) = +0= , pY (2) = + = , pY (3) = + = .
12 12 12 2 12 12 4 3
Then,

1/12 0
pX|Y (0|1) = = 1, pX|Y (1|1) = = 0,
1/12 1/12
1/12 1 1/2 6
pX|Y (0|2) = = , pX|Y (1|2) = = ,
7/12 7 7/12 7
1/12 1 1/4 3
pX|Y (0|3) = = , pX|Y (1|3) = = .
1/3 4 1/3 4

Note that pX|Y (x|y) is the proportion that pX,Y (x, y) represents of the total mass given
by pY (y) (in the table: the proportion that the entry in position (x, y) represents of the
total mass of its column).

2
(j)p3(-p
3
Py(y) =

&
Example 2. Let Y ⇠ Bin(n, p). Let q 2 (0, 1) and suppose that X is a random variable
with supp(X) = {0, . . . , n} and such that, for each y 2 {0, . . . , n},
✓ ◆
y
pX|Y (x|y) = · q x · (1 q)y x for x 2 {0, . . . , y} and 0 otherwise.
x

We can understand the random variable X as a result of a two-step procedure:

we first sample Y ⇠ Bin(n, p); say its value is y;

we use this y to now sample a Bin(y, q) random variable.


3 ConditioaY :

g ,
XuBian (e 9)
,

Let us find pX . For each x 2 {0, . . . , n}, as in (1) we have


n
X
pX (x) = pX|Y (x|y) · pY (y).
y=0

Noting that pX|Y (x|y) = 0 if y < x, the above becomes


n
X n
X -
y! n!
y=x
pX|Y (x|y) · pY (y) =
y=x
x!(y x)!
· q x · (1 q)y x
·
y!(n
- y)!
· py · (1 p)n y <

L
Xn
n! (n x)!
= · (pq)x · · (1 q)y x
· py x
· (1 p)n y .
x!(n x)! y=x
(y x)!(n y)!

Changing ` = y x, this becomes


n x
X (n x)!
n!
· (pq)x · · (1 q)` · p` · (1 p)n ` x
x!(n x)! `=0
`!(n x `)! ~

/q,)
(p(1 -
n!
= · (pq)x · (p(1 q) + (1 p)) n x
x!(n x)!
n!
= · (pq)x · (1 pq)n x .
x!(n x)!

This shows that X ⇠ Bin(n, pq).

3
Example 3. Let X and Y be independent with X ⇠ Poi( ) and Y ⇠ Poi(µ). Let Z = X +Y .
Find pX|Z (x|z).
Solution. Recall from the Week 3 lecture notes (first proposition) that Z ⇠ Poi( + µ). Note
that, given that Z = z, X can take any value in {0, . . . , z}. Hence, pX|Z (x|z) is defined for
all pairs (x, z) such that z 2 N0 and all x 2 {0, . . . , z}. We compute

pX,Z (x, z) pX,Y (x, z x) pX (x) · pY (z x)


pX|Z (x|z) = = = ,
pZ (z) pZ (z) pZ (z)

where the last equality follows from the independence between X and Y . The right-hand side
equals
x µz x ✓ ◆x ✓ ◆z
x!
e · (z x)!
e µ z! µ
x

( +µ)z
= · · .
e ( +µ) x!(z x)! +µ +µ
z!

This shows that, conditional on Z = z, the distribution of X is Bin(z, /( + µ)).

Conditional distributions: continuous case


Definition 2. Let X and Y be jointly continuous random variables. The conditional prob-
ability density function of X given Y is the function

fX,Y (x, y)
fX|Y (x|y) = ,
fY (y)

defined for all x 2 R and all y such that fY (y) > 0.

As in the discrete case, let us make some comments about this definition.
As a function of x for fixed y, fX|Y (x|y) is a probability density function since
Z 1 Z 1 Z 1
fX,Y (x, y) 1
fX|Y (x|y) dx = dx = fX,Y (x, y) dx = 1.
1 1 fY (y) fY (y) 1
We refer to the distribution that corresponds to this probability density function as the
distribution of X given that Y = y. Note that this is just a way to say things, since
formally {Y = y} is an event of probability zero, so we are not really allowed to condition
to it.
While in the discrete case we had pX|Y (x|y) = P(X=x, Y =y)
P(Y =y)
, it does not make sense to
write such a quotient for fX|Y (x|y) (both the numerator and the denominator are zero!).
Assume that A ✓ R2 is a set of the form:

4
Then, we have
Z x2 Z b(x) Z x2 Z b(x)
P((X, Y ) 2 A) = fX,Y (x, y) dy dx = fX (x) fY |X (y|x) dy dx.
x1 a(x) x1 a(x)

As in the discrete case, X and Y are independent if and only if fX|Y (x|y) = fX (x) for
all y such that fY (y) > 0.

The bivariate normal distribution


We will now study an important kind of distribution, that is a two-dimensional version of the
normal distribution. Remembering this joint p.d.f. is not required for the exam.
Definition 3. Two random variables X and Y have a bivariate normal distribution with
parameters
µX 2 R, µY 2 R, 2
X > 0,
2
Y > 0, ⇢ 2 ( 1, 1)
if the joint probability density function of X and Y is:

fX,Y (x, y)
( "✓ ◆2 ✓ ◆2 #)
1 1 x µX y µY (x µX )(y µY )
= p · exp + 2⇢
2⇡ X Y 1 ⇢2 2(1 ⇢2 ) X Y X Y

for x, y 2 R (the notation exp{z} means ez ). We write


2 2
(X, Y ) ⇠ N (µX , X , µY , Y , ⇢).

We will now see several properties of this joint distribution, starting with the marginals.
2 2 2 2
Proposition 1. If (X, Y ) ⇠ N (µX , X , µY , Y , ⇢), then X ⇠ N (µX , X) and Y ⇠ N (µY , Y ).

Proof. This proof is not examinable. We will only prove the statement for X (since the
statement for Y is treated in the same way, or even better, by symmetry). In order to render
the expression for fX,Y (x, y) more manageable, we let w = y YµY , so that
✓ ◆2 ✓ ◆2 ✓ ◆2
x µX y µY (x µX )(y µY ) x µX 2⇢w(x µX )
+ 2⇢ = + w2 . (2)
X Y X Y X X
We now complete the square as follows:
✓ ◆2 ✓ ◆2
2 2⇢w(x µX ) ⇢(x µX ) 2 x µX
w = w ⇢ .
X X X

With this at hand, the right-hand side of (2) becomes


✓ ◆2 ✓ ◆2
⇢(x µX ) 2 x µX
w + (1 ⇢ ) .
X X
We then have
fX,Y (x, y)
( "✓ ◆2 ✓ ◆2 #)
1 1 ⇢(x µX ) x µX
= p · exp w + (1 ⇢2 )
2⇡ X Y 1 ⇢2 2(1 ⇢2 ) X X
( ✓ ◆2 )
1 1 ⇢(x µX ) (x µX ) 2
= p · exp w 2
. (3)
2⇡ X Y 1 ⇢2 2(1 ⇢2 ) X 2 X

5
Now, we compute the marginal probability density function of X using
Z 1
fX (x) = fX,Y (x, y) dy,
1

replacing the expression for fX,Y (x, y) by what we obtained in (3), and using the substitu-
tion w = y YµY (which gives dy = Y dw). We then obtain

fX (x)
⇢ Z ( ✓ ◆2 )
1
1 (x µX ) 2 1 ⇢(x µX )
= p · exp 2
· Y exp w dw
2⇡ X Y 1 ⇢2 2 X 1 2(1 ⇢2 ) X

1 (x µX ) 2
=p · exp 2
2⇡ X 2 X
Z ( ✓ ◆2 )
1
1 1 ⇢(x µX )
⇥p · exp w dw
2⇡(1 ⇢2 ) 1 2(1 ⇢2 ) X

1 (x µX ) 2
=p · exp 2
.
2⇡ X 2 X

As you may have guessed, the value ⇢ 2 ( 1, 1) is the correlation coefficient between X and Y ,
but we will not prove that now. Next, we consider conditional density functions:
2
Proposition 2. Assume that (X, Y ) ⇠ N (µX , X , µY , Y2 , ⇢). Then, conditionally on Y = y,
the distribution of X is
✓ ◆
X 2 2
N µX + ⇢ (y µY ), (1 ⇢ ) X .
Y

Proof. This proof is not examinable. Throughout this proof, we will denote expressions
that depend on constants and on y, but not on x, by C1 , C2 , etc. With this convention, we
can write
fX,Y (x, y)
fX|Y (x|y) = = C1 · fX,Y (x, y)
fY (y)
( "✓ ◆2 ✓ ◆2 #)
1 x µX y µY (x µX )(y µY )
= C2 · exp + 2⇢
2(1 ⇢2 ) X Y X Y
8 2 39
>
> ✓ ◆ >
>
< 1 6 x2 2xµ µ 2
y µ
2
x(y µ ) µ (y µ ) 7 =
6 X X Y Y X Y 7
= C2 · exp + + 2⇢ + 2⇢
>
> 2(1 ⇢2 ) 4 X 2 2 2 5>
: X X
|{z} | Y
{z } X Y
| X Y
{z } > ;
no x no x no x
⇢ 
1 X
= C3 · exp 2
x2 2xµX 2⇢ x(y µY )
2(1 ⇢2 ) X Y
⇢  ✓ ◆
1 2 X
= C3 · exp 2
x 2x µX + ⇢ (y µY ) .
2(1 ⇢2 ) X Y

6
We now complete the square
✓ ◆ ✓ ✓ ◆◆2
2 X X
x 2x µX + ⇢ (y µY ) = x µX + ⇢ (y µY ) + C̃,
Y Y

where C̃ again does not depend on x. We thus obtained


8 ⇣ ⇣ ⌘⌘2 9
>
< x µX + ⇢ Y (y µY )
X >
=
fX|Y (x|y) = C4 · exp 2
. (4)
>
: 2(1 ⇢2 ) X >
;

We now observe that


8 ⇣ ⇣ ⌘⌘2 9
Z 1 Z 1
>
< x µX + ⇢ X
(y µY ) >
=
Y
1= fX|Y (x|y) dx = C4 exp 2
dx. (5)
1 1 >
: 2(1 ⇢2 ) X >
;

⇣ ⌘
2 2
On the other hand, integrating the density of N µX + ⇢ X
Y
(y µY ), (1 ⇢) X , we have
8 ⇣ ⇣ ⌘⌘2 9
Z 1
>
< x µX + ⇢ X
(y µY ) >
=
1 Y
1= p exp 2
dx. (6)
2⇡(1 ⇢2 ) 2
X 1 >
: 2(1 ⇢2 ) X >
;

Comparing (5) and (6) shows that


1
C4 = p 2
,
2⇡(1 ⇢2 ) X

which together with (4) completes the proof.


2 2
Proposition 3. Assume that (X, Y ) ⇠ N (µX , X , µY , Y , ⇢). Then, X and Y are indepen-
dent if and only if ⇢ = 0.

Proof. Recall that X and Y are independent if and only if fX,Y (x, y) = fX (x)fY (y) holds for
all x, y. Since fY (y) is (strictly) positive when Y is normally distributed, we have fX,Y (x, y) =
fX (x)fY (y) for all x, y if and only if fX|Y (x|y) = fX (x) for all x, y. By the above proposition,
we can see that this holds if and only if ⇢ = 0.
Remark 1. We have seen earlier that for two random variables X and Y , we have that

X, Y independent implies ⇢X,Y = 0, but ⇢X,Y = 0 does not imply X, Y independent.

We now see that, if we also know that (X, Y ) follow a bivariate normal distribution, then
equivalence holds, that is X, Y are independent if and only if ⇢X,Y = 0.

7
ST119: Probability 2 Lecture notes for Week 6

Conditional expectations: definition and properties


Last week, we defined the conditional distribution of X given that Y = y, both in the
discrete and in the continuous setting. We observed that in the discrete setting, the func-
tion x 7! pX|Y (x|y) is a probability mass function (for any fixed y), and in the continuous
setting, x 7! fX|Y (x|y) is a probability density function (for any fixed y). The next natural
step to take is to use this probability mass/density function to define an expectation.

Definition 1. 1. Let X and Y be discrete random variables. The conditional expecta-


tion of X given that Y = y is defined, for all y 2 supp(Y ), as
X
E[X|Y = y] = x · pX|Y (x|y),
x

(assuming, for cases where the sum is infinite, that it is well defined).

2. Let X and Y be jointly continuous random variables. The conditional expectation


of X given that Y = y is defined, for all y for which fY (y) > 0, as
Z 1
E[X|Y = y] = x · fX|Y (x|y) dx
1

(assuming that the integral converges).

Example 1. Let X and Y be jointly continuous with fX,Y (x, y) = 2 if 0 < x < y < 1 (and 0
otherwise). Let us compute E[X|Y = y] for all y 2 (0, 1). We have that, for all y 2 (0, 1),
Z y
fY (y) = fX,Y (x, y) dx = 2y,
0

so
fX,Y (x, y) 2 1
fX|Y (x|y) = = = , 0<x<y
fY (y) 2y y
which gives Z y
1 1 y2 y
E[X|Y = y] = x· dx = · = .
0 y y 2 2
Let us observe that the conditional expectation E[X|Y = y] satisfies the same properties as
the (unconditional) expectation you have already studied. For instance, for a function g, we
have, in the discrete case,
X
E[g(X)|Y = y] = g(x) · pX|Y (x|y)
x

and in the continuous case,


Z 1
E[g(X)|Y = y] = g(x) · fX|Y (x|y) dx.
1

The following is a very important and intuitive fact.

1
Proposition 1. Let X and Y be random variables. Then, if X and Y are independent,

E[X|Y = y] = E[X]

for all y such that the left-hand side is defined.

Proof. Assume that X and Y are independent. In the discrete case, we have pX,Y (x, y) =
pX (x)pY (y), so
pX,Y (x, y) pX (x)pY (y)
pX|Y (x|y) = = = pX (x),
pY (y) pY (y)
so X X
E[X|Y = y] = x · pX|Y (x|y) = x · pX (x) = E[X].
x x

The continuous case is treated similarly.

Example 2. Assume that X and Y are independent random variables with X ⇠ Poi( )
and Y ⇠ Poi(µ). Let Z = X + Y . In the example in page 3 of the Week 5 lecture notes, we
have seen that, for all z 2 N0 ,
✓ ◆✓ ◆x ✓ ◆z x
z µ
pX|Z (x|z) = , x 2 {0, . . . , z}.
x +µ +µ

that is, the distribution of X given that Z = z is Bin(z, /( + µ)). This implies that

E[X|Z = z] = · z.

Let us now compute E[Z|X = x]. We have

E[Z|X = x] = E[X + Y |X = x] = E[X|X = x] + E[Y |X = x] = x + E[Y ] = x + µ.

Similarly,
E[Z|Y = y] = y + .

Finally, we give the definition of conditional variance:


Definition 2. Let X and Y be random variables. Then, the conditional variance of X
given that Y = y is defined as

Var(X|Y = y) = E[(X E[X|Y = y])2 |Y = y]. ((x(y y) E(X(y y)


=
= -

Computing expectations by conditioning


In ST118, you have seen that if we have events A1 , . . . , An such that ⌦ = A1 [ . . . [ An ,
and Ai \ Aj = ? when i 6= j, then for any event B, we have
n
X
P(B) = P(B|Ai ) · P(Ai ).
i=1

This is sometimes called the law of total probability. Analogously, in this module, we have
formulas of the kind X
pX (x) = pX|Y (x|y) · pY (y)
y

2
in the discrete case, and Z 1
fX (x) = fX|Y (x|y) · fY (y) dy
1
in the continuous case. We will now see an analogous formula for computing conditional ex-
pectations.

Theorem 1. 1. Let X and Y be discrete random variables. We have


X
E[X] = E[X|Y = y] · pY (y)
y2supp(Y )

(assuming, for cases where the sum is infinite, that it is well defined).

2. Let X and Y be jointly continuous random variables. We have


Z 1
E[X] = E[X|Y = y] · fY (y) dy
1

(assuming that the integral converges).

Proof. In the discrete case,


X X X
E[X] = x · pX (x) = x· pX|Y (x|y) · pY (y)
x x y
!
X X X
= x · pX|Y (x|y) · pY (y) = E[X|Y = y] · pY (y).
y x y

In the continuous case,


Z 1 Z 1 Z 1
E[X] = x · fX (x) dx = x fX|Y (x|y) · fY (y) dy dx
1 1 1
Z 1 ✓Z 1 ◆ Z 1
= x · fX|Y (x|y) dx fY (y) dy = E[X|Y = y] · fY (y) dy.
1 1 1

Example 3. Let X and Y be jointly continuous, with Y ⇠ Unif(0, 1) and, for y 2 (0, 1),
1
fX|Y (x|y) = , 0 < x < y.
y

To find E[X], we can compute:


Z 1 Z 1
E[X] = E[X|Y = y] · fY (y) dy = E[X|Y = y] dy. (|)
0 0

We compute the conditional expectation,


Z y Z y
1 y
E[X|Y = y] = x · fX|Y (x|y) dx = x· dx = ,
0 0 y 2

so the right-hand side of (|) is Z 1


y 1
dy = .
0 2 4

3
Example 4. Problem: find the expected amount of money I spend in a visit to the bookstore,
given the following information. Every time I go there, I buy a random number N of books.
The prices of the books I buy are independent random variables (also independent of N ) and
all follow the same distribution. The expectation of N is µ and the expectation of the price
of a book I buy is ⌫.
Solution. It is assumed that the prices of the books I pick all follow the same distribution.
Let X1 , X2 , . . . be independent random variables, all following this distribution. Then, the
amount of money I spend in a visit to the bookstore is the random variable
N
X
X= Xi ,
i=1

Note that X is a sum with a random number


P N of terms. Also, for the case where N = 0, let
us adopt the convention that the sum 0i=1 Xi means zero (this is reasonable: usually, when
P
we write “ bi=a ”, we mean that the sum goes over all i’s that satisfy a 6 i 6 b; this is empty
when a > b, so the sum is void). We can then compute
1 1 1
" N #
X X X X
E[X] = E[X|N = n] · pN (n) = E[X|N = n] · pN (n) = E Xi N = n · pN (n).
n=0 n=1 n=1 i=1

P Pn
On the event {N = n}, we can replace Ni=1 Xi by i=1 Xi , so the right-hand side above
equals " n #
X1 X
E Xi N = n · pN (n).
n=1 i=1

Noting that N is independent of X1 , X2 , . . ., we have that


" n # " n #
X X
E Xi N = n = E Xi = n · E[Xi ] = n · ⌫.
i=1 i=1

Hence,
1
X 1
X
E[X] = n · ⌫ · pN (n) = ⌫ n · pN (n) = ⌫ · E[N ] = ⌫ · µ.
n=1 n=1

Example 5 (Correlation in bivariate normal). Recall from Week 5 that X and Y follow
2
a bivariate normal distribution with parameters µX , µY , X , Y2 and ⇢ if they have joint
probability density function
1
fX,Y (x, y) = p
2⇡ X Y 1 ⇢2
( "✓ ◆2 ✓ ◆2 #)
1 x µX y µY (x µX )(y µY )
· exp + 2⇢ .
2(1 ⇢2 ) X Y X Y

We will now show that ⇢ is the correlation coefficient of X and Y . To this end, we must
compute
Cov(X, Y ) E[XY ] µX µY
p = , (1)
Var(X)Var(Y ) X Y

4
2
where the equality follows from X ⇠ N (µX , X ), Y ⇠ N (µY , Y2 ), which we found in Propo-
sition 1 in the Week 5 notes. It remains to compute E[XY ], which we do by conditioning:
Z 1 Z 1
E[XY ] = E[XY |Y = y] · fY (y) dy = E[Xy|Y = y] · fY (y) dy
1 1
Z 1
= y · E[X|Y = y] · fY (y) dy. (2)
1

In Proposition
⇣ 2 in the Week 5 notes, we showed
⌘ that conditionally on Y = y, the distribution
of X is N µX + ⇢ Y (y µY ), (1 ⇢ ) X . In particular, E[X|Y = y] = µX + ⇢ XY (y µY ),
X 2 2

so (2) becomes
Z 1 ✓ ◆ 
X X
y · µX + ⇢ (y µY ) · fY (y) dy = E µX Y + ⇢ (Y 2 µY Y )
1 Y Y
X
= µX · E[Y ] + ⇢ (E[Y 2 ] µY E[Y ])
Y
X 2
= µX µY + ⇢ · Y = µX µY + ⇢ X Y.
Y

This shows that the right-hand side of (1) equals ⇢, as required.

Suppose X
,, Xc, . . .
as
independent with recompe and N is
independent to all the Xi's

ht
Su = X ,
EISwIN / EInIN n)
= = = =
EN ) =
=) =

5
ST119: Probability 2 Lecture notes for Week 7

(Conditional) expectations and prediction


We will conclude our study of conditional expectation by saying a few words about the purpose
of the (conditional) expectation for predicting the value of a random variable.
Expectations and prediction. Let us forget about conditional expectations for a second
and go back to the expectation E[X] of a random variable X. The following is an interesting
property of E[X]:
Proposition 1. Let X be a random variable. The value of c 2 R that minimizes E[(X c)2 ]
is c = E[X].

Proof. We compute

E[(X c)2 ] = E[X 2 ] 2cE[X] + E[c2 ] = c2 2E[X] · c + E[X 2 ].

This is a quadratic function of c which is minimized at the value of c for which the derivative
with respect to c (that is, 2c 2E[X]) equals zero, namely c = E[X].
We can interpret this proposition as follows. Suppose that we don’t know the value that X will
attain, but we want to choose a deterministic, fixed constant c which is a “good prediction”
for X. There are di↵erent possible ways to say what a good prediction is, but in the present
setting, let us say that we would like to guarantee that the squared di↵erence (X c)2 is
not too large. Since this squared di↵erence is also a random variable, we choose c that min-
imizes its expectation. By the above proposition, the optimal choice for c is precisely c = E[X].
Conditional expectations and prediction. Now suppose that we have two random vari-
ables, X and Y , and that we observe the value of Y , but not the value of X. Suppose that, for
each possible result Y = y that we observe, we want to come up with a good prediction for X
(in the sense of the previous paragraph). This means that now, instead of choosing a single
value c 2 R, we want to choose a function c(y), which gives the prediction for X when Y = y.
The following proposition tells us that the best choice is taking c(y) = E[X|Y = y].
Proposition 2. Let X and Y be random variables. Among all functions c(y), the one that
minimizes E[(X c(Y ))2 ] is c(y) = E[X|Y = y].

Proof. Assume that X and Y are discrete. We have


X
E[(X c(Y ))2 ] = E[(X c(Y ))2 | Y = y] · pY (y)
y
X
= E[(X c(y))2 | Y = y] · pY (y).
y

Now, repeating the proof of the previous proposition, we obtain that for each fixed value of y,
the number c(y) that minimizes - E[(X c(y))2 | Y = y] is c(y) = E[X|Y = y]. This concludes
the proof in the discrete case, and the jointly continuous case is treated similarly.

The Gamma function and the Gamma distribution


Our goal for the rest of this week is to study a mathematical model, called the Poisson process,
for systems such as queues, in which new “arrivals” occur from time to time. Before we define
Poisson processes, we recall the Gamma distribution.

1
eg Xn Poi (1) and Zv Poi/pr) are
independent
Y = X+2 , YePoi(Atm) ·
Conditional on Y= y
,
X ~

Birly im ,

want to fold the function which minimizes


E((X-c(y))2(Y y)
We
:
in E(X(V y)
=
=

The Gamma function is defined by


Z 1
(w) := xw 1
·e x
dx, w 2 (0, 1).
0

A key property of the Gamma function is given by the following:

Proposition 3. For any m 2 N, we have

(m) = (m 1)!

(with the convention that 0! = 1).


A random variable X has Gamma distribution with parameters w and if it has probability
density function ( w
(w)
· xw 1 · e x if x > 0,
fX (x) =
0 otherwise.
We write X ⇠ Gamma(w, ).

Recall that the Gamma distribution with parameters w = 1 and is equal to the exponential
distribution with parameter .
Proposition 4. Let T1 , . . . , Tn be independent random variables, all with the exponential
distribution with parameter . Then,
n
X
Z := Ti ⇠ Gamma(n, ).
i=1

Proof. We argue by induction in n.


Assume that the result holds for n 2 N, and let T1 , . . . , Tn+1 be independent, all with the
exponential distribution with parameter . Define
n
X n+1
X
Y = Ti , Z= Ti , so that Z = Y + Tn+1 .
i=1 i=1

By the induction hypothesis, we have that Y ⇠ Gamma(n, ), so that


n
fY (y) = · yn 1
·e y
, y > 0.
(n)
Note that Tn+1 is independent of T1 , . . . , Tn , so it is independent of Y . Hence, by the convolution
formula,
Z z
(z) = fY +Tn+1 (z) = fY (x) · fTn+1 (z x) dx
0
Z z n
= xn 1 e x
· e (z x)
dx
0 (n)
n+1 Z z
z
= ·e · xn 1
dx
(n) 0
n+1
z zn n+1
z
= ·e · = ·e · zn,
(n) n (n + 1)

2
where the last step follows from the fact that (n + 1) = n (n) = n!; so Z has the probability
density function of the Gamma(n + 1, ) distribution.
We readily obtain the following consequence:
Corollary 5. Let m, n 2 N. Let Z1 , Z2 be independent random variables, with Z1 ⇠ (m, )
and Z2 ⇠ (n, ). Then,
Z1 + Z2 ⇠ Gamma(m + n, ).

Poisson process
Consider a bank branch where clients arrive at random times. Starting time from t = 0 (say,
noon), let
0 < S1 < S2 < · · ·
denote the successive arrival times of clients. Define

T 1 = S1 , T 2 = S2 S1 , ..., T n = Sn Sn 1 , ...

so that, for n > 2, Tn is the time elapsed between the arrival of client n 1 and client n. Also
note that n
X
Sn = Ti , n 2 N.
i=1

We make the assumption that

T1 , T2 , . . . are independent, identically distributed (iid) ⇠ Exp( ).

Next, define
Nt = number of clients that have arrived by time t, t > 0,
so Nt is: 8
>
> 0 if 0 6 t < S1 ;
>
<1 if S1 6 t < S2 ;
Nt =
>
>
> 2 if S2 6 t < S3 ;
:
···
We plot this as follows:

4 Nt
3
2
1
t
S 1 S2 S3 S4
T
,
T
, T TitTztig

Adopting the convention that S0 = 0, a more elegant way to express Nt is:

Nt = max{n : Sn 6 t}.

We are now ready to give the

3
Pn
Definition 1. As above, let T1 , T2 , . . . be independent, all ⇠ Exp( ), let S0 = 0, Sn = i=1 Ti
for n 2 N, and let Nt = max{n : Sn 6 t} for t > 0. The family of random variables

(Nt )t>0

is called a Poisson process with intensity .

A family of random variables indexed by time is called a stochastic process. The Poisson
process is the only stochastic process we will encounter in this module. Later you will see many
more, such as Markov chains, Brownian motion, martingales, di↵usion processes etc.
Why‘Poisson process’.
Theorem 1. Let Nt : t > 0 be a Poisson process with intensity . Then, for any t > 0,
Nt ⇠ Poi( t).

Proof. We take random variables T1 , T2 , . . . and S0 , S1 , . . . as in the definition of the Poisson


process. First note that

pNt (0) = P(Nt = 0) = P(T1 > t) = 1 FT1 (t) = 1 (1 e t


)=e t
.

Now fix j 2 N. We observe that

pNt (j) = P(Nt = j) = P(Sj 6 t, Sj+1 > t)

(to justify the last equality: the event that there are exactly j arrivals at time t is equal to
the event that arrival number j occurred before or at time t, and arrival j + 1 occurred after
time t). Recalling that Sj+1 = Sj + Tj+1 , the probability on the right-hand side above equals

P(Sj 6 t, Sj + Tj+1 > t) = P(Sj 6 t, Tj+1 > t Sj ).

This corresponds to the event that (Sj , Tj+1 ) belongs to the blue set in the picture:

y
t

x
t

So we have to integrate fSj ,Tj+1 over this set to obtain the desired probability. To this end,
let us find fSj ,Tj+1 . Note that Sj and Tj+1 are independent. We know that Tj+1 ⇠ Exp( );
moreover, by the corollary seen earlier, we have Sj ⇠ Gamma(j, ). We thus have
✓ j j 1 ◆
·x x
fSj ,Tj+1 (x, y) = fSj (x) · fTj+1 (y) = ·e · ·e y .
(j)

4
We are now ready to compute
Z tZ 1
P(Sj 6 t, Tj+1 > t + Sj ) = fSj ,Tj+1 (x, y) dy dx
0 t x

j Z t Z 1
j 1 x y
= · x ·e ·e dy dx
(j) 0 t x

j Z t
= · xj 1
·e x
·e (t x)
dx
(j) 0
j Z t
t
= ·e · xj 1
dx
(j) 0
j
t tj ( t)j t
= ·e · = ·e ,
(j) j j!

where in the last equality we used (j) = (j 1)!. We have now proved that pNt (j) is equal to
the probability mass function of the Poisson( t) distribution evaluated at j, as desired.
The following is a stronger version of the above theorem. We will not give a proof in this module.
Theorem 2. Let Nt : t > 0 be a Poisson process with intensity . Then, for any sequence of
times 0 < t1 < · · · < tk :

Nt1 ⇠ Poi( t1 ), Nt2 Nt1 ⇠ Poi( (t2 t1 )), ··· Nt k Nt k 1


⇠ Poi( (tk tk 1 )),

and these random variables are independent.

5
ST119: Probability 2 Lecture notes for Week 8

Moments and moment-generating function


We give a name to expectations of powers of a random variable.
Definition. Let X be a random variable. For k 2 N, we define the kth moment of X as

E[X k ]

whenever the expectation exists. In this case, we say that X has finite kth moment.

We calculate moments of random variables using the familiar formula


8P
< k
x x · pX (x) if X is discrete;
E[X ] = R 1
k
: xk · fX (x) dx if X is continuous.
1

Let us see a simple example.


Example. Let X ⇠ Exp( ); recall that this means that fX (x) = e x for x > 0. Let us
compute the moments of X. We have already computed the expectation (which is the first
moment),
1
E[X] = .

Now let k 2 {2, 3, . . .}. We have


Z 1 Z 1
E[X ] =
k k
x · fX (x) dx = xk · e x
dx. (1)
1 0

We could compute the right-hand side by integrating by parts repeatedly, but there is a
w
shortcut. Recall that Y ⇠ Gamma(w, ) if it has density fY (x) = (w) e x , for x > 0. So we
take w = k + 1, so that
Z 1 Z 1 k+1
1= fY (x) dx = · xk · e x dx.
1 0 (k + 1)

The idea is now to manipulate the integral in (1) to make the Gamma density appear:
Z 1 Z
k x (k + 1) 1 k+1
x · e dx = k
· xk · e x dx .
0 (k + 1)
|0 {z }
=1

In conclusion,
(k + 1) k!
E[X k ] = k
= k
.

What are moments good for? We have already seen some important uses of the expectation
and the variance (which is calculated from the expectation and the second moment). Namely,
they appear in inequalities that give us information about the distribution of random variables,
when exact formulas are difficult to obtain. Recall in particular that Markov’s inequality stated
that, for a non-negative random variable X, we have P(X > x) 6 E[X] x
. The following theorem
says that this bound can be improved (at least asymptotically) in case X has finite higher-order
moments.

1
Theorem (Markov’s inequality, higher-order moments). Let X be a non-negative ran-
dom variable with finite kth moment. Then,

E[X k ]
P(X > x) 6 , x > 0.
xk
Proof. Let Y = X k . We write
P(X > x) = P(X k > xk ) = P(Y > xk ).
By Markov’s inequality, the right-hand side is smaller than or equal to
E[Y ] E[X k ]
= .
xk xk

Note that the upper bound for P(X > x) obtained above is E[X ] k

xk
, whereas the one from the
standard Markov’s inequality is x . For large values of x, xk is much smaller than E[X]
E[X] E[X ]
k

x
, so
the upper bound obtained from the above theorem is much better.
Definition. The moment-generating function of a random variable X is the function MX
defined as
MX (t) := E[etX ]
for all t 2 R for which the expectation is well defined.

Before we explain the name “moment-generating function”, let us compute it in a few examples.
Example. Let X ⇠ Ber(p). Then, for any t 2 R,

MX (t) = E[etX ] = et·0 · pX (0) + et·1 · pX (1)


= 1 p + et · p = 1 + p(et 1).
2
Example. Let X ⇠ N (0, 1), so that fX (x) = p12⇡ e x /2 for x 2 R. The moment-generating
function of X can be computed as follows:
Z 1 Z 1
1 x2 1 x2 +2tx
=>

MX (t) = E[e ] =
tX tx
e · p ·e 2 dx = p · e 2 dx.
1 2⇡ 2⇡ 1

We complete the square,


x2 +
E 2tx = (x + t)2 t2 ,
so the above becomes
Z 1 Z 1
1 E 2
(x+t) t2 t2 1 s 2
(x+t)
p · e 2 dx = e 2 ·p · e 2 dx.
2⇡ 1 2⇡ 1
-

(x+t)
Es
2
Since p1 e 2 is fY (x), where Y ⇠ N ( t, 1), we have
m

2⇡
Z 1
1 (x+t)
E
-

2
p · e 2 dx = 1.
2⇡ 1

In conclusion,
t2
MX (t) = e 2 , t 2 R.

2
The following theorem explains the reason for the name “moment-generating function”.

Theorem 1. Assume that MX exists in a neighborhood of 0, that is, there exists " > 0 such
that for all t 2 ( ", ") we have that MX (t) exists. Then, for k = 0, 1, . . ., the kth moment
of X exists, and we have
dk
E[X k ] = k MX (t) .
dt t=0

Although we do not give a full proof of this theorem, let us sketch the idea involved. Using the
Taylor expansion of the exponential function, we have
"1 #
X (tX)k
MX (t) = E[e ] = E
tX
.
k=0
k!

Now (after giving a rigorous justification, omitted here, for exchanging the expectation with
an infinite sum), the right-hand side becomes
1
X 
(tX)k E[X 2 ] 2 E[X 3 ] 3
E = 1 + E[X] · t + ·t + · t + ··· .
k=0
k! 2! 3!

Di↵erentiating the right-hand side k times with respect to t and evaluating the result at t = 0
gives the desired equality.
Before seeing an example of application of the above theorem, we prove some simple properties
of moment-generating functions.

Proposition 1. Assume that all expectations in the statement are well defined.

1. For any a, b 2 R,
MaX+b (t) = etb · MX (at).

2. If X and Y are independent, then

MX+Y (t) = MX (t) · MY (t).

More generally, if X1 , . . . , Xn are independent, then


n
Y
MX1 +···+Xn (t) = MXi (t).
i=1

Proof. 1. We compute

MaX+b (t) = E[et(aX+b) ] = etb · E[e(at)X ] = etb · MX (at).

2. If X and Y are independent, then so are etX and etY . Therefore,

MX+Y (t) = E[et(X+Y ) ] = E[etX · etY ] = E[etX ] · E[etY ] = MX (t) · MY (t).

3
Example. Let X ⇠ N (0, 1) again. Using an example in the lecture notes of Week 2 (top
of page 7), we have obtain that, if µ 2 R and 2 > 0, then Y = X + µ has the N (µ, 2 )
distribution. Using the above proposition, we can now compute

MY (t) = M X+µ (t) = etµ · MX ( t).


2 /2
Earlier we have found that MX (t) = et , so we now obtain
2 t2
t)2 /2
MY (t) = etµ · e( = etµ+ 2 .

Now that we know the moment-generating function of Y ⇠ N (µ, 2 ), let us use Theorem 1
to re-obtain that E[Y ] = µ and Var(Y ) = 2 . We first compute
 ⇢ 2 2
d t
E[Y ] = MX (t) 2
= (t + µ) · exp + µt =µ
dt t=0 2 t=0

and
 ⇢ ⇢
d2 t2 2 t2 2
E[Y ] = 2 MX (t)
2
= 2
· exp + µt + (t 2 2
+ µ) · exp + µt = 2
+ µ2 .
dt t=0 2 2 t=0

Hence,
Var(Y ) = E[Y 2 ] E[Y ]2 = 2
+ µ2 µ2 = 2
.
Recall that it was much harder, earlier in the module, to compute this variance through
a direct computation using the definition. Now, if we want higher moments (such
as E[X 3 ], E[X 4 ],...) it is relatively straightforward to obtain them by further di↵erentiat-
ing MY (t).

Moment-generating functions characterize distributions


The following theorem says that if two random variables have the same moment-generating
function, and moreover this moment-generating function is well defined in a neighborhood of
the origin, then these random variables have the same distribution.
Theorem 2. Let X and Y be two random variables. Assume that the moment-generating
functions of X and Y (denoted by MX and MY , respectively) exist and are finite on an interval
of the form ( ", "). Assume further that

MX (t) = MY (t) for all t 2 ( ", ").

Then, X and Y have the same distribution.


2
Example. Let X ⇠ N (µX , X ) and Y ⇠ N (µY , Y2 ) be independent. Let us find the moment-
generating function of X + Y . By the above example, we have:
⇢ 2 2 ⇢ 2 2
t X t Y
MX (t) = exp + tµX , MY (t) = exp + tµY .
2 2

Since X and Y are independent, we can use part (2) of Proposition 1 to obtain

4
⇢ 2 2
⇢ 2 2
Xt Yt
MX+Y (t) = MX (t) · MY (t) = exp µX t + · exp µY t +
2 2
⇢ 2 2 2
( X + Y )t
= exp (µX + µY )t + .
2
2
This shows that X + Y has the same moment-generating function as an N (µX + µY , X + Y2 )
random variable. Since this moment-generating function is defined in a neighborhood of the
2
origin, we conclude that X + Y ⇠ N (µX + µY , X + Y2 ).
Although we will not cover it in this module, let us mention that there is an alternative to the
moment-generating function, called the characteristic function. For a random variable X,
it is defined as
'X (t) = E[eitX ], t 2 R,
p
where i is the imaginary unit, i = 1.

5
ST119: Probability 2 Lecture notes for Week 9

Law of Large Numbers


For the Law of Large Numbers, we consider sums of the form
X1 + · · · + Xn
,
n
where X1 , X2 , . . . are independent and identically distributed random variables. Sums of this
form are averages. For example, if
2, 3, 3, 3, 4, 6, 3, 3, 4, 2, 5, 1, 2, 1, 6, 5, 2, 6, 1, 6
are the results obtained from rolling a fair die twenty times, we would consider the average
1
(2 + 3 + 3 + 3 + 4 + 6 + 3 + 3 + 4 + 2 + 5 + 1 + 2 + 1 + 6 + 5 + 2 + 6 + 1 + 6),
20
which in this case equals 3.4. Although the value of this average is still random, it will most
times be very close to 3.5 (especially if n is very large), which is the expected value of a single
die roll. This is what the Law of Large Numbers says: that an average of many independent
and identically distributed random variables ends up being close to the expectation of these
random variables.
Theorem. (Weak Law of Large Numbers). Let X1 , X2 , . . . be a sequence of independent
random variables, each with mean µ and variance 2 . Define
X1 + · · · + Xn
X̄n = , n 2 N.
n
Then,
n!1
for any " > 0, we have P(|X̄n µ| > ") ! 0. (1)

We can interpret (1) as follows. Let us say that we pick a number " > 0, possibly very small,
and then we start declaring that two real numbers a and b are far from each other if the distance
between them (= |a b|) is more than ". If " is tiny, this is a very demanding notion of ‘far’;
for instance, if " = 10 10 , we are saying that two numbers that di↵er by more than 10 10 are
far from each other. No matter: even when we are this demanding, if we take n large enough
in our sampling X1 , X2 , . . . , Xn , then the average X̄n will be close to µ, with high probability.
It is important to note that no assumptions on the specific distributions of the Xi ’s are made.
We only require the Xi ’s to have finite mean and variance, for each i.
X1 +···+Xn
Proof. Let X̄n = n
. Note that
n
1 X 1
E[X̄n ] = · E[Xi ] = · nµ = µ
n i=1 n

and (using independence)


n
1 X 1 2
2
Var(X̄n ) = 2 · Var(Xi ) = 2 · n = .
n i=1 n n
Fix " > 0. By Chebyshev’s inequality, we have that
2
Var(X̄n ) n!1
P X̄n µ > " = P X̄n E[X̄n ] > " 6 = ! 0.
"2 "2 n
This implies that X̄n converges to µ in probability as n ! 1, which is the desired result.

1
-

Remark. The reason for the word ‘weak’ in ‘Weak Law of Large Numbers’ is that there is
also a Strong Law of Large Numbers. The di↵erence between the two laws has to do with
di↵erent forms of convergence for sequences of random variables.
Recall that for a sequence of real numbers (xn : n 2 N), there is a well established notion of
convergence, which you may have seen in Calculus or Analysis:
n!1
xn ! x if for every " > 0 there exists n0 2 N such that
|xn x| < " for all n > n0 .

In contrast, for sequences of random variables, there are several possible definitions of con-
vergence. Although this is not examinable, let us briefly look at two of these notions, just
so that you can see the di↵erence between the Weak and the Strong Laws.
Given a sequence of random variables (Yn : n 2 N) and a random variable Y , we say that (Yn )
converges to Y in probability if
n!1
for every " > 0, we have P(|Yn Y | > ") ! 0.

Note that the Weak Law of Large Numbers says that the sequence (Yn ), defined by Yn = X̄n
for each n, converges in probability to the random variable Y that is constant, equal to E[X1 ].
Next, we say that a sequence of random variables (Yn : n 2 N) converges almost surely to
a random variable Y if ⇣ ⌘
P lim Yn = Y = 1.
n!1

It is possible to prove that this form of convergence is stronger : if Yn converges to Y almost


surely, then it also converges to Y in probability. The Strong Law of Large Numbers is a
statement involving almost sure convergence, hence a stronger form of convergence than the
Weak Law.

Example 1. Let X1 , X2 , . . . be independent random variables, all uniformly distributed in


the interval (0, 1). For each n 2 N, define

X12 + · · · + Xn2
Yn = .
n
Prove that there is a constant c 2 R such that
n!1
for any " > 0, P (|Yn c| > ") ! 0.

Solution. The random variables Z1 = X12 , Z2 = X22 , . . . are independent and identically
distributed. Their expectation is equal to
Z 1 Z 1
1
E[Z1 ] = E[X1 ] =
2 2
x · fX1 (x) dx = x2 dx = .
0 0 3
R1
They also have finite variance, since Var(Z1 ) = Var(X 2 ) = E[X 4 ] E[X 2 ]2 = 0 x4 dx
⇣R ⌘2
1 2
0
x dx and both integrals are finite. By the Law of Large Numbers, we have that
✓ ◆
1 n!1
for any " > 0, P Yn >" ! 0.
3

2
Example 2. We roll a die successively and deem each result of 6 a success (other results
are failures). Prove that the probability that we need to roll the die more than 7n times to
obtain n successes tends to zero, as n ! 1.
Solution. The number of times we need to roll the die to obtain n successes is X1 + · · · + Xn ,
where X1 is the number of rolls until (and including) the first sucess, and for i > 1, Xi is the
number of rolls from (and not including) the (i 1)-th success to (and including) the i-th
success. Here’s an example, with n = 4:

1, 5, 3, 4, 4, 6, 3, 6 , |{z}
6 , 5, 5, 2, 3, 3, 1, 6 .
| {z } |{z} | {z }
X1 =6 X2 =2 X3 =1 X4 =7

Note that X1 , X2 , . . . are independent and identically distributed, all with the geometric
distribution with parameter p = 16 . This distribution has expectation equal to 6 and finite
variance. The Law of Large Numbers gives
✓ ◆
X1 + . . . + Xn n!1
for any " > 0, P 6 >" ! 0.
n

We then write
✓ ◆
X1 + . . . + Xn
P (X1 + · · · + Xn > 7n) = P >7
n
✓ ◆
X1 + . . . + Xn
=P 6>1
n
✓ ◆
X1 + . . . + Xn n!1
6P 6 >1 !0
n

as required.

Example 3. Suppose that we are interested in finding the area of a two-dimensional set A
contained in the square {(x, y) : 1 6 x, y 6 1}. The way to find this area exactly is to solve
the integral ZZ
Area(A) = 1 dx dy.
A
For certain sets A, this integral may be too hard or impossible to solve. We will see now how
we can use random variables and the Law of Large Numbers to approximate the value of the
area. This example is a rudimentary form of the Monte Carlo method.
Suppose that we have a computer that can generate a sequence X1 , X2 , . . . of independent
random variables, all with the Unif( 1, 1) distribution. We first create random vectors

(X1 , X2 ), (X3 , X4 ), . . .

Note that these two-dimensional vectors are independent, all with the same probability density
function (
1
if 1 6 x, y 6 1;
fX1 ,X2 (x, y) = fX1 (x) · fX2 (y) = 4
0 otherwise.

3
Next, define, for i > 1, (
1 if (X2i 1 , X2i ) 2 A;
Zi =
0 otherwise.
Alternatively, we could write
Zi = h(X2i 1 , X2i ),
where h is the function (
1 if (x, y) 2 A;
h(x, y) =
0 otherwise.
Note that Z1 , Z2 , . . . are independent and identically distributed, with expectation
Z 1Z 1
E[Z1 ] = E[h(X1 , X2 )] = h(x, y) · fX,Y (x, y) dx dy
1 1
Z 1 Z 1 ZZ
1 1 1
= h(x, y) dx dy = 1 dx dy = · Area(A).
4 1 1 4 A 4

Now, the Law of Large Numbers tells us that, for n large,


Z1 + · · · + Z n 1
is close to E[Z1 ] = · Area(A) with high probability.
n 4
Note that the quotient Z1 +···+Z n
n
is obtained as follows: we take the two-dimensional vec-
tors (X1 , X2 ), (X3 , X4 ), . . ., (X2n 1 , X2n ), count how many of them lie inside A, and divide
the result by n. The outcome of this computation would, with high probability, be close to
the area we want, divided by 4.

The Law of Large Numbers requires the random variables involved to be independent. Some-
times, even if independence doesn’t hold, the method of proof of the Law of Large Numbers
(with Chebyshev’s inequality) can be very useful. This illustrated by the next example, which
revisits Exercise 5 of Week 3.
Example 4. In an N ⇥ N square grid (with N > 4), we color each of the unit squares black
with probability 1/3 (and leave it uncolored with probability 2/3), independently. The whole
grid has (N 3)2 sub-grids of dimensions 4 ⇥ 4. Let YN be the proportion of these sub-grids
in which we see:

3
Prove that the probability that YN exceeds 10 tends to zero as N ! 1.
Solution. Let SN be the set of all 4 ⇥ 4 sub-grids of the N ⇥ N grid. For each s 2 SN , define
(
1 if s shows the depicted picture;
Xs =
0 otherwise,

We then have P
s2SN Xs
YN = .
(N 3)2

4
We need to consider
!
X
P(YN > 10 3 ) = P Xs > 10 3
· (N 3)2 .
s2SN

We would now like to use Chebyshev’s inequality, so we write the right-hand side above as
" # " #!
X X X
P Xs E Xs > 10 3 · (N 3)2 E Xs . (2)
s2SN s2S s2S

We then compute
" # ✓ ◆4 ✓ ◆12
X X 1 2 212
E Xs = E[Xs ] = (N 3) · 2
· = 16 · (N 3)2 ,
s2SN s2SN
3 3 3

so (2) can be written as


" # !
X X
P Xs E Xs > a · (N 3)2 ,
s2SN s2S

12
where a = 10 3 3216 . Using a calculator, we see that a > 0. By Chebyshev’s inequality, the
probability above is smaller than P
Var s2SN Xs
. (3)
a2 (N 3)4
To prove that this tends to zero as N ! 1, we need to check that the variance in the
numerator grows slower than the denominator. For this, we do not need to compute the
variance exactly, but just to do some rough estimate. We start with
!
X X X X
Var Xs = Var(Xs ) + Cov(Xs , Xs0 )
s2SN s2SN s2SN s0 2SN ,
s0 6=s

If two sub-grids s and s0 do not overlap, then Xs and Xs0 are independent, so Cov(Xs , Xs0 ) = 0.
Hence, the right-hand side above is equal to
X X X
Var(Xs ) + Cov(Xs , Xs0 ) (4)
s2SN s2SN s0 2SN ,
s0 6=s,
s0 \s6=?

Recall that

Var(Xs ) = E[Xs2 ] E[Xs ]2 , Cov(Xs , Xs0 ) = E[Xs Xs0 ] E[Xs ]E[Xs0 ]

and, since the Xs are Bernoulli random variables, all the expectations above are between 0
and 1. This gives
Var(Xs ) 6 1, Cov(Xs , Xs0 ) 6 1
for any s and s0 . Hence, the expression in (4) is smaller than

5
X X X X
1+ 1 = (N 3)2 + Ms ,
s2SN s2SN s0 2SN , s2SN
s0 6=s,
s0 \s6=?

where Ms is the number of 4⇥4 sub-grids that are di↵erent from s and overlap with s. It is not
hard to see that we can find some large constant C (not depending on N ) such that Ms 6 C
for all s. Then, X
Ms 6 C(N 3)2 ,
s2SN

and we conclude that !


X
Var Xs 6 (C + 1)(N 3)2 .
s2SN

Then, the quotient in (3) is smaller than

(C + 1)(N 3)2 N !1
! 0.
a2 (N 3)4

6
ST119: Probability 2 Lecture notes for Week 10

Preliminaries for the Central Limit Theorem


Before we discuss the Central Limit Theorem, let us give some definitions and recall some
topics seen earlier.

Standardized version of a random variable


Definition. Let X be a random variable with finite expectation and variance. We define the
standardized version of X to be the random variable Z given by

X E[X]
Z=p . E(z) = 0
,
Vo(z) = 1
Var(X)

We observe that, regardless of the variance of X, we have E[Z] = 0 and Var(Z) = 1. Indeed,
" #
X E[X] 1
E[Z] = E p =p · (E[X] E[X]) = 0,
Var(X) Var(X)
1 ⇥ ⇤ Var(X)
Var(Z) = E[Z 2 ] = · E (X E[X])2 = = 1.
Var(X) Var(X)

Normal distribution: review, table for distribution function


Recall that a random variable X has the N (µ, 2 ) if X is continuous with probability density
function
1 (x µ)2
fX (x) = p · e 2 2 , x 2 R.
2⇡
We have E[X] = µ and Var(X) = 2 . Also recall that, if X ⇠ N (µ, 2 ) and Y = aX + b,
where a, b 2 R, with a 6= 0, then Y ⇠ N (aµ + b, a2 2 ) (this is easy to check using moment-
generating functions). In particular, we have that

2 X µ
X ⇠ N (µ, ) =) ⇠ N (0, 1). (1)

Now let Z ⇠ N (0, 1). Recall that FZ is the cumulative distribution function of Z, given by
Z x Z x
1 2
FZ (x) = P(Z 6 x) = fZ (y) dy = p · e y /2 dy, x 2 R.
1 1 2⇡
2
As we have observed in class, there is no explicit expression for the anti-derivative of e y /2 ,
so there is no hope of getting a more informative formula for FZ (x). In practice, we use
approximations for FZ (x). The (approximate) values of FZ (x) for x belonging to the set

{0, 0.01, 0.02, . . . , 3.98, 3.99}.

are recorded in a standard normal table, in the last page of these notes. The table is read
as follows. We take x written in decimal the form a.bc, where a, b and c are integers. We then
find FZ (x) in the table by going to row a.b and column c.
You may now ask:

(i) How about x > 4?

1
(ii) How about x < 0?

The answer to (i) is easy: the value of FZ (x) for x > 4 is so close to 1 that, in this four-digit
approximation, its value is rounded to 1. In fact, this already happens for x > 3.9, as can be
seen from the last row of the table.
Regarding (ii), we can use symmetry to find FZ (x) for x < 0. Since fZ (x) = fZ ( x), we have
that Z x Z 1
x < 0 =) FZ (x) = fZ (y) dy = fZ (y) dy = 1 FZ ( x).
1 x

So, for example, we can do:

FZ ( 2.5) = 1 FZ (2.5) ⇡ 1 0.9938 = 0.0062.

We can also use the table to get approximations for P(a < X 6 b), since

P(a < X 6 b) = P(X 6 b) P(X 6 a) = FX (b) FX (a);

also recall that, since the distribution of X is continuous, we have

P(a < X 6 b) = P(a 6 X < b) = P(a 6 X 6 b) = P(a < X < b).


2
Next: using (1), it is possible to use the table to find FX (x) even when X ⇠ N (µ, )
with (µ, 2 ) di↵erent from (0, 1). Indeed, letting Z = X µ , we have:
✓ ◆ ✓ ◆ ✓ ◆
X µ x µ x µ x µ
FX (x) = P(X 6 x) = P 6 =P Z6 = FZ .

Central Limit Theorem


Let X1 , X2 , . . . be independent and identically distributed random variables with mean µ and
variance 2 . Write
Sn
Sn = X 1 + · · · + X n , X̄n = .
n
The Law of Large Numbers says that, for any " > 0, we have
n!1
P(|X̄n µ| 6 ") ! 1.

This can be rewritten as


n!1
P (|Sn µn| 6 "n) ! 1. (2)
Noting that " can be very small, we see that "n is very small compared to µn. Hence, we
interpret (2) as saying:

when n is large, with high probability,


the di↵erence between Sn and µn is very small compared to n.

Note that Sn µn can be regarded as a “fluctuation”: as a first-order approximation, the Law


of Large Numbers gives the prediction that Sn should be µn, so the di↵erence between Sn and
this prediction is a fluctuation.
We observe that “small compared to n” is still quite vague. For example, the three func-
tions f (n) = n1/4 , g(n) = n1/2 , and h(n) = n9/10 , all grow much smaller than n, and yet they
get far from each other as n gets large. We might then wish for more precise knowledge about
the di↵erence Sn µn.

2
p
From the Central Limit Theorem, we will find out that Sn µn is typically comparable to n
(in other words, Snpnµn is typically not too large and not too small). Much more interestingly,
the theorem say that
Sn µn
the distribution of p is close to N (0, 1).
n

This is true regardless of the distribution of X1 , X2 , . . .! The only important thing is that they
are independent and identically distributed, with finite mean and variance. In this sense, the
normal distribution can be seen as a sort of “universal attractor” in Probability Theory: it
arises as the limit of sequences of the form Snpµn n
, regardless of the specific distribution that
we start with.
We are now ready to state the theorem. We will not se a proof in this module.
Theorem. (Central Limit Theorem) Let X1 , X2 , . . . be independent and identically dis-
tributed random variables, each with mean µ and variance 2 6= 0. Let Sn = X1 + · · · + Xn .
Then, for any x 2 R, we have
✓ ◆ Z x
Sn µn n!1 1 2
P p 6x ! p · e y /2 dy.
n 1 2⇡
More generally, for any a, b 2 R, a < b, we have
✓ ◆ Z b
Sn µn 1
(b) E(a)
n!1 y 2 /2
P a< p <b ! p ·e dy.
= -

n a 2⇡

It is worth nothing that Snpµn


n
is the standardized version of the random variable Sn . We now
look at several applications of the Central Limit Theorem.
Example. Let X1 , X2 , . . . be independent, all with the Unif(0, 1) distribution. Estimate the
probability that X1 + · · · + X100 > 45.
(b a)2
Solution. Earlier we showed that for Y ⇠ Unif(a, b), we have E[Y ] = b 2 a and Var(Y ) = 12
.
Hence,
1 1
µ = E[X1 ] = , 2
= Var(X1 ) = .
2 12
Then,
0 1
X1 + · · · + X100 12 · 100 45 12 · 100
P(X1 + · · · + X100 > 45) = P @ q > q A
1 1
12
· 100 12
· 100
0 1
1
45 · 100
⇡ P @Z > q 2 A,
1
12
· 100

where Z ⇠ N (0, 1). Simplifying the fraction and using a calculator, the above equals (ap-
proximately)
P(Z > 1.73) = P(Z < 1.73) = FZ (1.73),
where the first equality follows by symmetry of the normal density about 0. Using the normal
table, we obtain
FZ (1.73) ⇡ 0.9582.

3
Example. A fair die is rolled 12000 times. Use the Central Limit Theorem to find values
of a and b such that Z b
1 2
P(1900 < S 6 2200) ⇡ p · e x /2 dx,
a 2⇡
where S is the total number of 6’s obtained.

Solution. Define, for i = 1, . . . , 12000,


(
1 if the ith die shows 6
Xi :=
0 otherwise.

The Xi ’s are clearly independent and identically distributed. Each has Bernoulli distribution
with parameter p = 16 , so

1 5
µ = E[Xi ] = , 2
= Var(Xi ) = .
6 36
P12000
We have S = Xi . We now compute
i=1
0 1
1 1 1
1900 6 · 12000 S · 12000 2200 6 · 12000
P(1900 < S 6 2200) = P @ q p < q 6p 6 q p A
5 5 5
36
· 12000 36
· 12000 36
· 12000
0 1
1 1
1900 6 · 12000 2200 6 · 12000
⇡ P@ q p <Z6 q p A. (3)
5 5
36
· 12000 36
· 12000

where Z ⇠ N (0, 1). Simplifying the fractions and using a calculator, the above is approxi-
mately Z 4.89
1 2
P( 2.45 < Z 6 4.89) = p · e y /2 dy.
2.45 2⇡
n
1 X ni 1
Example. Prove that lim n = .
n!1 e i! 2
i=0
(Are you sure this is a probability question?)

Solution. The trick is to think of the Poisson distribution. Recall that, if X ⇠ Poi( ), then
it has probability mass function
i
pX (i) = e , i 2 N0 .
i!
Replacing by n, we see that
n n n
1 X ni X ni n
X
= e = pYn (i),
en i=0 i! i=0
i! i=0

where Yn ⇠ Poi(n). The next step is to remember that Yn has the same distribution as

Sn = X 1 + · · · + X n ,

4
where X1 , . . . , Xn are independent random variables, all with the Poi(1) distribution. Hence,
n
X ✓ ◆
ni X1 + · · · + Xn µ·n n µ·n
·e n
= P(Yn 6 n) = P(Sn 6 n) = P p 6 p .
i=0
i! n n

n pµ·n
Note that µ = E[X1 ] = 1, so (without having to bother about ) we have n
= 0 and then
the right-hand side above equals
✓ ◆
X1 + · · · + Xn µ · n
P p 60 .
n

By the Central Limit Theorem, as n ! 1, the above converges to


Z 0
1 2 1
p · e y /2 dy = .
1 2⇡ 2

5
Standard Normal cumulative distribution function
The value given in the table is FX (x) for X ⇠ N (0, 1).

x 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7703 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.8 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.9 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

You might also like