Probability 2
Probability 2
P : F ! [0, 1];
P(⌦) = 1;
S
1 P
1
P( An ) = P(An ) if the (An ) are disjoint events.
n=1 n=1
Informally speaking, the distribution of X is the set of all possible answers to questions of the
kind: “What is the probability that X . . . ?”, where [. . .] can be any behavior we can wonder
about (for instance: “What is the probability that X is between 2 and 5?”, “What is the
probability that X is equal to 15?”, “What is the probability that X is negative?”).
In a more rigorous treatment of probability, we must constrain the expression (?) by saying
that the sets A that can appear inside P(X 2 A) must be “reasonable” (the technical term
is Borel set). What is exactly meant by a “reasonable” (Borel) set is a topic in Measure
Theory. It suffices to say that all sets we would naturally think of, and all sets that will be
encountered in this module, are “reasonable” (Borel). Formally, the Borel -algebra is the
smallest -algebra of subsets of R which contains the (open/closed/arbitrary) intervals.
Remark 1. We can characterise the distribution of a random variable (r.v.) in many ways,
but one is to just give the values of
P(X 2 A)
for A an arbitrary interval, or even just ones of the form ( 1, x].
1
Review: discrete random variables
Definition 3. A random variable X is discrete if there exists a finite or countably infinite
set {x1 , x2 , . . .} of real numbers such that
X
P(X = xi ) > 0 for all i, P(X = xi ) = 1.
i>1
The set {x1 , x2 , . . .} is called the support of X, and is denoted supp(X). The probability
mass function (p.m.f.) of X is the function pX : supp(X) ! [0, 1] given by
In the following example, we consider the cumulative distribution function of a discrete random
variable.
2
Example 1. Let X be a discrete random variable whose support is the set N0 = {0, 1, . . .}.
Let us sketch the graph of the cumulative distribution function FX . First observe that
FX
1
x
1 2 3 4
3
Example 2. Assume that we roll a fair diea . This is modelled by the probability space given
by ⌦ = {1, . . . , 6} and P such that P({i}) = 1/6 for 1, . . . , 6. Let X be the random variable
given by (
1 if ! = 6;
X(!) =
0 otherwise.
Then, X has a Bernoulli distribution, since its support is {0, 1}. The parameter is
Again, Y is also a Bernoulli random variable, since its support is {0, 1}. The parameter is
So Y ⇠ Ber(1/2).
Bernoulli random variables are usually associated to Bernoulli trials. A Bernoulli trial is a
random experiment in which we label a subset of the possible outcomes as successes, and the
remaining possible outcomes as failures (so there is an underlying interpretation that we are
rooting for certain outcomes). For instance, in the above example of the roll of a fair die, we
could imagine that we are hoping for a 6. Then, the die roll would be a Bernoulli trial, with 6
being a success, and anything else being a failure.
Whenever we have a Bernoulli trial, we can define a Bernoulli random variable X, by saying
that X = 1 if the trial is a success, and X = 0 if the trial is a failure.
Definition 5 (Binomial distribution). Let n 2 N and p 2 (0, 1). A random variable X
has binomial distribution with parameters n and p if supp(X) = {0, . . . , n} and X has the
probability mass function
✓ ◆
n
pX (k) = · pk · (1 p)n k , k 2 {0, . . . , n}.
k
Note that pX is indeed a probability mass function since, by the Binomial Theorem,
Xn ✓ ◆
n n
1 = (p + (1 p)) = · pk · (1 p)n k .
k=0
k
A random variable with the Bin(n, p) distribution serves to model the number of successes
obtained when we perform n independent Bernoulli trials, each with success probability p. Let
us justify this statement. Let us first model the performing of n independent Bernoulli trials
by taking the sample space
4
where ‘S’ represents success and ‘F’ represents failure. That is, each outcome ! 2 ⌦ is a “word”
consisting of n letters, all of which are either S or F. The probability of an outcome ! = !1 . . . !n
is given by
P({!}) = pnumber of S in ! · (1 p)number of F in ! .
For instance, if n = 3 and p = 1/3, we have
⌦ = {SSS, SSF, SFS, SFF, FSS, FSF, FFS, FFF}
and
P({SSS}) = (1/3)3 · (2/3)0 = 1/27,
P({SSF}) = P({SFS}) = P({FSS}) = (1/3)2 · (2/3)1 = 2/27,
P({SFF}) = P({FSF}) = P({FFS}) = (1/3)1 · (2/3)2 = 4/24,
P({FFF}) = (1/3)0 · (2/3)3 = 8/27.
Returning to the general case with arbitrary n and p, let X denote the number of successes
obtained in the n trials. Equivalently,
X(!) = number of S in !.
Let us fix k 2 {0, . . . , n} and let us compute P(X = k). The event {X = k} is the event that
we have obtained k successes out of the n trials. That is,
X X
P(X = k) = P({!}) = pk · (1 p)n k
!:the number of S !:the number of S
in ! is k in ! is k
Random variables having the geometric distribution arise in the following situation. Suppose
that we repeatedly perform independent Bernoulli trials, all with the same probability p of
success. Then, the number of trials performed to obtain the first success has a geometric dis-
tribution with parameter p.
Remark 3. Some people use a slightly di↵erent definition for the geometric distribution with
parameter p; namely, they count the number of failed trials performed before the first success
is obtained. So, if success is obtained in the first trial, the number of failed trials is zero and
X = 0. This gives a p.m.f. p̃X (k) = p · (1 p)k , for k 2 N0 . We will not adopt this form.
5
Example 3. We roll a die repeatedly until we roll a 6 for the first time. Let X be the total
number of times we roll the die. Then, X ⇠ Geom( 16 ).
Definition 7 (Poisson distribution). Let > 0. A random variable X has Poisson dis-
tribution with parameter if supp(X) = N0 = {0, 1, . . .} and the probability mass function
of X is
k
pX (k) = ·e , k 2 N0
k!
(recall that 0! = 1). We write X ⇠ Poi( ).
Random variables that count rare occurrences among many trials (such as: number of accidents
in a road throughout a year, number of types in a book) typically follow (at least approximately)
the Poisson distribution. This is made rigorous in the following proposition.
Proposition 1 (Poisson approximation to the binomial distribution). Let be a positive
real number. Let X ⇠ Poisson( ). For each integer n > , let Xn ⇠ Bin(n, /n). Then, for
each k 2 N, we have
n!1
pXn (k) ! pX (k).
This proposition says that, when we perform a large number (= n) of independent Bernoulli
trials, all with same probability (= /n) of success, then the number of successes we obtain is
approximately distributed as Poisson( ).
Proof. (This proof is not examinable). We start writing
✓ ◆ ✓ ◆k ✓ ◆n k
n
pXn (k) = · · 1
k n n
✓ ◆k ✓ ◆n
n(n 1) · · · (n k + 1) 1
= · · k
· 1 .
k! n 1 n
n
and ✓ ◆k
n!1
Bn = 1 ! 1.
n
6
|E(x)|
Finally, using the Taylor approximation log(1 + x) = x + E(x) where lim = 0, we obtain
x!0 x
⇢ ✓ ◆ ⇢ ✓ ✓ ◆◆
Cn = exp n · log 1 = exp n · +E
n n n
⇢
E( /n) n!1
= exp + · !e .
/n
Exercise 1. Assume that the probability of a typo in any word in a book is 10 5 indepen-
dently of each other word.The book contains 2 · 105 words. Using a Poisson approximation
to the binomial distribution, estimate the probability that there are exactly two typos in the
book.
Solution. The number of typos (successes), denoted X, has the binomial distribution with
parameters n = 2 · 105 and p = 10 5 . We approximate this by a random variable Y with
Poisson distribution with parameter = np = 2 · 105 · 10 5 = 2. Then,
22
P(X = 2) ⇡ P(Y = 2) = pY (2) = e 2
⇡ 0.271.
2!
Remark 4. The above proposition and exercise show that we can approximate Bin(n, p)
by Poi(np) when n is large and p is small. Of course, saying “large” and “small” is a bit
vague, and we may want to have more quantitative statements. Although we will not go
further in this module, it is important to mention that there are more precise formulations of
the above proposition, which give some information about the size of the error in the Poisson
approximation to the binomial. A theorem due to Lucien Le Cam states that, if X ⇠ Bin(n, p)
and Y ⇠ Poi( ) with = np, then |pX (k) pY (k)| < np2 for all k. For instance, in the typo
exercise above, the error in the approximation is smaller than 2 · 105 · (10 5 )2 = 2 · 10 5 .
Remark 5. Technically, we should say that such a r.v. is absolutely continuous, since there
are r.v.s with continuous, but not di↵erentiable cdfs.
If X is continuous, then
Z x
FX (x) = fX (y)dy and FX0 (x) = fX (x) for every x at which fX is continuous.
1
7
Common families of continuous distributions
Definition 9 (Uniform distribution). Let a, b 2 R with a < b. A random variable X has
the uniform distribution in (a, b) if it has cumulative distribution function given by
8
>
>0
> if x 6 a,
<
FX (x) = xb aa if a < x < b,
>
>
>
: 1 if x > b.
1 FX 1
FX (x0 )
fX
area
x x
a x0 b a x0 b
We write X ⇠ Exp( ).
slope=
1 FX
FX (x0 )
area fX
x x
x0 x0
2
slope=
8
The exponential distribution is commonly used to model the lifetime of entities that have a
lack of memory property. To explain what this means, let us think of light bulbs. Suppose that
the lifetime of light bulbs of a particular brand has an Exponential( ) distribution (assume
that we turn on the light and don’t turn it o↵ until the bulb burns out). Then, the memoryless
property means that: regardless of whether the bulb has just been activated, or it has been
active for a certain amount of time, the distribution of the remaining lifetime is the same.
Mathematically, this is expressed by the following identity, which holds for all s, t > 0:
&
)t
* IBP
e- -A
t
You will prove this in the exercises. ↑ (k) =
K
↑
=
( D! (1)
-
Definition 11 (Gamma function and distribution). First define the Gamma function ↑
as Z 1
(w) := xw 1 · e x dx, w 2 (0, 1). ↑ (w) (0 - 1) ! WeN =
0
Let w > 0 and > 0. A random variable X has Gamma distribution with parameters w
and if it has probability density function
FJWEN, get this byne
( w -
(w)
· xw 1 · e x if x > 0, adding independent Exp(1) S.
w V
.
fX (x) =
0 otherwise.
We write X ⇠ Gamma(w, ).
It is worth noting that the Gamma distribution with parameters w = 1 and is equal to the
exponential distribution with parameter .
1 (x µ)2
fX (x) = p · e 2 2 , x 2 R. pay () =
w
2⇡ 2
and
cl T()
We write X ⇠ N (µ, 2
). Note X-m -
N(0 82)
,
The graph of fX is a ‘bell curve’, symmetric about µ. The value of controls the dispersion
of the curve. See the figure below.
The computation to show that fX is a probability density function is a bit involved and requires
knowledge of calculus in Rd so we omit it.
The normal distribution often arises when when we measure the dispersion of some real-world
quantity about its average value. As we will see later, this is justified by the Central Limit
9
Theorem, which says that the normal distribution arises as a universal limiting distribution in
probability theory.
fX
x
0.5 0.5
µ = 0, = 0.2 µ = 0, = 0.7
10
ST119: Probability 2 Lecture notes for Week 2
Proof. We have !
1
X 1
X n
X
E[X] = n · pX (n) = 1 · pX (n).
n=1 n=1 j=1
The above double sum is over all pairs (n, j) such that n 2 N and j 2 {1, . . . , n}. We can
rewrite the sum as 1 X 1 1
X X
pX (n) = P(X > j)
j=1 n=j j=1
as required.
An analogous statement also holds for continuous random variables:
Proposition 2. Let X be a continuous random variable which only takes positive values.
Then, Z 1
E[X] = P(X > x) dx.
0
da
-F()
=
1
We will now revisit the families of distributions seen in Week 1. For a random variable following
each of the distributions we have seen, we will compute the expectation (as a function of the
parameters).
Bernoulli distribution. If X ⇠ Ber(p), then
E[X] = · pk · (1 p)n k 0 p
=
k·
k=0
k
Xn ✓ ◆
n
= k· · pk · (1 p)n k
k=1
k
n
X n!
= k· · pk · (1 p)n k .
k=1
k!(n k)!
We now note that the expression inside the sum is equal to pY (j), where Y is a random variable
with the Bin(m, p) distribution. Hence, the above equals
m
X
np · pY (j) = np · 1 = np.
j=0
Geometric distribution. We will use the formula for a geometric series: for a 2 ( 1, 1), we
have 1 1
X 1 X a
n
a = and an = .
n=0
1 a n=1
1 a
Let X ⇠ Geom(p). We will compute E[X] by two methods.
Method 1.
1
X X X ✓ ◆
d
E[X] = k · p(1 p) k 1
=p k(1 p) k 1
=p (1 p) k
.
k=1 k=1 k=1
dp
We will now exchange the derivative with the infinite sum (we omit the rigorous justification
for doing so) to obtain
1 ✓ ◆ ✓ ◆ ✓ ◆
d X k d 1 p d 1 1 1
p· (1 p) = p · = p· 1 = p· 2
= .
dp k=1 dp 1 (1 p) dp p p p
2
In conclusion, E[X] = 1/p.
Method 2. By Proposition 1, we have
1
X 1 X
X 1 1 X
X 1 1 X
X 1
E[X] = P(X > n) = pX (m) = p(1 p)m 1
=p (1 p)m 1 . (?)
n=1 n=1 m=n n=1 m=n n=1 m=n
With the change of variable ` = m n (so that m = ` + n, and ` = 0 when m = n), we have
that 1 1 1
X X X (1 p)n 1
m 1 `+n 1 n 1
(1 p) = (1 p) = (1 p) (1 p)` = .
m=n `=0 `=0
p
Plugging this into (?), we obtain
1 1
1X X 1
E[X] = p · (1 p) n 1
= (1 p)j = .
p n=1 j=0
p
! -
Z 1 Z 1 Z 1 ✓ ◆1
e x 1
E[X] = P(X > x) dx = (1 FX (x)) dx = e x
dx = = .
0 0 0 x=0
Normal distribution. You will prove in the exercises that if X ⇠ N (µ, ), then E[X] = µ, 2
that is, the parameter µ is the mean (or expectation) associated to the distribution. Here we
will show that 2 is the variance. Recall that, for a random variable X, the variance is defined
as
Var(X) = E[(X E[X])2 ] = E[X 2 ] (E[X])2 .
We compute, for X ⇠ N (µ, ) (already using the fact that E[X] = µ),
2
Z 1
1 (x µ)2
Var(X) = E[(X µ) ] =2
(x µ)2 · p e 2 2 dx,
1 2⇡
R1 x µ
where we used the formula E[g(X)] = 1 g(x) · fX (x) dx. Using the substitution y = ,
the above becomes
2 Z 1 y2
p y 2 · e 2 dy.
2⇡ 1
We now integrate by parts,
y2 y2
u = y, dv = y · e 2 dy =) du = dy, v = e 2 ,
and then
Z "✓ ◆1 Z #
2 1 1
2 y2 1 y2 y2
p y ·e 2 dy = p ye 2 + e 2 dy
2⇡ 1 2⇡ y= 1 1
Z
2 p 1
2
=p 0+ 2⇡ · fX (y) dy = .
2⇡ 1
3
Gamma :
G -
(a i) ,
to() = 1/
E(0) =
Jo = "I do a
= x
Negative Binomial :
independent Ber(p)
H-NBin(k p) ,
The muder
of think until we have K-successes
Pp(x) P(exactly h 1
Successes 1-D Dials then
muse)
=
in
-
a
,
=
=
(e) ph/1-plan
E(H) =
ECG ) : where G : = Geom (p) independent
=
RE(bi) =
The distribution of the transformation of a random variable
We now study questions of the following kind: let X be a random variable whose distribution
is known to us, and let g : R ! R; then, what is the distribution of Y = g(X)? Although this
can be hard to answer in some situations, we will see some cases where it is straightforward.
A general comment: in this lecture and in the rest of the module, when we say “determine
the distribution of a random variable”, this can be understood as determining the cumulative
distribution function of the random variable. It is also satisfactory to determine the probability
mass function (in case the random variable is discrete) or probability density function (in case
it is continuous).
A word about inversion. Let g : R ! R be a function. For any y 2 R, we define the set
g 1 (y) := {x 2 R : g(x) = y} ✓ R.
{ax + b : x 2 supp(X)},
4
Example 2. Let X ⇠ Geom(p) and Y = | sin( 12 ⇡X)|. Let us determine the support and
distribution of Y .
Since X ⇠ Geom(p), the support of X is N and the probability mass function of X is pX (k) =
p(1 p)k 1 , for k 2 N. Define g(x) = | sin( 12 ⇡x)|, so that Y = g(X). We note that:
This shows that the support of Y is {0, 1}. Next, by (1) we have
X X
pY (0) = pX (x) and pY (1) = pX (x).
x2N\g 1 (0) x2N\g 1 (1)
We have
so
N \ g 1 (0) = {2, 4, 6, . . .}, N \ g 1 (1) = {1, 3, 5, . . .}.
Therefore,
1
This shows that Y has a Bernoulli distribution with parameter pY (1) = 2 p
.
5
To summarize, ( p p
FX ( y) FX ( y) if y > 0,
FY (y) =
0 otherwise.
In case we also want the density function of Y , we can di↵erentiate. For y < 0 we have fY (y) =
0 since FY is constant on ( 1, 0). For y > 0,
d p p
fY (y) = FY0 (y) = (FX ( y) FX ( y))
dy
✓ ◆
p 1 p 1
= fX ( y) · p fX ( y) · p
2 y 2 y
1 p p
= p · (fX ( y) + fX ( y)).
S
2 y
d 1
fY (y) = fX (g 1 (y)) · g (y) , y 2 J.
dy
d d d 1
fY (y) = Fy (y) = fX (g 1 (y)) · g 1 (y) = fX (g 1 (y)) · g (y) .
dy dy dy
1
The last equality follows from the fact that g is increasing when g is increasing, which is easy
to check.
Next, assume that g is decreasing. Then,
FY (y) = P(Y 6 y) = P(g(X) 6 y) = P(X > g 1 (y)) = P(X > g 1 (y)) = 1 FX (g 1 (y)).
Again di↵erentiating,
d d 1 d 1
fY (y) = (1 Fy (y)) = fX (g 1 (y)) · g (y) = fX (g 1 (y)) · g (y) ,
dy dy dy
1
since g is decreasing in this case, so it has a negative derivative, so |g 1 | = g 1.
U-%0 1) x+e" 1
g
ey
, :
-
[
0x0
Y =
g(u)(k) = x
1
x[0, 1]
x> /
is
y>
e -
1
6
+y (3)
Get Jose
=
Example 4. Let X ⇠ N (µ, 2 ) and Y = aX + b, where a, b 2 R, a 6= 0. Let us use the above
proposition to find fY . We take g(x) = ax + b, so g maps R into R, and is increasing if a > 0
and decreasing if a < 0. The inverse is g 1 (y) = y a b . We then get
y b (y (aµ+b))2
d 1 1 ( a µ)2 1 1
fY (y) = fX (g 1 (y)) · g (y) = p ·e 2 2 · =p ·e 2( a)2 .
dy 2⇡ a 2⇡ |a|
2 2
This shows that Y ⇠ N (aµ + b, a ). In particular, if X ⇠ N (0, 1) and Y = aX + b,
then Y ⇠ N (b, a2 ).
Solution. The function g(x) = x3 is di↵erentiable and increasing, so we can apply the theorem.
We have g 1 (y) = y 1/3 , and then
d 1
fV (y) = fX (g 1 (y)) · g (y).
dy
Note that g 1 (y) 2 (0, 1) if and only if y 2 (0, 1), so fX (g 1 (y)) > 0 if and only if y 2 (0, 1).
d
Also, dy g 1 (y) = 13 · y 2/3 . We then obtain
(
2y 1/3 · 13 · y 2/3
= 2
3
·y 1/3
if y 2 (0, 1),
fV (y) =
0 otherwise.
and then
8
>
<0 if y < 0;
FV (y) = P(V 6 y) = P(X 6 y) = P(X 6 y ) = FX (y ) = y 2/3
3 1/3 1/3
if 0 6 y 6 1;
>
:
1 if y > 1,
7
The functions FV and fV are plotted below:
fV
FV
1
2/3
x x
1 1
8
ST119: Probability 2 Lecture notes for Week 3
X a+
y
k2Z
X
= P(X = k, k + Y = m)
k2Z
X X
= P(X = k, Y = m k) = pX,Y (k, m k).
k2Z k2Z
In case X and Y are independent, we have pX,Y (x, y) = pX (x) · pY (y), so the above also gives
X
pZ (m) = pX (k) · pY (m k).
k2Z
Proof. Let Z := X + Y . Note that Z > 0 (since X > 0 and Y > 0), so pZ (m) = 0 if m < 0.
For m > 0, we apply the formula
X
pZ (m) = pX (k) · pY (m k).
k2Z
Noting that pX (k) = 0 for k < 0 and pY (m k) = 0 for k > m, the right-hand side becomes
m
X m
X k
µm k µ
pX (k) · pY (m k) = ·e · ·e .
k=0 k=0
k! (m k)!
We multiply and divide the right-hand side by m!; rearranging terms, it becomes
m
( +µ) 1 X m! k ( + µ)m
e · · · · µm k
=e ( +µ)
· .
m! k=0 k!(m k)! m!
Although we could also prove it using discrete convolutions, we will not do so, because the
proposition follows immediately from our interpretations of Bernoulli and binomial random
1
variables with Bernoulli trials. The sum X1 + · · · + Xn is doing exactly what the binomial
random variable does: counting the number of successes out of the n trials.
Example 1. In Week 2, we proved that X ⇠ Bin(n, p) has E[X] = np by doing a long
computation. With the above proposition, we obtain a much easierPn way to verify this. Indeed,
letting X1 , . . . , Xn ⇠ Ber(p) be independent and letting X = i=1 Xi , we have that
" n # n n
X X X
E[X] = E Xi = E[Xi ] = p = np.
i=1 i=1 i=1
Proof. Let X1 , . . . , Xm+n be independent random variables with the Bernoulli distribution with
parameter p. Define
m
X n+m
X
W := Xi , Z := Xi .
i=1 i=m+1
By the previous proposition, we have that W ⇠ Bin(m, p) and Z ⇠ Bin(n, p). Moreover,
since W depends only on X1 , . . . , Xm and Z depends only on Xm+1 , . . . , Xn+m (and these two
sets of random variables have nothing in common), we have that W and Z are independent.
We then have, for all (x, y),
so (W, Z) has the same distribution as (X, Y ). Then, W +Z has the same distribution as X +Y .
Finally, the previous proposition implies that W + Z ⇠ Bin(m + n, p), so X + Y also has this
distribution.
attempts to throw the ball inside the chosen basket, and has probability of success equal
to q 2 (0, 1).
Assume that the trial of each child is independent of the others. Let X be the number of
baskets that are empty after all children have taken their turn. Find the expectation of X.
2
Note that n
X
X= Xi .
i=1
We thus have n
X
E[X] = E[Xi ].
i=1
Since each Xi is a Bernoulli random variable, its expectation is equal to the probability that
it equals one, which we now compute:
k
!
\
P(Xi = 1) = P {child ` does not put a ball inside basket i}
`=1
k
Y
= P(child ` does not put a ball inside basket i)
`=1
⇣ q ⌘k
= (P(child 1 does not put a ball inside basket i))k = 1 .
n
This gives
n ⇣
X q ⌘k ⇣ q ⌘k
E[X] = 1 =n 1 .
i=1
n n
First, we have
X = Y1 + . . . + YN .
Indeed, the total number of coupons bought is equal to the sum of the total number of
coupons bought between the successive times when the collection grows.
Second, for k > 2, the distribution of Yk is Geometric( N Nk+1 ). To see this, assume that
the child just bought the coupon that made her collection grow to k 1 di↵erent types.
From this moment (and not including it) until (and including) the next unseen coupon
is bought, the child is making Bernoulli trials, where success means buying an unseen
coupon (and thus has probability N (k N
1)
= N Nk+1 ).
3
Putting these two observations together, and using the fact that the expectation of a Geometric(p)
random variable is 1/p, we obtain
E[X] = E[Y1 + Y2 + . . . + YN ]
= E[Y1 ] + E[Y2 ] + . . . + E[YN ]
X1 N
N N N N
= + + + ... + =N .
N N 1 N 2 1 k=1
k
P RN 1
The number N k=1 k is very close to 1 x dx = log(N ). Hence, E[X] is very close to N log(N ).
1
It is not hard to see that the random variables Y1 , . . . , YN that we used in the above rea-
soning are P
actually independent. However, we never needed this: we just used the equal-
ity E[X] = E[Yk ] – the expectation of a sum equals to sum of the expectations – for which
no independence is required.
Example 3. Let X and Y be independent, both with the Exponential distribution with
parameter > 0, that is,
(
e x if x > 0;
fX (x) = fY (x) =
0 otherwise.
Now, note that the product inside the integral is equal to zero when x < 0 (since fX (x) = 0
then) and when x > z (since fY (z x) = 0 then). The integral is then equal to
Z z Z z
x (z x) 2
e · e dx = · e z dx = 2 · z · e z .
0 0
2 z
Hence, (z) = ·z·e , for z > 0.
Review: variance
Let us recall the definition of the variance of a random variable.
Definition 1. For a random variable X, we define
4
As you have seen, it is easy to prove that the alternate formula holds:
Var(aX + b) = a2 · Var(X).
Let us now use this fact to compute the variance of a binomial random variable.
Example P4. Let X1 , . . . , Xn be independent, all with the Ber(p) distribution. Recall
that X = ni=1 Xi follows a Bin(n, p) distribution. We have that
n
! n
X X
Var(X) = Var Xi = Var(Xi ).
i=1 i=1
The variance of the Xi ’s is easy to compute. We already found that E[Xi ] = p. Also note that,
since Xi only attains the values 0 and 1, we have that Xi2 = Xi (since 02 = 0 and 12 = 1),
so E[Xi2 ] = E[Xi ] = p, and
We then obtain n
X
Var(X) = p(1 p) = np(1 p).
i=1
E[X]
P(X > x) 6 for all x > 0.
x
Proof. Fix x > 0. Define the random variable
(
x if X > x;
Y :=
0 otherwise.
5
if X 2 [0, x), then Y = 0, so X > Y .
This also gives E[X] > E[Y ]. Next, note that Y is a discrete random variable (it only attains
the values 0 and x) with
pY (x) = P(X > x), pY (0) = P(X < x).
Hence,
E[X] > E[Y ] = 0 · pY (0) + x · pY (x) = x · pY (x) = x · P(X > x).
Rearranging this, we obtain the desired inequality.
Example 5. Suppose that each week, a company produces on average 50 items. Give an
upper bound for the probability that a week’s production exceeds 75 items.
Terminology: An upper bound for an unknown quantity p is any real number a such
that p 6 a. This terminology is typically used when we are trying to narrow down the (un-
known) value of p as best we can. The smaller an upper bound, the more it is informative.
For instance, in this exercise, we could say that 1 is an upper bound for the probability, which
is obviously true, but doesn’t give any information.
Solution. The assumption means that a week’s production is a random variable X
with E[X] = 50. We use Markov’s inequality to estimate
E[X] 50 2
P(X > 75) 6 P(X > 75) 6 = = .
75 75 3
It is important to note that Markov’s inequality does not always give a useful bound. Indeed,
if X is a non-negative random variable with expectation equal to µ, and x 2 (0, µ], then in the
inequality
µ
P(X > x) 6 ,
x
the right-hand side is larger than 1, so the bound only tells us that the probability is smaller
than or equal to 1.
While Markov’s inequality gives a bound on the probability that a random variable is large,
Chebyshev’s inequality gives a bound on the probability that a random variable is far from its
expectation.
Theorem 2 (Chebyshev’s inequality). Let X be a random variable whose variance is well
defined. Then,
Var(X)
P(|X E[X]| > x) 6 for all x > 0.
x2
We emphasize that here no assumption is made concerning the sign of X.
Proof. Let Y := (X E[X])2 . Then, Y is non-negative and
⇥ ⇤
E[Y ] = E (X E[X])2 = Var(X);
in particular, Y has finite expectation. Next, note that for any x > 0, we have
{|X E[X]| > x} = {(X E[X])2 > x2 } = {Y > x2 }.
Hence, by Markov’s inequality we have
E[Y ] Var(X)
P(|X E[X]| > x) = P(Y > x2 ) 6 = .
x2 x2
6
Example 6. We roll a fair six-sided die n times. Let X denote the p
number of 6’s obtained.
n
Give an upper bound to the probability that |X 6 | is larger than n.
Solution. Note that X ⇠ Bin(n, 1/6). Recalling that a Bin(n, p) random variable has expec-
tation np and variance np(1 p), we have
n 5n
E[X] = , Var(X) = .
6 36
By Chebyshev’s inequality,
⇣ n p ⌘ Var(X) 5
·n 5
P X > n 6 p 2 = 36
= .
6 ( n) n 36
7
ST119: Probability 2 Lecture notes for Week 4
Covariance
We now introduce the covariance, a concept that is closely related to the variance. Rather
than being associated to a single random variable X, the covariance is associated to a pair of
random variables X and Y . Roughly speaking, the covariance of X and Y measures the degree
with which these random variables tend to vary together.
Definition 1. Let X and Y be two random variables. The covariance of X and Y is defined
as
Cov(X, Y ) = E [(X E[X]) (Y E[Y ])] ,
whenever this expectation exists.
Before we look at examples, let us list a few first properties of the covariance. First, the
covariance is symmetric, that is,
Cov(X, Y ) = Cov(Y, X).
Second, we clearly have
Cov(X, X) = Var(X).
Third, in the same way that there are two formulas to compute the variance (namely, E[(X
E[X])2 ] and E[X 2 ] (E[X])2 ), there is also an alternate formula for the covariance:
Proposition 1. Let X and Y be two random variables. Then,
Example 1. Let X and Y be discrete random variables with join probability mass function
1 1 1
pX,Y ( 2, 2) = , pX,Y ( 2, 2) = , pX,Y (3, 2) = .
6 3 2
Let us find the covariance of X and Y . We need to compute E[XY ], E[X] and E[Y ], which
we now do:
7
E[XY ] = ( 2) · ( 2) · pX,Y ( 2, 2) + ( 2) · 2 · pX,Y ( 2, 2) + 3 · 2 · pX,Y (3, 2) = ,
3
1
E[X] = ( 2) · pX,Y ( 2, 2) + ( 2) · pX,Y ( 2, 2) + 3 · pX,Y (3, 2) = ,
2
4
E[Y ] = ( 2) · pX,Y ( 2, 2) + 2 · pX,Y ( 2, 2) + 2 · pX,Y (3, 2) = .
3
Hence,
7 1 4 5
Cov(X, Y ) = E[XY ] E[X]E[Y ] = · = .
3 2 3 3
1
Example 2. In each of the items below, a set A ✓ R2 is equal to the union of the red squares.
1 1 1
x x x
1 1 1
In each item, consider a pair of jointly continuous random variables X and Y whose joint
probability density function is given by fX,Y (x, y) = C if (x, y) 2 A (and 0 otherwise). Find C
and Cov(X, Y ) in each case.
Solution.
(a) Since
Z 1 Z 1 Z 1 Z 1 Z 3 Z 3
fX,Y (x, y) dx dy = C dx dy + C dx dy = 8C
1 1 3 3 1 1
and this double integral should equal 1, we obtain C = 18 . It is easy to see that the
distribution of X and Y are both symmetric about 0, so E[X] = E[Y ] = 0. We compute
Z 1Z 1
E[XY ] = xy · fX,Y (x, y) dx dy
1 1
Z 1Z 1 Z 3 Z 3
1 1
= xy dx dy + xy dx dy = 4.
8 3 3 8 1 1
We then have
Cov(X, Y ) = E[XY ] E[X]E[Y ] = 4.
(b) Similarly to above, we have C = 18 , and again by symmetry, E[X] = E[Y ] = 0. Next,
Z 3 Z 1 Z 1 Z 3
1 1
E[XY ] = xy dx dy + xy dx dy = 4.
8 1 3 8 3 1
Hence,
Cov(X, Y ) = 4
in this case.
2
Proposition 2. Let X and Y be two independent random variables. Then,
Cov(X, Y ) = 0.
Proof. Recall that if X and Y are independent, then E[XY ] = E[X]E[Y ], so
Cov(X, Y ) = E[XY ] E[X]E[Y ] = 0.
The following example shows that the converse to the above proposition is in general not true.
Example 3 (Zero covariance does not imply independence). Let X be a discrete
random variable with probability mass function
1
pX ( 2) = pX ( 1) = pX (1) = pX (2) = .
4
Let Y = X 2 . Then,
Note that the distribution of X is symmetric about 0, and the same holds for X 3 , so E[X] =
E[X 3 ] = 0, so by the above we get Cov(X, Y ) = 0. However, X and Y are not independent;
this can for example be shown by noting that
We will now see that the covariance is linear in each of its two arguments.
Proposition 3. (Covariance of sums) For random variables X, Y and Z and real num-
bers a, b, we have Cor(aXtb <Y + d)
ac(or(X Y)
,
=
,
We leave the verification of this property as an exercise. Note that linearity can be used
repeatedly to give
m n
! m Xn
X X X
Cov ai X i , b j Yy = ai bj · Cov(Xi , Yj ).
i=1 j=1 i=1 j=1
Variance of sums
Let us recall that, if X and Y are independent random variables, then Var(X + Y ) = Var(X) +
Var(Y ). This equality need not be true when X and Y are not independent. We now see a
formula that holds in all cases.
Proposition 4. (Variance of sums) For random variables X and Y , we have
3
Proof. We prove the second formula, since it is more general:
n
! n n
! n X n
X X X X
Var Xi = Cov Xi , Xi = Cov(Xi , Xj )
i=1 i=1 i=1 i=1 j=1
n
X X
= Cov(Xi , Xi ) + Cov(Xi , Xj )
i=1 16i,j6n,
i6=j
n
X X
= Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 16i<j6n
Note that, in the above proposition, if the Xi ’s are independent, then Cov(Xi , Xj ) = 0 for
all i 6= j, so we re-obtain that the variance of the sum is equal to the sum of the variances.
Example 4. A group of N people, all of whom have hats, throw their hats on the floor of
a room and shu✏e them.Then each person takes a hat from the floor at random (assume
that the shu✏ing is perfect, in the sense that at the end of the procedure, all the possible
allocations of hats to the people are equally likely). Let X denote the number of people who
recover their own hats. Find the expectation and variance of X.
Solution. We enumerate the people from 1 to N , and for i = 1, . . . , N , define the random
variable (
1 if person i recovers their own hat;
Xi =
0 otherwise.
PN
We have X = i=1 Xi , so we can compute the expectation and variance of X using the
formulas
XN XN X
E[X] = E[Xi ], Var(X) = Var(Xi ) + Cov(Xi , Xj ).
i=1 i=1 i6=j
In order to apply these formulas, we need E[Xi ] and Var(Xi ) for each i and Cov(Xi , Xj ) for
each i 6= j. Since person i is equally likely to recover any of the N hats, we have
1
P(Xi = 1) = .
N
Hence, Xi ⇠ Ber(1/N ) and we have
1 1 N 1 N 1
E[Xi ] = , Var(Xi ) = · = .
N N N N2
Also, for i 6= j we note that Xi Xj is also a Bernoulli random variable (it only attains the
values 0 and 1) and
(N 2)! 1
E[Xi Xj ] = P(Xi Xj = 1) = P(Xi = 1, Xj = 1) = =
N! N (N 1)
(the third equality follows from the fact that there are (N 2)! outcomes where people i and j
recover their own hats, and N ! outcomes in total).
4
Then,
1 1 1
Cov(Xi , Xj ) = E[Xi Xj ] E[Xi ]E[Xj ] = 2
= 2 .
N (N 1) N N (N 1)
XN
1 1
E[X] = =N· =1
i=1
N N
and
N
X X
N 1 1
Var(X) = +
i=1
N2 i6=j
N 2 (N 1)
N 1 1
= + N (N 1) · = 1.
N N 2 (N 1)
Cauchy-Schwarz inequality
Next, we look at an important inequality.
Theorem 1. (Cauchy-Schwarz inequality) Given two random variables X and Y , we
have p
|Cov(X, Y )| 6 Var(X) · Var(Y ).
Proof. We give a classical proof of this inequality, that is often seen in Linear Algebra. Define
rety(t) E((X ty))) El =
-
= 0 since ... 40
2
= Var(X) 2tCov(X, Y ) + t Var(Y ). :
Cor(X , Y) = Var(x)Var(y)
Note that '(t) > 0 for all t, since the variance is always non-negative. Now, when a quadratic
function t 7! at2 + bt + c is non-negative for all t 2 R, the discriminant b2 4ac must be less
than or equal to zero. This gives
Correlation coefficient
The covariance is a useful quantity that describes how two random variables vary together.
However, it has one disadvantage: it is not scale invariant. To explain what this means,
suppose that X and Y are two random variables, both measuring lengths in meters. Assume
that U and V give the same measurements as X and Y , respectively, but in centimeters, that
is, U = 100X and V = 100Y . Then,
5
This means that changing the scale also changes the covariance. To obtain a scale-invariant
quantity, we make the following definition.
Definition 2. The correlation coefficient between two random variables X and Y is defined
as
Cov(X, Y )
⇢(X, Y ) = p .
Var(X) · Var(Y )
1 6 ⇢(X, Y ) 6 1.
⇢(aX + b, cY + d) = ⇢(X, Y ).
Proof. The first statement is an immediate consequence of the Cauchy-Schwarz inequality. For
the second statement, first note that
We now note that the covariance between a constant and any other random variable is equal
to zero. For instance,
Cov(aX + b, cY + d) = ac · Cov(X, Y ).
Then,
Cov(aX + b, cY + d) ac · Cov(X, Y )
⇢(aX + b, cY + d) = p =p
Var(aX + b) · Var(cY + d) a · Var(X) · c2 · Var(Y )
2
ac · Cov(X, Y ) Cov(X, Y )
= p =p = ⇢(X, Y ).
|a| · |c| · Var(X) · Var(Y ) Var(X) · Var(Y )
It is useful to observe that ⇢(X, X) = 1 and ⇢(X, X) = 1. Two random variables X and Y
are called uncorrelated if ⇢(X, Y ) = 0. Note that if X and Y are independent, then they are
uncorrelated, but the converse is not in general true.
Example 5. Let X1 and X2 be two independent random variables with expectation 0 and
variance 1.
1. Find ⇢(X1 , X2 ).
p
2. Let Y1 := X1 and Y2 := cX1 + 1 c2 · X2 , where c 2 [ 1, 1]. Determine E[Y2 ], E[Y22 ]
and ⇢(Y1 , Y2 ).
=
C
6
Solution.
2. We start with
h p i p
E[Y2 ] = E cX1 + 1 c2 · X2 = c · E[X1 ] + 1 c2 · E[X2 ] = 0.
| {z } | {z }
=0 =0
Next,
⇣ ⌘2 h i
p p
E[Y22 ] =E cX1 + 1 c2 · X 2 = E c2 · X12 + (1 c2 ) · X22 + 2c 1 c2 · X 1 · X 2
p
= c2 · E[X12 ] + (1 c2 ) · E[X22 ] + 2c 1 c2 · E[X1 ] · E[X2 ]
= c2 · E[X12 ] + (1 c2 ) · E[X22 ].
Noting that
Cov(Y1 , Y2 )
⇢(Y1 , Y2 ) = p = c.
Var(Y1 ) · Var(Y2 )
7
ST119: Probability 2 Lecture notes for Week 5
pX,Y (x, y)
pX|Y (x|y) = P(X = x | Y = y) = , for x 2 supp(X), y 2 supp(Y ).
pY (y)
Hence, pX|Y (x|y) is the probability that X = x, given that Y = y. Let us make some comments
about this definition:
So, for each fixed y, the function that maps x into pX|Y (x|y) is a probability mass func-
tion. We refer to the distribution associated to this probability mass function as the
distribution of X given that Y = y.
Recall that two discrete random variables X and Y are independent if and only if pX,Y (x, y) =
pX (x)pY (y) for all x, y. This is equivalent to saying that pX|Y (x|y) = pX (x) for all x and
all y 2 supp(Y ).
1
Example 1. Let X and Y be discrete with joint probability mass function given by the
following table:
x\y 1 2 3
1 1 1
0 12 12 12
1 1
1 0 2 4
Let us find pX|Y (x|y) for all choices of x and y. Note first that
1 1 1 1 7 1 1 1
pY (1) = +0= , pY (2) = + = , pY (3) = + = .
12 12 12 2 12 12 4 3
Then,
1/12 0
pX|Y (0|1) = = 1, pX|Y (1|1) = = 0,
1/12 1/12
1/12 1 1/2 6
pX|Y (0|2) = = , pX|Y (1|2) = = ,
7/12 7 7/12 7
1/12 1 1/4 3
pX|Y (0|3) = = , pX|Y (1|3) = = .
1/3 4 1/3 4
Note that pX|Y (x|y) is the proportion that pX,Y (x, y) represents of the total mass given
by pY (y) (in the table: the proportion that the entry in position (x, y) represents of the
total mass of its column).
2
(j)p3(-p
3
Py(y) =
&
Example 2. Let Y ⇠ Bin(n, p). Let q 2 (0, 1) and suppose that X is a random variable
with supp(X) = {0, . . . , n} and such that, for each y 2 {0, . . . , n},
✓ ◆
y
pX|Y (x|y) = · q x · (1 q)y x for x 2 {0, . . . , y} and 0 otherwise.
x
g ,
XuBian (e 9)
,
L
Xn
n! (n x)!
= · (pq)x · · (1 q)y x
· py x
· (1 p)n y .
x!(n x)! y=x
(y x)!(n y)!
/q,)
(p(1 -
n!
= · (pq)x · (p(1 q) + (1 p)) n x
x!(n x)!
n!
= · (pq)x · (1 pq)n x .
x!(n x)!
3
Example 3. Let X and Y be independent with X ⇠ Poi( ) and Y ⇠ Poi(µ). Let Z = X +Y .
Find pX|Z (x|z).
Solution. Recall from the Week 3 lecture notes (first proposition) that Z ⇠ Poi( + µ). Note
that, given that Z = z, X can take any value in {0, . . . , z}. Hence, pX|Z (x|z) is defined for
all pairs (x, z) such that z 2 N0 and all x 2 {0, . . . , z}. We compute
where the last equality follows from the independence between X and Y . The right-hand side
equals
x µz x ✓ ◆x ✓ ◆z
x!
e · (z x)!
e µ z! µ
x
( +µ)z
= · · .
e ( +µ) x!(z x)! +µ +µ
z!
fX,Y (x, y)
fX|Y (x|y) = ,
fY (y)
As in the discrete case, let us make some comments about this definition.
As a function of x for fixed y, fX|Y (x|y) is a probability density function since
Z 1 Z 1 Z 1
fX,Y (x, y) 1
fX|Y (x|y) dx = dx = fX,Y (x, y) dx = 1.
1 1 fY (y) fY (y) 1
We refer to the distribution that corresponds to this probability density function as the
distribution of X given that Y = y. Note that this is just a way to say things, since
formally {Y = y} is an event of probability zero, so we are not really allowed to condition
to it.
While in the discrete case we had pX|Y (x|y) = P(X=x, Y =y)
P(Y =y)
, it does not make sense to
write such a quotient for fX|Y (x|y) (both the numerator and the denominator are zero!).
Assume that A ✓ R2 is a set of the form:
4
Then, we have
Z x2 Z b(x) Z x2 Z b(x)
P((X, Y ) 2 A) = fX,Y (x, y) dy dx = fX (x) fY |X (y|x) dy dx.
x1 a(x) x1 a(x)
As in the discrete case, X and Y are independent if and only if fX|Y (x|y) = fX (x) for
all y such that fY (y) > 0.
fX,Y (x, y)
( "✓ ◆2 ✓ ◆2 #)
1 1 x µX y µY (x µX )(y µY )
= p · exp + 2⇢
2⇡ X Y 1 ⇢2 2(1 ⇢2 ) X Y X Y
We will now see several properties of this joint distribution, starting with the marginals.
2 2 2 2
Proposition 1. If (X, Y ) ⇠ N (µX , X , µY , Y , ⇢), then X ⇠ N (µX , X) and Y ⇠ N (µY , Y ).
Proof. This proof is not examinable. We will only prove the statement for X (since the
statement for Y is treated in the same way, or even better, by symmetry). In order to render
the expression for fX,Y (x, y) more manageable, we let w = y YµY , so that
✓ ◆2 ✓ ◆2 ✓ ◆2
x µX y µY (x µX )(y µY ) x µX 2⇢w(x µX )
+ 2⇢ = + w2 . (2)
X Y X Y X X
We now complete the square as follows:
✓ ◆2 ✓ ◆2
2 2⇢w(x µX ) ⇢(x µX ) 2 x µX
w = w ⇢ .
X X X
5
Now, we compute the marginal probability density function of X using
Z 1
fX (x) = fX,Y (x, y) dy,
1
replacing the expression for fX,Y (x, y) by what we obtained in (3), and using the substitu-
tion w = y YµY (which gives dy = Y dw). We then obtain
fX (x)
⇢ Z ( ✓ ◆2 )
1
1 (x µX ) 2 1 ⇢(x µX )
= p · exp 2
· Y exp w dw
2⇡ X Y 1 ⇢2 2 X 1 2(1 ⇢2 ) X
⇢
1 (x µX ) 2
=p · exp 2
2⇡ X 2 X
Z ( ✓ ◆2 )
1
1 1 ⇢(x µX )
⇥p · exp w dw
2⇡(1 ⇢2 ) 1 2(1 ⇢2 ) X
⇢
1 (x µX ) 2
=p · exp 2
.
2⇡ X 2 X
As you may have guessed, the value ⇢ 2 ( 1, 1) is the correlation coefficient between X and Y ,
but we will not prove that now. Next, we consider conditional density functions:
2
Proposition 2. Assume that (X, Y ) ⇠ N (µX , X , µY , Y2 , ⇢). Then, conditionally on Y = y,
the distribution of X is
✓ ◆
X 2 2
N µX + ⇢ (y µY ), (1 ⇢ ) X .
Y
Proof. This proof is not examinable. Throughout this proof, we will denote expressions
that depend on constants and on y, but not on x, by C1 , C2 , etc. With this convention, we
can write
fX,Y (x, y)
fX|Y (x|y) = = C1 · fX,Y (x, y)
fY (y)
( "✓ ◆2 ✓ ◆2 #)
1 x µX y µY (x µX )(y µY )
= C2 · exp + 2⇢
2(1 ⇢2 ) X Y X Y
8 2 39
>
> ✓ ◆ >
>
< 1 6 x2 2xµ µ 2
y µ
2
x(y µ ) µ (y µ ) 7 =
6 X X Y Y X Y 7
= C2 · exp + + 2⇢ + 2⇢
>
> 2(1 ⇢2 ) 4 X 2 2 2 5>
: X X
|{z} | Y
{z } X Y
| X Y
{z } > ;
no x no x no x
⇢
1 X
= C3 · exp 2
x2 2xµX 2⇢ x(y µY )
2(1 ⇢2 ) X Y
⇢ ✓ ◆
1 2 X
= C3 · exp 2
x 2x µX + ⇢ (y µY ) .
2(1 ⇢2 ) X Y
6
We now complete the square
✓ ◆ ✓ ✓ ◆◆2
2 X X
x 2x µX + ⇢ (y µY ) = x µX + ⇢ (y µY ) + C̃,
Y Y
⇣ ⌘
2 2
On the other hand, integrating the density of N µX + ⇢ X
Y
(y µY ), (1 ⇢) X , we have
8 ⇣ ⇣ ⌘⌘2 9
Z 1
>
< x µX + ⇢ X
(y µY ) >
=
1 Y
1= p exp 2
dx. (6)
2⇡(1 ⇢2 ) 2
X 1 >
: 2(1 ⇢2 ) X >
;
Proof. Recall that X and Y are independent if and only if fX,Y (x, y) = fX (x)fY (y) holds for
all x, y. Since fY (y) is (strictly) positive when Y is normally distributed, we have fX,Y (x, y) =
fX (x)fY (y) for all x, y if and only if fX|Y (x|y) = fX (x) for all x, y. By the above proposition,
we can see that this holds if and only if ⇢ = 0.
Remark 1. We have seen earlier that for two random variables X and Y , we have that
We now see that, if we also know that (X, Y ) follow a bivariate normal distribution, then
equivalence holds, that is X, Y are independent if and only if ⇢X,Y = 0.
7
ST119: Probability 2 Lecture notes for Week 6
(assuming, for cases where the sum is infinite, that it is well defined).
Example 1. Let X and Y be jointly continuous with fX,Y (x, y) = 2 if 0 < x < y < 1 (and 0
otherwise). Let us compute E[X|Y = y] for all y 2 (0, 1). We have that, for all y 2 (0, 1),
Z y
fY (y) = fX,Y (x, y) dx = 2y,
0
so
fX,Y (x, y) 2 1
fX|Y (x|y) = = = , 0<x<y
fY (y) 2y y
which gives Z y
1 1 y2 y
E[X|Y = y] = x· dx = · = .
0 y y 2 2
Let us observe that the conditional expectation E[X|Y = y] satisfies the same properties as
the (unconditional) expectation you have already studied. For instance, for a function g, we
have, in the discrete case,
X
E[g(X)|Y = y] = g(x) · pX|Y (x|y)
x
1
Proposition 1. Let X and Y be random variables. Then, if X and Y are independent,
E[X|Y = y] = E[X]
Proof. Assume that X and Y are independent. In the discrete case, we have pX,Y (x, y) =
pX (x)pY (y), so
pX,Y (x, y) pX (x)pY (y)
pX|Y (x|y) = = = pX (x),
pY (y) pY (y)
so X X
E[X|Y = y] = x · pX|Y (x|y) = x · pX (x) = E[X].
x x
Example 2. Assume that X and Y are independent random variables with X ⇠ Poi( )
and Y ⇠ Poi(µ). Let Z = X + Y . In the example in page 3 of the Week 5 lecture notes, we
have seen that, for all z 2 N0 ,
✓ ◆✓ ◆x ✓ ◆z x
z µ
pX|Z (x|z) = , x 2 {0, . . . , z}.
x +µ +µ
that is, the distribution of X given that Z = z is Bin(z, /( + µ)). This implies that
E[X|Z = z] = · z.
+µ
Similarly,
E[Z|Y = y] = y + .
This is sometimes called the law of total probability. Analogously, in this module, we have
formulas of the kind X
pX (x) = pX|Y (x|y) · pY (y)
y
2
in the discrete case, and Z 1
fX (x) = fX|Y (x|y) · fY (y) dy
1
in the continuous case. We will now see an analogous formula for computing conditional ex-
pectations.
(assuming, for cases where the sum is infinite, that it is well defined).
Example 3. Let X and Y be jointly continuous, with Y ⇠ Unif(0, 1) and, for y 2 (0, 1),
1
fX|Y (x|y) = , 0 < x < y.
y
3
Example 4. Problem: find the expected amount of money I spend in a visit to the bookstore,
given the following information. Every time I go there, I buy a random number N of books.
The prices of the books I buy are independent random variables (also independent of N ) and
all follow the same distribution. The expectation of N is µ and the expectation of the price
of a book I buy is ⌫.
Solution. It is assumed that the prices of the books I pick all follow the same distribution.
Let X1 , X2 , . . . be independent random variables, all following this distribution. Then, the
amount of money I spend in a visit to the bookstore is the random variable
N
X
X= Xi ,
i=1
P Pn
On the event {N = n}, we can replace Ni=1 Xi by i=1 Xi , so the right-hand side above
equals " n #
X1 X
E Xi N = n · pN (n).
n=1 i=1
Hence,
1
X 1
X
E[X] = n · ⌫ · pN (n) = ⌫ n · pN (n) = ⌫ · E[N ] = ⌫ · µ.
n=1 n=1
Example 5 (Correlation in bivariate normal). Recall from Week 5 that X and Y follow
2
a bivariate normal distribution with parameters µX , µY , X , Y2 and ⇢ if they have joint
probability density function
1
fX,Y (x, y) = p
2⇡ X Y 1 ⇢2
( "✓ ◆2 ✓ ◆2 #)
1 x µX y µY (x µX )(y µY )
· exp + 2⇢ .
2(1 ⇢2 ) X Y X Y
We will now show that ⇢ is the correlation coefficient of X and Y . To this end, we must
compute
Cov(X, Y ) E[XY ] µX µY
p = , (1)
Var(X)Var(Y ) X Y
4
2
where the equality follows from X ⇠ N (µX , X ), Y ⇠ N (µY , Y2 ), which we found in Propo-
sition 1 in the Week 5 notes. It remains to compute E[XY ], which we do by conditioning:
Z 1 Z 1
E[XY ] = E[XY |Y = y] · fY (y) dy = E[Xy|Y = y] · fY (y) dy
1 1
Z 1
= y · E[X|Y = y] · fY (y) dy. (2)
1
In Proposition
⇣ 2 in the Week 5 notes, we showed
⌘ that conditionally on Y = y, the distribution
of X is N µX + ⇢ Y (y µY ), (1 ⇢ ) X . In particular, E[X|Y = y] = µX + ⇢ XY (y µY ),
X 2 2
so (2) becomes
Z 1 ✓ ◆
X X
y · µX + ⇢ (y µY ) · fY (y) dy = E µX Y + ⇢ (Y 2 µY Y )
1 Y Y
X
= µX · E[Y ] + ⇢ (E[Y 2 ] µY E[Y ])
Y
X 2
= µX µY + ⇢ · Y = µX µY + ⇢ X Y.
Y
Suppose X
,, Xc, . . .
as
independent with recompe and N is
independent to all the Xi's
ht
Su = X ,
EISwIN / EInIN n)
= = = =
EN ) =
=) =
5
ST119: Probability 2 Lecture notes for Week 7
Proof. We compute
This is a quadratic function of c which is minimized at the value of c for which the derivative
with respect to c (that is, 2c 2E[X]) equals zero, namely c = E[X].
We can interpret this proposition as follows. Suppose that we don’t know the value that X will
attain, but we want to choose a deterministic, fixed constant c which is a “good prediction”
for X. There are di↵erent possible ways to say what a good prediction is, but in the present
setting, let us say that we would like to guarantee that the squared di↵erence (X c)2 is
not too large. Since this squared di↵erence is also a random variable, we choose c that min-
imizes its expectation. By the above proposition, the optimal choice for c is precisely c = E[X].
Conditional expectations and prediction. Now suppose that we have two random vari-
ables, X and Y , and that we observe the value of Y , but not the value of X. Suppose that, for
each possible result Y = y that we observe, we want to come up with a good prediction for X
(in the sense of the previous paragraph). This means that now, instead of choosing a single
value c 2 R, we want to choose a function c(y), which gives the prediction for X when Y = y.
The following proposition tells us that the best choice is taking c(y) = E[X|Y = y].
Proposition 2. Let X and Y be random variables. Among all functions c(y), the one that
minimizes E[(X c(Y ))2 ] is c(y) = E[X|Y = y].
Now, repeating the proof of the previous proposition, we obtain that for each fixed value of y,
the number c(y) that minimizes - E[(X c(y))2 | Y = y] is c(y) = E[X|Y = y]. This concludes
the proof in the discrete case, and the jointly continuous case is treated similarly.
1
eg Xn Poi (1) and Zv Poi/pr) are
independent
Y = X+2 , YePoi(Atm) ·
Conditional on Y= y
,
X ~
Birly im ,
(m) = (m 1)!
Recall that the Gamma distribution with parameters w = 1 and is equal to the exponential
distribution with parameter .
Proposition 4. Let T1 , . . . , Tn be independent random variables, all with the exponential
distribution with parameter . Then,
n
X
Z := Ti ⇠ Gamma(n, ).
i=1
2
where the last step follows from the fact that (n + 1) = n (n) = n!; so Z has the probability
density function of the Gamma(n + 1, ) distribution.
We readily obtain the following consequence:
Corollary 5. Let m, n 2 N. Let Z1 , Z2 be independent random variables, with Z1 ⇠ (m, )
and Z2 ⇠ (n, ). Then,
Z1 + Z2 ⇠ Gamma(m + n, ).
Poisson process
Consider a bank branch where clients arrive at random times. Starting time from t = 0 (say,
noon), let
0 < S1 < S2 < · · ·
denote the successive arrival times of clients. Define
T 1 = S1 , T 2 = S2 S1 , ..., T n = Sn Sn 1 , ...
so that, for n > 2, Tn is the time elapsed between the arrival of client n 1 and client n. Also
note that n
X
Sn = Ti , n 2 N.
i=1
Next, define
Nt = number of clients that have arrived by time t, t > 0,
so Nt is: 8
>
> 0 if 0 6 t < S1 ;
>
<1 if S1 6 t < S2 ;
Nt =
>
>
> 2 if S2 6 t < S3 ;
:
···
We plot this as follows:
4 Nt
3
2
1
t
S 1 S2 S3 S4
T
,
T
, T TitTztig
Nt = max{n : Sn 6 t}.
3
Pn
Definition 1. As above, let T1 , T2 , . . . be independent, all ⇠ Exp( ), let S0 = 0, Sn = i=1 Ti
for n 2 N, and let Nt = max{n : Sn 6 t} for t > 0. The family of random variables
(Nt )t>0
A family of random variables indexed by time is called a stochastic process. The Poisson
process is the only stochastic process we will encounter in this module. Later you will see many
more, such as Markov chains, Brownian motion, martingales, di↵usion processes etc.
Why‘Poisson process’.
Theorem 1. Let Nt : t > 0 be a Poisson process with intensity . Then, for any t > 0,
Nt ⇠ Poi( t).
(to justify the last equality: the event that there are exactly j arrivals at time t is equal to
the event that arrival number j occurred before or at time t, and arrival j + 1 occurred after
time t). Recalling that Sj+1 = Sj + Tj+1 , the probability on the right-hand side above equals
This corresponds to the event that (Sj , Tj+1 ) belongs to the blue set in the picture:
y
t
x
t
So we have to integrate fSj ,Tj+1 over this set to obtain the desired probability. To this end,
let us find fSj ,Tj+1 . Note that Sj and Tj+1 are independent. We know that Tj+1 ⇠ Exp( );
moreover, by the corollary seen earlier, we have Sj ⇠ Gamma(j, ). We thus have
✓ j j 1 ◆
·x x
fSj ,Tj+1 (x, y) = fSj (x) · fTj+1 (y) = ·e · ·e y .
(j)
4
We are now ready to compute
Z tZ 1
P(Sj 6 t, Tj+1 > t + Sj ) = fSj ,Tj+1 (x, y) dy dx
0 t x
j Z t Z 1
j 1 x y
= · x ·e ·e dy dx
(j) 0 t x
j Z t
= · xj 1
·e x
·e (t x)
dx
(j) 0
j Z t
t
= ·e · xj 1
dx
(j) 0
j
t tj ( t)j t
= ·e · = ·e ,
(j) j j!
where in the last equality we used (j) = (j 1)!. We have now proved that pNt (j) is equal to
the probability mass function of the Poisson( t) distribution evaluated at j, as desired.
The following is a stronger version of the above theorem. We will not give a proof in this module.
Theorem 2. Let Nt : t > 0 be a Poisson process with intensity . Then, for any sequence of
times 0 < t1 < · · · < tk :
5
ST119: Probability 2 Lecture notes for Week 8
E[X k ]
whenever the expectation exists. In this case, we say that X has finite kth moment.
We could compute the right-hand side by integrating by parts repeatedly, but there is a
w
shortcut. Recall that Y ⇠ Gamma(w, ) if it has density fY (x) = (w) e x , for x > 0. So we
take w = k + 1, so that
Z 1 Z 1 k+1
1= fY (x) dx = · xk · e x dx.
1 0 (k + 1)
The idea is now to manipulate the integral in (1) to make the Gamma density appear:
Z 1 Z
k x (k + 1) 1 k+1
x · e dx = k
· xk · e x dx .
0 (k + 1)
|0 {z }
=1
In conclusion,
(k + 1) k!
E[X k ] = k
= k
.
What are moments good for? We have already seen some important uses of the expectation
and the variance (which is calculated from the expectation and the second moment). Namely,
they appear in inequalities that give us information about the distribution of random variables,
when exact formulas are difficult to obtain. Recall in particular that Markov’s inequality stated
that, for a non-negative random variable X, we have P(X > x) 6 E[X] x
. The following theorem
says that this bound can be improved (at least asymptotically) in case X has finite higher-order
moments.
1
Theorem (Markov’s inequality, higher-order moments). Let X be a non-negative ran-
dom variable with finite kth moment. Then,
E[X k ]
P(X > x) 6 , x > 0.
xk
Proof. Let Y = X k . We write
P(X > x) = P(X k > xk ) = P(Y > xk ).
By Markov’s inequality, the right-hand side is smaller than or equal to
E[Y ] E[X k ]
= .
xk xk
Note that the upper bound for P(X > x) obtained above is E[X ] k
xk
, whereas the one from the
standard Markov’s inequality is x . For large values of x, xk is much smaller than E[X]
E[X] E[X ]
k
x
, so
the upper bound obtained from the above theorem is much better.
Definition. The moment-generating function of a random variable X is the function MX
defined as
MX (t) := E[etX ]
for all t 2 R for which the expectation is well defined.
Before we explain the name “moment-generating function”, let us compute it in a few examples.
Example. Let X ⇠ Ber(p). Then, for any t 2 R,
MX (t) = E[e ] =
tX tx
e · p ·e 2 dx = p · e 2 dx.
1 2⇡ 2⇡ 1
(x+t)
Es
2
Since p1 e 2 is fY (x), where Y ⇠ N ( t, 1), we have
m
2⇡
Z 1
1 (x+t)
E
-
2
p · e 2 dx = 1.
2⇡ 1
In conclusion,
t2
MX (t) = e 2 , t 2 R.
2
The following theorem explains the reason for the name “moment-generating function”.
Theorem 1. Assume that MX exists in a neighborhood of 0, that is, there exists " > 0 such
that for all t 2 ( ", ") we have that MX (t) exists. Then, for k = 0, 1, . . ., the kth moment
of X exists, and we have
dk
E[X k ] = k MX (t) .
dt t=0
Although we do not give a full proof of this theorem, let us sketch the idea involved. Using the
Taylor expansion of the exponential function, we have
"1 #
X (tX)k
MX (t) = E[e ] = E
tX
.
k=0
k!
Now (after giving a rigorous justification, omitted here, for exchanging the expectation with
an infinite sum), the right-hand side becomes
1
X
(tX)k E[X 2 ] 2 E[X 3 ] 3
E = 1 + E[X] · t + ·t + · t + ··· .
k=0
k! 2! 3!
Di↵erentiating the right-hand side k times with respect to t and evaluating the result at t = 0
gives the desired equality.
Before seeing an example of application of the above theorem, we prove some simple properties
of moment-generating functions.
Proposition 1. Assume that all expectations in the statement are well defined.
1. For any a, b 2 R,
MaX+b (t) = etb · MX (at).
Proof. 1. We compute
3
Example. Let X ⇠ N (0, 1) again. Using an example in the lecture notes of Week 2 (top
of page 7), we have obtain that, if µ 2 R and 2 > 0, then Y = X + µ has the N (µ, 2 )
distribution. Using the above proposition, we can now compute
Now that we know the moment-generating function of Y ⇠ N (µ, 2 ), let us use Theorem 1
to re-obtain that E[Y ] = µ and Var(Y ) = 2 . We first compute
⇢ 2 2
d t
E[Y ] = MX (t) 2
= (t + µ) · exp + µt =µ
dt t=0 2 t=0
and
⇢ ⇢
d2 t2 2 t2 2
E[Y ] = 2 MX (t)
2
= 2
· exp + µt + (t 2 2
+ µ) · exp + µt = 2
+ µ2 .
dt t=0 2 2 t=0
Hence,
Var(Y ) = E[Y 2 ] E[Y ]2 = 2
+ µ2 µ2 = 2
.
Recall that it was much harder, earlier in the module, to compute this variance through
a direct computation using the definition. Now, if we want higher moments (such
as E[X 3 ], E[X 4 ],...) it is relatively straightforward to obtain them by further di↵erentiat-
ing MY (t).
Since X and Y are independent, we can use part (2) of Proposition 1 to obtain
4
⇢ 2 2
⇢ 2 2
Xt Yt
MX+Y (t) = MX (t) · MY (t) = exp µX t + · exp µY t +
2 2
⇢ 2 2 2
( X + Y )t
= exp (µX + µY )t + .
2
2
This shows that X + Y has the same moment-generating function as an N (µX + µY , X + Y2 )
random variable. Since this moment-generating function is defined in a neighborhood of the
2
origin, we conclude that X + Y ⇠ N (µX + µY , X + Y2 ).
Although we will not cover it in this module, let us mention that there is an alternative to the
moment-generating function, called the characteristic function. For a random variable X,
it is defined as
'X (t) = E[eitX ], t 2 R,
p
where i is the imaginary unit, i = 1.
5
ST119: Probability 2 Lecture notes for Week 9
We can interpret (1) as follows. Let us say that we pick a number " > 0, possibly very small,
and then we start declaring that two real numbers a and b are far from each other if the distance
between them (= |a b|) is more than ". If " is tiny, this is a very demanding notion of ‘far’;
for instance, if " = 10 10 , we are saying that two numbers that di↵er by more than 10 10 are
far from each other. No matter: even when we are this demanding, if we take n large enough
in our sampling X1 , X2 , . . . , Xn , then the average X̄n will be close to µ, with high probability.
It is important to note that no assumptions on the specific distributions of the Xi ’s are made.
We only require the Xi ’s to have finite mean and variance, for each i.
X1 +···+Xn
Proof. Let X̄n = n
. Note that
n
1 X 1
E[X̄n ] = · E[Xi ] = · nµ = µ
n i=1 n
1
-
Remark. The reason for the word ‘weak’ in ‘Weak Law of Large Numbers’ is that there is
also a Strong Law of Large Numbers. The di↵erence between the two laws has to do with
di↵erent forms of convergence for sequences of random variables.
Recall that for a sequence of real numbers (xn : n 2 N), there is a well established notion of
convergence, which you may have seen in Calculus or Analysis:
n!1
xn ! x if for every " > 0 there exists n0 2 N such that
|xn x| < " for all n > n0 .
In contrast, for sequences of random variables, there are several possible definitions of con-
vergence. Although this is not examinable, let us briefly look at two of these notions, just
so that you can see the di↵erence between the Weak and the Strong Laws.
Given a sequence of random variables (Yn : n 2 N) and a random variable Y , we say that (Yn )
converges to Y in probability if
n!1
for every " > 0, we have P(|Yn Y | > ") ! 0.
Note that the Weak Law of Large Numbers says that the sequence (Yn ), defined by Yn = X̄n
for each n, converges in probability to the random variable Y that is constant, equal to E[X1 ].
Next, we say that a sequence of random variables (Yn : n 2 N) converges almost surely to
a random variable Y if ⇣ ⌘
P lim Yn = Y = 1.
n!1
X12 + · · · + Xn2
Yn = .
n
Prove that there is a constant c 2 R such that
n!1
for any " > 0, P (|Yn c| > ") ! 0.
Solution. The random variables Z1 = X12 , Z2 = X22 , . . . are independent and identically
distributed. Their expectation is equal to
Z 1 Z 1
1
E[Z1 ] = E[X1 ] =
2 2
x · fX1 (x) dx = x2 dx = .
0 0 3
R1
They also have finite variance, since Var(Z1 ) = Var(X 2 ) = E[X 4 ] E[X 2 ]2 = 0 x4 dx
⇣R ⌘2
1 2
0
x dx and both integrals are finite. By the Law of Large Numbers, we have that
✓ ◆
1 n!1
for any " > 0, P Yn >" ! 0.
3
2
Example 2. We roll a die successively and deem each result of 6 a success (other results
are failures). Prove that the probability that we need to roll the die more than 7n times to
obtain n successes tends to zero, as n ! 1.
Solution. The number of times we need to roll the die to obtain n successes is X1 + · · · + Xn ,
where X1 is the number of rolls until (and including) the first sucess, and for i > 1, Xi is the
number of rolls from (and not including) the (i 1)-th success to (and including) the i-th
success. Here’s an example, with n = 4:
1, 5, 3, 4, 4, 6, 3, 6 , |{z}
6 , 5, 5, 2, 3, 3, 1, 6 .
| {z } |{z} | {z }
X1 =6 X2 =2 X3 =1 X4 =7
Note that X1 , X2 , . . . are independent and identically distributed, all with the geometric
distribution with parameter p = 16 . This distribution has expectation equal to 6 and finite
variance. The Law of Large Numbers gives
✓ ◆
X1 + . . . + Xn n!1
for any " > 0, P 6 >" ! 0.
n
We then write
✓ ◆
X1 + . . . + Xn
P (X1 + · · · + Xn > 7n) = P >7
n
✓ ◆
X1 + . . . + Xn
=P 6>1
n
✓ ◆
X1 + . . . + Xn n!1
6P 6 >1 !0
n
as required.
Example 3. Suppose that we are interested in finding the area of a two-dimensional set A
contained in the square {(x, y) : 1 6 x, y 6 1}. The way to find this area exactly is to solve
the integral ZZ
Area(A) = 1 dx dy.
A
For certain sets A, this integral may be too hard or impossible to solve. We will see now how
we can use random variables and the Law of Large Numbers to approximate the value of the
area. This example is a rudimentary form of the Monte Carlo method.
Suppose that we have a computer that can generate a sequence X1 , X2 , . . . of independent
random variables, all with the Unif( 1, 1) distribution. We first create random vectors
(X1 , X2 ), (X3 , X4 ), . . .
Note that these two-dimensional vectors are independent, all with the same probability density
function (
1
if 1 6 x, y 6 1;
fX1 ,X2 (x, y) = fX1 (x) · fX2 (y) = 4
0 otherwise.
3
Next, define, for i > 1, (
1 if (X2i 1 , X2i ) 2 A;
Zi =
0 otherwise.
Alternatively, we could write
Zi = h(X2i 1 , X2i ),
where h is the function (
1 if (x, y) 2 A;
h(x, y) =
0 otherwise.
Note that Z1 , Z2 , . . . are independent and identically distributed, with expectation
Z 1Z 1
E[Z1 ] = E[h(X1 , X2 )] = h(x, y) · fX,Y (x, y) dx dy
1 1
Z 1 Z 1 ZZ
1 1 1
= h(x, y) dx dy = 1 dx dy = · Area(A).
4 1 1 4 A 4
The Law of Large Numbers requires the random variables involved to be independent. Some-
times, even if independence doesn’t hold, the method of proof of the Law of Large Numbers
(with Chebyshev’s inequality) can be very useful. This illustrated by the next example, which
revisits Exercise 5 of Week 3.
Example 4. In an N ⇥ N square grid (with N > 4), we color each of the unit squares black
with probability 1/3 (and leave it uncolored with probability 2/3), independently. The whole
grid has (N 3)2 sub-grids of dimensions 4 ⇥ 4. Let YN be the proportion of these sub-grids
in which we see:
3
Prove that the probability that YN exceeds 10 tends to zero as N ! 1.
Solution. Let SN be the set of all 4 ⇥ 4 sub-grids of the N ⇥ N grid. For each s 2 SN , define
(
1 if s shows the depicted picture;
Xs =
0 otherwise,
We then have P
s2SN Xs
YN = .
(N 3)2
4
We need to consider
!
X
P(YN > 10 3 ) = P Xs > 10 3
· (N 3)2 .
s2SN
We would now like to use Chebyshev’s inequality, so we write the right-hand side above as
" # " #!
X X X
P Xs E Xs > 10 3 · (N 3)2 E Xs . (2)
s2SN s2S s2S
We then compute
" # ✓ ◆4 ✓ ◆12
X X 1 2 212
E Xs = E[Xs ] = (N 3) · 2
· = 16 · (N 3)2 ,
s2SN s2SN
3 3 3
12
where a = 10 3 3216 . Using a calculator, we see that a > 0. By Chebyshev’s inequality, the
probability above is smaller than P
Var s2SN Xs
. (3)
a2 (N 3)4
To prove that this tends to zero as N ! 1, we need to check that the variance in the
numerator grows slower than the denominator. For this, we do not need to compute the
variance exactly, but just to do some rough estimate. We start with
!
X X X X
Var Xs = Var(Xs ) + Cov(Xs , Xs0 )
s2SN s2SN s2SN s0 2SN ,
s0 6=s
If two sub-grids s and s0 do not overlap, then Xs and Xs0 are independent, so Cov(Xs , Xs0 ) = 0.
Hence, the right-hand side above is equal to
X X X
Var(Xs ) + Cov(Xs , Xs0 ) (4)
s2SN s2SN s0 2SN ,
s0 6=s,
s0 \s6=?
Recall that
and, since the Xs are Bernoulli random variables, all the expectations above are between 0
and 1. This gives
Var(Xs ) 6 1, Cov(Xs , Xs0 ) 6 1
for any s and s0 . Hence, the expression in (4) is smaller than
5
X X X X
1+ 1 = (N 3)2 + Ms ,
s2SN s2SN s0 2SN , s2SN
s0 6=s,
s0 \s6=?
where Ms is the number of 4⇥4 sub-grids that are di↵erent from s and overlap with s. It is not
hard to see that we can find some large constant C (not depending on N ) such that Ms 6 C
for all s. Then, X
Ms 6 C(N 3)2 ,
s2SN
(C + 1)(N 3)2 N !1
! 0.
a2 (N 3)4
6
ST119: Probability 2 Lecture notes for Week 10
X E[X]
Z=p . E(z) = 0
,
Vo(z) = 1
Var(X)
We observe that, regardless of the variance of X, we have E[Z] = 0 and Var(Z) = 1. Indeed,
" #
X E[X] 1
E[Z] = E p =p · (E[X] E[X]) = 0,
Var(X) Var(X)
1 ⇥ ⇤ Var(X)
Var(Z) = E[Z 2 ] = · E (X E[X])2 = = 1.
Var(X) Var(X)
2 X µ
X ⇠ N (µ, ) =) ⇠ N (0, 1). (1)
Now let Z ⇠ N (0, 1). Recall that FZ is the cumulative distribution function of Z, given by
Z x Z x
1 2
FZ (x) = P(Z 6 x) = fZ (y) dy = p · e y /2 dy, x 2 R.
1 1 2⇡
2
As we have observed in class, there is no explicit expression for the anti-derivative of e y /2 ,
so there is no hope of getting a more informative formula for FZ (x). In practice, we use
approximations for FZ (x). The (approximate) values of FZ (x) for x belonging to the set
are recorded in a standard normal table, in the last page of these notes. The table is read
as follows. We take x written in decimal the form a.bc, where a, b and c are integers. We then
find FZ (x) in the table by going to row a.b and column c.
You may now ask:
1
(ii) How about x < 0?
The answer to (i) is easy: the value of FZ (x) for x > 4 is so close to 1 that, in this four-digit
approximation, its value is rounded to 1. In fact, this already happens for x > 3.9, as can be
seen from the last row of the table.
Regarding (ii), we can use symmetry to find FZ (x) for x < 0. Since fZ (x) = fZ ( x), we have
that Z x Z 1
x < 0 =) FZ (x) = fZ (y) dy = fZ (y) dy = 1 FZ ( x).
1 x
We can also use the table to get approximations for P(a < X 6 b), since
2
p
From the Central Limit Theorem, we will find out that Sn µn is typically comparable to n
(in other words, Snpnµn is typically not too large and not too small). Much more interestingly,
the theorem say that
Sn µn
the distribution of p is close to N (0, 1).
n
This is true regardless of the distribution of X1 , X2 , . . .! The only important thing is that they
are independent and identically distributed, with finite mean and variance. In this sense, the
normal distribution can be seen as a sort of “universal attractor” in Probability Theory: it
arises as the limit of sequences of the form Snpµn n
, regardless of the specific distribution that
we start with.
We are now ready to state the theorem. We will not se a proof in this module.
Theorem. (Central Limit Theorem) Let X1 , X2 , . . . be independent and identically dis-
tributed random variables, each with mean µ and variance 2 6= 0. Let Sn = X1 + · · · + Xn .
Then, for any x 2 R, we have
✓ ◆ Z x
Sn µn n!1 1 2
P p 6x ! p · e y /2 dy.
n 1 2⇡
More generally, for any a, b 2 R, a < b, we have
✓ ◆ Z b
Sn µn 1
(b) E(a)
n!1 y 2 /2
P a< p <b ! p ·e dy.
= -
n a 2⇡
where Z ⇠ N (0, 1). Simplifying the fraction and using a calculator, the above equals (ap-
proximately)
P(Z > 1.73) = P(Z < 1.73) = FZ (1.73),
where the first equality follows by symmetry of the normal density about 0. Using the normal
table, we obtain
FZ (1.73) ⇡ 0.9582.
3
Example. A fair die is rolled 12000 times. Use the Central Limit Theorem to find values
of a and b such that Z b
1 2
P(1900 < S 6 2200) ⇡ p · e x /2 dx,
a 2⇡
where S is the total number of 6’s obtained.
The Xi ’s are clearly independent and identically distributed. Each has Bernoulli distribution
with parameter p = 16 , so
1 5
µ = E[Xi ] = , 2
= Var(Xi ) = .
6 36
P12000
We have S = Xi . We now compute
i=1
0 1
1 1 1
1900 6 · 12000 S · 12000 2200 6 · 12000
P(1900 < S 6 2200) = P @ q p < q 6p 6 q p A
5 5 5
36
· 12000 36
· 12000 36
· 12000
0 1
1 1
1900 6 · 12000 2200 6 · 12000
⇡ P@ q p <Z6 q p A. (3)
5 5
36
· 12000 36
· 12000
where Z ⇠ N (0, 1). Simplifying the fractions and using a calculator, the above is approxi-
mately Z 4.89
1 2
P( 2.45 < Z 6 4.89) = p · e y /2 dy.
2.45 2⇡
n
1 X ni 1
Example. Prove that lim n = .
n!1 e i! 2
i=0
(Are you sure this is a probability question?)
Solution. The trick is to think of the Poisson distribution. Recall that, if X ⇠ Poi( ), then
it has probability mass function
i
pX (i) = e , i 2 N0 .
i!
Replacing by n, we see that
n n n
1 X ni X ni n
X
= e = pYn (i),
en i=0 i! i=0
i! i=0
where Yn ⇠ Poi(n). The next step is to remember that Yn has the same distribution as
Sn = X 1 + · · · + X n ,
4
where X1 , . . . , Xn are independent random variables, all with the Poi(1) distribution. Hence,
n
X ✓ ◆
ni X1 + · · · + Xn µ·n n µ·n
·e n
= P(Yn 6 n) = P(Sn 6 n) = P p 6 p .
i=0
i! n n
n pµ·n
Note that µ = E[X1 ] = 1, so (without having to bother about ) we have n
= 0 and then
the right-hand side above equals
✓ ◆
X1 + · · · + Xn µ · n
P p 60 .
n
5
Standard Normal cumulative distribution function
The value given in the table is FX (x) for X ⇠ N (0, 1).
x 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7703 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.8 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
3.9 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000