0% found this document useful (0 votes)
18 views15 pages

Section 5 - Expectation and Variance

Uploaded by

jixuezhanggg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

Section 5 - Expectation and Variance

Uploaded by

jixuezhanggg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

5 Expectation and variance

In Section 3 and 4 we encountered a wide range of random variables. It is often useful to describe
or summarise their behaviour and in this section we introduce several quantities to do this. We pay
particular attention to two of key importance: the expectation and the variance of a random variable.

5.1 Expectation

Definition 5.1 (Expectation – discrete case). Let X be a discrete random variable. The expectation
of X is given by
X
E[X] := x · P(X = x),
x∈SX
P P
provided the series x∈SX |x| · P(X = x) converges. If the sum x∈SX |x| · P(X = x) diverges then
the expectation is undefined 1 .

Remark 5.2. Let X be a discrete random variable.

• The expectation of X is often called the mean or the average of X.

• If SX is a finite set then x∈SX |x| · P(X = x) converges, and so the expectation always exists.
P

This might not be true if |SX | is infinite (see Example 5.10 below).

Example 5.3. Let X satisfy P(X = 2) = 0.4, P(X = 4) = 0.1 and P(X = 10) = 0.5. Then

E[X] = 2 · (0.4) + 4 · (0.1) + 10 · (0.5) = 6.2.

When betting on a game of chance, an important random variable is the amount of money gained over
the game. The expectation is then often used to decide if the game is worth playing.

Example 5.4. Take two kings and four aces, shuffle the six cards thoroughly and then draw two
without replacement. We win £1.25 if we draw two aces; otherwise, we pay £1. Should we play?
Let X denote the amount gained in a game (negative if we lose). The question gives P(X = 1.25) =
4
 6
2 / 2 = 0.4 and P(X = −1) = 0.6, so

E[X] = (1.25) · P(X = 1.25) + (−1) · P(X = −1) = (1.25) · (0.4) + (−1) · (0.6) = −0.1.

As the expected gain is negative, on average we will loss money and so should not play 2 .

Let us calculate the expectations of some familiar random variables.


1
If a series is absolutely convergent then the order in which we sum its elements does not effect the answer, proven in
1SAS. This is why absolute convergence is important here.
2
Hopefully this sounds like an good criteria to make a decision about whether to play but at the moment this relies
only on our intuition. We will obtain a more rigorous justification in Section 6.

1
Example 5.5 (Constant). If X has P(X = c) = 1 for some c ∈ R then E[X] = c · P(X = c) = c.

Example 5.6 (Discrete uniform). Let X follow the uniform distribution on {1, . . . , n}, that is P(X =
k) = 1/n for all k = 1, . . . , n. Then,
n n
X X i n(n + 1) n+1
E[X] = i · P(X = i) = = = .
n 2n 2
i=1 i=1

Example 5.7 (Binomial). On problem sheet 4 you will show that if X ∼ binn,p then E[X] = np.

Example 5.8 (Geometric). Let X ∼ geop with p ∈ (0, 1). Here there is a small trick. Taking q = 1−p

X ∞
X ∞
X ∞
X ∞
X
k−1 k k−1 k−1
p · E[X] = E[X] − q · E[X] = kpq − kpq = kpq − (k − 1)pq = pq k−1 = 1.
k=1 k=1 k=1 k=2 k=1

The third equality here follows by a change of variable and the final equality follows by summing a
geometric series. Rearranging we get E[X] = p1 .

Example 5.9 (Poisson). Let X ∼ Poiλ with λ > 0. Then,


∞ ∞ ∞ ∞
X X λk X λk `=k−1
X λ`
E[X] = k · P(X = k) = ke−λ = e−λ = λe−λ = λe−λ eλ = λ.
k! (k − 1)! `!
k=0 k=1 k=1 `=0

1
Example 5.10 (No expectation). Let X be a discrete random variable with P(X = n) = n(n+1) for
n ∈ N. These probabilities do indeed sum to one since:
m m m
X 1 X1 X 1 1
= − =1− → 1, as m → ∞.
n(n + 1) n n+1 m+1
n=1 n=1 n=1
P∞ P∞ 1
However n=1 n · P(X = n) = n=1 n+1 which diverges 3 . Therefore E[X] is not defined in this case.

The expectation of continuous random variables is defined very similarly.

Definition 5.11 (Expectation – continuous case). Let X be a continuous random variable with density
fX . Then the expectation of X is given by
Z ∞
E[X] := x · fX (x)dx,
−∞
R∞ R∞
provided −∞ |x| · fX (x)dx exists. If −∞ |x| · fX (x)dx does not exist then E[X] is undefined.

Example 5.12 (Continuous uniform). Let X ∼ unif[a, b] with density fX (x) = (b − a)−1 for x ∈ [a, b]
and fX (x) = 0 for x ∈
/ [a, b]. Then,
∞ b
b2 − a2
Z Z
−1 a+b
E[X] = x · fX (x)dx = (b − a) xdx = = .
−∞ a 2(b − a) 2
3 P∞ 1
As the harmonic series n=1 n diverges.

2
Example 5.13 (Exponential). Let X ∼ expλ where λ > 0. Then X has density function fX (x) =
λ · e−λx for x > 0 and fX (x) = 0 otherwise. Then,
Z ∞ Z ∞ Z ∞
−λx −λx ∞ 1
e−λx dx = ,
 
E[X] = x · fX (x)dx = x · λe dx = x(−e ) 0 +
−∞ 0 0 λ

where we used integration by parts in the second last equality.

Example 5.14. Let f (x) = 41 | sin(x)| for x ∈ [0, 2π] and f (x) = 0 elsewhere. Then f gives a density
function (check this!) and if X is a random variable with this density
Z ∞ Z 2π Z π Z 2π
1 1 1
E[X] = x · f (x)dx = · x · | sin(x)|dx = · x · sin(x)dx − · x · sin(x)dx.
−∞ 4 0 4 0 4 π
Rb
Integration by parts (with u = x and v = − cos(x)) gives a x sin(x)dx = [−x · cos(x) + sin(x)]ba . Thus

1 1 (π + 0) (−2π − π)
E[X] = · [−x · cos(x) + sin(x)]π0 − · [−x · cos(x) + sin(x)]2π
π = − = π.
4 4 4 4

5.2 Key properties of expectation

The next theorem, together with Remark 5.16), is absolutely central to why expectation is so useful.

Theorem 5.15. Let X, Y be discrete or continuous random variables with well-defined expectations.

(i) Given any a, b ∈ R, we have E[aX + bY ] = a · E[X] + b · E[Y ]. (linearity of expectation)

(ii) If X, Y are independent then E[X · Y ] = E[X] · E[Y ].

We postpone the proof of Theorem 5.15 to end of the section so that we can highlight some aspects
of the theorem and see some applications.

Remark 5.16. The following points are of key importance.

• Linearity of expectation (Theorem 5.15 (i )) extends by induction on n to give that for any n ≥ 1,
a1 , . . . , an ∈ R and random variables X1 , . . . , Xn with well-defined expectations, we have

E[a1 X1 + · · · + an Xn ] = a1 E[X1 ] + · · · + an E[Xn ].

• From Section 3 we know that if X, Y are independent random variables then f (X), g(Y ) are also
independent random variables for any functions f, g : R → R. It follows from Theorem 5.15 (ii )
that E[X 2 Y 3 ] = E[X 2 ] · E[Y 3 ], that E[sin(X) · eY ] = E[sin(X)] · E[eY ], etc. 4 .

• Independence is key in (ii ); in general for random variables X, Y we have E[X · Y ] 6= E[X] · E[Y ].
4
Provided all these expectations are well defined.

3
Example 5.17 (Dice). Let X1 , . . . , Xn be random variables with the uniform distribution on {1, . . . , 6}
(representing fair dice rolls). Then the random variable Y = X1 + · · · + Xn has SY = {n, n +
1, . . . , 6n − 1, 6n} but P(Y = k) is quite tricky to calculate. However E[Y ] = E[X1 + · · · + Xn ] =
E[X1 ] + · · · + E[Xn ] = nE[X1 ] by Theorem 5.15(i). But E[X1 ] = 3.5 by Example 5.6, so E[Y ] = 3.5n.

Example 5.18 (Sum of digits). Let X be uniformly distributed on {0, 1, . . . , 999}. Let Y be its
sum of digits. What is E[Y ]? We can write X = X1 + 10X2 + 100X3 , where Xi ∈ {0, 1, . . . , 9}
follows the uniform distribution on this set. In particular, E[Xi ] = (0 + 1 + 2 + · · · + 9)/10 = 4.5. As
Y = X1 + X2 + X3 , we find E[Y ] = E[X1 ] + E[X2 ] + E[X3 ] = 13.5 using linearity of expectation.

Example 5.19 (Expectation of the normal distribution). We first focus on the standard normal,
2
N ∼ N (0, 1). Using that xe−x /2 is an odd function, by symmetry
Z ∞ Z 0 Z ∞
1 2 1 2 1 2
E[N ] = √ xe−x /2 dx = √ xe−x /2 dx + √ xe−x /2 dx = 0. (1)
2π −∞ 2π −∞ 2π 0
Now to consider X ∼ N (µ, σ 2 ). By Proposition 4.19 we have µ + σN ∼ N (µ, σ 2 ) and it follows from
Theorem 5.15 (i) that E[X] = E[µ + σN ] = E[µ] + σ · E[N ] = µ, by (1).

To further exploit linearity of expectation we need the notion of a Bernoulli random variable.

Definition 5.20 (Bernoulli distribution). The Bernoulli distribution with parameter p, with p ∈ [0, 1],
is the probability distribution on {0, 1} given by

Berp (1) = p, Berp (0) = 1 − p.

A random variable X follows the Bernoulli distribution with parameter p if SX = {0, 1} and P(X =
k) = Berp (k) for all k ∈ SX = {0, 1}. In this case we write X ∼ Berp . (Note that bin1,p = Berp .)

Example 5.21 (Bernoulli). If X ∼ Berp then E[X] = 1 · P(X = 1) + 0 · P(X = 0) = p.

Example 5.22. Let X ∼ Ber0.5 and set Y = 1 − X. Then Y ∼ Ber0.5 and X · Y = 0. This gives
E[X · Y ] = 0 6= 1/4 = E[X] · E[Y ]; see the third bullet point in Remark 5.16.

In computing expectations, it is often possible to ‘break up’ a random variable X into a sum of simpler
ones X = ni=1 Xi and then use linearity of expectation to find E[X] = ni=1 E[Xi ]. The following
P P

three examples illustrate this method, which is very applicable.

Example 5.23 (Binomial). Let X ∼ binn,p . We will show E[X] = np. We know that X counts the
number of occurrences of independent events A1 , . . . , An , where each event happens with probability
p. Thus, X = X1 + X2 + · · · + Xn with Bernoulli random variables X1 , . . . , Xn , where Xi = 1 if and
only if Ai occurs. Linearity of expectation and Example 5.21 give E[X] = ni=1 E[Xi ] = ni=1 p = np.
P P

Example 5.24 (Hypergeometric). We show that if X ∼ hypn,r,t where n, r ≤ t then E[X] = n rt .




One could compute this by hand, exploiting identities for binomial coefficients (exercise!). Instead we
present a computation-free proof.
Recall that X counts the number of red balls in a sample of size n from an urn containing r red balls
and t balls in total. Label the red balls from 1 to r and the remaining balls from r + 1 to t. Let Xi = 1

4
if the i-th ball is in the sample and Xi = 0 otherwise. Then the number of red balls X = X1 + · · · + Xr .
By linearity of expectation,

E[X] = E[X1 ] + · · · + E[Xr ]. (2)


t−1
 t
But E[Xi ] = P(Xi = 1) = n−1 / n = n/t as we draw n balls without replacement and are interested
in the event we draw a specific ball. Putting this value into (2) gives E[X] = n · rt .


Example 5.25 (Birthday paradox). Place 50 people in a room and assume that their birthdays are
uniformly distributed over the whole year, independently of each other. Let X denote the number of
days in the year on which someone in the group has their birthday. What is E[X]?
The distribution of X is intricate, and we will not even think of studying it. Instead, for i = 1, . . . , 365,
let Ai be the event that somebody celebrates on day i. Then, P (Aci ) = (364/365)50 . Hence, P (Ai ) =
1 − (364/365)50 . As we can write X = X1 + X2 + · · · + X365 with Bernoulli random variables
X1 , . . . , X365 , where Xi = 1 if and only if Ai occurs, linearity of expectation gives

E[X] = 365 · E[X1 ] = 365 · P(A1 ) = 365 · 1 − (364/365)50 = 46.786 [3dp].




Remark 5.26. The fact that linearity of expectation works without independence is a huge strength.
Note that in the previous two examples, the random variables {Xi } were not independent.

Example 5.27 (Dice). We roll a fair die repeatedly and let X be the number of rolls necessary for
all faces to show up at least once. What is E[X]?
To calculate this, for each i ∈ {1, . . . , 6} let Ti denote the roll when the i-th new face first appears 5 .
In particular, we have T1 = 1 and T6 = X. Then we can write

X = T1 + (T2 − T1 ) + (T3 − T2 ) + (T4 − T3 ) + (T5 − T4 ) + (T6 − T5 ).

By linearity of expectation, our calculation now reduces to calculating E[Ti+1 − Ti ] for i ∈ {1, . . . , 5}.
What is the distribution of Ti+1 − Ti if i ∈ {1, . . . , 5}? At time Ti , we have seen i faces. The number of
rolls until the next new face appears follows the geometric distribution with parameter pi = (6 − i)/6.
Hence, Ti+1 − Ti ∼ geopi and E [Ti+1 − Ti ] = 1/pi = 6/(6 − i). By linearity of expectation, we find

5 5 5
X X 6 X1
E[X] = 1 + E[Ti+1 − Ti ] = 1 + =1+6 = 14.7
6−i i
i=1 i=1 i=1

This calculation is a variant of the coupon collector problem, which tends to appear quite often 6 .

The following result is also very useful in calculating expectations.

Lemma 5.28. Let g : R → R be a function.


5
Note: Ti is not the first time that face i appears.
6
You might like to see the Wikipedia article for more information here.

5
(i) If X is a discrete random variable, we have
X
E[g(X)] = g(x) · P(X = x),
x∈SX
P
provided x∈SX |g(x)| · P(X = x) converges.

(ii) If X is a continuous random variable with density fX then


Z ∞
E[g(X)] = g(x) · fX (x)dx,
−∞
R∞
provided −∞ |g(x)| · fX (x)dx converges.

Again we postpone the proof, to first see how it applies.

Example 5.29. Lemma 5.28 might look a little odd, but it is very convenient. To see why, note that
if X is a discrete random variable then by Definition 5.1 we have
X
E[X 2 ] = x · P(X 2 = x).
x∈SX 2

This might require determining SX 2 and probabilities P(X 2 = x). Lemma 5.28 shows this is not
necessary; taking g(x) = x2 we also have E[X 2 ] = x∈SX x2 · P(X = x).
P

Example 5.30 (Karate). We chop a stick of length 1 into two pieces at a uniformly chosen point.
What is the expected length of the longer of the two pieces?

Letting X ∼ unif[0, 1] be the breaking point, we want E[g(X)] where g(x) = max(x, 1 − x). Thus,
1 0.5 1 0.5 1
−(1 − x)2 x2
Z Z Z  
3
E[Y ] = max(x, 1 − x) · 1dx = (1 − x)dx + xdx = + = .
0 0 0.5 2 0 2 0.5 4

5.2.1 The proofs of Theorem 5.15 and Lemma 5.28.

Proof of Theorem 5.15. We only consider the case of discrete random variables in the proof. To
simplify the notation, we write SX = {x1 , x2 , . . .} and SY = {y1 , y2 , . . .}, noting that these sets could
be finite or infinite.

To prove (i) let Z = aX + bY with SZ = axi + byk : i ≥ 1, k ≥ 1 . We have
X X  [ 
 
E[Z] = z·P Z =z = z·P X = x i , Y = yk
z∈SZ z∈SZ i≥1,k≥1:axi +byk =z
X  X 

= z· P X = xi , Y = yk
z∈SZ i≥1,k≥1:axi +byk =z
X X  
= axi + byk · P X = xi , Y = yk .
i≥1 k≥1

6
The series for E[X] and E[Y ] are absolutely convergent, so we can rearrange this sum by 1SAS results:
X X  X X 
E[Z] = a · xi · P X = x i , Y = yk + b · yk · P X = xi , Y = yk
i≥1 k≥1 k≥1 i≥1
X X
=a· xi · P(X = xi ) + b · yk · P(Y = yk )
i≥1 k≥1

= a · E[X] + b · E[Y ].

To prove (ii), note that by replacing aX + bY with X · Y in (i), the analogue of the calculation for
E[Z] gives
XX XX  
E[X · Y ] = xi yk · P(X = xi , Y = yk ) = xi yk · P(X = xi ) · P(Y = yk )
i≥1 k≥1 i≥1 k≥1
X  X 
= xi · P(X = xi ) · yk · P(Y = yk )
i≥1 k≥1

= E[X] · E[Y ].

The second equality above is the key point at which independence is used (recalling Definition 3.33).
The third equality is clear if the series are finite, but holds for infinite series by results from 1SAS, as
both series are absolutely convergent. The final equality holds by definition of E[X] and E[Y ].

Proof of Lemma 5.28. We only prove (i). By definition of E[g(X)], we have


X  X 
E[g(X)] = y · P g(X) = y = y · P X ∈ {x ∈ SX : g(x) = y}
y∈Sg(X) y∈Sg(X)
X  X 
= y· P(X = x)
y∈Sg(X) x∈SX :g(x)=y
X  X 
= g(x) · P(X = x)
x∈SX y∈Sg(X) :g(x)=y
X
= g(x) · P(X = x).
x∈SX

The fourth equality holds by switching the order of summations (i.e. the order of x and y) and the
final equality holds by noting that {y ∈ Sg(X) : g(x) = y} = |{g(x)}| = 1.

5.3 Variance

Definition 5.31 (Variance - discrete case). Let X be a discrete random variable with well-defined
expectation. The variance of X is given by
X
Var(X) := E (X − E[X])2 = (x − E[X])2 · P(X = x),
 
(3)
x∈SX

7
provided the series in (3) converges. The standard deviation of X is then given by
p
σX := Var(X).

If the series in (3) diverges then both Var(X) and σX are undefined.

Remark 5.32. Let X be a discrete random variable with well-defined expectation and variance.

• The variance of X is always non-negative, by definition in (3).

• Both the variance and the standard deviation give a measure of the typical distance of the
random variable X from E[X]. These quantities have different benefits:

(i) typically |X − E[X]| is not much larger than σX (proven in Section 6), while
(ii) variance behaves much better algebraically (see Proposition 5.39 and Theorem 5.50 below).

Example 5.33. We return to the game from Example 5.4. As X satisfies P(X = 1.25) = 0.4,
P(X = −1) = 0.6 and E[X] = −0.1, we deduce

Var(X) = (1.25 − (−0.1))2 · 0.4 + (−1 − (−0.1))2 · 0.6 = 1.215.

Example 5.34 (Constant random variable). We return to Example 5.5 with a random variable X
satisfying P(X = c) = 1 for some c ∈ R. Then, E[X] = c and Var(X) = (c − c)2 · P(X = x) = 0. This
makes sense as there is no variation in the values which X can attain.

Example 5.35. Let X follow the uniform distribution on {49, 50, 51} and Y follow the uniform
distribution on {1, . . . , 99}. Then E[X] = E[Y ] = 50. However the random variable X attains values
p
very close to E[X] and has a small standard deviation; σX = 2/3. The random variable Y is evenly
spread on {1, . . . , 99}, and σY is much larger; σY = 28.57 . . . as shown in Example 5.40.

Definition 5.36 (Variance - continuous case). Let X be a random variable with density fX , and
well-defined expectation. Then the variance of X is given by
Z ∞
Var(X) = (x − E[X])2 · fX (x)dx,
−∞
p
provided the integral exist and the corresponding standard deviation by σX = Var(X).

Remark 5.37. More generally, for g : R → R, if g(X) has a well-defined expectation, then
Z ∞ p
Var(g(X)) = (g(x) − E[g(X)])2 · fX (x)dx, σg(X) = Var(g(X)).
−∞

Example 5.38. Coming back to Example 5.30, recall that we had X ∼ unif[0, 1] and were interested
in g(X) where g(x) = max(x, 1 − x). Our calculation gave E[g(X)] = 3/4.
Z 1 Z 1/2 Z 1
2 2 2
Var(g(X)) = max(x, 1 − x) − 3/4 dx = 1 − x − 3/4 dx + x − 3/4 dx
0 0 1/2
1 3

h (1/4 − x)3 i1/2 h (x − 3/4)3 i1 4 4 1
= − + = = .
3 0 3 1/2 3 48

8
5.4 Key properties of variance

Proposition 5.39. Let X be a discrete or continuous random variable with well-defined expectation
and variance. Then the following hold:

(i) Var(X) = E[X 2 ] − (E[X])2 ,

(ii) given a, b ∈ R we have Var(aX + b) = a2 · Var(X),

(iii) Var(X) = 0 if and only if P(X = x0 ) = 1 for some x0 ∈ R.

Proof. We will prove (i)–(iii) assuming X is a discrete random variable. For (i) note that

Var(X) = E (X − E[X])2 = E X 2 − 2 · X · E[X] + E[X]2


   

= E X 2 − 2E X · E[X] + E[X]2
   

= E X 2 − E[X]2 .
 

The second equality holds by expanding (X − E[X])2 and third holds by Theorem 5.15 (i).
To see (ii) note that by Theorem 5.15 (i) we have E[aX + b] = a · E[X] + b. It follows that
2  2 
= a2 Var(X).
 
Var(aX + b) = E aX + b − E[aX + b] = E aX − a · E[X]

To prove (iii), note that if P(X = x0 ) = 1 then given y ∈ R \ {x0 } we have P(X = y) ≤ P(X 6= x0 ) = 0.
It follows that E[X] = x0 · 1 + 0 = x0 and Var(X) = (x0 − E[X])2 · 1 + 0 = 0.
On the other hand, suppose Var(X) = 0. Then given any y ∈ SX we have
X 2
(y − E[X])2 · P(X = y) ≤ x − E[X] · P(X = x) = Var(X) = 0.
x∈SX
P
Thus if y 6= E[X] then P(X = y) = 0. It follows that P(X = E[X])+0 = x∈SX P(X = x) = P(Ω) = 1,
and so P(X = E[X]) = 1.

Example 5.40 (Discrete uniform). Let X follow the uniform distribution on {1, . . . , n}. Then by
Example 5.6 we have E[X] = (n + 1)/2. As
n
X i2 n(n + 1)(2n + 1) (n + 1)(2n + 1)
E[X 2 ] = = = ,
n 6n 6
i=1

1
by Proposition 5.39 (i) we obtain that Var(X) = E[X 2 ] − (E[X])2 = 12 (n
2 − 1).

Example 5.41 (Bernoulli). If X ∼ Berp then E[X] = p by Example 5.21 and we find that Var(X) =
(1 − p)2 · p + p2 · (1 − p) = p(1 − p).

Example 5.42 (Binomial). Var(X) = np(1 − p) for X ∼ binn,p (see problem sheet 4).

Example 5.43 (Hypergeometric). If X ∼ hypn,r,t then Var(X) = n rt t−r


  t−n 
t t−1 . We will not
prove this, but if you are interested, see Example 8.30 in Introduction to Probability by Anderson,
Seppäläinen and Valkó.

9
Example 5.44 (Geometric). Let X ∼ geop where p ∈ (0, 1). Letting q := 1 − p we have
∞ ∞ ∞ ∞ ∞ q
X X X X X 1
E[X 2 ] = k 2 (1 − q)q k−1 = (k + 1)2 q k − k2 qk = 2 kq k + qk = 2 · · E[X] + .
p p
k=1 k=0 k=1 k=1 k=0

2q 1 1 q 1−p
Since E[X] = 1/p by Example 5.8, we find Var(X) = E[X 2 ] − (E[X])2 = p2
+ p − p2
= p2
= p2
.

Example 5.45 (Poisson). Let X ∼ Poiλ with λ > 0. In Example 5.9 we saw that E[X] = λ. A direct
computation of E[X 2 ] is cumbersome, and it is better to consider E[X(X − 1)]:
∞ ∞ ∞
X λk X λk `=k−2
X λ`
E[X(X − 1)] = k(k − 1)e−λ = e−λ = λ2 e−λ = λ2 e−λ eλ = λ2 .
k! (k − 2)! `!
k=2 k=2 `=0

We obtain Var(X) = E[X 2 ] − (E[X])2 = E[X(X − 1)] + E[X] − (E[X 2 ])2 = λ2 + λ − λ2 = λ.

Example 5.46. (Exercise!) Taking Y = g(X) as in Example 5.30, you can check that Var(Y ) = 1/48.
Similarly for the random variable X from Example 5.14, we have Var(X) = π 2 /2 − 2.

Example 5.47 (Continuous uniform distribution). Let X ∼ unif[a, b]. In Example 5.12 we had
already shown that E[X] = (a + b)/2. To find the variance, compute
b
b3 − a3 a2 + ab + b2
Z
2 −1
E[X ] = (b − a) x2 dx = = .
a 3(b − a) 3
(b−a)2
Thus, Var(X) = E[X 2 ] − (E[X])2 = 12 .

Example 5.48 (Exponential). Let X ∼ expλ where λ > 0. Then


Z ∞ i∞ Z ∞
2 −λx
h
−λx 2E[X] 2
2
E[X ] = λx e dx = −2xe +2 xe−λx dx = 0 + = 2,
0 0 0 λ λ
1
using Example 5.13. This gives Var(X) = E[X 2 ] − (E[X])2 = λ2
.
2
Example 5.49 (Normal). Again let N ∼ N (0, 1) with density f (x) = (2π)−1/2 e−x /2 , x ∈ R, recalling
2
that E[N ] = 0 from 5.19. To compute Var(N ) = E[N 2 ] − E[N ] = E N 2 , we perform integration
 
2
by parts with u = x, v 0 = xe−x /2 :
Z ∞ h i∞ Z ∞  Z ∞
2 1 − x2
2 1 −x2 /2 − x2
2
Var(N ) = E[N ] = √ x · xe dx = √ −xe + e dx = f (x)dx = 1.
2π −∞ 2π −∞ −∞ −∞

As µ + σN ∼ N (µ, σ 2 ), it follows from Proposition 5.39 (ii) that Var(X) = σ 2 for X ∼ N (µ, σ 2 ).

The following theorem shows that variance is additive for independent random variables.

Theorem 5.50. Let X1 , . . . , Xn be discrete or continuous random variables with well-defined expec-
tations and variances. If X1 , . . . , Xn are independent, then
n
X
Var(X1 + · · · + Xn ) = Var(Xi ).
i=1

10
Proof. We again prove this only for discrete random variables. It is convenient to set µi = E[Xi ] for
each i ∈ [n]. By linearity of expectation we have E[ ni=1 Xi ] = ni=1 E[Xi ] = ni=1 µi . Secondly,
P P P

given 1 ≤ i 6= j ≤ n the random variables Xi and Xj are independent, so the random variables Xi − µi
and Xj − µj are also independent by Proposition 3.38. Thus
n
X n
h X Xn 2 i n
h X 2 i

Var Xi = E Xi − E[ Xi ] =E (Xi − µi )
i=1 i=1 i=1 i=1
n
hX n
X i
=E (Xi − µi ) · (Xj − µj )
i=1 j=1
n
X h 2 i Xn n
X i

= E Xi − µi + E (Xi − µi ) · (Xj − µj )
i=1 i=1 j=1:j6=i
Xn n
X n
X i
  
= Var(Xi ) + E (Xi − µi ) · E (Xi − µi )
i=1 i=1 j=1:j6=i
n
X
= Var(X).
i=1
 2 
The second last equality here uses that Var(Xi ) = E Xi − µi for all i = 1, . . . , n and that by
Theorem 5.15 (ii), as Xi − µi and Xj − µi are independent for i 6= j. The final inequality holds as
 
E Xi − µi = 0 for all i = 1, . . . , n.

Example 5.51. Let X, Y be independent random variables with Var(X) = 2 and Var(Y ) = 5. What
is Var(3X − 5Y )? By Proposition 3.38 the random variables 3X, −5Y are independent and so

Var(3X − 5Y ) = Var(3X) + Var(−5Y ) = 32 Var(X) + (−5)2 Var(Y ) = 18 + 125 = 143.

Remark 5.52. Let X, Y be random variables with well-defined variances. In general, we do not have
Var(X +Y ) = Var(X)+Var(Y ). For example, let X be an arbitrary random variable with well-defined
variance Var(X) > 0 and set Y = −X. Then 0 = Var(X + Y ) 6= Var(X) + Var(Y ) = 2Var(X) where
we used Proposition 5.39(ii) with a = −1, b = 0 in the last step.

Example 5.53 (Binomial). Let X ∼ binn,p . As in Example 5.23 we have X = ni=1 Xi where with
P

Xi ∼ Berp for all i = 1, . . . , n. By additivity of variances for independent random variables, since the
random variables X1 , . . . , Xn are independent (the events A1 , . . . , An are independent), we obtain
n n
!
X X
Var(X) = Var Xi = Var(Xi ) = np(1 − p),
i=1 i=1

where we used the expression for the variance of a Bernoulli random variable from Example 5.41.

For fixed p ∈ [0, 1], the value E[X] in Example 5.53 is proportional to n as n → ∞, while the standard

deviation σX grows like n. In particular, we find that the fluctuations of X are much smaller than
E[X] for large n.

11
5.5 Median

Another quantity which plays an important role in statistics is the median.

Definition 5.54 (Median). Let X be a discrete or continuous random variable. A value m ∈ R is


called a median of X if P(X ≥ m) ≥ 1/2 and P(X ≤ m) ≥ 1/2.

Remark 5.55. .

• In general, the median of a random variable is not necessarily unique.

• Like the expectation, the median gives an estimate of the ‘typical value’ of X. Statisticians tend
to have a soft spot for the median and often choose it over the expectation to represent data.
One reason is that it is more stable and less sensitive to extreme values than the expectation.

• On the downside, the median doesn’t behave very well algebraically and the median analogues of
the useful properties in Theorem 5.15 all tend to fail. In particular, the median is rarely linear.

Example 5.56 (Continuous uniform distribution). Let X ∼ unif[a, b] with distribution function

0 if t < a,


FX (t) = t−a
b−a if a ≤ t ≤ b,


1 if t > b.

Solving FX (t) = 1/2 shows that m = (a + b)/2 = E[X] is the unique median of X.

Example 5.57 (Discrete uniform distribution). Let X have the uniform distribution on {1, . . . , n}.
Given x ∈ R with 1 ≤ x ≤ n, we have P(X ≤ x) = bxc/n and P(X ≥ x) = (bn − xc + 1)/n. If n is
odd, the E[X] = (n + 1)/2 is the unique median of X. If n is even, however, then every number in the
interval [n/2, n/2 + 1] is a median of X.

Example 5.58. We roll a fair dice and as usual let Ω = {1, . . . , 6} and P denote the uniform distri-
bution on Ω. We also let X : Ω → R denote the outcome of the roll, so that X(i) = i. As seen above,
any number in [3, 4] is a median of X.
Now instead assign a new value K to the face 6, where K is some very large number (maybe K = 1000).
Let Y be the outcome of rolling the modified fair dice. Then, for K ≥ 4, any median of Y still lies in
the interval [3, 4]. However, we find E[Y ] = (K + 15)/6 and so the mean changes drastically 7 .

7
One can show that, for any median m, we have |E[X] − m| ≤ σX , so median and expectation are close to each other
if σX is small. (A proof of this inequality is difficult.)

12
5.6 Statistical applications

Hypothesis tests. Randomised double-blind clinical trials work as follows: a group of patients is
randomly subdivided into two subgroups T (for treatment) and C (for control). Patients in group T
are given a certain treatment while group C patients receive placebos. To minimise any kind of bias
neither participants nor doctors know the partition into groups T and C.
In 1948 the first randomised clinical trial was held in the UK leading to a breakthrough in tuberculosis
treatment using the antibiotic streptomycin. Six months of treatment saw the following results:

improvement no improvement total


treatment (T) 39 16 55
no treatment (C) 17 35 52
total 56 51 107

The numbers look convincing, but one might still ask if such an extreme result could have occurred
by coincidence. Assuming that the conditions of 56 of 107 patients were determined to improve (inde-
pendently of any treatment), the number X of those patients in group T follows the hypergeometric
distribution with n = 55, r = 56 and t = 107. Thus, E[X] = 55 · 56/107 = 28.78 . . . The probability
for a deviation of at least 39 − 28.78 . . . from the mean is
18 56 51 56 56 51
   
X k · 55−k
X k · 55−k
P(X ≤ 18) + P(X ≥ 39) = 107
 + 107
 ≈ 0.001.
k=4 55 k=39 55

This probability is called the p-value associated with the data. In hypothesis testing, one fixes a
significance level α (often 0.01, 0.05 or 0.10) and rejects the null hypothesis (here: treatment has no
effect) if the p-value is smaller than α. 8

Mean estimation. A central problem in statistics is the following: given independent realisations
X̂1 , . . . , X̂n (that is, data) from an unknown distribution which has mean µ and variance σ 2 ≥ 0, we
would like to estimate µ. The obvious choice is the sample mean

µ̂n = n−1 (X̂1 + · · · + X̂n ).

By linearity of expectation, we find E[µ̂n ] = µ. Further, by Proposition 5.39 (ii) and Theorem 5.50
we have Var(µ̂n ) = σ 2 /n. As n → ∞, the fluctuations of µ̂n around its expectation µ become smaller
giving that µ̂n is extremely likely 9 to be a good estimator for the true mean µ.

Variance estimation. In the setting of the previous example, we sometimes also want to estimate
σ 2 . The obvious choice would be n−1 (X̂1 − µ)2 + · · · + (X̂n − µ)2 . Indeed, this quantity is a good


8
This line of argument is known as Fisher’s exact test after Sir Ronald Fisher (1890 - 1962), the founder of modern
statistical science. You will study hypothesis tests in much greater detail in the Year 2 module 2S.
9
The law of large numbers, discussed in Section 6, will make this statement more precise.

13
approximation for σ 2 ; however, in applications, it is useless since we do not know the true value of µ.
Instead, we work with the sample variance
n
X
σ̂n2 := n−1 (X̂i − µ̂n )2 .
i=1
Pn
By rearranging this expression (a one line calculation) we obtain that σ̂n2 = n−1 2 − µ̂2n . Since

i=1 X̂i
E µ̂2n = Var(µ̂n ) + (E[µ̂n ])2 = σ 2 /n + µ2 and E X̂i2 = σ 2 + µ2 , it follows that
   

n − 1
E[σ̂n2 ] = · σ2. (4)
n
Similar to estimation of the mean, to show that σ̂n2 is a good estimator for σ 2 , we would like to prove
that its variance tends to zero as n → ∞. A lengthy calculation (a page or two in length) gives that

 E X̂i4 − σ 4 E X̂i4 + σ 4 E X̂i4 − σ 4


     
2
Var σ̂n = + + .
n n2 n3
Provided E X̂i4 < ∞ we obtain Var σ̂n2 → 0 as n → ∞, and σ̂n2 is very likely to be close to E σ̂n2 .
    

Remark 5.59. In statistics, we often seek 10 an estimator θ̂ for a parameter θ. The estimator is said

to be unbiased if θ is the average value (or expectation) of θ̂. From above E µ̂n = µ and so µ̂n is an
unbiased estimator for µ. On the other hand, from (4) we see that σ̂n2 is a biased estimator. This can
be corrected, by instead taking the unbiased estimator
n
X
(n − 1)−1 (X̂i − µ̂n )2 .
i=1

Most important takeaways in this chapter. You should

• know definitions of expectation, variance, standard deviation and median for discrete and con-
tinuous random variables,

• be familiar with the main properties of expectation and variance,

• be able to compute expectation, variance and median in simple examples,

• know expectation for binomial, hypergeometric, geometric, Poisson, discrete and continuous
uniform, exponential and normal distribution,

• know variance of binomial, Poisson, geometric and normal distribution,

• be familiar with the Bernoulli distribution and know its expectation and variance,

• be able to exploit linearity of expectation to compute the expectation of complicated random


variables,

• appreciate the concept of hypothesis tests.


10
E.g. selecting a group of voters to estimate the proportion of an electorate in favour of a referendum result.

14
Tables of formula and comparisons

The following table extends the previous table from Section 4 (page 3), to include expressions for the
expectation and variance.

discrete r.v. continuous r.v.

mass/density function pX (k), k ∈ SX fX (x), x ∈ R


P Rt
distribution function FX k∈SX ,k≤t pX (k) −∞ fX (x)dx
FX is a step function FX is continuous

connection P(X = k) = FX (k) − P(X < k) fX (x) = FX0 (x)


P R∞
expectation E[X] k∈SX k · pX (k) −∞ x · fX (x)dx
P R∞
expectation E[g(X)] k∈SX g(k) · pX (k) −∞ g(x) · fX (x)dx

variance Var(X) E[X 2 ] − (E[X])2 E[X 2 ] − (E[X])2

The second table below gives formulae for the expectation and variance of common random variables,
and a reference to the example above in which the formula was derived 11 .

distribution parameters expectation Ex. no. variance Ex. no.


n+1 n2 −1
(discrete) uniform n 2 5.6 12 5.40

Bernoulli p p 5.21 p(1 − p) 5.41

binomial n, p np 5.23 np(1 − p) 5.53


r r
  t−r  t−n 
hypergeometric n, r, t n t 5.24 n t t t−1 5.43
1 1−p
geometric p p 5.8 p2
5.44

Poisson λ λ 5.9 λ 5.45


a+b (b−a)2
(continuous) uniform a<b 2 5.12 12 5.47
1 1
exponential λ λ 5.13 λ2
5.48

Normal µ, σ 2 µ 5.19 σ2 5.49

11
In the electronic version of these notes, the example number references all have hyperlinks, which might be convenient.

15

You might also like