Random Variables
Random Variables
This is definitely one of the most important sections in the entire text! The Central Limit Theorem is used
everywhere in statistics (hypothesis testing), and it also has its applications in computing probabilities. We’ll
see three results here, each getting more powerful and surprising.
are iid random variables with mean µ and variance σ 2 , then we define the sample mean
If X1 , . . . , Xn P
1 n
to be X n = n i=1 Xi . We’ll see the following results:
• The expectation of the sample mean E X n is exactly the true mean µ, and the variance Var X n =
σ 2 /n goes to 0 as you get more samples.
• (Law of Large Numbers) As n → ∞, the sample mean X n converges (in probability) to the true mean
µ. That is, as you get more samples, you will be able to get an excellent estimate of µ.
• (Central Limit Theorem) In fact, Xn follows a Normal distribution as n → ∞ (in practice n as low as
30 is good enough for this to be true). When we talk about the distribution of Xn , this means: if we
take n samples and take the sample mean, another n samples and take the sample mean, and so on,
how will these sample means look in a histogram? This is crazy - regardless of what the distribution
of Xi ’s were (discrete, continuous), their average will be approximately Normal! We’ll see pictures and
describe this more soon!
Before we start, we will define the sample mean of n random variables, and compute its mean and variance.
Further:
" n # n
1X 1X 1
E X̄n = E Xi = E [Xi ] = nµ = µ
n i=1 n i=1 n
1
2 Probability & Statistics with Applications to Computing 5.7
Again, none of this is “mind-blowing” to prove: we just used linearity of expectation and properties of
variance to show this.
What is this saying? Basically, if you wanted to estimate the mean height of the U.S. population by sampling
n people uniformly at random:
• In expectation, your sample average will be “on point” at E X n = µ. This even includes the case
n = 1: if you just sample one person, on average, you will be correct. However, the variance is high.
• The variance of your estimate (the sample mean) for the true mean goes down (σ 2 /n) as your sample
size n gets larger. This makes sense right? If you have more samples, you have more confidence in
your estimate because you are more “sure” (less variance).
In fact, as n → ∞, the variance of the sample mean approaches 0. A distribution with mean µ and variance
0 is essentially the degenerate random variable that takes on µ with probability 1. We’ll actually see that
the Law of Large Numbers argues exactly that!
Using the fact that the variance is approaching 0 as n → ∞, we can argue that, by averaging more and more
samples (n → ∞), we get a really good estimate of the true mean µ since the variance of the sample mean
is σ 2 /n → 0 (as we showed earlier). Here is the formal mathematical statement:
The SLLN implies the WLLN, but not vice versa. The difference is subtle and is basically swapping
the limit and probability operations.
The proof the WLLN will be given in 6.1 when we prove Chebyshev’s inequality, but the proof of the SLLN
is out of the scope of this class and much harder to prove.
5.7 Probability & Statistics with Applications to Computing 3
Let X1 , . . . Xn be a sequence of independent and identically distributed random variables with mean
2
µ and (finite) variance σ 2 . We’ve seen that the sample mean X̄n has mean µ and variance σn . Then
as n → ∞, the following equivalent statements hold:
2
1. X̄n → N (µ, σn ).
X̄n −µ
2. √ 2
→ N (0, 1)
σ /n
Pn
3. i=1 Xi ∼ N (nµ, nσ 2 ). This is not “technically” correct, but is useful for applications.
Pn
X −nµ
4. √ i
i=1
→ N (0, 1)
nσ 2
The mean or variance are not a surprise (we computed these at the beginning of these notes for any
sample mean); the importance of the CLT is, regardless of the distribution of Xi ’s, the sample mean
approaches a Normal distribution as n → ∞.
We will prove the central limit theorem in 5.11 using MGFs, but take a second to appreciate this crazy
result! The LLN say that as n → ∞, the sample mean of iid variables X n converges to µ. The CLT says
that, as n → ∞, the sample mean actually converges to a Normal distribution! For any original distribution
of the Xi ’s (discrete or continuous), the average/sum will become approximately normally distributed.
If you’re still having trouble with figuring out what “the distribution of the sample mean” means, that’s
completely normal (double pun!). Let’s consider n = 2, so we just take the average of X1 + X2 , which is
X1 +X2
2 . The distribution of X1 + X2 means: if we repeatedly sample X1 , X2 and add them, what might
the density look like? For example, if X1 , X2 ∼ Unif(0, 1) (continuous), we showed the density of X1 + X2
looked like a triangle. We figured out how to compute the PMF/PDF of the sum using convolution in 5.5,
and the average is just dividing this by 2: X1 +X
2
2
, which you can find the PMF/PDF by transforming RVs
in 4.4. On the next page, you’ll see exactly the CLT applied to these Uniform distributions. With n = 1, it
looks (and is) Uniform. When n = 2, you get the triangular shape. And as n gets larger, it starts looking
more and more like a Normal!
You’ll see some examples below of how we start with some arbitrary distributions and how the density
function of their mean becomes shaped like a Gaussian (you know how to compute the pdf of the mean now
using convolution in 5.5 and transforming RV’s in 4.4)!
On the next two pages, we’ll see some visual “proof” of this surprising result!
4 Probability & Statistics with Applications to Computing 5.7
1
• The first (n = 1) of the four graphs below shows a discrete · Unif(0, 29) PMF in the dots (and
29
a blue line with the curve of the normal distribution
with the same mean and variance). That is,
1 1 2 28
P (X = k) = for each value in the range 0, , , . . . , , 1 .
30 29 29 29
• The second graph (n = 2) has the average of two of these distributions, again with a blue line with
the curve of the normal distribution with the same mean and variance. Remember we expected this
triangular distribution when summing either discrete or continuous Uniforms. (e.g., when summing
two fair 6-sided die rolls, you’re most likely to get a 7, and the probability goes down linearly as you
approach 2 or 12. See the example in 5.5 if you forgot how we got this!
• The third (n = 3) and fourth (n = 4) have the average of 3 and 4 identically distributed random
variables respectively, each of the distribution shown in the distribution in the first graph. We can see
that as we average more, the sum approaches a normal distribution.
Again, if you don’t believe me, you can compute the PMF yourself using convolution: first add two Unif(0, 1),
then convolve it with a third, and a fourth!
Despite this being a discrete random variable, when we take an average of many, there become increasingly
many values we can get between 0 and 1. The average of these iid discrete rv’s approaches a continuous
Normal random variable even after just averaging 4 of them!
Image Credit: Larry Ruzzo (a previous University of Washington CSE 312 instructor).
5.7 Probability & Statistics with Applications to Computing 5
You might still be skeptical, because the Uniform distribution is “nice” and already looked pretty “Normal”
even with n = 2 samples. We now illustrate the same idea with a strange distribution shown in the first
(n = 1) of the four graphs below, illustrated with the dots (instead of a “nice” uniform distribution). Even
this crazy distribution nearly looks Normal after just averaging 4 of them. This is the power of the CLT!
What we are getting at here is that, regardless of the distribution, as we have more independent and
identically distributed random variables, the average follows a Normal distribution (with the same mean and
variance as the sample mean).
6 Probability & Statistics with Applications to Computing 5.7
Now let’s see how we can apply the CLT to problems! There were four different equivalent forms (just
scaling/shifting) stated, but I find it easier to just look at the problem and decide what’s best. Seeing
examples is the best way to understand!
Example(s)
Let’s consider the example of flipping a fair coin 40 times independently. What’s the probability of
getting between 15 to 25 heads? First compute this exactly and then give an approximation using
the CLT.
Solution Define X to be the number of heads in the 40 flips. Then we have X ∼ Bin(n = 40, p = 12 ), so we
just sum the Binomial PMF:
25 k 40−k
X 40 1 1
P (15 ≤ X ≤ 25) = 1− ≈ 0.9193.
k 2 2
k=15
Now, let’s use the CLT. Since X can be thought of as the sum of 40 iid Ber( 21 ) RVs, we can apply the
CLT. We have E [X] = np = 40( 12 ) = 20 and Var (X) = np(1 − p) = 40( 21 )(1 − 12 ) = 10. So we can use the
approximation X ≈ N (µ = 20, σ 2 = 10).
Notice that in the prior example in computing P (15 ≤ X ≤ 25), we sum over 25 − 15 + 1 = 11 terms of the
PMF. However, our integral P (15 ≤ N (20, 10) ≤ 25) has width 25 − 15 = 10. We’ll always be off-by-one
since the number of integers in [a, b] is (b − a) + 1 (for integers a ≤ b) and not b − a (e.g., the number of
integers between [12, 15] is (15 − 12) + 1 = 4 : {12, 13, 14, 15}).
The continuity correction says we should add 0.5 in each direction. That is, we should ask for P (a − 0.5 ≤ X ≤ b + 0.5)
instead so the width is b − a + 1 instead. Notice that if we do the final calculation, to approximate
P (15 ≤ X ≤ 25) using the central limit theorem, now with the continuity correction, we get the following:
Example(s)
Use the continuity correction to get a better estimate than we did earlier for the coin problem.
5.7 Probability & Statistics with Applications to Computing 7
Solution We’ll apply the exact same steps, except changing the bounds from 15 and 25 to 14.5 and 25.5.
P (15 ≤ X ≤ 25) ≈ P (14.5 ≤ N (20, 10) ≤ 25.5) [apply continuity correction]
14.5 − 20 25.5 − 20
=P √ ≤Z≤ √
10 10
≈ P (−1.74 ≤ Z ≤ 1.74)
= Φ(1.74) − Φ(−1.74)
≈ 0.9182
Notice that this is much closer to the exact answer from the first part of the prior example (0.9193) than
approximating with the central limit theorem without the continuity correction!
Note: If you are applying the CLT to sums/averages of continuous RVs instead, you should not
apply the continuity correction.
See the additional exercises below to get more practice with the CLT!
5.7.5 Exercises
1. Each day, the number of customers who come to the CSE 312 probability gift shop is approximately
Poi(11). Approximate the probability that, after the quarter ends (9 × 7 = 63 days), that we had at
least 700 customers.
Solution: The total number of customers that come is X = X1 + · · · + X63 , where each Xi ∼ Poi(11)
has E [Xi ] = Var (Xi ) = λ = 11 from the chart. By the CLT, X ≈ N (µ = 63 · 11, σ 2 = 63 · 11) (sum of
the means and sum of the variances). Hence,
what’s the probability my flashlight can operate for the entirety of my trip?
Solution: The total lifetime of the battery is X = X1 + · · · + X18 where each Xi ∼ Exp(0.1)
1 1
has E [Xi ] = = 10 and Var (Xi ) = = 100. Hence, E [X] = 180 and Var (X) = 1800 by linearity
0.1 0.12
of expectation and since variance adds for independent rvs. In fact, X ∼ Gamma(r = 18, λ = 0.1),
but we don’t have a closed-form for its CDF. By the CLT, X ≈ N (µ = 180, σ 2 = 1800), so
Note that we don’t use the continuity correction here because the RV’s we are summing are already
continuous RVs.