Variance PDF
Variance PDF
4.1 Overview
The expected value of a random variable gives a crude measure for the
“center of location” of the distribution of that random variable. For instance,
if the distribution is symmetric about a value µ then the expected value
equals µ. To refine the picture of a distribution about its “center of location”
we need some measure of spread (or concentration) around that value. For
many distributions the simplest measure to calculate is the variance (or,
more precisely, the square root of the variance).
Remark. In the Chapter on the normal distribution you will find more
refined probability approximations involving the variance.
The Tchebychev inequality gives the right insight when dealing with
sums of random variables, for which variances are easy to calculate. Sup-
pose EY = µY and EZ = µZ . Then
var(Y + Z) = E [Y − µY + Z − µZ ]2
= E (Y − µY )2 + 2(Y − µY )(Z − µZ ) + (Z − µZ )2
It is easier to see the pattern if we work with the centered random variables
U 0 = U − µU , . . . , Z 0 = Z − µZ . For then the left-hand side becomes
E (aU 0 + bV 0 )(cY 0 + dZ 0 )
= E ac U 0 Y 0 + bc V 0 Y 0 + ad U 0 Z 0 + bd V 0 Z 0
The expected values in the last line correspond to the four covariances.
Sometimes it is easier to subtract off the expected values at the end of
the calculation, by means of the formulae cov(Y, Z) = E(Y Z) − (EY )(EZ)
and, as a particular case, var(X) = E(X 2 ) − (EX)2 . Both formulae follow
via an expansion of the product:
cov(Y, Z) = E (Y Z − µY Z − µZ Y + µY µZ )
= E(Y Z) − µY EZ − µZ EY + µY µZ
= E(Y Z) − µY µZ .
You should check the last assertion by expanding out the quadratic in the
variables Xi − EXi , observing how all the cross-product terms disappear
because of the zero covariances. These facts lead to a useful concentration
property.
Given that X = xi , we know that g(X) = g(xi ) but we get no help with
understanding the behavior of h(Y ). Thus, independence implies
Deduce that
X
Eg(X)h(Y ) = P{X = xi }g(xi )Eh(Y ) = Eg(X)Eh(Y ).
i
<4.4> Example. Consider two independent rolls of a fair die. Let X denote the
value rolled the first time and Y denote the value rolled the second time.
The random variables X and Y are independent, and they have the same
distribution. Consequently cov(X, Y ) = 0, and var(X) = var(Y ).
The two random variables X + Y and X − Y are uncorrelated:
cov(X + Y, X − Y )
= cov(X, X) + cov(X, −Y ) + cov(Y, X) + cov(Y, −Y )
= var(X) − cov(X, Y ) + cov(Y, X) − var(Y )
= 0.
Nevertheless, the sum and difference are not independent. For example,
1
P{X + Y = 12} = P{X = 6}P{Y = 6} =
36
but
P{X + Y = 12 | X − Y = 5} = P{X + Y = 12 | X = 6, Y = 1} = 0.
<4.5> Example. Until quite recently, in the Decennial Census of Housing and
Population the Census Bureau would obtain some more detailed about the
population via information from a more extensive list of questions sent to
only a random sample of housing units. For an area like New Haven, about
1 in 6 units would receive the so-called “long form”.
For example, one question on the long form asked for the number of
rooms in the housing unit. We could imagine the population of all units
numbered 1, 2, . . . , N , with the ith unit containing yi rooms. Complete
enumeration would reveal the value of the population average,
1
ȳ = (y1 + y2 + · · · + yN ) .
N
A sample can provide a good estimate of ȳ with less work.
Suppose a sample of n housing units is selected from the population
without replacement. (For the Decennial Census, n ≈ N/6.) The answer
from each unit is a random variable that could take each of the values
y1 , y2 , . . . , yN , each with probability 1/N .
Remark. It might be better to think of a random variable that takes
each of the values 1, 2, . . . , N with probability 1/N , then take the
corresponding number of rooms as the value of the random variable
that is recorded. Otherwise we can fall into verbal ambiguities when
many of the units have the same number of rooms.
and consequently, the sample average Ȳ = (Y1 +· · ·+Yn )/n also has expected
value ȳ. Notice also that each Yi has the same variance,
1 XN
var(Yi ) = (yj − ȳ)2 ,
N j=1
What formula did There are n variance terms and n(n − 1) covariance terms. We know that
I just rederive? each Yi has variance σ 2 , regardless of the dependence between the variables.
The effect of the dependence shows up in the covariance terms. By symme-
try, cov(Yi , Yj ) is the same for each pair i < j, a value that I will denote
by c. Thus, for sampling without replacement,
1 σ 2 (n − 1)c
var Ȳ = 2 nσ 2 + n(n − 1)c =
(∗) + .
n n n
We can calculate c directly, from the fact that the pair (Y1 , Y2 ) takes
each of N (N − 1) pairs of values (yi , yj ) with equal probability. Thus
1 X
c = cov(Y1 , Y2 ) = (yi − ȳ)(yj − ȳ).
N (N − 1) i6=j
If we added the “diagonal” terms (yi − ȳ)2 to the sum we would have the
expansion for the product
XN XN
(yi − ȳ) (yj − ȳ) ,
i=1 j=1
PN
which equals zero because N ȳ = i=1 yi . The expression for the covariance
simplifies to
σ2
1 2
XN
2
c = cov(Y1 , Y2 ) = 0 − (yi − ȳ) = − .
N (N − 1) i=1 N −1
Substitution in formula (∗) then gives
σ2 σ2 N − n
n−1
var(Ȳ ) = 1− = .
n N −1 n N −1
Compare with the σ 2 /n for var(Y ) under sampling with replacement.
The correction factor (N − n)/(N − 1) is close to 1 if the sample size n is
small compared with the population size N , but it can decrease the variance
of Y appreciably if n/N is not small. For example, if n ≈ N/6 (as with the
Census long form) the correction factor is approximately 5/6.
If n = N , the correction factor is zero. That is, var(Y ) = 0 if the
whole population is sampled. Indeed, when n = N we know that Ȳ equals
the population mean, ȳ, a constant. A random variable that always takes
the same constant value has zero variance. Thus the right-hand side of (∗)
must reduce to zero when we put n = N , which gives a quick method for
establishing the equality c = −σ 2 /(N − 1), without all the messing around
with sums of products and products of sums.
<4.6> Example. Consider a two stage method for generating a random vari-
able Z. Suppose we have k different random variables Y1 , . . . , Yk , with
EYi = µi and var(Yi ) = σi2 . Suppose also that we have a random method
for selecting which variable to choose: a random variable X that is inde-
pendent of all the Yi ’s, with P{X = i} = pi for i = 1, 2, . . . , k, where
p1 + p2 + · · · + pk = 1. If X takes the value i, define Z to equal Yi .
The variability in Z is due to two effects: the variability of each Yi ; and
the variability of X. Conditional on X = i, we have Z equal to Yi , and
E (Z | X = i) = E(Yi ) = µi
var (Z | X = i) = E (Z − µi )2 | X = i = var(Yi ) = σi2 .
If we could replace the µ̄ in the ith summand by µi , the sum would become a
weighted average of conditional variances. To achieve such an effect, rewrite
(Z − µ̄)2 as
On the right-hand side, the first term equals σi2 , and the middle term disap-
pears because E(Z | X = i) = µi . With those simplifications, the expression
for the variance becomes
X X
var(Z) = pi σi2 + pi (µi − µ̄)2 .
i i
where E(Z | X) denotes the random variable that takes the value µi when X
takes the value i, and var(Z | X) denotes the random variable that takes
the value σi2 when X takes the value i.