Chapter 3: Variance Reduction
Chapter 3: Variance Reduction
1 Introduction
Variance reduction is the search for alternative and more accurate estimators
of a given quantity. The possibility of variance reduction is what separates
Monte Carlo from direct simulation. Simple variance reduction methods often
are remarkably effective and easy to implement. It is good to think about them
as you wait for a long Monte Carlo computation to finish. In some applications,
such as rare event simulation and quantum chemistry, they make practicle what
would be impossible otherwise. Most advanced Monte Carlo is some kind of
variance reduction.
Among the many variance reduction techniques, which may be used in com-
bination, are control variates, partial integration, systematic sampling, and im-
portance sampling. The method of control variates is useful when a crude version
of the problem can be solved explicitly. This is often the case in simple prob-
lems (possibly the definition of “simple”) such pricing problems in quantitative
finance where the crude solvable version could be Black Scholes. Partial inte-
gration, also called Rao Blackwellization lowers variance by replacing integrals
over come variables or over parts of space by their averages. Systematic sam-
pling methods range from the simplest, antithetic variates, to the slightly more
sophisticated stratified sampling, to quasi Monte Carlo integration. Importance
sampling has appeared already as sampling with a weight function. It also is the
basis of reweighting and score function strategies for sensitivity analysis. Meth-
ods for rare event sampling mostly use importance functions, often suggested
by the mathematical theory of large deviations.
2 Control variates
Suppose X is a random variable and that we want to evaluate
A = E[V (X)] .
We may estimate A by generating L independent samples of X and taking
L
b= 1
X
A V (Xk ) . (1)
L
k=1
1
Thus, the number of samples and the run time needed to achieve a given accu-
racy is inversely proportional to the variance.
A control variate is an easily evaluated random variable, W (X), so that
B = E[W (X)] is known. If W (X) is correlated with V (X) with covariance
can have less variance than V (X). This will make the control variate estimator
L
b= 1
X
A (V (Xk ) − αW (Xk )) + αB (3)
L
k=1
more accurate than the simple one (1), often dramatically so.
We choose α to minimize the variance of Z in (2). The variance is
2
σZ = σV2 − 2αCV W + α2 σW
2
.
The optimal α is
CV W
α∗ = 2 , (4)
σW
and the corresponding optimal variance is
2
2 CCW
= σV2 − = σV2 1 − ρ2V W ,
σZ 2 (5)
σW
2
c∗ CdVW
α = , (6)
σ
d 2
W
L
c∗ 1
X
A b(1) − α
b = A (Wk − B) .
L
k=1
The estimate (6) may not be a very accurate estimate of (4), but the perfomance
does not depend strongly on α when α is close α∗ , where the derivative is zero.
One can use more than one control variate. Given W1 (X), . . ., Wn (X), we
can form
Xn
Z = V (X) − αl Wl (X) . (7)
l=1
The optimal coefficients, the αl that minimize var(Z), are found by solving the
system of linear equations
n
X
cov(V, Wl ) = cov(Wl , Wm )αm . (8)
l=1
Should the coefficients in (8) be unknown, we can estimate them from Monte
Carlo data as above.
Let VS denote the control variate sum on the right of (7) so that V = Z +VS .
The optimality conditions for the coefficients αl is that VS be uncorrelated with
Z. If this were not so, we could use W = VS as an additional control variate and
further reduce the variance. Because they are uncorrelated, var(V ) = var(Z)+.
In statisticians’ terminology, the total variance of V is the sum of the explained
part, var(VS ), and the unexplained part, var(Z).
Linear algebra has a geometrical way to express this. Given a random
variable, X, there is a vector space consisting of mean zero functions of X
with finite variance. If V (X) − A and W (X) − B are two such, their in-
ner product is hV − A, W − Bi = cov(V, W ). The corresponding length is
2
kV k = hV − A, V − Ai = var(V ). In the vector space is the subspace, S,
spanned by the vectors Wl (X) − Bl . Minimizing var(Z) in (7) is the same as
2
finding the VS ∈ S that minimizes kV − VS k . This is the element of VS closest
to V − A. In this way we write V = Z + VS with VS perpendicular to Z.
Example: From the introduction. Let B ⊂ R3 be the unit ball of points with
|x| ≤ 1. Suppose X and Y are independent and uniformly distibuted in B and
try to evaluate −λ|X−Y |
e
E .
|X − Y |
−λ|X−Y |
Since the functional V (X, Y ) = e |X−Y | depends on |X − Y |, we seek control
variates that have this dependence, the difficulty being finding functionals whose
2
expected value is known. One possibility is W1 (X, Y ) = |X − Y | with
h i h i
2 2
E [W1 ] = E |X| + 2E [hX, Y i] + E |Y | .
3
The middle term on the right vanishes because X and Y are independent. The
other two each are equal to 35 , so E[W1 ] = 65 . With λ = .2, the improvement
takes us from var(V ) ≈ .99 to var(Z) ≈ .72, about 26% lower. Another possi-
4
bility is W2 = |X − Y | with W [W2 ] = 67 + 65 . Using these two control variates
together gives var(Z) ≈ .58, an almost 50% reduction. The Matlab program
that does this, CV1.m, is posted.
This example shows a relatively modest variance reduction from two not very
insightful control variates. Variance reduction methods that seem impressive
in one dimensional examples may become less effective in higher dimensional
problems, as this relatively modest six dimensional problem illustrates.
3 Partial averaging
Partial averaging, or Rao-Blackwellization, reduces variance by averaging ofver
some of the variables or over part of the integration domain. for example, sup-
pose (X, Y ) is a random variable wtih probability density f (x, y). Let V (X, Y )
be a random variable and
R
V (x, y)f (x, y)dy
Ve (x) = E [V (X, Y ) | x] = R (9)
f (x, y)dy
A simple inequality shows that except in the trivial case where V already was
independent of y,
var(Ve ) < var(V ) . (10)
In fact, the reader can check that
2
var(V ) = var(Ve ) + E V − Ve . (11)
The conclusion is that if a problem can be solved partially, if some of the integrals
(9) can be computed explicitly, the remaining problem is easier.
A more abstract and general version of of the partial averaging method is
that if G is a sub σ−algebra and
Ve = E [V | G] ,
then we again have (11) and the variance reduction property (10). Of course,
the method still depends on being able to evaluate Ve efficiently.
Subset averaging is another concrete realization of the partial averaging prin-
ciple. Suppose B is a subset (i.e., an event) and that E [V | B] is known. If
E [V | B] if x ∈ B,
V (x) =
e
V (x) if x ∈
/ B,
then again var(V ) = var(Ve ) except in trivial situations. For example, we might
take B to be the largest set for which E [V | B] can be evaluated by symmetry.
4
Example. Consider just the Y integration in the previous example.:
−λ|X−Y |
e−λ|x−y|
Z
e 3
EY = dy .
|X − Y | 4π |y|≤1 |x − y|
EY [V (x, Y ) | Bx ]
−3 3 −λ(1−|x|)
= u(x) = (1 − |x|)) 1 − e (1 + λ(1 − |x|)) . (12)
λ2
Therefore
e−λ|X−Y |
h i
A = E(X,Y ) = E(X,Y ) Ve (X, Y ) ,
|X − Y |
where −λ|X−Y |
e
if |X − Y | ≥ 1 − |X| ,
Ve (X, Y ) =
|X − Y |
u(X) if |X − Y | < 1 − |X| .
Computational experiments (Matlab script CV3.m posted) with λ = .2 show
that var(Ve ) ≈ .61. We may further reduce the variance using the earlier control
2 4
variates W1 = |X − Y | and W2 = |X − Y | . Using only W1 gives var(Z) ≈ .35.
Using W1 and W2 together gives var(Z) ≈ .24. Thus, the combined effects of
not very sophisticated partial averaging and two simple control variates reduces
the variance, and the work needed to achieve a given accuracy, by a factor of 4
(from .99 to .24).
5
sampling a well chosen different distrubution, g. To get an unbiased estimator,
Z
A = V (x)f (x)dx
Z
f (x)
= V (x) g(x)dx
g(x)
A = Eg [V (X)L(X)] . (13)
The ratio L(x) = f (x)/g(x) may be called the likelihood ratio or the importance
function or the score function, or the Radon Nikodym derivative. Of course we
must have g(x) > 0 whenever f (x) > 0. Estimators using (13) are unbiased for
any g. Our task is to find distributions g that we can sample so that
varg [V (X)L(X)]
σ(b
p) 1 1
≈p =p . (15)
p pL E [N ]
That is, the relative error is governed by the expected number of hits. If I want
10% accuracy in pb, I need to generate about 100 hits, which could mean very
many mostly fruitless trials.
2 This means IB (x) = 1 if x ∈ B and IB (x) = 0 if x ∈
/ B.
6
The first goal of importance sampling is to make hits more likely. But this is
not enough. Often the conditional probability withing B, fB (x) = f (x)/P (B)
for x ∈ B, is sharply peaked within B. Informally, we say that rare events
happen in predictable ways. If g is peaked in the wrong parts of B, the resulting
variance could be larger than for the naive estimator. To be a good importance
sampler, g samples must hit B often, and in roughly the same way f samples
that hit B do.
The mathematical study of rare events is the theory of large deviations3 . A
typical large deviations theorem says Pn (B) ∼ e−Rn , where n is some measure
of the problem size. The proof usually involves a change of measure of the kind
we have been discussing, a g that knows how rare f events that hit B do so.
sampling.
4 This is a general way, called Watson’s lemma, to estimate integrals like this one, see the
7
The interested reader will be able to show that the error in this approximation
is O(n−3/2 ) so that
2 1 1
P (S ≥ b) = e−nb /2 √ √ + O n−3/2 .
2πb n
√
This is of the general form (16) with R = b2 /2 and C = 1/ 2πb.
This formula illustrates the predictability of rare events. If B is the event
S ≥ b, then most of the hits in B have S only slightly above b. In fact, P (S ≥
b+ | S ≥ b) ∼ e−bn . Moreover, we explore the mechanism for samples S ≥ b by
finding hb (y), the conditional probability density of, say, Y1 , given that S ≥ b.
In this Gaussian world, the density of Y1 when S = b + is Gaussian and
(after some thought) hb ∼ N (b + , 1). Since is small for large n, this gives
fb (y) → N (b, 1) as n → ∞. This suggests that we can sample more effectively
by drawing the Yk from N (b, 1) than from N (0, 1).
The proof of Cramer’s theorem for general h(y) builds on these observations.
The probability density for S is5
Z Z
φ(s) = n · · · h(y1 ) · · · h(yn )δ(y1 + · · · + yn − ns)dy1 · · · dyn .
with Z
eλy h(y)dy = E eλY .
Z(λ) = (18)
The hypothesis of Cramer’s theorem is that the exponential moments (18) are
finite, at least for a suitable range of λ. This implies taht h(y) decays expo-
nentially in some average sense as y → ∞. Clearly, the force λ, changes the
expected value of Y . Denote this by
Z
1
µ(λ) = Eλ [Y ] = yeλy h(y)dy . (19)
Z(λ)
Note that hb in the Gaussian case has the form (17) with λ chosen so that
µ(λ) b. In general, define λ∗ (s) by
8
It is easy to see that if λ∗ (s) exists for a given s, it is unique. Uning this λ∗ , we
have
1 nλ∗ (s)s
e · Z(λ∗ (s))n φ(s)
n Z Z
= ··· hλ∗ (y1 ) · · · hλ∗ (yn )δ(y1 + · · · + yn − ns)dy1 · · · dyn .
What is special about using λ = λ∗ (s) is that the right side is probability density
at s for an average of iid random variables with mean s. This allows us to use
the central limit theorem to approximate the right side. Let
h i
2
σλ2 ∗ = Eλ∗ (Y − s)
Z
1
= (y − s)2 eλ∗ y h(y)dy .
Z(λ∗ )
Then the right side corresponds to the maximum of a N (s, nσλ2 ∗ ) density, which
is
1
√ .
nσλ∗
Altogether6
√
−n −nλ∗ (s)s n
−1/2
φ(s) = Z(λ∗ (s)) e +O n .
σλ∗
This shows that
√
φ(s) = e−R(s)n d(s) n + O n−1/2 ,
with
R(s) = ln {Z(λ∗ (s))} + sλ∗ (s) , (21)
and
1
d(s) = .
σλ∗
As with the Gaussian case Z ∞
p= φ(s)ds ,
s=b
and most of the mass is near s = b. Again using the Watson lemma, we expand
6 The error term comes from the error term in the central limit theorem.