0% found this document useful (0 votes)
109 views17 pages

Mathematical Finance

This document provides an overview of measure-theoretic probability and Lebesgue integration. It begins by defining Lebesgue measure for intervals on the real line based on length, then extends this to higher dimensions based on area and volume. It introduces Lebesgue-measurable sets and defines the Lebesgue integral for non-negative functions using simple approximations, then extends this to signed functions by splitting them into positive and negative parts. Finally, it discusses Lp spaces and compares the Lebesgue and Riemann integrals.

Uploaded by

Alps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views17 pages

Mathematical Finance

This document provides an overview of measure-theoretic probability and Lebesgue integration. It begins by defining Lebesgue measure for intervals on the real line based on length, then extends this to higher dimensions based on area and volume. It introduces Lebesgue-measurable sets and defines the Lebesgue integral for non-negative functions using simple approximations, then extends this to signed functions by splitting them into positive and negative parts. Finally, it discusses Lp spaces and compares the Lebesgue and Riemann integrals.

Uploaded by

Alps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Chapter III: MEASURE-THEORETIC PROBABILITY

1. Measure
The language of option pricing involves that of probability, which in turn
involves that of measure theory. This originated with Henri LEBESGUE
(1875-1941), in his 1902 thesis, ‘Intégrale, longueur, aire’. We begin with
the simplest case.
Length.
The length µ(I) of an interval I = (a, b), [a, b], [a, b) S
or (a, b] should be
b − a: µ(I) = b − a. The length of the disjoint union I = nr=1 Ir of intervals
Ir should be the sum of their lengths:
n
! n
[ X
µ Ir = µ(Ir ) (finite additivity).
r=1 r=1

Consider now an infinite sequence I1 , I2 , . . .(ad infinitum) of disjoint intervals.


Letting n → ∞ suggests that length should again be additive over disjoint
intervals:

! ∞
[ X
µ Ir = µ(Ir ) (countable additivity).
r=1 r=1

For I an interval, A a subset of length µ(A), the length of the complement


I \ A := I ∩ Ac of A in I should be

µ(I \ A) = µ(I) − µ(A) (complementation).

If A ⊆ B and B has length µ(B) = 0, then A should have length 0 also:

A ⊆ B & µ(B) = 0 ⇒ µ(A) = 0 (completeness).

Let F be the smallest class of sets A ⊂ R containing the intervals, closed


under countable disjoint unions and complements, and complete (containing
all subsets of sets of length 0 as sets of length 0). The above suggests – what
Lebesgue showed – that length can be sensibly defined on the sets F on the
line, but on no others. There are others – but they are hard to construct
(in technical language: the Axiom of Choice (AC), or some variant of it such

1
as Zorn’s Lemma, is needed to demonstrate the existence of non-measurable
sets – but all such proofs are highly non-constructive). So: some but not all
subsets of the line have a length.1 These are called the Lebesgue-measurable
sets, and form the class F described above; length, defined on F is called
Lebesgue measure µ (on the real line, R).
Area.
The area of a rectangle R = (a1 , b1 ) × (a2 , b2 ) – with or without any of
its perimeter included – should be µ(R) = (b1 − a1 ) × (b2 − a2 ). The area of
a finite or countably infinite union of disjoint rectangles should be the sum
of their areas:

! ∞
[ X
µ Rn = µ(Rn ) (countable additivity).
n=1 n=1

If R is a rectangle and A ⊆ R with area µ(A), the area of the complement


R \ A should be

µ(R \ A) = µ(R) − µ(A) (complementation).

If B ⊆ A and A has area 0, B should have area 0:

A ⊆ B & µ(B) = 0 ⇒ µ(A) = 0 (completeness).

Let F be the smallest class of sets, containing the rectangles, closed under
finite or countably infinite unions, closed under complements, and complete
(containing all subsets of sets of area 0 as sets of area 0). Lebesgue showed
that area can be sensibly defined on the sets in F and no others. The sets
A ∈ F are called the Lebesgue-measurable sets in the plane R2 ; area, defined
on F, is called Lebesgue measure in the plane. So: some but not all sets in
the plane have an area.
Volume.
Similarly in three-dimensional space R3 , starting with the volume of a
cuboid C = (a1 , b1 ) × (a2 , b2 ) × (a3 , b3 ) as

µ(C) = (b1 − a1 ) · (b2 − a2 ) · (b3 − a3 ).


1
There are alternatives to AC, under which all sets are measurable. So it is not so much
a question of whether AC is true or not, but of what axioms of Set Theory we assume.
Background: Model Theory in Mathematical Logic, etc.

2
Euclidean space.
Similarly in k-dimensional Euclidean space Rk . We start with
k
! k
Y Y
µ (ai , bi = (bi − ai ),
i=1 i=1

and obtain the class F of Lebesgue-measurable sets in Rk , and Lebesgue mea-


sure µ in Rk .
Probability.
The unit cube [0, 1]k in Rk has Lebesgue measure 1. It can be used to
model the uniform distribution (density f (x) = 1 if x ∈ [0, 1]k , 0 otherwise),
with probability = length/area/volume if k = 1/2/3.
Note. If a property holds everywhere except on a set of measure zero, we
say it holds almost everywhere (a.e.) [French: presque partout, p.p.; German:
fast überall, f.u.]. If it holds everywhere except on a set of probability zero,
we say it holds almost surely (a.s.) [or, with probability one].

2. Integral.
1. Indicators.
We start in dimension
Rb k = 1 for simplicity , and consider the simplest
calculus formula a 1 dx = b − a. We rewrite this as
Z ∞
I(f ) := f (x) dx = b − a if f (x) = I[a,b) (x),
−∞

the indicator function of [a, b] (1 in [a, b], 0 outside it), and similarly for the
other three choices about end-points.
2. Simple functions.
A function
Pn f is called simple if it is a finite linear combination of indica-
tors: f = i=1 ci fi for constants ci and indicator functions fi of intervals Ii .
One then extends the definition of the integral from indicator functions to
simple functions by linearity:
n
! n
X X
I ci fi := ci I(fi )
i=1 i=1

for constants ci and indicators fi of intervals Ii .


3. Non-negative measurable functions.

3
Call f a (Lebesgue-) measurable function if, for all c, the sets {x : f (x) ≤
c} is a Lebesgue-measurable set (§1). If f is a non-negative measurable
function, we quote that it is possible to construct f as the increasing limit
of a sequence of simple functions fn :

fn (x) ↑ f (x) for all x ∈ R (n → ∞), fn simple.

We then define the integral of f as

I(f ) := lim I(fn ) (≤ ∞)


n→∞

(we quote that this does indeed define I(f ): the value does not depend on
which approximating sequence (fn ) we use). Since fn increases in n, so does
I(fn ) (the integral is order-preserving), so either I(fn ) increases to a finite
limit, or diverges to ∞. In the first case, we Rsay f is (Lebesgue-)
R integrable
with (Lebesgue-)
R integral
R I(f ) = lim I(fn ), or f (x) dx = lim fn (x) dx, or
simply f = lim fn .
4. Measurable functions.
If f is a measurable function that may change sign, we split it into its
positive and negative parts, f± :

f+ (x) := max(f (x), 0), f− (x) := − min(f (x), 0),


f (x) = f+ (x) − f− (x), |f (x)| = f+ (x) + f− (x)

If both f+ and f− are integrable, we say that f is too, and define


Z Z Z
f := f+ − f− .

Then, in particular, |f | is also integrable, and


Z Z Z
|f | = f+ + f− .

Note. The Lebesgue integral is, by construction, an absolute integral: f is


integrable iff |f | is integrable. Thus, for instance, the well-known formula
Z ∞
sin x π
dx =
0 x 2

4
R∞
has Rno meaning for Lebesgue integrals, since 1 | sinx x| dx diverges to +∞

like 1 x1 dx. It has to be replaced by the limit relation
Z X
sin x π
dx → (X → ∞).
0 x 2

The class of (Lebesgue-) integrable functions f on R is written L(R) or (for


reasons explained below) L1 (R) – abbreviated to L1 or L.
Higher dimensions.
In Rk , we start instead from k-dimensional R boxes.
Qk If f is the indicator of
a box B = [a1 , b1 ] × [a2 , b2 ] × · · · × [ak , bk ], f := i=1 (bi − ai ). We then ex-
tend to simple functions by linearity, to non-negative measurable functions
by taking increasing limits, and to measurable functions by splitting into
positive and negative parts.

Lp spaces.
For p ≥ 1, the Lp spaces Lp (Rk ) on Rk are the spaces of measurable
functions f with Lp -norm
Z  p1
kf kp := |f |p < ∞.

Riemann integrals.
Our first exposure to integration is the ‘Sixth-Form integral’, taught non-
rigorously at school. Mathematics undergraduates are taught a rigorous in-
tegral (in their first or second years), the Riemann integral [G.B. RIEMANN
(1826-1866)] – essentially this is just a rigourization of the school integral.
It is much easier to set up than the Lebesgue integral, but much harder to
manipulate.
For finite intervals [a, b] ,we quote:
(i) for any function f Riemann-integrable on [a, b], it is Lebesgue-integrable
to the same value (but many more functions are Lebesgue integrable);
(ii) f is Riemann-integrable on [a, b] iff it is continuous a.e. on [a, b]. Thus the
question, “Which functions are Riemann-integrable?” cannot be answered
without the language of measure theory – which then gives one the techni-
cally superior Lebesgue integral anyway.
Note. Integration
R is like summation (which is why Leibniz gave us the in-
tegral sign , as an elongated S). Lebesgue was a very practical man – his

5
father was a tradesman – and used to think about integration in the follow-
ing way. Think of a shopkeeper totalling up his day’s takings. The Riemann
integral is like adding up the takings – notes and coins – in the order in
which they arrived. By contrast, the Lebesgue integral is like totalling up
the takings in order of size - from the smallest coins up to the largest notes.
This is obviously better! In mathematical effect, it exchanges ‘integrating by
x-values’ (abscissae) with ‘integrating by y-values’ (ordinates).

Lebesgue-Stieltjes integral.
Suppose that F (x) is a non-decreasing function on R:

F (x) ≤ F (x) if x ≤ y

(prime example: F a probability distribution function). Such functions can


have at most countably many discontinuities, which are at worst jumps. We
may without loss re-define F at jumps so as to be right-continuous.
We now generalise the starting points above:
(i) Measure. We take µ((a, b]) := F (b) − F (a).
Rb
(ii) Integral. We take a 1 := F (b) − F (a).
We may now follow through the successive extension procedures used above.
We obtain:
(i) Lebesgue-Stieltjes measure Rµ, or µF , R R
(ii) Lebesgue-Stieltjes integral f dµ, or f dµF , or even f dF .
Similarly in higher dimensions; we omit further details.
Finite variation (FV).
If instead of being monotone non-decreasing, F is theR difference
R of two
such functions, F = F1 − F2 , we can define the integrals f dF1 , f dF2 as
above, and then define
Z Z Z Z
f dF = f d(F1 − F2 ) := f dF1 − f dF2 .

If [a, b] is a finite interval and F is defined on [a, b], a finite collection of


points, x0 , x1 , . . . , xn with aP= x0 < x1 < · · · < xn = b, is called a partition
of [a, b], P say. The sum ni=1 |F (xi ) − F (xi−1 )| is called the variation of
F over the partition. The least upper bound of this over all partitions P is
called the variation of F over the interval [a, b], Vab (F ):
X
Vab (F ) := sup |F (xi ) − F (xi−1 )|.
P

6
This may be +∞; but if Vab (F ) < ∞, F is said to be of finite variation (FV)
on [a, b], F ∈ F Vab (bounded variation, BV, is also used). If F is of finite
variation on all finite intervals, F is said to be locally of finite variation,
F ∈ F Vloc ; if F is of finite variation on the real line, F is of finite variation,
F ∈ FV .
We quote (Jordan’s theorem) that the following are equivalent:
(i) F is locally of finite variation;
(ii) F is the difference F = F1 − F2 of two monotone
R functions.
So the above procedure defines the integral f dF when the integrator F is
of finite variation.

3. Probability.
Probability spaces.
The mathematical theory of probability can be traced to 1654, to corre-
spondence between PASCAL (1623-1662) and FERMAT (1601-1665). How-
ever, the theory remained both incomplete and non-rigorous till the 20th
century. It turns out that the Lebesgue theory of measure and integral
sketched above is exactly the machinery needed to construct a rigorous the-
ory of probability adequate for modelling reality (option pricing, etc.) for
us. This was realised by the great Russian mathematician and probabilist
A.N.KOLMOGOROV (1903-1987), whose classic book of 1933, Grundbegriffe
der Wahrscheinlichkeitsrechnung [Foundations of probability theory] inaugu-
rated the modern era in probability.
Recall from your first course on probability that, to describe a random
experiment mathematically, we begin with the sample space Ω, the set of all
possible outcomes. Each point ω of Ω, or sample point, represents a possible
– random – outcome of performing the random experiment. For a set A ⊆ Ω
of points ω we want to know the probability P (A) (or Pr(A), pr(A)). We
clearly want
1. P (∅) = 0, P (Ω) = 1.
2. P (A) ≥ 0 for all A.
3. If A1 , A2 , . . . , An are disjoint, P ( ni=1 Ai ) = ni=1 P (Ai )
S P
(finite additivity – fa), which, as above we will strengthen to
3*. If A1 , A2 . . . (ad inf.) are disjoint,

[ ∞
X
P( Ai ) = P (Ai ) (countable additivity – ca).
i=1 i=1

7
4. If B ⊆ A and P (A) = 0, then P (B) = 0 (completeness).
Then by 1 and 3 (with A = A1 , Ω \ A = A2 ),

P (Ac ) = P (Ω \ A) = 1 − P (A).

So the class F of subsets of Ω whose probabilities P (A) are defined should


be closed under countable, disjoint unions and complements, and contain the
empty set ∅ and the whole space Ω. Such a class is called a σ-field of subsets
of Ω [or sometimes a σ-algebra, which one would write A]. For each A ∈ F,
P (A) should be defined (and satisfy 1, 2, 3∗, 4 above). So, P : F → [0, 1] is a
set-function,
P : A 7→ P (A) ∈ [0, 1] (A ∈ F).
The sets A ∈ F are called events. Finally, 4 says that all subsets of null-sets
(events) with probability zero (we will call the empty set ∅ empty, not null)
should be null-sets (completeness). A probability space, or Kolmogorov triple,
is a triple (Ω, F, P ) satisfying these Kolmogorov axioms 1,2,3*,4 above. A
probability space is a mathematical model of a random experiment.

Random variables.
Next, recall random variables X from your first probability course. Given
a random outcome ω, you can calculate the value X(ω) of X (a scalar – a
real number, say; similarly for vector-valued random variables, or random
vectors). So, X is a function from Ω to R, X → R,

X : ω 7→ X(ω) (ω ∈ Ω).

Recall also that the distribution function of X is defined by


 
F (x), or FX (x), := P {ω : X(ω) ≤ x} , or P (X ≤ x), (x ∈ R).

We can only deal with functions X for which all these probabilities are de-
fined. So, for each x, we need {ω : X(ω) ≤ x} ∈ F. We summarize this by
saying that X is measurable with respect to the σ-field F (of events), briefly,
X is F-measurable. Then, X is called a random variable [non-F-measurable
X cannot be handled, and so are left out]. So,
(i) a random variable X is an F-measurable function on Ω;
(ii) a function on Ω is a random variable (is measurable) iff its distribution
function is defined.

8
Generated σ-fields.
The smallest σ-field containing all the sets {ω : X(ω) ≤ x} for all real x
[equivalently, {X < x}, {X ≥ x}, {X > x}]2 is called the σ-field generated
by X, written σ(X). Thus,

X is F-measurable [is a random variable] iff σ(X) ⊆ F.

When the (random) value X(ω) is known, we know which of the events in the
σ-field generated by X have happened: these are the events {ω : X(ω) ∈ B},
where B runs through the Borel σ-field [the σ-field generated by the intervals
– it makes no difference whether open, closed etc.] on the line.

Interpretation.
Think of σ(X) as representing what we know when we know X, or in
other words the information contained in X (or in knowledge of X). This is
from the following result, due to J. L. DOOB (1910-2004), which we quote:

σ(X) ⊆ σ(Y ) iff X = g(Y )

for some measurable function g. For, knowing Y means we know X := g(Y )


– but not vice-versa, unless the function g is one-to-one [injective], when the
inverse function g −1 exists, and we can go back via Y = g −1 (X).

Expectation.
A measure (II.1) determines an integral (II.2). A probability measure P ,
being a special kind of measure [a measure of total mass one] determines a
special kind of integral, called an expectation.
Definition. The expectation E of a random variable X on (Ω, F, P ) is
defined by Z Z
E[X] := X dP, or X(ω) dP (ω).
Ω Ω

If X is real-valued, say, with distribution function F , recall (Ch. I) that EX


is defined in your first course on probability by
Z
E[X] := xf (x) dx if X has a density f

2
Here, and in Measure Theory, whether intervals are open, closed or half-open doesn’t
matter. In Topology, such distinctions are crucial. One can combine Topology and Mea-
sure Theory, but we must leave this here.

9
or if X is discrete,
P taking values xn , (n = 1, 2, . . .) with probability function
f (xn )(≥ 0), ( f (xn ) = 1),
X
E[X] := xn f (xn )

(weighted average of possible values, weighted according to their probability).


These two formulae are the special cases (for the density and discrete cases)
of the general formula Z ∞
E[X] := x dF (x)
−∞

where the integral on the right is a Lebesgue-Stieltjes integral. This in turn


agrees with the definition above, since if F is the distribution function of X,
Z Z ∞
X dP = x dF (x)
Ω −∞

follows by the change of variable formula for the measure-theoretic integral,


on applying the map X : Ω → R (we quote this: see any book on Measure
Theory).
Glossary. We now have two parallel languages, measure-theoretic and prob-
abilistic:
Measure Probability
Integral Expectation
Measurable set Event
Measurable function Random variable
almost-everywhere (a.e.) almost-surely (a.s.)

§4. Equivalent Measures and Radon-Nikodym derivatives.


Given two measures P and Q defined on the same σ-field F, we say that
P is absolutely continuous with respect to Q, written

P << Q,

if P (A) = 0 whenever Q(A) = 0, A ∈ F. We quote from measure theory the


vitally important Radon-Nikodym theorem: P << Q iff there exists a (F-)
measurable function f such that
Z
P (A) = f dQ ∀A ∈ F
A

10
(note that since the integral of anything over a null set is zero, any P so
representable is certainly absolutely continuous
R with respect to Q – the point
is that the converse holds). Since P (A) = A dP , this says that
Z Z
dP = f dQ ∀A ∈ F.
A A

By analogy with the chain rule of ordinary calculus, we write dP/dQ for f ;
then Z Z
dP
dP = dQ ∀A ∈ F.
A A dQ
Symbolically,
dP
if P << Q, dP = dQ.
dQ
The measurable function (= random variable) dP/dQ is called the Radon-
Nikodym derivative (RN-derivative) of P with respect to Q.
If P << Q and also Q << P , we call P and Q equivalent measures,
written P ∼ Q. Then dP/dQ and dQ/dP both exist, and
dP . dQ
=1 .
dQ dP
For P ∼ Q, P (A) = 0 iff Q(A) = 0: P and Q have the same null sets. Taking
negations: P ∼ Q iff P, Q have the same sets of positive measure. Taking
complements: P ∼ Q iff P, Q have the same sets of probability one [the same
a.s. sets]. Thus the following are equivalent: P ∼ Q iff P , Q have the same
null sets/the same a.s. sets/the same sets of positive measure.
Note. Far from being an abstract theoretical result, the Radon-Nikodym
theorem is of key practical importance, in two ways:
(a) It is the key to the concept of conditioning (”using what we know” – §5,
§6 below), which is of central importance throughout,
(b) The concept of equivalent measures is central to the key idea of math-
ematical finance, risk-neutrality, and hence to its main results, the Black-
Scholes formula, the Fundamental Theorem of Asset Pricing (FTAP), etc.
The key to all this is that prices should be the discounted expected values
under the equivalent martingale measure. Thus equivalent measures, and
the operation of change of measure, are of central economic and financial
importance. We shall return to this later in connection with the main math-
ematical result on change of measure, Girsanov’s theorem (VII.4).

11
Recall that we first met the phrase ‘equivalent martingale measure’ in
II.5 above. We now know what a measure is, and what equivalent measures
are; we will learn about martingales in III.3 below.

§5. Conditional Expectations.


Suppose that X is a random variable, whose expectation exists (i.e.
E[|X|] < ∞, or X ∈ L1 ). Then E[X], the expectation of X, is a scalar
(a number) – non-random. The expectation operator E averages out all the
randomness in X, to give its mean (a weighted average of the possible value
of X, weighted according to their probability, in the discrete case).
It often happens that we have partial information about X – for instance,
we may know the value of a random variable Y which is associated with X,
i.e. carries information about X. We may want to average out over the
remaining randomness. This is an expectation conditional on our partial in-
formation, or more briefly a conditional expectation.
This idea will be familiar already from elementary courses, in two cases
(see e.g. [BF]):
1. Discrete case, based on the formula

P (A|B) := P (A ∩ B)/P (B) if P (B) > 0.

If X takes values x1 , · · · , xm with probabilities f1 (xi ) > 0, Y takes values


y1 , · · · , yn with probabilities f2 (yj ) > 0, (X, Y ) takes values (xi , yj ) with
probabilitiesP f (xi , yj ) > 0, then P
(i) f1 (xi ) = j f (xi , yj ), f2 (yj ) = i f (xi , yj ),
(ii) P (Y = yj |X = xi ) = P (X = xi , Y = yj )/P (X = xi ) = f (xi , yj )/f1 (xi )
X
= f (xi , yj )/ f (xi , yj ).
j

This is the conditional distribution of Y given X = xi , written


X
fY |X (yj |xi ) = f (xi , yj )/f1 (xi ) = f (xi , yj )/ f (xi , yj ).
j

Its expectation is
X
E[Y |X = xi ] = yj fY |X (yj |xi )
j

12
X X
= yj f (xi , yj )/ f (xi , yj ).
j j

But this approach only works when the events on which we condition have
positive probability, which only happens in the discrete case.
2. Density case. Formally replacing the sums above by integrals: if (X, Y )
has density f (x, y),
Z ∞ Z ∞
X has density f1 (x) := f (x, y)dy, Y has density f2 (y) := f (x, y)dx.
−∞ −∞

We define the conditional density of Y given X = x by the continuous ana-


logue of the discrete formula above:
Z ∞
fY |X (y|x) := f (x, y)/f1 (x) = f (x, y)/ f (x, y)dy.
−∞

Its expectation is
Z ∞ Z ∞ Z ∞
E[Y |X = x] = yfY |X (y|x)dy = yf (x, y)dy/ f (x, y)dy.
−∞ −∞ −∞

Example: Bivariate normal distribution, N (µ1 , µ2 , σ12 , σ22 , ρ).


σ2
E[Y |X = x] = µ2 + ρ (x − µ1 ),
σ1
the familiar regression line of statistics (linear model: [BF, Ch. 1]). See I.4.

Kolmogorov’s approach: conditional expectations via σ-fields


The problem is that joint densities need not exist – do not exist, in general.
One of the great contributions of Kolmogorov’s classic book of 1933 was the
realization that measure theory – specifically, the Radon-Nikodym theorem
–provides a way to treat conditioning in general, without assuming that we
are in the discrete case or density case above.
Recall that the probability triple is (Ω, F, P ). Take B a sub-σ-field of F,
B ⊂ F (recall: a σ-field represents information; the big σ-field F represents
‘knowing everything’, the small σ-field B represents ‘knowing something’).
Suppose that Y is a non-negative random variable whose expectation
exists: E[Y ] < ∞. The set-function
Z
Q(B) := Y dP (B ∈ B)
B

13
is non-negative (because Y is), σ-additive – because
Z XZ
Y dP = Y dP
B n Bn

if B = ∪n Bn , Bn disjoint – and defined on the σ-algebra B, so is a measure


on B. If P (B) = 0, then Q(B) = 0 also (the integral of anything over a
null set is zero), so Q << P . By the Radon-Nikodym theorem (III.4), there
exists a Radon-Nikodym derivative of Q with respect to P on B, which is
B-measurable [in the Radon-Nikodym theorem as stated in III.4, we had F in
place of B, and got a random variable, i.e. an F-measurable function. Here,
we just replace F by B.] Following Kolmogorov (1933), we call this Radon-
Nikodym derivative the conditional expectation of Y given (or conditional on)
B, E[Y |B]: this is B-measurable, integrable, and satisfies
Z Z
Y dP = E[Y |B]dP ∀B ∈ B. (∗)
B B

In the general case, where Y is a random variable whose expectation exists


(E[|Y |] < ∞) but which can take values of both signs, decompose Y as

Y = Y+ − Y−

and define E[Y |B] by linearity as

E[Y |B] := E[Y+ |B] − E[Y− |B].

Suppose now that B is the σ-field generated by a random variable X:


B = σ(X) (so B represents the information contained in X, or what we
know when we know X). Then E[Y |B] = E[Y |σ(X)], which is written more
simply as E[Y |X]. Its defining property is
Z Z
Y dP = E[Y |X]dP ∀B ∈ σ(X).
B B

Similarly, if B = σ(X1 , · · · , Xn ) (B is the information in (X1 , · · · , Xn )) we


write E[Y |σ(X1 , · · · , Xn ] as E[Y |X1 , · · · , Xn ]:
Z Z
Y dP = E[Y |X1 , · · · , Xn ]dP ∀B ∈ σ(X1 , · · · , Xn ).
B B

14
Note.
1. To check that something is a conditional expectation: we have to check
that it integrates the right way over the right sets [i.e., as in (*)].
2. From (*): if two things integrate the same way over all sets B ∈ B, they
have the same conditional expectation given B.
3. For notational convenience, we use E[Y |B] and EB Y interchangeably.
4. The conditional expectation thus defined coincides with any we may have
already encountered - in regression or multivariate analysis, for example.
However, this may not be immediately obvious. The conditional expectation
defined above – via σ-fields and the Radon-Nikodym theorem – is rightly
called by Williams ([W], p.84) ‘the central definition of modern probability’.
It may take a little getting used to. As with all important but non-obvious
definitions, it proves its worth in action: see III.6 below for properties of con-
ditional expectations, and Chapter IV for stochastic processes, particularly
martingales [defined in terms of conditional expectations].

§6. Properties of Conditional Expectations.

1. B = {∅, Ω}. Here B is the smallest possible σ-field (any σ-field of subsets
of Ω contains ∅ and Ω), and represents ‘knowing nothing’.

E[Y |{∅, Ω}] = EY.

Proof. We have to check (*) of §5 for B = ∅ and B = Ω. For B = ∅ both


sides are zero; for B = Ω both sides are EY . //

2. B = F. Here B is the largest possible σ-field: ‘knowing everything’.

E[Y |F] = Y P − a.s.

Proof. We have to check (*) for all sets B ∈ F. The only integrand that
integrates like Y over all sets is Y itself, or a function agreeing with Y except
on a set of measure zero.
Note. When we condition on F (‘knowing everything’), we know Y (because
we know everything). There is thus no uncertainty left in Y to average out,
so taking the conditional expectation (averaging out remaining randomness)
has no effect, and leaves Y unaltered.

15
3. If Y is B-measurable, E[Y |B] = Y P − a.s.
Proof. Recall that Y is always F-measurable (this is the definition of Y being
a random variable). For B ⊂ F, Y may not be B-measurable, but if it is,
the proof above applies with B in place of F.
Note. If Y is B-measurable, when we are given B (that is, when we condition
on it), we know Y . That makes Y effectively a constant, and when we take
the expectation of a constant, we get the same constant.

4. If Y is B-measurable, E[Y Z|B] = Y E[Z|B] P − a.s.


We refer for the proof of this to [W], p.90, proof of (j).
Note. Williams calls this property ‘taking out what is known’. To remem-
ber it: if Y is B-measurable, then given B we know Y , so Y is effectively a
constant, so can be taken out through the integration signs in (*), which is
what we have to check (with Y Z in place of Y ).

5. If C ⊂ B, E[E[Y |B]|C] = E[Y |C] a.s.


Proof. EC EB Y is C-measurable, and for C ∈ C ⊂ B,
Z Z
EC [EB Y ]dP = EB Y dP (definition of EC as C ∈ C)
C C
Z
= Y dP (definition of EB as C ∈ B).
C

So EC [EB Y ] satisfies the defining relation for EC Y . Being also C-measurable,


it is EC Y (a.s.). //

5’. If C ⊂ B, E[E[Y |C]|B] = E[Y |C] a.s.


Proof. E[Y |C] is C-measurable, so B-measurable as C ⊂ B, so E[.|B] has no
effect on it, by 3.

Note. 5, 5’ are the two forms of the iterated conditional expectations property.
When conditioning on two σ-fields, one larger (finer), one smaller (coarser),
the coarser rubs out the effect of the finer, either way round. This is also
called the coarse-averaging property, or (Williams [W]) the tower property.

6. Conditional Mean Formula. E[E[Y |B]] = EY P − a.s.


Proof. Take C = {∅, Ω} in 5 and use 1. //
Example. Check this for the bivariate normal distribution considered above.

16
Note. Compare this with the Conditional Variance Formula of Statistics: see
e.g. SMF, IV.6, or Ch. VIII.

7. Role of independence. If Y is independent of B,

E[Y |B] = E[Y ] a.s.

Proof. See [W], p.88, 90, property (k).

Note. In the elementary definition P (A|B) := P (A∩B)/P (B) (if P (B) > 0),
if A and B are independent (that is, if P (A ∩ B) = P (A).P (B)), then
P (A|B) = P (A): conditioning on something independent has no effect. One
would expect this familiar and elementary fact to hold in this more general
situation also. It does – and the proof of this rests on the proof above.

Projections. In Property 5 (tower property), take B = C:

E[E[X|C]|C] = E[X|C].

This says that the operation of taking conditional expectation given a sub-
σ-field C is idempotent – doing it twice is the same as doing it once. Also,
taking conditional expectation is a linear operation (it is defined via an in-
tegral, and integration is linear). Recall from Linear Algebra that we have
met such idempotent linear operations before. They are the projections.
(Example: (x, y, z) 7→ (x, y, 0) projects from 3-dimensional space onto the
(x, y)-plane.) This view of conditional expectation as projection is useful
and powerful; see e.g. [BK], [BF] or
[N] J. Neveu, Discrete-parameter martingales (North- Holland, 1975), I.2.
It is particularly useful when one has not yet got used to conditional expec-
tation defined measure-theoretically as above, as it gives us an alternative
(and perhaps more familiar) way to think.

17

You might also like