Conditional Probability
Conditional Probability
Expectations
Steve Cheng
March 2, 2008
Contents
1 Basic definition
2 Intuitive explanations
5 Conditional probabilities
13
9 Change of variable
14
15
16
12 Bibliography
18
Purpose
In this note, we give a rigorous definition of conditional probabilities and expectations,
and some fundamental results about them. We assume that the reader already is familiar
with the intuitive notion of conditional probabilities (P(A | B) for P(B) > 0)). For
our exposition, we will also depend on, of course, some measure theory, including the
Lebesgue-Radon-Nikodym theorem.
The author has written this note because he still does not readily encounter introductions to conditional probability that are theoretically rigorous and yet not afraid to delve
into, explain and justify the intuition behind the concepts. (Though J. Michael Steeles
book referenced in the bibliography comes close, even as that author remarks that the
abstract definition of conditional probability is not easy to love; fortunately, love is not
required.)
Copyright matters
Permission is granted to copy, distribute and/or modify this document under the terms of
the GNU Free Documentation License, Version 1.2 or any later version published by the
Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and
with no Back-Cover Texts.
Basic definition
for A G.
Moreover, the theorem says that the function g is actually unique up to P|G -null sets.
Definition 1.1. The conditional expectation of Y given G, denoted by E[Y | G], is defined
to be any of the G-measurable functions g : R that satisfy equation (1).
Though there are many functions g that are candidates, each pair of them only differ
on a set of P|G -measure zero.
As the conditional expectation can only be defined when Y is integrable (E|Y | < ),
we will tacitly make such assumptions in our work unless stated otherwise.
Intuitive explanations
We now want to give some intuitions about the strange definition of conditional expectation just given.
for all B 0 .
1 The most simplest case is 0 = R and 0 = B , but 0 could also be Rn . Usually we will assume
R
that 0 contains all singleton sets; otherwise Proposition 2.1 might fail to hold with R replaced by 0 .
Also we want to be allowed to talk about seemingly simple sets like [X = x].
Since the integrals on the extreme left and right are equal on all A G, the integrands
h X and g must be equal P|G -almost surely.
Definition 3.1. The symbol E[Y | X] is to mean the same thing as E[Y | (X)]. It is
called the conditional expectation of Y given the random variable X.
There always exists a measurable function h : 0 R such that h(X()) = E[Y |
X]() for P|G -almost all . The value h(x) is often denoted by E[Y | X = x], even
though it may not always be well-defined as a single number. When we use this notation
we will usually have a specific version of the function h in mind.
Equation (2) stated in the new notation becomes:
Z
E[Y 1XB ] = E E[Y | X] 1XB =
E[Y | X = x] dPX ,
for all B 0 .
(3)
xB
To summarize, the construction just given is basically the same as the first construction
used for E[Y | G], only that (0 , 0 , PX ) replaces (, , P|G ) in the original. Since they
are so similar, when we discuss properties about E[Y | G] and E[Y | X = x] hereafter,
we will usually state and prove the properties for only E[Y | G], as the modifications for
E[Y | X = x] are trivial.
Intuitively, the function E[Y | X] answers the question: What is the average value of
Y given the values that the random variable X takes on?
Or, the operation E[ | X] can be thought of as extracting the part of a random variable
that can be predicted as a function of X (a (X)-measurable random variable).
Note the arguments in this section also give an easy-to-digest characterization of realvalued integrable functions2 that are (X)-measurable for some measurable X: they
are exactly those functions that can be expressed as an integrable function of X almost
everywhere.
The following property is trivial but well-known in nave probability theory (when G =
(X)):
2 This
characterization is also valid in general measure theory, provided the measure involved is -finite.
Proposition 4.1 (The law of total probability). For any real-valued random variable Y ,
E E[Y | G] = EY .
Proof. Take A to be the entire probability space G in equation (1).
In a sense, the conditional expectation has been defined in such a way that functions that are conditional expectations must satisfy a generalization of the law of total
probability that it must hold on all G-measurable subsets and not just on . The generalized law of total probability makes the satisfying functions unique up to sets of measure
zero, whereas lots of other G-measurable functions Z can satisfy only E[Z1 ] = E[Y 1 ],
including of course the constant Z = EY .
It is easy to check the linearity property of conditional expectations, as suggested by
the notation:
Proposition 4.2. For constants a, b, c R, almost surely
E[aX + bY + c | G] = a E[X | G] + b E[Y | G] + c .
Proof. The Radon-Nikodym derivative operates linearly on signed measures, and obviously E[c | G] = c.
And the comparison property:
Proposition 4.3. If Y 0, then almost surely
E[Y | G] 0 ,
which in combination of Proposition 4.2 leads to the generalized triangle inequality:
E[Y | G] E[|Y | | G] .
Proof. The Radon-Nikodym derivative of a positive measure is almost everywhere nonnegative.
The same properties hold when G is replaced with a random variable X.
We end this section with the tower property of conditional expectation, that is so
frequently used that it would not do justice to omit:
Proposition 4.4. If G1 and G2 are two -algebras with G1 G2 , then for any random
variable Y ,
E E[Y | G1 ]G2 = E[Y | G1 ] = E E[Y | G2 ]G1 .
Explanation: Filtering a random variable on a coarser -algebra, then a sharper/finer
one, or vice versa, is the same as filtering just on the coarser one. Or, conditioning can
only subtract from, and not add to, the randomness of a random variable.
Proof. The first equality is immediate: as E[Y | G1 ] is G1 -measurable by definition, and
hence is already G2 measurable, so re-conditioning on G2 changes nothing.
For the second equality, we must show that E E[Y | G2 ] | G1 is just another version
of E[Y | G1 ], satisfying its defining equation (1). All we need to do is to let B G1 G2
and push some symbols around:
E E[ E[Y | G2 ] G1 1B ] = E E[Y | G2 ] 1B = E[Y 1B ] .
Finally, if X and Y are not necessarily non-negative, write them in terms of their
positive and negative parts, and apply linearity to obtain the same conclusion.
Conditional probabilities
We have delayed the discussion of conditional probabilities, because they are defined by
a similar process as conditional expectations, and we want to avoid repeating proofs of
what are mostly the same thing. But with conditional expectations available, they are
easy to define:
Definition 5.1. The conditional probability of an event A given G is the function
P(A | G) = E[1A | G] .
Definition 5.2. The conditional probability of an event A given a random variable
X : 0 is the function
P(A | X) = E[1A | X] .
Intuitively, the right-hand sides are the average value of 1A given G (or X). This
average value of the indicator function 1A is the probability of the event A occurring.
Also, if given a value X() = x 0 , evaluating P(A | X = x) = E[1A | X =
x] = E[1A | X]() gives the conditional probability of A occurring when we know that
X = x. However, we cannot take this too literally, because if the event [X = x] occurs
with probability zero, P(A | X = x) might not have a single value across all versions of
the function P(A | X).
Reinterpreting equations (1) and (3), we arrive at the following defining relations:
Z
P(A E) =
P(A | E) dP|G ,
E G,
(4)
ZE
P(A [X B]) =
P(A | X = x) dPX ,
B 0 .
(5)
xB
But these are just the Bayes rules for conditional probabilities!
6
Definition 5.3. For any event B with positive probability of occurrence, we also
define
P(A | B) = P(A | 1B = 1)
E(Y | B) = E(Y | 1B = 1) .
Applying equation (3) shows that P(A | B) just defined agrees with the usual definition:
Z
P(A B) = E[1A 1B ] =
P(A | 1B ) dP1B
{1}
P(A B)
.
P(B)
(6)
If B is a null set, P(A | B) is not well-defined, and this is of course where the traditional
definition of P(A | B) fails.
If B is the event [X = x], the definition just made is (fortunately!) consistent with
that of P(A | X = x).
For the sake of completeness, we mention the following intuitive fact, obviously true
for the nave definition of conditional probability:
Proposition 5.1. Let A be an event. Then P(A | G) = P(A) almost surely if and
only if A is independent from all events in G,
Proof. For the if direction: for all E G,
E[P(A) 1E ] = P(A) P(E) = P(A E) = E[1A 1E ] = E[P(A | G) 1E ] ,
so P(A) satisfies the condition for being P(A | G). The converse follows from the same
equation.
The proof of the following theorem requires a result from the next section. But rest
assured, the next section does not depend on the results here, nor is the present result
crucial to understand immediately. The material is put into this order only to make it
easier to digest, and to build intuition on how independence and conditional expectations
are related.
Corollary 5.2. E[f (Y ) | G] = E[f (Y )] for all PY -integrable functions f , if and only if Y
is a random variable independent of all the events in G.
(The random variable Y can have a codomain 0 that is not necessarily R.)
Proof. Note that Y being independent of all events in G means exactly that the events
[Y B] are independent of all events in G.
For the if direction, Proposition 5.1 already proves the case for f = 1B for measurable sets B 0 , setting A = [Y B]. For arbitrary integrable f , approximate it
with a sequence of linear combinations fn of indicator functions converging to it pointwise
everywhere and dominated by |f | itself. Employing dominated convergence (Proposition
6.4) to take limits, we find E[f (Y ) | G] = E[f (Y )].
For the only if direction, simply select the particular cases f = 1B for measurable
sets B 0 , and apply the only if direction from Proposition 5.1.
7
For a little intuition of Corollary 5.2, take the example of G = (X) and h = identity.
If Y is independent from X, then Y cannot be expressed as a function of X at all, unless
it is constant.3 Thus knowing the value of X cannot possibly give additional information
on Y this is the content of the equation E[Y | X] = E[Y ], the constant average value.
This section is devoted to developing the analogues of the convergence theorems for normal
expectations or integrals.
Theorem 6.1. Taking conditional expectations is a
E E[Y | G] E|Y | or equivalently,
L1 contraction:
E[Y | G]
1 kY kL1 .
L
Proof. E E[Y | G] E E[|Y | | G] = E|Y | using the generalized triangle inequality
(Proposition 4.3).
Corollary 6.2. If the real-valued random variables X1 , X2 , . . . converge to X in L1 , then
E[Xn | G] converge to E[X | G] in L1 .
Proof. Apply the previous theorem to Y = Yn = Xn X and take n .
= E[X 1B ] P(B) ,
and this is impossible unless P (B) = 0. Letting & 0, we see that the event where
E[Xn | G] does not increase to E[X | G] must have probability zero.
3 This
conclusion also follows if Y is merely uncorrelated with all functions of X. For a proof, refer to
the last remarks in Section 2.
Simpler proof. Apply the next result on dominated convergence, as Xn are dominated by
X. Of course, normally we cannot reduce the Dominated Convergence Theorem to the
Monotone Convergence Theorem this way, because E|X| in general might be infinite, but
in that case E[X | G] cannot be defined anyway.
Another formulation of the monotone convergence is possible that perhaps does not so
trivially reduce to dominated convergence. Suppose we do not assume that X has finite
mean, but that Z = limn E[Xn | G] has finite mean. Then we can conclude that X has
finite mean and Z = E[X | G].
Theorem 6.4 (Dominated convergence). If the random variables Xn converge to a random variable X almost surely, and |Xn | are dominated by another random variable Z with
finite expectation, then almost surely,
lim E |Xn X| | G = 0 and hence lim E[Xn | G] = E[X | G] .
n
Proof. Set Yn = |Xn X|. Then almost surely, Yn are dominated by 2Z, while E[Yn | G]
are dominated by 2E[Z | G]. The last quantity has a finite expectation of 2EZ. For any
B G, we have:
E lim E[Yn | G] 1B
n
= lim E E[Yn | G] 1B
dominated convergence on E[Yn | G] 1B
n
= lim E[Yn 1B ]
n
=0
dominated convergence on Yn .
Thus the G-measurable random variable limn E[Yn | G] is equal to 0 almost surely.
P(A | G) dP|G 0 ,
E
1 P( | G) dP|G = 0 .
P( | G) dP|G ,
1 = P() =
XZ
n
Z
P(An | G) dP|G =
!
X
P(An | G)
dP|G ,
the first and final integrands are equal almost surely. (The exchange of summation
and integration is allowed since the integrands are non-negative.)
The key phrase in Proposition 7.1 is almost surely. There is a P|G -null set N
where the (in)equalities may fail. The set N depends on the events A, because each
conditional probability is separately constructed for each A. There may not necessarily
exist a null set N for which the (in)equalities hold everywhere else for every event A .
On the other hand, if we were to identify a countable set of events A that we are
interested in, then we avoid this problem. For each A there is an exceptional null set,
and the countable union of all of these is again a null set; everywhere else on all the
relevant relations hold.
Given a random variable Y : R, a category of events that we can look at are
those of the form Y 1 (, y] . If we restrict y Q {, +}. There are at most
countably many of these, and yet knowing only their probabilities already determines the
probability distribution of Y . So the idea is to construct a version of B 7 P([Y B] | G),
where almost at all sample points in it becomes a probability distribution.
In our formal construction, we will also generalize to random variables that are Rn valued.
10
(7)
except on a null set. For each of the countably many pairs (A, B) D D there is
such a null set. Taking their union, we obtain a null set M for which relation (7)
holds everywhere except on M .
For each y Qn , define
F (y) = P Y Dy | G () ,
\ (N M )
(8)
inf
yQn : zj yj
F (y) ,
\ (N M ) .
(9)
inf
yRn : z6=y,zj yj
F (y) ,
(10)
for z Rn \ Qn .
Actually equation (10) holds for z Qn as well. Firstly, because F is increasing
in each variable, the infimum can be taken over only points of the form yn =
z + (1/n, . . . , 1/n), for n N, with no change. And secondly, by Fatous Lemma,
Z
0
lim inf F (yn ) F (z) dP|G
n
Z
lim inf
F (yn ) F (z) dP|G
n
11
Thus, lim inf n F (yn ) F (z) = inf n F (yn ) F (z) = 0 for all except on a
null set Nz . Provided we strip away these null sets too (for z Qn ) at the beginning,
equation (10), equivalent to right-continuity of F , holds in general.
With the same sort of argument using Fatous Lemma, we can prove that, save for a
null set, F 0 if one of the variables tends to , and F 1 if all the variables
tend to +.
For those which are on the exceptional null sets, we can define F (z) arbitrarily as P([Y Dz ]).
Thus we now know F is a cumulative distribution function, which then has a
corresponding probability measure .
(B) is the conditional probability. All that remains is to show that (B) = P([Y
B] | G)().
The first point is that 7 (B) should be G-measurable. This is unfortunately
somewhat technical: it involves the monotone class theorem, the same sort of argument used to prove measurability in Fubinis Theorem.
Let 0 = {B BRn : 7 (B) is G-measurable}. By equation (8), (D) is
measurable for all the sets in D D. For finite disjoint unions and complements B
of sets from D, (B) is measurable too, because it can be obtained by addition and
subtraction of various functions (D) for D D. And if Bn are sets in 0 increasing
or decreasing to B, then (B) = limn (Bn ) is a limit of G-measurable functions
and hence is measurable. This shows 0 is a monotone class, containing the algebra
generated D; by the monotone class theorem, 0 equals the -algebra generated by
D, that is, BRn .
The rest is easy. Consider
Z
B 7
(B) dP|G ,
E
which defines a positive measure, and, by definition, agrees with the measure B 7
P([Y B] E) for B D. Since D generates BRn , the two measures are ultimately
equal. As this is true for all E G, we have (B) = P([Y B] | G) as desired.
Definition 7.1. Let Y : 0 be measurable, for a measurable space (0 , 0 ). Any
function : 0 [0, 1] such that
(i) for each , : 0 [0, 1] is a probability measure, and
(ii) for each B 0 , (B) = P([Y B] | G) P|G -almost surely
is called a conditional probability measure for Y given G. In general, we denote these
functions by PY |G .
A conditional probability measure for Y given a random variable X is similarly defined,
and denoted PY |X .
Theorem 7.2 says that PY |G (or PY |X ) exists at least if 0 = Rn .
We make a brief note, that it exists also if 0 = RN . That is, given a a countable
number of random variables Yn : R, we can still construct PY |G for Y = (Y1 , Y2 , . . . ).
This is done by the same kind of procedures used to construct sample spaces for an infinite
12
(11)
where , are ordered finite subsets of N (with no repetition of members), and P stands
for the finite-dimensional conditional probability measures for (Y(1) , Y(2) , . . . , Y(||) )
constructed in Theorem 7.2. For each , , equation (11) is found to hold for every
measurable E BR|| except for a null set. But there are only countably many possible
pairs of , , so we can obtain a single null set where equation (11) holds everywhere else.
Then the Kolmogorov Existence Theorem allows us to construct
P(a1 < Y1 b1 , a2 < Y2 b2 , . . . | G)()
as a probability measure for each .
(We cannot go much further than this, to construct conditional probability measures
for an uncountable number of variables, because by taking the variables 1E for every event
E, we would be able to construct P(E | G)().)
yR
Proof. The approximation theorem for measurable functions furnishes a sequence of random variables Y1+ , Y2+ , . . . , such that Yn+ 0 and Yn+ % Y + = max(0, Y ). In fact they
have the explicit expression:
n
n2
1
X
k k+1
k
+
1
+
n
1
,
D
=
,
, Dn, = [n, ) .
Yn =
n,k
[Y Dn,k ]
[Y Dn, ]
2n
2n 2n
k=1
Then we have
E[Yn+
| G] =
n
n2
1
X
k=1
k
P(Y Dn,k | G) + n P(Y Dn, | G)
2n
n
n2
1
X
k
PY |G (Dn,k ) + n PY |G (Dn, )
2n
k=1
!
n
Z
n2
1
X
k
=
1D + n 1Dn, dPY |G .
2n n,k
R
k=1
13
(linearity of E[ | G])
The last integrand is the nth approximation for the positive part of the identity function
y 7 y.
By monotone convergence (Proposition 6.3), E[Y + | G] may be obtained as a limit of
E[Yn+ | G]. Therefore,
Z
E[Y + | G] = lim E[Yn+ | G] =
y dPY |G .
n
y(0,)
y(,0)
Hence
E[Y | G] = E[Y + | G] E[Y | G] =
Z
y dPY |G .
y(,)
Example 8.1. If X, Y are real-valued random variables, with a joint probability density
fX,Y , then we calculate that
Z
Z
fX,Y (x, y)
dy , fX (x) =
fX,Y (x, y) dy .
E[Y | X = x] =
y
fX (x)
Not suprisingly, the conditional probability density that is, the Radon-Nikodym
derivative appears to be the infinitesimal version of the elementary equation (6) for
the conditional probability.
Change of variable
f (x) dPX ,
xR
y dPY .
(12)
yR
which is the counterpart to the last integral in (12). The first integral in (12) has no
counterpart for conditional expectations, since we do not have available a conditional
measure B 7 P(B | G)() that is defined for all events B. But the second integral in
(12) ought to have an analogy for conditional expectations, namely:
Z
f (x) dPX|G .
xR
10
The aim of this section is to rigorously generalize two facts well known from the nave
definition of conditional probability:
1. For any events A, B (with P(B) > 0), P(A B | B) = P(A | B). i.e. if we are given
B, and asked to calculate conditional probabilities, then of course B happens for
certain. For the same reason, P(A B c | B) = 0.
2. This is related to the first fact. Suppose f (X, Y ) is a measurable function of two
random variables X and Y , and we want to compute E[f (X, Y ) | X]. Since X is a
given in the conditional probability, in the integral calculations X may be assumed
to be constant. So, for instance (Proposition 4.5), E[XY | X] = X E[Y | X] .
In what follows, (, , P) is a probability space, G is a -algebra, and 1 , 2 are
two other measurable spaces. Also X : 1 will be G-measurable, while Y : 2
will be -measurable.
Theorem 10.1. Let be a version of the conditional probability measure PY |G . Then the
product measure = X gives a version of the conditional probability measure PX,Y |G .
(Here x denotes the point-mass measure at x 1 .)
Proof. Clearly, is a probability measure everywhere on .
We prove that (S) is G-measurable for every measurable S 1 2 by appealing
to the monotone class theorem (again). Taking S of the form A B, where A 1 ,
B 2 are measurable, the function (S) = X (A) (B) = 1[XA] (B) is G-measurable
because X and (B) are. And G-measurability is preserved under finite disjoint unions
of sets A B, and under increasing and decreasing limits. So it follows that (S) is
G-measurable for every S in the product -algebra of 1 2 .
15
Proof. Given a version of PY |G , we calculate using the version of PX,Y |G that Theorem
10.1 constructs.
Z
E[f (X, Y ) | G] =
f dPX,Y |G
(Theorem 9.1)
1 2
Z
=
f d(X PY |G )
(Theorem 10.1)
1 2
Z
=
f (X, y) dPY |G .
y2
11
V M .
for every B G,
and E[V 1B ] = 0 by the definition of the orthogonal projection. Since the conditional
expectation is unique, we must have U = E[Y | G].
16
It is also not hard to give an intuitive description of M : it consists of all L2 random variables, with zero mean, that are uncorrelated with every G-measurable random
variable. Indeed, since 1 M, we must have E(V 1) = EV = 0 for every V M .
Then E[U V ] = E[(U EU )(V EV )], so V M is orthogonal to U if and only if V is
uncorrelated to U .
A slicker way of recognizing that U = E[Y | G] is to recall that the image of the
orthogonal projection onto M can be characterized as the unique vector in M closest in
norm to the pre-image:
Proposition 11.1. If Y is a L2 real-valued random variable, then the best estimate of
Y , in the least-squares sense, using only G-measurable functions, is E[Y | G]. That is,
E (Y X)2 E (Y E[Y |G])2
for all G-measurable random variables X, with equality ifand only if X = E[Y | G].
(The lower bound can also be written as: E Var(Y |G) = Var(Y ) Var E[Y |G] .)
Proof. The proof is completely analogous to the proof of the well-known case when X are
restricted to constants. We have:
E (Y X)2 ] = E(X 2 ) 2E(XY ) + E(Y 2 )
= E(X 2 ) 2E XE(Y | G) + E E(Y 2 | G)
= E X 2 2E(Y | G) X + E(Y 2 | G) .
The outermost integrand is a quadratic in X, which is minimized when X equals the
G-measurable function E[Y | G].
In fact, this Hilbert-space argument can be turned around, to prove the existence of
E[ | G] without recourse to the Lebesgue-Radon-Nikodym theorem!4
Example 11.1 (Orthonormal basis expansion of conditional expectation). Let X and Y
be two L2 random variables, with some joint distribution that is known, and we want to
compute the conditional expectation E[Y | X] = E[Y | (X)].
Recall, from linear algebra, that an orthogonal projection can be evaluated from its
known actions on an orthonormal basis {Zn } that spans (X).
X
E[Y | (X)] =
hY, Zn i Zn .
n
Incidentally, the Lebesgue-Radon-Nikodym theorem has a nice proof using Hilbert-space methods
also.
17
U = F (X) .
n=
Example 11.3 (Conditional expectation for discrete random variables). The only time
that the orthonormal basis in Example 11.1 can be taken to be a finite set is when
X has finite range
p{x1 , . . . , xn }. In this case, the obvious orthonormal basis to use is
Zn = 1(X = xn )/ P(X = xn ). Then we arrive at the familiar expression:
E[Y | X] =
12
N
X
E[Y 1(X = xn )]
1(X = xn ) .
P(X = xn )
n=1
Bibliography
References
[Bouleau]
[Folland]
Gerald B. Folland, Real Analysis: Modern Techniques and Their Applications, second ed. Wiley-Interscience, 1999.
[Rosenthal]
[Schmetterer]
[Steele]
18