0% found this document useful (0 votes)
100 views28 pages

ORF309 Probability

Uploaded by

Darren Alexis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views28 pages

ORF309 Probability

Uploaded by

Darren Alexis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Probability Theory

Mark Cerenzia∗

Introduction
Probability theory seeks to quantify and model “randomness” in an experiment with outcomes
we interpret to be unpredictable (the prototypes are flipping a coin and rolling a die). Such an
experiment contrasts with a “deterministic” one that yields a reliable or mostly predictable outcome
(e.g., hitting a billiard ball from a given position with exactly the same force and direction).
Randomness is observed to occur in almost every area of human inquiry:
• physics (the microscopic world is modeled to be fundamentally random),
• biology (modeling the spread of a disease – both through a population and within a person’s
body – and mutations via evolution leading to diversity of lifeforms),
• chemistry (chemical reactions occur by random meeting of molecules),
• engineering (noise in transmitting messages, data compression, structures withstanding un-
predictable effects, ...),
• economics (stock price fluctuations are inherently random, risk management and insurance),
• sociology (social networks, voting models, surveys and data collection, ...),
• computer science (randomized algorithms can be the best known for certain hard problems),
• machine learning (drawing reliable conclusions about large data sets from a smaller sample),
• pure mathematics (besides probability being a theory in its own right, it offers important
intuition and methods for other areas of math, especially analysis, combinatorics, and number
theory!), ...
Despite this broad list, consistent phenomena and certain basic principles govern the appearance
of probability in all these areas.
A fascinating and arguably unique aspect of probability theory is that counterintuitive or “para-
doxical” results are accessible even at the most elementary, naive level. Here is an instructive
example (the “paradox of division”)1 :
Question: You and a friend are playing a fair game where the first to get 6 rounds wins a
box of Thomas Sweet’s meltaways. You have 5 rounds and your friend has 3 rounds when the
final bell rings before Winter break...how can you fairly divide the box?

Although the choice of presentation and much of the commentary is my own, these notes draw heavily from many
sources: Gabor Székely’s Paradoxes in probability theory and mathematical statistics, Ramon van Handel’s ORF 309
notes, and Saeed Ghahramani’s “Fundamentals of Probability, with Stochastic Processes”
1
This paradox was published in 1494, but known as early as 1380; it wasn’t solved correctly until 1654.

1
Here are some answers:
• (Paccioli, 1494) There was a total of 8 rounds played and the division should correspond to
the “proportions earned,” so give 5/8 of the pot to you and 3/8 to your friend.
• (Tartaglia, 1499-1557)2 You are only 2 rounds ahead of your friend, which is only 1/3 of the
total 6 rounds required to win, so you deserve at least 1/3 of the pot but the remainder should
be split equally. In summary, give 2/3 of the pot to you and 1/3 to your friend.
• (Pascal and Fermat, 1654) Let us imagine you were able to keep playing the game for 3 more
rounds (even if you win with 6 in the first or second). There are 8 possible such futures and
the relative frequency of future gameplay where you win is 7/8 and only 1/8 where your friend
wins. Hence, give 7/8 of the pot to you and only 1/8 to your friend.
So which of these different answers is correct? Our rigorous mathematical work in this course will
allow us to identify one of these answers as the correct one; namely, we will interpret the relative
frequency in the third answer as “probability of winning.” The ambiguity that can arise when
tackling such questions suggests we spend some time developing rigorous machinery for working
with models of randomness to be confident in our answers (as well as our questions!).

Interpretations of probability
Historically, it was a serious philosophical difficulty to understand how one could precisely quan-
tify or predict the unpredictable, thus hindering the development of probability theory for many
centuries. There are two main interpretations of the “probability of an event”:
• (Objectivist) it is (the limit of) the relative frequency that the event occurs when repeating
an experiment indefinitely. This is the frequentist approach.
• (Subjectivist) it is the degree of belief or confidence in the event occurring, e.g., what pro-
portion of a dollar would you pay to play a game that pays $1 if the event occurs and $0
otherwise. This relies on some experience/expertise to guess a “prior” degree of belief upon
which we can improve (say, using experimental data) to update to a “posterior” degree of
belief. This is the Bayesian approach.
Either of these needs to be rigorized somehow in order to be useful. For example, let us denote
the event as E and its “probability” by P(E). The first approach above suggests we define
Number of outcomes where E occurs
P(E) :=
Number of possible outcomes
However, let us go through history and see how famous intellects used this definition to arrive at
contrasting answers to the following
Question: When flipping two coins at once, what is the probability of at least one heads?
• (Leibniz and D’Alembert)3 There are three outcomes (0 heads, 1 heads, 2 heads), two of which
are favorable; hence,
Number of outcomes where E occurs 2
P(E) := = .
Number of possible outcomes 3
2
This person was verified to have about as high of an IQ as you could ever want: he could learn a new language
in a week, first to derive the roots of cubic polynomials to win a math competition, etc.
3
This answer by D’Alembert was published as an entry in his Encyclopedia in 1754.

2
• (Cardano and Galileo) There are four outcomes (T T, HT, T H, HH), three of which are favor-
able; hence,
Number of outcomes where E occurs 3
P(E) := = .
Number of possible outcomes 4
The lesson here is that a completely intuitive approach can lead even the most brilliant minds to
error. Further, the “correct” answer 3/4 requires modeling the setting of the question carefully, but
once done, we will have rules (“axioms”) so that there is no ambiguity about what P(E) is.
Even once we have the probabilistic framework and axioms, counterintuitive or “paradoxical”
results may readily still occur. The rigor we develop below, however, allows us to analyze these
examples to refine our intuition. For example, the following question of Chevalier de Meré is what
lead to Pascal and in his pen-pal Fermat to begin a systematic study of probability in 1654. It
is considered the impetus for developing a systematic theory, eventually yielding the axiomatic
approach we take for granted today (though it was not fully developed until Kolmogorov, 1933).
Question (de Méré): In four throws of a die, it is favorable to bet on the occurrence of
at least one 6, i.e., you would win over 1/2 the time. From this, we try to draw another
favorability assessment based on an intuitive “rule of proportionality”: since the chance of
getting one 6 (namely, 1/6) is six times as much as the chance of a double 6 when rolling
two dice (namely, 1/36), then we should have the same favorability assessment for betting on
the occurrence of at least one double 6 in six times as many throws of a pair of dice, i.e., in
twenty-four throws. However, although the first scenario is favorable (win over 1/2 the time),
the second is not (win less than 1/2 the time). Why is the proportionality rule wrong here?
• (Cardano) The probability of a double 6 in a roll of a pair of dice is 1/36, so we only need
to throw the two dice 18 times to arrive at 18/36 = 1/2 chance of at least one double 6. In
particular, the event of at least one double 6 in 24 throws of a pair of dice is surely favorable.4
• (Pascal, Fermat, Newton) The probability of the first event is 1 − (5/6)4 ≈ 0.51774691 > 1/2,
while that of the second event is 1 − (35/36)24 ≈ 0.49140388 < 1/2. However, 25 throws is
favorable, i.e., 1 − (35/36)25 ≈ 0.50553155 > 1/2.
• (de Moivre, 1718) Suppose you are betting on an outcome that has probability p ∈ [0, 1] of
occurring. Then we are looking for the very first value (preferably integer) x > 0 such that
log 2
1 − (1 − p)x ≥ 1/2 or 1/2 ≥ (1 − p)x , which admits solution x = − log(1−p) . Now recall the
2 3
series expansion − log(1 − p) = p + p2 + p3 + · · · . Hence, if p ∈ [0, 1] is small, then we have
the approximation
log 2 log 2 log 2
x= = 2 3 ≈ .
− log(1 − p) p p
p + 2 + 3 + ··· p
In particular, if we take the critical number of plays to be x∗ = logp 2 , then the “proportionality
rule” holds; more precisely, if the probability p decreases by a factor 1/α, α > 1, to p/α,
then the critical value x∗ must increase by x∗ · α = log 2/(p/α). This formalizes de Méré’s
“proportionality rule” intuition as well as explains why it is faulty: it is only as good as the
above approximation!5
4
This argument is nonsense: by his logic, one will get a double 6 with probability > 1 in n throws when n > 36.
In the rigorous language below, Cardano is not appreciating that the sample space changes with each added throw.
5
If X is the (random!) number of rolls of a die it takes you before getting a 6 and Y is the (random!) number
of rolls of a pair of dice it takes you before getting a double 6, then we will see that the “best guesses” of X and Y
are, respectively, EX = 6 and EY = 36, so indeed one expects it to take six times as many rolls for the favorable
outcome to occur (this is another instance of the “proportionality rule”).

3
1 Axioms of Probability
Let us revisit the relative frequency interpretation of the probability that some event, denoted A,
occurs (for concreteness, imagine the experiment is flipping two coins and A is the event of at least
one heads). Suppose we repeat this random experiment N times and let NA be the number of times
that the event A occurs. Note that 1 ≤ NA ≤ N and that NA implicitly depends on N . Prior to
Kolmogorov’s 1933 work, the most satisfactory definition of the probability of A was as the limit of
the relative frequencies P(A) := limN →∞ NNA . This definition faces a number of problems:

1. There is no way to realistically perform an experiment infinitely many times, and even if we
accept an approximation, there is no systematic way to discuss the error of the approximation.

2. There is no reason to assume this limit should always exist, and even if we demand it does,
we face an issue of uniqueness (do not forget the numbers are random!).

3. If we adopt this relative frequency definition, then we are excluding the “degree of confidence”
interpretation of probability, which accounts for a lot of probabilistic statements.6

The axiomatic approach to probability theory (initiated by Kolmogorov, 1933) abstracts the key
properties that a definition like “P(A) := limN →∞ NNA ” should satisfy as axioms to serve as the basis
for investigations of probabilistic problems. Before giving these axioms, we need terminology.

Definition 1.1. A random experiment is an experiment where all possible outcomes are known but
it can be unpredictable which outcome will occur before the experiment is performed.
The sample space Ω of a random experiment is a set representing all possible outcomes.

Intuitively, any subset E of outcomes in Ω, E ⊂ Ω, can be interpreted as an event for the random
experiment. Events are often determined by some english statement whose validity (true or false)
is known after the experiment is performed. More precisely, upon realizing the experiment, some
outcome ω occurs and we say that the event E occurs if ω ∈ E and that it does not occur otherwise.
Thus, E represents the occurrence of any of the outcomes ω ∈ E. However, we emphasize a subtle
distinction that can be made here:

⋆ We distinguish “all possible outcomes” from “those events the modeler can observe” ⋆

Indeed, the relative frequency definition P(A) := limN →∞ NNA implicitly assumes the modeler can
observe an event A in order to reliably record the number NA of times that A occurs.
This distinction is better understood through an example. Suppose Ω = {1, 2, 3, 4, 5, 6} repre-
sents the possible outcomes of a roll of a die, but an experimenter with bad eyesight can only dis-
tinguish whether a small, medium, or large “blob” occurs, corresponding to the events A1 := {1, 2},
A2 := {3, 4}, A3 := {5, 6}, respectively. Further note that the modeler can observe/distinguish
compound events, such as A := A1 ∪ A2 = {1, 2, 3, 4}. In fact, the collection of all events the
modeler can observe/distinguish is completely given by7

F := {∅, A1 , A2 , A3 , A1 ∪ A2 , A1 ∪ A3 , A2 ∪ A3 , A1 ∪ A2 ∪ A3 }
= {∅, {1, 2}, {3, 4}, {5, 6}, {1, 2, 3, 4}, {1, 2, 5, 6}, {3, 4, 5, 6}, Ω}
6
Namely, those for which one cannot easily repeat an experiment under similar conditions, such as “the probability
the gas price rises next month is 60%.”
7
Notice the empty set ∅ represents the event of nothing happening at all, which we assume the observer will record
as N∅ = 0 for any number of trials N .

4
We give the collection “F” of all such events that the modeler can observe a special status, namely, as
a structure that formalizes the ways statements can be combined, which corresponds mathematically
to set operations with the events we can observe.
Definition 1.2. A collection F of subsets of Ω is called a σ-algebra if it satisfies
(a) Ω ∈ F (this is the event that any of the possible outcome occurs);
(b) if A ∈ F, then Ac ∈ F (we write Ac := Ω \ A);
(c) if A1 , A2 , ... ∈ F, then ∪∞
i=1 Ai ∈ F.
In summary, a σ-algebra F is a nonempty collection of events that is closed under the set operations
of taking complements and countable unions.
We tend to reserve the term event for subsets E ⊂ Ω that are in F and thus observable.
Finally, to complete our model of a random experiment, we need a quantification P(E) for the
“degree of confidence” for the occurrence of any given event E ∈ F. Although this description
sounds like we are already taking a philosophical stance on how to interpret probabilities (“subjec-
tivist”), the axioms below are in fact natural, common sense rules based on our intuition of what
properties relative frequency (“objectivist”) should satisfy.8
Definition 1.3. A probability rule or measure P on (Ω, F) is a function P on F satisfying
(a) 0 ≤ P(E) ≤ 1 for all events E ∈ F.9

(b) P(Ω) = 1;10

(c) if A, B ∈ F are disjoint, i.e., A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B).11


More generally, if A1 , A2 , ... is a sequence of disjoint members of F, i.e., Ai ∩ Aj = ∅ for all
pairs of distinct indices i ̸= j in N, then

X
P(∪∞
i=1 Ai ) = P(Ai ).
i=1

The triple (Ω, F, P) is called a probability space, and for each E ∈ F, P(E) is the probability of E.
Example 1.1. Consider a finite sample space Ω in which all outcomes ω are observable and equally
likely. This means that

F = 2Ω := {collection of all possible subsets of Ω}

and the probability of any event A ∈ F is given by


|A|
P(A) := ,
|Ω|

where |A| is the number of outcomes that make up A. Although this completely determines the
probability space (Ω, F, P), we are still left with the problem of counting |A| and |Ω| systematically.
8
Laws of large numbers refer to theorems confirming that the relative frequency interpretation of probability holds
in certain limiting senses.
9
Probability is a “degree of confidence” or a limit of relative frequencies, so it should be between 0 and 1.
10
One of the total list of possible outcomes under consideration certainly will occur.
11
In a repeated experiment, disjoint events cannot occur simultaneously, so the number of experiments where A∪B
occurs is equal to the sum of the number of experiments where A occurs and the number of those where B occurs.

5
To give a concrete example, consider the random experiment of rolling a pair of distinguishable
dice. The sample space can then be given by
Ω = {(i, j) : 1 ≤ i, j ≤ 6}.
Both the collection of events F we can observe and their likelihoods P are determined by the above.
Thus we have constructed a probability space (Ω, F, P). Consider now the event that “the sum of
the numbers is 8”. As a subset, this is given by the collection of ordered pairs
E = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} ⊂ Ω.
Although we cannot predict whether or not the event E occurs, since it is obviously in F and we
can compute its likelihood as
|E| 5
P(E) := = .
|Ω| 36
This framework in fact dominated the early developments of probability; indeed, most probability
problems prior to the 20th century involved combinatorial considerations.
Example 1.2. One could easily imagine natural infinite sample spaces, such as Ω = [0, +∞)
representing possible arrival times of buses at a stop or Ω = R3 representing possible positions for
the flight of a fly. In either of these cases, the collection F of events is often more complicated since
P(E) will often at least be related to the length/area/volume of E.
More precisely, when working in R, F usually involves the σ-algebra BR generated by all the
open intervals (a, b), a < b, of R.12 In the general case of Rd , d ≥ 1, the σ-algebra BR generated by
all open d-dimensional balls becomes relevant. In either case, elements E of the collection BRd are
referred to as Borel sets.
The previous example allows us to us set up a continuous probability rule so natural it hardly
requires motivation. More precisely, we want to model the notion of selecting a point X “uniformly
at random” from a set Ω ⊂ R or Ω ⊂ Rd .
For A ⊂ Rd , let Meas(A) denote some notion of “measure” or size of the set A. When d = 1, 2, 3,
it corresponds to length, area, volume, respectively, of the set A. Thus, in general, Meas(A) usually
refers to the d-dimensional volume of A, but depending on the context, it could also be the (d − 1)-
dimensional surface area (if, say, A is, say, a sphere, i.e., the boundary of a ball) in Rd .
When the reader takes measure theory some day, the reader will learn that the natural collection
of sets A admitting a notion of measure is exactly the σ-algebra BRd of Borel sets generated by the
open or closed balls in Rd , introduced in the previous example.
Definition 1.4. Given Ω ∈ BRd , the probability rule P formalizing the notion of choosing a point
uniformly at random in the set Ω is defined by
M eas(A ∩ E)
P(E) = , for any E ∈ BRd .
M eas(A)
In the special case that Ω is the d-dimensional hyperrectangle Ω = [a1 , b1 ] × · · · × [ad , bd ], we have
d
Y
Meas(Ω) = (bk − ak ).
k=1
12
Here, one can work with different classes of subsets of R, such as infinite rays (−∞, x), (−∞, x], closed sets [a, b],
half-open/closed (a, b], [a, b), etc.. However, the point is that any of these classes can form sets of any other of these
classes through countably many set operations, so they all generate the same σ-algebra.

6
For another special case, if Ω is a d-dimensional ball of radius r > 0 centered at x ∈ Rd , i.e.,
Ω = Br (x) := {p ∈ Rd : ∥p − x∥ ≤ r}, then
( k
π
k!
· r2k d = 2k
M eas(Ω) = 2(4π)k k! 2k+1
(2k+1)!
·r d = 2k + 1

The reader can think of modeling a random position of a shot on a dartboard or the random
arrival of the cable technician between 2 p.m. and 4 p.m., choosing a random position on the Earth,
etc.. Intuitively, in each of these examples, the probability of an event can be thought of as the
quotient of the favorable length/area/volume and the total length/area/volume.
The uniform distribution might not seem exciting, but it allows us to venture into an interesting
class of examples based on elementary geometry.
Example 1.3. A circle of radius 1 is inscribed in a square with sides of length 2. A point is selected
at random from the square. What is the probability that it is inside the circle?
Let Ω = [0, 2] × [0, 2] and let D1 be the disk centered at the center (1, 1) of the square, which in
particular has radius r = 1. Then the desired probability is given by
Area(D1 ) π
P(D1 ) = = .
Area([0, 2] × [0, 2]) 4

Remark. We emphasize there is no general recipe for determining a probability rule/measure P:


it is the business of statistics or probabilistic modeling to determine P for a given model, while
probability theory simply takes P or (Ω, F, P) as given.

Basic properties and formulas


The reader can readily observe that there are many other common sense properties we expect
relative frequencies to satisfy. Fortunately, all such properties can be shown to follow from the
basic definition of probability space (Ω, F, P). We now list some of the most fundamental, and
prove them carefully for the sake of illustrating the axioms.
Proposition 1.1. For A ∈ F, write Ac := Ω \ A. Some basic relations are
(a) P(Ac ) = 1 − P(A).13
(b) If A ⊆ B,14 then P(B \ A) = P(B) − P(A), so that P(B) = P(A) + P(B\A) ≥ P(A);
(c) P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
We next provide detailed proofs of these properties here in the notes, but we emphasize such
details are not the purpose of this course.
Proof. Proof of (a): Simply note that A ∪ Ac = Ω and A ∩ Ac = ∅. Hence, by definition of a
probability rule,
1 = P(Ω) = P(A ∪ Ac ) = P(A) + P(Ac ),
so rearranging gives the result.
Proof of (b): Since A ⊂ B, we have that
A⊂B distributivity
B = B ∩ Ω = B ∩ (A ∪ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = A ∪ (B ∩ Ac ) = A ∪ (B \ A),
13
Ac is the event that A does not occur.
14
The relation “A ⊆ B” means “the event A implies the event B.”

7
and A ∩ (B \ A) = ∅. Hence, again by definition of a probability rule, we have
P(B\A)≥0
P(B) = P(A ∪ (B \ A)) = P(A) + P(B\A) ≥ P(A) + 0 = P(A).
Proof of (c): Note that A ∩ B ⊂ A ∪ B, so by the previous item, we have that
P(A ∪ B) = P(A ∩ B) + P((A ∪ B) \ (A ∩ B))
= P(A ∩ B) + P((A ∪ B) \ (A ∩ B))
distributivity
= P(A ∩ B) + P((A \ (A ∩ B)) ∪ (B \ (A ∩ B)))
= P(A ∩ B) + P((A \ (A ∩ B)) + P(B \ (A ∩ B))
part (b)
= P(A ∩ B) + [P(A) − P(A ∩ B)] + [P(B) − P(A ∩ B)]
= P(A) + P(B) − P(A ∩ B)
as required.
Alternative Proof of (c): First we note that
A ∪ (B \ (A ∩ B)) = A ∪ (B ∩ (A ∩ B)c )
DeMorgan’s
= A ∪ (B ∩ (Ac ∪ B c ))
distributivity
= (A ∪ B) ∩ (A ∪ (Ac ∪ B c ))
associativity
= (A ∪ B) ∩ ((A ∪ Ac ) ∪ B c )
= (A ∪ B) ∩ (Ω ∪ B c )
B c ⊂Ω
= (A ∪ B) ∩ Ω
= A ∪ B.
Since A ∩ B ⊂ B and since A and B \ (A ∩ B) are disjoint, we have
part (b)
P(A ∪ B) = P(A ∪ (B \ (A ∩ B))) = P(A) + P(B \ (A ∩ B)) = P(A) + P(B) − P(A ∩ B),
as required.
Remark. It is misleading to think F will consist of every subset of Ω, which is the “finest” possible
collection of events. The “coarsest” collection of events is F = {∅, Ω}.
For an intermediate and more interesting example, suppose Ω = {1, 2, 3, 4, 5, 6} represents the
possible outcomes of a roll of a die, but the experimenter has bad eyesight and can only distinguish
whether the events B1 := {1, 2, 3} and B2 := {4, 5, 6} = B1c occur. This situation can be modeled
with the σ-algebra G = {∅, B1 , B1c , Ω}. Note that even if the experimenter somehow concludes that
P(B1 ) = P(B1c ) = 1/2, this does not imply the die is fair! Perhaps after putting his glasses on to
get back to F = 2Ω , the experimenter notices neither 2 nor 5 ever occur, so through some statistics
he concludes it is a trick die with P({i}) = 0 for i = 2, 5.
More generally, given any collection A of subsets of Ω, let σ(A) denote the smallest σ-algebra
containing A. For an important concrete example, given a (countable) partition B1 , B2 , . . . , Bn , . . .
of subsets of Ω, we can generate the σ-algebra G = σ(B1 , B2 , . . . , Bn , . . .) = {∪m∈I Bm }I⊂N (this
collection of unions is indexed by all subsets I of N, including I = ∅, so the collection includes ∅ as
well). In particular, if the partition involves a finite number n of sets B1 , . . . , Bn , then the σ-algebra
G := σ(B1 , B2 , . . . , Bn ) = {∪m∈I Bm }I⊂{1,...,n} will consist of 2n elements. Intuitively, we can think
of the Bi as “atoms” that serve as the smallest building blocks of the σ-algebra.

8
2 Conditioning and independence
Conditional probability
Assume as usual we are working on a probability space (Ω, F, P). Suppose we know an event B ∈ F
with P(B) > 0 has occurred. How do we update our calculation of likelihoods?
To give a definition in our axiomatic framework, we return to interpreting probabilities as relative
frequency. Suppose we want to update the probability of an event A ∈ F given B has occurred,
denoted by P(A|B). As before, we perform an experiment N times, but now we plan to ignore any
outcomes where B does not occur. Accordingly, when doing this, in order to record an outcome
where A occurs, we must also record B, i.e., we are looking for outcomes where both A and B
occur. To summarize in our notation before, this gives us
NA∩B NA∩B
NA∩B N
limN →∞ N P(A ∩ B)
P(A|B) := lim = lim NB
= = .
N →∞ NB N →∞
N
limN →∞ NNB P(B)

For an alternative derivation of this formula, suppose we denote the updated probability rule
given B has occurred by PB . We then know it should satisfy PB (B) = 1 and further that A occurs if
and only if A∩B occurs so we expect PB (A) to be proportional to P(A∩B), i.e., PB (A) = K·P(A∩B).
Combining these two considerations immediately gives K = 1/P(B).
Either derivation leads us to define the conditional probability rule PB for any A ∈ F by
P(A ∩ B)
PB (A) := .
P(B)

In particular, one can show (Ω, F, PB ) is also a probability space/model; more informally, con-
ditional probabilities behave just like probabilities. In practice, we write it as in the following
definition.

Definition 2.1. Given B ∈ F with P(B) > 0, define the conditional probability that A occurs given
that B occurs by
P(A ∩ B)
P(A|B) := .
P(B)

Example 2.1 (Two Children paradox). For the next two questions, assume that all possible pairs
of gender are equally likely.

1. I tell you I have two children, at least one a girl. What is the probability both are?

2. I tell you I have two children, at least one a girl. You are greeted at my front door by a girl
whom you correctly assume to be my daughter. What is the probability the other is a girl?

(Solution of 1) Let B be the event that I have at least one daughter and A be the event both
children are daughters. Then P(B) = 3/4 and P(A) = 1/4, so

P(A ∩ B) 1/4 1
P(A|B) = = = .
P(B) 3/4 3

(Solution of 2) The only randomness is the gender of the other unseen child, and there is 1/2
chance that one is a girl.

9
Discussion: This pair of problems constitutes a paradox because it does not seem like you are
learning anything you didn’t already know upon being greeted by a girl. However, there is a
difference between being told one is a girl (reducing the sample space to {GB, BG, GG}) versus
seeing a girl (reducing the sample space to {GB, GG}, recording the seen girl as the first child).
Another way to think about it is that the first case above concerns choosing a random family
with two children that has girl, while the second case concerns choosing a random child from such
a family that is found to be a girl. So imagine an urn containing two children. The setting of the
first case corresponds to being told the urn has at least one girl, while the setting of the second
corresponds to sampling a child from the urn and seeing it is a girl.
If the reader is not satisfied with these explanations, try to convince yourself the distinction
between being told and seeing will be felt if you revisit these problems using Bayes’ theorem.15
At first, it might seem like a pain to compute conditional probabilities; after all, you ostensibly
have to figure out the intersection and then compute two probabilities. But recall, in our notation
above, that (Ω, F, PB ) is a probability space in its own right. However, this space can be simplified
by restricting it further by replacing Ω with Ω ∩ B = B and F with F ∩ B (the σ-algebra obtained
from intersecting all events in F with B). It can often be more convenient to work directly with
the reduction of sample space (B, F ∩ B, PB ).16
Example 2.2 (Batteries). There are 10 good and 3 dead batteries. You confirm by random testing
that 4 are good and set them aside. What is the probability that the fifth tested is dead?
We simply reduce the sample space to 6 good and 3 dead, so the desired answer is 3/9 = 1/3.
There are other conceptual advantages of reduction of sample space.
Exercise: You need to choose one of your three children to do a chore, but your only source of
randomness is a fair coin. How do you use it to fairly decide which one will do the chore?

Three basic formulas of conditional probability


It is natural to consider a sequence of events A1 , . . . , An as reflecting a temporal order. Further,
reduction of sample space can often make it easy to compute the sequence of conditional probabilities

P(A2 |A1 ), P(A3 |A1 ∩ A2 ), . . . , P(An |A1 ∩ A2 ∩ · · · ∩ An−1 ).

The next result says that these probabilities are enough to compute the event they all occur.
Theorem 2.1 (Law of Multiplication17 ). Given events A1 , . . . , An ∈ F with P(Ai ) > 0 for i =
1, . . . , n, we have

P(A1 ∩ · · · ∩ An ) = P(A1 ) · P(A2 |A1 ) · P(A3 |A1 ∩ A2 ) · · · P(An |A1 ∩ A2 ∩ · · · ∩ An−1 )

Example 2.3. Let us return to the scenario of testing 10 good and 3 dead batteries. What is the
probability we are lucky and find the 3 dead batteries on our first 3 tests?
Let Di for i = 1, . . . , 10 represent the events that the ith tested battery (without replacement)
is dead. Then we have
3 2 1 1
P(D1 ∩ D2 ∩ D3 ) = P(D1 ) · P(D2 |D1 ) · P(D3 |D1 ∩ D2 ) = · · = ,
13 12 11 286
15
Some refer to this as a “Bayesian analysis.”
16
Compare this space with our relative frequency derivation of conditional probability above.
17
Some texts think of intersection as a form of “multiplication” and even write AB := A ∩ B, hence the name.

10
where each factor has been computed as in the previous example via reduction of sample space
reasoning. This route avoids figuring out the combinatorics of these events under the original
sample space of all possible distinguishable orders of testings.18

Next, conditional probabilities are useful even when they are not strictly necessary. More pre-
cisely, the next result shows how to break apart an unconditional probability into (usually simpler)
conditional pieces according to a partition of Ω. Recall a collection of subsets B1 , B2 , ..., Bn of Ω is a
partition of Ω if Ω = ∪nk=1 Bk and the Bi are pairwise disjoint, i.e., Bi ∩ Bj = ∅ for all 1 ≤ i ̸= j ≤ n.

Theorem 2.2 (Law of Total Probability). For any events A and B such that 0 < P(B) < 1,
LoM
P(A) = P(A ∩ B) + P(A ∩ B c ) = P(A|B) P(B) + P(A|B c ) P(B c ).

More generally, let B1 , B2 , ..., Bn be a partition of Ω such that P(Bi ) > 0 for all i. Then

n n
LoM
X X
P(A) = P(A ∩ Bi ) = P(A|Bi )P(Bi ).
i=1 i=1

Finally, the previous two formulas of conditional expectation lead to arguably the most important
one: Bayes’ Theorem. In words, it expresses the probability of an earlier event given later ones as
a ratio of probabilities of later events given earlier ones. Accordingly, this theorem is the heart of
developing the Subjectivist interpretation of probability rigorously.

Theorem 2.3 (Bayes’ Theorem). For any events A and B 19 such that P(A) > 0, 0 < P(B) < 1,

P(A|B)P(B)
P(B|A) = .
P(A|B)P(B) + P(A|B c )P(B c )

More generally, if B1 , B2 , ..., Bn is a partition of Ω, each with positive probability, then

P(A|Bi )P(Bi )
P(Bi |A) = Pn .
i=1 P(A|Bi )P(Bi )

In words again, the last result says that the a priori probabilities (namely, P(Bi ) and P(A|Bi ))
determine the a posteriori probabilities (namely, P(Bi |A)).

Example 2.4. My wallet contains either a one dollar bill or a twenty bill. I then receive a one
dollar bill as change that I put into my wallet. Later, I randomly pull out a one dollar bill. What
is the probability the other remaining bill is a single dollar?
Let B be the event the wallet originally contained a single dollar and let A be the event you
randomly pull out a dollar bill later. Then by Bayes’ formula, we have

P(A|B)P(B) 1 · 1/2
P(B|A) = c c
= = 2/3.
P(A|B)P(B) + P(A|B )P(B ) 1 · 1/2 + 1/2 · 1/2

Another classic application concerns the counterintuitive observation that there is a surprisingly
high chance of a false positive even for a reliable test.
18
Though the combinatorics here is quite straightforward.
19
Think of A as an “After” event and B or the Bi as “Before” events.

11
Example 2.5 (False positive for a rare disease). Suppose if you have a disease, a test gives positive
95% of the time, while if you do not have the disease, the test gives a (false) positive only 2% of
the time. But suppose the disease is rare, say, only .1% of the population has it.
If you receive a positive from the test, what is the probability you actually have the disease?
Let A be the event the test comes back positive and let B be the event you have the disease.
Then we have
P(A|B)P(B) .95 · .001
P(B|A) = c c
= ≈ 0.045.
P(A|B)P(B) + P(A|B )P(B ) .95 · .001 + .02 · .999
Reflect on this calculation: you only have a 4.5% chance that you have the disease even though a
reliable test came back positive! Although a false positive is rare (.02 chance), having the disease is
even more rare (.001 chance); hence, it’s more likely you get a false positive than have the disease.
For the test to be useful, the chance of a false positive must be (much) smaller than the chance of
having the disease.
Example 2.6 (Monty Hall Paradox). Question: You’re on a game show and the host Monty Hall
presents you with three doors. You are told there is a prize behind one of them but the other two
have nothing. Since it’s the 60th anniversary of the show, the host tells you that after you make a
selection of door to open, he will reveal one of the remaining two doors as empty and give you the
option to switch. Strategically, do you accept the offer to switch or does it not matter?
Answer: Here is the most elegant explanation of the answer to this problem I know. Suppose
you adopt the strategy of switching. Then you will win if you originally chose either of the two
empty doors. Hence, switching wins 2/3 of the time! If you adopt the strategy of staying, then you
can only win if you originally chose the correct door, which occurs only 1/3 of the time. Thus, you
should switch.
Exercise: Give a more rigorous solution to this problem using Bayes’ theorem so no one can
dispute the answer.
Modification: Instead of the host revealing the door, suppose he allows an audience member
to select one of the other two doors and it ends up being empty. Does this affect your answer?20

Independence
To motivate the next definition, if knowledge of B with P(B) > 0 does not update the likelihood
that an event A occurs, then this can be written as P(A|B) = P(A). Working from the definition
of conditional probability, this last relation holds if and only if P(A ∩ B) = P(A) · P(B), which
is a more symmetric relationship than the description above. More precisely, this relation should
also imply the converse situation, namely, that knowledge of A with P(A) > 0 does not update the
likelihood of the event B, i.e., P(B|A) = P(B). At least one advantage of “P(A ∩ B) = P(A) · P(B)”
over the other two relations is that it does not require P(A), P(B) > 0 and it succinctly captures
the idea that the occurrences of A and B are independent of each other.
Definition 2.2. Events A and B are called independent if

P(A ∩ B) = P(A) · P(B).


20
Some authors indicate this problem was not well-specified when popularly posed in the “Ask Marilyn” column of
1990, suggesting subjectivist vs objectivist interpretations will lead to two distinct model-specifications; see Problems
4.20/21, Crack 2021. The ”Sleeping Beauty” problem more significantly suffers from being under-specified.
To be clear: for fully specified problems in probability, there will be no difference between the Bayesian vs.
Frequentist approaches.

12
Lemma 2.1. If A and B are independent, then so are A and B c as well as Ac and B c .

Proof. Simply note that by the law of total probability,


independence
P(A ∩ B c ) = P(A) − P(A ∩ B) = P(A) − P(A)P(B) = P(A) · [1 − P(B)] = P(A) · P(B c ).

In turn, the independence of B c and Ac follows from the independence of B c and A by applying
the result just proved. This completes the proof.
Warning: A, B being mutually exclusive, i.e., disjoint, is not the same as being independent!
In fact, if A, B are mutually exclusive with P(A), P(B) > 0, then they are immediately dependent:
the occurrence of one precludes the occurrence of the other, i.e., P(A|B) = 0 = P(B|A).
Given a probability space, we can of course endeavor to test if two given events are independent
by checking whether the condition above holds or not. However, it is very common to take inde-
pendence as a modeling assumption, built into our model of the random experiment based on our
intuition.

Example 2.7. Consider the experiment of flipping two coins with bias p ∈ [0, 1]. Let Hi be the
event the ith coin is heads, i = 1, 2. We assume H1 , H2 are independent, so we have

P(H1 ∩ H2 ) = P(H1 ) · P(H2 ) = p2 .

By the Lemma above, we thus also have

P(H1 ∩ H2c ) = P(H1c ∩ H2 ) = p(1 − p), P(H1c ∩ H2c ) = (1 − p)2 .

Notice we would have arrived at the same results had we instead wrote Ω = {HH, HT, T H, T T },
F = 2Ω with H1 = {HT, HH}, H2 = {T H, HH}, and finally we can define P by

P({HH}) := p2 , P({HT }) = P({T H}) = p(1 − p), P({T T }) := (1 − p)2 .

Notice then that


P(H1 ) = P(H2 ) = p(1 − p) + p2 = p · [(1 − p) + p] = p.
With the probability space in hand, we can deduce the independence of H1 and H2 by a direct
check rather than assuming it:

P(H1 ∩ H2 ) = P({HH}) = p2 = p · p = P(H1 ) · P(H2 ).

Example 2.8 (Jailor Paradox). Suppose three alleged criminals, Alex, Ben, and Chris, are awaiting
their fates in jail. The jailer tells them he learned that two of them will go free while one of them
will spend life in prison, and he knows who. Regardless of the results, the jailer likes Alex and
regularly chats with him. After warming the jailer up a bit, Alex asks the jailer if he could at
least tell him the name of one of the two who will go free. The jailer says legally he cannot tell
Alex whether Alex will go free or not, but still, if the jailer gives the request info, he argues he will
increase Alex’s chance of life imprisonment, from 1/3 to 1/2, which he could not do to his friend.
Question: Is the jailer’s reasoning correct? After all, we know probabilities can change given
new information...
Answer: Let us suppose the jailer provides Alex with a name of one of the free men. Let
A, B, C be the respective events that either Alex, Ben, and Chris gets life in prison, and let J be

13
the event the jailer tells Alex that, say, Chris goes free (the case of telling Ben goes free is the
same). Then by Bayes’ formula, we have
P(J|A)P(A) 1/2 · 1/3 1
P(A|J) = = = = P(A).
P(J|A)P(A) + P(J|B)P(B) + P(J|C)P(C) 1/2 · 1/3 + 1 · 1/3 + 0 · 1/3 3
Hence, Alex’s chance of life imprisonment does not actually change since the event that Alex gets
life is independent of the name the jailer gives him.
Discussion: The key point is the so-called “Principle of Restricted Choice vs Free Choice”: if
Alex will go free, the jailer is restricted to give the one remaining free man; if Alex will get life, the
jailer has a choice of what name to give. The probability of restricted choice occurs 2/3 of the time
and of free choice 1/3 of the time, i.e, restricted choice is twice as likely as free choice. We see this
weighting exhibited in the Bayesian analysis above by comparing P(J|B) = 1 (restricted choice)
with P(J|A) = 1/2 (free choice).
Now suppose we have three events A, B, C that are pairwise independent, i.e., knowledge of any
one does not update the likelihood of any one of the others. Then is it true that they are “all
independent” in the sense that the knowledge of the joint occurrence of any two events does not
update the likelihood of the remaining one? The next example gives a negative answer.
Example 2.9 (Paradox of Pairwise Independence of S.N.Bernstein). Continue with the setting and
notation of the coin example above but remove the bias, i.e., take p = 12 . Let A := H1 = {HT, HH},
B := H2 = {T H, HH}, and C := {HT, T H}, i.e., C is the event of exactly one heads.
Now a calculation similar to showing A and B are independent shows the same for the pairs
(A, C), (B, C). For example, since P(C) = 1/2,

P(A ∩ C) = P({HT }) = 1/4 = 1/2 · 1/2 = P(A) · P(C).

However, the occurrence of any two of A, B, C determines the remaining: if A ∩ B = {HH} occurs,
then C cannot occur, i.e., P(A ∩ B|C) = 0; if A ∩ C = {HT } occurs, then B cannot occur; and if
B ∩ C = {T H} occurs, then A cannot occur.
Remark. It is actually not so intuitively clear that the pairs (A, C), (B, C) are independent. In
fact, when p ̸= 1/2, these events are NOT independent.
The last example leads us to modify our definition of independence for multiple events.
Definition 2.3. Events A1 , . . . , An ∈ F are independent if for all 1 ≤ i ̸= j1 ̸= · · · =
̸ jk ≤ n,
1 ≤ k < n, (with P(Aj1 ∩ · · · ∩ Ajk ) > 0) we have

P(Ai |Aj1 ∩ · · · ∩ Ajk ) = P(Ai ).

If A1 , . . . , An ∈ F are independent as above, then by the Law of Multiplication, we have


independence
P(Aj1 ∩ · · · ∩ Ajn ) = P(Aj1 ) · P(Aj2 |Aj1 ) · · · P(Ajn |Aj1 ∩ · · · ∩ Ajn−1 ) = P(Aj1 ) · · · P(Ajn ).
Thus, as before, we can update this definition into a more elegant, symmetric form that has the
added benefit of not requiring certain probability be positive.
Definition 2.4. Events A1 , . . . , An ∈ F are independent if for all 1 ≤ j1 ̸= · · · ̸= jk ≤ n, 1 ≤ k ≤ n,
we have
P(Aj1 ∩ · · · ∩ Ajn ) = P(Aj1 ) · · · P(Ajn ).

14
Exercise: How many equations does one have to check to confirm independence of n sets?

Example 2.10. Suppose we flip independently five p-coins, p ∈ [0, 1], with Hi the event ith toss is
heads for i = 1, . . . , 5. Then
independence
P(4 heads, 1 tails) = P(H1 ∩ H2 ∩ H3 ∩ H4 ∩ H5c ) + P(H1 ∩ H2 ∩ H3 ∩ H4c ∩ H5 ) = 5p4 (1 − p).
+ P(H1 ∩ H2 ∩ H3c ∩ H4 ∩ H5 ) + P(H1 ∩ H2c ∩ H3 ∩ H4 ∩ H5 )
+ P(H1c ∩ H2 ∩ H3 ∩ H4 ∩ H5 )

Remark. Although the axiomatic approach to probabilistic models emphasized the sample space
Ω, the events F, and the probability measure P, the reader likely noticed that in many of the
above examples, we did not bother to define explicitly the probability space, instead working with
basic modeling assumptions to get answers. In a word, the probability space is suppressed and left
implicit, but it is still “under the hood.”
There is nothing wrong with this as long as we are careful! If we ever sense there is a misconcep-
tion or mistake in our reasoning, we can aways resort to working out more details of the probability
model until we are satisfied. But to repeat, for the sake of performing computations, mod-
eling assumptions and the “rules of probability” are usually enough.

Random variables
The sample space Ω in general can be quite abstract, e.g., its elements ω can be symbols representing
certain (sometimes complicated) outcomes. However, we are often interested in more quantitative
expressions of these ω (after all, we want to do mathematics!). Accordingly, we are lead to consider
real-valued functions, often written generally as X : Ω → R, ω 7→ X(ω). The simplest such function
(and ubiquitous in probability/statistics) is the following.

Definition 2.5. Given any subset A ⊂ Ω, the indicator function of A is defined by


(
1 ω∈A
1A (ω) := .
0 ω∈
/A

Thus, 1A (ω) “indicates” whether ω is in A or not.

Nota Bene: Let X : Ω → R, ω 7→ X(ω) be an arbitrary function. For a < b, a common


notation is
{a < X < b} := {ω ∈ Ω : a < X(ω) < b} = X −1 ((a, b))
Then, a common abuse of notation for an indicator of the form 1X −1 ((a,b) (ω) = 1(a,b) (X(ω)) is to
simply write 1{a<X<b} . In general, probability theory adopts the stylistic choice to suppress writing
an outcome ω whenever it is convenient to do so.
Indicator functions offer a different perspective on events E ∈ F, the main “random” objects
we have considered so far; namely, 1E (ω) tells us whether an event occurs when the outcome ω
is realized in performing an experiment. More general functions X : Ω → R, ω 7→ X(ω) offer
a quantitative way to express more complicated statements than simply “yes or no.” The only
technical condition that needs to be met is that these statements produce events whose likelihoods
we can compute!

15
Definition 2.6. Given a function X : Ω → R, define σ(X) to be the smallest σ-algebra containing

{a < X < b} := {ω ∈ Ω : a < X(ω) < b} = X −1 ((a, b)), for all a, b ∈ R with a < b.21

An F-random variable is a real-valued function X : Ω → R such that σ(X) ⊂ F. To check this


condition, it is sufficient to show {a < X < b} ∈ F for all a, b ∈ R with a < b.

Example 2.11. Consider Ω = {(i, j) : 1 ≤ i, j ≤ 6}, F = 2Ω , and each outcome is equally likely,
P({(i, j)}) = 1/36. Then (Ω, F, P) models the experiment of rolling two distinguishable dice.
Let X denote the sum of the dice, i.e., if ω = (i, j), then X(ω) = i + j. Then X is an F-random
variable with range RX = {2, . . . , 12}. Accordingly, we can compute probabilities based on the
events it assumes certain values:
1
P(X = 4) = P({ω ∈ Ω : X(ω) = 4}) = P({(1, 3), (2, 2), (3, 1)}) =
12
Finally, taking F = 2Ω as above is a bit trivial: any function X : Ω → R whatsoever will be
an F-random variable! Intuitively, this means any function X is “observable.” In contrast, let us
imagine that the person running the experiment is wearing foggy goggles and can only distinguish
whether the outcome of each roll is in the set {1, 2, 3} or not. To model this fogged situation,
we take the same Ω and P as above; in particular, X is still a well-defined function X : Ω → R.
However, the sub σ−algebra G representing the set of events we can observe is

G = σ ({1, 2, 3} × {1, 2, 3}, {1, 2, 3} × {4, 5, 6}, {4, 5, 6} × {1, 2, 3}, {4, 5, 6} × {4, 5, 6})
In particular, X not a G-random variable! Indeed, {X = 2} = X −1 ({2}) = {(1, 1)} ∈ / G, so
σ(X) ⊈ G. More intuitively, the value of X cannot be observed clearly through the “lens” of G.22

Remark. An important lesson of these notes is that σ-algebras are not merely a technical device
but in fact serve as an important modeling tool; indeed, we can interpret the collection σ(X) as
the information obtained from observing the function X. More precisely, σ(X) is the collection of
events one knows have occurred or not from observing the various values that X can assume.

Definition 2.7. Let (Ω, F, P) be a probability space. An F-random variable X : Ω → R is discrete


if it takes values in some countable subset RX ⊂ R. In this case, one only needs to check that
all the events {X = x}, x ∈ RX , are in F. Further, we define the probability mass function of X
(abbreviated “pmf”) by

fX (x) := P(X = x) = P ({ω ∈ Ω : X(ω) = x}) .

Note the quantity fX (x) := P(X = x) can admit the interpretation as the fraction of repeated
experiments in which X takes the value x.
21
One can work with different classes of subsets of R, such as infinite rays (−∞, x), (−∞, x], closed sets [a, b],
half-open/closed (a, b], [a, b), etc.. However, the point is that any of these classes can form sets of any other of these
classes through countably many set operations, so they all generate the same σ-algebra BR of R, called the Borel
sets, so that σ(X) is simply all the preimages of the Borel sets.
22
Recall F is information available to the modeller of a random experiment, in addition to being the domain of
definition of P (as we have argued, these two interpretations are equivalent). Thus, any sub σ-algebra G ⊂ F admits
the interpretation as a model of restricted information, i.e., G is a foggier lens than the cleaner lens F.

16
We say that X is continuous if there exists a function fX : R → [0, +∞), called the probability
density function (abbreviate “pdf”), such that
Z b
P(a < X < b) = fX (x) dx.
a

Notice this implies P(X = x) = 0 for all x ∈ R.


Formally, given a probability space (Ω, F, P), the distribution of a random variable X is the
probability measure P ◦ X −1 , which is defined on (R, BR ) by P ◦ X −1 (E) := P(X ∈ E) for E ∈ BR .
However, for the purpose of this course, we will usually determine the “distribution of a random
variable X” by declaring or identifying the pmf or pdf fX (x) of X.

Example 2.12. Examples of random variables not modeled as discrete are the arrival time T ∈
[0, +∞] of a bus and the position X ∈ R3 of a fly in a room.

General definnition of independence of random variables (Optional)


For the sake of mathematical precision, here is the general definition of independence between
arbitrary collections of events.23

Definition 2.8. Suppose we are given an index set I and for each i ∈ I, Ai ⊂ F is an arbitrary
collection of events. Then we refer to the collections Ai , i ∈ I, as independent if
Y
P(∩i∈J Ai ) = P(Ai )
i∈J

for all finite subsets J of I and all Ai ∈ Ai , i ∈ J.

Although this last definition may seem overly abstract, it allows us to easily talk generally about
independence between random variables, which has a concrete interpretation we develop later.

Definition 2.9. A collection of F random variables Xi : Ω → R, i ∈ I, are called independent if


the sub σ-algebras σ(Xi ) ⊂ F, i ∈ I, are independent.

23
In the next section we will introduce the notion in terms of joint mass/density functions of random variables;
namely, X and Y will be independent if f(X,Y ) (x, y) = fX (x)fY (y).

17
3 Discrete Random Variables
First, writing out (Ω, F, P) formally is not always so straightforward, but this need not stop us from
working with examples.. Indeed, we can suppress the sample space (Ω, F, P) in favor of pre-
scribing modeling assumptions and using probability rules. Often, we will be content with
merely describing the experiment, relevant random variables, their dependence or independence,
their distributions, etc. as enough. This point will become more and more important as the models
become more complicated. The following example provides an illustration of this point.

Example 3.1. Suppose now we repeatedly flip a p-coin, p ∈ [0, 1]. Let Hi be the event the ith toss
is heads. We are implicitly assuming the Hi are independent events24 . Let X give the number of
the first flip that is heads. Then we can compute some probabilities of events determined by X:

1. We have for any n ∈ N,


independence
c
P(X = n) = P(H1c ∩ · · · ∩ Hn−1 ∩ Hn ) = c
P(H1c ) · · · P(Hn−1 )P(Hn ) = (1 − p)n−1 · p.

2. We have for any n ∈ N,


independence
P(X > n) = P(H1c ∩ · · · ∩ Hnc ) = P(H1c ) · · · P(Hnc ) = (1 − p)n .

3. What about the probability that we get heads for the first time at the latest n ≥ 2?
= 1 − P(X > n) =P1 − (1 − p)n .
By the previous item, this is simply P(X ≤ n)P
But we can also compute this as P(X ≤ n) = nk=1 P(X = k) = P nk=1 (1 − p)k−1 p.
Hence, we reproduce the geometric sum identity 1 − (1 − p)n = nk=1 (1 − p)k−1 p.

Discrete distributions and independence


Keeping in line with suppressing the probability space (Ω, F, P), it is often not necessary to write
out or even know what X(ω) is for all ω ∈ Ω. Rather, we just need to know all the quantities
P(X = x) for x ∈ RX .

Definition 3.1. Let X be a discrete random variable taking values in RX . Then the distribution
of X is the vector PX = (fX (x))x∈RX , i.e., PX ({x}) = fX (x) = P(X = x) for x ∈ RX and
P
x∈RX PX (x) = 1. If X and Y have the same distribution, i.e., PX = PY , then we write X ∼ Y or
d
X = Y (read “X is equal to Y in distribution”) and say “X, Y are identically distributed.”

WARNING: Let us emphasize: different random variables may have the same distribution!
For example, suppose X is the number of heads in two flips of one fair coin and Y is the number
of heads in two flips of some other fair coin. Then PX = PY , but X(ω) need not be equal to Y (ω).
Here, we could take our sample space to be

Ω = {ω = (ω1 , ω2 ) : ωi = HH, HT, T H, T T for i = 1, 2},

where X(ω) only depends on ω1 and Y (ω) only depends on ω2 . This remark also points out that
when describing random variables, we need a sample space “rich enough” to support them.
24
Although we technically only defined independence for finitely many events, the extension to arbitrarily many
events is fine as long as we are checking the condition on finitely many at a time.

18
Example 3.2. 1. Bernoulli distribution X ∼ Bernoulli(p). A biased p-coin is flipping and X
indicates whether it is heads or not by respectively assuming the value 1 or 0. Then
(
p, k=1
fX (k) = P(X = k) = .
1 − p, k = 0

2. Binomial distribution X ∼ Bin(n, p). A biased p-coin is flipped n times and X is the number
of heads that appear. Then
 
n k
fX (k) = P(X = k) = p (1 − p)n−k , k = 0, 1, . . . , n.
k

Alternatively, X can be treated as the sum of n independent Bernoulli variables:


X = Y1 + · · · + Yn where Yi ∼ Bin(n, p)

3. Geometric distribution X ∼ Geometric(p). A random variable X has the Geometric distribu-


tion with parameter p ∈ [0, 1] if

fX (k) = p(1 − p)k−1 , k = 1, 2, ...

4. Poisson distribution X ∼ Pois(λ). If a random variable takes values in the set {0, 1, 2, ...}
with mass function
λk
fX (k) = e−λ , k = 0, 1, 2, ...
k!
where λ > 0, then X is said to have Poisson distribution with parameter λ.

Discrete statistics
Recall the quantity fX (x) := P(X = x) admits the interpretation as the relative frequency that X
assumes the value x upon repeating an experiment. More precisely, suppose we repeat an experiment
N times and observe X take the values x1 , . . . , xN in RX , the range of X. Then we have
N
1 X
fX (x) = P(X = x) = lim 1{xk } (x) (1)
N →∞ N
k=1

However, these quantities do not tell us the typical size X is. In a repeated experiment, this
role should be served by the average value we observe.

Definition 3.2. The expectation or mean of a random variable X is defined by25


X X
E[X] = x fX (x) = x P(X = x),
x∈RX x∈RX

where implicitly the sum is over the values X can assume. More generally, for a function g : R → R,
X
E[g(X)] = g(x) fX (x).
x∈RX
25
Whenever this sum is absolutely convergent.

19
To see why this last formula holds, suppose we repeat an experiment N times and observe X
take the values x1 , . . . , xN in RX , the range of X. Then the average of the values g(xi ) is given by
N N N
!
1 X 1 X X X 1 X N →∞
X
g(xk ) = g(x) 1{xk } (x) = g(x) · 1{xk } (x) → g(x) · P(X = x)
N k=1 N k=1 x∈R x∈R
N k=1 x∈R
X X X

where the last equality follows from (1). Hence, our definition of expectation is completely consistent
with our intuition based on the relative frequency interpretation of probabilities.26
Example 3.3. The simplest example is given by X = 1E for some E ⊂ Ω. Then we have
E[X] = 1 · P(E) + 0 · (1 − P(E)) = P(E),
as long as E ∈ F (so that “P(E)” is defined). Put another way, X = 1E is an F-random variable
if and only if E ∈ F.
Remark. So far in this section, we assumed our discrete random variables to assume numerical
quantities. However, it is perfectly reasonable to work with functions X that assume values in a more
general space D. Such functions X : Ω → D are referred to as random elements. Unfortunately,
statistics like the expectation of X may then no longer make sense.
Fortunately, if say D is discrete, the definition above tells us how to work with the random
variable Y = ϕ(X) for general functions ϕ : D → R.
As an example, suppose X gives the suit of a randomly drawn card from a normal deck, i.e.,
takes values in D = {♣, ♢, ♠, ♡}. Suppose we draw a card and get paid ϕ(♣) = −1, ϕ(♢) = 0,
ϕ(♡) = 1, and ϕ(♠) = 7. Although “E[X]” makes no sense, your expected earnings is given by
1
E[ϕ(X)] = (−1 + 0 + 1 + 7) = 1.75.
4
Remark. Although expectation (and other statistics) of a random variable X seems to require
knowledge of the distribution PX of X, it is sometimes hard to compute the numbers P(X = x) for
all x ∈ RX (see the sums Pn or Hn in the birthday problems below, where the combinatorics would
otherwise get quite difficult).
However, if X is determined by certain modeling assumptions that allow one to break its expec-
tation apart in terms of simpler random variables, it can often be easier to compute E[X] before even
knowing the distribution of X explicitly! This crucial point of probability and statistics (namely,
that one actually uses values like the expectation and variance to get a handle on the distribution
rather than the other way around) will be explored immediately in the next subsection.

Random vectors and independence of random variables


If X, Y : Ω → R are F-random variables, then the F-random vector Z := (X, Y ) takes values in
D = R2 and by a remark above, we know that we can define
X X
E[ϕ(X, Y )] = Eϕ(Z) = ϕ(z) · P(Z = z) = ϕ(x, y) · P(X = x, Y = y),
z∈RZ (x,y)∈RX ×RY

for any ϕ : D = Rd → R. Here, {X = x, Y = y} := {X = x} ∩ {Y = y}.


WARNING: It is not enough to know the distributions of PX and PY separately to compute
statistics like the last quantity, one must have access to all the numbers P(X = x, Y = y).
26
As mentioned before, this point will be made even more precise with the law of large numbers.

20
Definition 3.3. Given discrete random variables X, Y : Ω → R, their joint distribution is deter-
mined by the list
PX,Y = (f(X,Y ) (x, y))x∈RX ,y∈RY ,
where the joint probability mass function of X and Y is given by

f(X,Y ) (x, y) := P(X = x, Y = y).

Notice that since Ω = ∪x∈RX {X = x} = ∪y∈RY {Y = y}, by the law of total probability we have
X X
P(X = x) = P(X = x, Y = y), P(Y = y) = P(X = x, Y = y),
y∈Y x∈X

This immediately gives us linearity of expectation: for ϕ(X, Y ) = X + Y , we have

E[X + Y ] = EX + EY.

This innocuous property yields many interesting results.

Example 3.4. In flipping a p-coin n times, what is the expected number of heads?
Let X be the number of heads in n flips. We can write this as X = 1H1 + · · · + 1Hn . Since
E1Hi = P(Hi ) = p, by linearity of expectation we have

EX = E[1H1 + · · · + 1Hn ] = E[1H1 ] + · · · + E[1Hn ] = n · p.

We emphasize that we did not use independence of the events Hi in the previous example. To
emphasize this point, let us do a more interesting example.

Example 3.5. There is a party with n people. How many pairs of people have the same birthday?
Let Xi denote the birthday of the ith person, 1 ≤ i ≤ n. For 1 ≤ i ̸= j ≤ n, we reason that27
X X X 1 1 1
P(Xi = Xj ) = P(Xi = d, Xj = d) = P(Xi = d) · P(Xj = d) = · =
d∈D d∈D d∈D
365 365 365

Notice here we did not even have to know where the random variables/elements take their values
(either D = {1, . . . , 365} or the set of dates D of the form “September 15” are just fine); rather, we
only needed to know the joint distributions P(Xi = d, Xj = d) and that |D| = 365.
Although we used independence above, the events of matching pairs are not independent (for
i ̸= j ̸= k, the joint occurrence of A = {Xi = Xj } and B = {Xj = Xk } implies C = {Xi = Xk }
occurs).
P Nevertheless, since we can write the random number of matching pairs as
Pn = 1≤i<j≤n 1{Xi =Xj } , we have
 
X X n 1 n(n − 1) 1
E[Pn ] = E[1{Xi =Xj } ] = P(Xi = Xj ) = · = ·
1≤i<j≤n 1≤i<j≤n
2 365 2 365

Finally, for fun, suppose next that the host of this party with n people asks everyone to state
their birthday. She then asks anyone who has the same birthday as another to raise a hand. How
many hands do you expect to see raised?
27
Alternatively, given the birthday of one person, there is a 1/365 chance the other person matches.

21
The answer is not twice the one above! Let Hn be the random number of hands raised. Then
this can be written n
X
Hn = 1{∃k̸=i : Xi =Xk } .
i=1

Each of these indicators has expectation given by


 n−1
364
P(∃k ̸= i : Xi = Xk ) = 1 − P(∀k ̸= i , Xi ̸= Xk ) = 1 −
365
 
364 n−1

Hence, we conclude by linearity of expectation that the desired answer is E[Hn ] = n· 1 − 365
.

One case where it is enough to know the distributions PX and PY is when X and Y are inde-
pendent, which occurred naturally as an implicit modeling assumption in some examples above.

Definition 3.4. We say that discrete random variables/elements X and Y are independent if the
events {X = x} and {Y = y} are independent for all x ∈ RX and y ∈ RY , i.e.,

f(X,Y ) (x, y) = P(X = x, Y = y) = P(X = x) · P(Y = y) = fX (x) · fY (y), x ∈ RX , y ∈ RY .

The phrase “X, Y are independent and identically distributed” is abbreviated as “X, Y are i.i.d.”

Theorem 3.1. Suppose X and Y are independent random variables/elements. Then for any
g : RX → R, h : RY → R, the random variables g(X) and h(Y ) are independent too, and further

E[g(X) · h(Y )] = E[g(X)] · E[h(Y )].

22
4 Conditional expectation
4.1 Elementary Conditional Expectation
Let us first assume X : Ω → R is discrete random variable (so the set of values it can assume is
exhausted by a countable set RX ). Then recall the expectation of any statistic or transformation
g : R → R of this variable is given by
X X
E[g(X)] = g(x) · P(X = x) = g(x)fX (x),
x x

where fX (x) is the probability mass function of X. Intuitively, this weighted average is our best
guess of g(X) given no other information.
If we learn something new, say, that an event B has occurred, then we can update this guess
by instead computing with respect to the conditional measure PB introduced before; namely, the
(elementary) conditional expectation E[g(X)|B] of g(X) given B ∈ F is defined as
X X
E[g(X)|B] := EB [g(X)] = g(x)PB (X = x) = g(x)P(X = x|B). (2)
x x

Example 4.1. A fair die is rolled twice. What is the conditional expectation of the sum of outcomes
given the first roll shows 1?
Let Z be the sum of the outcomes and let B be the event the first roll shows 1. Notice
that although Z ∈ {2, . . . , 12}, we have P(Z = k|B) = 0 if k > 7. We thus compute for every
k ∈ {2, . . . , 7},
P(Z = k, B) 1/36
P(Z = k|B) = = = 1/6.
P(B) 1/6
Hence, we have
7 7
X 1X 9
E[Z|B] = k · P(Z = k|B) = k = = 4.5.
k=2
6 k=2 2

Next, we derive an important alternative form of the conditional expectation. If z ̸= 0 is such


that the set {g(X) = z} ∩ B is nonempty, then {g(X) = z} ∩ B = {g(X)1B = z} so that
z P({g(X) = z} ∩ B) = z P(g(X)1B = z);
otherwise, if z = 0 or the set is empty, we have zP({g(X) = z} ∩ B) = 0. These considerations give
X X P({g(X) = z} ∩ B) X P(g(X)1B = z) E[g(X)1B ]
E[g(X)|B] = zP(g(X) = z|B) = z = z = .
z z
P(B) z
P(B) P(B)
(3)
In words, the resulting expression is the (normalized) average value of g(X) over the set B. Although
B]
these manipulations relied on X being discrete, the expression E[g(X)|B] = E[g(X)1 P(B)
makes sense
for any random variable X, so we take it as the definition of conditional expectation given B for
more general random variables.
Definition 4.1. Given an event B with P(B) > 0 and a random variable X (discrete, continuous,
mixed, etc.) with E|g(X)| < ∞, the conditional expectation of g(X) given B is defined by

E[g(X)1B ]
E[g(X)|B] :=
P(B)

23
As a special case of the above, if we take B = {Y = y} for some discrete random variable Y
with P(Y = y) > 0, then (3) becomes
E[g(X)1{Y =y} ] X P(X = x, Y = y) X f(X,Y ) (x, y)
E[g(X)|Y = y] = = g(x) = g(x) (4)
P(Y = y) x
P(Y = y) x
fY (y)

Remark. For the relative frequency intuition, in repeating an experiment, E[g(X)|Y = y] is the
average value of realizations of g(X) except with outcomes ignored when Y = y does not occur.
That is, if (x1 , y1 ), . . . , (xN , yN ) occur and we write B = {Y = y}, then
N N
!
1 X 1 X 1 X N →∞ 1 X
g(xk ) 1{yk =y} = NB g(x) 1(xk ,yk ) (x, y) → g(x) P(X = x, Y = y),
NB k=1 N x∈R
N k=1 P(Y = y) x∈R
X X

which is exactly (4).


Definition 4.2. Given discrete random variables X : Ω → RX , Y : Ω → RY , the conditional
distribution of X given Y = y is determined by the list
PX|Y =y = (fX|Y (x|y))x∈Rx ,
where the conditional mass function of X given Y = y is defined by
f(X,Y ) (x, y)
fX|Y (x|y) := = P(X = x|Y = y).
fY (y)
Remark. Certain modeling assumptions such as independence can be very powerful for solving
problems and computation. Likewise, declaring conditional distributions serve as another common
component of the modeling assumptions.
For example, assume P is a discrete random variable taking values in [0, 1], say, uniformly or
beta distributed. Then given a realization P = p, we flip a p-coin n times and record the number
of heads X. This determines the conditional distribution of X given P = p as Binomial(n, p).

4.2 Conditional expectation given a random variable


Now suppose more generally we want to update our computation of likelihoods and statistics given
an observation of some (discrete) random variable Y . Then the update will depend on the realization
of Y , i.e., it will depend on the value y of Y that we observe. But once we observe the event Y = y,
we naturally expect E[g(X)|Y = y] to be our best guess. Hence, we define the conditional expectation
of g(X) given Y to be the following function on Ω:
E[g(X)|Y ](ω) := E[g(X)|Y = y], if Y (ω) = y,
or more explicitly X
E[g(X)|Y ](ω) := E[g(X)|Y = y] · 1{y} (Y (ω)). (5)
y∈RY

In words, the conditional expectation is a list or array of elementary conditional expectations.


We can arrive at this same object in a slightly alternative manner, though it leads to some
confusing expressions. Let ψ(y) := E[g(X)|Y = y] (defined implicitly for y such that P(Y = y) > 0).
Then we define the random variable
E[g(X)|Y ](ω) := ψ(Y (ω)) (6)

24
to be the conditional expectation of g(X) given Y . Notice in particular that E[g(X)|Y ] is a deter-
ministic function of the random variable Y .
It is easy to see the two definitions (5), (6) are actually the same. Indeed, suppose we take (6)
as ourPdefinition of E[g(X)|Y ]. The events By := {Y = y}, y ∈ RY , form a partition of Ω so that
1Ω = y∈RY 1By . Then we have
X X X
ψ(Y ) = E[g(X)|Y ]·1Ω = E[g(X)|Y ]· 1By = E[g(X)|Y ]1By = E[g(X)|Y = y] 1By , (7)
y∈RY y∈RY y∈RY

which is precisely the definition (5) since 1By = 1{y} (Y ).


Example 4.2. Let X be the outcome of a die roll and let Y = 1{1,2,3} indicate whether it is less
than or equal to 3 or not. Then we have
E[X|Y ] = E[X|Y = 1] 1{Y =1} + E[X|Y = 0] 1{Y =0} = 2 1{1,2,3} + 5 1{4,5,6} .
WARNING: Do not write ψ(Y ) = E[g(X)|Y = Y ] unless you want to confuse yourself. Doing
so makes it seem like we are conditioning on the event {Y = Y } = Ω and thus aren’t really
conditioning at all. We have introduced the random variable E[X|Y ] so as to emphasize that
X f(X,Y ) (x, Y (ω))
E[g(X)|Y ](ω) = ψ(Y (ω)) = E[g(X)|Y = Y (ω)] = g(x) , (8)
x∈R
f Y (Y (ω))
X

which is clearly defined.


Remark. When X, Y are continuous random variables with densities fX , fY , respectively, and joint
density function f(X,Y ) , the expression (8) naturally lead us to the definition
Z ∞
f(X,Y ) (x, Y (ω))
E[g(X)|Y ](ω) := g(x) dx.
−∞ fY (Y (ω))
In either the discrete or continuous case, the quantity
f(X,Y ) (x, y)
fX|Y (x|y) :=
fY (y)
is the conditional mass or density of X given Y = y, and we have
(P
g(x) fX|Y (x|Y (ω)), discrete case
E[g(X)|Y ](ω) = R ∞x∈RX
−∞
g(x) fX|Y (x|Y (ω)) dx, continuous case
Theorem 4.1 (Law of Total Expectation). We have
X
E[X] = E[E[X|Y ]] = E[X|Y = y] · P(Y = y).
y∈RY

In words,
⋆ the average of X is equal to an average of its sub-averages given Y ⋆
The law of total expectation is a very useful computational fact as it will usually be much easier
to compute the subaverages E[X|Y ] first. This law will be proven as a consequence of Proposition
(4.1) below, but for now we give an application after a quick definition. Since P(A) = E[1A ], we
naturally define conditional probability by
P(A|Y ) := E[1A |Y ].
Being a conditional expectation, this object also satisfies the law of total expectation.

25
4.3 Characterizing Property
The following fundamental property says the law of total expectation remains true when the av-
erages and subaverages involved are both restricted or reweighted in the same way, as long as the
restriction/reweighting is based only on Y .
Proposition 4.1. For any function/statistic h(Y ) obtained from Y , we have

E[E[X|Y ]h(Y )] = E[Xh(Y )]. (9)

The proof is a direct computation: suppose for simplicity that X, Y are discrete; then
" #
X f(X,Y ) (x, Y ) X X f(X,Y ) (x, y)
E[E[X|Y ]h(Y )] = E x h(Y ) = x h(y) · fY (y)
x
fY (Y ) x y
fY (y)
X
= xh(y) f(X,Y ) (x, y)
x,y

= E[Xh(Y )].

Interestingly, the property (9) is all that is required to conclude that any other function ϕ(Y )
of Y will be farther from X in mean-squared distance than E[X|Y ]; more precisely, letting h(Y ) :=
ϕ(Y ) − E[X|Y ], we have

E(ϕ(Y ) − X)2
= E(ϕ(Y ) − E[X|Y ] + E[X|Y ] − X)2
= E(E[X|Y ] − X)2 + E(ϕ(Y ) − E[X|Y ])2 + 2E [(ϕ(Y ) − E[X|Y ])(E[X|Y ] − X)]
≥ E(E[X|Y ] − X)2 + 2E(ϕ(Y ) − E[X|Y ])(E[X|Y ] − X) (10)
= E(E[X|Y ] − X)2 + 2E[h(Y )(E[X|Y ] − X)]
(9)
= E(E[X|Y ] − X)2

where the inequality follows since the term in red is nonnegative and the last follows from (9).

4.4 General definition of conditional expectation


The key property (9) seems to be all that is required for proving the fact (10) that says:

⋆ E[X|Y ] is the closest (deterministic) function of Y to X in mean-squared distance ⋆

Being a minimizer, one would (correctly) suspect that E[X|Y ] is the unique function of Y satisfying
(9). In anticipation of the geometric interpretation of (9) below that says E(X|Y ) can be interpreted
as an orthogonal projection of X, we state the following very natural fact:

H is a σ(Y )-random variable if and only if H = h(Y ) for some deterministic function h.

Hence, we can rewrite (9) as

E[(E[X|Y ] − X)H] = 0, for every σ(Y )-random variable H. (11)

But this last statement (11) clarifies that the key property (9) really only depends on Y through
the σ-algebra “σ(Y )” it generates. This makes complete sense! The conditional expectation should

26
only really depend on the information we get about the experiment from observing Y , and this
information is exactly modeled by σ(Y ), as we have emphasized many times by now. For example,
recall the simple random variable Y = 1{1,2,3} indicates whether a die roll is less than 4 or not,
but it should be at least intuitively clear that observing Y gives the same information as observing
Y ′ = π 1{1,2,3} + e 1{4,5,6} , i.e., σ(Y ) = σ(Y ′ ). Accordingly, conditioning on Y should yield the same
values as conditioning on Y ′ .
Further, given these observations, why should we restrict to the information gained from ob-
serving only a single random variable? Indeed, it is easy to talk about the information G =
σ(Y1 , Y2 , . . . , Yn , . . .) obtained from observing many random variables.
Hence, we are naturally led to take (11) as the defining property for conditional expectation
given general information σ-algebra G ⊂ F.
Definition 4.3. Suppose X satisfies E|X| < ∞ and suppose G ⊂ F is a sub σ-algebra. Then the
conditional expectation E[X|G] of X given G is any F-random variable Z satisfying
1. Z is a G random variable, i.e., σ(Z) ⊂ G
2. For every G random variable H,
E[(Z − X)H] = 0. (12)

For any random variable Y , we define E[X|Y ] := E[X|σ(Y )]. If X = 1B for some B ∈ F, then we
sometimes write P(B|G) := E[1B |G].
Although this definition may seem abstract at first, the reader should not forget we built up to
it from the most elementary, concrete principles. Further, one can readily arrive at basic properties
useful for manipulating and computing conditional expectations in practice. Before stating these
basic properties in Proposition 4.2, we give the promised geometric interpretation:
Define an “inner product” ⟨X, Y ⟩ := EXY of random variables X, Y satisfying EX 2 < ∞,
EY 2 < ∞. Then (12) says that for any G random variable H,

⟨H, E[X|G] − X⟩ = 0.

This restates the minimizing property (10) as “E[X|G] is the orthogonal projection in L2 (Ω, F, P)
of X onto the space L2 (Ω, G, P) of G random variables with finite second moment.”
Proposition 4.2. 1. At one extreme, if X is independent of every event in G, then

E[g(X)|G] = E[g(X)].

At the other extreme, if σ(X) ⊂ G, we have E[g(X)|G] = g(X)


2. If σ(Y ) ⊂ G, then E[Xh(Y )|G] = h(Y )E[X|G] (“pull out what you know”).
3. Linearity: E[X1 + X2 |G] = E[X1 |G] + E[X2 |G]
4. Monotonicity: if X1 ≤ X2 , then E[X1 |G] ≤ E[X2 |G].
5. Jensen’s Inequality: if ϕ is convex with E|ϕ(X)| < ∞, then ϕ(E[X|G]) ≤ E[ϕ(X)|G]
6. If G1 ⊂ G2 , then E[E[X|G1 ]|G2 ] = E[E[X|G2 ]|G1 ] = E[X|G1 ] (“smaller σ-algebras dull vision”).
(Interpretation: observing X through a foggy lens placed over a clear lens, or through the
clear lens placed over the foggy lens, is just the same as observing X through the foggy lens)

27
7. Generalizing the first item, if X is independent of G and Y is a G-random variable, then
E[h(X, Y )|G] = ψ(Y ), where ψ(y) := E[h(X, y)].
(Interpretation: if G gives no new information to update your best guess of X and you know
Y exactly given G, then just hold Y fixed while averaging over X.
The next result shows the abstract notion of conditional expectation subsumes the basic definitions
for discrete and continuous variables given in the introduction of this section.
Proposition 4.3. Assume Ω = ∪n≥1 Bn , where the Bi are disjoint with P(Bi ) > 0 (a partition of
Ω). Let G := σ(B1 , B2 , . . . , Bn , . . .) = {∪m∈I Bm }I⊂N . Then

X
P(A|G) := E(1A |G) = P(A|Bk )1Bk ,
k=1

and more generally,


∞ ∞
X X E(X1Bk )
E(X|G) = E[X|Bk ] 1Bk = 1Bk .
k=1 k=1
P(Bk )

Thus the abstract notion of conditional expectation as a random variable can be thought of as
a list or array of elementary conditional expectations: if we know Bi occurs (i.e., we observe an
outcome ω ∈ Bi ), then our best guess of the random variable X is its scaled average value over Bi .
E(X1Bk )
Proof. We need to check that the expression Z := ∞
P
k=1 P(Bk ) 1Bk satisfies both items in Definition
4.3 that uniquely characterize E(X|G). First, we observe that Z is a linear combination of random
variables 1Bk , and so σ(Z) ⊂ G. To deduce the second defining property, if A = Bm for some m ≥ 1,
then by disjointness 1Bk · 1A = 1Bk ∩Bm = 1Bk 1(k=m) and so we have
∞ ∞
X E(X1Bk ) X E(X1Bk ) E(X1Bm )
E[Z1A ] = ·E[1Bk 1A ] = ·E[1Bk ]1(k=m) = ·P(Bm ) = E(X1Bm ) = E[X1A ].
k=1
P(Bk ) k=1
P(Bk ) P(Bm )

The property E[Z1A ] =PE[X1A ] then holds for any A = ∪m∈I Bm ∈ G by linearity of expectation
and the fact that 1A = m∈I 1Bm . It is a measure theoretic fact that any G-random variable H is a
limit of linear combinations of such indicators, which together with a measure theoretic convergence
theorem implies the identity holds for such general H. This completes the proof.
Here is a quick example of the final property above in action.
Example 4.3 (Buffon’s Needle, 1733). Let 0 < ℓ < d. Drop a needle of length ℓ on the Euclidean
plane ruled with parallel lines at a constant distance d > 0 apart. what is the probability the needle
intersects one of the lines?
Let (X, Θ) ∼ Uniform((0, d/2) × (0, π)), where X represents the distance from the center of the
needle to the nearest ruling line and where Θ represents the angle formed by the line determined
by the needle and the nearest ruling line. Consider the right triangle formed by the ruling line, X,
and the point on the ruling line whose angle is Θ. Then the length H of the hypotenuse satisfies
H := X/ sin Θ. The event of interest is exactly when H < ℓ/2. Hence, we compute
" #
     ℓ Z π
ℓ ℓ (8) 2
sin Θ ℓ dθ 2ℓ
P(H < ℓ/2) = P X < sin Θ = E P X < sin Θ Θ =E d
= · sin θ = .
2 2 2
d 0 π πd

28

You might also like