Math 272 Probability

MATH 272: Probability
Lent Term 2020/21
Dirk Zeindler
January 9, 2021
Contents
1 Introduction 1
1.1 Practical information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Course format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.5 Online seminars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.6 Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.7 Office hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.8 Plan of online seminars and lectures . . . . . . . . . . . . . . . . . . 3
1.1.9 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Motivation and course overview . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Discrete Probability spaces 7

2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Finite probability spaces and Laplace experiments . . . . . . . . . . . . . . 19
2.3 Multi level random experiments . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Combinatoric I–product and the sum rule . . . . . . . . . . . . . . . . . . 26
2.5 Combinatoric II–samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Ordered samples with replacement . . . . . . . . . . . . . . . . . . 30
2.5.2 Ordered samples without replacement . . . . . . . . . . . . . . . . . 33
2.5.3 Unordered samples without replacement . . . . . . . . . . . . . . . 36
2.5.4 Mixed samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.1 Some simple observations . . . . . . . . . . . . . . . . . . . . . . . 46
2.6.2 Multiplication rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6.3 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6.4 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.7.1 Independence of two events . . . . . . . . . . . . . . . . . . . . . . 59
2.7.2 Independence of several events . . . . . . . . . . . . . . . . . . . . . 63
i
3 Discrete random variables 69
3.1 Motivation and definition of random variables . . . . . . . . . . . . . . . . 69
3.2 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Events involving several values . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.6 Important distributions of random variables . . . . . . . . . . . . . . . . . 99
3.6.1 Degenerate distribution . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.6.2 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.6.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.6.4 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.6.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.7 Joint distribution and independence . . . . . . . . . . . . . . . . . . . . . . 105
3.7.1 Joint distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.7.2 Independent random variables . . . . . . . . . . . . . . . . . . . . . 111
3.8 Law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.8.1 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.8.2 Law of large numbers for random variables . . . . . . . . . . . . . . 119
3.9 Fair games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.10 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A Sigma-Notation for sums 139

A.1 Finite sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.2 Infinite sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.3 Summation over arbitrary sets . . . . . . . . . . . . . . . . . . . . . . . . . 145
B Sets 149
B.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B.2 Set operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
C Table of the normal distribution 157
D Useful mathematical identities 159
Symbol index 161
Index 162
Chapter 1
Introduction
This chapter contains some practical information about the course. Begin by watching
this short video Sec1 of me introducing myself and the course.
If you are taking part in this course from China, then the following link might be helpful.
1.1 Practical information

1.1.1 Course format
This course consists of 20 lectures as self-study and three weekly online live sessions.
Also, you will get weekly exercise question to test your understanding of course material.
The assessments consists of five weekly moodle online quizzes and an end of module test.
1.1.2 Videos
The videos mentioned in these lecture notes have embedded links, so if you click on
them, they should take you straight to the video. You may need to be logged in to your
University account to watch them. If the link does not work, all these videos can also
be accessed from a repository on the MATH272 Moodle page; just search for the unique
identifier, which is “Sect1” in the case at hand. Sect1 refers to Section 1; for all other
videos, this identifier is followed by a lower-case letter a,b,c,. . . indicating the sequence of
the videos within a section. Please report any dead links to me using the online forums,
as described below.
1.1.3 Lectures
This course corresponds to 20 lectures, which are 4 lectures per week. All this lectures
are planed as self-study. The material of each lecture is listed in Table 1.1. I suggest
that you study one lecture per day on Monday–Thursday, but you are welcome to choose
your own pattern, as long as you complete the first two lectures of each week before the
1
first online seminar and the other two before the second (see below for details of the
online seminars). Note that you do not need to complete a lecture in one sitting: you can
study shorter portions of it if it suits you better. The lectures involve a mix of reading
these notes and watching videos of me discussing and explaining selected results. Expect
to spend around two hours per lecture to learn and reflect on the material and solve the
exercises. However, not all lectures will take exactly the same amount of time to complete:
some are slightly longer, some perhaps slightly shorter, as they have been divided at the
most natural break points, and you may find some topics easier than others.
1.1.4 Exercises
You can find on the Moodle page of MATH272 for each of the weeks 11–15 exercise
questions. The are designed to test your understanding of the material covered up to that
point and to prepare your for the exam. Note that you don’t have to submit the solution
to those exercises and they also don’t count towards your grade. The solution will be
released on Friday at 12:00 each week.
Don’t worry if you don’t get everything right the first time around; that is perfectly
normal in mathematics and an integral part of the learning process. See if you can work
out from the notes or model solutions where you went wrong, go back and try again. If
you are unable to figure it out on your own or in your study group, please ask me for
help, either in the corresponding online seminar or my office hours.
1.1.5 Online seminars

There are three live online seminars per week. In them, I shall discuss key points of the
lectures, including important examples/exercises, and answer your questions. The first
online seminar in each week will focus on the first two lectures of the week. The second
online seminar in each week will focus on the second two lectures of the week. Please keep
in mind:
You should complete the study of the lectures before attending the online seminars.
I will assume that you have done that and will not use the online seminars to repeat
the lectures. However, I do not expect you to have fully understood everything and
the online seminars are a good opportunity to clarify open questions. You can submit
questions in advance of the online seminar using a dedicated discussion forum on the
MATH272 Moodle. This is the recommended method, as it will allow me to plan ahead,
but it may also be possible to ask questions during the live seminar. Please do not
email questions to me, unless they are of a personal/confidential nature, as otherwise I
am likely to get overwhelmed by emails and unable to handle them. The third will be
used as workshop/office hour. Here I answer your questions and will give you also some
exercises to try yourself and then I will show you afterwards the solution.
You join the online seminars using a link on the Moodle page. If you are unable to attend a
seminar, they will all be recorded, and the recordings will be posted on Moodle afterwards.
2
However, the lecture recording does not always work, so watching the recording afterwards
should be used only as a last resort.
1.1.6 Assessments
The assessments consists moodle quizzes and an end of module test.
There are 5 weekly online quizzes in the weeks 11–15. The deadline of each is
Friday at 23.55 (UK time). You can find those on the Moodle page of MATH272 in
the section ‘Assessments for Probability’.
There is a written end of module test for the probability part of Math272. The sub-
mission deadline is Tuesday at 17.00 in week 16. The questions will be available on
moodle 48 hours prior to this deadline in the section ‘Assessments for Probability’,
that is, from Sunday at 17.00. When scanning your solutions, make sure that they
are legible and submitted in the right order and right orientation. Solutions that
fail to meet these criteria will lose marks, or may not be marked at all.
1.1.7 Office hour

My office hours are Wednesday 10.40–11.20 (subject to change; other times may be
added if there is demand). The office hours offer you an opportunity to ask me individual
questions. You must book in advance through the link. This link can also be found on
the MATH272 Moodle page.
1.1.8 Plan of online seminars and lectures

Table 1.1 below lists the dates of the online seminars together with the two ‘lectures’ that
will be discussed in each online seminar. In order to benefit maximally from the online
seminar, you should have studied those lectures before it. The page numbers refer to the
pages of these notes corresponding to the lecture.
1.1.9 Other
Technical support: For IT help and support, including help with Moodle or Teams,
please consult the ISS help page.
Further reading: This lecture notes are self-contained. However, you can find a list
of helpful books in the Resource list on top of the MATH272 moodle page.
Past exam papers: Student Based Services operate an archive of past exam papers,
which you may find helpful when revising for the exam.
3
Lecture 1 pages 7–15
Online seminar 1
Week 11
Online seminar 2
Online seminar 3
Week 12
Online seminar 4
Online seminar 5
Week 13
Online seminar 6
Online seminar 7
Week 14
Online seminar 8
Online seminar 9
Week 15
Online seminar 10
Table 1.1: Lecture material
1.2 Motivation and course overview

A typical situation in our life is that we have to make a decision between one or more
options. The problem is that it is most times difficult to find the best decision since we
don’t know what will happen in the future or we do not have all information. Some simple
examples are
Would you rather be given £1, or toss a coin 10 times winning £1000 if it comes
up heads every time?
How many people do you need in a room to have a better than 50% chance that
two of them share a birthday?
A suspect’s fingerprints match those found on a murder weapon. The chance of a

match occurring by chance is around 1 in 500 00. Is the suspect likely to be innocent?
A network is formed by taking 100 points and connecting each pair with some fixed
probability. What can we say about the number and size of connected components
in the network?
4
These questions all relate to situations where there is some randomness; that is events
we cannot predict with certainty. This could be either because they happen by chance
(tossing a coin) or they are beyond our knowledge (the innocence or guilt of a suspect).
Probability theory has been introduced to study this uncertainty. This study involves
mathematics, logic, and considerations of the underlying physical mechanisms that cause
variability.
Let us now briefly comment the above questions. The first question is not entirely a
mathematical one and there is no right or wrong answer for each part. It’ll depend on
circumstances. However some tools from probability can help to describe the choice we
have quantitatively. In each case the average amount we expect to win and the degree of
variation in our gain are relevant. Such questions are the topic of game theory.
The second question is a simple calculation which we will see how to do later (see Ex-
ample 2.5.25). If you haven’t seen this before then try to guess what the answer will be.
This question is surprisingly already complicate enough to mislead the intuition of many
people.
The third question emphasises how important it is to work out exactly what information
we are given and what we want to know. The court would have to consider the probability
that the suspect is innocent under the assumption that the fingerprints match. The numer-
ical probability given in the question is the probability that the fingerprints match under
the assumption that the suspect is innocent. These are in general completely different
quantities. The erroneous assumption that they equal is sometimes called the prosecutor’s
fallacy. The mathematical tool for considering probabilities given certain assumptions is
conditional probability, see Section 2.6, and Baye’s theorem, see Section 2.6.4.
The final question describes the construction of a random graph. These objects are well
studied both as pure mathematical objects in their own right and as models to help
understand networks in a range of areas such as epidemiology and computing.
We will give in this course a introduction to discrete probability. More precisely, we define
discrete (Ω, P) using the axioms of Kolmogorov . We then will study some examples basing
on counting problems and in particular we will give the answer on the second question
above. Furthermore we consider also conditional probability, random variables and the
law of large numbers. Notice that many results and definitions in this course are true on
arbitrary probability spaces. For such results, we will put in the requirements the word
‘discrete’ in brackets. However, we will not go into it.
5
6
Chapter 2
Discrete Probability spaces
2.1 Definition
The study of probability and its laws is more than 300 years old. Bernoulli’s law of large
numbers was one of the first probabilistic law and dates back to 1705 and was published
posthumously in 1713. Surprisingly, the question of a precise (axiomatic) definition of
probability was long time an open problem. Hilbert presented 1900 in his famous lecture
the axiomatizing of probability and physics(!) as the sixth of his 23 open problems. Kol-
mogorov finally found in 1933 the right approach. His idea is based on measure theory
instead of the intuitive idea of probability as the limit of relative frequencies. We will not
consider here the general definition of a probability space since this would go much to
far for this course. We will restrict us to discrete probability spaces and random variables.
We begin with the definition of a discrete probability space. Please watch first the fol-
lowing video Sec2.1a and then proceed reading. Probabilities are naturally associated
with random experiments and to find the right definition, we consider as motivation the
following examples
Example 2.1.1.
a) Anton tosses a coin. He says to himself that he can have only two possible results:
Head or Tail. Since he don’t knows much about the coin, he thinks that he will have
in 1/2 of the cases Head and in 1/2 of the cases Tail.
b) Bea rolls a die. She says to herself that she can have only six possible results: 1, 2, 3, 4, 5, 6.
Furthermore, she says that this is fair die and thus each number will occur in 1/6 of
cases.
c) Chris has an urn with four red and one blue ball, and the balls can only be distinguished
through their colour. Chris now draws a ball without considering the colour. Chris says
to himself that the can only have two possible results: red or blue. Furthermore, he
7
says to himself that he will have in 4/5 of the cases a red ball and in 1/5 of the cases
a blue ball.
In this three examples of random experiments, we do always the same:
First, we specify the possible results of our random experiment.
Second, we assign to each possible result a chance to occur in this experiment.

We will write Ω for the set of all possible results of a random experiment. Furthermore, we
will write P [ω] for the chance of an ω ∈ Ω to occur. We will call P [ω] also the probability
of ω. In Example 2.1.1, we have
1
a) Ω = {‘Head’, ‘Tail’} and P [‘Head’] = P [‘Tail’] = , (2.1.1)
2
1
b) Ω = {1, 2, 3, 4, 5, 6} and P [1] = · · · = P [6] = , (2.1.2)
6
4 1
c) Ω = {‘Red’, ‘Blue’} and P [‘Red’] = , P [‘Blue’] = . (2.1.3)
5 5
It is clear that P is a function P : Ω → R, but this function can not be arbitrary. Obviously,
we must have P [ω] ≥ 0 and P [ω] ≤ 1. Furthermore, we observe in Example 2.1.1 that
P [‘Head’]+P [‘Tail’] = 1 and P [1]+P [2]+· · ·+P [6] = 1 and also P [‘Red’]+P [‘Blue’] = 1
Thus, if we sum the probabilities of all possible results of our random experiment, we
must obtain 1. We are now almost ready to give a first definition of a probability space.
We need at this point also to assume that the set Ω is countable. This is necessary to
avoid problems with the definition of the sum over all elements of Ω when Ω is infinite.
Definition 2.1.2. A set Ω is countable if one can assign to each ω ∈ Ω a natural

number n ∈ N so that each n ∈ N correspond to at most one ω ∈ Ω. If this is not
possible, we call the set uncountable.
More intuitively said, a set Ω is countable if Ω can be written as

(
{ω1 , ω2 , . . . , ωN } if Ω is finite,
Ω = {ωj ; j = 1, . . . , N } = (2.1.4)
{ω1 , ω2 , ω3 , . . . } if Ω is infinite.
with N ∈ N ∪ {∞} the cardinality of the set Ω (the number of elements in Ω). In words
this means we can introduce a ranking on Ω and thus have a ‘first’ element, a ‘second’
element, a ‘third’ element and so on. In particular each finite set is countable, the natural
numbers N are countable and the integers Z are also countable. Examples of uncountable
sets are the real numbers R or each interval [a, b] ⊂ R with a < b. We now can define
8
Definition 2.1.3 (Discrete probability space, Version 1). A discrete probability space
consists of a pair (Ω, P), with Ω a countable set and a function P : Ω → [0, 1] with
0 ≤ P [ω] ≤ 1 for all ω ∈ Ω and
P [ω1 ] + P [ω2 ] + P [ω3 ] + · · · = 1 with ωj as in (2.1.4).
We call Ω the sample space, P the probability measure and a ω ∈ Ω a sample point.
An illustration of a discrete probability space is given in Figure 2.1. The blue dots in
Figure 2.1 represent the sample points and the values near the points the corresponding
probability. An intuitive interpretation of Figure 2.1 can be to see Ω as lake, the sample
points ω as fish in the lake and the probabilities P [ω] as the chance that the fish ω will
be caught as next.
Figure 2.1: Illustration of a probability space.
Some further examples of probability spaces are

Example 2.1.4.
a) We consider the toss of a manipulated coin. This can be modelled as
Ω = {0, 1} and P [0] = p and P [1] = 1 − p, (2.1.5)
where ‘1’ stands for ‘Heads’, ‘0’ for ‘Tails’ and 0 ≤ p ≤ 1 a fixed number.
b) We pick randomly a number between 1 and 10. This can be modelled as
1
Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and P [ω] =for all ω ∈ Ω. (2.1.6)
10
The probability measure P at this point is of course not unique and depends on the
person one asks. An other possibility for P is for instance
P [4] = 1 and P [ω] = 0 for all ω 6= 4. (2.1.7)
9
Exercise 2.1.5. Give a (formal) example of a probability space.
This here now the right point to watch the video Sec2.1b about definitions and examples.
P
We will use in this the lecture the -notation for sums to simplify the occurring expres-
sions. In particular, we write the second property of P in Definition 2.1.3 as
N
X
P [ωj ] = 1, (2.1.8)
j=1
where N ∈ N ∪ {∞}P is the cardinality of Ω (the number of elements in Ω). If you are not
familiar with the -notation for sums, you should first watchPthe video Sec2.1c and then
read Appendix A. We give there a brief introduction to the -notation for sums, where
we
P introduce everything we need in this course. Note that from now on we will use the
-notation frequently.
Definition 2.1.3 is maybe not the most intuitive way to define a probability. It would much
more intuitive to determine the probability of an outcome of an experiment as follows:
Perform n independent trials with n very large and consider the fraction how many times
the desired outcome has occurred. However, it is clear that enlarging n gives a preciser
result. We thus would like to chose n as large as possible. We therefore let n go to
infinity and define the probability of this outcome as the limit of the observed fractions.
This definition, however, would bring some disadvantages. For example, the proportion
would typically depend on the tests. In a four-fold coin toss one can obtain H, H, T, H
and T, T, H, H as a result (H = ‘Head’ and T = ‘Tail’). Definition 2.1.3 can now be
interpreted that we omit the limit process and work directly with the proportion in the
limit. Of course, one can ask whether Definition 2.1.3 also has this intuitive property. We
will see in Section 3.8 that this is the case indeed.
Definition 2.1.3 is very direct, but it can unfortunately not be used to define probability
spaces on uncountable sets as the real numbers R. We thus will give in Definition 2.1.19
an equivalent formulation using the Kolmogorov’s axioms, which can also be used in gen-
eral. However, we will consider in this lecture only discrete probability spaces.
If a probability space (Ω, P) is given, then we know the probability of each ω ∈ Ω to

occur in a random experiment. However, in most cases one is interested in probabilities
involving more than just one ω ∈ Ω. For instance, if we have five different apples, two
are green and three are red, what is the probability that a randomly chosen apple is red?
Please now watch the following video Sec2.1d. To find the right definition, we look again
at some examples.
10
Example 2.1.6.
a) We roll a die. What is the probability the number is divisible by 3? The natural answer
is
P [‘divisible by 3’] = P [{3, 6}] = P [3] + P [6] . (2.1.9)
b) Anton, Bea, Chris and Donald play a game. The probability that Anton wins is 1/6,
that Bea wins is 1/3, that Chris wins is 3/8 and that Donald wins is 1/8. The proba-
bility space (Ω, P) is in this case
Ω = {‘Anton’, ‘Bea’, ‘Chris’, ‘Donald’} and (2.1.10)
1 1 3 1
P [‘Anton’] = , P [‘Bea’] = , P [‘Chris’] = , P [‘Donald’] = . (2.1.11)
6 3 8 8
A possible question here is: What is the probability that Chris or Bea win the game.
The natural answer is
P [‘Chris or Bea wins’] = P [{‘Bea’, ‘Chris’}] = P [‘Bea’] + P [‘Chris’]
1 3 17
= + = . (2.1.12)
3 8 24
Another question might be: What is the probability that Donald doesn’t win the game?
The natural answer is
P [‘Donald loses’] = P [‘Anton or Bea or Chris wins’] = P [{‘Anton’, ‘Bea’, ‘Chris’}]
1 1 3 7
= P [‘Anton’] + P [‘Bea’] + P [‘Chris’] = + + = . (2.1.13)
6 3 8 8
If one analyses Example 2.1.6, one sees that we are interested in subsets A of Ω (A ⊂ Ω)
and assign to each subset A a probability by summing the probabilities of all elements
contained in A. This now leads to the following definition
Definition 2.1.7. Let (Ω, P) be a discrete probability space. We call a subset A ⊂ Ω

an event and define the probability of the event A as
X
P [A] := P [ω] , (2.1.14)
ω∈A
P
where ω∈A indicates that we sum over all elements of the set A. For completeness,
we define P [∅] := 0, where ∅ is the empty set.
An illustration of an event A is given in Figure 2.2. Using again the intuitive picture of Ω
as lake and ω ∈ Ω as fishes in the lake, one can interpret an event A as a restricted area,
where some fish are enclosed. The important point here is that the fish neither can enter
or leave this area.
11
Figure 2.2: Illustration of a event A in a probability space (Ω, P).
Remark 2.1.8. Notice that a sample space contains often many sample points ω and thus
one normally drops in an illustration as in Figure 2.2 the sample points ω. Such an
illustration is then called a Venn diagram, see also Appendix B.
It is clear from Definition 2.1.7, that one part of the work with probability spaces and
events is to convert a description like ‘The number is less than 4 ’ into a subset of the
sample space. Sometimes it is also useful to find for an event A ⊂ Ω a description like
‘The number is greater than 2 ’.
Example 2.1.9. Peter has three red, two blue, two white and four green
shirts. He now chooses purely randomly a shirt. What is the probability
that it is red? We define
1
Ω = {r1 , r2 , r3 , b1 , b2 , w1 , w2 , g1 , g2 , g3 , g4 } and P [ω] = for all ω ∈ Ω. (2.1.15)
11
The event to select a red shirt is then
A := ‘The shirt is red’ = {r1 , r2 , r3 }. (2.1.16)

3
Thus Peter selects with the probability P [A] = P [r1 ] + P [r2 ] + P [r3 ] = 11
a red shirt.
Example 2.1.10. We choose uniformly at random a number between 1 and 12. This
gives
1
Ω = {1, 2, 3, . . . , 11, 12} and P [ω] = for all ω ∈ Ω. (2.1.17)
12
We consider here events A, B, C, D with
A : The number is divisible by 2, B : The number is divisible by 3,

C : The number is a prime number, D : The number is strictly greater than 9.
We now write this events as a subset of Ω. We get
A = {2, 4, 6, 8, 10, 12}, B = {3, 6, 9, 12}, C = {2, 3, 5, 7, 11} and D = {10, 11, 12}.
12
The probability of the event A is then
6 1
P [A] = P [2] + P [4] + P [6] + P [8] + P [10] + P [12] = = .
12 2
Similarly, we obtain
4 5 3
P [B] = , P [C] = , and P [D] = .
12 12 12
The video Sec2.1e contains a further example how to translate a description like ‘The
number is less than 4 ’ into a subset of the sample space.
Exercise 2.1.11. Suppose we have two soccer teams, each team constituting of 11
players. Suppose we choose uniformly at random a player. Give a formal description
of this experiment and of the events ‘We select a player from the first team’ and ‘We
select a keeper ’. Compute the probability of these events.
In most situations, we are interested in more than one event. Please watch at this point
the video Sec2.1f. When we work with more than one event, we require the basic nota-
tions of set theory. We expect that the reader is already familiar with these, but we have
given in Appendix B for completeness a short introduction, which covers everything we
need in this course.
Before we discuss some properties of P [.], we would like to introduce a few ways of speak-
ing. We say that a event A occurs in a random experiment, if a ω ∈ A has been obtained
in this random experiment. Otherwise we say that the event A didn’t occur. For example,
let us toss a die and consider the event A = {4, 5, 6} (‘the number is greater than 3 ’). If
we toss a ‘4’ then A occurs and if we toss ‘2’ then A doesn’t occurs. We interpret P [A]
as the chance that the event A occurs in a random experiment. We have listened some
further terms and ways of speaking in Table 2.1. Notice that events like A ∪ B or A ∩ B
can illustrated with Venn diagrams, see Figure 2.3 and Appendix B.
13
Fact Ways of speaking As formula
A surely occurs. A is the sure event. A=Ω
A never occurs. A is the impossible event. A=∅
A does not occur. The event is the complement of A Ac := Ω \ A
A and B occur. A as well B occur. A∩B
A or B occur. At least one of the events A∪B
A or B occurs.
If A occurs, then B doesn’t occur. A and B are disjoint A∩B =∅
or mutually exclusive.
If A occurs then also B occurs. A is a subset of B. A ⊂ B.
A occurs if and only if B A and B are complementary events. A = Bc
does not occur. Sn
A occurs if and only if at least A is the union of the Ai . A= i=1 Ai
one of the events Ai occur. Tn
A occurs if and only if all Ai occur. A is the intersection of the Ai . A= i=1 Ai
Table 2.1: Designation of events (A, A1 , A2 , . . . , An and B are events in some (Ω, P).)
Example 2.1.12. Let us consider again the situation of Example 2.1.10. If we are for
instance interest in the event
E := ‘The number is divisible by 3 and is strictly greater than 9’, (2.1.18)
we then have obviously E = {12} and P [E] = 1/12. On the other hand, we can use the
forth row of Table 2.1 and write E = B ∩ D. Let us now consider the event F = A ∪ C.
We then have F = {2, 3, 4, 5, 6, 7, 8, 10, 11, 12} and P [F ] = 10/12. In words, F can be
written as the number is divisible by 2 or a prime number.
Example 2.1.13. An urn contains 100 balls, numbered with 00, 01, . . . , 99. We now draw
purely random draw a ball. We define in this situation
1
Ω = {00, . . . , 99} and P [ω] = for all ω ∈ Ω. (2.1.19)
100
We consider the events
A := ‘The first number is a 4’ = {40, 41, 42, 43, 44, 45, 46, 47, 48, 49}, (2.1.20)
B := ‘The second number is a 6’ = {06, 16, 26, 36, 46, 56, 66, 76, 86, 96}. (2.1.21)
14
(a) ‘A or B’, the union A ∪ B (b) ‘A and B’, the intersection A ∩ B
(c) ‘A but not B’, the set difference A \ B (d) ‘A does not occur’, the complement Ac
(e) A is a subset of B, A ⊂ B (f) A and B are disjoint, A ∩ B = ∅
Figure 2.3: Illustration of A ∪ B, A ∩ B, A \ B, Ac , A ⊂ B and A, B disjoint.
We then have for instance
A ∩ B = ‘The first number is a 4 and the second is a 6’ = {46}, (2.1.22)

A ∪ B = ‘The first number is a 4 or the second is a 6’
= {06, 16, 26, 36, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 56, 66, 76, 86, 96}, (2.1.23)
A \ B = ‘The first number is a 4 and the second is not a 6’
= {40, 41, 42, 43, 44, 45, 47, 48, 49}, (2.1.24)
c
A = ‘The first number is not a 4’ = {xy; 1 ≤ x, y ≤ 9 and x 6= 4}. (2.1.25)
15
Notice that each of the events Ac and B c contain 90 elements. It is thus not convenient
to write Ac and B c explicit out as for instance A or B.
Exercise 2.1.14. Give an example of two events A, B on a probability space (Ω, P).
Compute A ∩ B, A ∪ B and A \ B and give a description of these events.
We now prove some properties of P. The following video Sec2.1g explains Lemma 2.1.15
and Lemma 2.1.17 below given some addition explanations using diagrams.
Lemma 2.1.15. Let (Ω, P)be a (discrete) probability space. We then have
a) If A, B ⊂ Ω are disjoint, i.e. A ∩ B = ∅, then
P [A ∪ B] = P [A] + P [B] . (2.1.26)
b) If A and B are arbitrary events, then
P [A ∪ B] ≤ P [A] + P [B] . (2.1.27)
c) If A is a subset of B, A ⊂ B, then we have
P [A] ≤ P [B] . (2.1.28)
d) We have for the complement Ac = Ω \ A of a event A ⊂ Ω
P [Ac ] = 1 − P [A] . (2.1.29)
Proof. We begin with Point a). An illustration of A ∪ B for A and B disjoint is given in
Figure 2.4(a).
We have by definition
X
P [A ∪ B] = P [ω] . (2.1.30)
ω∈A∪B
Since A ∩ B = ∅, each
P ω ∈ A ∪ B belongs either to A or to B, but not to both. We thus
can split the sum ω∈A∪B into a sum over all ω ∈ A and a sum over all ω ∈ B and no ω
is counted twice, see also Lemma A.3.2. In formulas (see also Lemma A.3.2, this means
X X X
P [A ∪ B] = P [ω] = P [ω] + P [ω] = P [A] + P [B] . (2.1.31)
ω∈A∪B ω∈A ω∈B
16
(a) Disjoint events A and B (b) Non-disjoint events A and B
Figure 2.4: Illustration of for the proof of Lemma 2.1.15.
We now come to Point b). Here the events A and B can have a non-empty intersection
and thus it is possible that a ω ∈ Ω is contained in both, see Figure 2.4(b). If we write
X X
P [A] + P [B] = P [ω] + P [ω] (2.1.32)
ω∈A ω∈B
then all ω ∈ A ∩ B occur twice in this sum while all ω ∈ A \ B and all ω ∈ B \ A occur
just once. If we take now a look at
X
P [A ∪ B] = P [ω] (2.1.33)
ω∈A∪B
then all ω ∈ A ∪ B occur only once, inclusive those ω ∈ A ∩ B. This completes the proof
of this point since P [ω] ≥ 0.
We now come to Point c). Since A ⊂ B, we can write B = A ∪ (B \ A). Because
A ∩ (B \ A) = ∅, we can use Point a) and get
P [B] = P [A ∪ (B \ A)] = P [A] + P [B \ A] ≥ P [A] . (2.1.34)
For Point d), we use that A ∩ Ac = ∅ and A ∪ Ac = Ω. Using again Point a) gives
P [A] + P [Ac ] = P [A ∪ Ac ] = P [Ω] = 1. (2.1.35)
Subtracting P [A] on both sides completes the proof of Point c).
Exercise 2.1.16. Let A and B be two events on a (Ω, P). Show that
P [A ∪ B] = P [A] + P [B] − P [A ∩ B] . (2.1.36)
We have considered in Lemma 2.1.15 in the points a) and b) the union of two events.
However, in many cases one considers the union of more than two events and often
17
even the union of infinitely many events. Fortunately, the argumentation in the proof
of Lemma 2.1.15 can be also used for such cases. Obviously, we have to introduce a
suitable notation for the union of (infinitely) many events (see Definition B.2.2). We set
n
[ ∞
[
Ai := A1 ∪ A2 ∪ A3 ∪ . . . ∪ An and Ai := A1 ∪ A2 ∪ A3 ∪ . . . . (2.1.37)
i=1 i=1
We now have
a) If A1 , A2 , . . . are pairwise disjoint, i.e. Ai ∩ Aj = ∅ for i 6= j, then

"n # n
"∞ # ∞
[ X [ X
P Ai = P [Ai ] and P Ai = P [Ai ] . (2.1.38)
i=1 i=1 i=1 i=1
b) If B1 , B2 , . . . are arbitrary events, then

"n # n
"∞ # ∞
[ X [ X
P Bi ≤ P [Bi ] and P Bi ≤ P [Bi ] . (2.1.39)
i=1 i=1 i=1 i=1
The proof of Lemma 2.1.17 is similar to the proof of Lemma 2.1.15 and we thus omit it.
Remark 2.1.18. Notice that Lemma 2.1.17 is true on general probability spaces. However,
it is important to mention that one can apply Lemma 2.1.17 only for countable
S unions of
events. It is for instance not allowed to apply Lemma 2.1.17 to the union x∈R Ax .
We can now give with the help of Lemma 2.1.15 and Lemma 2.1.17 an equivalent definition
of a discrete probability space (Ω, P). The reason we do this is that Definition 2.1.3 can
only be used if the sample space is finite or countable. However, one requires for the
normal distribution or the Black-Scholes model a more complicate sample space. On the
other hand, the Definition 2.1.19 below can be generalised to arbitrary sample spaces.
18
Definition 2.1.19 (Discrete probability space, Version 2). A discrete probability space
consists of a pair (Ω, P), with Ω a countable set and a P assigning to each subset A ⊂ Ω
a value in [0, 1] with
0 ≤ P [A] ≤ 1 for all A ⊂ Ω,
P [Ω] = 1 and
If the events Ai are pairwise disjoint, i.e. Ai ∩ Aj = ∅ for i 6= j, then

"∞ # ∞
[ X
P Ai = P [Ai ] . (sigma-additivity) (2.1.40)
i=1 i=1
Remark 2.1.20. The P in Definition 2.1.19 can be see as function P : PΩ → [0, 1], where
PΩ is the power set of Ω (see Definition B.1.10).
It is quite easy to see that Definition 2.1.3 and 2.1.19 are equivalent. It follows obvi-
ously from Lemma 2.1.17.a) that a probability in the sense of Definition 2.1.3 is a also
a probability in the sense of Definition 2.1.19. For the opposite direction, we define
P [ω] := P [{ω}], where {ω} is the subset of Ω just containing the element ω. We then
write with the sigma-additivity
" #
[ X X
1 = P [Ω] = P {ω} = P [{ω}] = P [ω] . (2.1.41)
ω∈Ω ω∈Ω ω∈Ω
Thus a probability in the sense of Definition 2.1.19 is a also a probability in the sense of
Definition 2.1.3. Notice that we have used in (2.1.41) that Ω is countable.
The three properties of P in Definition 2.1.19 are know as Kolmogorov’s axioms for prob-
ability. These axioms now can be used to define probability measures on an arbitrary
sample spaces.
2.2 Finite probability spaces and Laplace experiments

In this section, we consider random experiments with finite sample space. I recommend
to watch first following video Sec2.2 and then to read this section.
We have seen in the previous section already some examples. One was for instance the
throw of a fair coin and another was the throw of a fair die, see Example 2.1.1. These
both examples share an important property, which often occurs if Ω is finite: All sample
points ω ∈ Ω have the same probability and thus are equally likely. Such experiments are
called classical experiment or Laplace experiment.
19
Definition 2.2.1. Let Ω be a finite set. A probability measure P is called classical
measure or Laplace measure or uniform measure (and (Ω, P) a classical or Laplace
experiment) if
1
P [ω] = for all ω ∈ Ω. (2.2.1)
|Ω|
Here |Ω| denotes the cardinality of Ω (the number of elements in Ω). A Laplace
experiment is often also called a ‘purely random’ or a ‘fair ’ experiment.
All possible outputs in a Laplace experiment have by definition the same probability and
we have thus have for each event
|A|
P [A] = for all A ⊂ Ω. (2.2.2)
|Ω|
The elements of A are sometimes also called favourable or favourable outcomes and the
elements of Ω the possible outcomes. One thus says that P [A] is the number of favourable
divided by the number of possible outcomes.
Example 2.2.2 (The roll of two dice). We roll two fair, distinguishable die together and
then add the numbers. We now consider the following event
A = ‘The sum of the numbers is even.’, and (2.2.3)

B = ‘The sum of the numbers is at most five.’. (2.2.4)
First we specify the sample space. We could chose Ω = {2, 3, . . . , 12}. Unfortunately, not
each value between 2 and 12 has the same probability to occur. For instance, 2 can only
occur as 1 + 1 while 5 = 1 + 4 = 2 + 3 = 3 + 2 = 4 + 1. We thus set
Ω = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
1 1
With this choice of Ω, all ω ∈ Ω have the same probability P [ω] = |Ω| = 36
. We thus now
have a Laplace-Experiment. In this setting, the events A and B are
A = {(1, 1), (1, 3), (1, 5), (2, 2), . . . , (5, 3), (5, 5), (6, 2), (6, 4), (6, 6)}, (2.2.5)
B = {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (4, 1)}. (2.2.6)
20
Thus probabilities of the events A and B are
|B| 18 1 |B| 10 5
P [A] = = = and P [B] = = = . (2.2.7)
|Ω| 36 2 |Ω| 36 18
Example 2.2.3. One cards is randomly selected from an ordinary 52 card playing deck.
What is the probability that we draw a king? At this point one needs to know that an
ordinary card playing deck contains in particular 4 kings. Also, we assume that one cannot
distinguish the cards when they are face down. Thus each card has the same probability
to be drawn. This implies that
4
P [‘We draw a king’] = . (2.2.8)
52
Example 2.2.4. A club has 100 members. A survey revealed the following statistics:
Man Woman
are not vegetarian 48 12
are vegetarian 16 24
In this situation, we set
Ω = {(i, g, v); i ∈ {1, . . . , 100}, g ∈ {m, w}, v ∈ {vegetarian,non-vegetarian}}, (2.2.9)
where i = 1, . . . , 100 is the number of the member, g the gender and v denotes if he/she
is vegetarian/non-vegetarian. We now choose purely randomly a member and thus each
1
member has the probability P [{ω}] = 100 to be selected. The probability that the selected
member is a vegetarian is thus
16 + 24 2
P [{ω; ω is vegetarian}] = = , (2.2.10)
100 5
since 16 man and 24 woman are vegetarian. If we now would know beforehand that we
will select a woman, then the situation looks differently. In this case, the probability that
the selected member is a vegetarian, is 24
36
= 23 . However, this question in fact belongs to
conditional probability.
One thing one should keep in mind is that not all experiments are Laplace experiments.
Here two examples.
Two football teams play a match against each other. One could think there is a
50:50 chance for a victory for each team, but this is not the case. The talent of the
players, the strategy of the coach and much more has an influence to the chance to
win. This match could only be a Laplace experiment if a team would virtually play
against itself.
Suppose we throw a frisbee instead a coin. A frisbee has like a coin two sides.
However, both sides are not symmetric, which argues against a Laplace experiment.
21
Also, in many cases a random experiment can be mistaken for a Laplace experiment.
Such things can happen also very good mathematicians.
Example 2.2.5 (Leibniz’s Error). Gottfried Wilhelm von Leibniz was an outstanding
German mathematician. Among his most outstanding achievements are the ideas of dif-
ferential and integral calculus, which he independently developed by Isaac Newton at the
same time. However, Leibnitz was not infallible. He stated in the Opera Omnia (Leibniz,
1768, p. 217) that
... for example, with two dice, it is equally likely to throw twelve points, than
to throw eleven; because one or the other can be done in only one manner.
Let us now see why this is wrong. To get 12 points, both die must show a 6. On the other
hand, to get 11 points, one must show a 5 and the other a 6. This can be achieved in two
ways. The first die gives five and the second six and vice-versa. Thus we get
2 1
P [‘The sum of the numbers is 11’] = and P [‘The sum of the numbers is 11’] = .
36 36
But what happens if we can’t distinguish the dice? Does this affect the above mentioned
probabilities? The answer is no. To see this, let us suppose we roll two identical die at the
same time, which we have marked with invisible ink. We thus can distinguish the die and
2
the probability that the sum is 11 is therefore 36 . All others, however, cannot distinguish
the die, but see the same sums as we do. Thus seeing or not seeing the invisible ink has
2
no influence to the probability to see the sum 11, and must be therefore 36 for everybody.
2.3 Multi level random experiments

At this point, I recommend that you watch the following video Sec2.3 and then read this
section.
Till now we have considered just single level random experiments. A coin was flipped
once or one ball was selected from a urn. However, often such experiments are repeated
several times. In this case one speaks of a multi level Experiment. The progress of such
an experiment can be illustrated in a tree diagram. We illustrate this with two examples.
Example 2.3.1. In an urn are 5 balls, three are marked with an ‘a’ and 2 with a ‘n’, see
Figure 2.5. We now will draw a ball, note the letter and then put the ball back in the urn.
After this we repeat the experiment. Obviously, both levels of the experiment are copies
of each other. In such a situation one speaks of independent and identical experimental
procedure (we will speak about it later). For this two level experiment, we have
Ω = {(a, a), (a, n), (n, a), (n, n)}.
To compute the probability of each element of the sample space, we think of the balls in the
urn as they would be numbered. 1−3 stand for a, 4 and 5 for n. In the numbered situation,
22
Figure 2.5: Illustration of the urn in Example 2.3.1
there 25 ways to draw the balls. This are all combinations (i, j) with 1 ≤ i, j ≤ 5. Nine of
this combinations lead to (a, a) ∈ Ω, namely all pairs (i, j) with 1 ≤ i ≤ 3 and 1 ≤ j ≤ 3.
Six combinations lead to (a, n) ∈ Ω, namely all (i, j) with 1 ≤ i ≤ 3 and j = 4 or
j = 5. Six further combinations lead to (n, a) ∈ Ω and four to (n, n) ∈ Ω. Since all balls
have the same probability to be drawn, all combinations (i, j), 1 ≤ i, j ≤ 5 have the same
1
probability to be drawn. Since there 25 possibilities, each pair (i, j) has the probability 25
to be drawn. We thus obtain
9 6 6 4
P [(a, a)] = , P [(a, n)] = , P [(n, a)] = and P [(n, n)] = . (2.3.1)
25 25 25 25
Another way to get to this conclusion is to represent the experiment by a tree diagram, see
Figure 2.6. Let us now describe how to obtain Figure 2.6. We start with a dot as the root
Figure 2.6: Tree diagram to Example 2.3.1
of our tree. In the first step of our experiment, we can draw a ball with an ‘a’ or with a ‘n’
with probabilities 35 and 52 . We thus add to the root of our tree two branches, one for ‘a’
and for ‘n’. Furthermore, we write to each branch the probability that the corresponding
ball has been drawn. In the second step, we repeat the first step, but this time adding the
new branches to the branches obtained in the first step. The important observation at this
23
point is that the probabilities of the combinations (a, a), (a, n), (n, a) and (n, n) can also
be obtained by multiplying the probabilities along the path corresponding to a combination.
Example 2.3.2. We consider the same situation as in Example 2.3.1, but with the differ-
ence that we do not return the ball into the urn. We thus have to adjust the probabilities
on the branches corresponding to the second level of our experiment. If we draw in the
first step an ‘a’, then two ‘a’ and two ‘n’ are left in the urn. Thus the probability to draw
an ‘a’ is 12 and to draw an ‘n’ is also 12 . If we draw in the first step an ‘n’, then three ‘a’
and one ‘n’ are left in the urn. Thus the probability to draw an ‘a’ is 34 and to draw an
‘n’ is also 41 . This now leads to the tree in Figure 2.7. The multiplication approach gives
Figure 2.7: Tree diagram to Example 2.3.2
3 1
P [(a, a)] = P [(a, n)] = P [(n, a)] = and P [(n, n)] = . (2.3.2)
10 10
If one computes the probabilities as in the first half of Example 2.3.1, one obtains the
same.
Exercise 2.3.3. Compute P [(a, a)] in Example 2.3.2 as in the first half of Example 2.3.1
and compare it to (2.3.2).
The observation in the Examples 2.3.1 and 2.3.2 lead to the following proposition:
24
Proposition 2.3.4 (Path rule).
The probability P [ω] of an ω ∈ Ω in a multi level experiment is the product of

all probabilities along the path corresponding to this ω. We will call this the
path rule.
The probability of an event A is the sum over the probabilities of all paths
leading to an ω ∈ A.
The graphical illustration of all of this paths as a tree as in Figure 2.6 and
Figure 2.7 is called a probability tree or tree diagram.
The proof of this proposition requires conditional probability and we thus postpone it,
see Proposition 2.6.16 and Section 2.6.3. We now will apply the path rule in various
examples.
Example 2.3.5. In a boat are 9 passengers. 4 are smugglers and 5 are honest passengers.
A customs officer selects 3 persons for a check and all 3 are smugglers. Has the custom
officer selected them randomly? To make a decision, we compute the probability to have
3 smugglers when we select them purely randomly. We set
Ω = {(i, j, k); i, j, k ∈ {s, h}},
where s denotes smuggler and h honest. The path corresponding to the result (s, s, s) is
4 3 2
9 8 7
· −→ s −→ s −→ s (2.3.3)
Thus the probability to select three smugglers is
4 3 2 1
· · = ,
P [(s, s, s)] = (2.3.4)
9 8 7 21
which is less than 5%. With this rate it is to expect the customs officer has an ability to
judge human nature.
Example 2.3.6. In a drawer are 4 black, 6 red and 2 white socks. In the
dark, somebody chooses randomly two socks. What is the probability that
they have the same colour? We define in this situation
Ω = {(i, j); i, j ∈ {b, r, w}}. (2.3.5)
The following three trees lead to equal colours:
4 3 6 5 2 1
12 11 12 11 12 11
· −→ b −→ b, · −→ r −→ r, · −→ w −→ w. (2.3.6)
We thus have
4 3 6 5 2 1 1
P [{(i, j); i = j}] = · + · + · = . (2.3.7)
12 11 12 11 12 11 3
25
Example 2.3.7. Each time when Prof.Q sees 7 persons standing together, he
bets 100 : 1 that at least two person are born on the same weekday. Is
this a good bet? We could try to compute the probability that Prof.Q wins. The prob-
lem is that we would had to consider all different outcomes like ‘All are born on the same
weekday’ or ‘Two are born on the same weekday and all others are born on different week-
days’. This is possible, but it is much easier to compute the probability that Prof.Q looses
his bet and then to use Lemma 2.1.15.d. We set
Ω = {(i1 , . . . , i7 ); ij ∈ {1, . . . , 7}, j = 1, . . . , 7}, (2.3.8)
where ij denotes the weekday on which the person j is born. We can assume at this point
that all weekdays have the same probability. Under this assumption, the following path
corresponds to the event that Prof.Q looses:
7 6 5
7 7 7
· −→ {(i1 ); 1 ≤ i1 ≤ 7} −→ {(i1 , i2 ); i1 6= i2 , 1 ≤ i1 , i2 ≤ 7} −→ · · ·
1
7
· · · −→ {(i1 , . . . , i7 ); ij 6= ik for j 6= k, 1 ≤ i1 , . . . , i7 ≤ 7} (2.3.9)
The probability that Prof.Q looses his bet is thus

7 · 6···2 · 1 7! 6!
7
= 7 = 6 ≈ 0.00612. (2.3.10)
7 7 7
The bet is thus favourable for Prof.Q.
Example 2.3.8. In the city M. exists only two kind of weather: rainy (R) or dry (D).
If it is today rainy, then probability is 65 that it is tomorrow also rainy and 16 that it is
3
tomorrow dry. If it is today dry, then the probability is 10 that it is tomorrow also dry
7
and 10 that it is tomorrow rainy. Today it is dry: What is the probability that it dry in
two days? Apparently there are two paths leading to dry weather:
3 3 7 1
10 10 10 6
D −→ D −→ D and D −→ R −→ D (2.3.11)
The probability that it dry in two days is thus

3 3 7 1 9 7 27 36 52 13
· + · = + = + = = . (2.3.12)
10 10 10 6 100 60 300 300 300 75
2.4 Combinatoric I–product and the sum rule

You can watch at this point the video Sec2.4 or read this section.
In a Laplace experiment (see Definition 2.2.1) we have by definition Ω finite and
|A|
P [A] = for all A ⊂ Ω.
|Ω|
26
If one would like to determine for some A ⊂ Ω the probability P [A], one has to count
the elements of A. At first glance, this looks to be a simple question, but it is often very
difficult. We will thus look in this section and in Section 2.5 at some helpful counting
rules. The topic of this section are the product and the sum rule.
At this point, we would like to remark that the computation of the size of a finite quan-
tity of a finite set and the finding of counting principles is a subdiscipline of discrete
mathematics, more precisely of Combinatorics. We also should mention that there many
problems in Combinatorics, which can easily be formulated and explained to everybody,
but are extremely hard to solve. Furthermore, there are still many very hard problems in
Combinatorics, which are still not resolved.
We start with the sum rule. We have
Proposition 2.4.1 (Sum rule). Let Ω be a finite set and A1 , . . . , An ⊂ Ω be pairwise

disjoint subsets, i.e. Ai ∩ Aj = ∅ for all i 6= j. We then have
n
X
|A1 ∪ A2 ∪ . . . ∪ An | = |A1 | + . . . + |An | = |Aj |. (2.4.1)
j=1
The proof of Proposition 2.4.1 is obvious and can be deduce for example directly from
Lemma 2.1.17.a) if one uses for P the Laplace probability measure on Ω. We have given
in Figure 2.8 an illustration for the sum rule with n = 3.
Figure 2.8: Illustration for the sum rule. We have here |A1 ∪ A2 ∪ A3 | = 2 + 3 + 3
The second counting principle is the product rule. This is formulated as
27
Proposition 2.4.2 (Product rule). Suppose an experiment has s levels and the ex-
periment has ni possible outcomes on level i (1 ≤ i ≤ s), no matter what happens in
the other levels. Then the whole experiment has
n1 · n2 · . . . · ns−1 · ns possible outcomes.
We have given in Figure 2.9 an illustration with tree diagrams. Notice that we have in
Figure 2.9(a) in the second level always three possibilities, no matter what happened in
the first level. Also, we have in the third level always two possibilities, no matter what
happened in the first two levels. We thus can apply the product rule for the experiment
in Figure 2.9(a) and get 2 · 3 · 2 possible outcomes. On the other hand, we have in
Figure 2.9(b) in the second level either two or three possibilities, depended on the result
of the first level. Similarly on the third level. We thus cannot apply the product rule for
the experiment in Figure 2.9(b).
(a) The product can be applied (b) The product cannot be applied
Figure 2.9: Illustration of Product rule.
Remark 2.4.3.
a) In Proposition 2.4.2, it is allowed that a level can have an influence on the later levels
as long it has no influence on the number of outcomes.
. . . · ns can be written as sj=1 nj with the Capital Pi notation, which

Q
b) The product n1 ·P
is similar to the -notation for sums (see Appendix A). We will not need the Capital
Pi notation and thus will not go into it.
The proof of Proposition 2.4.2 is as for Proposition 2.4.1 quiet obvious. If one would like,
one can deduce the product rule from the principles of probability theory. For this, one
would need the independence of events, see Section 2.7. We leave this to the reader and
show instead the usefulness of these rules with some examples.
28
Example 2.4.4. We toss 6 times a coin with the sides 0 and 1. How many
possible outcomes do we have? Each coin toss has two possible outcomes. Further-
more, each level has no influence to the other levels. We thus have
2 · 2 · 2 · 2 · 2 · 2 = 26 = 64 (2.4.2)
possible outcomes of our experiment. In the general, if we toss a coin n times, then we
have 2n possible outcomes.
Example 2.4.5. In an urn are 5 balls with the numbers 1 to 5. We now draw
without replacement three times a ball. How many possible outcomes do we
have? In the first level, we have 5 possibilities, in the second 4 possibilities and in the
third 3 possibilities. Each level has an influence on the numbers remaining in the urn,
but it has no influence the number of the remaining balls in the urn. We thus can use the
product rule and get
5 · 4 · 3 = 60 (2.4.3)
possible outcomes of the experiment.

Example 2.4.6. Mrs. M. has 6 hats, 7 skirts and 4 blouses and all have
either the colour black or the colour white. More preciously, she has
2 black hats (and 4 white hats), 2 black skirts (and 5 white skirts) and
3 black blouses (and 1 white blouse). If there are no restrictions,
Mrs. M. can wear 6 · 7 · 4 = 168 possible combinations. But Mrs. M. wears
at most one black garment. How many possibilities she has then? We thus
can not directly apply the product rule. Each combination www, bww, wbw, wwb deter-
mines a subset of all possible combinations and this subsets are obviously disjoint. We
thus can apply the sum rule and the number of all allowed combinations is the sum of
all combinations for www, bww, wbw, wwb. The product rule now shows that www has
4 · 5 · 1, bww has 2 · 5 · 1, wbw has 4 · 2 · 1 and wwb has 4 · 5 · 3 possible combinations. In
total, we have
4 · 5 · 1 + 2 · 5 · 1 + 4 · 2 · 1 + 4 · 5 · 3 = 98 (2.4.4)
possible combinations.
2.5 Combinatoric II–samples

We will apply the rules of Section 2.4 for the determination of the number of possible
sample sets. The starting point in this section is always a set with n elements. An
example is an urn with n balls, where the balls are numbered from 1 to n. We will now
draw s times a ball. Here we have to consider two aspects. The first is if we draw with or
without replacement of the ball. The second point is we consider the sample as ordered
29
ordered unordered
n+s−1

with replacement ns s
n
without replacement (n)s s
Table 2.2: Sample size for s times drawing of n elements.
or unordered. This gives four cases, of which we will consider three in this section. The
resulting formulas are
We will not study the case ‘unordered with replacement’ and have given the formula here
only for completeness.
2.5.1 Ordered samples with replacement

Please watch first the following video Sec2.5a. Here we start with a urn with n balls,
numbered from 1 to n.
Figure 2.10: Urn with n = 5 balls.
We now draw a ball, note the number and put it back into the urn. This procedure is
repeated s times. This experiment has s level, where each level has no influence on the
other level. Furthermore, we have on each level n possibilities. The product rule thus
implies that we have ns possible outcomes of the whole experiment.
Remark 2.5.1.
a) In an ordered sample, the order how the elements of the sample have been drawn
is important. A good example are words. If we draw from the alphabet the letters
L, I, S, T, E, N , then depending on the order of the letters, one can get the words
LIST EN, EN LIST, SILEN T or T IN SEL.
Obviously, all have a different meaning.
b) If a ball in the above experiment is drawn purely randomly, i.e. we have a Laplace
experiment, then each of the ns possible outcomes has the probability n1s .
30
c) There exists also another (a ‘dual’) presentation of the same experiment. Instead to
draw s times a ball, we throw s balls into n numbered urns. To be more precise, we
have s distinguishable (for instance numbered) balls and n distinguishable urns. We
now place this balls into the urns, see Figure 2.11. In this experiment we can place
Figure 2.11: Illustration of the ‘dual’ experiment with n = 4 and s = 5.
into each urn as much balls as we would like. Since each of the s balls falls into one of
the n urns, we can represent this experiment by a vector of length s
(a1 , . . . , as ) ∈ {1, . . . , n}s . (2.5.1)
The number aj stands for the urn, where j−th ball has been placed. Since each aj can
have s different values, the total number of possibilities is
n . . · n} = ns .
| · .{z (2.5.2)
s−times
We now will consider some examples.

Example 2.5.2. How many numbers with 5 digits exist only consisting of the
numbers 1, 4, 5, 7 if each number can occur more than once? For each digit,
we have 4 choices and obviously the order plays a role. We thus have 45 possibilities.
Example 2.5.3. A password has to consist of 7 symbols. A password has to
consist of 7 symbols. The first 3 must be capital letters, then next
3 lower case letters and the last an integer from 0 to 9. How many
possibilities do we have to chose this password? We have
263 · 263 · 10 possibilities. (2.5.3)
The following video Sec2.5b contains some additional explanations for Example 2.5.4 and
an alternative approach to Example 2.5.5.
Example 2.5.4. Let Ω = {1, . . . , s} and B = {1, . . . , n}. How many functions from
Ω to B exists? A function X : Ω → B assign to each element in Ω exactly one el-
ement of B. An illustration of a function X : Ω → B for s = 7 and n = 5 is given in
Figure 2.12.
A function X thus ‘transports’ each element j ∈ Ω (j = 1, . . . , s), into one of n urns. We
thus have ns possibilities.
31
Figure 2.12: Illustration a function X : Ω → B
Example 2.5.5. How many subsets has a set Ω with s elements (inclusive
the set Ω itself and the empty set ∅)? Notice that this question is equivalent to
compute the cardinality of the power set PΩ of Ω (see Definition B.1.10). We compute
the number of subsets of Ω with the help of Example 2.5.4.We define for a subset A ⊂ Ω
a function 1A as
(
1 ω ∈ A,
1A : Ω −→ {0, 1} with 1A (ω) := (2.5.4)
0 ω 6∈ A.
It is clear that different subsets lead to different functions and that a subset can be recovered
from the assigned function. We thus see that the number of subsets of Ω is equal to the
number of functions from Ω to {0, 1}. This number has been computed in Example 2.5.4.
Since ω has s elements and {0, 1} has two elements, Ω has 2s subsets.
Remark 2.5.6. The function 1A is called indicator function and is for instance helpful to
write probabilities as expectation of a random variable, see Section 3.8.
Example 2.5.7. How many possible bets can be made in football toto? At this
point one needs to know that one bets in football toto on 11 football games. For each game,
one has to choose between victory of the home team (‘1’), draw (‘0’) or victory of the guest
team (‘2’). Thus a bet in football toto corresponds to a experiment with 11 levels and with
3 possibilities on each level. We therefore have
311 = 177147
possible tips can in football toto. A possible question at this point is, how many tips are
completely wrong? Per game, there are 2 wrong tips and we thus have in total
211 = 2048
completely wrong tips. If we compare this with 177147 possible tips, we see there are only
a few completely wrong tips (ca. 1.1%). It is thus very very reasonable that somebody,
32
even if he or she has absolutely no knowledge about football, gives at least for one game
the right tip.
2.5.2 Ordered samples without replacement

Please watch first the video Sec2.5c. We start also with a urn with n balls, numbered from
1 to n. This time we draw without returning the ball, see Figure 2.13. As in Section 2.5.1,
we note the numbers in the order they have been drawn.
Figure 2.13: Urn with n = 5 balls, after two times drawing without replacement.
We have again s levels in this experiment. The difference to the previous experiment is
that the different levels are not anymore copies of each other. Obviously, the drawn balls
have an influence the on the remaining numbers in the urn, but not on the number of
balls on each level. We thus can apply the product rule. In the first level we have n
possibilities, in the second level we have n − 1 possibilities, . . . , and in the s’th level we
have n − s + 1 possibilities. We thus get with the product rule
(n)s := n · (n − 1) · . . . · (n − s + 1) (2.5.5)
possible outcomes of our experiment. The notation (n)s is called the falling factorial .
There are other notations used in the literature for the falling factorial (n)s such as sP n
and P (s, n), but you should use in this course only the notation (n)s .
If s = n, i.e. the urn is empty at the end, we have
n! := (n)n = n · (n − 1) · . . . · 3 · 2 · 1 (2.5.6)
possible outcomes of our experiment. The notation n! is called factorial . In particular
we see that n different elements can be arranged in n! ways. Thus there n! different
permutations of n different elements. Typically one defines 0! = 1. Note that (n)s = 0
for s > n and thus gives also the right number of possible samples for s > n (the urn
contains only n balls).
Remark 2.5.8.
a) If we draw purely randomly a sample (n balls and s levels), then each sample has the
probability
1 (n − s)!
= . (2.5.7)
(n)s n!
33
b) We have here also a dual description, which is often more convenient. We have to
distribute s different balls into n different urns, where s ≤ n. The difference to
Section 2.5.1 is that we can place this time at most one ball in each urn, see Figure 2.14.
Figure 2.14: Illustration of the ‘dual’ experiment with n = 6 and s = 4.
How many possibilities do we have? The first ball has n possibilities, the second ball
has n − 1 possibilities, . . . , and the s’th ball has n − s + 1 possibilities. This gives
(n)s = n · (n − 1) · . . . · (n − s + 1)
different possible outcomes of the experiment.
Example 2.5.9. Anna has 9 (different) books and would like to arrange them
on a bookshelf. In how many ways can she do this? Obviously, the order of the
books plays a role and each book can only be selected once. We thus have (9)9 = 9!
possibilities.
Example 2.5.10. 8 sprinters run in the 100m final of a sports event. In
how many ways can the medals be distributed? We have a gold, a silver and a
bronze medal. Each sprinter can get only one medal and we thus have (8)3 = 8 · 7 · 6
possibilities.
The following video Sec2.5d contains some additional comments about the Examples 2.5.11,
2.5.12 and 2.5.13 below.
Example 2.5.11. A list of n cities is given and a salesman has to visit
each city on this list. In how many ways can he visit these cities if he
would like to visit each city only once? Obviously, the number of possible
routes is the number of permutations of n elements and is thus n!. If furthermore the
distances between each pair of cities is known then a natural question is: What is the
shortest possible route that the salesman can take? It is surprisingly very hard to find
the answer to this question. This problem is known as the Travelling salesman problem
(TSP) and it plays an important role in operations research.
At this point, one might ask why the travelling salesman problem is so hard? Why not
testing all possible routes and then choose the optimal one? The reason it that the value
of n! is growing extremely fast as n goes to infinity, much faster than exponential. To get
an impression how fast n! grows, we consider an example.
34
Example 2.5.12. How heavy are 100! rice grains? We have 100! ≈ 9.333 · 10157
and one grain of rice is around 25 mg = 0.000025 kg. Thus 100! rice grains are around
2.333 · 10153 kg. As a comparison, our sun system is around 2 · 1030 kg and our milky way
is around 3.60 · 1041 kg.
Example 2.5.13. Considering again the travelling salesman problem. How

long would need a modern processor to check all possible routes for 100
cities? A modern processor can make per second around 1500 0000 0000 000 = 150 · 109
operations and for each route, we need around 100 operations. He thus would need
100! · 100
seconds ≈ 6.222 · 10148 seconds ≈ 1.97 · 10141 years. (2.5.8)
150 · 109
Since n! is growing so fast, it is already for a moderate big n difficult to compute the
value of n! (already 70! is to big to be displayed on a commercially used calculator). For
bigger n, one approximates n! with the help of Stirlings formula
n n √
n! ∼ 2πn, (2.5.9)
e
where e denotes Euler’s number , e ≈ 2.71828. The notation an ∼ bn for two sequences
(an )n∈N and (bn )n∈N denotes that
an
→ 1 as n → ∞. (2.5.10)
bn
If (2.5.10) is not fulfilled then we write an 6∼ bn . Heuristically, we can interpret an ∼ bn
that the sequences (an )n∈N and (bn )n∈N have the same leading coefficient.
Example 2.5.14. Let an = n2 , bn = n2 − 5n and cn = 2n2 + 3n. We then have
an n2 1
= 2 = → 1 as n → ∞, (2.5.11)
bn n − 5n 1 − n5
an n2 2
= 2 = → 2 as n → ∞. (2.5.12)
cn 2n + 3n 1 + n3
We thus have an ∼ bn and an 6∼ cn .
One should point out that an ∼ bn does not imply that an −bn converges to 0. For Stirlings
formula this is definitively wrong. Another example is for instance an = n + log n and
bn = n. We have here
an log n
=1+ → 1 as n → ∞ =⇒ an ∼ bn . (2.5.13)
bn n
Example 2.5.15. How many injective functions exists from Ω = {1, . . . , s} to
B = {1, . . . , n}? A function X is injective if X(i) 6= X(j) for all i 6= j and i, j ∈ Ω. For
35
instance, the function in Figure 2.12 is not injective as X(4) = X(5). If s > n, then it
is obvious that no injective function exists. If s ≤ n, then one can interpret an injective
function
X:Ω→B
in the way that the function transports s numbered balls into n urns, where each urn has
only enough space for one ball. The above comments now show that we have (n)s injective
functions.
Example 2.5.16. In how many ways can one arrange 8 rooks on a chessboard
so that they cannot hit each other? At that this point one should know that a
chessboard has 8 × 8 fields and that a rook can move any number of vacant squares verti-
cally or horizontally. Thus, the rooks have to be arranged that we have in each row and
each column precisely one rook. We now arrange the rooks column by column. For the
first rook we have 8 possibilities, for the second rook we have 7 possibilities and so on. We
thus have in total
8! = 40320
possible arrangements. A similar argumentation shows that one has for n rooks on a n×n
chessboard n! possible arrangements so that they cannot hit each other.
The video Sec2.5e discusses a further example not contained in the notes.
2.5.3 Unordered samples without replacement

Please watch now the video Sec2.5f. The order of a sample does not always play a role.
An example is for instance Lotto. Here we have only have to predict seven balls, but not
their order they are drawn. Now let Ω be a set of n elements. We now draw (quasi in one
grip) a set of s elements. We will call ns the number of possible drawings. Obviously,
n

s
is the number of subsets with s elements of the set {1, . . . , n}. We thus can easily
n

determine s . Recall that
n!
(n)s = (2.5.14)
(n − s)!
is the number of ordered samples of {1, . . . , n} with s elements. On the other hand,
there exists s! possible arrangements of s elements and thus each unordered sample with
s elements corresponds to s! ordered samples. We thus have

n (n)s n! n
= = = . (2.5.15)
s s! (n − s)!s! n−s
The term ns is called binomial coefficient is read as ‘n choose k’.

36
Remark 2.5.17.
a) We have n0 = 1. The reason is that each set has only one subset with 0 elements:

the empty set.

b) A more intuitive interpretation of ns is to see it as the number of ways to choose s

‘seats’ among n identical ‘seats’.

c) We have here also a dual description. We have this time s identical balls and n urns,
where each urn has only enough space for one ball. Since we choose here for the balls
s ‘seats’ among n ‘seats’, we obviously have ns possibilities.
d) We have shown in example 2.5.5 that the number of subsets of {1, . . . , n} is 2n . Fur-
thermore n0 , n1 ,. . . , nn are the number of subsets of sizes 0, 1, . . . , n. Thus

n
X n
= 2n . (2.5.16)
k=0
k
This equation can be also obtained from the binomial law. We have
n n
X n X n k n−k
= 1 ·1 = (1 + 1)n = 2n (2.5.17)
k=0
k k=0
k
Example 2.5.18. The Lotto problem: an urn contains 49 balls with numbers from 1 to
49. Seven of these balls are red and all other white. We now choose a sample of 6 balls
without considering the order. We have in total

49 49 · 48 · 47 · 46 · 45 · 44
= = 13983816
6 1·2·3·4·5·6
possible samples. If we choose purely randomly a sample then each sample has the proba-
bility 1/13983816. One question we could ask for instance: What is the probability that a
sample consists only of red balls? From the above considerations we know that there are
7

6
samples with only red balls. Since we have a Laplace experiment, we get
7

6
P [‘6 red balls’] = 49 .
(2.5.18)
6
More
generally, what is the probability that our sample has precisely s red balls. We have
7 42
s
possibilities to choose s red balls. Furthermore, we have 6−s ways to choose the
reaming 6 − s balls from the 42 white balls. Using the product rule, we obtain

7 42
· (2.5.19)
s 6−s
possible samples with s red balls. Thus
7 42

s
· 6−s
P [‘s red balls’] = 49
. (2.5.20)
6
37
Example 2.5.19. A parliament consists of n members. They would like to
build a committee consisting of s members and with one chairman. How
many possibilities do we have? One possibility is to choose first the chairman (there
n

are n = 1 possibilities) and then to choose the s−1 remaining members of the committee
(there are n−1

s−1
possibilities to do this). Thus there are

n−1
n· (2.5.21)
s−1
possible ways to build this committee. An alternative way is first to choose the s members
n

of the committee (there are s ways)and then to choose the chairman among the members
of the committee (there are s = 1s possibilities). Since both way must give the same
number of possible committees, we have

n−1 n n n−1 n
n· = · s =⇒ · = . (2.5.22)
s−1 s s s−1 s
This equation can also be easily deduced from (2.5.15).

The video Sec2.5g considers the Examples 2.5.20 and 2.5.21 below.
Example 2.5.20. We have 4 letters AABC of which 2 are identical. How
many words can we build with these letters? By a word we mean an ordering of
the letters AABC, but not necessarily having a meaning. There are 24 = 4! permutations
of 4 places, but not all give a different word. Here we can easily write down all possibilities
as list:
AABC, ABCA, BAAC, CAAB,

AACB, ACAB, BACA, CABA,
ABAC, ACBA, BCAA, CBAA.
Had we started with distinctly labelled As say, A1 and A2 , then we have 24 in the list:
A1 A2 BC, A2 A1 BC, A1 BCA2 , A2 BCA1 , BA1 A2 C, BA2 A1 C, CA1 A2 B, CA2 A1 B,

A1 A2 CB, A2 A1 CB, A1 CA2 B, A2 CA1 B, BA1 CA2 , BA2 CA1 , CA1 BA2 , CA2 BA1 ,
A1 BA2 C, A2 BA1 C, A1 CBA2 , A2 CBA1 , BCA1 A2 , BCA2 A1 , CBA1 A2 , CBA2 A1 .
We can use this to find a quick way to compute the number of different words. When both
A1 and A2 are replaced by A then A1 A2 BC and A2 A1 BC both go to AABC. We thus see
that each word in the first list corresponds to precisely 2 words in the second list. Hence
we have in total 4!/2 = 24/2 = 12 different words.
Another way to get to this result is to interpret the problem as distributing the 4 letters
4

AABC on 4 seats. First, we choose 2 seats for the As and there are 2 possibilities. Since
there are 2 seats remaining, there are 21 possibilities to chose a seat for B. Finally, there
38
is 1 seat reaming for C and we thus only have 1 = 11 possibility. Using the product role,

we obtain

4 2 1
· · = 12. (2.5.23)
2 1 1
Example 2.5.21. Let us now extend Example 2.5.20 a little bit. In how many ways can
one arrange three As, four Bs and five 5Cs? (We assume at this point that we can not
distinguish identical letters.) In total, we have 12 letters and one can permute them in
12! ways, but as in Example 2.5.20, not all possible permutations give a different word.
As above, we interpret this problem as distributing the letters on 12 seats. First, there
are 12 3
possibilities to choose the 3 seats for the As. Since 9 seats are remaining, there
9
are 4 possibilities to choose the 4 seats for the Bs. Finally, there are 5 seats for 5 C 0 s
remaining and thus there is 1 = 55 possibility. In total we get with the product role

12 9 5 12!
· · = = 27720. (2.5.24)
3 4 5 3!4!5!
Example 2.5.22. In a similar way one can compute the number of possible arrangements
of 0s and 1s, if the number of 0s is a and the number of 1s is b (a, b ∈ N). In total we
have a + b elements to permute. Arguing as in the Examples 2.5.20 and 2.5.21 gives

a+b a+b
=
a b
possibilities to arrange the 0s and 1s.
Example 2.5.23. What is the probability that we throw 5 times
‘Heads’ when we throw 10 times a fair coin? We begin by determining Ω and P.
We have
Ω = {0, 1}10 = {(ω1 , . . . , ω10 ); ωi ∈ {0, 1}},

1 1
P [{ω}] = = 10 for all ω ∈ Ω,
|Ω| 2
where we interpret 1 as ‘Head’ and 0 as ‘Tail’. The event, in which we are interested, is
A = {ω ∈ Ω : ‘5 times head’} = {ω ∈ Ω : ω1 + · · · + ω10 = 5}.
All ω consists only of 1s or 0s and thus all ω ∈ A have five 0s and five 1s. Since we are in
the situation of a Laplace experiment, we have to compute the cardinality of A. We thus
need to know in how many ways one can arrange five 0s and five 1s. But this is precisely
the topic of Example 2.5.22 and we we have |A| = 10 5
. Therefore
10

|A| 252 1
P [A] = = 510 = ≈ . (2.5.25)
|Ω| 2 1024 4
39
2.5.4 Mixed samples
We consider in this section some combinatorial examples involving the three cases we have
studied in the previous sections. The video Sec2.5h contains some examples not contained
in the notes.
Example 2.5.24. Peter has 11 (different) books, 4 about Geometry, 5 about Algebra and
2 about Probability. Peter would like to arrange them on a bookshelf.
a) In how many different ways is this possible to arrange the books if

there are no restrictions? Obviously, we have to compute all possible permuta-
tions of 11 objects. We thus have 11! possibilities.
b) Suppose now that Peter would like to order the books by topic with
Geometry as first, then Algebra and then Probability. How many
arrangements are possible? There are 4! ways to arrange the Geometry books, 5!
ways to arrange the Algebra books and 2! ways to arrange the Probability books. Since
the arrangement of each topic has no influence on the arrangements of the other topics,
we can use the product rule and get 4! · 5! · 2! possibilities.
c) How many arrangements are possible if Peter would like to have the
books about the same topic standing together, but with no preference
for the order of the topics? First, we choose the order of the topics. We have
3 topics and can thus arrange them in 3! = 6 ways. Furthermore, we have for each
possible order of the topics 4! · 5! · 2! possibilities to arrange the books. We thus get
with the sum rule have in total 6 · 4! · 5! · 2! possible arrangements.
Example 2.5.25. How many people do you need in a room to have a better
than 50% chance that two of them share a birthday? Let n be the number of
people in the room. Let us compute the probability that all n people have a different
birthday. For simplicity, we assume at this point that the year has 365 days and each day
has the same probability. There are now (365)n = 365 · 364 · . . . · (365 − n + 1) possibilities
if all persons should have a different birthday (ordered sample without replacement). Fur-
thermore, there are 365n possibilities for the birthdays if there are no restrictions (ordered
sample with replacement). We thus get
(365)n
P [‘At least 2 persons have the same birthday’] = 1 − . (2.5.26)
365n
Testing some numbers for n (with the computer), one obtains for n = 23 that the above
probability is 0.507. We thus see that already a relative small group has a good chance to
have 2 persons have the same birthday.
40
Example 2.5.26.
a.) In a simple version of Mastermind one has to guess an ordered codeword

of the length 4 consisting of colours, where 6 colours are allowed and
no colour can occur more than once. How many codewords are possible?.
This is an ordered sample without replacement and we thus have (6)4 = 6 · 5 · 4 · 3
possibilities.
b.) How many codewords are possible if there are no restrictions how often
a colour can occur? This is an ordered sample with replacement and we thus have
64 = 6 · 6 · 6 · 6 possibilities.
Example 2.5.27. Suppose there 5 French, 10 British and 6 Austrians and we

have to select two persons of different nationalities. How many
possibilities do we have? We have 5·10 possibilities to select a French and a British,
5 · 6 possibilities to select a French and an Austrian, and 10 · 6 possibilities to select a
British and an Austrian. We thus have in total 5 · 10 + 5 · 6 + 10 · 6 possibilities.
Example 2.5.28. A list of n cities is given and a salesman has to visit

each city on this list. In how many ways can he visit these cities if he
would like to visit each city twice? We assume that the salesman can
visit the same city one after another. This question can now be understood as
building a word of length 2n with n different letters, where each letter occurs precisely
twice. Using the same argumentation as in the Examples 2.5.20 and 2.5.21 then gives

2n 2n − 2 2n − 4 4 2 (2n)!
· · · ... · · = n . (2.5.27)
2 2 2 2 2 2
Example 2.5.29. How many divisors has the number 1‘000‘000‘000 (inclusive 1
and 1‘000‘000‘000)? We have 1‘000‘000‘000 = 109 = 29 ·59 . Thus a divisor of 1‘000‘000‘000
has the form 2k 5` with 0 ≤ k, ` ≤ 9. This gives 10 · 10 divisors.
Example 2.5.30. I have 8 different coins. In how many ways can I

distribute them into two bags? We have for each coin two possibilities. This then
gives 28 possibilities in total.
41
2.6 Conditional probability
Please watch now the following video Sec2.6a and then proceed reading. In this section,
we study the probability of an event A to occur if we already know something about the
outcome of the experiment. We will call this probability then conditional probability. As
motivation, we take a look at two examples.
Example 2.6.1. Ben tosses 50 times a fair coin. We have here
Ω := {0, 1}50 = {(ω1 , ω2 , . . . , ω50 ); ωj ∈ {0, 1} for 1 ≤ j ≤ 50} and (2.6.1)

1
P [ω] := 50 for all ω ∈ Ω. (2.6.2)
2
As before, we interpret 1 as ‘Heads’ and 0 as ‘Tails’. What is the probability that Ben
tosses 50 times ‘Heads’? If we do not anything about the outcome of the experiment then
this probability is
1
P [‘50 times head’] = P [(1, 1, . . . , 1)] = ≈ 1.1 · 10−15 . (2.6.3)
250
This is obviously very small. On the other hand, if we know that Ben has been lucky and
has already tossed 49 times ‘Heads’ then the probability to toss 50 times ‘Heads’ is 12 .
Example 2.6.2. Let us consider again the situation of Example 2.3.2. We have there an
urn containing three ‘a’s and two ‘b’s and we draw two times a ball without replacement.
We now compute the probability to draw in the second step a ‘n’. This is
3 1 2
P [‘n’ in the second step] = P [(a, n)] + P [(n, n)] = + = . (2.6.4)
10 10 5
If we know nothing about the outcome of the experiment, then we expect in 2 of 5 cases to
draw an ‘n’. However, the situation is different if we already know the that the first ball
is an ‘a’. In this case, there are two ‘a’ and two ‘n’ left in urn. The probability to draw
‘n’ is thus 21 . On the other hand, if we know that the first ball is an ‘n’, then there is one
‘n’ and three ‘a’ left in the urn. Thus the probability to draw an ‘n’ is 14 .
We have made a similar observation also in Example 2.2.4 with the club with 100 members.
If we choose purely randomly a member then the probability to select a vegetarian is 0.4.
On the other hand, if we already know that the selected member is a woman, then the
probability is 32 to select a vegetarian, which is a lot higher. We thus see that some
knowledge about the outcome of a experiment can have a significant influence to the
probability of an event.
We know will introduce for the above ad hoc computations a clear, formal description.
For this, let (Ω, P)be a probability space and A, B ⊂ Ω two events. We now assume that
we already know that the event B will occur. What is in this situation the probability
42
that also A occurs? We will call this probability P [A|B]. At this point, we have not yet
given a proper definition of P [A|B], but we have to have in Example 2.6.1
1
P ‘50 times head’ ‘We have tossed 49 times head’ = . (2.6.5)
2
and in Example 2.6.2
1
P ‘n’ in the second step ‘n’ in the first step = . (2.6.6)
4
Our first aim is to find a proper mathematical definition. Let us begin with the special
cases A = B c and A = B. For this, suppose that someone is doing a random experiment,
but does not tell you the result. He only tells us that the event B has occurred.
• P [B c |B]: If ω ∈
/ B then this ω cannot have occurred because we already know that a
ω ∈ B has occurred. Thus this ω has in this situation probability 0. In formulas
/ B. Similarly, we deduce P [B c |B] = 0.
P [ω|B] = 0 for all ω ∈
• P [B|B]: In words, P [B|B] is the probability that B occurs if we already know that B
occurs. Clearly, the only possibility is P [B|B] = 1.
Let us now take a look at an arbitrary event A. As motivation to find the right definition
for P [A|B], we take a look at Figure 2.15.
Figure 2.15: Illustration for the definition of conditional probabilities
Obviously, only these ω ∈ A can occur, which are are also in B. This corresponds to the
hatched part in Figure 2.15. Thus only the elements in the intersection A ∩ B can occur
for the event A. The probability P [A|B] thus should be proportional to P [A ∩ B]. A
potential candidate for P [A|B] is P [A ∩ B]. However, we don’t have P [B|B] = 1 for this
candidate since
P [B ∩ B] = P [B] ,
and this probability in usually less than 1. To reach P(B|B) = 1, we divide P [A ∩ B]
by P [B]. At this point we have to assume that P [B] 6= 0. We can do this without
further ado. The reason is that we obviously get logical problems if we try to compute
the probability of the event A under the assumptions that an event B with P [B] = 0
occurs. We thus define
43
Definition 2.6.3. Let (Ω, P)be a (discrete) probability space and A, B ⊂ Ω be given
with P [B] 6= 0. We then define the conditional probability of A given B as
P [A ∩ B]
P [A|B] := . (2.6.7)
P [B]
The conditional probability of a sample point ω ∈ Ω is

(
P[ω]
P [{ω} ∩ B] if ω ∈ B,
P [ω|B] = = P[B]
(2.6.8)
P [B] 0 otherwise.
Example 2.6.4. We roll a fair die. Thus
1
Ω = {1, 2, 3, 4, 5, 6} and P [ω] = for all ω ∈ Ω.
6
Let A and B be the events
A = ‘The number is divisible by 3’ = {3, 6},
B = ‘The number is even’ = {2, 4, 6}.
Then
1
P [A ∩ B] P [{6}] 6 1
P [A|B] = = = 3 = . (2.6.9)
P [B] P [{2, 4, 6}] 6
3
Example 2.6.5. We consider again the situation in Example 2.2.4 with the club with 100
members. We now check if our intuitive interpretation of conditional probability agrees
with the formal one. More precisely, let us compute
P [‘A random member is vegetarian’|‘The member is a woman’] . (2.6.10)
We have
P [‘A random member is vegetarian’ ∩ ‘The member is a woman’] (2.6.11)
24
= P [‘A random member is vegetarian and is a woman’] = (2.6.12)
100
and
12 + 24 36
P [‘A random member is a woman’] = = . (2.6.13)
100 100
This then gives
P [‘A random member is vegetarian’|‘The member is a woman’]
24
100 24 2
= 36 = = . (2.6.14)
100
36 3
Thus our intuition and Definition 2.6.3 give the same result.
44
Example 2.6.6. In New York, the probability that a random teenager owns a skateboard
is 0.48 and that a random teenager owns a skateboard and roller blades is 0.39. What
is the probability that a teenager owns roller blades if we know that the teenager owns a
skateboard? This probability is
0.39
≈ 0.81. (2.6.15)
0.48
The following video Sec2.6b discusses a further example not contained in the notes.
Exercise 2.6.7. We roll a fair die. Find the probability that that the number is > 2
and the probability that the number is a prime number. Also find the probability
that the number is > 2 given the number is prime.
In most cases, our intuition agrees with the computed conditional probability. However,
there are examples, where this is not the case.
Example 2.6.8. We have three indistinguishable purses and each contains two coins. One
purse contains two gold coins, another contains two silver coins and the third contains
one gold coin and a silver coin.
GG GS SS
A purse is selected at random, then at random a coin is selected from it. The selected coin
turns out to be gold. What is the probability that the other coin in the purse is also gold?
P [G selected ∩ G left ] P [ two G purse ] 1/3
P [G left |G selected ] = = = = 2/3.
P [G selected ] P [G selected ] 1/2
In a first point of view, one would expect 12 . Let us thus take a closer look at this experiment
to see if the 12 or the 23 is more reasonable. First, we illustrate it with a tree diagram,
see Figure 2.16. We see that there are 3 paths, where we select a gold coin. Two of
Figure 2.16: Tree diagram for Example 2.6.8
them correspond to the first purse and one to the second purse. Since all paths in this
45
experiment have the same probability, we must have 2/3. Thus the 32 is indeed the right
result. At this point one might ask: why does our intuition fail? A possible explanation
is that one might think the events A=‘ We select a gold coin’ and B=‘ We select a purse
with gold coin(s)’ are the same, but they are not. We clearly have A ⊂ B, but the event
‘ We select a silver coin and gold coin is remaining’ is contained in B and is not contained
in A. Thus A 6= B.
We study as next the properties of conditional probability. We start in Section 2.6.1 with
some basic observations. We then give as next three rules for computations with condi-
tional probability. These three are the Multiplication rule (see Lemma 2.6.13 and 2.6.14),
the law of Total probability (see Theorem 2.6.17 and 2.6.21) and Bayes’ theorem (see
Theorem 2.6.23). In particular, we will use the Multiplication rule and law of Total
probability to justify tree diagrams and the path rule, see Proposition 2.6.16.
2.6.1 Some simple observations

Please watch the following video Sec2.6d and then proceed reading. If we look at the
definition of conditional probability and the properties of a probability measure P, it is
natural to ask some questions. For instance
Do we have P [A|B] = P [B|A]?
Does P [A1 ∪ A2 |B] = P [A1 |B] + P [A2 |B] hold for A1 and A2 disjoint?
Do we have P [A|B1 ∪ B2 ] = P [A|B1 ] + P [A|B2 ] for B1 and B2 disjoint?
Let us start with the first point. In general we have P [A|B] 6= P [B|A] and thus the order
of A and B is important. This can be seen for instance in Example 2.6.4. We have there
1
1 P [A ∩ B] P [{6}] 6 1
P [A|B] = and P [B|A] = = = 1 = . (2.6.16)
3 P [A] P [{3, 6}] 3
2
Note that P [A|B] 6= P [B|A] for most events A and B. A natural question is thus how
P [A|B] and P [B|A] are related. We will answer this question in Section 2.6.4, see The-
orem 2.6.23. We now consider the second point. This point is indeed true. We get with
the definition of conditional probability and Lemma 2.1.15.a)
P [(A1 ∪ A2 ) ∩ B] P [(A1 ∩ B) ∪ (A2 ∩ B)] P [A1 ∩ B] P [A2 ∩ B]
P [A1 ∪ A2 |B] = = = +
P [B] P [B] P [B] P [B]
= P [A1 |B] + P [A2 |B] . (2.6.17)
Let us now consider the third point. As P [A|B] 6= P [B|A], we cannot expect this point
is true. Indeed, we have in general P [A|B1 ∪ B2 ] 6= P [A|B1 ] + P [A|B2 ]. To see this, we
consider again Example 2.6.4 with B1 = B and B2 = B c . Then
P [A ∩ Ω] 1
P [A|B1 ∪ B2 ] = P [A|B ∪ B c ] = P [A|Ω] = = P [A] = .
P [Ω] 3
46
Furthermore
1 P [A ∩ B c ] 1 P [{3}] 2
P [A|B1 ] + P [A|B2 ] = P [A|B] + P [A|B c ] = + c
= + = .
3 P [B ] 3 P [{1, 3, 5}] 3
Thus the third point is not true. Note that P [A|B1 ∪ B2 ] 6= P [A|B1 ] + P [A|B2 ] for
most events A, B1 and B2 . Let us now summarise this three observations as Lemma.
a) Let A and B be events with P [A] > 0 and P [B] > 0. We then have in general
P [A|B] 6= P [B|A] . (2.6.18)
b) Let A1 and A2 be disjoint events and B an event with P [B] > 0. Then
P [A1 ∪ A2 |B] = P [A1 |B] + P [A2 |B] . (2.6.19)
c) Let A be an arbitrary event and B1 , B2 be disjoints events with P [B1 ] > 0 and
P [B2 ] > 0. We then have in general
P [A|B1 ∪ B2 ] 6= P [A|B1 ] + P [A|B2 ] . (2.6.20)
Notice that the term ‘in general’ indicates that (2.6.18) and (2.6.20) are true for most
events, but not all. Indeed, if we choose A = B then P [A|B] = P [B|A] and if we choose
an A with P [A] = 0 then P [A|B1 ∪ B2 ] = P [A|B1 ] + P [A|B2 ].
We now show that we can extend the property (2.6.19) a little bit. We have
Lemma 2.6.10. Let (Ω, P)be a (discrete) probability space. Let further B ⊂ Ω be
given with P [B] > 0. Then
P [ · |B] : PΩ → [0, 1], (2.6.21)

A → P [A|B] (2.6.22)
is a probability measure on Ω.
Proof of Lemma 2.6.10. We verify here the properties of Definition 2.1.19. First we have
to check that 0 ≤ PA|B ≤ 1 for each A ⊂ Ω. The inequality 0 ≤ PA|B is obvious. On
47
the other hand, we have A ∩ B ⊂ B and thus P [A ∩ B] ≤ P [B]. We get
P [A ∩ B] P [B]
P [A|B] = ≤ = 1.
P [B] P [B]
Similarly, we get
P [Ω ∩ B] P [B]
P [Ω|B] = = = 1.
P [B] P [B]
It remains to show the sigma-additivity P [ ∞

S P∞
i=1 Ai |B] = i=1 P [Ai |B]. This follows as in
(2.6.17) with the sigma-additivity of P.
Remark 2.6.11. In contrast to Lemma 2.6.10, we have that the function
P [A| · ] : PΩ → [0, 1], (2.6.23)

B → P [A|B] (2.6.24)
is not a probability measure on Ω. This follows immediately from Lemma 2.6.9. Another
way to see this is for instance with B = Ω. We then have
P [A ∩ Ω] P [A]
PA|Ω = = = P [A] . (2.6.25)
P [Ω] 1
Thus PA| · could only be a probability measure if P [A] = 1. But if we have P [A] = 1
then PA|B = 1 for all B with P [B] 6= 0.
Exercise 2.6.12. Let A be a event with P [A] = 1. Show that P [A ∩ B] = P [B] for all
events B ⊂ Ω.
2.6.2 Multiplication rule

Please watch first the following video Sec2.6e. We justify in this section the path rule in
Proposition 2.3.4. We start with paths of length 2. For this, suppose we are interested in
the probability that B occurs in the first level and A in the second level. This is of course
in the probability that A and B occur, i.e. P [A ∩ B]. The corresponding path is given in
Figure 2.17. The probability at the second arrow in Figure 2.17 is the probability that A
occurs in the second level if we have observed B in the first level. Thus this probability
should be P [A|B]. The following lemma shows that this is indeed true.
48
P[B] P[A|B]
−−−−−−−→ B −−−−−−−→ A
Figure 2.17: Illustration of the multiplication rule for two levels
Lemma 2.6.13 (Multiplication rule, simple case).

Let A, B ⊂ Ω be given with P [B] > 0. Then
P [A ∩ B] = P [A|B] P [B] . (2.6.26)
Proof. Equation (2.6.26) follows immediately from Definition 2.6.3. More precisely
P [A ∩ B]
P [A|B] = ⇐⇒ P [A|B] P [B] = P [A ∩ B] . (2.6.27)
P [B]
It is very convenient to interpret Lemma 2.6.13 in the context of the path rule. However,
Lemma 2.6.13 is true for all events A and B with P [B] > 0 and an interpretation as path
as in Figure 2.17 is not required. Thus the path rule for a two level experiment is just a
special case of Lemma 2.6.13.
We are of course also interested in random experiments with more than two levels. We
thus have to extend Lemma 2.6.13 a little bit. For this, suppose we are interested in the
probability that A1 occurs in the first level, A2 in the second level, . . . , and An in the
n’th level. In other words, we are interested
Tn in the probability that A1 and A2 and . . . and
An occur, i.e. we are interested in P [ i=1 Ai ] (see (2.1.37)). The path corresponding
to this probability is given in Figure 2.18. The probabilities at the arrows can now be
P[A1 ] P[A2 |A1 ] P[A3 |A1 ∩A2 ] P[A4 |A1 ∩A2 ∩A3 ]
−−−−−−−→ A1 −−−−−−−→ A2 −−−−−−−−→ A3 −−−−−−−−−→ A4 · · ·
Figure 2.18: Illustration of the multiplication rule
interpreted as follows: P [A1 ] is the probability that in the first step the event A1 occurs,
P [A2 |A1 ] is the probability that A2 occurs in the second step if A1 occurs in the first step,
P [A3 |A1 ∩ A2 ] is the probability that A3 occurs in the third step if A1 and A2 occur in
the first and second step and so on. The following lemma shows that the path rule in
Figure 2.18 gives the right answer.
49
Lemma 2.6.14 (Multiplication rule, general case). Let A1 , . . . , An ⊂ Ω be given with
P [A1 ∩ A2 ∩ · · · ∩ An−1 ] > 0. We then have
"n #
\
P Ai = P [A1 ] · P [A2 |A1 ] · P [A3 |A1 ∩ A2 ] · . . . · P [An |A1 ∩ · · · ∩ An−1 ] . (2.6.28)
i=1
Proof. Equation (2.6.28) is not difficult to verify. We first rewrite

n n−1
!
\ \
Ai = A1 ∩ A2 ∩ . . . ∩ An = A1 ∩ A2 ∩ . . . ∩ An−1 ∩ An = Ai ∩ An
i=1 i=1
Tn−1
and then apply Lemma 2.6.13 with B = i=1 Ai . This gives
"n # " n−1 ! # " # "n−1 #
\ \ n−1
\ \
Ai = P Ai ∩ An = P An Ai · P Ai . (2.6.29)

P

i=1 i=1 i=1 i=1
We now apply (2.6.29) to P n−1

T
i=1 Ai . We get
"n # " # " # "n−2 #
\ n−1 n−2
\ \ \
P Ai = P An Ai · P An−1 Ai · P Ai . (2.6.30)

i=1 i=1 i=1 i=1
We then proceed in this way till only P [A1 ] is left. This completes the proof.
Similar to Lemma 2.6.13, Lemma 2.6.14 is true for arbitrary events A1 , . . . , An and does
not require that A1 , . . . , An can be interpreted as path in a multi level experiment.
Let us now consider a concrete example to compare the results obtained with the path
rule and with the multiplication rule and check if both methods give indeed the same.
Example 2.6.15. We have an urn with 5 red and 2 blue balls. We now draw 3 times
purely randomly a ball without replacement. We denote by Cj the colour of the ball,
which was drawn in j’th turn. We write r for red and b for blue. We now compute
P [C1 = r, C2 = b, C3 = r] with the path rule (see Proposition 2.3.4) and with Lemma 2.6.14.
We begin with the path rule. The interesting path is
5 2 4
7 6 5
· −→ r −→ rb −→ rbr (2.6.31)
and thus
5 2 4 4
P [C1 = r, C2 = b, C3 = r] = · · = . (2.6.32)
7 6 5 21
50
If one now uses Lemma 2.6.13, one obtains
P [C1 = r, C2 = b, C3 = r] = P [C1 = r] · P [C2 = b|C1 = r] · P [C3 = r|C1 = r, C2 = b] .

(2.6.33)
We then have
5·2
5 P [C2 = b, C1 = r] 7·6 2
P [C1 = r] = and P [C2 = b|C1 = r] = = 5 = .
7 P [C1 = r] 7
6
Thus P [C1 = r] and P [C2 = b|C1 = r] are probabilities at the first and the second arrow
in (2.6.31). A similar argument shows that P [C3 = r|C1 = r, C2 = b] is the probability at
the third arrow in (2.6.31). Thus the path rule and the multiplication rule give the same
results as expected.
If one takes also a look at the Examples 2.3.1 and 2.3.2, one sees immediately that
the entries at the tree diagrams are also conditional probabilities, obtained with the
multiplication rule. We thus we can summarize this section as
Proposition 2.6.16. The path rule in Proposition 2.3.4 is true and is a special case
of Lemma 2.6.14.
2.6.3 Law of total probability

Please watch first the following video Sec2.6f. We justify in this section the remaining
points about probability trees, i.e. the splitting into different cases at each level and that
the probability of an event A is the sum of the probabilities of all paths leading to a A,
see Proposition 2.3.4.
We start with the simplest situation and consider a two level experiment, which splits
into two cases in the first level. What is then the probability that an event A occurs in
the second level? The probability tree for this experiment is given in Figure 2.19.
If the second point in Proposition 2.3.4 is correct then we have
P [A] = P [A|B] P [B] + P [A|B c ] P [B c ] . (2.6.34)
Before we verify this, we have to take a look at the probability tree in Figure 2.19. Why do
we have in the first level a splitting into B and B c for some event B? Let us reformulate
this question as follows: When we have in the first level a splitting into two cases B1 and
B2 , what should these events fulfil? First B1 and B2 have to be different cases. Thus B1
and B2 have to be mutually exclusive, i.e. B1 ∩ B2 = ∅. If this wouldn’t be true then B1
and B2 could both occur at the same time, what we obviously don’t want. Second, all
possibilities have to be covered by B1 and B2 and thus B1 ∪ B2 = Ω. If we would have
B1 ∪ B2 6= Ω then there would exist at least one ω ∈ Ω, which would belong neither to B1
51
Figure 2.19: Probability tree for a two level experiment with a splitting into two cases.
nor to B2 . Thus this ω wouldn’t be covered by the cases B1 and B2 . Obviously, we also
don’t want this. The two properties B1 ∩ B2 = ∅ and B1 ∪ B2 = Ω clearly imply B1 = B2c .
For simplicity, we write just B instead B1 . We now justify (2.6.34). We have
Theorem 2.6.17 (Law of total probability, simple version). Let (Ω, P)be a (discrete)
probability space and A, B ⊂ Ω with 0 < P [B] < 1. Then
P [A] = P [A|B] P [B] + P [A|B c ] P [B c ] . (2.6.35)
Proof. The idea is to split the event A, see Figure 2.20.
Figure 2.20: Illustration for the splitting of Ω in Theorem 2.6.17.
The splitting in Figure 2.20 together with Lemma 2.1.15.a) then gives
P [A] = P [A ∩ Ω] = P [A ∩ (B ∪ B c )] = P [(A ∩ B) ∪ (A ∩ B c )]
P [B c ]

c P [B] c
= P [A ∩ B] + P [A ∩ B ] = P [A ∩ B] · + P [A ∩ B ] ·
P [B] P [B c ]
P [A ∩ B] P [A ∩ B c ]
= · P [B] + c
· P [B c ] = P [A|B] P [B] + P [A|B c ] P [B c ] .
P [B] P [B ]
52
We have used that the events A ∩ B and A ∩ B c are disjoint.
Example 2.6.18. We have two identical urns with red and blue balls. The first urn
contains 4 red and 5 blue balls and the second urn contains 3 red and 6 blue balls. We
now choose randomly an urn and draw a ball. What is the probability that we draw a red
ball? It is easy is determine the probability to draw a red ball if we know which urn we
have chosen. We thus introduce the events
A := ‘A red ball is drawn’, (2.6.36)
B := ‘The first urn has been chosen’. (2.6.37)
Clearly, B c is the event ‘The second urn has been chosen’. We now have
1 4 3
P [B] = P [B c ] = , P [A|B] = and P [A|B c ] = . (2.6.38)
2 9 9
We thus get with Theorem 2.6.17
1 4 1 3 7
P [A] = · + · = . (2.6.39)
2 9 2 9 18
Exercise 2.6.19. Draw a tree diagram for Example 2.6.18.
Similar to Lemma 2.6.13 and 2.6.14, Lemma 2.6.17 does not require the interpretation as
two level experiment.
Theorem 2.6.17 can be used if the sample space Ω splits into two parts, i.e we have two
different cases. Of course we would like also to consider situations with more than two
different cases. For instance, how do we handle Example 2.6.18 if we would have three
urns? Let us consider the probability tree for a two level experiment with three possible
cases in the first level, see Figure 2.19.
Figure 2.21: Probability tree for a two level experiment with a splitting into three cases.
Similar to considerations after (2.6.34), we would like to have that the events B1 , B2
and B3 are mutually exclusive and cover the whole Ω. Events with this properties are
called a partition of Ω. This can be of course extended to a general number of events.
53
Definition 2.6.20 (Partition). Let (Ω, P)be a (discrete) probability space and B1 ,. . . ,
Bn be events. We call the events B1 ,. . . , Bn a partition if
Bi are pairwise disjoint, i.e. Bi ∩ Bj = ∅ for all i 6= j,
Ω = ni=1 Bi .
S
P [Bi ] > 0 for all i = 1, . . . , n.
Keeping this definition in mind, we can show
Theorem 2.6.21 (Law of total probability, general version). Let (Ω, P)be a (discrete)
probability space and A be an arbitrary event and B1 , . . . , Bn be a partition of Ω.
We then have
n
X
P [A] = P [A|Bi ] · P [Bi ] . (2.6.40)
i=1
The idea for the proof of Theorem 2.6.21 is the same as for Theorem 2.6.17. More precisely,
this means we split Ω and use conditional probability to consider each part separately. We
have given in Figure 2.22 an illustration for the splitting of Ω into three events B1 , B2 , B3 .
Figure 2.22: Illustration for the splitting of Ω in Theorem 2.6.21.
Proof of Theorem 2.6.21. The

Sn proof of Theorem 2.6.21 is similar to the proof of Theo-
rem 2.6.21. We write Ω = i=1 Bi and get
" n
!# "n #
[ [
P [A] = P [A ∩ Ω] = P A ∩ Bi =P (A ∩ Bi )
i=1 i=1
n n n
X X P [A ∩ Bi ] X
= P [A ∩ Bi ] = P [Bi ] = P [A|Bi ] P [Bi ] .
i=1 i=1
P [Bi ] i=1
54
Theorem 2.6.21 now completes our justification of probability trees and also completes
the proof of Proposition 2.3.4.
Example 2.6.22. We have three identical urns with red and blue balls. The first urn
contains 4 red and 5 blue balls, the second urn contains 3 red and 6 blue balls and the
third urn contains 5 red and 4 blue balls. We now choose randomly an urn and draw a
ball. What is the probability that we draw a red ball? It is easy is determine the probability
to draw a red ball if we know which urn we have chosen. We thus introduce the events
A := ‘A red ball is drawn’, (2.6.41)

B1 := ‘The first urn has been chosen’, (2.6.42)
B2 := ‘The second urn has been chosen’, (2.6.43)
B3 := ‘The third urn has been chosen’. (2.6.44)
Clearly, we have Bi ∩ Bj = ∅ for i 6= j since it is not possible to choose more than one
urn. Furthermore, B1 ∪ B2 ∪ B3 = Ω since we have to choose one of the three urns. We
now have
1 4 3 5
P [B1 ] = P [B2 ] = P [B3 ] = , P [A|B1 ] = , P [A|B2 ] = , and P [A|B3 ] = . (2.6.45)
3 9 9 9
We thus get with Theorem 2.6.21
1 4 1 3 1 5 4
P [A] = · + · + · = . (2.6.46)
3 9 3 9 3 9 9
2.6.4 Bayes’ Theorem

Please watch first the video Sec2.6g. We have noted in Section 2.6.1 that in general
P [A|B] 6= P [B|A] and thus the order of A and B is important. A natural question at this
point is thus how P [A|B] and P [B|A] are related. The reason is that there exist many
situations, where we know P [A|B] and we are much more interested in the probability
P [B|A]. For instance, when we make a polygraph test, then we typically know something
like the test is with a probability of 98% positive if the test person is guilty. Of course we
are more interested in the opposite, i.e. what is probability that the test person is guilty if
the test is positive? In such situations, one uses Bayes’ Theorem.
55
Theorem 2.6.23 (Bayes’ Theorem). Let (Ω, P)be a (discrete) probability space, A, B ⊂
Ω with P [A] > 0 and P [B] > 0. Then
P [B] · P [A|B]
P [B|A] = . (2.6.47)
P [A]
Proof. We have

P [A ∩ B] P [A ∩ B] P [B] P [A ∩ B] P [B] P [B]
P [B|A] = = · = · = P [A|B] · .
P [A] P [A] P [B] P [B] P [A] P [A]
Exercise 2.6.24. Find a sufficient and necessary condition that P [A|B] = P [B|A].
When we combine Theorem 2.6.23 with Theorem 2.6.21, we get
Corollary 2.6.25. Let (Ω, P)be a (discrete) probability space and A be an arbitrary
event and B1 , . . . , Bn be a partition of Ω.
We then have for all 1 ≤ i ≤ n
P [Bi ] · P [A|Bi ]
P [Bi |A] = Pn . (2.6.48)
i=1 P [Bi ] · P [A|Bi ]
Example 2.6.26. Suppose we have two dice, one is fair and one is manipulated. The
manipulated die always gives the number 6 and cannot be visually distinguished from the
fair one. We choose purely randomly a die and we would like to know if this die is the
fair or the manipulated. We introduce the events
F :=‘We have chosen the fair die’, (2.6.49)

M :=‘We have chosen the manipulated die’. (2.6.50)
We have obviously P [F ] = P [M ] = 12 since we don’t know more about the chosen die. We
now roll the chosen die two times and get two times the number 6. What is the probability
that this is the manipulated die? This means we have to compute P [M |‘two times a 6’].
We now get with Theorem 2.6.23
P [M ] · P [‘two times a 6’|M ]

P [M |‘two times a 6’] = . (2.6.51)
P [‘two times a 6’]
56
We know P [M ] and clearly P [‘two times a 6’|M ] = 1. We thus have to compute
P [‘two times a 6’]. We use here the path rule and there are two paths leading to ‘two
times a 6’. This paths are
1 1 1
2 6 2 2 1
−→ F −→ ‘two times a 6’, −→ M −→ ‘two times a 6’. (2.6.52)
We thus get
1 1 1
P [‘two times a 6’] = · 2 + · 1. (2.6.53)
2 6 2
This then gives
1
2
·1
P [M |‘two times a 6’] = 1 1 ≈ 0.97 (2.6.54)
2
· 62
+ 12 · 1
We thus see that is very likely that we have chosen the manipulated die.
The the video Sec2.6h contains an extension of Example 2.6.26.
Example 2.6.27. The infection rate for BSE in a herd of cows is 0.1 %, i.e. 1 of 1000
cows is suffering from BSE. Since this is a very dangerous disease, one has invented a test
for this disease: If a cow has BSE then the test is with a probability of 99.9 % positive.
If a cow don’t has BSE then the test is with a probability of 95 % negative, i.e. the test
shows that the cow is healthy. What is the probability that a positive tested cow has BSE?
1
We use the probability space (Ω, P)with Ω = {All cows in the herd} and P [ω] = |Ω| . We
now introduce the events
B := {ω; ω has BSE}, B c := {ω; ω is healthy},

⊕ = {ω; ω has been positive tested}, = {ω; ω has been negative tested}.
We are interested in P [B|⊕] and know P [⊕|B] , P [⊕|B c ] and P [B]. We thus use (2.6.47)
and get
P [⊕|B] · P [B]
P [B|⊕] = . (2.6.55)
P [⊕]
We now have to compute P [⊕]. Since we know P [⊕|B] and P [⊕|B c ], we can use Theo-
rem 2.6.17 and get
P [⊕] = P [⊕|B] · P [B] + P [⊕|B c ] · P [B c ] . (2.6.56)
We insert this in (2.6.55) and get

P [⊕|B] · P [B]
P [B|⊕] = . (2.6.57)
P [⊕|B] · P [B] + P [⊕|B c ] · P [B c ]
57
Using the given probabilities, we obtain
0.999 · 0.001 1 1
P [B|⊕] = = = ≈ 0.0196. (2.6.58)
0.999 · 0.001 + 0.05 · 0.999 1 + 50 51
If one looks at rates of the test above, one would expect a very good test, however only 1
of 50 positive tested cows really has BSE. It thus would be a good idea to test a positive
tested cow a second time.
Exercise 2.6.28.
a) Explain why the test in Example 2.6.27 is is good to identify healthy cows, but is
not good to find sick cows. (Hint: Compute P [B| ]).
b) Explain why this test is good test for a first triage.
We have mentioned in Section 2.6.3 and in the begin of this section that we have in general
P [A|B] 6= P [B|A]. However, our intuition of confounds very easily P [A|B] with P [B|A].
Let us consider an example. The following example is also available as video Sec2.6i.
Example 2.6.29. The company SDPT would like to sell a new kind of polygraph test to
the police. SDPT says that this test is 100% positive if the tested person lies. Should the
police buy this test? Most people would say intuitively: If it is correct what SDPT claims,
then this is a good test and the police should buy it. Let us now take a more careful look
at this test and check how good it really is. We introduce
⊕ = ‘The test is positive’, = ‘The test is negative’ = ⊕c ,

T := ‘The tested person says the truth’, L = ‘The tested person lies’ = T c .
The main question at this point is: If the test is positive, does this imply that the tested
person lies? To answer this question, we compute the probability that the tested person
lies if the test is positive. This means we have to compute P [L|⊕]. We know from SDPT
that P [⊕|L] = 1. Applying Bayes theorem then gives
P [⊕|L] · P [L] P [⊕|L] · P [L]
P [L|⊕] = = . (2.6.59)
P [⊕] P [⊕|L] · P [L] + P [⊕|T ] · P [T ]
Obviously, some information is missing. P [L] and P [T ] clearly depend on the situation.
On the other hand, the probability P [⊕|T ] only depends on the test itself and can be
interpreted as the probability that the test is positive if the tested person says the truth.
Ideally, one would have P [⊕|T ] = 0 (⇐⇒ P [ |T ] = 1). In this case, we would have
P [L|⊕] = 1 and the test of SDPT would be perfect. However, SDPT says nothing about
P [⊕|T ]. In principle it would be possible that P [⊕|T ] = 1. This would mean that the test
is always positive when the tested person says the truth. In other words, the test would be
58
always positive no matter if the tested person says the truth or lies. Note that the claim
of SDPT would be still correct, but the test itself would be obviously useless. We thus see
that it is not enough to know only P [⊕|L] to make a reasonable decision.
The video Sec2.6j contains another example that shows that our intuition is often incorrect
when it comes to conditional probabilities.
2.7 Independent events

Please watch first the video Sec2.7a. We have seen in the last chapter, that the knowledge
about the outcome of an experiment may have a large effect on the probability of an event
A. However, this is not always the case. If we toss a coin twice then the outcome of the
first throw has no affect on the outcome of the second throw. Another example is when
we draw twice a ball from an urn with replacement. Then the colour of the first ball has
no affect on the colour of the second ball, see for instance Example 2.3.1. In these both
examples, the first and second level have no affect on each other and thus we can call
them independent. We now introduce this concept in general. We will see in Section 2.7.2
that the definition of independence of three or more events is a little bit technical. We
thus start with the independence of two events.
2.7.1 Independence of two events

A natural approach is to call two events A and B independent if the outcome of the event
B has no influence on the outcome of the event A. Using conditional probabilities, this
means
P [A ∩ B]
P [A|B] = P [A] ⇔ = P [A] ⇔ P [A ∩ B] = P [A] · P [B] . (2.7.1)
P [B]
The last equation has the advantage to be symmetric in A and B. In addition, the last
equality is also well defined for P [B] = 0 and is therefore also appropriate. We thus can
forget the assumption P [B] > 0 and set
Definition 2.7.1. Let (Ω, P)be a (discrete) probability space. Two events A, B ⊂ Ω
are called (stochastically) independent if
P [A ∩ B] = P [A] · P [B] . (2.7.2)
Example 2.7.2. Let us roll two distinguishable, fair dice and consider the events
A = {‘The number of the first is divisible by 3’} and (2.7.3)
B = {‘The number of the second is divisible by 2’}. (2.7.4)
59
We set Ω = {(i, j); 1 ≤ i, j ≤ 6}. This is obviously a Laplace experiment (see Defini-
1 1
tion 2.2.1) and thus P [ω] = |Ω| = 36 . We have
A = {(i, j); i ∈ {3, 6}, 1 ≤ j ≤ 6} and (2.7.5)

B = {(i, j); 1 ≤ i ≤ 6, j ∈ {2, 4, 6}}. (2.7.6)
12 1 18
Thus |A| = 12 and |B| = 18 and therefore P [A] = 36
= 3
and P [B] = 36
= 12 . Also
A ∩ B = {(i, j); i ∈ {3, 6}, j ∈ {2, 4, 6}}

= {(3, 2), (3, 4), (3, 6), (6, 2), (6, 4), (6, 6)}. (2.7.7)
We get
6 1 1 1
P [A ∩ B] = = = · = P [A] · P [B] . (2.7.8)
36 6 3 2
Thus the events A and B are independent.
The video Sec2.7b discusses the Example 2.7.3.
Example 2.7.3. Let us consider again the situation of Example 2.3.1 and check if our
intuition is correct. More precisely: Let us show that the first and second level of this
experiment are indeed independent in the meaning of Definition 2.7.1. Recall
Ω = {(a, a), (a, n), (n, a), (n, n)} and (2.7.9)
9 6 4
P [{(a, a)}] = , P [{(a, n)}] = P [{(n, a)}] = and P [{(n, n)}] = . (2.7.10)
25 25 25
We consider the events
A = ‘a in the first draw’ = {(a, n), (a, a)},

B = ‘n in the second draw’ = {(a, n), (n, n)}.
and show that these events A and B are independent. We have

6
A ∩ B = {(a, n)} =⇒ P [A ∩ B] = . (2.7.11)
25
Furthermore
9 6 3 6 4 2
P [A] = + = and P [B] = + = . (2.7.12)
25 25 5 25 25 5
We thus get
6 3 2
P [A ∩ B] = = · = P [A] · P [B] (2.7.13)
25 5 5
60
Exercise 2.7.4. Let (Ω, P)be as in Example 2.7.3 and show that the events Ac and B
are independent.
Example 2.7.5. A possible question is: Is the interest of the people in mathematics
and physics (stochastically) independent? We write M for the set of people interested
in mathematics and P for the set of people interested in physics. A survey shows that
P [M ] = 0.7, P [P ] = 0.6 and P [M ∩ P ] = 0.48 (if we choose uniformly at random a
person). We now have
P [M ] · P [P ] = 0.42 6= P [M ∩ P ] (2.7.14)
and thus M and P are not independent.
Exercise 2.7.6. If P [A] = 0.2 and P [B] = 0.3 find P [A ∩ B] if
a) A and B are independent,
b) A and B are disjoint.
In the everyday use, one calls two events independent if they belong to two different exper-
iments, which do not have an influence on each other. This is for stochastic independence
not mandatory. Let us give an example.
Example 2.7.7. Let us consider the same situation as in Example 2.6.4. This means we
look at the throw of a fair die and study the events
A = ‘The number is divisible by 3’ and B = ‘The number is even’. (2.7.15)
We now show that A and B are independent (even though both events refer to the same
throw!). Indeed, we have computed in Example 2.6.4 that
P [A ∩ B] 1
P [A|B] = = = P [A] . (2.7.16)
P [B] 3
We thus have P [A|B] = P [A] and therefore A and B are independent.
Unfortunately our intuition for independence is not always correct. It can occur that we
guess two events are independent, but they are not independent. This is for instance the
case in Example 2.6.8. A further famous example is
Example 2.7.8 (Monty Hall problem). Suppose we are in a game show. The show master
gives you the choice between three doors: beyond two doors is a goat and beyond one door
a car. After we have picked a door, the show master opens a remaining door with a
61
goat. The show master then asks us: ‘Would you like to keep your door or change it?’.
Intuitively, one would say that both doors have the same probability and it does not matter
if we keep the door or change. Let us now compute the probability to win the car. For
this, we define
A : = {The car is in the first chosen door.},
Z : = {A goat is in the first chosen door.},
W : = {We win the car.},
L : = {We don’t win the car.}.
We then have P [A] = 13 and P [Z] = 32 . Let us now consider three different strategies. In
all three cases, we have the same probability tree, see Figure 2.23.
Figure 2.23: Probability tree for Example 2.7.8.
Notice that we have not specified in Figure 2.23 the probabilities at the branches with ‘keep
the door’ and ‘change the door’ since they depend on the strategy.
Strategy 1: Keep the door.
With this strategy, the branches with ‘keep the door’ have probability 1 and the
branches with ‘change the door’ have probability 0. We get
1
P [W ] = P [A] · P [W |A] + P [Z] · P [W |Z] = P [A] · 1 + P [Z] · 0 = . (2.7.17)
3
Strategy 2: Change the door.
Here we see that only the right path of the tree leads to the car and thus
2
P [W ] = P [Z] = . (2.7.18)
3
Strategy 3: Choose the door randomly.
We have in this case
1 1 1
P [W ] = P [W |A] · P [A] + P [W |Z] · P [Z] = · P [A] + · P [Z] = . (2.7.19)
2 2 2
We thus see that the first level of the experiment has an influence on the second level and
thus both levels cannot be independent.
62
The video Sec2.7c considers a further example not contained in the notes.
Exercise 2.7.9. Consider the situation of Example 2.7.8, but this times with 1000
doors, goats beyond 999 doors and 998 doors opened after the first step. Compute
the probabilities to win the car for each strategy in Example 2.7.8.
We now would like to consider the following question: if A and B are independent, are
then also Ac and B independent? Intuitively one would say yes. The following lemma
shows that this is indeed true.
Lemma 2.7.10. Let (Ω, P)be a (discrete) probability space and A, B ⊂ Ω be two
events. It is equivalent
a) A and B are independent,
b) A and B c are independent,
c) Ac and B are independent,
d) Ac and B c are independent.
Proof. It is enough to prove the equivalence of a) and b). The equivalence of a) and c)
follows by interchanging A and B. Furthermore, the equivalence of c) and d) follows by
replacing A by Ac . We now have
P [A ∩ B] = P [A] · P [B] ⇐⇒ P [A ∩ B] + P [A ∩ B c ] = P [A] · P [B] + P [A ∩ B c ]

⇐⇒ P [(A ∩ B) ∪ (A ∩ B c )] = P [A] · P [B] + P [A ∩ B c ]
⇐⇒ P [A] = P [A] · P [B] + P [A ∩ B c ]
⇐⇒ P [A] · (1 − P [B]) = P [A ∩ B c ]
⇐⇒ P [A] · P [B c ] = P [A ∩ B c ] .
We have used in the second equivalence that A ∩ B and A ∩ B c are disjoint and in the
third equivalence that A = (A ∩ B) ∪ (A ∩ B c ). We get with this computation
P [A ∩ B] = P [A] · P [B] ⇐⇒ P [A] · P [B c ] = P [A ∩ B c ] .
2.7.2 Independence of several events

Please watch the following video Sec2.7d. In most cases we have more than two events,
for instance if we toss n times independently a fair coin. A natural question is thus how
63
we define the independence of more than two events. A possible approach is to extend
(2.7.2). More precisely
Definition 2.7.11. Let (Ω, P)be a (discrete) probability space. The events A1 , . . . , An
with n ≥ 2 are called pairwise independent if
P [Ai ∩ Aj ] = P [Ai ] · P [Aj ] for all i 6= j with 1 ≤ i, j ≤ n (2.7.20)
The question at this point is if Definition 2.7.11 gets close enough to our intuition of
independence. Let us thus think which properties we would like to have. For this, we
look at an example, where we definitive should have independence.
Example 2.7.12. Let an urn be given with 8 balls of which 5 are white and 3 black. We
now draw 4 times purely randomly a ball with replacement. We define
Aj := ‘The ball in jth draw is white.’ with 1 ≤ j ≤ 4. (2.7.21)
One now can show (for instance with the path rule) that
P [Ai ∩ Aj ] = P [Ai ] · P [Aj ] for all i 6= j (2.7.22)
and thus the events A1 , A2 , A3 and A4 are pairwise independent. Similarly, one can show
P [Ai ∩ Aj ∩ Ak ] = P [Ai ] · P [Aj ] · P [Ak ] with i 6= j 6= k 6= i and (2.7.23)
P [A1 ∩ A2 ∩ A3 ∩ A4 ] = P [A1 ] · P [A2 ] · P [A3 ] · P [A4 ] (2.7.24)
The question at this point is if the equations (2.7.22), (2.7.23) and (2.7.24) are character-
istic for independent events or if they are only a speciality for this example. Let us thus
look if we can find an interpretation for these equations. We know from the motivation
of the independence of two events at the begin of Section 2.7.1 that
P [Ai ∩ Aj ] = P [Ai ] · P [Aj ] ⇐⇒ P [Ai |Aj ] = P [Ai ] for i 6= j. (2.7.25)
In words this means the occurrence of the event Aj has no influence on the probability that
the event Ai occurs. Do the equations (2.7.23) and (2.7.24) have a similar interpretation?
A possible starting point to answer this question is: What should happen to P [A3 ] and
P [A4 ] if we know that A1 and A2 occur? If the events A1 , A2 , A3 and A4 are independent,
then the natural answer is that this should have of course no influence on the probabilities
P [A3 ] and P [A4 ]. Using conditional probability, this means we should have
P [A3 |A1 ∩ A2 ] = P [A3 ] and P [A4 |A1 ∩ A2 ] = P [A4 ] . (2.7.26)
As simple computation shows that
P [A3 |A1 ∩ A2 ] = P [A3 ] ⇐⇒ P [A1 ∩ A2 ∩ A3 ] = P [A1 ] · P [A2 ] · P [A3 ] and (2.7.27)
P [A4 |A1 ∩ A2 ] = P [A4 ] ⇐⇒ P [A1 ∩ A2 ∩ A4 ] = P [A1 ] · P [A2 ] · P [A4 ] . (2.7.28)
64
The equation (2.7.23) thus can be interpreted as follows: The occurrence of the events A1
and A2 has no influence on the probability that the event A3 occurs. Similarly for A4 .
With a similar computation one can show that
P [A4 |A1 ∩ A2 ∩ A3 ] = P [A4 ] ⇐⇒ P [A1 ∩ A2 ∩ A3 ∩ A4 ] = P [A1 ] · P [A2 ] · P [A3 ] · P [A4 ]
Thus (2.7.24) can be interpreted as: The occurrence of the events A1 , A2 and A3 has no
influence on the probability that the event A4 occurs. We thus see that the three equations
(2.7.22),(2.7.23) and (2.7.24) get indeed very close to our intuition for independence (for
four events). If we think now about n independent events, then we should have similar
properties. More preciously, if the events A1 , . . . , An are independent then we should have
P [Aj1 ∩ Aj2 ∩ · · · ∩ Ajd ] = P [Aj1 ] · P [Aj2 ] · . . . · P [Ajd ] (2.7.29)
for each subsequence 1 ≤ j1 < · · · < jd ≤ n with 2 ≤ d ≤ n. In words, each subfamily

of the events A1 , . . . , An fulfil the product property (2.7.29). Another property we would
like to have is: if the events A1 , . . . , An are independent then each subfamily of A1 , . . . , An
is also independent. It would be indeed very strange that A1 and A2 would be dependent
while A1 , A2 and A3 are independent. The questions at this point is if Definition 2.7.11
implies (2.7.29). Unfortunately, this is not the case as the following example shows
Example 2.7.13. We toss two times a fair die and consider the following events:
A = ‘The number of the first throw is even’,

B = ‘The number of the second throw is even’,
C = ‘The sum of both numbers is odd’.
A simple computations shows

1 1
P [A] = P [B] = P [C] = , P [A ∩ B] = P [A ∩ C] = P [B ∩ C] = , (2.7.30)
2 4
but
P [A ∩ B ∩ C] = P [∅] = 0. (2.7.31)
Thus the events A, B and C are pairwise independent. However, the property (2.7.29) is
not fulfilled for d = 3 and thus A, B and C are not independent.
We thus see that Definition 2.7.11 is not strong enough to fulfil our intuitive interpretation
of independence. Another possible approach is to try
P [A1 ∩ · · · ∩ An ] = P [A1 ] · . . . · P [An ] (2.7.32)
The question is again if (2.7.32) implies pairwise independence and (2.7.29). Unfortu-
nately, this is also not the case.
65
Example 2.7.14. We consider the Laplace experiment on Ω = {1, 2, 3, 4, 5, 6, 7, 8} with
A = {1, 2, 3, 4}, B = {1, 5, 6, 7}, C = {1, 2, 5, 8}. (2.7.33)
It follows
1 1
P [A] = P [B] = P [C] = , P [A ∩ B ∩ C] = P [{1}] = , (2.7.34)
2 8
but
1 1
P [A ∩ B] = P [{1}] = , P [A ∩ C] = P [{1, 2}] = = P [{1, 5}] = P [B ∩ C] . (2.7.35)
8 4
Thus (2.7.32) is fulfilled, but the events A, B, C are not pairwise independent and thus
(2.7.29) is also not fulfilled.
Since neither Definition 2.7.11 nor the approach (2.7.32) is strong enough, we have to
use (2.7.29) as the definition for the independence of more than two events. We thus set
Definition 2.7.15. Let (Ω, P)be a (discrete) probability space. The events A1 , . . . , An
are called independent, if we have
P [Aj1 ∩ · · · ∩ Ajd ] = P [Aj1 ] · P [Aj2 ] · . . . · P [Ajd ] (2.7.36)
for all sequences 1 ≤ j1 < · · · < jd ≤ n with 2 ≤ d ≤ n arbitrary.
If the events A1 , . . . , An are independent in the sense of Definition 2.7.15, then they are
also pairwise independent. One simply has to choose d = 2. Furthermore, it is easy to
see that Definition 2.7.15 also implies that each subfamily of independent A1 , . . . , An is
also independent.
Independence has many advantages. For instance, if we consider the path rule in Propo-
sition 2.3.4, then we can compute for independent events the probability of each part in
a path without paying attention to the rest of the path.
The question is how to check the independence of n events. The best strategy is to check
first all cases with d = 2, then all cases with d = 3 and so on. This means we first check
for all i < j
P [Ai ∩ Aj ] = P [Ai ] · P [Aj ] , (2.7.37)
which obviously means we check first if the events are pairwise independent. After this
we check for all i < j < k
P [Ai ∩ Aj ∩ Ak ] = P [Ai ] · P [Aj ] · P [Ak ] (2.7.38)
and so on. Let us illustrate this with a concrete example.
66
Example 2.7.16. Let us consider the probability space (Ω, P)with
1
Ω = {1, 2, 3, . . . , 120} and P [ω] = (2.7.39)
120
and the events A2 , A3 and A5 with
Ak = {ω ∈ Ω; k divides ω} (for k = 2, 3 or 5). (2.7.40)
Are the events A2 , A3 and A5 independent? We begin by computing the probabilities

P [A2 ], P [A3 ] and P [A5 ]. We have
|A2 | |{2, 4, 6, . . . , 120}| 60 1

P [A2 ] = = = = . (2.7.41)
120 120 120 2
Similarly, one can show that
1 1
P [A3 ] = and P [A5 ] = . (2.7.42)
3 5
We now check if A2 , A3 and A5 fulfil (2.7.36). We start with the case d = 2 and we thus
have to check if
P [A2 ∩ A3 ] = P [A2 ] · P [A3 ] , (2.7.43)

P [A2 ∩ A5 ] = P [A2 ] · P [A5 ] , (2.7.44)
P [A3 ∩ A5 ] = P [A3 ] · P [A5 ] (2.7.45)
are fulfilled. We first look at (2.7.43). We have
|A2 ∩ A3 | |{6, 12, 18, . . . , 120}| 20 1

P [A2 ∩ A3 ] = = = = = P [A2 ] · P [A3 ] . (2.7.46)
120 120 120 6
This shows that (2.7.43) is fulfilled. Using the same argumentation, it is straight forward
to see that (2.7.44) and (2.7.45) are also fulfilled. This shows in particular that the events
A2 , A3 and A5 are pairwise independent, see Definition 2.7.11. We now come to the case
d = 3 and we thus have to check if
P [A2 ∩ A3 ∩ A5 ] = P [A2 ] · P [A3 ] · P [A5 ] . (2.7.47)
Using the same argumentation as above, we get

|A2 ∩ A3 ∩ A5 | |{30, 60, 90, 120}| 4
P [A2 ∩ A3 ∩ A5 ] = = =
120 120 120
1
= = P [A2 ] · P [A3 ] · P [A5 ] . (2.7.48)
30
This shows that the events A2 , A3 and A5 are independent.
67
68
Chapter 3
Discrete random variables
We introduce in this section discrete random variables and study some of their properties.
This includes the probability mass function, the expectation, the variance and indepen-
dence of random variables. Also, we study the sum of independent random variables and
state the law of large numbers and the central limit theorem.
3.1 Motivation and definition of random variables

As motivation, let us compare two random experiments. In the first, we roll a die. In the
second, we select randomly one of six cars. In both experiments, we can use
Ω = {1, 2, 3, 4, 5, 6} (3.1.1)
as sample space. However, there is clearly a huge difference between both experiments.
If we roll a die then the number ‘1’ is a perfect description of the outcome. On the other
hand, the number ‘1’ is not really a good description of a car. An example of a more
accurate description is given in Table 3.1.
ω brand km age colour

car1 BMW 3’000 1 year blue
car2 BMW 5’000 1 year grey
car3 VW 30’000 8 years white
car4 VW 10’000 2 years grey
car5 Lamborghini 10 1 year yellow
car6 Mercedes-Benz 15’000 4 year blue
We thus see that a car has more properties than just being the number ‘1’. This obser-
vation is of course not restricted to cars. In fact, in almost all random experiments a
selected ω has many properties. For instance, if we select a random citizen in UK, then
this citizen has a first name, a family name, an age, a gender, a weight and so on. It is
69
thus natural to write an ω ∈ Ω as
ω = (ω (1) , ω (2) , ω (3) , ω (4) , . . . , ω (`) ), (3.1.2)
where each ω (j) is the value of a certain property. For instance, we could write for the
car1 in Table 3.1
car1 = (1, BMW, 3000, 1 year, blue). (3.1.3)
However, in most situations we do not need to know all properties of a selected ω, only
some ‘interesting’ ones. For a car, these are for instance the brand or the colour. The
serial number of the engine on the other hand is (normally) not really interesting. We thus
would like to have a mathematical description of a property. Let us therefore consider an
example to find the right mathematical definition.
Example 3.1.1. Let us take a look at the six cars in Table 3.1. One interesting property
is for instance the colour. We now can write
colour (car1) = blue, colour (car2) = grey and colour (car3) = white. (3.1.4)
Another interesting property is the brand. We thus can write
brand (car1) = BMW, brand (car2) = BMW and brand (car3) = VW. (3.1.5)
Both properties can also be illustrated graphically, see Figure 3.1.
(a) Coulor of a car.
(b) Brand of a car.
Figure 3.1: Illustration of the properties ‘colour’ and ‘brand’ of a car in Table 3.1.
70
Studying the two properties in Example 3.1.1 more careful, we see that a property can
be characterised as follows:
First, we determine the possible values of the property. We write S for the set of
these values. In Example 3.1.1 we have for the colour
S = {blue, grey, white, yellow}. (3.1.6)
Second, we assign to each sample point ω ∈ Ω a value in S. This is illustrated in

Figure 3.1 with arrows.
Thus each property can be illustrated as in Figure 3.2.
Figure 3.2: Illustration of a property/random variable X of a random element ω.
A possible question at this point is: Are the arrows in Figure 3.2 fixed or are they also
random? Let us consider again the colour of a randomly select car. Clearly the colour of
a car doesn’t change and we thus assign to each car always the same colour. Similarly the
brand of a car don’t changes and we thus assign to each car always the same brand. The
arrows in Figure 3.2 are therefore fixed. We can summarize this as follows: A property
itself is not random, only the values a property takes are random.
If we now consider the illustration in Figure 3.2 and the above observation, we see that
we can realise a property as function X : Ω → S. Let us consider some further examples
of properties.
Example 3.1.2.
a) Suppose we are interested in the age of a randomly selected inhabitant of England.

We then choose Ω at the set of all inhabitants of England, S = N0 and X : Ω → N0 is
the function that assigns each inhabitant his/her age. For instance, if an inhabitant
ω is 20 years old, then X(ω) = 20.
b) If we are interested in the eye colours of the students of this course, then
Ω = {ω; ω is a student} and S = {blue, green, . . . } (3.1.7)
and the function X : Ω → S assigns each student his/her eye colour.
71
c) An insurance company would like to study the BMI (Body Mass index) of their clients.
Then Ω is the set of all clients, S = R+ and
the weight of ω
X(ω) = . (3.1.8)
(the size of ω)2
d ) If a bank is interested in the amount of money their clients have on their bank ac-
counts, then we choose Ω as the set of all clients of the bank, S = R+ and X assigns
each client the amount of money on his/her bank account.
e) We can study also more abstract sample spaces and properties. A possible question is
if a natural number is a prime number. We set in this case Ω = N, S = {0, 1} and
define X : N → S as
(
1 if ω is a prime number,
X(ω) := (3.1.9)
0 otherwise.
We can of course allow S to be an arbitrary set. However, we will restrict us to S ⊂ R.

The reason is that this allows us make calculations with properties. We can for instance
compute means and variances. For historical reasons, one calls a property in the context
of probability theory a random variable.
Definition 3.1.3. Let (Ω, P)be a discrete probability space. A discrete random vari-
able is a function
X : Ω → R,
ω → X(ω).
When we are working with random variables, we suggest that the reader should keep in
mind that random variables are nothing else than properties. We also recommend to keep
the illustration in Figure 3.2 in mind. Notice that an illustration like Figure 3.2 needs
often much space. It is thus sometimes more convenient to write the values of a random
variable directly next to the sample point, see Figure 3.3.
72
Figure 3.3: Alternative illustration of a random variable X.
The blue dots represent the sample points ω ∈ Ω. The values near the sample points are
the value of a random variable X of this sample point.
Random variables are important as they are the result of most real experiments. Also they
impose an additional structure by the number system, such as ordering, enables further
development of the ideas in the previous chapter. There are some points we would like to
emphasise about random variables
The name random variable can be misleading since a random variable is not a
variable nor random in any sense. As we have discussed above, a random variable is
a property. Thus a random variable itself is not random, only the values a random
variable takes are random. Mathematically, a random variable is only a function.
We will use always capital letters for random variables like X or Y . Also we write
typically X instead of X(ω) since the dependence on ω is most time clear from the
context.
Every time the experiment is conducted exactly one value of the random variable is
observed; this is called a realisation of the random variable.
We have used in Example 3.1.2 for all random variables the name X. This is of
course not always a good idea and can lead to confusions. We will thus use from
now more intuitive names, for instance G for ‘gender’ or C for ‘colour’, but we will
use always capital letters for random variables. However, we write X and Y for
generic random variables.
73
Some further examples of random variables are
Example 3.1.4.
a) Let Ω = {1, 2, 3, 4, 5, 6} and P [ω] = 61 for all ω ∈ Ω. We thus have again the model for
the throw of a fair die. We are interested here in the question if the thrown number is
even or odd. A possibility to study this question is to introduce the random variable:
(
1 for ω ∈ {2, 4, 6},
X : Ω → R with X(ω) = (3.1.10)
0 for ω ∈ {1, 3, 5}.
b) We draw purely randomly an integer between 1 and 16. In this case Ω = {1, . . . , 16}
1
and P [ω] = 16 for all ω ∈ Ω. The following random variable describes the amount of
1’s in the decimal representation of the drawn number:

1 for ω ∈ {1, 10, 12, 13, 14, 15, 16},

X(ω) = 2 for ω = 11, (3.1.11)

0 otherwise.

Random variables are (essentially) functions and we can perform with them mathematical
operations like to sum them, to multiply them with constants or to square them. These
operations then give again a random variable. As an example, let X and Y be random
variables, then
X + Y, X 2 , (X − 2)2 , (2X − 3Y )2 , . . . (3.1.12)
are also random variables. Highlighting the dependence on ω in (3.1.12), we have

2
X(ω) + Y (ω), (X(ω))2 , (X(ω) − 2)2 , 2 · X(ω) − 3 · Y (ω) , . . .

Example 3.1.5. An insurance is interested in the body mass index of some clients. These
clients are listed in Table 3.1.5.
ω age gender size weight

client1 25 m 1.66 m 60 kg
client2 36 w 1.76 m 86 kg
client5 44 w 1.82 m 90 kg
The body mass index itself is not listed in Table 3.1.5, but can be computed from the size
and the weight of a client, see (3.1.8). Let us denote by S the size of a client (for instance
S(client1) = 1.66 m), by W the weight of a client (for instance W (client1) = 60 kg) and
74
by BM I the body mass index of a client. If we now pick uniformly at random on of the
5 clients, then S, W and BM I are clearly random variables and the BM I of a client is
given by BM I = W/S 2 . We have for instance
W (client1) 60
BM I(client1) = 2
= = 21.77. (3.1.13)
S (client1) (1.66)2
Example 3.1.6.
1
a) Let Ω = {1, 2, 3, 4, 5, 6} and P [ω] = 6
for all ω ∈ Ω. We set
X(ω) := 7 − ω and Y (ω) := ω 2 . (3.1.14)
Then
X(ω) + Y (ω) := 7 − ω + ω 2 and X(ω) · Y (ω) := (7 − ω) · ω 2 . (3.1.15)
b) Suppose Ω = {(i, j); i, j = 1, . . . , 6} is the sample space resulting from rolling two dice
1
and P [ω] = 36 for all ω ∈ Ω. We can define random variables by
X((i, j)) = i, (the number of the first die),

Y ((i, j)) = j, (the number of the second die),
S((i, j)) = i + j, (the sum of the numbers of the dice),
M ((i, j)) = max{i, j}, (the bigger of the two numbers of the dice).
We then have of instance
X + Y = S and M = max{X, Y }. (3.1.16)
3.2 Probability mass functions

Often one is interested in the probability that a random variable X takes a certain value.
For instance, what is the probability that a randomly selected car is red. Let us consider
an example to make this clearer.
Example 3.2.1. What is the probability that the sum of the numbers of two random rolled
1
dice is 4? We have here Ω = {(i, j); i, j = 1, . . . , 6}, P [ω] = 36 and the corresponding
random variable is S((i, j)) = i + j. In this terminology, we are interested in all (i, j)
with S((i, j)) = 4. Obviously, these (i, j) are
{(1, 3), (2, 2), (3, 1)} (3.2.1)

3
and thus the probability of this event is 36
.
We now introduce a suitable notation for such questions.
75
Definition 3.2.2. Let (Ω, P)be a (discrete) probability space and X a random variable
on this space. We then define for x ∈ R
{X = x} := {ω ∈ Ω; X(ω) = x}. (3.2.2)
Furthermore we define
P [X = x] := P [{X = x}] = P [{ω ∈ Ω; X(ω) = x}] . (3.2.3)
Obviously, {X = x} ⊂ Ω and is thus {X = x} is an event (see Definition 2.1.7). The event

{X = x} consists of all ω ∈ Ω with X(ω) = x and thus can be interpreted heuristically as
‘X has the value x’ or ‘X is equal to x’. The probability P [X = x] can be interpreted as
the chance the random variable X has the value x. An illustration of {X = x} is given
in Figure 3.4 for the random variable X in Figure 3.3.
Figure 3.4: Illustration of the events {X = 1}, {X = 2} and {X = 3}.
Example 3.2.3. Let us consider the situation of Example 3.2.1. We have

{S = 4} = {(1, 3), (2, 2), (3, 1)}, (3.2.4)
{S = 5} = {(1, 4), (2, 3), (3, 2), (4, 1)}, (3.2.5)
{S = 6} = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}, (3.2.6)
{S = 10} = {(4, 6), (5, 5), (6, 4)}. (3.2.7)
These events can be interpreted as ‘The sum of the numbers is 4’, ‘The sum of the numbers
is 5’, ‘The sum of the numbers is 6’ and ‘The sum of the numbers is 10’. Furthermore,
3 4 5 3
P [S = 4] = , P [S = 5] = , P [S = 6] = , P [S = 10] = . (3.2.8)
36 36 36 36
76
Example 3.2.4. An urn contains 100 balls, numbered with 00, 01, . . . , 99. We now draw
purely random a ball. We have in this situation
1
Ω = {00, . . . , 99} and P [ω] = for all ω ∈ Ω.
100
We consider here the random variables X and Y , where X is the first number and Y is the
second number of the drawn ball. We compute here as an example P [X = 3] , P [Y = 8].
We use the definition of {X = 3} and {Y = 8} and get
{X = 3} = {ω ∈ Ω; X(ω) = 3} = {30, 31, . . . , 39},

{Y = 8} = {ω ∈ Ω; Y (ω) = 8} = {08, 18, . . . , 98}.
Thus |{X = 3}| = |{Y = 8}| = 10 and
10 1
P [X = 3] = P [Y = 8] = = . (3.2.9)
100 10
The idea behind the events {X = x} is to put all ω ∈ Ω together, where X has the same
value. An illustration of this idea is given in Figure 3.5. The question at this point is if
(a) The blue points represent the sample (b) The random variable X has only the val-
points ω ∈ Ω and the values near the points ues 1, 2 and 3. Thus we split Ω in to three
stand for values of X at this point. regions, each region standing for one of the
values 1, 2 and 3. Finally, we move each ω
to the corresponding region.
Figure 3.5: Illustration of Proposition 3.2.5 for a random variable X.
the Figures 3.4 and 3.5 are typical illustrations for the events {X = x}. Of course there
are random variables with more than 3 values, but apart from this, one can ask:
Is it possible that an ω ∈ Ω is contained in two different events {X = x}? In

formulas: Can we have an ω ∈ {X = x1 } ∩ {X = x2 } with x1 6= x2 ?
Is it possible that an ω ∈ Ω is not contained in any event {X = x}? In formulas:

Can we have ω ∈ / {X = x} for all x ∈ R?
77
(a) ω ∈ Ω in {X = 1} and in {X = 2} (b) ω ∈ Ω not in any {X = x}
Figure 3.6: Impossible illustration for a random variable X with values 1, 2 and 3.
This two questions can be visualised graphically, see Figure 3.6. We thus can reformulate
the two above questions as: Are the illustrations in Figure 3.6 possible for a random
variable X and the events {X = x}?
Intuitively, it is quiet clear that the answer to both questions is no. For instance, let us
denote by X the brand of a randomly selected car in the UK. Obviously, no car has more
than one brand. Furthermore, each car has a brand.
Let us formulate the answer to both questions as proposition for a general random variable.
Proposition 3.2.5. Let (Ω, P)be a (discrete) probability space and X a random
variable on this space. We then have
{X = x1 } ∩ {X = x2 } = ∅ for all x1 , x2 ∈ R with x1 6= x2 , (3.2.10)

[
Ω= {X = x}. (3.2.11)
x∈R
Proposition 3.2.5 now shows that Ω is covered by the events {X = x} and thus each
ω ∈ Ω is contained some event {X = x}. Proposition 3.2.5 also shows that no ω can
occur in more than one {X = x}. In other words each ω ∈ Ω is contained in precisely
one event {X = x}.
A natural question at this point is: Ω is just countable, why do we use in (3.2.11) the
real numbers R? The reason is that we have made no restrictions for the values the
random variable X can take. Thus in principle X can take each value x ∈ R. However,
{X = x} = 6 ∅ for only countable many x ∈ R. On the other hand, if one already knows
that X : Ω → S where S ⊂ R is countable then one can rewrite (3.2.11)
[
Ω= {X = x}. (3.2.12)
x∈S
78
Proof of Proposition 3.2.5. We have by definition
{X = x1 } ∩ {X = x2 } = {ω ∈ Ω; X(ω) = x1 } ∩ {ω ∈ Ω; X(ω) = x2 }
= {ω ∈ Ω; X(ω) = x1 and X(ω) = x2 }.
Since X is a function Ω → R, X(ω) can have only one value for each ω. Thus we can
have either X(ω) = x1 or X(ω) = x2 , but not both at the same time (if x1 6= x2 ). Thus
{X = x1 } ∩ {X = x2 } is the empty set ∅. This proves the first equation.
S now come to the second equation. Since we {X = x} ⊂ Ω for all x ∈ R, we have
We
x∈R {X = x} ⊂ Ω. On the other hand, if an ω ∈ Ω is given, then X(ω) Sis a real number
x ∈ R. Thus this ω is contained in {X = x} for this x. This implies ω ∈ x∈R {X(ω) = x}.
This completes the proof.
We now introduce the probability mass function of the random variable X.
Definition 3.2.6. Let (Ω, P)be a discrete probability space. The probability mass
function of a discrete random variable X is defined by
pX (x) : R → [0, 1], (3.2.13)

x → pX (x) := P [{X = x}] . (3.2.14)
The probability mass function is also called distribution of X.
The probability mass function is a very important quantity of a random variable. One of
the reason is that one can find in practical applications often (a good approximation of)
the probability mass function of a given random variable. One the other hand, it is much
more difficult to find in a practical application a suitable sample space.
Remark 3.2.7. The terminology distribution is used in probability theory for several ob-
jects. For instance it is also used for the probability density function of a continuous
random variable. However, we consider here only discrete random variables and thus
distribution is here synonymously to probability mass function.
Example 3.2.8. Suppose Ω = {0, 1, 2, 3, 4, 5} and we consider the Laplace-experiment on

Ω. We define
X(ω) := (ω − 2)2 . (3.2.15)
Substituting all ω ∈ Ω into X, we see that X can have only the values 0, 1, 4 and 9.
Furthermore, we have
{X = 0} = {2}, {X = 1} = {1, 3}, {X = 4} = {0, 4},

{X = 9} = {5} and {X = x} = ∅ for x ∈ / {0, 1, 4, 9}.
79
The probability mass function of X is thus
1 2 2
pX (0) = P [{2}] = , pX (1) = P [{1, 3}] = , pX (4) = P [{0, 4}] = ,
6 6 6
1
pX (9) = P [{5}] = and pX (x) = 0 for x ∈ R \ {0, 1, 4, 9}.
6
An illustration is given in Figure 3.7.
Figure 3.7: Probability mass function in Example 3.2.8.
Let us now make an important observation: if we sum all (non-zero) pX (x), we get
pX (0) + pX (1) + pX (4) + pX (9) = 1. (3.2.16)
Example 3.2.9. Suppose our sample space consists of the outcomes of throwing a fair
die, and suppose we gamble on the outcome: lose £1 if the outcome is 1, 2 or 3; win
nothing if the outcome is 4; win £2 if the outcome is 5 or 6. We define X to be the
random variable giving the profit. Explicitly, we have Ω = {1, 2, 3, 4, 5, 6}, P [ω] = 61 for
all ω ∈ Ω and
X(1) = X(2) = X(3) = −1, X(4) = 0, X(5) = X(6) = 2.
We now get
{X = −1} = {1, 2, 3}, {X = 0} = {4}, {X = 2} = {5, 6} and
{X = x} = ∅ for x ∈/ {−1, 0, 2}.
The probability mass function of X is thus
1 1 1
pX (−1) = P [{1, 2, 3}] = , pX (0) = P [{4}] = , pX (2) = P [{5, 6}] = and
2 6 3
pX (x) = 0 for x ∈ / {−1, 0, 2}.
If we sum similarly to Example 3.2.8 all (non-zero) pX (x), we get
pX (−1) + pX (0) + pX (2) = 1. (3.2.17)
We now show that the observation made in Example 3.2.8 and 3.2.9 in the equations 3.2.16
and 3.2.17 is true in general.
80
Proposition 3.2.10. Let (Ω, P)be a probability space and X be a discrete random
variable on this space. We then have
0 ≤ pX (x) ≤ 1 for all x ∈ R and (3.2.18)

X
pX (x) = 1. (3.2.19)
x∈R
pX (x)6=0
This proposition can be interpreted in words as: If we sum pX (x) over all possible values
of X then we have 1. Note that this one of the few results, which is true for discrete
random variables only.
Remark 3.2.11. If we define S ⊂ R as
S := {x ∈ R; It exists a ω ∈ Ω with X(ω) = x} (3.2.20)
then the set S is a countable subset of R (see Definition 2.1.2) since X is discrete. We
then can write
X X
pX (x) = pX (x). (3.2.21)
x∈R x∈S
pX (x)6=0
Furthermore, we can use pX (x) to define a probability measure on S. Indeed, Propo-

sition 3.2.10 shows that pX fulfils the requirements in Definition 2.1.3 for a probability
measure. The set S will be thus sometimes called induced sample space. However, we will
not make use the induced sample space.
Proof of Proposition 3.2.10. We have pX (x) = P [{X = x}]. If follows from the definition
of a probability space (Ω, P) that 0 ≤ P [{X = x}] ≤ 1. This implies the first point. For
the second point, we use Proposition 3.2.5 and Lemma 2.1.17.a) and get
 
" #
[  [  X
1 = P [Ω] = P {X = x} = P   {X = x}=
 P [{X = x}]
x∈R x∈R x∈R
{X=x}6=∅ {X=x}6=∅
X
= pX (x).
x∈R
pX (x)6=0
S S
Notice that x∈R {X = x} and x∈R {X = x} are the same sets since we have only
{X=x}6=∅
omitted empty sets. However, this step is important since we can use LemmaS 2.1.17.a)
only for countable unions of sets and thus cannot apply Lemma 2.1.17.a) to x∈R {X = x}
(since R is uncountable). This is the point, where we need that X is discrete.
81
Customers n 0 1 2 3 4 5 6 total
Probability pN (x) 0.05 0.07 0.12 0.19 0.31 0.2 0.06 1
Table 3.1: Table for the distribution of N in Example 3.2.12
Example 3.2.12. A small branch of a bank considers the number of customers served per
hour. We denote this number by N . A survey leads then to Table 3.1. We thus have for
instance P [N = 3] = 0.19. Proposition 3.2.10 now states that the sum of the probabilities
pN (k) = P [N = k] over all possible values of N (here k ∈ {0, 1, 2, 3, 4, 5, 6}) gives 1, i.e.
pN (0) + pN (1) + pN (2) + pN (3) + pN (4) + pN (5) + pN (6) = 1. (3.2.22)
It is easy to check that this is indeed true. Notice that in principle one would had to
include in this example also P [N = 7], P [N = 8], . . . , however Table 3.1 indicates that
P [N = k] = 0 for k ≥ 7 and we thus do not need to consider them.
It is clear that the underlying probability space (Ω, P)in Example 3.2.12 is quite com-
plicate. However, it is enough to know Table 3.2 if we are only interested how many
customers are served in an hour. One thus can forget for such questions the underly-
ing probability space. This is a general phenomenon. Almost all important information
about a discrete random variable X is contained in the probability mass function. If one
thus knows the probability mass function, one can forget the underlying probability space.
Therefore the probability mass function abstracts the underlying probability space. How-
ever, there are different random variables, which have the same probability mass function.
Example 3.2.13. We consider again the random variable X from Example 3.1.4.a).
Furthermore, we consider also a second probability space (Ω, P) with Ω = {H, T }, P [H] =
P [T ] = 21 and a random variable Y with
(
1, ω = H,
Y (ω) = (3.2.23)
0, ω = T.
Then X and Y have the same distribution, i.e. pX (x) = pY (x) for all x ∈ R. Explicit,
we have
x=0 x=1 x∈
/ {0, 1}
P [X = x] = P [Y = x] 0.5 0.5 0.
3.3 Events involving several values

The probability mass function gives us the probability that a random variable X has a
the value x. However, in many cases we are interested in events involving more than one
value x. As illustration, let us consider an example
82
Days stayed x 4 5 6 7 8 9 10+ total
Probability pL (x) 0.038 0.114 0.430 0.300 0.080 0.030 0.008 1
Table 3.2: Table for the distribution of L in Example 3.3.1
Example 3.3.1. The length of stay in hospital after surgery is modelled as a random
variable L. The Table 3.2 gives the distribution of L.
We thus know that the probability to stay 5 days in hospital is P [L = 5] = 0.114. On the
other hand, what is the probability to stay less than a week in the hospital? The natural
answer is in this situation
P [L ≤ 7] = P [L = 4] + P [L = 5] + P [L = 6] + P [L = 7] (3.3.1)
= 0.038 + 0.114 + 0.430 + 0.300 = 0.882. (3.3.2)
Let us now justify the answer in Example 3.3.1. First, we have considered in Example 3.3.1
the probability P [L ≤ 7] and thus the event {L ≤ 7}. If one takes it exactly, we have
not given the definition of such events. This definition is of course straight forward, but
we state it here for completeness. Thus let X be a discrete random variable on some
probability space (Ω, P). We then set
{X ≤ x} := {ω ∈ Ω; X(ω) ≤ x} and (3.3.3)

{X ≥ x} := {ω ∈ Ω; X(ω) ≥ x}. (3.3.4)
Clearly, we set P [X ≤ x] := P [{X ≤ x}] and P [X ≥ x] := P [{X ≥ x}]. Similarly, we

define for a set A ⊂ R
{X ∈ A} := {ω ∈ Ω; X(ω) ∈ A} and P [X ∈ A] := P [{X ∈ A}] . (3.3.5)
In words, {X ∈ A} can be interpreted as ‘X is contained in A’ and P [X ∈ A] as the proba-

bility that ‘X is contained in A’. Further expressions like P [1 < X < 2] or P [X 2 + X = 1]
are defined similarly. We now have
Proposition 3.3.2. Let (Ω, P)be a discrete probability space and X be a discrete
random variable on this space. We then have for all x ∈ R
X X
P [X ≤ x] = pX (y) and P [X ≥ x] = pX (y). (3.3.6)
y≤x y≥x
pX (y)6=0 pX (y)6=0
Furthermore, we have for each A ⊂ R

X [
P [X ∈ A] = pX (x) and {X ∈ A} = {X = x}. (3.3.7)
x∈A x∈A
pX (x)6=0
83
S
Proof. The identity {X ∈ A} = x∈A {X = x} follows immediately from the definition of
{X ∈ A}. For the identity X
P [X ∈ A] = pX (x),
x∈A
pX (x)6=0
we can use precise the same argumentation as in the proof Proposition 3.2.10 and we
thus omit it. The equation (3.3.6) then follows with A = [−∞, x] respectively with
A = [x, ∞].
Applying this proposition to P [L ≤ 7] then immediately gives (3.3.1) since P [L = k] 6= 0
only for k = 4, 5, 6, 7 in the considered interval.
Example 3.3.3. We consider again the situation in Example 3.3.1. What is the proba-
bility of being in hospital for at most 6 days, between 5 and 7 days and at least 7 days.
We have
at most 6 days: P [L ≤ 6] = P [L = 4] + P [L = 5] + P [L = 6] = 0.582,
between 5 and 7: P [5 ≤ L ≤ 7] = P [L = 5] + P [L = 6] + P [L = 7] = 0.844,
at least 7: P [L ≥ 7] = 1 − P [L ≤ 6] = 1 − 0.582 = 0.418.
3.4 Expectation
We now will consider the mean of a random variable. More precisely, we are interested in
what happens if we repeat the same experiment very often and then consider the mean
over all realisations of X.
Example 3.4.1. Suppose we have a random experiment with Ω = {a, b, c}, P [a] = P [b] =
P [c] = 31 and a random variable X with X(a) = 1, X(b) = 3 and X(c) = −5. We now
repeat this experiment 10 times. A possible realisation is
a, a, c, b, a, c, b, b, c, b. (3.4.1)
Thus the mean of X over this 10 realisations is

X(a) + X(a) + X(c) + X(b) + X(a) + X(c) + X(b) + X(b) + X(c) + X(b)
10
1+1−5+3+1−5+3+3−5+3
= =0 (3.4.2)
10
The mean of a random variable is for instance interesting for a casino. If a game is played
per day around 200 times, how many gain or loss does the casino have per round?
It is clear that we cannot say much about the realisation of a random variable for one
experiment. However, we can try to say something about the mean if we repeat the
experiment often enough. Let us consider some examples
84
Example 3.4.2. We play a betting game. For this game, we have to toss a fair coin. If
we toss ‘Head’, we win £1 and if we toss ‘Tail’, we have to pay £1. How much do we
win (or lose) in average per game if we play this game 1000 times? If our interpretation
of probability is correct, then we toss around 500 times ‘Head’ and in around 500 times
‘Tail’. Thus our gain in average per game should be around
£1 · 500 + (−£1) · 500 1 1

≈ = £1 · + (−£1) · = £0. (3.4.3)
1000 2 2
Example 3.4.3. Let us consider again the random experiment in Example 3.4.1. Suppose
we repeat this experiment 3000 times. Which mean would we expect? Since P [a] = P [b] =
P [c] = 13 , we would expect around 1000 a’s, 1000 b’s and 1000 c’s. Thus the mean should
be around
1000 · X(a) + 1000 · X(b) + 1000 · X(c) 1 1 1
≈ = X(a) · + X(b) · + X(c) ·
3000 3 3 3
1
= X(a) · P [a] + X(b) · P [b] + X(c) · P [c] = − .
3
We now can extend the idea in the Examples 3.4.2 and 3.4.3 to more general situations.
Let (Ω, P)be given with Ω = {ω1 , . . . , ωd }. We now play a betting game and suppose we
win £X(ωi ) if ωi occurs, where X : Ω → R is a random variable. How much do we win
in average per game if we play this game n times with n large? If our interpretation of
the probability P [ωi ] is correct, then each ωi should occur around n · P [ωi ] times. Thus
we expect in average per game
X(ω1 ) · n · P [ω1 ] + · · · + X(ωd ) · n · P [ωd ]

≈ = X(ω1 ) · P [ω1 ] + · · · + X(ωd ) · P [ωd ] .
n
(3.4.4)
Notice that the right hand side of (3.4.4) is independent of n. This observation is the
motivation for the following definition.
85
Definition 3.4.4. Let (Ω, P)be a discrete probability space and X a random variable
on this space. The expectation of X is defined as
X
E [X] = X(ω) · P [ω] (3.4.5)
ω∈Ω
if we have
X
E [|X|] := |X(ω)| · P [ω] < ∞. (3.4.6)
ω∈Ω
If (3.4.6) is fulfilled, we say that the expectation of X exists and write E [|X|] < ∞.
We will see in Chapter 3.8 that the expectation describes indeed the mean of a random
variable very well if we repeat the corresponding experiment very often.
Example 3.4.5. Let us compute the expectation for the throw of a fair die. In this
situation, we have
1
Ω = {1, 2, 3, 4, 5, 6}, P [ω] = and X(ω) = ω. (3.4.7)
6
The expectation of X is then
E [X] = X(1) · P [1] + X(2) · P [2] + . . . + X(6) · P [6]

1 1 1 1 1 1 21
=1· +2· +3· +4· +5· +6· = (3.4.8)
6 6 6 6 6 6 6
Example 3.4.6. Suppose
1 1
Ω = {1, 2, 3, 4}, P [1] = P [2] = P [3] = , P [4] = and (3.4.9)
6 2
X(1) = 2, X(2) = 5, X(3) = 2, X(4) = 5. (3.4.10)
Then
E [X] = X(1) · P [1] + X(2) · P [2] + X(3) · P [3] + X(4) · P [4]

1 1 1 1 24
=2· +5· +2· +5· = =4 (3.4.11)
6 6 6 2 6
Example 3.4.7. We consider the random variable X from Example 3.2.8. We then have
E [X] = X(0) · P [0] + . . . + X(5) · P [5]

1 1 1 1 1 1 19
=4· +1· +0· +1· +4· +9· = . (3.4.12)
6 6 6 6 6 6 6
86
We have noticed on Page 74 that X + Y and X 2 are also random variables if X and Y
are random variables. Their expectation is given by
X
E [X + Y ] = X(ω) + Y (ω) · P [ω] , (3.4.13)
ω∈Ω
2
X 2
E X = X(ω) · P [ω] . (3.4.14)
ω∈Ω

Generally, if f : R → R is a function, then f (X) = f X(ω) is also random variable and
X
E [f (X)] = f (X(ω)) · P [ω] . (3.4.15)
ω∈Ω
Before we come to two important computation rules, we would like to say some words
about the condition in (3.4.6). This condition is required that the expectation of a random
variable is well defined and that the sum in (3.4.5) is independent of the order of the
summands. Let us given an example.
1
Example 3.4.8. We consider the geometric distribution on N with parameter p = 2
.
This means Ω = N and P [k] = 2−k for k ∈ N. We define the random variables
X(k) := (−2)k and Y (k) := 2k . (3.4.16)
We then would have
E [X] = (−1) + 1 + (−1) + 1 + · · · and E [Y ] = 1 + 1 + 1 + · · · . (3.4.17)
It is thus obvious that the expectation of both random variables is not well defined. Of
course, the condition (3.4.6) is not fulfilled for X and Y .
Example 3.4.8 shows that (3.4.6) is important. One thus has to verify each time (3.4.6) if
one is computing the expectation of a random variable. However, (3.4.6) is automatically
fulfilled if Ω is finite. Since most of our sample spaces are finite or (3.4.6) is obviously
fulfilled, we will not dwell on (3.4.6) in our computations (except if it is necessary).
Next we prove two important computation rules for the expectation.
87
Theorem 3.4.9. Let X be a random variable with distribution pX (x) and with
E [|X|] < ∞. We then have
X X
E [X] = x · pX (x) = x · P [X = x] (3.4.18)
x∈R x∈R
pX (x)6=0 pX (x)6=0
and E [X] only depends on the distribution of X.

Let f : R → R be an arbitrary function with E [|f (X)|] < ∞. Then
X X
E [f (X)] = f (x) · pX (x) = f (x) · P [X = x] (3.4.19)
x∈R x∈R
pX (x)6=0 pX (x)6=0
The main idea of Theorem 3.4.9 is similar to the idea of the probability mass function, see
Figure 3.5. Instead to sum over all ω ∈ Ω separately, we first put all ω together, where
X has the same value and then we sum over all possible values of X.
P
Proof of Theorem 3.4.9. By assumption, we have P ω |X(ω)|P [ω] < +∞. We are thus
allowed to reorder the summands P in the sum ω X(ω)P [ω] without changing the value
of the sum. We now split the sum ω∈Ω and combine all ω, where X has the same value.
In formulas this gives
 
X X X
E [X] = X(ω)P [ω] =  X(ω) · P [ω]
ω∈Ω x∈R ω:X(ω)=x
X X
= x · P [X = x] = x · pX (x).
x∈R x∈R
This proves (3.4.18). The proof of (3.4.19) is almost identical and we thus omit it.
Example 3.4.10. We now apply Theorem 3.4.9 to the random variable in Example 3.2.8
and compare the result with Example 3.4.7. We have
E [X] = 0 · P [X = 0] + 1 · P [X = 1] + 4 · P [X = 4] + 9 · P [X = 9]
1 2 2 1 19
=0· +1· +4· +9· = . (3.4.20)
6 6 6 6 6
This value agrees of course with the value in Example 3.4.7.
Example 3.4.11. We consider again Example 3.1.4.b). We have in this situation Ω =
1
{1, . . . , 16}, P [{ω}] = 16 and

1 for ω ∈ {1, 10, 12, 13, 14, 15, 16},

X(ω) = 2 for ω = 11, (3.4.21)

0 otherwise.

88
The distribution of X is then
8
P [X = 0] = P [{2, 3, 4, 5, 6, 7, 8, 9}] = , (3.4.22)
16
7
P [X = 1] = P [{1, 10, 12, 13, 14, 15, 16}] = , (3.4.23)
16
1
P [X = 2] = P [{11}] = . (3.4.24)
16
We thus get
8 7 1 9
E [X] = 0 ·
+1· +2· = , (3.4.25)
16 16 16 16
8 7 1 11
E X = 02 ·
2
+ 12 · + 22 · = . (3.4.26)
16 16 16 16
We now can prove a further important property of the expectation: It is linear in X.
Theorem 3.4.12. Let a, b ∈ R and X and Y be random variables on the same

probability space (Ω, P)with E [|X|] < ∞ and E [|Y |] < ∞. Then the expectation of
aX + bY exists and
E [aX + bY ] = aE [X] + bE [Y ] . (3.4.27)
Proof. We have
X X X
E [aX + bY ] = aX(ω) + bY (ω) P [ω] = a X(ω)P [ω] + b Y (ω)P [ω]
ω∈Ω ω∈Ω ω∈Ω
= aE [X] + bE [Y ] .
We have stated Theorem 3.4.12 for two random variables, but Theorem 3.4.12 is valid for
several random variables. We omit the proof of this statement as it is almost the same as
for Theorem 3.4.12.
Let us now make a useful observation. Each a ∈ R can be interpreted as random variable
on a probability space (Ω, P). For this, we identify a ∈ R with the constant function
a :Ω → R (3.4.28)
ω → a, (3.4.29)
i.e the function that maps all sample points ω to a ∈ R. An illustration can be found in
Figure 3.8. We immediately get with the definition of the expectation that E [a] = a for
all a ∈ R. Combining this with Theorem 3.4.12 gives
89
Figure 3.8: Illustration of the constant random variable.
Corollary 3.4.13. Let a ∈ R and X and Y be random variables on the same prob-
ability space (Ω, P)with E [|X|] < ∞ and E [|Y |] < ∞. Then
a) E [X + Y ] = E [X] + E [Y ],
b) E [aX] = aE [X],
c) E [X + a] = E [X] + a,

d) E E [X] = E [X].
All points in this corollary follow immediately from Theorem 3.4.12. Only point d) re-
quires some explanation, as it is somewhat confusing at first glance. Here one has to
observe that E [X] is a real number. Thus d.) follows immediately from the observations
between Theorem 3.4.12 and Corollary 3.4.13.
Example 3.4.14. Let Ω = {1, 2, 3, 4, 5, 6} and P [ω] = 16 for all ω ∈ Ω. We define
(
1, if ω is divisible by 2 ,
X(ω) := (3.4.30)
0, otherwise
and
(
ω, if ω is divisible by 3,
Y (ω) := (3.4.31)
0, otherwise.
We now compute E [2X + Y ] direct and then with the help of the Theorems 3.4.9 and 3.4.12.
The direct computations gives
1 1 1 1 1 1
E [2X + Y ] = (0 + 0) · + (2 + 0) · + (0 + 3) · + (2 + 0) · + (0 + 0) · + (2 + 6) ·
6 6 6 6 6 6
15 5
= = . (3.4.32)
6 2
90
As next we use the Theorems 3.4.9 and 3.4.12. We have
1 4 1
P [X = 0] = P [X = 1] = and P [Y = 0] = , P [Y = 3] = P [Y = 6] = . (3.4.33)
2 6 6
Theorem 3.4.9 therefore implies
1 1 1 4 1 1 3
E [X] = 0 · + 1 · = and E [Y ] = 0 · + 3 · + 6 · = . (3.4.34)
2 2 2 6 6 6 2
We now get with Theorem 3.4.12
1 3 5
E [2X + Y ] = 2E [X] + E [Y ] = 2 · + = . (3.4.35)
2 2 2
The results of both computations agree of course.
3.5 Variance
The expectation of a random variable X can be interpreted as mean or as average be-
haviour. The expectation E [X] can describe X more or less good. For instance, con-
sider the random variables X, Y and Z with X ≡ 0, P [Y = +1] = P [Y = −1] = 21 and
P [Z = 1000] = P [Z = −1000] = 12 . We clearly have E [X] = E [Y ] = E [Z] = 0. For
X, this value is identical with the value of X itself and thus describes X very well. On
the other hand, we have for Z the difference between Z and E [Z] is always 1000. It is
thus natural to introduce something to measure how much a random variable X differs
in average from it’s expectation E [X]. The naive approach E [X − E [X]] gives

E [X − E [X]] = E [X] − E E [X] = E [X] − E [X] = 0
and is thus fruitless for all X. It has been agreed (for many reasons) to use the expectation
of the square of X − E [X]:
Definition 3.5.1. Let (Ω, P)be a (discrete) probability space and X a random variable
on this space with E [|X|] < ∞. The variance of X is defined as
V(X) = E (X − E [X])2

(3.5.1)
if the expression in (3.5.1) is finite.
We have explicitly
X 2
V(X) = X(ω) − E [X] · P [ω] (3.5.2)
ω∈Ω
X
= (x − E [X])2 · P [X = x] . (3.5.3)
x∈R
P[X=x]6=0
91
We have used for (3.5.3) Theorem 3.4.9 with f (x) = (x − E [X])2 .
Remark 3.5.2. Apart from the variance, one often uses also the standard deviation σ
(spoken
p sigma) of a random variable X. The standard deviation is defined as σ(X) :=
V(X). However, we will use here only the variance.
Example 3.5.3. Let Ω = {0, 1} and P [0] = P [1] = 12 . We consider the random variables
X, Y and Z with
X ≡ 0, Y (0) = −1, Y (1) = 1 and Z(0) = −1000, Z(1) = 1000. (3.5.4)
These X, Y and Z agree with the random variables in the beginning of this chapter. We
have E [X] = E [Y ] = E [Z] = 0 and
V(X) = 0, (3.5.5)
2 1 2 1
V(Y ) = E Y 2 = Y (0) · + Y (1) · = 1,

(3.5.6)
2 2
2 1 2 1
V(Z) = E Z = Z(0) · + Z(1) · = 106 .
2
(3.5.7)
2 2
Example 3.5.4. We consider again Example 3.1.4.b). We have in this situation Ω =
1
{1, . . . , 16}, P [{ω}] = 16 and

1
 for ω ∈ {1, 10, 12, 13, 14, 15, 16},
X(ω) = 2 for ω = 11, (3.5.8)

0 otherwise.

The distribution of X is
8 7 1
P [X = 0] = , P [X = 1] = , P [X = 2] = . (3.5.9)
16 16 16
9
We have computed in Example 3.4.11 that E [X] = 16
. We use here (3.5.3) and get
V(X) = (0 − E [X])2 · P [X = 0] + (1 − E [X])2 · P [X = 1] + (2 − E [X])2 · P [X = 2]

2 2 2
9 8 9 7 9 1
= 0− · + 1− · + 2− ·
16 16 16 16 16 16
1014 507
= = ≈ 0.248.
4096 2048
We will now prove a formula, which often simplifies the computation of the variance of a
random variable.
92
Theorem 3.5.5 (Shifting theorem). Let (Ω, P)be a (discrete) probability space and
X a random variable on this space with E [|X|] < ∞. We then have
V(X) = E X 2 − (E [X])2 .

(3.5.10)
The variance of X is finite if and only if E [X 2 ] < ∞.
Proof. We have
V(X) = E (X − E [X])2 = E X 2 − 2XE [X] + E [X]2

= E X 2 − 2E [X]2 + E [X]2 = E X 2 − (E [X])2 .

We will not prove here the second part of the theorem.
Example 3.5.6. Let us compute the variance for the throw of a fair die. In this situation,
we have
1
Ω = {1, 2, 3, 4, 5, 6}, P [ω] = and X(ω) = ω. (3.5.11)
6
21
We know from Example 3.4.5, that E [X] = 6
= 72 . Similarly, we get
1 1 1 91
E X 2 = 12 · + 22 · + . . . + 62 · = .

(3.5.12)
6 6 6 6
The variance of X is then
2
2
2 91 7 91 49 35
V(X) = E X − (E [X]) = − = − = . (3.5.13)
6 2 6 4 12
Let us now give with the formula (3.5.2) a heuristic interpretation of the variance. Roughly
speaking, a small variance means that a typical value of X is close to E [X] while a large
variance means that a typical value of X is not so close to E [X]. This is of course not
a rigorous mathematical statement. However, let us try to see why this is make sense.
Notice that each summand in (3.5.2) is non negative and thus
2
‘V(X) small ’ =⇒ ‘ X(ω) − E [X] · P [ω] is small for all ω ∈ Ω’. (3.5.14)
2
Suppose now that ‘X(ω) is far away from E [X]’ and therefore ‘ X(ω) − E [X] is large’.
2
Since ‘ X(ω) − E [X] · P [ω] is small ’, we must have that ‘P [ω] very is small ’ and thus
it is very unlikely that this ω occurs. The question at this point is if we can make this
statement more rigours. More precisely, can we say anything about the probability that
93
the distance between X and E [X] is greater than a given number a > 0? In formulas,
can we say anything about

P |X − E [X] | > a (3.5.15)
for a given a > 0? Note that

|X − E [X] | > a ⇐⇒ X ∈
/ E [X] − a, E [X] + a . (3.5.16)
An illustration of (3.5.16) is given in Figure 3.5.16. We can clearly not expect to get an

Figure 3.9: Illustration of X ∈
/ E [X] − a, E [X] + a

explicit expression for P |X − E [X] | > a just from the variance. However, we can give
an upper bound. More precisely, we get with the Chebyschev inequality
V(X)
P [|X − E [X] | ≥ a] ≤ for all a > 0. (3.5.17)
a2
We will postpone the precise formulation of the Chebyschev inequality to Section 3.8, see
Theorem 3.8.5, and give instead an example.
Example 3.5.7. A factory produces screws and these screws should have
a width of 5mm. Since the production process is not perfect,
there are always minor differences in the width of the produced screws.
Suppose that the width can be modelled by a random variable W with
E [W ] = 5mm and V(W ) = 0.005mm2 . What is the probability that a randomly
chosen screw has a width between 4.9mm and 5.1mm?
The given information is not enough to compute the desired probability exactly. However,
we can give with the Chebyschev inequality a lower bound. We have
P [4.9 < W < 5.1] = P [−0.1 < W − 5 < 0.1] = P [|W − 5| < 0.1]
= P [|W − E [W ] | < 0.1] = 1 − P [|W − E [W ] | ≥ 0.1]
V(W ) 0.005 1
≥1− 2
=1− = .
0.1 0.01 2
We thus see that at least the half of the screws have a width between 4.9mm and 5.1mm.
Note that this does not imply that P [4.9 ≤ W ≤ 5.1] is around 0.5. It is indeed possible
that P [4.9 ≤ W ≤ 5.1] is close to 1.
94
Exercise 3.5.8. Give an example of a random variable W with
E [W ] = 5, V(W ) = 0.005 and P [4.9 ≤ W ≤ 5.1] ≥ 0.9.
Hint: Assume that W can have only the values 5 − b, 5 and 5 + b.
Example 3.5.9. A school makes each year at the end of the term an exam.
The exam contains traditionally 10 questions, each giving 10 points.
Let us denote by T the total points achieved by a randomly chosen student.
The experience of several years shows that E [T ] = 40 and V(T ) = 200. How
likely is it that a student has more than 90 points?
As in Example 3.5.7, we cannot compute this probability explicit, but we can give an upper
bound. We have
P [T ≥ 90] = P [T − E [T ] ≥ 90 − E [T ]] = P [T − E [T ] ≥ 50]
V(T ) 200
≤ P [|T − E [T ] | ≥ 50] ≤ 2
= = 0.08. (3.5.18)
50 2500
We therefore have P [T ≥ 90] ≤ 0.08 and thus P [T ≥ 90] can’t be really high.
Example 3.5.10. A university is interested in the daily study time of their
students. For this, let us denote by D the daily study time of a randomly
picked student. A survey shows that E [D] = 4 hours and V(D) = 2. How
likely is it that a random student studies more than 5 hours or less
3 hours per day?
As in the Examples 3.5.7 and 3.5.9, we cannot compute this probability explicitly. We use
again (3.5.17) to give an upper bound.
P [D ≥ 5 or D ≤ 3] = P [D − E [D] ≥ 1 or D − E [D] ≤ −1] = P [|D − E [D] | ≥ 1]
V(D) 2
≤ = = 2. (3.5.19)
1 1
We therefore get with (3.5.17) that P [D ≥ 5 or D ≤ 3] ≤ 2. Obviously, this inequality is
useless as we already know by the definition of a probability that P [D ≥ 5 or D ≤ 3] ≤ 1.
We thus see that the Chebyschev inequality not always gives useful bounds.
At this point, one might ask: Why do we use the Chebyschev inequality to get bounds
instead to compute P [|X − E [X] | > a] directly? The reason is that there are many
situations, where we can compute (easily) the variance V(X), while it is very difficult
to compute the probability P [|X − E [X] | > a] explicit. We will see this for instance in
Section 3.8.1.
As the Chebyschev inequality indicates, it is important to be able to compute the variance
of a random variable in a good way. We thus look how to compute the variances V(aX) and
V(X +Y ). For this, we require the covariance of two random variables.
95
Definition 3.5.11. Let (Ω, P)be a (discrete) probability space and X and Y be
random variables on this space with E [X 2 ] < ∞ and E [Y 2 ] < ∞. The covariance of
X and Y is defined as
Cov(X, Y ) := E [(X − E [X])(Y − E [Y ])] = E [X · Y ] − E [X] E [Y ] . (3.5.20)
Two random variables are called uncorrelated if Cov(X, Y ) = 0.
Notice that the identity E [(X − E [X])(Y − E [Y ])] = E [X · Y ]−E [X] E [Y ] follows with a
computation similar to the proof of Theorem 3.5.5. Furthermore, the assumption E [X 2 ] <
∞ and E [Y 2 ] < ∞ in Definition 3.5.11 implies that Cov(X, Y ) is well defined. This follows
from the Cauchy-Schwarz inequality for random variables,
E [|XY |]2 ≤ E |X|2 E |Y |2 .

(3.5.21)
We will not prove here the equation (3.5.21).

Example 3.5.12. We throw twice a coin independently with the probability p to get
‘Heads’. As usually, we write 1 for ‘Head’ and 0 for ‘Tail’. We thus have
Ω = {(0, 0), (1, 0), (0, 1), (1, 1)}, (3.5.22)

P [(0, 0)] = (1 − p)2 , P [(1, 0)] = P [(0, 1)] = p(1 − p), 2
P [(1, 1)] = p . (3.5.23)
We consider the random variables
X1 (x1 , x2 ) = x1 and X2 (x1 , x2 ) = x2 . (3.5.24)
We first compute the expectation and variance of X1 and X2 . A simple computation shows
that P [X1 = 1] = P [X2 = 1] = p and P [X1 = 0] = P [X2 = 0] = 1 − p. Random variables
with this distribution are called Bernoulli distributed with parameter p, see also 3.6. Thus
E [X1 ] = E [X2 ] = 0 · (1 − p) + 1 · p = p. (3.5.25)
Furthermore, we have
V(X1 ) = V(X2 ) = E X12 − (E [X1 ])2

= 02 · P [X1 = 0] + 12 · P [X1 = 1] − p2 = p − p2 = p · (1 − p).

(3.5.26)
As next, we compute the covariance of X1 and X2 . We begin with

X X
E [X1 · X2 ] = X1 (ω) · X2 (ω) · P [ω] = x1 · x2 · P [(x1 , x2 )]
ω∈Ω 0≤x1 ,x2 ≤1
= 0 · 0 · P [(0, 0)] + 1 · 0 · P [(1, 0)] + 0 · 1 · P [(0, 1)] + 1 · 1 · P [(1, 1)]

= 0 + 0 + 0 + p 2 = p2 . (3.5.27)
96
This then gives
Cov(X1 , X2 ) = E [X1 · X2 ] − E [X1 ] · E [X2 ] = p2 − p2 = 0. (3.5.28)
The random variables X1 and X2 are thus uncorrelated.

We have calculated the covariance in Example 3.5.26 directly. One can compute the
covariance also with the help of distributions. However, for this one needs the joint
distribution of random variables, see Theorem 3.7.8 in Chapter 3.7.
With the help of the covariance, we can now compute the variance of a sum of random
variables. It is immediately clear from the quadratic form of the variance that it cannot
be linear. The following theorem shows that this is indeed the case.
Theorem 3.5.13. Let (Ω, P)be a (discrete) probability space, a, b ∈ R and X and Y
be random variables on this space with E [X 2 ] < ∞ and E [Y 2 ] < ∞. We then have
a) Cov(X, Y ) = Cov(Y, X) and V(X) = Cov(X, X),
b) V(X) = V(X − a) and Cov(X, Y ) = Cov(X − a, Y ) = Cov(X − a, Y − b),
c) V(aX) = a2 V(X) and Cov(aX, bY ) = ab Cov(X, Y ).
d ) V (X + Y ) = V(X) + V(Y ) + 2 Cov(X, Y ).
e) If the X and Y are uncorrelated, i.e. Cov(X, Y ) = 0, then
V (X + Y ) = V(X) + V(Y ). (3.5.29)
Proof. The point a) follows immediately from the definition of the variance and covariance
(see Definition 3.5.1 and Definition 3.5.11).
For point b), note that we have E [X − a] = E [X] − a (see Corollary 3.4.13) and thus
h 2 i
= E (X − E [X])2 = V(X).

V(X − a) = E (X − a) − E [(X − a)]
The argumentation for the covariance is similarly and we thus omit it.
We use for c) the shift theorem and get
V(aX) = E (aX)2 − (E [aX])2 = a2 E X 2 − a2 (E [X])2

= a2 E X 2 − (E [X])2 = a2 V(X).

The argumentation for the covariance is again similarly and we thus omit it also.
For the proof of d), we define X 0 := X − E [X] and Y 0 := Y − E [Y ]. If follows from b)
that it is enough to prove d.) for X 0 and Y 0 . The advantage of X 0 and Y 0 compared to
97
X and Y is that E [X 0 ] = E [Y 0 ] = 0 and one can thus easier compute the variance and
covariance. Explicit, we have
V(X 0 ) = E X 02 , V(Y 0 ) = E Y 02 and Cov(X 0 , Y 0 ) = E [X 0 Y 0 ] .

(3.5.30)
Since also E [X 0 + Y 0 ] = 0, it follows

h i
2
V (X 0 + Y 0 ) = E (X 0 + Y 0 ) = E X 02 + 2X 0 Y 0 + Y 02 = E X 02 + E Y 02 + 2E [X 0 Y 0 ]

= V (X 0 ) + V (Y 0 ) + 2 Cov(X 0 , Y 0 ). (3.5.31)
The point e) follows immediately from d).

The points d) and e) in Theorem 3.5.13 can easily extend to more than two random
variables. We have
Corollary 3.5.14. Let (Ω, P)be a (discrete) probability space and X1 , X2 , . . . , Xn

be random variables on this space with finite variances.
a) We have
n
! n
X X X
V Xi = V(Xj ) + Cov(Xi , Xj ) (3.5.32)
i=1 j=1 i6=j
Xn X
= V(Xj ) + 2 Cov(Xi , Xj ). (3.5.33)
j=1 i<j
b) If all Xj are uncorrelated, i.e. Cov(Xi , Xj ) = 0 for all i 6= j, then

n
! n
X X
V Xi = V(Xi ). (3.5.34)
i=1 i=1
The proof of Corollary 3.5.14 is almost the same as the proof of the points d) and e) in
Theorem 3.5.13 and we thus omit the proof of Corollary 3.5.14.
98
3.6 Important distributions of random variables
We consider in this section some important distributions of random variables. We have
summarized their probability mass function, expectation and variance in Table 3.3.
Name Short name Parameter Obtained values Distribution
( pX (x) E [X] V(X)
1, if x = c,
Degenerate c∈R c pX (x) = c 0
0, otherwise.
Bernoulli Ber(p) 0≤p≤1 {0, 1} pX (1) = p, pX (0) = 1 − p p p (1 − p)
Binomial B(n, p) 0 ≤ p ≤ 1, n ∈ N {0, 1, 2, . . . , n} pX (k) = nk
pk (1 − p)n−k n·p n p (1 − p)
1 1−p
Geometric Geo(p) 0<p≤1 N = {1, 2, 3, 4 . . . } pX (k) = p (1 − p)k−1 p p2
k
Poisson Poi(λ) λ ∈ R+ N0 = {0, 1, 2, 3, . . . } pX (k) = e−λ λk! λ λ
Table 3.3: Table of important distributions and their properties
3.6.1 Degenerate distribution

Let (Ω, P) be an arbitrary probability space and c ∈ R be a constant. Then X : Ω → R
with X(ω) = c for all ω ∈ Ω is a random variable on each (Ω, P). Thus
(
Ω, if x = c,
{X = x} =
∅, if x 6= c.
The distribution pX (x) of X is thus

(
1, if x = c,
P [X = x] =
0, otherwise.
We compute as next the expectation with Theorem 3.4.9 and get

X
E [X] = x · P [X = x] = c · P [X = c] = c. (3.6.1)
x∈R
pX (x)6=0
The variance of X(ω) ≡ c is computed similarly and is
V(X) = E X 2 − (E [X])2 = c2 − c2 = 0.

(3.6.2)
3.6.2 Bernoulli distribution

Let 0 ≤ p ≤ 1 be given. A random variable X is called Bernoulli distributed with
parameter p (short X ∼ Ber(p)), if X only can have the values 0 and 1 and
P [X = 1] = p, P [X = 0] = 1 − p. (3.6.3)
99
A Bernoulli random variable X can be interpreted as the throw of one coin, where ‘1’ is
interpreted as ‘Head’ and ‘0’ as ‘Tail’. In this situation p = P [X = 1] is the probability
to toss ‘Head’. In general, a Bernoulli random variable can viewed as ‘success’ or ‘fail’
experiment or as ‘1’ or ‘0’ experiment. Some further examples are
I win the next game or lose the next game,
the next person smokes or don’t smokes,
the next baby is a boy or a girl.
The expectation of a Bernoulli distributed random variable is
E [X] = 0 · (1 − p) + 1 · p = p. (3.6.4)
The variance has already been computed in Example 3.5.12 and is
V(X) = p(1 − p). (3.6.5)
3.6.3 Binomial distribution

Let n ∈ N and 0 ≤ p ≤ 1. A random variable X is called binomial distributed with
parameters n and p (short X ∼ B(n, p)), if X can have only the values 0, 1, . . . , n and

n k
P [X = k] = p (1 − p)n−k for all k ∈ {0, 1, . . . , n}. (3.6.6)
k
This distribution has the following origin: We toss n times (independently) a coin, which
shows ‘Head’ with probability p and denote by X the number of occurred ‘Head’s. Then
the distribution of X is given by (3.6.6) (see also Example 2.5.23). Let us check this in
the case n = 9 and k = 3. We compute first the probability P [HHHT T T T T T ]. The
path rule (see Proposition 2.3.4) then gives
p p p 1−p 1−p 1−p 1−p 1−p 1−p
· −→ H −→ H −→ H −→ T −→ T −→ T −→ T −→ T −→ T (3.6.7)
and thus P [HHHT T T T T T ] = p3 (1 − p)9−3 . Notice, that this probability only depends
on the number of ‘Heads’ and ‘Tails’, but not on the order how they occur. Thus
P [HHHT T T T T T ] = P [HHT T T T T T H] = P [T T T T T T HHH] = . . . . (3.6.8)
Now there are 93 paths with 3 ‘Head’s and 6 ‘Tail’s (see Example 2.5.22) and thus

9 3
P [X = 3] = p (1 − p)6 . (3.6.9)
3
The argumentation in the general case is completely similar. Apart from throwing n
times a coin, a Binomial distribution B(n, p) occurs also as the number of successes when
we perform n independent ‘success’ or ‘fail’ experiments, where each experiment has the
success probability p. Some further examples are
100
the number of games I win when I play n times a game,
the number of persons smoking among n arbitrary chosen persons,
the number of boys among n newborn babies.
It remains to shows that (3.6.6) indeed defines aprobability distribution on {0, . . . , n}.
We use the Binomial theorem (a + b)n = nk=0 nk ak bn−k and get with a = p, b = 1 − p
P
n n
X X n k
P [X = k] = p (1 − p)n−k = (p + (1 − p))n = 1. (3.6.10)
k=0 k=0
k
We now compute as next the expectation and the variance. As before, we use Theo-
rem 3.4.9 and get
n n
X X n k
E [X] = k · P [X = k] = k p (1 − p)n−k . (3.6.11)
k=0 k=0
k
It is possible to compute this expression directly, but it is easier to use instead Theo-
rem 3.4.12. We use the observation that X is the number of ‘Heads’ when one tosses a
coin n times. We thus can write
X = X1 + · · · + Xn (3.6.12)
with Xj = 1 if in the j’th throw ‘Head’ occurs and Xj = 0 otherwise. Obviously, each Xj
is Ber(p) distributed. We thus get with Theorem 3.4.12
E [X] = E [X1 ] + · · · + E [Xn ] = p + · · · + p = n · p. (3.6.13)
We argue for the computation of the variance V(X) similarly and use (3.6.12). We thus
have to compute
V(X) = V(X1 + · · · + Xn ). (3.6.14)
We now would like to apply Corollary 3.5.14. For this, we have to compute the covariance
Cov(Xi , Xj ) for i 6= j and 1 ≤ i, j ≤ n. We have computed this in the case n = 2 in
Example 3.5.12 and have shown that the random variables X1 and X2 are uncorrelated.
This result is indeed true for a general n. In other words, the random variables X1 , . . . ,
Xn are all uncorrelated. One can show this for instance as in Example 3.5.12. However,
we will show in Chapter 3.7 that the random variables X1 ,. . . , Xn are independent and
thus are automatically uncorrelated (see Example 3.7.15 and Theorem 3.7.17). Since all
Xj are Bernoulli distributed, it follows Corollary 3.5.14
n
X
V(X) = V(X1 + · · · + Xn ) = V(Xj ) = n · p(1 − p). (3.6.15)
j=1
101
3.6.4 Geometric distribution
Let 0 < p ≤ 1 be given. A random variable X is called geometric distributed (short
X ∼ Geo(p)), if X can have only values in N = {1, 2, 3, . . . } and
P [X = k] = p · (1 − p)k−1 , k ∈ N. (3.6.16)
Let us now describe the model behind this distribution. Suppose we have a coin, which
gives ‘Head’ with probability p when it is tossed. We now toss this coin till the first time
‘Head’ occurs. We assume at this point that each throw is independent of the others.
This then leads to the tree diagram in Figure 3.10. We now define X to be the number
Figure 3.10: Tree diagram for Geometric distribution.
of tosses need till the first time ‘Head’ occurs. We then have for instance
X(H) = 1, X(T H) = 2, X(T T H) = 3, X(T T T H) = 4, . . . (3.6.17)
and
P [X = 1] = p, P [X = 2] = (1 − p) · p, P [X = 3] = (1 − p)2 · p, . . . . (3.6.18)
We thus see that X has the distribution (3.6.16). Apart from tossing a coin till the first
‘Head’ occurs, the geometric distribution Geo(p) also occurs as the number of ‘success’
or ‘fail’ experiments needed till the first ‘success’ when the experiments are performed
independently and each experiment has the success probability p. Further examples are
the number of games needed till I win the first time a game,
the number of persons I have to ask to find a non-smoking person,
the number babies born till the first boy.
102
Remark 3.6.1. Notice that there is also a shifted version of the geometric distribution,
which is defined on N0 = {0, 1, 2, 3, . . . } as
P [X = k] = p · (1 − p)k , k ∈ N0 . (3.6.19)
This corresponds to count only the number of ‘Tails’ needed till the first time ‘Head’
occurs. Unfortunately, the term ‘geometric distribution’ is used in the literature for both
versions and thus one has to check always which one is meant. We use in this course only
(3.6.16) and thus the term ‘geometric distribution’ always refers to (3.6.16).
It remains to show that (3.6.16) gives indeed a probability distribution. If p = 1, then
P [X = 1] = 1 and P [X = k] = 0 for k ≥ 2. (3.6.20)
We thus get
∞
X
P [X = k] = 1 + 0 + 0 + . . . = 1. (3.6.21)
k=1
For 0 < p < 1, we use the sum formula of the geometric series
∞
X 1
qk = (for |q| < 1)
k=0
1−q
and obtain with q = 1 − p

∞ ∞
X X 1
P [X = k] = p · (1 − p)k−1 = p · = 1. (3.6.22)
k=1 k=1
1 − (1 − p)
Thus (3.6.16) indeed defines a probability distribution.

We now compute the expectation E [X] with Theorem 3.4.9 and get
∞
X ∞
X
E [X] = k · P [X = k] = k · p · (1 − p)k−1
k=1 k=1
∞ 2
X
k−1 1 1
=p· k · (1 − p) =p· = . (3.6.23)
k=1
1 − (1 − p) p
We have used here the identity

∞
X 1
k · q k−1 = (for |q| < 1). (3.6.24)
k=1
(1 − q)2
P∞ 1
This identity can be obtained for instance if one differentiates k=0 q k−1 = 1−q
term by
term with respect to q.
103
We thus see that we have to toss the coin in average p1 times till we get the first ‘Head’.
For a fair coin, this is two times, since p = 21 . The formula is also correct in the case
p = 0 if we interpret 10 as ∞. Indeed, p = 0 corresponds to ‘only Tail occurs’ and we thus
would wait infinitely long. We have of course excluded the case p = 0 in (3.6.16) since it
does not give a probability distribution (p = 0 implies P [X = k] = 0 for all k ∈ N). The
variance for geometric distributed X is given by
1−p
V(X) = . (3.6.25)
p2
We will not prove this here.
3.6.5 Poisson distribution

Let λ > 0 given. A random variable X is called Poisson distributed (short X ∼ Poi(λ)),
if X only have the values in N0 = {0, 1, 2, 3, . . . } and
k
−λ λ
P [X = k] = e for all k ∈ N0 , (3.6.26)
k!
where e is Euler’s number. The Poisson distribution occurs for instance in physics in the
decay of radioactive elements. It is used there to model the number of decaying atoms
in a specific time interval. At this point, one has to assume that the number of decayed
atoms in the specific time interval does not have an significant influence on the total mass.
The Poisson distribution is also used to model the number of successes for a sequence of
success/fail experiments when the probability of success is very rare. The keyword is
Poisson limit theorem, but we will not discuss it in this course.
We now show that (3.6.26) is a distribution. We have
∞ ∞
X
−λ
X λk
P [X = k] = e = e−λ eλ = 1. (3.6.27)
k=0 k=0
k!
We have used the identity eλ = ∞ λk

P
k=0 k! .
We compute as next the expectation and get with Theorem 3.4.9
∞ ∞ ∞
X λk X λk X λk−1
E [X] = k · e−λ = e−λ = λe−λ
k=0
k! k=1
(k − 1)! k=1
(k − 1)!
∞
−λ
X λk
= λe = λe−λ eλ = λ. (3.6.28)
k=0
k!
It remains to compute the variance. For this, we have to compute E [X 2 ]. However, it is

easier to compute first E [X(X − 1)] and then to use E [X] and the linearity of E [ . ] to
104
get E [X 2 ]. We have
∞ k ∞ ∞
X
−λ λ −λ
X λk 2 −λ
X λk−2
E [X(X − 1)] = k(k − 1) · e =e =λ e
k=0
k! k=1
(k − 2)! k=1
(k − 2)!
∞ k
X λ
= λ2 e−λ = λ2 . (3.6.29)
k=0
k!
It now follows
V(X) = E X 2 − E [X]2 = E [X(X − 1)] + E [X] − E [X]2 = λ2 + λ − λ2 = λ. (3.6.30)

3.7 Joint distribution and independence

We will extend in this section the notation of independence from Chapter 2.7 to random
variables. For this, we first have to study the joint distribution of random variables.
3.7.1 Joint distribution

In most situations, a sample point ω ∈ Ω has many properties and often one is interested
in more than one property. If one selects for instance randomly a car, one can ask: What is
the probability that the selected car is blue and a BMW? For such questions, we introduce
Definition 3.7.1. Let X and Y be discrete random variables on the same probability
space (Ω, P). We then set
{X = x, Y = y} := {X = x} ∩ {Y = y} = {ω; X(ω) = x and Y (ω) = y}. (3.7.1)
The joint distribution of X, Y is now defined as
P [X = x, Y = y] := P [{X = x, Y = y}] .
The joint distribution of more than two random variables is similarly defined.
The idea behind the definition of {X = x, Y = y} is similar to the idea behind the
definition of {X = x}. More precisely, one puts all ω ∈ Ω together, where X and Y have
the same value. Thus {X = x, Y = y} can be heuristically interpreted as ‘X has the value
x and Y has the value y’. Let us consider an example
Example 3.7.2. Consider the probability space (Ω, P) with
1
Ω = {1, 2, 3, 4, 5, 6, 7, 8} with P [ω] = for all ω ∈ Ω. (3.7.2)
8
105
Further, we define the random variables X and Y by
X(1) = X(3) = X(7) = 1, X(2) = X(4) = 2 and X(5) = X(6) = X(8) = 3,

Y (1) = Y (2) = Y (4) = Y (5) = 1 and Y (3) = Y (6) = Y (7) = Y (8) = −1.
We then have
{X = 1, Y = 1} = {X = 1} ∩ {Y = 1} = {1, 3, 7} ∩ {1, 2, 4, 5} = {1}. (3.7.3)
and thus P [X = 1, Y = 1] = 18 . Similarly, we get
{X = 2, Y = 1} = {2, 4}, {X = 3, Y = 1} = {5},

{X = 1, Y = −1} = {3, 7}, {X = 2, Y = −1} = ∅, {X = 3, Y = −1} = {6, 8}.
These events can also be illustrated graphically, see Figure 3.11. We now get
Figure 3.11: Illustration of the events {X = x, Y = y} in Example 3.7.2.
2 1
P [X = 2, Y = 1] = , P [X = 3, Y = 1] = ,
8 8
2 2
P [X = 1, Y = −1] = , P [X = 2, Y = −1] = 0, P [X = 3, Y = −1] = .
8 8
If one now takes a look at Example 3.7.2 and at Figure 3.11, one sees that the events
{X = x, Y = y} cover the sample space Ω, i.e we have in Example 3.7.2
[
Ω = {1, 2, 3, 4, 5, 6, 7, 8} = {X = x, Y = y}.
x∈{1,2,3}
y∈{−1,1}
106
Also we see that the events {X = x, Y = y} are disjoint, i.e
{X = x1 , Y = y1 } ∩ {X = x2 , Y = y2 } = ∅ if x1 6= x2 or y1 6= y2 .
These two observations are also true in general, but we will not prove them here. Notice
that we have made a similar observation for the events {X = x}, see Proposition 3.2.5.
More important is that we can recover the distribution of X (and of Y ) from the joint
distribution of X and Y . We have
Lemma 3.7.3. Let X and Y be discrete random variables on the same probability
space (Ω, P). We then have for each x ∈ R.
X
P [X = x] = P [X = x, Y = y] . (3.7.4)
y∈R
P[X=x,Y =y]>0
As illustration, let us look again at Example 3.7.2 and considering P [X = 1]. Since
{X = 1} = {1, 3, 7}, we see in Figure 3.11 that {X = 1} consists of all ω in the left
column and thus P [X = 1] is the sum over all P [ω] with ω in the left column. This sum
can be splitted up into the sum over all ω in the upper left box and the sum over all
P [ω] with ω in the lower left box. The first sum corresponds to P [X = 1, Y = 1] and the
second sum corresponds to P [X = 1, Y = −1]. Thus, we have in this illustration
P [X = 1] = P [X = 1, Y = 1] + P [X = 1, Y = −1] ,
which is precisely (3.7.4).
Proof of Lemma 3.7.3. To show (3.7.4) in general, we use Proposition 3.2.5 and get
  
  [ 
P [X = x] = P [{X = x} ∩ Ω] = P 
{X = x} ∩ 
 {Y = y} 

y∈R
{Y =y}6=∅
 
 [ 
= P
 {X = x} ∩ {Y = y}

y∈R
{X=x}∩{Y =y}6=∅
X X
= P [{X = x} ∩ {Y = y}] = P [X = x, Y = y] .
y∈R y∈R
{X=x}∩{Y =y}6=∅ P[X=x,Y =y]>0
We also used that at most countable many set {Y = y} are non empty.
107
Example 3.7.4. Let us consider again the situation in Example 3.7.4. We have here an
urn containing 100 balls, numbered with 00, 01, . . . , 99. We now draw purely randomly a
1
ball. Thus Ω = {00, . . . , 99} and P [ω] = 100 for all ω ∈ Ω. We consider here the random
variables X and Y , where X is the first number and Y is the second number of the drawn
ball. We compute here as an example P [X = 3, Y = 8]. We have
{X = 3} = {ω ∈ Ω; X(ω) = 3} = {30, 31, . . . , 39},
{Y = 8} = {ω ∈ Ω; Y (ω) = 8} = {08, 18, . . . , 98}.
10 1
Thus |{X = 3}| = |{Y = 8}| = 10 and P [X = 3] = P [Y = 8] = 100
= 10
. Furthermore,
{X = 3, Y = 8} = {X = 3} ∩ {Y = 8} = {ω ∈ Ω; X(ω) = 3 and Y (ω) = 8} = {38}.
|{38}| 1
Thus P [X = 3, Y = 8] = 100
= 100
. With a similar argumentation, one can show that
1
P [X = 3, Y = j] = for all j ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. (3.7.5)
100
Applying (3.7.4) then gives
P [X = 3] = P [X = 3, Y = 0] + P [X = 3, Y = 1] + · · · + P [X = 3, Y = 9]
9 9
X X 1 1
= P [X = 3, Y = j] = = . (3.7.6)
j=0 j=0
100 10
The best way to present the joint distribution of two random variables is with a table.
We use an example to illustrate this.
Example 3.7.5. We roll twice a fair die. Let X1 be the number of the first throw, X2 be
the number of the second throw and Y = max{X1 , X2 }. We set here
1
Ω = {(x1 , x2 ); 1 ≤ x1 , x2 ≤ 6} and P [ω] = . (3.7.7)
36
We then have for instance
1
P [X1 = 1, Y = 2] = P [{(1, 2)}] = , (3.7.8)
36
2
P [X1 = 2, Y = 2] = P [{(2, 1), (2, 2)}] = , (3.7.9)
36
P [X1 = 3, Y = 2] = P [∅] = 0. (3.7.10)
The complete joint distribution of X1 and Y is visible in Table 3.4.
A positive side effect of the table is that we can easily read of the distribution of X1 and Y
by summing the values in the rows respectively in the columns. This follows from (3.7.4).
We have for instance
1 1 3 5
P [Y = 3] = + + = and (3.7.11)
36 36 36 36
1 1 1 1 1 6 11
P [Y = 6] = + + + + + = . (3.7.12)
36 36 36 36 36 36 36
108
X1 \ Y 1 2 3 4 5 6 Row sum
1 1 1 1 1 1 1
1 36 36 36 36 36 36 6
= P [X1 = 1]
2 1 1 1 1 1
2 0 36 36 36 36 36 6
= P [X1 = 2]
3 1 1 1 1
3 0 0 36 36 36 36 6
= P [X1 = 3]
4 1 1 1
4 0 0 0 36 36 36 6
= P [X1 = 4]
5 1 1
5 0 0 0 0 36 36 6
= P [X1 = 5]
6 1
6 0 0 0 0 0 36 6
= P [X1 = 6]
1 3 5 7 9 11
Column sum 36 36 36 36 36 36
Table 3.4: Joint distribution of X1 and Y in Example 3.7.5
Example 3.7.6. Let Ω = {−3, −2, −1, 1, 2, 3} and P [ω] = 61 . We consider the random
variables
(
1, if ω > 0,
X(ω) = ω, |X|(ω) = |ω| and Y (ω) = sign(ω) = (3.7.13)
−1, if ω < 0.
We compute the joint distribution of X and |X| and the joint distribution of |X| and Y .
1
P [X = −1, |X| = 1] = P [−1] = and P [X = −2, |X| = 1] = P [∅] = 0, (3.7.14)
6
1 1
P [|X| = 2, Y = 1] = P [2] = and P [|X| = 1, Y = −1] = P [−1] = . (3.7.15)
6 6
The remaining case can be computed similarly and are summarized in Table 3.5.
(a) Joint distribution of X and |X| (b) Joint distribution of |X| and Y
X \ |X| 1 2 3 Row sum Y \ |X| 1 2 3 Row sum
1 1 1 1 1 1
-3 0 0 6 6 -1 6 6 6 2
1 1 1 1 1 1
-2 0 6 0 6 1 6 6 6 2
1 1 1 1 1
-1 6 0 0 6
Column sum 3 3 3
1 1
1 6 0 0 6
1 1
2 0 6 0 6
1 1
3 0 0 6 6
1 1 1
Column sum 3 3 3
Table 3.5: Joint distributions in Example 3.7.6
109
One can ask oneself at this point if the joint distribution of two random variables is
determined by the individual distributions of those random variables. The following
example shows that is not the case.
Example 3.7.7. We toss two times a coin. The first coin is a fair coin. For the second
throw, we have chose between two different coins. If the outcome of the first throw is
‘Head’, we chose a coin, which shows with probability p ‘Head’. If the outcome of the first
throw is ‘Tail’, we chose a coin, which shows with probability 1 − p ‘Head’. The value of
p is here arbitrary, except of course that 0 ≤ p ≤ 1. This now lead to the tree diagram in
Figure 3.12. We define Xj = 1 if we have in the j’th throw ‘Head’ and Xj = 0 otherwise.
Figure 3.12: Tree diagram for Example 3.7.7
We then get for instance with the path rule (see Proposition 2.3.4)
1
P [X1 = 1, X2 = 0] = P [(H, T )] = · (1 − p). (3.7.16)
2
The remaining cases can be computed similarly and are summarized in Table 3.6. Con-
X1 \ X2 0 1 Row sum
p 1−p 1
0 2 2 2
1−p p 1
1 2 2 2
1 1
Column sum 2 2
Table 3.6: Joint distribution of X1 and X2 in Example 3.7.7
sidering the row and column sums, we see that X1 and X2 are always Bernoulli(1/2)
distributed. However, the joint distribution of X1 and X2 is dependent on p and is in
particular not determined through the distributions of X1 and X2 !
We will see in Chapter 3.7.2 that the joint distribution of two random variables is deter-
mined by the individual distributions of those random variables if the random variables
are independent, see Definition 3.7.9.
Similarly to Theorem 3.4.9, one can show
110
Theorem 3.7.8. Let (Ω, P)be a probability space and X and Y be discrete random
variables on this space with E [X 2 ] < ∞, E [Y 2 ] < ∞ (and E [|XY |] < ∞). Then
X
E [X · Y ] = x · y · P [X = x, Y = y] (3.7.17)
x,y∈R
and E [X · Y ] only depends on the joint distribution of X and Y .

Further, let f : R2 → R be an arbitrary function, so that E [|f (X, Y )|] < ∞. Then
X
E [f (X, Y )] = f (x, y) · P [X = x, Y = y] . (3.7.18)
x,y∈R
The proof of Theorem 3.7.8 is almost identical to the proof of Theorem 3.4.9. We thus
omit it.
3.7.2 Independent random variables

We now come to independent random variables. Intuitively, we would call two random
variables independent if the outcome of one has no affect to the outcome of the other. To
find the right formal definition, let us start with two random variables X1 and X2 and
two values x1 , x2 ∈ R. In this situation, it is natural to consider the events {X1 = x1 }
and {X2 = x2 }. If X1 and X2 are independent then the events {X1 = x1 } and {X2 = x2 }
should also be independent. We thus get
P [X1 = x1 , X2 = x2 ] = P [X1 = x1 ] · P [X2 = x2 ] . (3.7.19)
Since x1 and x2 are arbitrary, (3.7.19) has to be true for all x1 and x2 . In view of this
observation, it is evident to call X1 and X2 independent, if (3.7.19) is fulfilled for all x1
and x2 . We thus set
Definition 3.7.9. Let (Ω, P)be a probability space and X1 and X2 be discrete random
variables on this space. X1 and X2 are called (stochastically) independent, if
P [X1 = x1 , X2 = x2 ] = P [X1 = x2 ] · P [X2 = x2 ] (3.7.20)
for all x1 , x2 ∈ R. If the random variables X1 and X2 are not independent, then we
call them dependent.
In contrary to independent events (see Section 2.7), one can extend this definition without
problems to more than two random variables. We set
111
Definition 3.7.10. Let (Ω, P)be a probability space and X1 , . . . , Xn be discrete ran-
dom variables on this space. X1 , . . . , Xn are called (stochastically) independent, if
P [X1 = x1 , . . . , Xn = xn ] = P [X1 = x1 ] · . . . · P [Xn = xn ] (3.7.21)
for all x1 , . . . , xn ∈ R. If the random variables X1 , . . . , Xn are not independent, then

we call them dependent.
Remark 3.7.11. Note that one has to verify (3.7.21) only for those x1 , . . . , xn ∈ R with
P [Xj = xj ] > 0 for all 1 ≤ j ≤ n.
Example 3.7.12. We consider again the coin toss of two coins in Example 3.5.12, using
the same notation. We now show that the random variables X1 and X2 are independent.
P [X1 = 0, X2 = 1] = P [{(0, 1)}] = (1 − p) · p = P [X1 = 0] · P [X2 = 1] . (3.7.22)
The computation for the remaining case is similar and summarized in Table 3.7. From
this, one sees immediately that X1 and X2 are independent.
X 1 \ X2 0 1 Row sum
0 (1 − p) · (1 − p) p · (1 − p) 1−p
1 p · (1 − p) p2 p
Column sum 1−p p
Table 3.7: Joint distribution of X1 and X2 in Example 3.7.7
If we consider Table 3.7, we see a further advantage of the tabular representation of the
joint distribution. We can easily check if two random variables are independent. For this,
one only has to check if each entry in the table is the product of the row- and column
sum. We have highlighted in Table 3.7 the entries corresponding to P [X1 = 0, X2 = 1].
Example 3.7.13. Let us consider again Example 3.7.6. Table 3.5 shows that X and |X|
are not independent while |X| and Y are independent.
A big advantage of independent random variables is that the joint distribution of X1 ,. . . ,

Xn is determined through the individual distributions of the Xj . This follows immediately
from the definition of the joint distribution and independence. Especially, it is very easy
to create the table of the joint distribution of two independent random variables X and
Y . First, one enters the distributions of X and Y at the column- and row sums. Then
the remaining entries are obtained by building products with the column- and row sums.
112
Example 3.7.14. Let X and Y be independent random variables with
1 1 1
P [X = 1] = , P [X = 2] = , P [X = 3] = and (3.7.23)
6 3 2
1 3
P [Y = 1] = , P [Y = 2] = . (3.7.24)
4 4
The computation of the joint distribution of X and Y is illustrated in Figure 3.8.
(a) Step 1: The distributions of X, Y (b) Step 2: Completing the remaining entries
X \Y 1 2 Row sum X \Y 1 2 Row sum
1 1 1 3 1 1
1 6 1 4 · 6 4 · 6 6
1 1 1 3 1 1
2 3 2 4 · 3 4 · 3 3
1 1 1 3 1 1
3 2 3 4 · 2 4 · 2 2
1 3 1 3
Column sum 4 4 Column sum 4 4
Table 3.8: The creation of the table of the joint distributions in Example 3.7.14.
As next, we extend Example 3.7.12 and show that the n-fold coin toss is independent.
Example 3.7.15. We toss n times a coin, where in each throw ‘Head’ has the probability
p to occur. For this, we set
Ω = {0, 1}n = {(ω1 , ω2 , . . . , ωn ); ωj ∈ {0, 1}}. (3.7.25)
We now use the path rule to determine the probability P [ω] of each ω ∈ Ω. As an example,
let us consider n = 5 and ω = (1, 0, 0, 1, 1)
p 1−p 1−p p p
· −→ 1 −→ 0 −→ 0 −→ 1 −→ 1, (3.7.26)
and thus P [(1, 0, 0, 1, 1)] = p3 (1 − p)2 . In general, we get
n
X
k n−k
P [(ω1 , ω2 , . . . , ωn )] = p (1 − p) with k := ωj = |{1 ≤ j ≤ n; ωj = 1}|. (3.7.27)
j=1
and k can be described in words as the number of 1’s in ω. We define here

Xj (ω1 , ω2 , . . . , ωn ) = ωj for 1 ≤ j ≤ n. (3.7.28)
The random variable Xj is obviously the value of the j’th throw. We interpreted here
Xj = 1 as ‘Head’ and Xj = 0 as ‘Tail’ in the j’th throw. We now show that X1 , . . . , Xn
Pn we consider an arbitrary sequence x1 ,. . . , xn with xj ∈ {0, 1}.
are independent. For this
We then get with k = j=1 xj (=‘Number of Heads’)
P [X1 = x1 , . . . , Xn = xn ] = P [(x1 , . . . , xn )] = pk (1 − p)n−k

= P [X1 = x1 ] · . . . · P [Xn = xn ] . (3.7.29)
113
If one compares the definition of independent events (see Definition 2.7.15) with Defini-
tion 3.7.10, one sees a small difference. For independent events A1 , . . . , An , we had to
require the product property for each subfamily of A1 , . . . , An to get a suitable defini-
tion for independence. On the other hand, we consider in (3.7.21) only the whole family.
The question at this point is: is (3.7.21) sufficient? If (3.7.21) is a suitable definition for
independent random variables, then
P [X1 = x1 , X2 = x2 ] = P [X1 = x1 ] · P [X2 = x2 ] and (3.7.30)
P [X1 = x1 , X2 = x2 , X3 = x3 ] = P [X1 = x1 ] · P [X2 = x2 ] · P [X3 = x3 ] (3.7.31)
should follow from (3.7.21). In words: If X1 , . . . , Xn are independent, then each subfamily
should also be independent, especially X1 , X2 and X1 , X2 , X3 . The following lemma shows
that this is indeed the case.
Lemma 3.7.16. Let (Ω, P)be a probability space and X1 , . . . , Xn be independent,

discrete random variables on this space. We then have for each 1 ≤ d ≤ n and each
sequence 1 ≤ j1 < . . . < jd ≤ n and all xj1 , . . . , xjd ∈ R
P [Xj1 = xj1 , . . . , Xjd = xjd ] = P [Xj1 = xj1 ] · P [Xj2 = xj2 ] · . . . · P [Xjd = xjd ] .
In particular, the subfamily Xj1 ,. . . , Xjd is independent.
Proof. We illustrate here only the case n = 3 and j1 = 1, j2 = 2. The argumentation in

the general case is similar. We argue as for (3.7.4) and get
P [X1 = x1 , X2 = x2 ] = P [{X1 = x1 } ∩ {X2 = x2 } ∩ Ω]
" !#
[
= P {X1 = x1 } ∩ {X2 = x2 } ∩ {X3 = x3 }
x3 ∈R
" #
[
=P {X1 = x1 } ∩ {X2 = x2 } ∩ {X3 = x3 }
x3 ∈R
X
= P [{X1 = x1 } ∩ {X2 = x2 } ∩ {X3 = x3 }] . (3.7.32)
x3 ∈R
We now use (3.7.21) and obtain

X
P [X1 = x1 , X2 = x2 ] = P [X1 = x1 ] P [X2 = x2 ] P [X3 = x3 ]
x3 ∈R
X
= P [X1 = x1 ] P [X2 = x2 ] P [X3 = x3 ]
x3 ∈R
= P [X1 = x1 ] P [X2 = x2 ] .
114
We consider as next the expectation of independent random variables. We have
Theorem 3.7.17. Let (Ω, P)be a probability space and X and Y be independent,
discrete random variables on this space with E [|X|] < ∞ and E [|Y |] < ∞. Then the
expectation of X · Y exists and we have
E [X · Y ] = E [X] · E [Y ] . (3.7.33)
In particular, the random variables X and Y are also uncorrelated (see Definition 3.5.11).
Notice that independent random variables are always uncorrelated. However, the reverse
direction does not need to be not true, see Example 3.7.18.
Proof. We first assume that the expectation of X · Y exists. We then get with Theo-
rem 3.7.8
X X
E [X · Y ] = X(ω)Y (ω)P [ω] = xy · P [X = x, Y = y]
ω∈Ω x,y∈R
! !
X X X
= xy · P [X = x] P [Y = y] = x · P [X = x] y · P [Y = y]
x,y∈R x∈R y∈R
= E [X] · E [Y ] .
It remains to prove the existence of E [X · Y ]. It follows similarly
! !
X X
E [|X · Y |] ≤ |x| · P [X = x] |y| · P [Y = y]
x∈R y∈R
= E [|X|] E [|Y |] < ∞.
As mentioned after Theorem 3.7.17, we give now an example for two random variables X
and Y , which are uncorrelated, but not independent.
Example 3.7.18. Let
1 1
Ω = {0, 1, −1} and P [0] = , P [1] = P [−1] = . (3.7.34)
2 4
We also define X(ω) = ω and Y (ω) = |X(ω)| = |ω|. We then have
1 1
E [X] = X(−1) · P [−1] + X(0) · P [0] + X(1) · P [1] = − + 0 + = 0, (3.7.35)
4 4
1 1 1
E [Y ] = Y (−1) · P [−1] + Y (0) · P [0] + Y (1) · P [1] = + + 0 + = , (3.7.36)
4 4 2
E [XY ] = −1 · P [−1] + 0 · P [0] + 1 · P [1] = 0. (3.7.37)
115
We thus have Cov(X, Y ) = E [XY ] − E [X] E [Y ] = 0. Therefore X and Y are uncorre-
lated. Obviously X and Y are not independent. We have for instance
1
0 = P [X = 0, Y = 1] 6= P [X = 0] · P [Y = 1] = . (3.7.38)
4
Independent random variables have further interesting properties. However, we will not
mention them here, but we would like to emphasise that independent random variables
are ‘user friendly’.
3.8 Law of large numbers

We have carried out on Page 9 that our definition of a discrete probability space (see
Definition 2.1.3 and Definition 2.1.19) is maybe not the most intuitive one. As we said, it
would be much more intuitive to determine the probability of an outcome of an experiment
as follows: We perform n independent trials and consider the fraction how many times
the desired outcome has occurred. After this, we let n go to infinity and define the
probability of this outcome as the limit. We now show in this chapter that our definition
of a probability space has this limiting property.
This law is in nowadays known as ‘Law of large numbers’. It has been formulated and
proved 1705 by Jakob Bernoulli and was published 1713 posthumously. This law is in
some sense the fundamental law of probability theory, since large parts of probability
theory are not possible without the ‘Law of large numbers’. For instance, it would be
completely useless to consider a growing number of samples to determine the probability
of an event. This would be be of course against our intuition.
3.8.1 Motivating example

Let us consider the following situation.
Peter is a student in Germany. A professor asks Peter during an oral exam
the following question: ‘Suppose I would offer you to get your mark by rolling
a die and you would have the choice between
a) roll the die once,
b) roll the die twice and to calculate the mean of both numbers,
c) roll the die 1000 times and to calculate the mean of the 1000 numbers.
Which way would you choose? Justify your choice! ’
One needs to know at this point that students in Germany get marks between 1 and 6,
where 1 is the best and 6 is the worst. A student has passed the exam if he/she has a 4 or
less. Also, we assume that the die would be fair and the die would be rolled independently.
Let us assume also that Peter only would like pass the exam and does not really care what
mark he gets. Let us now try to answer this question.
116
Rolling the die once
The natural approach is to compute in all three cases the probability to pass the exam
and then to compare the results. If the die is rolled only once, we can easily compute
this. We denote by X1 the number in the first (and only) throw of the die. We get with
Proposition 3.3.2
P [‘Peter passes’] = P [X1 ≤ 4] = P [X1 = 1] + P [X1 = 2] + P [X1 = 3] + P [X1 = 4]

1 1 1 1 4 2
= + + + = = . (3.8.1)
6 6 6 6 6 3
We have used in the last equality that the die is fair.
Rolling the die twice

Let us now look at the situation when the die is rolled twice. The computations in this
case are more difficult. We have here two random variables X1 and X2 , where X1 is the
number in the first throw and X2 is the number in the second throw. Since we roll the
die independently, the random variables X1 and X2 are independent. We now have
8
X1 + X 2 X
P [‘Peter passes’] = P ≤ 4 = P [X1 + X2 ≤ 8] = P [X1 + X2 = k] , (3.8.2)
2 k=2
where we have used for the last equality again Proposition 3.3.2. We thus have to compute
P [X1 + X2 = k] for 2 ≤ k ≤ 8. We have for instance
1
P [X1 + X2 = 2] = P [X1 = 1, X2 = 1] = ,
36
2
P [X1 + X2 = 3] = P [X1 = 1, X2 = 2] + P [X1 = 2, X2 = 1] = .
36
In a similar way, one can compute P [X1 + X2 = k] for the other values of k. The results
are listed in Table 3.9.
k 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P [X1 + X2 = k] 36 36 36 36 36 36 36 36 36 36 36
Table 3.9: Table for the distribution of X1 + X2 in Section 3.8.1
We insert the probabilities of Table 3.9 into the equation (3.8.2) and get
1+2+3+4+5+6+5 26
P [‘Peter passes’] = P [X1 + X2 ≤ 8] = = . (3.8.3)
36 36
2 24 26
We now clearly have 3
= 36
< 36
and thus the option b) is better for Peter than a).
117
Rolling the die 1000 times
Let us try to argue as in the previous two cases. We thus have here 1000 random variables
X1 , . . . , X1000 , where Xj is the number in the j’th throw (1 ≤ j ≤ 1000). As before, we roll
the die independently and the random variables X1 , . . . , X1000 are therefore independent.
We thus have

X1 + X2 + · · · + X1000
P [‘Peter passes’] = P ≤4
1000
"1000 # 4000
"1000 #
X X X
=P Xj ≤ 4000 = P Xj = k . (3.8.4)
j=1 k=1000 j=1
This
hPis clearly very ihard to compute explicitly. For instance, if we would like to compute
1000
P j=1 Xj = 4000 , then we have to find the number of x1 , . . . , x1000 with
x1 + x2 + · · · + x1000 = 4000 and xj ∈ {1, 2, 3, 4, 5, 6} for all 1 ≤ j ≤ 1000. (3.8.5)

On the other hand: Do we really need to know the probability P [‘Peter passes’] explicitly?
In principle it would be enough to know if this probability is bigger than 26
36
. Let us thus
find another way to compare the random variables
1000
X1 + X2 1 X
X1 , and X 1000 := Xj . (3.8.6)
2 1000 j=1
One possibility is to compute the expectation and the variance of all three random vari-
ables. The results are listed in Table 3.10
X1 +X2
X1 2
X 1000
7 7 7
E[·] 2 2 2
35 35 35
V(·) 12 24 12000
Table 3.10: Expectation and the variance for the random variables in (3.8.6)
Before we interpret the entries in Table 3.10, let us briefly explain how they are obtained.
We know from Example 3.4.5 and Example 3.5.6 that
7 35
E [X1 ] = = 3.5 and V(X1 ) = ≈ 2.91. (3.8.7)
2 12
Each Xj has clearly the same distribution as X1 . Thus E [Xj ] = E [X1 ] and V(Xj ) =
V(X1 ) for all j. Also the Xj are independent and thus uncorrelated (see Theorem 3.7.17).
It then follows with Theorem 3.4.12 and Theorem 3.5.13 that

X1 + X2 1 1 7 7 7
E = (E [X1 ] + E [X2 ]) = + = ,
2 2 2 2 2 2
2 2
X1 + X2 1 1 1 35
V = V (X1 + X2 ) = V(X1 ) + V(X2 ) = V(X1 ) = .
2 2 2 2 24
118

The computations for E X 1000 and V X 1000 are similar and we thus omit them.
Table 3.10 shows that all three random variables have the same expectation, but the
variance of X 1000 is much smaller than the variances of X1 and X1 +X 2
2
. We have noted
in Section 3.5 that a small variance suggests that a typical value of a random variable is
close to the expectation. Thus Table 3.10 suggests a typical value of X 1000 is much closer
to 3.5 than a typical value of X1 and X1 +X
2
2
. Thus Peter should choose to roll 1000 times
the die since we expect with high probability a value around 3.5.
3.8.2 Law of large numbers for random variables

We now extend the observations from the last section to arbitrary distributions of X1 . In
this section, X1 , X2 ,. . . , Xn will be always a sequence of independent random variables
with all Xj having the same distribution. We also assume E [|X1 |] < ∞ and V(X1 ) < ∞.
Furthermore, we use the notation
n
X1 + · · · + Xn 1X
X n := = Xj . (3.8.8)
n n j=1
If one examines the computations in Chapter 3.8.1 for the Table 3.10, one see immediately
that we need for the expectation and variance of X1 +X
2
2
and X 1000 only the expectation
and variance of X1 since all Xj have the same distribution and are independent. Thus
this computation holds also for X n with n arbitrary. In formulas, we have
1
E X n = E [X1 ] and V X n = V (X1 ) . (3.8.9)
n
Thus the expectation of X n is for each n ∈ N the same, but the variance of X n tends
to 0 as n → ∞. This suggest that X n is for n large very close to E [X1 ] (at least with
high probability). To see if this intuition is correct, let us simulate the random variable
X n for different values of n. For this, we roll again a fair die and thus X1 has the same
distribution as in Section 3.8.1. We use the computer and let it do for a fixed n the
following:
a) Roll n times independently a fair die.
b) Compute the mean of these n values.
c) Write this mean into a list.
d) Repeat the steps a), b) and c) 10’000 times.
e) Plot a histogram of this list.

The obtained histograms are displayed in Figure 3.13. Since E [X1 ] = E X n = 3.5,
Figure 3.13 indeed supports the idea that X n is for n large very close to E [X1 ].
119
(a) n = 1 (b) n = 10 (c) n = 100 (d) n = 1000 (e) n = 10000
Figure 3.13: Histograms for computer generated samples of X n .
The question at this point

is: What happens when we let n tend to infinity? We know
from (3.8.9) that V X n → 0 (as n → ∞). Also, we have
Lemma 3.8.1. Let (Ω, P)be a (discrete) probability space and X a random variable
on this space with E [X 2 ] < ∞.
If V(X) = 0 =⇒ P [X = E [X]] = 1. (3.8.10)
Lemma 3.8.1 can be interpreted in words as follows: if a random variable X has variance
0 then X is (essentially) a constant.
Proof. We have
X 2
0 = V(X) = X(ω) − E [X] P [ω] . (3.8.11)
ω∈Ω
2
Since each summand in (3.8.11) is non-negative, we must have X(ω) − E [X] P [ω] = 0
for all ω ∈ Ω. This is only possible if P [ω] = 0 or X(ω) = E [X] for each ω ∈ Ω.
Thus we expect that ‘X n −→ E [X1 ]’ in some sense as n tends to infinity. A naive

approach would be to expect that X n is equal to E [X1 ] with high probability for n large
and this probability tends to 1 as n → ∞. In formulas, this would mean

‘P X n = E [X1 ] → 1’ as n → ∞. (3.8.12)
However (3.8.12) is wrong for almost all distributions of X1 .
Example 3.8.2. Consider the case X1 ∼ Bernoulli(1/2). We then have E [Xj ] = 1/2.
We first look at n even and write n = 2N . We get
" n # " 2N #
X n X 1 2N
P X n = 1/2 = P Xj = =P Xj = N = , (3.8.13)
j=1
2 j=1
2N N
120
since the sum of 2N independent Bernoulli(1/2) distributed random variables is
Binomial(2N, 1/2) distributed, see Chapter 3.6.3. Applying Stirling’s formula (see (2.5.9))
2N

to N then shows that
1
P X n = 1/2 ∼ √ → 0 as n → ∞. (3.8.14)
πn
/ N. Since nj=1 Xj is always an integer, we have

Furthermore if n is odd then n2 ∈
P
" n #
X n
P X n = 1/2 = P Xj = = 0. (3.8.15)
j=1
2
The observation in Example 3.8.2 is indeed true for most distributions of X1 , including
the one originating from the throw of a fair die. We don’t give the proof since this is
(much) more complicated than the computations in Example 3.8.2.
Does Example 3.8.2 show that our intuition ‘X n −→ E [X1 ]’ is wrong? Let us consider
the following situation:
If we toss 2000 times a fair coin, would it be surprising to have 1003 times ‘Head’
instead 1000 times?
If we roll 3000 times a fair die, would it be surprising to see 502 times the 6?
Most people would say: Of course not! We thus can only expect that ‘X n is approximately
E [X1 ]’ with a high probability, but not ‘X n equal to E [X1 ]’ with a high probability. The
sentence ‘X n is approximately E [X1 ]’ is of course not a rigorous mathematical expression.
We make this statement mathematical rigorous by introducing a small error with > 0
a fixed constant. We now consider the question: What is the probability that the distance
between X n and E [X1 ] smaller than ? We have given in Figure 3.14 an illustration for
this question using the histogram in Figure 3.13.(b). The red part corresponds to all
outcomes where the distance between X n and E [X1 ] is smaller than and the blue part
corresponds to all outcomes, where the distance is greater than .
Notice that ‘the distance between X n and E [X1 ] is smaller than ’ is written in formulas
as |X n − E [X1 ] | < and the probability of this event is

P X n − E [X1 ] < . (3.8.16)

If our intuition is now correct then P X n − E [X1 ] < should converge to 1 as n → ∞.
Furthermore, this should be true for each fixed > 0. This is indeed true and is known
as the Law of large numbers.
121
Figure 3.14: Illustration of the event ‘the distance between X n and E [X1 ] is less than ’.
Theorem 3.8.3 (Law of large numbers). Let (Ω, P)be a (discrete) probability space
and X1 , X2 , . . . , Xn be independent random variables with the same distribution and
E [X12 ] < ∞. We then have for any fixed > 0

P X n − E [X1 ] < −→ 1 as n → ∞, (3.8.17)
where X n is defined as in (3.8.8).
Theorem 3.8.3 can be interpreted in words as: The mean X n of the random variables X1 ,
. . . , Xn is for n large close E [X1 ] with high probability and this probability tends to 1 as
n → ∞. Theorem 3.8.3 might look a little bit abstract in the first moment. Let us thus
give an example how to use it.
Example 3.8.4. Suppose we roll n times independently a fair die. How often does the 3
occur? If n is small, we can not say too much about it. However, if n is large, we can use
the law of large numbers. We model this experiment by introducing a sequence of random
variables X1 , X2 , . . . , Xn with
(
1, if 3 occurs in the j’th throw,
Xj = (3.8.18)
0, if 3 does not occur in the j’th throw.
By assumption, we roll the die independently and therefore X1 , X2 , . . . , Xn are indepen-
dent. Also, it is clear that Xj ∼ Ber(p) with p = 16 . Since each Xj is either 0 or 1, we
can interpret
n
X
X1 + · · · + X n = Xj = ‘The number of rolled 3’s ’ (3.8.19)
j=1
122
and therefore
n
X1 + · · · + Xn 1X
Xn = = Xj = ‘The fraction of rolled 3’s ’. (3.8.20)
n n j=1
We have E X n = E [X1 ] = 16 . Let us now consider the error = 1/100. The law of

large numbers now states that

1 1 1 1 1 1
P X n − < =P − < Xn < + −→ 1 (n → ∞). (3.8.21)
6 100 6 100 6 100
The probability in (3.8.21) can be interpreted in words as the probability that the fraction
of rolled 3’s is in the interval [ 16 − 100
1 1 1
, 6 − 100 ]. Equation (3.8.21) (and thus the law of
large numbers) state that this probability tends to 1 as n → ∞. In other words, if n is
large then it is very likely that the fraction of observed 3’s is around 61 .
A natural question at this point is of course: How large must n be such that the probability
in (3.8.21) is at least 0.99? In formulas: How large must n be such that we have

1 1
P X n − <
≥ 0.99? (3.8.22)
6 100
Unfortunately, we cannot use the law of large numbers to answer this question, but we
can use the central limit theorem in Section 3.10 to give a good approximation for this n.
It remains to proof Theorem 3.8.3. The main ingredient of the proof of Theorem 3.8.3 is
the Chebyschev inequality.
Theorem 3.8.5 (Chebyschev inequality). Let (Ω, P)be a probability space and X be
a (discrete) random variable with E [X 2 ] < ∞. We then have for each > 0
V(X)
P [|X − E [X] | ≥ ] ≤ . (3.8.23)
2
We have mentioned the Chebyschev inequality already in Section 3.5, see (3.5.17). We also
have given there some examples on how it can be used to get bounds for some probabilities,
see the Examples 3.5.7, 3.5.9 and 3.5.10.
Proof. We have
X
V(X) = E (X − E [X])2 = (x − E [X])2 · P [X = x]

x∈R
X X
2
≥ (x − E [X]) · P [X = x] ≥ 2 · P [X = x]
x∈R x∈R
|x−E[X]|≥ |x−E[X]|≥
= 2 P [|X − E [X] | ≥ ] .
123
We have used in the first inequality that squares are always non-negative. Division by 2
completes the proof.
Proof of Theorem 3.8.3. Notice that (3.8.17) is equivalent to

lim P X n − E ≥ = 0. (3.8.24)
n→∞
This follows by considering the complement. We have

" n # n n
1X 1X 1X
E Xn = E Xj = E [Xj ] = E [X1 ] = E [X1 ] .
n j=1 n j=1 n j=1
We apply the Chebyschev inequality to X n and get

1
P X n − E [X1 ] ≥ ≤ 2 V X n . (3.8.25)

Since the random variables Xj are uncorrelated, Theorem 3.5.13.c) and f.) gives
n
! n
! n
1X 1 X 1 X 1 V(X1 )
V Xn = V Xj = 2 V Xj = 2 V(Xj ) = 2 (nV(X1 )) = .
n j=1 n j=1
n j=1 n n
The right hand side of (3.8.25) thus converges to 0 as n → ∞. This completes the proof
since the left hand side is always non-negative.
Remark 3.8.6.
The assumption E Xj2 < ∞ in Theorem 3.8.3 is not necessary and can be dropped.

However, the proof without this assumption is considerably more complex and we
thus do not give it here.
The law of large numbers can be used to verify that our definition of a probability
measure has the limiting property described on Page 9 and at the begin of this
section. The argumentation is similar to Example 3.8.4.
Theorem 3.8.3 is the so called weak law of large numbers. There is also a strong
law of large numbers. This states that
n
1X
Xi → E [X1 ]
n i=1
for almost all ω ∈ Ω. This convergence is much stronger than the convergence in
Theorem 3.8.3, but we will not explain here the difference. The strong law of large
numbers also requires the existence of infinitely many identical distributed random
variables on the same probability space. Unfortunately, this is not possible on a
discrete probability space. This is not necessary for the weak law of large numbers.
124
One now can consider the following game: Tom and Jerry toss several times a fair
coin. Each time when ‘Head’ occurs, Tom gets £1 from Jerry, otherwise Jerry gets
£1 from Tom. The law of large numbers now state that average win or deficit per
game goes to 0 as the number of rounds tends to infinity. Does this mean that both
are leading with high probability around 50 % of the time in this game? No! The
Arcsine law says the converse: The probability is very high that the same player is
leading almost the whole time. We will not study this here.
3.9 Fair games

In this chapter, we will study fair (and unfair) games. There are of course many types
of games. We will study here only betting games. A betting game in this chapter is
a random experiment together with a random variable X, which describes the gain or
deficit.
First, we determine when we call a game fair. For this, let us consider some examples to
find a mathematical definition, which is as close as possible to our intuition.
Example 3.9.1. Victor and Peter play the following betting game: Victor
rolls three times a die and wins the game if at least once a ‘6’ occurs.
Otherwise Peter wins. Is this game fair?
We have here Ω = {1, . . . , 6}3 and P [ω] = 613 = 216 1
. We get

| (ω1 , ω2 , ω3 ); ωi ∈ {1, . . . , 5} | 125
P [‘Peter wins’] = = ≈ 0.578,
216 216
91
P [‘Victor wins’] = 1 − P [‘Peter wins’] = ≈ 0.422.
216
Thus both players don’t have the same chance to win. This game is therefore unfair.
Example 3.9.2. Somebody gives Peter £3.20 and requests him to roll a die.
After this, Peter has to repay the rolled number in pounds. We have
here of course P [‘Peter wins’] = P [‘Peter loses’]. Is this game fair?
From the perspective of Peter, we have the following pay table
Number 1 2 3 4 5 6
Gain 2.2 1.2 0.2 −0.8 −1.8 −2.8
1 1 1 1 1 1
P [·] 6 6 6 6 6 6
Table 3.11: Pay table for Example 3.9.2
The gain and deficit in this game do not cancel each other, i.e.
2.2 + 1.2 + 0.2 − 0.8 − 1.8 − 2.8 = −1.8. (3.9.1)
125
The average deficit per game is thus £0.3. One can not really call such a game fair. On
the other hand, if Peter would get £3.50 instead £3.20, then in average the gain and
deficit would be the same. Thus no side would have an advantage. In this case, we could
call the game fair.
We thus will call a game fair if nobody has an advantage and thus the gain and deficit
‘cancel in the long run’. If we are now thinking about the Law of large numbers, this
nothing else than the expectation of the gain per game is 0.
Definition 3.9.3. Let X be the gain of a betting game (negative values stand for
deficit). We call this game fair, if the expected gain is 0, i.e. if
E [X] = 0.
This definition is now purely mathematical and does not require our intuition. This is a
big advantage, because our intuition can often be misleading. For instance, one could try
to argue in Example 3.9.1 that ‘3 times 1/6 is 0.5!’. Another situation is the Monty Hall
problem is Example 2.7.8.
Example 3.9.4. We see in Figure 3.15 a wheel of fortune, where each sector has the
same probability to occur. The numbers in each sector stand for the money we win if
the corresponding sector occurs. If we would like to take part in this game, we first
have to pay a certain amount E. How much must E be so that this is a fair game?
Let Y be the money we can win in this game without considering E. We then have
Figure 3.15: Wheel of fortune for Example 3.9.4
E [Y ] = 5 · 51 + 1 · 52 = 1.4. The total gain of this game is X = Y − E and the game is fair
!
if 0 = E [X] = E [Y − E] = 1.4 − E. Thus E must be 1.4.
Example 3.9.5. Victor and Peter play the following game: Victor can toss a fair coin as
often he would like, but at most 4 times. Each time Victor tosses ‘Head’, he gets £1 from
126
Peter. Each time Victor tosses ‘Tail’, he has to pay £1 to Peter. Let us now consider
the following strategies and decide which one Victor should use.
S0 :Victor tosses the coin 4 times.

S1 :Victor stops when the first time ‘Head’ occurs.
S2 :Victor stops when the second time ‘Head’ occurs.
Let Xj now be the gain of Victor for the strategy Sj . Obviously, we have E [X0 ] = 0, since
we have for a fair coin P [‘Head’] = P [‘Tail’] = 21 .
Let us now consider strategy S1 . Since Victor stops after the first ‘Head’, the only possible
realisations are ‘H’, ‘TH’ , ‘TTH’ , ‘TTTH’ and ‘TTTT’. We now get
Realisation ‘H’ ‘TH’ ‘TTH’ ‘TTTH’ ‘TTTT’

Gain of Victor 1 0 −1 −2 −4
1 1 1 1 1
P [·] 2 4 8 16 16
Table 3.12: Gain/deficit for strategy S1
The expectation of X1 is thus

1 1 1 1 1 8 2 2 41
E [X1 ] = 1 · +0· −1· −2· −4· = − − − = 0.
2 4 8 16 16 16 16 16 16
The computations for the strategy S2 are similar. We thus state here only the distribution
of X2 , see Table 3.13.
X2 = 2 1 0 -2 -4
1 1 3 4 1
P [·] 4 4 16 16 16
Table 3.13: Gain/deficit for strategy S2
The expectation of X2 is thus

1 1 3 4 1 8 4 8 4
E [X2 ] = 2 · +1· +0· −2· −4· = + − − =0
4 4 16 16 16 16 16 16 16
We thus get for three strategies the expectation 0. One can try further strategies, but all
these strategies lead to E [X] = 0.
The following theorem shows, that this example is only special case of a more general
phenomenon.
127
Theorem 3.9.6. In a fair game, each strategy leads to an expectation 0
The definition of a fair game used in Theorem 3.9.6 is slightly more general than the one
used in Definition 3.9.3. However, we will not go into details. Furthermore, we should
mention that in Theorem 3.9.6 only strategies are allowed, which can not see into the
future and which have to stop after finite many steps.
One can ask oneself, if betting games in the casino are fair. As an example, let us consider
Roulette.
Example 3.9.7. In Roulette, one can bet on the numbers 0 to 36, where each number has
the same probability to occur. There are many possible allowed combinations, also the 0
has a special role. We consider here only the betting on one number (franz.‘ Plein’, engl.
‘ Full number’). Furthermore, we ignore also the special role of 0. If we bet on the right
number, then the payout ratio is 35:1. If we bet on the wrong number, then we loose our
money. Suppose we bet with £1, then the average gain per game is
1 36 1
35 · + (−1) · = − < 0.
37 37 37
We thus see that this game is not fair. This is also true for all other allowed combinations
1
with or without the special role of the 0. A possible question at this point is: 37 ≈ 0.03
is very small, how much money can the casino make with this advantage? Let us make a
rough computation for one roulette table. Suppose the casino is 8 hours open, 20 rounds
roulette are played per hour and 10 guests bet per round 20£. Thus around 32’000 pounds
pass the roulette table in one day. The law of large numbers then state the gain for this
table is per day around 32000 · 0.03 ≈ 850£.
Example 3.9.7 shows that in Roulette, the casino is the winner in the long run. This
is also true for all other games in the casino. At this point, one should not be really
surprised since the aim of the casino is to make money.
128
3.10 Central limit theorem
We discovered in Section 3.8 with the law of large numbers a limit theorem in probabil-
ity theory. Such limit theorems are the strength of probability theory since they show
that there are regular patterns in the apparent chaos of random sequences. Also, we have
briefly illustrated at the begin of Section 3.8 why the law of large numbers is an important
result. However, we can not use the law of large numbers for concrete computations. As
an example, suppose that we toss 1000 times a fair coin with success probability p = 21 .
The law of large numbers then tell us that the number of Head’s should be close to 500.
But this is everything we get out of it. We could use the Chebyschev inequality as in
Example 3.5.7 to get some upper or lower bounds, but they are often not really good.
Much more useful is the central limit theorem. This is together with the law of large
numbers something like the first and second principle of probability theory. However, the
name central limit theorem does not come from its central role in probability theory, it
comes from the fact that it approximates very well events in the centre, i.e events that
are close to the expectation.
Let us first recall some facts about histograms. Histograms are used to present large
datasets consisting of real numbers, which are too large to be presented in a table. If a
large dataset is given then a histogram is constructed as follows:
Determine the range of the values,
divide the range into a series of intervals,
determine the fraction of the values falling into each interval and
draw over each interval a rectangle with area equal to the fraction of values.
Example 3.10.1. Suppose we have a dataset
L = {3.5, 2.75, 3.06, 4.7, 3.52, 3.99, 3.12, 3.90, 2.98, 2.81, 4.07, 4.01, 3.74, 3.56)}. (3.10.1)
We use the range [2, 5] and the intervals I1 = [2, 3], I2 = [3, 3.5], I3 = [3.5, 4], I4 = [4, 5].
The numbers and fractions of values in each interval are displayed in Table 3.14 and
resulting histogram in Figure 3.16.
I1 I2 I3 I4
number 3 2 6 3
fraction ≈ 22% ≈ 14% ≈ 42% ≈ 22%
Table 3.14: Number and fraction of values of the dataset L in Example 3.10.1
129
Figure 3.16: Histogram for Example 3.10.1
We would like to make some brief comments about histograms. Often one uses the
number instead the faction of values in the intervals. Also, one uses normally intervals
of the same size, but it is not essential. An important point is how one determines the
number of intervals. A usual method is Sturges rule. We denote by N the number of
values in the dataset and k the number of intervals. Then Sturges rule says
√
for large N : k ≈ log2 N and k ≈ N for small N. (3.10.2)
Let us now return to the central limit theorem. As in Section 3.8.2, X1 , X2 ,. . . , Xn

will be always a sequence of independent random variables with all Xj having the same
distribution. Further, we assume E [|X1 |] < ∞ and V(X1 ) < ∞. Also we write X n for
the mean of the random variables X1 , X2 ,. . . , Xn , see (3.8.8).
As motivation for the central limit theorem, let us take again a look at the histograms in
Figure 3.13. Recall, we rolled for Figure 3.13 n times independently a fair die and have
taken the mean of these n values (and have repeated this experiment often to get some
experimental data). The distribution of Xj in this situation is thus
1
P [Xj = 1] = P [Xj = 2] = . . . = P [Xj = 6] = . (3.10.3)
6
We used in Figure 3.13 the range [1, 6]. This time we use a range, which is more adapted
to the obtained data. This then gives Figure 3.17.
(a) n = 100 (b) n = 1000 (c) n = 10000
Figure 3.17: Histograms for the mean X n for the roll of a fair die.
130
We see in Figure 3.17 that all histograms have the same shape, but have a different range.
We will see that is a general phenomenon. Let us now illustrate how one can use the
histograms in Figure 3.17 to approximate the probabilities of some events. Suppose we
are interested in the probability

P 3.48 ≤ X 10000 ≤ 3.51 . (3.10.4)
As we have seen in Section 3.8.1, this is very hard to compute explicit. However, we have
simulated X 10000 with the computer to obtain Figure 3.17 and thus have a large dataset
with experimental values of X 10000 . A small part of this dataset is
[3.498, 3.507, 3.464, 3.529, 3.512, 3.510, 3.488, 3.501, 3.505, 3.500, 3.551, 3.503]. (3.10.5)

Since the dataset is large, the probability P 3.48 ≤ X 10000 ≤ 3.51 should be close to the
fraction of values in the interval [3.48, 3.51] in this dataset. If for instance 25% of the
values are in [3.48, 3.51] then we expect that P 3.48 ≤ X 10000 ≤ 3.51 ≈ 0.25. Notice that
one draws for a histogram over each used interval in the histogram a rectangle such that
the area is the fraction of values in this interval. If we thus would like to determine the
fraction of values in the interval [3.48, 3.51], we have to determine the area of all rectangles
over [3.48, 3.51]. This corresponds to the area of the red part in Figure 3.18.(a). Obviously,
we can approximate the shape of the histogram by a nice curve f (x), see Figure 3.18.(b).
(a) (b)

Figure 3.18: Illustration of the probability P 3.48 ≤ X 10000 ≤ 3.51
Thus we can approximate the area of all rectangles over [3.48, 3.51] by the area between
the curve f (x) and the x-Axis in the interval [3.48, 3.51]. We thus get
Z 3.51

P 3.48 ≤ X 10000 ≤ 3.51 ≈ f (x) dx, (3.10.6)
3.48
131
by the definition of the integral of function f . The question at this point is: What is f ?
2
The function f in Figure 3.18.(b) looks similar to the Gaussian bell curve e−x . Some
2
shifting and stretching of e−x then shows that the Gaussian bell curve is indeed a good
candidate for the function f . More precisely, if we write µ := E [X1 ] and σ 2 := V(X1 )
then we can use the function
1 (x−µ)2
−
f (x) = √ √ e 2(nσ 2 ) , (3.10.7)
( nσ 2 ) 2π
which is the curve plotted in Figure 3.18.(b). An interesting observation is that f only
depends on the number of throws n, the expectation µ = E [X1 ] and the variance σ 2 =
V(X1 ), but not on the distribution of X1 . The important point is now that the behaviour
of X n for an arbitrary random variable X1 is almost the same as for the roll of a fair die,
i.e. the histograms in Figure 3.18 have the same shape and the shape can be very well
approximated by the function f in (3.10.7). However, is is often not so convenient to work
directly with X n . It is indeed much more convenient to shift and scale X n such that it
has expectation 0 and variance 1. In formulas, this means one normally works with
P
n
− nµ
Pn
X
j=1 (Xj − µ)
r
n j=1 j
X
en := (X n − µ) = √ = √ . (3.10.8)
σ2 nσ 2 nσ 2
An advantage of Xen is for instance that we can use for X

en in Figure 3.17 a fixed range
2
and that we can approximate the histogram directly with Gaussian bell curve e−x .
Exercise 3.10.2.
n −E[X n ]
en = X√
a) Verify that X ,
V(X n )
h i
b) check that E Xen = 0 and V( X
en ) = 1 and
c) verify the equalities in (3.10.8).
Let us now formulate the above observation in a mathematical rigorous way using the
random variable X
en . This is an important result and is known as central limit theorem.
132
Theorem 3.10.3 (Central limit theorem). Let (Ω, P)be a (discrete) probability space
and X1 , X2 , . . . , Xn be independent random variables with the same distribution and
E [X1 ] = µ, V(X1 ) = σ 2 > 0. We then have for all a, b with −∞ ≤ a < b ≤ ∞
" Pn #
j=1 (Xj − µ)
Z b
1 2
P a≤ √ ≤ b −→ √ e−x /2 dx = Φ(b) − Φ(a) (3.10.9)
nσ 2 2π a
as n → ∞, where Φ is defined as
Z t
1 2 /2
Φ(t) := √ e−x dx for all t ∈ R. (3.10.10)
2π −∞
We will not give here a proof of Theorem 3.10.3 since this requires some probabilistic
tools, which we will not discuss in this course. The special case X1 ∼ Ber(p) can be
proved very directly with Stirlings formula. However, this is a little bit technical and we
thus will also not give it.
Remark 3.10.4.
a) The integral on the RHS of (3.10.9) has a probabilistic interpretation. It is the prob-
ability P [a ≤ N ≤ b], where N is a continuous random variable with normal distribu-
tion. This is sometimes also called Gaussian distribution. We will not go further into
details since we are considering here only discrete random variables.
b) The technical term for the convergence in Theorem 3.10.3 is convergence in law or
convergence in distribution. We will not need this convergence and will thus not
further discuss it.
Let us now show how to use the central limit theorem to get approximations for sums of
independent random variables. The idea behind this is a follows: Suppose that (cn )n∈N
is a sequence with cn → c. If we know c explicitly and are interested in cn with n large,
then we can hope cn ≈ c. In view of Theorem 3.10.3, we hope that we have for n large
" Pn #
−
Z b
j=1 (X j µ) 1 2
P a≤ √ ≤b ≈ √ e−x /2 dx. (3.10.11)
nσ 2 2π a
Considering again the probability in (3.10.4) then the approximation in (3.10.6) is pre-
cisely what we get with (3.10.11). Thus one of the advantages of the central limit theorem
is that it is not anymore necessary to create a huge dataset of samples to obtain approx-
imations. Note that one can replace ≤ by < in (3.10.11) without further ado.
We would like to highlight at this point that (3.10.11) gives very often a good approxima-
tion, but we have in general not an equality in (3.10.11). Especially for discrete random
variables, there is no hope to have equality in (3.10.11).
133
Let us now give examples how to use the central limit theorem (and thus (3.10.11)) to
get approximations of events involving sums of independent random variables.
Example 3.10.5. Suppose 120 persons have booked a certain flight. The
experience of the flight company shows that in average only 90% of the
passengers really take the flight. What is the probability that more
than 105 passengers take the flight? To answers this question, we number the
passengers from 1 to 120 and define for 1 ≤ j ≤ 120
(
1, if passenger j takes the flight,
Xj = (3.10.12)
0, otherwise.
The number of passengers taking the flight is thus

120
X
X1 + . . . + X120 = Xj (3.10.13)
j=1
and we are therefore interested in the probability
P [X1 + . . . + X120 ≤ 105] . (3.10.14)
Obviously, we have Xj ∼ Ber(p) with some p (maybe depending on j). We assume at

this point that each passengers decides independently and each passenger has the same
probability to take the flight. Thus Xj are independent random variables with the same
distribution. As in average 90% of the passengers take the flight, we must have p = 0.9.
We thus have to use n = 120, µ = E [X1 ] = p and σ 2 = V(X1 ) = p (1 − p) in (3.10.11).
The main idea is now to reshape (3.10.14) such that the probability has the same form as
in Theorem 3.10.3 (or (3.10.11)). We have
" 120 # " 120 #
X X
P [X1 + . . . + X120 ≤ 105] = P Xj ≤ 105 = P Xj − n · µ ≤ 105 − n · µ
j=1 j=1
" P120 #
j=1 (X j − µ) 105 − n · µ
= P √ ≤ √
nσ 2 nσ 2
" P120 #
j=1 (X j − p) 105 − 120 · p
=P p ≤p . (3.10.15)
120 · p (1 − p) 120 · p (1 − p)
105−120·p
We now set a := −∞ and b := √ ≈ −0.278. We thus have
n·p (1−p)
" P120 #
j=1 (Xj − p)
P [X1 + . . . + X120 ≤ 105] = P p ≤b . (3.10.16)
120 · p (1 − p)
134
Using (3.10.11) for the RHS of (3.10.16) gives the approximation
" P120 #
j=1 (X j − p)
P p ≤ b ≈ Φ(b) ≈ Φ(−0.278) ≈ 0.61. (3.10.17)
120 · p (1 − p)
The procedure in the general case is (almost) the same as in Example 3.10.14. More
precisely, we think about the computation (3.10.15). We start with a sum of independent
random variables, subtract the expectation and divide it by the square root of the variance.
This then gives a new random variable with expectation 0 and variance 1. Let us illustrate
this with a further example.
Example 3.10.6. Suppose that an employee at the customer service of a
phone company needs in average 5 minutes to handle a call of a customer.
We assume at this point the time need per call is Poisson distributed and
independent of each other. Suppose that an employee has 100 calls a day.
What is the probability that the used time per call is in average more
than 6 minutes? We denote by Xj the time need for each call. Thus Xj ∼ Poi(5)
and are independent of each other. We are therefore interested in

X1 + · · · + X100
P > 6 = P X 100 > 6 (3.10.18)
100
We now have n = 100, µ = E [X1 ] = 5 and σ 2 = V(X1 ) = 5 and thus
 
X 100 − E X 100 6 − E X 100
P X 100 > 6 = P X 100 − E X 100 > 6 − E X 100 = P  q > q 
V(X 100 ) V(X 100 )
  "r #
X 100 − E [X 1 ] 6 − E [X 1 ] 100 √
= P q >q =P (X 100 − µ) > 20 .
1
V(X ) 1
V(X ) σ2
100 1 100 1
We now
√ know from (3.10.8) that the last expression has the required form. We thus set
a = 20 and b = ∞ and get with (3.10.11)
√ √
P X 100 > 6 ≈ Φ(∞) − Φ 20 = 1 − Φ 20 ≈ 0.
One important question at this point is of course: How good is the approximation in
(3.10.11)? The quality of the approximation clearly depends on the distribution of X1
and one can easily construct examples such that that approximation in (3.10.11) is even
for large n very bad. However, these examples are normally very exotic. We can say
roughly that the approximation for ‘real life examples’ is typically very good, even for
relative small values of n. As general guideline, let us say that one can use (3.10.11) for
‘real life examples’ for n ≥ 30. Ash illustration, let
i us take a look at an example, where
Pn
we can compute the probability P j=1 Xj ≤ x .
135
Example 3.10.7. Let us roll 6000 times independently a faire die. How
big is the probability to roll at least 1100 times the 6? Similarly to
Example 3.10.14, we define for 1 ≤ j ≤ 6000
(
1, if 6 occurs in the j’th throw,
Xj = (3.10.19)
0, otherwise.
1
We then have that Xj ∼ Ber(p) with p = 6
and all Xj are independent. We are thus
interested in the probability
P [X1 + . . . + X6000 ≥ 1100] . (3.10.20)
We know from Section 3.6.3 that the sum of independent

P Bernoulli random variables is
Binomial distributed. We get with the notation X := 6000
j=1 Xj that X ∼ B(6000, 1/6).
We thus have
"6000 # 6000 j 6000−j
X X 6000 1 5
P Xj ≥ 1100 = P [X ≥ 1100] = . (3.10.21)
j=1 j=1100
j 6 6
A naive approach would be to type this expression as it is directly into a computer. Surpris-
ingly, this expression with n = 6000 is already large enough that many computer programs
get problems. However, there are several programs, which can work directly with Binomial
distribution, for instance Maple or R. If one uses such a program, one gets
"6000 #
X
P Xj ≥ 1100 ≈ 0.00029.
j=1
Let us now use the central limit theorem to compute an approximation for P [X ≥ 1100].
We have n = 6000, µ = E [X1 ] = 16 and σ 2 = V(X1 ) = 61 · 65 . We thus get
"6000 # " P6000 # " P6000 #
X j=1 Xj − np
1100 − np j=1 X j − 1000 100
P Xj ≥ 1100 = P p ≥p =P 100
√ ≥ 100 √
j=1 np(1 − p) np(1 − p) 6
3 6
3
Z ∞
1 2 √
≈ √ √ e−x /2 dx = 1 − Φ(2 3) = 0.00027.
2 3 2π
We thus see that our approximation agrees on 4 decimal places. Thus the magnitude of
1
the error is in this case 100 %. We would like to point out here that n = 6000 is already
quite big and that one can not expect that the error is also so tiny for relatively small n.
Indeed, if we consider n = 300 and look at P [X1 + . . . + X300 ≥ 55] then we get with a
computer program ≈ 0.24 while the central limit theorem gives 1 − Φ(0.77) ≈ 0.22. Thus
the error is not anymore as tiny as before, but is still reasonably small.
136
2
Another point we have to discuss briefly is that the function e−x is not elementary
2
integrable. It is thus not possible to give for the antiderivative Φ(x) of e−x a closed
expression. However, with our modern computers, calculators and smartphones, this is
not a big problem. It can be reached already with low computing power and a simple
algorithm in a very short time a sufficiently accurate approximation of Φ(x) (for a given
x ∈ R). For completeness, we would like to show here the classical approach. If one doesn’t
has a computer, it needs obviously much time to compute Φ(x) manually. One thus has
calculated Φ(x) manually for many values of x and tabulated these values. Table 3.10
shows a small excerpt from this table. The full table can be found in Appendix C.
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
Table 3.15: Excerpt from the table of the normal distribution.
The table reads as follows: If we are looking for the probability Φ(x) for a value x in the
interval 0 to 3.49, so x is up to one decimal place in the left margin line of the table and the
second decimal place is located in the header. As an example, we have Φ(0.14) = 0.5557.
For x ≥ 3.5, we use Φ(x) = 1. For x < 0, one uses that Φ(x) = 1 − Φ(−x). For instance
Φ(−0.09) = 1 − Φ(0.09) = 0.4641. If one requires Φ(x) for an x with more than two
decimal places, one interpolates the values in the table. We will not explain this here.
Note that there are also other useful limit theorems apart from the central limit theorem.
One is for instance the Poisson limit theorem, which can be used for the sum of indepen-
dent Bernoulli random variables with small success rate. In this situation, the Poisson
limit theorem gives often better approximations than the central limit theorem. However,
we will not discuss this here.
137
138
Appendix A
Sigma-Notation for sums
A.1 Finite sums

There are many situations, where one has to add a big number of summands. Often it is
impractical to write them explicit out, for example
x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 . (A.1.1)
A possibility to write this sum is
x1 + x2 + · · · + x12 . (A.1.2)
This notation requires that the sum has a well recognizable pattern. The problem is that
this pattern is often not unique and one possibly needs many terms to see the pattern.
One
P thus has introduced a more suitable notation. One uses the upper case Greek letter
as sum symbol and writes the sum in (A.1.1) as
12
X
xi . (A.1.3)
i=1
This notation indicates that we let i vary over all integers between 1 and 12 and then
build the sum over all terms, which occur, when we substitute i in the expression in the
sum. We have for instance
5
X 6
X
1+2+3+4+5= i, 2·3+2·4+2·5+2·6= 2i. (A.1.4)
i=1 i=3
The variable i is called summation index and is an auxiliary variable, which can be
replaced by any other variable. We thus have
12
X 12
X 12
X
xi = xj = xk .
i=1 j=1 k=1
139
We can use of course also other upper and the lower limits than 1 and 12. For instance
b
X
xi = xa + xa+1 + · · · + xb
i=a
Pb
One requires at this point that a ≤ b and defines i=a xi = 0 if b < a. In the special case
a = b, one gets
a
X
x i = xa .
i=a
Until now, we have used the summation index mostly as index for the summands. The
expression we would like to sum can of course depend in a different way on the summation
index. In general, we have thus
b
X
f (i) = f (a) + f (a + 1) + · · · + f (b − 1) + f (b), (A.1.5)
i=a
where f is a function, which assigns each i between a and b a value. To make this a little
bit clearer, we will discuss an example in detail. Let
7
X
(2i)i
i=1
be given. i is the summation index. i starts from 1 and runs till 7. We thus have to build
the sum over all i ∈ {1, 2, 3, 4, 5, 6, 7}. When we insert the value i = 1 in the term in the
sum, we obtain (2 · 1)1 . For i = 2, we get (2 · 2)2 , for i = 3 we get (2 · 3)3 ,. . . , and finally
for i = 7 we get (2 · 7)7 . We thus have
7
X
(2i)i = (2 · 1)1 + (2 · 2)2 + (2 · 3)3 + (2 · 4)4 + (2 · 5)5 + (2 · 6)6 + (2 · 7)7 .
i=1
Further examples are

P6 2 2 2 2 2
a) i=3 i = 3 + 4 + 5 + 6 ,
P4
b) i=1 (3i + 1) = (3 · 1 + 1) + (3 · 2 + 1) + (3 · 3 + 1) + (3 · 4 + 1),
P3 i+1 1
c) i=1 (−1) · i = 11 − 21 + 31 .
The following example shows the advantages of the sum notation
100
X
i(i + 1)(i + 2) = 1 · 2 · 3 + 2 · 3 · 4 + 3 · 4 · 5 + · · · + 100 · 101 · 102. (A.1.6)
i=1
140
P
The sum with is much more compact and it is much easier to keep the overview. Also,
one could try to write out the sum on the right hand side of (A.1.6) as
6 + 24 + 60 + · · · + 1030200.
In this form, obviously almost nobody would be able to recognise the right sum. Often
one has apart from the summation index also other parameters occurring in the expression
in the sum. Similar to the integration of functions, the summation index determines over
which parameter one builds the sum. Let us consider the following example
4
X
xki = xk1 + xk2 + xk3 + xk4 ,
i=1
X4
xki = x1i + x2i + x3i + x4i .
k=1
P
To write a sum as in (A.1.1) with the held of the sum symbol is more difficult and
requires that one knows the structure.
Example A.1.1.
a) Let us study the sum
1 + 5 + 52 + · · · + 5n .
Note that 50 = 1 and 51 = 5. Thus the sum is 50 + 51 + · · · + 5n and the general form
of the term is 5k . We thus have
Xn
5k .
k=0
b) If one considers the sum y + 2y 2 + 3y 3 + 4y 4 + 5y 5 , then the general term has the form
ky k and k varies between 1 and 5. We thus have
5
X
y + 2y 2 + 3y 3 + 4y 4 + 5y 5 = ky k .
k=1
c) We consider as next

5 5 2 5 3 5 4
q+4 q +9 q + 16 q + 25q 5 .
4 3 2 1
The exponent of q starts at 1 and ends at 5. The binomial coefficient has the form k5 ,

where k starts at 4 and ends at 0. The coefficient in front of the binomial coefficient
has the form k 2 and starts at 1 and ends at 5. This gives
5
5 5 2 5 3 5 4 5
X
2 5
q+4 q +9 q + 25 q + 36q = k qk .
4 3 2 1 k=1
5 − k
141
Let us now consider some properties of the sum notation.
Lemma A.1.2. We have for m ≤ ` ≤ n

n
X n
X n
X
(ak + bk ) = ak + bk , (A.1.7)
k=m k=m k=m
n
X n
X
c · ak , = c · ak for c ∈ R, (A.1.8)
k=m k=m
Xb b−m
X
f (i) = f (i + m), (A.1.9)
i=a i=a−m
n
X `
X n
X
ak = ak + ak . (A.1.10)
k=m k=m k=`+1
Proof. We have
n
X
(ak + bk ) = (am + bm ) + (am+1 + bm+1 ) + · · · + (an + bn )
k=m
n
X n
X
= (am + · · · + an ) + (bm + · · · + bn ) = ak + bk
k=m k=m
and
n
X n
X
c · ak = (c · am ) + (c · am+1 ) + . . . + (c · an ) = c · (am + · · · + an ) = c · ak .
k=m k=m
The remaining equations are proven similarly and we thus omit the proofs.
Remark A.1.3. In general, we have
n n
! n
!
X X X
ak · bk 6= ak · bk . (A.1.11)
k=m k=m k=m
Note that we have for most ak and bk an inequality in (A.1.11).

Example A.1.4. Let m = 1, n = 3, a1 = a2 = a3 = 1 and b1 = b2 = b3 = 1. Then
3
X
ak · bk = 1 · 2 + 1 · 2 + 1 · 2 = 6,
k=1
3
! 3
!
X X
ak · bk = (1 + 1 + 1) · (2 + 2 + 2) = 18
k=1 k=1
142
Some important sums are
Lemma A.1.5.
n n
X n(n + 1) X n(n + 1)(2n + 1)
i= , i2 = , (A.1.12)
i=1
2 i=1
6
n 2 n
n2 (n + 1)2 q n+1 − 1

X
3 n(n + 1) X
i = = , qi = (A.1.13)
i=1
2 4 i=0
q−1
n ∞
X n n−k k X 1
a b = (a + b)n , qi = for |q| < 1. (A.1.14)
k=0
k i=0
1−q
We omit here the proof of Lemma A.1.5. Till now, we have worked only with one sum-
mation index. One can of course define sums over more than one summation index. One
defines in this case !
X n Xn Xn
aij := aij .
i,j=1 i=1 j=1
Such sums often occurs in physics. Note that

n
X n
X
aij = aij
i,j=1 j,i=1
A.2 Infinite sums

In many situations one has to build a sum over infinitely many summands, for instance
∞
X
aj = a1 + a2 + a3 + . . . .
j=1
If only finitely many aj 6= 0, then we obviously don’t have a problem. We then build the
sum only over all non-zero aj 6= 0. In formulas, this means
∞
X X
aj := aj .
j=1 j∈N
aj 6=0
If infinitely many aj 6= 0, we have to define the infinite sum as a limit. We set in this case
∞
X n
X
aj := lim aj
n→∞
j=1 j=1
143
if the limit exists. An example is for instance
∞
X 1 π2
2
= . (A.2.1)
k=1
k 6
However, it is not possible to assign each infinite sum a meaningful value. To see this let
us consider
n
(
X −1, if n is odd,
(−1)k = (−1) + 1 + (−1) + 1 + (−1) + 1 + · · · + (−1)n =
k=1
0, if n is even.
We thus see that Pthe value jumps up and down between −1 and 0 as n → ∞ and thus
theP limit limn→∞ nk=1 (−1)k does not exist. Therefore it is not possible to assign a value
to ∞ k
k=1 (−1) .
Another important point is that we often rearrange in probability theory the order of the
summands in infinite sums. This is of course only meaningful if this rearranging of the
summands has no influence on the value of the sum. If one rearranges only finitely many
summands then this has no influence on the value of the sum. However, if one rearrange
infinitely many summands, it can happen that the sum gets a different value. A famous
example is the harmonic series
∞
X (−1)n+1 1 1 1
=1− + − + ··· . (A.2.2)
n=1
n 2 3 4
In the order in (A.2.2), the sum has the value ln(2). However, the order of summands is
not for each infinite sum relevant. The following theorem gives necessary and sufficient
conditions, under which the rearrangement of summands has no influence on the value.
Theorem A.2.1 (Absolutely convergent sums). Let a1 , a2 , a3 , . . . ∈ R be a sequence

of real numbers. If
∞
X
|ak | < ∞ (A.2.3)
k=1
P∞ Pn
then k=1 ak = limn→∞ k=1 ak exists and is independent of the order of the sum-
mands. In this case we call the sum absolutely convergent. If
∞
X
|ak | = ∞, (A.2.4)
k=1
P∞
then it is not possible to assign to k=1 ak a meaningful value, which is independent
of the order of the summands.
144
Remark A.2.2. Notice that (almost) all sums we consider are absolutely convergent. We
thus will only mention the absolute convergence if it is important or necessary.
We will not prove Theorem A.2.1. One can ask oneself: what is with the sum
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + · · ·? (A.2.5)
The most natural approach would be to define this sum as +∞ and this value would
be obviously independent of the orde of the summands in (A.2.5) since all are the same.
However, one cannot really make computations with +∞ (What should be +∞ − (+∞)
or +∞/ + ∞?). One thus normally excludes +∞ to avoid problems. The next theorem
shows that not absolute convergent sums can make big problems.
Theorem A.2.3 (Riemann’s Rearrangement Theorem). Let a1 , a2 , a3 , . . . be a se-

quence of real numbers. Suppose that
n
X ∞
X
lim ak exists and |ak | = ∞. (A.2.6)
n→∞
k=1 k=1
Then the sum can take by suitable rearrangement of the order of summands every
value in R and also ±∞.
We will also not prove this theorem.
A.3 Summation over arbitrary sets

Often we needs over more complicate sums than a sum running from m to n. As an
example, suppose we would like to build the sum over allPprime numbers less than 100.
For such sums, there
P is a slightly adjusted version of the notation. In this version we
write under the one or more conditions for the summation index. For instance, we can
write the sum over all prime numbers less than 100 as
X
p.
p prim, p≤100
This version of the sum notation is useful if the area, over which one would like to sum,
can not be easily described. Also, we can express the sums in the Sections A.1 and A.2
in this way. For example,
X 10
X
ak = ak = a1 + · · · + a10 .
1≤k≤10 k=1
145
Also we can build sums over an arbitrary index sets I. We write in this case
X
ai . (A.3.1)
i∈I
We will need this for instance in the definition of the probability of an event, see Defini-
tion 2.1.7. Let us consider an example.
Example A.3.1. Let I = {1, 2, 4, 9} and J = {−2, −1, 0, 1, 2}. We then have
X X
ai = a1 + a2 + a4 + a9 and ai = a−2 + a−1 + a0 + a1 + a2 . (A.3.2)
i∈I i∈J
A useful property is the following
Lemma A.3.2. Let I and J be disjoint sets, i.e. there is no element which is contained
in I and in J. We then have
X X X
ai = ai + ai , (A.3.3)
i∈I∪J i∈I i∈J
where I ∪ J is the union of the sets I and J.
The proof is similar to the proof of Lemma A.1.2 and we thus omit the proof.
If we now look at (A.3.2) then we see that the sets I and J in this example a not disjoint as
the elements 1 and 2 are contained in I and in J. Therefore we cannot apply Lemma A.3.2.
Indeed, we have in this case
X X
ai + ai = a−2 + a−1 + a0 + 2a1 + 2a2 + a4 + a9 and (A.3.4)
i∈I i∈J
X
ai = a−2 + a−1 + a0 + a1 + a2 + a4 + a9 . (A.3.5)
i∈I∪J
Thus we see that (A.3.4) and (A.3.5) are equal if and only if a1 + a2 = 0. In other words,
we have equality only in special cases and thus (A.3.3) does not hold in general. Let us
now consider another example.
Example A.3.3. Let I = {3, 5} and J = {4, 9}. In this case I and J are disjoint, i.e.
we have I ∩ J = ∅. Thus we can apply Lemma A.3.2 and get
X X X
ai + ai = (a3 + a5 ) + (a4 + a9 ) = a3 + a4 + a5 + a9 = ai . (A.3.6)
i∈I i∈J i∈I∪J
In the above examples, we consider only subsets of the integers. In other words, we had
always I ⊂ Z and J ⊂ Z. This condition is not necessary and we can use arbitrary set as
index sets. For instance, we can use I = { , , , ~}.
146
For completeness, we have to highlight that we have the same problem as in Section A.2.
If the index set I is finite, then the sum
X
ai (A.3.7)
i∈I
is always well defined. However, if the the index set is infinite then we have to be more
careful. In this case, we need that at most countable many ai 6= 0 (see Definition 2.1.2).
Further, we have to introduce a numbering of all i ∈ I with ai 6= 0 as i1 , i2 , . . . and define
X n
X ∞
X
ai := lim ai k if |aik | < ∞. (A.3.8)
n→∞
i∈I k=1 k=1
The property ∞
P
k=1 |aik | < ∞ is required so that the sum does not depend on the choice
of the numbering of the indices ik . An example is the expression for the expectation of
a discrete random variable X with the distribution, see Theorem 3.4.9. We have in this
case I = R and P [X = x] = 0 except for countable many x1 , x2 , . . . . We thus have
X ∞
X
E [X] = x · P [X = x] = xk · P [X = xk ] .
x∈R k=1
where the sequence (xj )j∈N consists of all x ∈ R with P [X = x] > 0 (with an arbitrary
ordering).
147
148
Appendix B
Sets
Sets are one of the most basic objects in mathematics and occur directly or indirectly
almost everywhere. In the context of probability theory, they are used to model events,
see Definition 2.1.7. We give in this appendix a brief introduction and a summery of the
properties we need in this lecture.
B.1 Definition
Definition B.1.1. A set is a collection of objects and each object in this collection
occurs precisely once. The objects which make up the set are called the elements
or members of the set. We write ω ∈ Ωif the element ω is contained in the set Ω,
otherwise we write ω ∈/ Ω.
Sets are conventionally denoted with capital letters. If one writes down a set, one
displays the elements between curly braces { }.
Example B.1.2. Examples of sets are
a) The set of all natural numbers less than 10 is A = {1, 2, 3, 4, 5, 6, 7, 8, 9}.

We have 2 ∈ A and 11 ∈ / A.
b) The set of all prime numbers less than 20 is B = {2, 3, 5, 7, 11, 13, 17, 19}.
We have 3 ∈ B, 11 ∈ B and 4 ∈ / B.
Some well known and important sets are
the natural numbers N = {1, 2, 3, 4, . . . },
the integers Z = {. . . , −3, −2, −1, 0, 1, 2, 3, 4, . . . },
the rational numbers Q = { pq |p, q ∈ Z},
149
the real numbers R and
the empty set ∅ = {}.
Definition B.1.3. The cardinality of a set A, denoted |A|, is the number of elements
in Ω. If the set has an infinite number of elements, then its cardinality is ∞.
Example B.1.4. We have |{1, 7, 3}| = 3 and |N| = ∞.
Remark B.1.5. If the cardinality |A| of a set A is infinite, one often distinguishes if the set
is countable infinite (i.e. the set is countable, see Definition 2.1.2) or uncountable infinite.
We do not need this distinction and thus write for simplicity just ∞ for both cases.
Definition B.1.6. The set A is a subset of the set B if every element of A is also an
element of B. In this case we write A ⊂ B. Otherwise, we write A 6⊂ B.
Subsets of set B can be specified for instance as
{ω ∈ B; ‘some conditions on ω’}. (B.1.1)
Remark B.1.7. Notice that that there are different conventions used in literature for the
symbols about subsets. Some authors write A ⊆ B if A is a subset of B and write A ( B
if A is a proper subset of B, i.e A 6= B. Some others write A ⊂ B instead of A ( B. In
this course, A ⊂ B indicates always that A is a subset of B and A can be equal to B.
Example B.1.8.
a) Let C = {1, 2, 5} and D = {1, 2, 3, 5} then C ⊂ D.
b) We have N ⊂ Z, Q ⊂ R, Q 6⊂ N.
c) The set of the even natural numbers is {n ∈ N; n is divisible by 2}.
The notation of subset now gives us an elegant way to say when two sets are equal.
Definition B.1.9. Two sets A and B are equal if and only if A ⊆ B and B ⊆ A. We
write in this case A = B. Otherwise we write A 6= B.
If one speaks about subsets of a set, one has also speak about the power set of a set.
150
Definition B.1.10. The power set PΩ of a set Ω is defined as the set of all subsets
of Ω inclusive the set Ω itself and the empty set ∅. In formulas,
PΩ := {A, A ⊂ Ω}. (B.1.2)
Example B.1.11. Suppose that Ω = {0, 1}. Then
PΩ := {∅, {0}, {1}, {0, 1}}. (B.1.3)
B.2 Set operations

There are now several natural operations with sets. For instance
Definition B.2.1. Let A, B be two subsets of a set Ω.
The union A ∪ B is the set all elements ω ∈ Ω that are in A or in B (or in both).
The intersection A ∩ B is that set of elements that are in A and in B.
The set difference A \ B is the set of elements that are in A but not in B.
The complementary set Ac to the set A is the set consisting of those elements
ω ∈ Ω that are not in A (but in Ω).
We call two sets A and B disjoint if A ∩ B = ∅.
A useful way to illustrate sets is with Venn diagrams, see Figure B.1. These were intro-
duced by John Venn (1834-1923) an English mathematician. Drawing Venn diagrams for
three or fewer sets is reasonably straightforward. Good drawings of larger numbers of
intersecting sets have been sought, with those by Anthony Edwards are particularly well
known, see Figure B.2.
Often we have to work with more than two or three sets and it thus can be unconvincing
to write something like
A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 ∪ A6 ∪ A7 ∪ A8 ∪ A9 ∪ A10 ∪ A11 ∪ A12 ∪ A13 ∪ A14 (B.2.1)
We thus introduce a suitable notation.
151
(a) The union A ∪ B (b) The intersection A ∩ B
(c) The setdifference A \ B (d) The complement Ac
(e) A is a subset of B, A ⊂ B (f) A and B are disjoint, A ∩ B = ∅
Figure B.1: Illustration of the set operations in Definition B.2.1.
152
(a) Three sets (b) Four sets
(c) Five sets (d) Six sets
Figure B.2: Venn diagrams by Anthony Edwards (Source: Wikipedia)
Definition B.2.2. Let A1 , A2 , A3 , . . . , An be subsets of some set Ω. We then define

n
[
Ai := A1 ∪ A2 ∪ A3 ∪ . . . ∪ An (B.2.2)
i=1
\n
Ai := A1 ∩ A2 ∩ A3 ∩ . . . ∩ An . (B.2.3)
i=1
S∞ T∞
The expressions i=1 Ai and i=1 Ai are similar defined.
We will now prove some properties of set operations. To keep the notation simple, we
restrict us to two or three sets. We have
153
Proposition B.2.3 (Properties of the union ∪). Let A, B and C be subsets of a set
Ω. We then have
A ∪ B = B ∪ A, (A ∪ B) ∪ C = A ∪ (B ∪ C), (B.2.4)
A ⊂ A ∪ B, A∪A=A (B.2.5)
A ∪ ∅ = A, A ⊂ B if and only if A ∪ B = B. (B.2.6)
Proposition B.2.4 (Properties of the the intersection ∩). Let A, B and C be subsets
of a set Ω. We then have
A ∩ B = B ∩ A, (A ∩ B) ∩ C = A ∩ (B ∩ C), (B.2.7)
A ∩ B ⊂ A, A∩A=A (B.2.8)
A ∩ ∅ = ∅, A ⊂ B if and only if A ∩ B = A. (B.2.9)
The proofs of Proposition B.2.3 and B.2.4 are straight forward and follow almost im-
mediately from the definition. We thus omit them. Some further properties of sets are
Proposition B.2.5. Let A, B and C be subsets of a set Ω. We then have
A ∪ Ac = Ω, A ∩ Ac = ∅, (B.2.10)
A \ B = A ∩ Bc, c c
(A \ B) = A ∪ B, (B.2.11)
(A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C), (A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C)
(B.2.12)
The proof of Proposition B.2.5 is also straight forward and can be proven for instance
with Venn diagrams. We thus omit this proof as well.
Finally, we have
154
(a) The sets A (green) and B (lila) (b) A ∩ B (red) and Ac ∪ B c (blue)
Figure B.3: Graphical proof of (B.2.14).
Proposition B.2.6 (De Morgan’s laws). Let A and B be subsets of a set Ω.
(A ∪ B)c = Ac ∩ B c (B.2.13)
(A ∩ B)c = Ac ∪ B c (B.2.14)
More generally, let A1 , A2 , . . . be subsets of some set Ω. Then

∞
!c ∞
\ [
Ai = (Ai )c , (B.2.15)
i=1 i=1
∞
!c ∞
[ \
Ai = (Ai )c . (B.2.16)
i=1 i=1
Proof. We prove first (B.2.14). The best way to prove (B.2.14) is to use Venn diagrams,
see Figure B.3. Equation (B.2.13) then follows immediately from (B.2.14) with A e := Ac
and Be := B c and using that (Ac )c = A. The Equation B.2.15 now can be seen as follows:
∞
!c ∞
\ \
ω∈ Ai ⇐⇒ ω ∈ / Ai ⇐⇒ It exists an i with ω ∈
/ Ai
i=1 i=1
∞
[
⇐⇒ It exists an i with ω ∈ Aci ⇐⇒ ω ∈ (Ai )c .
i=1
155
156
Appendix C
Table of the normal distribution

Rt x2
Table of the integral Φ(t) = √1
2π −∞
e− 2 dx. As example Φ(1.23) = 0.8907.
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
157
158
Appendix D
Useful mathematical identities
The following identities are helpful in the subsequent developments.

Arithmetic progression: We have for all n ∈ N
n
X
i = 1 + 2 + 3 + 4 + . . . + n = n(n + 1)/2.
i=1
Sum of squares: We have for all n ∈ N

n
X
i2 = 12 + 22 + 32 + 42 + . . . + n2 = n(n + 1)(2n + 1)/6.
i=1
Exponential series: We have for all real x, y ∈ R and e the Euler’s number
∞
x x2 x3 X xi
e = exp(x) = 1 + x + + + ... = ,
2! 3! i=0
i!
exp(x + y) = exp(x) exp(y).
Partial geometric sum: We have for all p ∈ R

m
X 1 − pm+1
Sm = 1 + p + p2 + . . . + pm = pk = .
k=0
1−p
Geometric sum: We have for |p| < 1

∞
2
X 1
S∞ = 1 + p + p + . . . = pj = for |p| < 1. (D.0.1)
j=0
1−p
Using the result for geometric sums, similar results can be derived. Differentiating (D.0.1),
term by term, each side with respect to p gives
159
Weighted geometric sum:
∞
X
(1 − p)−2 = 1 + 2p + 3p2 + . . . = jpj−1
j=1
X∞
2(1 − p)−3 = 2 + 6p + 12p2 + . . . = j(j − 1)pj−2 .
j=2
Binomial theorem
For any positive integer n ∈ N, and real numbers p, q ∈ R

n n n 0 n n−1 1 n n−k k n 0 n
(p + q) = p q + p q + ... + p q + ... + p q
0 1 k n
n
X n n−k k
= p q .
k
k=0
The triangle inequality

For all x, y ∈ R
|x + y| ≤ |x| + |y|.
160
Symbol index
Cov(X, Y ) . . . . . . . . . . . . . . . . Covariance of two random variables,

see Definition 3.5.11
(n)s . . . . . . . . . . . . . . . . . . . . Falling factorial, short notation for

n · (n − 1) · . . . · (n − s + 1), see (2.5.5)
A ∩ B . . . . . . . . . . . . . . . . . . . The intersection of the sets A and B,

see Definition B.2.1
A ∪ B . . . . . . . . . . . . . . . . . . . The union of the sets A and B, see Definition B.2.1
A 6⊂ B . . . . . . . . . . . . . . . . . . A is not a subset of B, see Definition B.1.6
A \ B . . . . . . . . . . . . . . . . . . . The set difference of the sets A and B,

A⊂B . . . . . . . . . . . . . . . . . . A is a subset of B, see Definition B.1.6
Ac . . . . . . . . . . . . . . . . . . . . . The complement of the A, see Definition B.2.1
B(n, p) . . . . . . . . . . . . . . . . . . Binomial distribution, see Section 3.6.3
E [X] . . . . . . . . . . . . . . . . . . . Expectation of a random variable, see Definition 3.4.4
E [|X|] < ∞ . . . . . . . . . . . . . . . The expectation of the random variable X exists,

see Definition 3.4.4 and the lines below.
N . . . . . . . . . . . . . . . . . . . . . The natural numbers, see Appendix B.1
P [A|B] . . . . . . . . . . . . . . . . . . Conditional probability, see Definition 2.6.3

Rt x2
Φ(t) . . . . . . . . . . . . . . . . . . . . √1
2π −∞
e− 2 dx, see Theorem 3.10.3.
Q . . . . . . . . . . . . . . . . . . . . . The rational numbers, see Appendix B.1
R . . . . . . . . . . . . . . . . . . . . . The real numbers
Z . . . . . . . . . . . . . . . . . . . . . . The integers, see Appendix B.1
161
Tn
i=1 Ai . . . . . . . . . . . . . . . . . . The intersection A1 ∩ A2 ∩ . . . ∩ An ,
S∞
i=1 Ai . . . . . . . . . . . . . . . . . . The union A1 ∪ A2 ∪ A3 ∪ . . ., see Definition B.2.2
Sn
i=1 Ai . . . . . . . . . . . . . . . . . . The union A1 ∪ A2 ∪ . . . ∪ An ,
n

s
. . . . . . . . . . . . . . . . . . . . . Binomial coefficient, see (2.5.15)
∅ . . . . . . . . . . . . . . . . . . . . . . The empty set {}
P . . . . . . . . . . . . . . . . . . . . . . Probability measure, see Definition 2.1.3 and 2.1.19
V(X) . . . . . . . . . . . . . . . . . . . Variance of a random variable, see Definition 3.5.1
PΩ . . . . . . . . . . . . . . . . . . . . The power set of Ω, see Definition B.1.10
ω . . . . . . . . . . . . . . . . . . . . . . a sample point in a probability space (Ω, P),
see Definition 2.1.3 and 2.1.19
ω ∈ Ω . . . . . . . . . . . . . . . . . . . The element ω is contained in the set Ω,
ω∈
/ Ω . . . . . . . . . . . . . . . . . . . The element ω is not contained in the set Ω,
Pn P
i=1 . . . . . . . . . . . . . . . . . . . notation for sums, see Chapter A
Ber(p) . . . . . . . . . . . . . . . . . . Bernoulli distribution, see Section 3.6.2
Geo(p) . . . . . . . . . . . . . . . . . . Geometric distribution, see Section 3.6.4
Poi(λ) . . . . . . . . . . . . . . . . . . . Poisson distribution, see Section 3.6.5
a ≈ b . . . . . . . . . . . . . . . . . . . b is an approximation of a.
an ∼ bn . . . . . . . . . . . . . . . . . . an and bn are asymptotically equal, see Page 35
e . . . . . . . . . . . . . . . . . . . . . . Euler’s number, e ≈ 2.71828.
n! . . . . . . . . . . . . . . . . . . . . . Factorial, short notation for
n · (n − 1) · . . . · 2 · 1, see (2.5.6)
pX (x) . . . . . . . . . . . . . . . . . . . The probability mass function, see Definition 3.2.6
|A| . . . . . . . . . . . . . . . . . . . . . Cardinality of a set A, see Definition B.1.3
(Ω, P) . . . . . . . . . . . . . . . . . . . Probability space, see Definition 2.1.3 and 2.1.19
Ω . . . . . . . . . . . . . . . . . . . . . Sample space, see Definition 2.1.3 and 2.1.19
162
Index
sigma-additivity, 14 event, 7
complement, 9
absolutely convergent, 134 impossible, 9
independence, 52, 59
Bayes’ Theorem, 48
occurs, 9
Bernoulli distribution, 90
pairwise independent, 56
binomial coefficient, 29
sure, 9
Binomial distribution, 91
expectation, 78
Binomial theorem, 92, 148
Bernoulli distribution, 91
capital Pi notation, 22 Binomial distribution, 92
cardinality, 4, 6, 15, 25, 32, 138 degenerate distribution, 90
Cauchy-Schwarz inequality, 87 geometric distribution, 94
Chebyschev inequality, 85, 120 Poisson distribution, 95
chebyschev inequality, 114 experiment
classical experiment, 15 classical, 15
classical measure, 15 fair, 15
combinatorics, 21 Laplace, 15
complementary, 139 multi level, 17
conditional probability, 35, 36 purely random, 15
Bayes’ Theorem, 48 single level, 17
law of total probability, 47 exponential series, 147
multiplication rule, 42 factorial, 26
convergence in distribution, 124 favourable outcomes, 15
convergence in law, 124
countable, 4, 15 Gaussian bell curve, 123
covariance, 87 Gaussian distribution, 124
Geometric distribution, 93
dependent Geometric sum, 94
random variable, 102, 103 geometric sum, 147
disjoint, 12, 139 partial, 147
distribution, 71
joint distribution, 96 harmonic series, 134
elements, 137 independence, 52

Euler’s number, 27 independent
163
events, 59 discrete, 64
random variable, 102, 103 geometric, 93
indicator function, 26 independent, 102, 103
induced sample space, 73 Poisson, 95
inequality uncorrelated, 87, 88
Chebyschev, 114
triangle, 81, 148 sample point, 5
intersection, 139 sample space, 5
set, 137
joint distribution, 96 elements, 137
set difference, 139
Kolmogorov’s axioms, 15 shifting theorem, 84
Kolmogorov’s axioms, 2 standard deviation, 83
Stirlings formula, 27, 124
Laplace experiment, 15 Sturges rule, 121
Laplace measure, 15 subset, 12, 138
Law of large numbers, 107, 113 sum rule, 21
law of total probability, 39, 47 summation index, 129
multiplication rule, 38, 42 travelling salesman problem, 27
tree diagram, 17, 19
normal distribution, 124
triangle inequality, 81, 148
pairwise independent, 56
uncorrelated, 87, 88
partial Geometric sum, 147
uncountable, 4, 73
partition, 46, 49
union, 139
path rule, 19, 41, 43, 44, 49, 56, 59, 91
Pochhammer symbol, 26 variance, 83
Poisson distribution, 95 Bernoulli distribution, 91
Poisson limit theorem, 95, 128 Binomial distribution, 92
possible outcomes, 15 degenerate distribution, 90
power set, 25, 139 geometric distribution, 95
probability mass function, 71 Poisson distribution, 95
probability measure, 5 Venn diagram, 7, 9, 139
probability space, discrete, 5, 14
probability tree, 19
product rule, 22
random variable
distribution, 96
Bernoulli, 90
Binomial, 91
covariance, 87
dependent, 102, 103
164

Math 272 Probability

Uploaded by

Copyright:

Available Formats

Math 272 Probability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Math 272 Probability

Uploaded by

Copyright:

Available Formats

MATH 272: Probability

Lent Term 2020/21

2 Discrete Probability spaces 7

A Sigma-Notation for sums 139

C Table of the normal distribution 157

D Useful mathematical identities 159

Symbol index 161

1.1 Practical information

1.1.5 Online seminars

1.1.7 Office hour

1.1.8 Plan of online seminars and lectures

Table 1.1: Lecture material

1.2 Motivation and course overview

 A suspect’s fingerprints match those found on a murder weapon. The chance of a

Discrete Probability spaces

 Second, we assign to each possible result a chance to occur in this experiment.

Definition 2.1.2. A set Ω is countable if one can assign to each ω ∈ Ω a natural

More intuitively said, a set Ω is countable if Ω can be written as

 0 ≤ P [ω] ≤ 1 for all ω ∈ Ω and

 P [ω1 ] + P [ω2 ] + P [ω3 ] + · · · = 1 with ωj as in (2.1.4).

Figure 2.1: Illustration of a probability space.

Some further examples of probability spaces are

If a probability space (Ω, P) is given, then we know the probability of each ω ∈ Ω to

Definition 2.1.7. Let (Ω, P) be a discrete probability space. We call a subset A ⊂ Ω

A := ‘The shirt is red’ = {r1 , r2 , r3 }. (2.1.16)

A : The number is divisible by 2, B : The number is divisible by 3,

We now write this events as a subset of Ω. We get

E := ‘The number is divisible by 3 and is strictly greater than 9’, (2.1.18)

We consider the events

(e) A is a subset of B, A ⊂ B (f) A and B are disjoint, A ∩ B = ∅

Figure 2.3: Illustration of A ∪ B, A ∩ B, A \ B, Ac , A ⊂ B and A, B disjoint.

We then have for instance

A ∩ B = ‘The first number is a 4 and the second is a 6’ = {46}, (2.1.22)

a) If A, B ⊂ Ω are disjoint, i.e. A ∩ B = ∅, then

P [A ∪ B] = P [A] + P [B] . (2.1.26)

b) If A and B are arbitrary events, then

P [A ∪ B] ≤ P [A] + P [B] . (2.1.27)

c) If A is a subset of B, A ⊂ B, then we have

P [A] ≤ P [B] . (2.1.28)

d) We have for the complement Ac = Ω \ A of a event A ⊂ Ω

P [Ac ] = 1 − P [A] . (2.1.29)

Figure 2.4: Illustration of for the proof of Lemma 2.1.15.

P [B] = P [A ∪ (B \ A)] = P [A] + P [B \ A] ≥ P [A] . (2.1.34)

P [A] + P [Ac ] = P [A ∪ Ac ] = P [Ω] = 1. (2.1.35)

Subtracting P [A] on both sides completes the proof of Point c).

P [A ∪ B] = P [A] + P [B] − P [A ∩ B] . (2.1.36)

a) If A1 , A2 , . . . are pairwise disjoint, i.e. Ai ∩ Aj = ∅ for i 6= j, then

b) If B1 , B2 , . . . are arbitrary events, then

 0 ≤ P [A] ≤ 1 for all A ⊂ Ω,

 If the events Ai are pairwise disjoint, i.e. Ai ∩ Aj = ∅ for i 6= j, then

2.2 Finite probability spaces and Laplace experiments

A = ‘The sum of the numbers is even.’, and (2.2.3)

Ω = {(i, g, v); i ∈ {1, . . . , 100}, g ∈ {m, w}, v ∈ {vegetarian,non-vegetarian}}, (2.2.9)

2.3 Multi level random experiments

Ω = {(a, a), (a, n), (n, a), (n, n)}.

Figure 2.6: Tree diagram to Example 2.3.1

Figure 2.7: Tree diagram to Example 2.3.2

 The probability P [ω] of an ω ∈ Ω in a multi level experiment is the product of

Ω = {(i1 , . . . , i7 ); ij ∈ {1, . . . , 7}, j = 1, . . . , 7}, (2.3.8)

The probability that Prof.Q looses his bet is thus

A suspect’s fingerprints match those found on a murder weapon. The chance of a

Second, we assign to each possible result a chance to occur in this experiment.

0 ≤ P [ω] ≤ 1 for all ω ∈ Ω and

P [ω1 ] + P [ω2 ] + P [ω3 ] + · · · = 1 with ωj as in (2.1.4).

0 ≤ P [A] ≤ 1 for all A ⊂ Ω,

If the events Ai are pairwise disjoint, i.e. Ai ∩ Aj = ∅ for i 6= j, then

The probability P [ω] of an ω ∈ Ω in a multi level experiment is the product of

Bi are pairwise disjoint, i.e. Bi ∩ Bj = ∅ for all i 6= j,

P [Bi ] > 0 for all i = 1, . . . , n.