Math 272 Probability
Math 272 Probability
Math 272 Probability
Dirk Zeindler
January 9, 2021
Contents
1 Introduction 1
1.1 Practical information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Course format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.5 Online seminars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.6 Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.7 Office hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.8 Plan of online seminars and lectures . . . . . . . . . . . . . . . . . . 3
1.1.9 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Motivation and course overview . . . . . . . . . . . . . . . . . . . . . . . . 4
i
3 Discrete random variables 69
3.1 Motivation and definition of random variables . . . . . . . . . . . . . . . . 69
3.2 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Events involving several values . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.6 Important distributions of random variables . . . . . . . . . . . . . . . . . 99
3.6.1 Degenerate distribution . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.6.2 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.6.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.6.4 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.6.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.7 Joint distribution and independence . . . . . . . . . . . . . . . . . . . . . . 105
3.7.1 Joint distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.7.2 Independent random variables . . . . . . . . . . . . . . . . . . . . . 111
3.8 Law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.8.1 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.8.2 Law of large numbers for random variables . . . . . . . . . . . . . . 119
3.9 Fair games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.10 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
B Sets 149
B.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B.2 Set operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Index 162
Chapter 1
Introduction
This chapter contains some practical information about the course. Begin by watching
this short video Sec1 of me introducing myself and the course.
If you are taking part in this course from China, then the following link might be helpful.
1.1.2 Videos
The videos mentioned in these lecture notes have embedded links, so if you click on
them, they should take you straight to the video. You may need to be logged in to your
University account to watch them. If the link does not work, all these videos can also
be accessed from a repository on the MATH272 Moodle page; just search for the unique
identifier, which is “Sect1” in the case at hand. Sect1 refers to Section 1; for all other
videos, this identifier is followed by a lower-case letter a,b,c,. . . indicating the sequence of
the videos within a section. Please report any dead links to me using the online forums,
as described below.
1.1.3 Lectures
This course corresponds to 20 lectures, which are 4 lectures per week. All this lectures
are planed as self-study. The material of each lecture is listed in Table 1.1. I suggest
that you study one lecture per day on Monday–Thursday, but you are welcome to choose
your own pattern, as long as you complete the first two lectures of each week before the
1
first online seminar and the other two before the second (see below for details of the
online seminars). Note that you do not need to complete a lecture in one sitting: you can
study shorter portions of it if it suits you better. The lectures involve a mix of reading
these notes and watching videos of me discussing and explaining selected results. Expect
to spend around two hours per lecture to learn and reflect on the material and solve the
exercises. However, not all lectures will take exactly the same amount of time to complete:
some are slightly longer, some perhaps slightly shorter, as they have been divided at the
most natural break points, and you may find some topics easier than others.
1.1.4 Exercises
You can find on the Moodle page of MATH272 for each of the weeks 11–15 exercise
questions. The are designed to test your understanding of the material covered up to that
point and to prepare your for the exam. Note that you don’t have to submit the solution
to those exercises and they also don’t count towards your grade. The solution will be
released on Friday at 12:00 each week.
Don’t worry if you don’t get everything right the first time around; that is perfectly
normal in mathematics and an integral part of the learning process. See if you can work
out from the notes or model solutions where you went wrong, go back and try again. If
you are unable to figure it out on your own or in your study group, please ask me for
help, either in the corresponding online seminar or my office hours.
2
However, the lecture recording does not always work, so watching the recording afterwards
should be used only as a last resort.
1.1.6 Assessments
The assessments consists moodle quizzes and an end of module test.
There are 5 weekly online quizzes in the weeks 11–15. The deadline of each is
Friday at 23.55 (UK time). You can find those on the Moodle page of MATH272 in
the section ‘Assessments for Probability’.
There is a written end of module test for the probability part of Math272. The sub-
mission deadline is Tuesday at 17.00 in week 16. The questions will be available on
moodle 48 hours prior to this deadline in the section ‘Assessments for Probability’,
that is, from Sunday at 17.00. When scanning your solutions, make sure that they
are legible and submitted in the right order and right orientation. Solutions that
fail to meet these criteria will lose marks, or may not be marked at all.
1.1.9 Other
Technical support: For IT help and support, including help with Moodle or Teams,
please consult the ISS help page.
Further reading: This lecture notes are self-contained. However, you can find a list
of helpful books in the Resource list on top of the MATH272 moodle page.
Past exam papers: Student Based Services operate an archive of past exam papers,
which you may find helpful when revising for the exam.
3
Lecture 1 pages 7–15
Online seminar 1
Lecture 2 pages 15–22
Week 11
Lecture 3 pages 22–26
Online seminar 2
Lecture 4 pages 26–33
Lecture 5 pages 33–36
Online seminar 3
Lecture 6 pages 36–41
Week 12
Lecture 7 pages 42–46
Online seminar 4
Lecture 8 pages 46–48
Lecture 9 pages 48–55
Online seminar 5
Lecture 10 pages 55–58
Week 13
Lecture 11 pages 59–63
Online seminar 6
Lecture 12 pages 63–67
Lecture 13 pages 69–75
Online seminar 7
Lecture 14 pages 75–84
Week 14
Lecture 15 pages 84–91
Online seminar 8
Lecture 16 pages 91–98
Lecture 17 pages 99–105
Online seminar 9
Lecture 18 pages 105–111
Week 15
Lecture 19 pages 111–116
Online seminar 10
Lecture 20 pages 125–128
Would you rather be given £1, or toss a coin 10 times winning £1000 if it comes
up heads every time?
How many people do you need in a room to have a better than 50% chance that
two of them share a birthday?
A network is formed by taking 100 points and connecting each pair with some fixed
probability. What can we say about the number and size of connected components
in the network?
4
These questions all relate to situations where there is some randomness; that is events
we cannot predict with certainty. This could be either because they happen by chance
(tossing a coin) or they are beyond our knowledge (the innocence or guilt of a suspect).
Probability theory has been introduced to study this uncertainty. This study involves
mathematics, logic, and considerations of the underlying physical mechanisms that cause
variability.
Let us now briefly comment the above questions. The first question is not entirely a
mathematical one and there is no right or wrong answer for each part. It’ll depend on
circumstances. However some tools from probability can help to describe the choice we
have quantitatively. In each case the average amount we expect to win and the degree of
variation in our gain are relevant. Such questions are the topic of game theory.
The second question is a simple calculation which we will see how to do later (see Ex-
ample 2.5.25). If you haven’t seen this before then try to guess what the answer will be.
This question is surprisingly already complicate enough to mislead the intuition of many
people.
The third question emphasises how important it is to work out exactly what information
we are given and what we want to know. The court would have to consider the probability
that the suspect is innocent under the assumption that the fingerprints match. The numer-
ical probability given in the question is the probability that the fingerprints match under
the assumption that the suspect is innocent. These are in general completely different
quantities. The erroneous assumption that they equal is sometimes called the prosecutor’s
fallacy. The mathematical tool for considering probabilities given certain assumptions is
conditional probability, see Section 2.6, and Baye’s theorem, see Section 2.6.4.
The final question describes the construction of a random graph. These objects are well
studied both as pure mathematical objects in their own right and as models to help
understand networks in a range of areas such as epidemiology and computing.
We will give in this course a introduction to discrete probability. More precisely, we define
discrete (Ω, P) using the axioms of Kolmogorov . We then will study some examples basing
on counting problems and in particular we will give the answer on the second question
above. Furthermore we consider also conditional probability, random variables and the
law of large numbers. Notice that many results and definitions in this course are true on
arbitrary probability spaces. For such results, we will put in the requirements the word
‘discrete’ in brackets. However, we will not go into it.
5
6
Chapter 2
2.1 Definition
The study of probability and its laws is more than 300 years old. Bernoulli’s law of large
numbers was one of the first probabilistic law and dates back to 1705 and was published
posthumously in 1713. Surprisingly, the question of a precise (axiomatic) definition of
probability was long time an open problem. Hilbert presented 1900 in his famous lecture
the axiomatizing of probability and physics(!) as the sixth of his 23 open problems. Kol-
mogorov finally found in 1933 the right approach. His idea is based on measure theory
instead of the intuitive idea of probability as the limit of relative frequencies. We will not
consider here the general definition of a probability space since this would go much to
far for this course. We will restrict us to discrete probability spaces and random variables.
We begin with the definition of a discrete probability space. Please watch first the fol-
lowing video Sec2.1a and then proceed reading. Probabilities are naturally associated
with random experiments and to find the right definition, we consider as motivation the
following examples
Example 2.1.1.
a) Anton tosses a coin. He says to himself that he can have only two possible results:
Head or Tail. Since he don’t knows much about the coin, he thinks that he will have
in 1/2 of the cases Head and in 1/2 of the cases Tail.
b) Bea rolls a die. She says to herself that she can have only six possible results: 1, 2, 3, 4, 5, 6.
Furthermore, she says that this is fair die and thus each number will occur in 1/6 of
cases.
c) Chris has an urn with four red and one blue ball, and the balls can only be distinguished
through their colour. Chris now draws a ball without considering the colour. Chris says
to himself that the can only have two possible results: red or blue. Furthermore, he
7
says to himself that he will have in 4/5 of the cases a red ball and in 1/5 of the cases
a blue ball.
In this three examples of random experiments, we do always the same:
First, we specify the possible results of our random experiment.
with N ∈ N ∪ {∞} the cardinality of the set Ω (the number of elements in Ω). In words
this means we can introduce a ranking on Ω and thus have a ‘first’ element, a ‘second’
element, a ‘third’ element and so on. In particular each finite set is countable, the natural
numbers N are countable and the integers Z are also countable. Examples of uncountable
sets are the real numbers R or each interval [a, b] ⊂ R with a < b. We now can define
8
Definition 2.1.3 (Discrete probability space, Version 1). A discrete probability space
consists of a pair (Ω, P), with Ω a countable set and a function P : Ω → [0, 1] with
We call Ω the sample space, P the probability measure and a ω ∈ Ω a sample point.
An illustration of a discrete probability space is given in Figure 2.1. The blue dots in
Figure 2.1 represent the sample points and the values near the points the corresponding
probability. An intuitive interpretation of Figure 2.1 can be to see Ω as lake, the sample
points ω as fish in the lake and the probabilities P [ω] as the chance that the fish ω will
be caught as next.
9
Exercise 2.1.5. Give a (formal) example of a probability space.
This here now the right point to watch the video Sec2.1b about definitions and examples.
P
We will use in this the lecture the -notation for sums to simplify the occurring expres-
sions. In particular, we write the second property of P in Definition 2.1.3 as
N
X
P [ωj ] = 1, (2.1.8)
j=1
where N ∈ N ∪ {∞}P is the cardinality of Ω (the number of elements in Ω). If you are not
familiar with the -notation for sums, you should first watchPthe video Sec2.1c and then
read Appendix A. We give there a brief introduction to the -notation for sums, where
we
P introduce everything we need in this course. Note that from now on we will use the
-notation frequently.
Definition 2.1.3 is maybe not the most intuitive way to define a probability. It would much
more intuitive to determine the probability of an outcome of an experiment as follows:
Perform n independent trials with n very large and consider the fraction how many times
the desired outcome has occurred. However, it is clear that enlarging n gives a preciser
result. We thus would like to chose n as large as possible. We therefore let n go to
infinity and define the probability of this outcome as the limit of the observed fractions.
This definition, however, would bring some disadvantages. For example, the proportion
would typically depend on the tests. In a four-fold coin toss one can obtain H, H, T, H
and T, T, H, H as a result (H = ‘Head’ and T = ‘Tail’). Definition 2.1.3 can now be
interpreted that we omit the limit process and work directly with the proportion in the
limit. Of course, one can ask whether Definition 2.1.3 also has this intuitive property. We
will see in Section 3.8 that this is the case indeed.
Definition 2.1.3 is very direct, but it can unfortunately not be used to define probability
spaces on uncountable sets as the real numbers R. We thus will give in Definition 2.1.19
an equivalent formulation using the Kolmogorov’s axioms, which can also be used in gen-
eral. However, we will consider in this lecture only discrete probability spaces.
10
Example 2.1.6.
a) We roll a die. What is the probability the number is divisible by 3? The natural answer
is
P [‘divisible by 3’] = P [{3, 6}] = P [3] + P [6] . (2.1.9)
b) Anton, Bea, Chris and Donald play a game. The probability that Anton wins is 1/6,
that Bea wins is 1/3, that Chris wins is 3/8 and that Donald wins is 1/8. The proba-
bility space (Ω, P) is in this case
Ω = {‘Anton’, ‘Bea’, ‘Chris’, ‘Donald’} and (2.1.10)
1 1 3 1
P [‘Anton’] = , P [‘Bea’] = , P [‘Chris’] = , P [‘Donald’] = . (2.1.11)
6 3 8 8
A possible question here is: What is the probability that Chris or Bea win the game.
The natural answer is
P [‘Chris or Bea wins’] = P [{‘Bea’, ‘Chris’}] = P [‘Bea’] + P [‘Chris’]
1 3 17
= + = . (2.1.12)
3 8 24
Another question might be: What is the probability that Donald doesn’t win the game?
The natural answer is
P [‘Donald loses’] = P [‘Anton or Bea or Chris wins’] = P [{‘Anton’, ‘Bea’, ‘Chris’}]
1 1 3 7
= P [‘Anton’] + P [‘Bea’] + P [‘Chris’] = + + = . (2.1.13)
6 3 8 8
If one analyses Example 2.1.6, one sees that we are interested in subsets A of Ω (A ⊂ Ω)
and assign to each subset A a probability by summing the probabilities of all elements
contained in A. This now leads to the following definition
An illustration of an event A is given in Figure 2.2. Using again the intuitive picture of Ω
as lake and ω ∈ Ω as fishes in the lake, one can interpret an event A as a restricted area,
where some fish are enclosed. The important point here is that the fish neither can enter
or leave this area.
11
Figure 2.2: Illustration of a event A in a probability space (Ω, P).
Remark 2.1.8. Notice that a sample space contains often many sample points ω and thus
one normally drops in an illustration as in Figure 2.2 the sample points ω. Such an
illustration is then called a Venn diagram, see also Appendix B.
It is clear from Definition 2.1.7, that one part of the work with probability spaces and
events is to convert a description like ‘The number is less than 4 ’ into a subset of the
sample space. Sometimes it is also useful to find for an event A ⊂ Ω a description like
‘The number is greater than 2 ’.
Example 2.1.9. Peter has three red, two blue, two white and four green
shirts. He now chooses purely randomly a shirt. What is the probability
that it is red? We define
1
Ω = {r1 , r2 , r3 , b1 , b2 , w1 , w2 , g1 , g2 , g3 , g4 } and P [ω] = for all ω ∈ Ω. (2.1.15)
11
The event to select a red shirt is then
A = {2, 4, 6, 8, 10, 12}, B = {3, 6, 9, 12}, C = {2, 3, 5, 7, 11} and D = {10, 11, 12}.
12
The probability of the event A is then
6 1
P [A] = P [2] + P [4] + P [6] + P [8] + P [10] + P [12] = = .
12 2
Similarly, we obtain
4 5 3
P [B] = , P [C] = , and P [D] = .
12 12 12
The video Sec2.1e contains a further example how to translate a description like ‘The
number is less than 4 ’ into a subset of the sample space.
Exercise 2.1.11. Suppose we have two soccer teams, each team constituting of 11
players. Suppose we choose uniformly at random a player. Give a formal description
of this experiment and of the events ‘We select a player from the first team’ and ‘We
select a keeper ’. Compute the probability of these events.
In most situations, we are interested in more than one event. Please watch at this point
the video Sec2.1f. When we work with more than one event, we require the basic nota-
tions of set theory. We expect that the reader is already familiar with these, but we have
given in Appendix B for completeness a short introduction, which covers everything we
need in this course.
Before we discuss some properties of P [.], we would like to introduce a few ways of speak-
ing. We say that a event A occurs in a random experiment, if a ω ∈ A has been obtained
in this random experiment. Otherwise we say that the event A didn’t occur. For example,
let us toss a die and consider the event A = {4, 5, 6} (‘the number is greater than 3 ’). If
we toss a ‘4’ then A occurs and if we toss ‘2’ then A doesn’t occurs. We interpret P [A]
as the chance that the event A occurs in a random experiment. We have listened some
further terms and ways of speaking in Table 2.1. Notice that events like A ∪ B or A ∩ B
can illustrated with Venn diagrams, see Figure 2.3 and Appendix B.
13
Fact Ways of speaking As formula
A surely occurs. A is the sure event. A=Ω
A never occurs. A is the impossible event. A=∅
A does not occur. The event is the complement of A Ac := Ω \ A
A and B occur. A as well B occur. A∩B
A or B occur. At least one of the events A∪B
A or B occurs.
If A occurs, then B doesn’t occur. A and B are disjoint A∩B =∅
or mutually exclusive.
If A occurs then also B occurs. A is a subset of B. A ⊂ B.
A occurs if and only if B A and B are complementary events. A = Bc
does not occur. Sn
A occurs if and only if at least A is the union of the Ai . A= i=1 Ai
one of the events Ai occur. Tn
A occurs if and only if all Ai occur. A is the intersection of the Ai . A= i=1 Ai
Table 2.1: Designation of events (A, A1 , A2 , . . . , An and B are events in some (Ω, P).)
Example 2.1.12. Let us consider again the situation of Example 2.1.10. If we are for
instance interest in the event
we then have obviously E = {12} and P [E] = 1/12. On the other hand, we can use the
forth row of Table 2.1 and write E = B ∩ D. Let us now consider the event F = A ∪ C.
We then have F = {2, 3, 4, 5, 6, 7, 8, 10, 11, 12} and P [F ] = 10/12. In words, F can be
written as the number is divisible by 2 or a prime number.
Example 2.1.13. An urn contains 100 balls, numbered with 00, 01, . . . , 99. We now draw
purely random draw a ball. We define in this situation
1
Ω = {00, . . . , 99} and P [ω] = for all ω ∈ Ω. (2.1.19)
100
A := ‘The first number is a 4’ = {40, 41, 42, 43, 44, 45, 46, 47, 48, 49}, (2.1.20)
B := ‘The second number is a 6’ = {06, 16, 26, 36, 46, 56, 66, 76, 86, 96}. (2.1.21)
14
(a) ‘A or B’, the union A ∪ B (b) ‘A and B’, the intersection A ∩ B
(c) ‘A but not B’, the set difference A \ B (d) ‘A does not occur’, the complement Ac
15
Notice that each of the events Ac and B c contain 90 elements. It is thus not convenient
to write Ac and B c explicit out as for instance A or B.
Exercise 2.1.14. Give an example of two events A, B on a probability space (Ω, P).
Compute A ∩ B, A ∪ B and A \ B and give a description of these events.
We now prove some properties of P. The following video Sec2.1g explains Lemma 2.1.15
and Lemma 2.1.17 below given some addition explanations using diagrams.
Lemma 2.1.15. Let (Ω, P)be a (discrete) probability space. We then have
Proof. We begin with Point a). An illustration of A ∪ B for A and B disjoint is given in
Figure 2.4(a).
We have by definition
X
P [A ∪ B] = P [ω] . (2.1.30)
ω∈A∪B
Since A ∩ B = ∅, each
P ω ∈ A ∪ B belongs either to A or to B, but not to both. We thus
can split the sum ω∈A∪B into a sum over all ω ∈ A and a sum over all ω ∈ B and no ω
is counted twice, see also Lemma A.3.2. In formulas (see also Lemma A.3.2, this means
X X X
P [A ∪ B] = P [ω] = P [ω] + P [ω] = P [A] + P [B] . (2.1.31)
ω∈A∪B ω∈A ω∈B
16
(a) Disjoint events A and B (b) Non-disjoint events A and B
We now come to Point b). Here the events A and B can have a non-empty intersection
and thus it is possible that a ω ∈ Ω is contained in both, see Figure 2.4(b). If we write
X X
P [A] + P [B] = P [ω] + P [ω] (2.1.32)
ω∈A ω∈B
then all ω ∈ A ∩ B occur twice in this sum while all ω ∈ A \ B and all ω ∈ B \ A occur
just once. If we take now a look at
X
P [A ∪ B] = P [ω] (2.1.33)
ω∈A∪B
then all ω ∈ A ∪ B occur only once, inclusive those ω ∈ A ∩ B. This completes the proof
of this point since P [ω] ≥ 0.
We now come to Point c). Since A ⊂ B, we can write B = A ∪ (B \ A). Because
A ∩ (B \ A) = ∅, we can use Point a) and get
For Point d), we use that A ∩ Ac = ∅ and A ∪ Ac = Ω. Using again Point a) gives
Exercise 2.1.16. Let A and B be two events on a (Ω, P). Show that
We have considered in Lemma 2.1.15 in the points a) and b) the union of two events.
However, in many cases one considers the union of more than two events and often
17
even the union of infinitely many events. Fortunately, the argumentation in the proof
of Lemma 2.1.15 can be also used for such cases. Obviously, we have to introduce a
suitable notation for the union of (infinitely) many events (see Definition B.2.2). We set
n
[ ∞
[
Ai := A1 ∪ A2 ∪ A3 ∪ . . . ∪ An and Ai := A1 ∪ A2 ∪ A3 ∪ . . . . (2.1.37)
i=1 i=1
We now have
Lemma 2.1.17. Let (Ω, P)be a (discrete) probability space. We then have
The proof of Lemma 2.1.17 is similar to the proof of Lemma 2.1.15 and we thus omit it.
Remark 2.1.18. Notice that Lemma 2.1.17 is true on general probability spaces. However,
it is important to mention that one can apply Lemma 2.1.17 only for countable
S unions of
events. It is for instance not allowed to apply Lemma 2.1.17 to the union x∈R Ax .
We can now give with the help of Lemma 2.1.15 and Lemma 2.1.17 an equivalent definition
of a discrete probability space (Ω, P). The reason we do this is that Definition 2.1.3 can
only be used if the sample space is finite or countable. However, one requires for the
normal distribution or the Black-Scholes model a more complicate sample space. On the
other hand, the Definition 2.1.19 below can be generalised to arbitrary sample spaces.
18
Definition 2.1.19 (Discrete probability space, Version 2). A discrete probability space
consists of a pair (Ω, P), with Ω a countable set and a P assigning to each subset A ⊂ Ω
a value in [0, 1] with
P [Ω] = 1 and
Remark 2.1.20. The P in Definition 2.1.19 can be see as function P : PΩ → [0, 1], where
PΩ is the power set of Ω (see Definition B.1.10).
It is quite easy to see that Definition 2.1.3 and 2.1.19 are equivalent. It follows obvi-
ously from Lemma 2.1.17.a) that a probability in the sense of Definition 2.1.3 is a also
a probability in the sense of Definition 2.1.19. For the opposite direction, we define
P [ω] := P [{ω}], where {ω} is the subset of Ω just containing the element ω. We then
write with the sigma-additivity
" #
[ X X
1 = P [Ω] = P {ω} = P [{ω}] = P [ω] . (2.1.41)
ω∈Ω ω∈Ω ω∈Ω
Thus a probability in the sense of Definition 2.1.19 is a also a probability in the sense of
Definition 2.1.3. Notice that we have used in (2.1.41) that Ω is countable.
The three properties of P in Definition 2.1.19 are know as Kolmogorov’s axioms for prob-
ability. These axioms now can be used to define probability measures on an arbitrary
sample spaces.
19
Definition 2.2.1. Let Ω be a finite set. A probability measure P is called classical
measure or Laplace measure or uniform measure (and (Ω, P) a classical or Laplace
experiment) if
1
P [ω] = for all ω ∈ Ω. (2.2.1)
|Ω|
Here |Ω| denotes the cardinality of Ω (the number of elements in Ω). A Laplace
experiment is often also called a ‘purely random’ or a ‘fair ’ experiment.
All possible outputs in a Laplace experiment have by definition the same probability and
we have thus have for each event
|A|
P [A] = for all A ⊂ Ω. (2.2.2)
|Ω|
The elements of A are sometimes also called favourable or favourable outcomes and the
elements of Ω the possible outcomes. One thus says that P [A] is the number of favourable
divided by the number of possible outcomes.
Example 2.2.2 (The roll of two dice). We roll two fair, distinguishable die together and
then add the numbers. We now consider the following event
First we specify the sample space. We could chose Ω = {2, 3, . . . , 12}. Unfortunately, not
each value between 2 and 12 has the same probability to occur. For instance, 2 can only
occur as 1 + 1 while 5 = 1 + 4 = 2 + 3 = 3 + 2 = 4 + 1. We thus set
Ω = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
1 1
With this choice of Ω, all ω ∈ Ω have the same probability P [ω] = |Ω| = 36
. We thus now
have a Laplace-Experiment. In this setting, the events A and B are
A = {(1, 1), (1, 3), (1, 5), (2, 2), . . . , (5, 3), (5, 5), (6, 2), (6, 4), (6, 6)}, (2.2.5)
B = {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (4, 1)}. (2.2.6)
20
Thus probabilities of the events A and B are
|B| 18 1 |B| 10 5
P [A] = = = and P [B] = = = . (2.2.7)
|Ω| 36 2 |Ω| 36 18
Example 2.2.3. One cards is randomly selected from an ordinary 52 card playing deck.
What is the probability that we draw a king? At this point one needs to know that an
ordinary card playing deck contains in particular 4 kings. Also, we assume that one cannot
distinguish the cards when they are face down. Thus each card has the same probability
to be drawn. This implies that
4
P [‘We draw a king’] = . (2.2.8)
52
Example 2.2.4. A club has 100 members. A survey revealed the following statistics:
Man Woman
are not vegetarian 48 12
are vegetarian 16 24
In this situation, we set
where i = 1, . . . , 100 is the number of the member, g the gender and v denotes if he/she
is vegetarian/non-vegetarian. We now choose purely randomly a member and thus each
1
member has the probability P [{ω}] = 100 to be selected. The probability that the selected
member is a vegetarian is thus
16 + 24 2
P [{ω; ω is vegetarian}] = = , (2.2.10)
100 5
since 16 man and 24 woman are vegetarian. If we now would know beforehand that we
will select a woman, then the situation looks differently. In this case, the probability that
the selected member is a vegetarian, is 24
36
= 23 . However, this question in fact belongs to
conditional probability.
One thing one should keep in mind is that not all experiments are Laplace experiments.
Here two examples.
Two football teams play a match against each other. One could think there is a
50:50 chance for a victory for each team, but this is not the case. The talent of the
players, the strategy of the coach and much more has an influence to the chance to
win. This match could only be a Laplace experiment if a team would virtually play
against itself.
Suppose we throw a frisbee instead a coin. A frisbee has like a coin two sides.
However, both sides are not symmetric, which argues against a Laplace experiment.
21
Also, in many cases a random experiment can be mistaken for a Laplace experiment.
Such things can happen also very good mathematicians.
Example 2.2.5 (Leibniz’s Error). Gottfried Wilhelm von Leibniz was an outstanding
German mathematician. Among his most outstanding achievements are the ideas of dif-
ferential and integral calculus, which he independently developed by Isaac Newton at the
same time. However, Leibnitz was not infallible. He stated in the Opera Omnia (Leibniz,
1768, p. 217) that
... for example, with two dice, it is equally likely to throw twelve points, than
to throw eleven; because one or the other can be done in only one manner.
Let us now see why this is wrong. To get 12 points, both die must show a 6. On the other
hand, to get 11 points, one must show a 5 and the other a 6. This can be achieved in two
ways. The first die gives five and the second six and vice-versa. Thus we get
2 1
P [‘The sum of the numbers is 11’] = and P [‘The sum of the numbers is 11’] = .
36 36
But what happens if we can’t distinguish the dice? Does this affect the above mentioned
probabilities? The answer is no. To see this, let us suppose we roll two identical die at the
same time, which we have marked with invisible ink. We thus can distinguish the die and
2
the probability that the sum is 11 is therefore 36 . All others, however, cannot distinguish
the die, but see the same sums as we do. Thus seeing or not seeing the invisible ink has
2
no influence to the probability to see the sum 11, and must be therefore 36 for everybody.
To compute the probability of each element of the sample space, we think of the balls in the
urn as they would be numbered. 1−3 stand for a, 4 and 5 for n. In the numbered situation,
22
Figure 2.5: Illustration of the urn in Example 2.3.1
there 25 ways to draw the balls. This are all combinations (i, j) with 1 ≤ i, j ≤ 5. Nine of
this combinations lead to (a, a) ∈ Ω, namely all pairs (i, j) with 1 ≤ i ≤ 3 and 1 ≤ j ≤ 3.
Six combinations lead to (a, n) ∈ Ω, namely all (i, j) with 1 ≤ i ≤ 3 and j = 4 or
j = 5. Six further combinations lead to (n, a) ∈ Ω and four to (n, n) ∈ Ω. Since all balls
have the same probability to be drawn, all combinations (i, j), 1 ≤ i, j ≤ 5 have the same
1
probability to be drawn. Since there 25 possibilities, each pair (i, j) has the probability 25
to be drawn. We thus obtain
9 6 6 4
P [(a, a)] = , P [(a, n)] = , P [(n, a)] = and P [(n, n)] = . (2.3.1)
25 25 25 25
Another way to get to this conclusion is to represent the experiment by a tree diagram, see
Figure 2.6. Let us now describe how to obtain Figure 2.6. We start with a dot as the root
of our tree. In the first step of our experiment, we can draw a ball with an ‘a’ or with a ‘n’
with probabilities 35 and 52 . We thus add to the root of our tree two branches, one for ‘a’
and for ‘n’. Furthermore, we write to each branch the probability that the corresponding
ball has been drawn. In the second step, we repeat the first step, but this time adding the
new branches to the branches obtained in the first step. The important observation at this
23
point is that the probabilities of the combinations (a, a), (a, n), (n, a) and (n, n) can also
be obtained by multiplying the probabilities along the path corresponding to a combination.
Example 2.3.2. We consider the same situation as in Example 2.3.1, but with the differ-
ence that we do not return the ball into the urn. We thus have to adjust the probabilities
on the branches corresponding to the second level of our experiment. If we draw in the
first step an ‘a’, then two ‘a’ and two ‘n’ are left in the urn. Thus the probability to draw
an ‘a’ is 12 and to draw an ‘n’ is also 12 . If we draw in the first step an ‘n’, then three ‘a’
and one ‘n’ are left in the urn. Thus the probability to draw an ‘a’ is 34 and to draw an
‘n’ is also 41 . This now leads to the tree in Figure 2.7. The multiplication approach gives
3 1
P [(a, a)] = P [(a, n)] = P [(n, a)] = and P [(n, n)] = . (2.3.2)
10 10
If one computes the probabilities as in the first half of Example 2.3.1, one obtains the
same.
Exercise 2.3.3. Compute P [(a, a)] in Example 2.3.2 as in the first half of Example 2.3.1
and compare it to (2.3.2).
The observation in the Examples 2.3.1 and 2.3.2 lead to the following proposition:
24
Proposition 2.3.4 (Path rule).
The probability of an event A is the sum over the probabilities of all paths
leading to an ω ∈ A.
The graphical illustration of all of this paths as a tree as in Figure 2.6 and
Figure 2.7 is called a probability tree or tree diagram.
The proof of this proposition requires conditional probability and we thus postpone it,
see Proposition 2.6.16 and Section 2.6.3. We now will apply the path rule in various
examples.
Example 2.3.5. In a boat are 9 passengers. 4 are smugglers and 5 are honest passengers.
A customs officer selects 3 persons for a check and all 3 are smugglers. Has the custom
officer selected them randomly? To make a decision, we compute the probability to have
3 smugglers when we select them purely randomly. We set
Ω = {(i, j, k); i, j, k ∈ {s, h}},
where s denotes smuggler and h honest. The path corresponding to the result (s, s, s) is
4 3 2
9 8 7
· −→ s −→ s −→ s (2.3.3)
Thus the probability to select three smugglers is
4 3 2 1
· · = ,
P [(s, s, s)] = (2.3.4)
9 8 7 21
which is less than 5%. With this rate it is to expect the customs officer has an ability to
judge human nature.
Example 2.3.6. In a drawer are 4 black, 6 red and 2 white socks. In the
dark, somebody chooses randomly two socks. What is the probability that
they have the same colour? We define in this situation
Ω = {(i, j); i, j ∈ {b, r, w}}. (2.3.5)
The following three trees lead to equal colours:
4 3 6 5 2 1
12 11 12 11 12 11
· −→ b −→ b, · −→ r −→ r, · −→ w −→ w. (2.3.6)
We thus have
4 3 6 5 2 1 1
P [{(i, j); i = j}] = · + · + · = . (2.3.7)
12 11 12 11 12 11 3
25
Example 2.3.7. Each time when Prof.Q sees 7 persons standing together, he
bets 100 : 1 that at least two person are born on the same weekday. Is
this a good bet? We could try to compute the probability that Prof.Q wins. The prob-
lem is that we would had to consider all different outcomes like ‘All are born on the same
weekday’ or ‘Two are born on the same weekday and all others are born on different week-
days’. This is possible, but it is much easier to compute the probability that Prof.Q looses
his bet and then to use Lemma 2.1.15.d. We set
where ij denotes the weekday on which the person j is born. We can assume at this point
that all weekdays have the same probability. Under this assumption, the following path
corresponds to the event that Prof.Q looses:
7 6 5
7 7 7
· −→ {(i1 ); 1 ≤ i1 ≤ 7} −→ {(i1 , i2 ); i1 6= i2 , 1 ≤ i1 , i2 ≤ 7} −→ · · ·
1
7
· · · −→ {(i1 , . . . , i7 ); ij 6= ik for j 6= k, 1 ≤ i1 , . . . , i7 ≤ 7} (2.3.9)
Example 2.3.8. In the city M. exists only two kind of weather: rainy (R) or dry (D).
If it is today rainy, then probability is 65 that it is tomorrow also rainy and 16 that it is
3
tomorrow dry. If it is today dry, then the probability is 10 that it is tomorrow also dry
7
and 10 that it is tomorrow rainy. Today it is dry: What is the probability that it dry in
two days? Apparently there are two paths leading to dry weather:
3 3 7 1
10 10 10 6
D −→ D −→ D and D −→ R −→ D (2.3.11)
|A|
P [A] = for all A ⊂ Ω.
|Ω|
26
If one would like to determine for some A ⊂ Ω the probability P [A], one has to count
the elements of A. At first glance, this looks to be a simple question, but it is often very
difficult. We will thus look in this section and in Section 2.5 at some helpful counting
rules. The topic of this section are the product and the sum rule.
At this point, we would like to remark that the computation of the size of a finite quan-
tity of a finite set and the finding of counting principles is a subdiscipline of discrete
mathematics, more precisely of Combinatorics. We also should mention that there many
problems in Combinatorics, which can easily be formulated and explained to everybody,
but are extremely hard to solve. Furthermore, there are still many very hard problems in
Combinatorics, which are still not resolved.
The proof of Proposition 2.4.1 is obvious and can be deduce for example directly from
Lemma 2.1.17.a) if one uses for P the Laplace probability measure on Ω. We have given
in Figure 2.8 an illustration for the sum rule with n = 3.
Figure 2.8: Illustration for the sum rule. We have here |A1 ∪ A2 ∪ A3 | = 2 + 3 + 3
27
Proposition 2.4.2 (Product rule). Suppose an experiment has s levels and the ex-
periment has ni possible outcomes on level i (1 ≤ i ≤ s), no matter what happens in
the other levels. Then the whole experiment has
We have given in Figure 2.9 an illustration with tree diagrams. Notice that we have in
Figure 2.9(a) in the second level always three possibilities, no matter what happened in
the first level. Also, we have in the third level always two possibilities, no matter what
happened in the first two levels. We thus can apply the product rule for the experiment
in Figure 2.9(a) and get 2 · 3 · 2 possible outcomes. On the other hand, we have in
Figure 2.9(b) in the second level either two or three possibilities, depended on the result
of the first level. Similarly on the third level. We thus cannot apply the product rule for
the experiment in Figure 2.9(b).
(a) The product can be applied (b) The product cannot be applied
Remark 2.4.3.
a) In Proposition 2.4.2, it is allowed that a level can have an influence on the later levels
as long it has no influence on the number of outcomes.
The proof of Proposition 2.4.2 is as for Proposition 2.4.1 quiet obvious. If one would like,
one can deduce the product rule from the principles of probability theory. For this, one
would need the independence of events, see Section 2.7. We leave this to the reader and
show instead the usefulness of these rules with some examples.
28
Example 2.4.4. We toss 6 times a coin with the sides 0 and 1. How many
possible outcomes do we have? Each coin toss has two possible outcomes. Further-
more, each level has no influence to the other levels. We thus have
2 · 2 · 2 · 2 · 2 · 2 = 26 = 64 (2.4.2)
possible outcomes of our experiment. In the general, if we toss a coin n times, then we
have 2n possible outcomes.
Example 2.4.5. In an urn are 5 balls with the numbers 1 to 5. We now draw
without replacement three times a ball. How many possible outcomes do we
have? In the first level, we have 5 possibilities, in the second 4 possibilities and in the
third 3 possibilities. Each level has an influence on the numbers remaining in the urn,
but it has no influence the number of the remaining balls in the urn. We thus can use the
product rule and get
5 · 4 · 3 = 60 (2.4.3)
4 · 5 · 1 + 2 · 5 · 1 + 4 · 2 · 1 + 4 · 5 · 3 = 98 (2.4.4)
possible combinations.
29
ordered unordered
n+s−1
with replacement ns s
n
without replacement (n)s s
or unordered. This gives four cases, of which we will consider three in this section. The
resulting formulas are
We will not study the case ‘unordered with replacement’ and have given the formula here
only for completeness.
We now draw a ball, note the number and put it back into the urn. This procedure is
repeated s times. This experiment has s level, where each level has no influence on the
other level. Furthermore, we have on each level n possibilities. The product rule thus
implies that we have ns possible outcomes of the whole experiment.
Remark 2.5.1.
a) In an ordered sample, the order how the elements of the sample have been drawn
is important. A good example are words. If we draw from the alphabet the letters
L, I, S, T, E, N , then depending on the order of the letters, one can get the words
LIST EN, EN LIST, SILEN T or T IN SEL.
Obviously, all have a different meaning.
b) If a ball in the above experiment is drawn purely randomly, i.e. we have a Laplace
experiment, then each of the ns possible outcomes has the probability n1s .
30
c) There exists also another (a ‘dual’) presentation of the same experiment. Instead to
draw s times a ball, we throw s balls into n numbered urns. To be more precise, we
have s distinguishable (for instance numbered) balls and n distinguishable urns. We
now place this balls into the urns, see Figure 2.11. In this experiment we can place
into each urn as much balls as we would like. Since each of the s balls falls into one of
the n urns, we can represent this experiment by a vector of length s
(a1 , . . . , as ) ∈ {1, . . . , n}s . (2.5.1)
The number aj stands for the urn, where j−th ball has been placed. Since each aj can
have s different values, the total number of possibilities is
n . . · n} = ns .
| · .{z (2.5.2)
s−times
31
Figure 2.12: Illustration a function X : Ω → B
Example 2.5.5. How many subsets has a set Ω with s elements (inclusive
the set Ω itself and the empty set ∅)? Notice that this question is equivalent to
compute the cardinality of the power set PΩ of Ω (see Definition B.1.10). We compute
the number of subsets of Ω with the help of Example 2.5.4.We define for a subset A ⊂ Ω
a function 1A as
(
1 ω ∈ A,
1A : Ω −→ {0, 1} with 1A (ω) := (2.5.4)
0 ω 6∈ A.
It is clear that different subsets lead to different functions and that a subset can be recovered
from the assigned function. We thus see that the number of subsets of Ω is equal to the
number of functions from Ω to {0, 1}. This number has been computed in Example 2.5.4.
Since ω has s elements and {0, 1} has two elements, Ω has 2s subsets.
Remark 2.5.6. The function 1A is called indicator function and is for instance helpful to
write probabilities as expectation of a random variable, see Section 3.8.
Example 2.5.7. How many possible bets can be made in football toto? At this
point one needs to know that one bets in football toto on 11 football games. For each game,
one has to choose between victory of the home team (‘1’), draw (‘0’) or victory of the guest
team (‘2’). Thus a bet in football toto corresponds to a experiment with 11 levels and with
3 possibilities on each level. We therefore have
311 = 177147
possible tips can in football toto. A possible question at this point is, how many tips are
completely wrong? Per game, there are 2 wrong tips and we thus have in total
211 = 2048
completely wrong tips. If we compare this with 177147 possible tips, we see there are only
a few completely wrong tips (ca. 1.1%). It is thus very very reasonable that somebody,
32
even if he or she has absolutely no knowledge about football, gives at least for one game
the right tip.
Figure 2.13: Urn with n = 5 balls, after two times drawing without replacement.
We have again s levels in this experiment. The difference to the previous experiment is
that the different levels are not anymore copies of each other. Obviously, the drawn balls
have an influence the on the remaining numbers in the urn, but not on the number of
balls on each level. We thus can apply the product rule. In the first level we have n
possibilities, in the second level we have n − 1 possibilities, . . . , and in the s’th level we
have n − s + 1 possibilities. We thus get with the product rule
(n)s := n · (n − 1) · . . . · (n − s + 1) (2.5.5)
possible outcomes of our experiment. The notation (n)s is called the falling factorial .
There are other notations used in the literature for the falling factorial (n)s such as sP n
and P (s, n), but you should use in this course only the notation (n)s .
If s = n, i.e. the urn is empty at the end, we have
n! := (n)n = n · (n − 1) · . . . · 3 · 2 · 1 (2.5.6)
possible outcomes of our experiment. The notation n! is called factorial . In particular
we see that n different elements can be arranged in n! ways. Thus there n! different
permutations of n different elements. Typically one defines 0! = 1. Note that (n)s = 0
for s > n and thus gives also the right number of possible samples for s > n (the urn
contains only n balls).
Remark 2.5.8.
a) If we draw purely randomly a sample (n balls and s levels), then each sample has the
probability
1 (n − s)!
= . (2.5.7)
(n)s n!
33
b) We have here also a dual description, which is often more convenient. We have to
distribute s different balls into n different urns, where s ≤ n. The difference to
Section 2.5.1 is that we can place this time at most one ball in each urn, see Figure 2.14.
How many possibilities do we have? The first ball has n possibilities, the second ball
has n − 1 possibilities, . . . , and the s’th ball has n − s + 1 possibilities. This gives
(n)s = n · (n − 1) · . . . · (n − s + 1)
different possible outcomes of the experiment.
Example 2.5.9. Anna has 9 (different) books and would like to arrange them
on a bookshelf. In how many ways can she do this? Obviously, the order of the
books plays a role and each book can only be selected once. We thus have (9)9 = 9!
possibilities.
Example 2.5.10. 8 sprinters run in the 100m final of a sports event. In
how many ways can the medals be distributed? We have a gold, a silver and a
bronze medal. Each sprinter can get only one medal and we thus have (8)3 = 8 · 7 · 6
possibilities.
The following video Sec2.5d contains some additional comments about the Examples 2.5.11,
2.5.12 and 2.5.13 below.
Example 2.5.11. A list of n cities is given and a salesman has to visit
each city on this list. In how many ways can he visit these cities if he
would like to visit each city only once? Obviously, the number of possible
routes is the number of permutations of n elements and is thus n!. If furthermore the
distances between each pair of cities is known then a natural question is: What is the
shortest possible route that the salesman can take? It is surprisingly very hard to find
the answer to this question. This problem is known as the Travelling salesman problem
(TSP) and it plays an important role in operations research.
At this point, one might ask why the travelling salesman problem is so hard? Why not
testing all possible routes and then choose the optimal one? The reason it that the value
of n! is growing extremely fast as n goes to infinity, much faster than exponential. To get
an impression how fast n! grows, we consider an example.
34
Example 2.5.12. How heavy are 100! rice grains? We have 100! ≈ 9.333 · 10157
and one grain of rice is around 25 mg = 0.000025 kg. Thus 100! rice grains are around
2.333 · 10153 kg. As a comparison, our sun system is around 2 · 1030 kg and our milky way
is around 3.60 · 1041 kg.
an n2 1
= 2 = → 1 as n → ∞, (2.5.11)
bn n − 5n 1 − n5
an n2 2
= 2 = → 2 as n → ∞. (2.5.12)
cn 2n + 3n 1 + n3
One should point out that an ∼ bn does not imply that an −bn converges to 0. For Stirlings
formula this is definitively wrong. Another example is for instance an = n + log n and
bn = n. We have here
an log n
=1+ → 1 as n → ∞ =⇒ an ∼ bn . (2.5.13)
bn n
Example 2.5.15. How many injective functions exists from Ω = {1, . . . , s} to
B = {1, . . . , n}? A function X is injective if X(i) 6= X(j) for all i 6= j and i, j ∈ Ω. For
35
instance, the function in Figure 2.12 is not injective as X(4) = X(5). If s > n, then it
is obvious that no injective function exists. If s ≤ n, then one can interpret an injective
function
X:Ω→B
in the way that the function transports s numbered balls into n urns, where each urn has
only enough space for one ball. The above comments now show that we have (n)s injective
functions.
Example 2.5.16. In how many ways can one arrange 8 rooks on a chessboard
so that they cannot hit each other? At that this point one should know that a
chessboard has 8 × 8 fields and that a rook can move any number of vacant squares verti-
cally or horizontally. Thus, the rooks have to be arranged that we have in each row and
each column precisely one rook. We now arrange the rooks column by column. For the
first rook we have 8 possibilities, for the second rook we have 7 possibilities and so on. We
thus have in total
8! = 40320
possible arrangements. A similar argumentation shows that one has for n rooks on a n×n
chessboard n! possible arrangements so that they cannot hit each other.
The video Sec2.5e discusses a further example not contained in the notes.
n!
(n)s = (2.5.14)
(n − s)!
is the number of ordered samples of {1, . . . , n} with s elements. On the other hand,
there exists s! possible arrangements of s elements and thus each unordered sample with
s elements corresponds to s! ordered samples. We thus have
n (n)s n! n
= = = . (2.5.15)
s s! (n − s)!s! n−s
36
Remark 2.5.17.
a) We have n0 = 1. The reason is that each set has only one subset with 0 elements:
n
X n
= 2n . (2.5.16)
k=0
k
This equation can be also obtained from the binomial law. We have
n n
X n X n k n−k
= 1 ·1 = (1 + 1)n = 2n (2.5.17)
k=0
k k=0
k
Example 2.5.18. The Lotto problem: an urn contains 49 balls with numbers from 1 to
49. Seven of these balls are red and all other white. We now choose a sample of 6 balls
without considering the order. We have in total
49 49 · 48 · 47 · 46 · 45 · 44
= = 13983816
6 1·2·3·4·5·6
possible samples. If we choose purely randomly a sample then each sample has the proba-
bility 1/13983816. One question we could ask for instance: What is the probability that a
sample consists only of red balls? From the above considerations we know that there are
7
6
samples with only red balls. Since we have a Laplace experiment, we get
7
6
P [‘6 red balls’] = 49 .
(2.5.18)
6
More
generally, what is the probability that our sample has precisely s red balls. We have
7 42
s
possibilities to choose s red balls. Furthermore, we have 6−s ways to choose the
reaming 6 − s balls from the 42 white balls. Using the product rule, we obtain
7 42
· (2.5.19)
s 6−s
possible samples with s red balls. Thus
7 42
s
· 6−s
P [‘s red balls’] = 49
. (2.5.20)
6
37
Example 2.5.19. A parliament consists of n members. They would like to
build a committee consisting of s members and with one chairman. How
many possibilities do we have? One possibility is to choose first the chairman (there
n
are n = 1 possibilities) and then to choose the s−1 remaining members of the committee
(there are n−1
s−1
possibilities to do this). Thus there are
n−1
n· (2.5.21)
s−1
possible ways to build this committee. An alternative way is first to choose the s members
n
of the committee (there are s ways)and then to choose the chairman among the members
of the committee (there are s = 1s possibilities). Since both way must give the same
number of possible committees, we have
n−1 n n n−1 n
n· = · s =⇒ · = . (2.5.22)
s−1 s s s−1 s
Had we started with distinctly labelled As say, A1 and A2 , then we have 24 in the list:
We can use this to find a quick way to compute the number of different words. When both
A1 and A2 are replaced by A then A1 A2 BC and A2 A1 BC both go to AABC. We thus see
that each word in the first list corresponds to precisely 2 words in the second list. Hence
we have in total 4!/2 = 24/2 = 12 different words.
Another way to get to this result is to interpret the problem as distributing the 4 letters
4
AABC on 4 seats. First, we choose 2 seats for the As and there are 2 possibilities. Since
there are 2 seats remaining, there are 21 possibilities to chose a seat for B. Finally, there
38
is 1 seat reaming for C and we thus only have 1 = 11 possibility. Using the product role,
we obtain
4 2 1
· · = 12. (2.5.23)
2 1 1
Example 2.5.21. Let us now extend Example 2.5.20 a little bit. In how many ways can
one arrange three As, four Bs and five 5Cs? (We assume at this point that we can not
distinguish identical letters.) In total, we have 12 letters and one can permute them in
12! ways, but as in Example 2.5.20, not all possible permutations give a different word.
As above, we interpret this problem as distributing the letters on 12 seats. First, there
are 12 3
possibilities to choose the 3 seats for the As. Since 9 seats are remaining, there
9
are 4 possibilities to choose the 4 seats for the Bs. Finally, there are 5 seats for 5 C 0 s
remaining and thus there is 1 = 55 possibility. In total we get with the product role
12 9 5 12!
· · = = 27720. (2.5.24)
3 4 5 3!4!5!
Example 2.5.22. In a similar way one can compute the number of possible arrangements
of 0s and 1s, if the number of 0s is a and the number of 1s is b (a, b ∈ N). In total we
have a + b elements to permute. Arguing as in the Examples 2.5.20 and 2.5.21 gives
a+b a+b
=
a b
possibilities to arrange the 0s and 1s.
Example 2.5.23. What is the probability that we throw 5 times
‘Heads’ when we throw 10 times a fair coin? We begin by determining Ω and P.
We have
All ω consists only of 1s or 0s and thus all ω ∈ A have five 0s and five 1s. Since we are in
the situation of a Laplace experiment, we have to compute the cardinality of A. We thus
need to know in how many ways one can arrange five 0s and five 1s. But this is precisely
the topic of Example 2.5.22 and we we have |A| = 10 5
. Therefore
10
|A| 252 1
P [A] = = 510 = ≈ . (2.5.25)
|Ω| 2 1024 4
39
2.5.4 Mixed samples
We consider in this section some combinatorial examples involving the three cases we have
studied in the previous sections. The video Sec2.5h contains some examples not contained
in the notes.
Example 2.5.24. Peter has 11 (different) books, 4 about Geometry, 5 about Algebra and
2 about Probability. Peter would like to arrange them on a bookshelf.
b) Suppose now that Peter would like to order the books by topic with
Geometry as first, then Algebra and then Probability. How many
arrangements are possible? There are 4! ways to arrange the Geometry books, 5!
ways to arrange the Algebra books and 2! ways to arrange the Probability books. Since
the arrangement of each topic has no influence on the arrangements of the other topics,
we can use the product rule and get 4! · 5! · 2! possibilities.
c) How many arrangements are possible if Peter would like to have the
books about the same topic standing together, but with no preference
for the order of the topics? First, we choose the order of the topics. We have
3 topics and can thus arrange them in 3! = 6 ways. Furthermore, we have for each
possible order of the topics 4! · 5! · 2! possibilities to arrange the books. We thus get
with the sum rule have in total 6 · 4! · 5! · 2! possible arrangements.
Example 2.5.25. How many people do you need in a room to have a better
than 50% chance that two of them share a birthday? Let n be the number of
people in the room. Let us compute the probability that all n people have a different
birthday. For simplicity, we assume at this point that the year has 365 days and each day
has the same probability. There are now (365)n = 365 · 364 · . . . · (365 − n + 1) possibilities
if all persons should have a different birthday (ordered sample without replacement). Fur-
thermore, there are 365n possibilities for the birthdays if there are no restrictions (ordered
sample with replacement). We thus get
(365)n
P [‘At least 2 persons have the same birthday’] = 1 − . (2.5.26)
365n
Testing some numbers for n (with the computer), one obtains for n = 23 that the above
probability is 0.507. We thus see that already a relative small group has a good chance to
have 2 persons have the same birthday.
40
Example 2.5.26.
b.) How many codewords are possible if there are no restrictions how often
a colour can occur? This is an ordered sample with replacement and we thus have
64 = 6 · 6 · 6 · 6 possibilities.
Example 2.5.29. How many divisors has the number 1‘000‘000‘000 (inclusive 1
and 1‘000‘000‘000)? We have 1‘000‘000‘000 = 109 = 29 ·59 . Thus a divisor of 1‘000‘000‘000
has the form 2k 5` with 0 ≤ k, ` ≤ 9. This gives 10 · 10 divisors.
41
2.6 Conditional probability
Please watch now the following video Sec2.6a and then proceed reading. In this section,
we study the probability of an event A to occur if we already know something about the
outcome of the experiment. We will call this probability then conditional probability. As
motivation, we take a look at two examples.
Example 2.6.2. Let us consider again the situation of Example 2.3.2. We have there an
urn containing three ‘a’s and two ‘b’s and we draw two times a ball without replacement.
We now compute the probability to draw in the second step a ‘n’. This is
3 1 2
P [‘n’ in the second step] = P [(a, n)] + P [(n, n)] = + = . (2.6.4)
10 10 5
If we know nothing about the outcome of the experiment, then we expect in 2 of 5 cases to
draw an ‘n’. However, the situation is different if we already know the that the first ball
is an ‘a’. In this case, there are two ‘a’ and two ‘n’ left in urn. The probability to draw
‘n’ is thus 21 . On the other hand, if we know that the first ball is an ‘n’, then there is one
‘n’ and three ‘a’ left in the urn. Thus the probability to draw an ‘n’ is 14 .
We have made a similar observation also in Example 2.2.4 with the club with 100 members.
If we choose purely randomly a member then the probability to select a vegetarian is 0.4.
On the other hand, if we already know that the selected member is a woman, then the
probability is 32 to select a vegetarian, which is a lot higher. We thus see that some
knowledge about the outcome of a experiment can have a significant influence to the
probability of an event.
We know will introduce for the above ad hoc computations a clear, formal description.
For this, let (Ω, P)be a probability space and A, B ⊂ Ω two events. We now assume that
we already know that the event B will occur. What is in this situation the probability
42
that also A occurs? We will call this probability P [A|B]. At this point, we have not yet
given a proper definition of P [A|B], but we have to have in Example 2.6.1
1
P ‘50 times head’ ‘We have tossed 49 times head’ = . (2.6.5)
2
and in Example 2.6.2
1
P ‘n’ in the second step ‘n’ in the first step = . (2.6.6)
4
Our first aim is to find a proper mathematical definition. Let us begin with the special
cases A = B c and A = B. For this, suppose that someone is doing a random experiment,
but does not tell you the result. He only tells us that the event B has occurred.
• P [B c |B]: If ω ∈
/ B then this ω cannot have occurred because we already know that a
ω ∈ B has occurred. Thus this ω has in this situation probability 0. In formulas
/ B. Similarly, we deduce P [B c |B] = 0.
P [ω|B] = 0 for all ω ∈
• P [B|B]: In words, P [B|B] is the probability that B occurs if we already know that B
occurs. Clearly, the only possibility is P [B|B] = 1.
Let us now take a look at an arbitrary event A. As motivation to find the right definition
for P [A|B], we take a look at Figure 2.15.
Obviously, only these ω ∈ A can occur, which are are also in B. This corresponds to the
hatched part in Figure 2.15. Thus only the elements in the intersection A ∩ B can occur
for the event A. The probability P [A|B] thus should be proportional to P [A ∩ B]. A
potential candidate for P [A|B] is P [A ∩ B]. However, we don’t have P [B|B] = 1 for this
candidate since
P [B ∩ B] = P [B] ,
and this probability in usually less than 1. To reach P(B|B) = 1, we divide P [A ∩ B]
by P [B]. At this point we have to assume that P [B] 6= 0. We can do this without
further ado. The reason is that we obviously get logical problems if we try to compute
the probability of the event A under the assumptions that an event B with P [B] = 0
occurs. We thus define
43
Definition 2.6.3. Let (Ω, P)be a (discrete) probability space and A, B ⊂ Ω be given
with P [B] 6= 0. We then define the conditional probability of A given B as
P [A ∩ B]
P [A|B] := . (2.6.7)
P [B]
44
Example 2.6.6. In New York, the probability that a random teenager owns a skateboard
is 0.48 and that a random teenager owns a skateboard and roller blades is 0.39. What
is the probability that a teenager owns roller blades if we know that the teenager owns a
skateboard? This probability is
0.39
≈ 0.81. (2.6.15)
0.48
The following video Sec2.6b discusses a further example not contained in the notes.
Exercise 2.6.7. We roll a fair die. Find the probability that that the number is > 2
and the probability that the number is a prime number. Also find the probability
that the number is > 2 given the number is prime.
In most cases, our intuition agrees with the computed conditional probability. However,
there are examples, where this is not the case.
Example 2.6.8. We have three indistinguishable purses and each contains two coins. One
purse contains two gold coins, another contains two silver coins and the third contains
one gold coin and a silver coin.
GG GS SS
A purse is selected at random, then at random a coin is selected from it. The selected coin
turns out to be gold. What is the probability that the other coin in the purse is also gold?
P [G selected ∩ G left ] P [ two G purse ] 1/3
P [G left |G selected ] = = = = 2/3.
P [G selected ] P [G selected ] 1/2
In a first point of view, one would expect 12 . Let us thus take a closer look at this experiment
to see if the 12 or the 23 is more reasonable. First, we illustrate it with a tree diagram,
see Figure 2.16. We see that there are 3 paths, where we select a gold coin. Two of
them correspond to the first purse and one to the second purse. Since all paths in this
45
experiment have the same probability, we must have 2/3. Thus the 32 is indeed the right
result. At this point one might ask: why does our intuition fail? A possible explanation
is that one might think the events A=‘ We select a gold coin’ and B=‘ We select a purse
with gold coin(s)’ are the same, but they are not. We clearly have A ⊂ B, but the event
‘ We select a silver coin and gold coin is remaining’ is contained in B and is not contained
in A. Thus A 6= B.
We study as next the properties of conditional probability. We start in Section 2.6.1 with
some basic observations. We then give as next three rules for computations with condi-
tional probability. These three are the Multiplication rule (see Lemma 2.6.13 and 2.6.14),
the law of Total probability (see Theorem 2.6.17 and 2.6.21) and Bayes’ theorem (see
Theorem 2.6.23). In particular, we will use the Multiplication rule and law of Total
probability to justify tree diagrams and the path rule, see Proposition 2.6.16.
46
Furthermore
1 P [A ∩ B c ] 1 P [{3}] 2
P [A|B1 ] + P [A|B2 ] = P [A|B] + P [A|B c ] = + c
= + = .
3 P [B ] 3 P [{1, 3, 5}] 3
Thus the third point is not true. Note that P [A|B1 ∪ B2 ] 6= P [A|B1 ] + P [A|B2 ] for
most events A, B1 and B2 . Let us now summarise this three observations as Lemma.
Lemma 2.6.9. Let (Ω, P)be a (discrete) probability space. We then have
a) Let A and B be events with P [A] > 0 and P [B] > 0. We then have in general
b) Let A1 and A2 be disjoint events and B an event with P [B] > 0. Then
c) Let A be an arbitrary event and B1 , B2 be disjoints events with P [B1 ] > 0 and
P [B2 ] > 0. We then have in general
Notice that the term ‘in general’ indicates that (2.6.18) and (2.6.20) are true for most
events, but not all. Indeed, if we choose A = B then P [A|B] = P [B|A] and if we choose
an A with P [A] = 0 then P [A|B1 ∪ B2 ] = P [A|B1 ] + P [A|B2 ].
We now show that we can extend the property (2.6.19) a little bit. We have
Lemma 2.6.10. Let (Ω, P)be a (discrete) probability space. Let further B ⊂ Ω be
given with P [B] > 0. Then
is a probability measure on Ω.
Proof of Lemma 2.6.10. We verify here the properties of Definition 2.1.19. First we have
to check that 0 ≤ PA|B ≤ 1 for each A ⊂ Ω. The inequality 0 ≤ PA|B is obvious. On
47
the other hand, we have A ∩ B ⊂ B and thus P [A ∩ B] ≤ P [B]. We get
P [A ∩ B] P [B]
P [A|B] = ≤ = 1.
P [B] P [B]
Similarly, we get
P [Ω ∩ B] P [B]
P [Ω|B] = = = 1.
P [B] P [B]
is not a probability measure on Ω. This follows immediately from Lemma 2.6.9. Another
way to see this is for instance with B = Ω. We then have
P [A ∩ Ω] P [A]
PA|Ω = = = P [A] . (2.6.25)
P [Ω] 1
Thus PA| · could only be a probability measure if P [A] = 1. But if we have P [A] = 1
then PA|B = 1 for all B with P [B] 6= 0.
Exercise 2.6.12. Let A be a event with P [A] = 1. Show that P [A ∩ B] = P [B] for all
events B ⊂ Ω.
48
P[B] P[A|B]
−−−−−−−→ B −−−−−−−→ A
Proof. Equation (2.6.26) follows immediately from Definition 2.6.3. More precisely
P [A ∩ B]
P [A|B] = ⇐⇒ P [A|B] P [B] = P [A ∩ B] . (2.6.27)
P [B]
It is very convenient to interpret Lemma 2.6.13 in the context of the path rule. However,
Lemma 2.6.13 is true for all events A and B with P [B] > 0 and an interpretation as path
as in Figure 2.17 is not required. Thus the path rule for a two level experiment is just a
special case of Lemma 2.6.13.
We are of course also interested in random experiments with more than two levels. We
thus have to extend Lemma 2.6.13 a little bit. For this, suppose we are interested in the
probability that A1 occurs in the first level, A2 in the second level, . . . , and An in the
n’th level. In other words, we are interested
Tn in the probability that A1 and A2 and . . . and
An occur, i.e. we are interested in P [ i=1 Ai ] (see (2.1.37)). The path corresponding
to this probability is given in Figure 2.18. The probabilities at the arrows can now be
P[A1 ] P[A2 |A1 ] P[A3 |A1 ∩A2 ] P[A4 |A1 ∩A2 ∩A3 ]
−−−−−−−→ A1 −−−−−−−→ A2 −−−−−−−−→ A3 −−−−−−−−−→ A4 · · ·
interpreted as follows: P [A1 ] is the probability that in the first step the event A1 occurs,
P [A2 |A1 ] is the probability that A2 occurs in the second step if A1 occurs in the first step,
P [A3 |A1 ∩ A2 ] is the probability that A3 occurs in the third step if A1 and A2 occur in
the first and second step and so on. The following lemma shows that the path rule in
Figure 2.18 gives the right answer.
49
Lemma 2.6.14 (Multiplication rule, general case). Let A1 , . . . , An ⊂ Ω be given with
P [A1 ∩ A2 ∩ · · · ∩ An−1 ] > 0. We then have
"n #
\
P Ai = P [A1 ] · P [A2 |A1 ] · P [A3 |A1 ∩ A2 ] · . . . · P [An |A1 ∩ · · · ∩ An−1 ] . (2.6.28)
i=1
We then proceed in this way till only P [A1 ] is left. This completes the proof.
Similar to Lemma 2.6.13, Lemma 2.6.14 is true for arbitrary events A1 , . . . , An and does
not require that A1 , . . . , An can be interpreted as path in a multi level experiment.
Let us now consider a concrete example to compare the results obtained with the path
rule and with the multiplication rule and check if both methods give indeed the same.
Example 2.6.15. We have an urn with 5 red and 2 blue balls. We now draw 3 times
purely randomly a ball without replacement. We denote by Cj the colour of the ball,
which was drawn in j’th turn. We write r for red and b for blue. We now compute
P [C1 = r, C2 = b, C3 = r] with the path rule (see Proposition 2.3.4) and with Lemma 2.6.14.
We begin with the path rule. The interesting path is
5 2 4
7 6 5
· −→ r −→ rb −→ rbr (2.6.31)
and thus
5 2 4 4
P [C1 = r, C2 = b, C3 = r] = · · = . (2.6.32)
7 6 5 21
50
If one now uses Lemma 2.6.13, one obtains
We then have
5·2
5 P [C2 = b, C1 = r] 7·6 2
P [C1 = r] = and P [C2 = b|C1 = r] = = 5 = .
7 P [C1 = r] 7
6
Thus P [C1 = r] and P [C2 = b|C1 = r] are probabilities at the first and the second arrow
in (2.6.31). A similar argument shows that P [C3 = r|C1 = r, C2 = b] is the probability at
the third arrow in (2.6.31). Thus the path rule and the multiplication rule give the same
results as expected.
If one takes also a look at the Examples 2.3.1 and 2.3.2, one sees immediately that
the entries at the tree diagrams are also conditional probabilities, obtained with the
multiplication rule. We thus we can summarize this section as
Proposition 2.6.16. The path rule in Proposition 2.3.4 is true and is a special case
of Lemma 2.6.14.
Before we verify this, we have to take a look at the probability tree in Figure 2.19. Why do
we have in the first level a splitting into B and B c for some event B? Let us reformulate
this question as follows: When we have in the first level a splitting into two cases B1 and
B2 , what should these events fulfil? First B1 and B2 have to be different cases. Thus B1
and B2 have to be mutually exclusive, i.e. B1 ∩ B2 = ∅. If this wouldn’t be true then B1
and B2 could both occur at the same time, what we obviously don’t want. Second, all
possibilities have to be covered by B1 and B2 and thus B1 ∪ B2 = Ω. If we would have
B1 ∪ B2 6= Ω then there would exist at least one ω ∈ Ω, which would belong neither to B1
51
Figure 2.19: Probability tree for a two level experiment with a splitting into two cases.
nor to B2 . Thus this ω wouldn’t be covered by the cases B1 and B2 . Obviously, we also
don’t want this. The two properties B1 ∩ B2 = ∅ and B1 ∪ B2 = Ω clearly imply B1 = B2c .
For simplicity, we write just B instead B1 . We now justify (2.6.34). We have
Theorem 2.6.17 (Law of total probability, simple version). Let (Ω, P)be a (discrete)
probability space and A, B ⊂ Ω with 0 < P [B] < 1. Then
The splitting in Figure 2.20 together with Lemma 2.1.15.a) then gives
P [A] = P [A ∩ Ω] = P [A ∩ (B ∪ B c )] = P [(A ∩ B) ∪ (A ∩ B c )]
P [B c ]
c P [B] c
= P [A ∩ B] + P [A ∩ B ] = P [A ∩ B] · + P [A ∩ B ] ·
P [B] P [B c ]
P [A ∩ B] P [A ∩ B c ]
= · P [B] + c
· P [B c ] = P [A|B] P [B] + P [A|B c ] P [B c ] .
P [B] P [B ]
52
We have used that the events A ∩ B and A ∩ B c are disjoint.
Example 2.6.18. We have two identical urns with red and blue balls. The first urn
contains 4 red and 5 blue balls and the second urn contains 3 red and 6 blue balls. We
now choose randomly an urn and draw a ball. What is the probability that we draw a red
ball? It is easy is determine the probability to draw a red ball if we know which urn we
have chosen. We thus introduce the events
A := ‘A red ball is drawn’, (2.6.36)
B := ‘The first urn has been chosen’. (2.6.37)
Clearly, B c is the event ‘The second urn has been chosen’. We now have
1 4 3
P [B] = P [B c ] = , P [A|B] = and P [A|B c ] = . (2.6.38)
2 9 9
We thus get with Theorem 2.6.17
1 4 1 3 7
P [A] = · + · = . (2.6.39)
2 9 2 9 18
Similar to Lemma 2.6.13 and 2.6.14, Lemma 2.6.17 does not require the interpretation as
two level experiment.
Theorem 2.6.17 can be used if the sample space Ω splits into two parts, i.e we have two
different cases. Of course we would like also to consider situations with more than two
different cases. For instance, how do we handle Example 2.6.18 if we would have three
urns? Let us consider the probability tree for a two level experiment with three possible
cases in the first level, see Figure 2.19.
Figure 2.21: Probability tree for a two level experiment with a splitting into three cases.
Similar to considerations after (2.6.34), we would like to have that the events B1 , B2
and B3 are mutually exclusive and cover the whole Ω. Events with this properties are
called a partition of Ω. This can be of course extended to a general number of events.
53
Definition 2.6.20 (Partition). Let (Ω, P)be a (discrete) probability space and B1 ,. . . ,
Bn be events. We call the events B1 ,. . . , Bn a partition if
Ω = ni=1 Bi .
S
Theorem 2.6.21 (Law of total probability, general version). Let (Ω, P)be a (discrete)
probability space and A be an arbitrary event and B1 , . . . , Bn be a partition of Ω.
We then have
n
X
P [A] = P [A|Bi ] · P [Bi ] . (2.6.40)
i=1
The idea for the proof of Theorem 2.6.21 is the same as for Theorem 2.6.17. More precisely,
this means we split Ω and use conditional probability to consider each part separately. We
have given in Figure 2.22 an illustration for the splitting of Ω into three events B1 , B2 , B3 .
54
Theorem 2.6.21 now completes our justification of probability trees and also completes
the proof of Proposition 2.3.4.
Example 2.6.22. We have three identical urns with red and blue balls. The first urn
contains 4 red and 5 blue balls, the second urn contains 3 red and 6 blue balls and the
third urn contains 5 red and 4 blue balls. We now choose randomly an urn and draw a
ball. What is the probability that we draw a red ball? It is easy is determine the probability
to draw a red ball if we know which urn we have chosen. We thus introduce the events
Clearly, we have Bi ∩ Bj = ∅ for i 6= j since it is not possible to choose more than one
urn. Furthermore, B1 ∪ B2 ∪ B3 = Ω since we have to choose one of the three urns. We
now have
1 4 3 5
P [B1 ] = P [B2 ] = P [B3 ] = , P [A|B1 ] = , P [A|B2 ] = , and P [A|B3 ] = . (2.6.45)
3 9 9 9
We thus get with Theorem 2.6.21
1 4 1 3 1 5 4
P [A] = · + · + · = . (2.6.46)
3 9 3 9 3 9 9
55
Theorem 2.6.23 (Bayes’ Theorem). Let (Ω, P)be a (discrete) probability space, A, B ⊂
Ω with P [A] > 0 and P [B] > 0. Then
P [B] · P [A|B]
P [B|A] = . (2.6.47)
P [A]
Proof. We have
P [A ∩ B] P [A ∩ B] P [B] P [A ∩ B] P [B] P [B]
P [B|A] = = · = · = P [A|B] · .
P [A] P [A] P [B] P [B] P [A] P [A]
Exercise 2.6.24. Find a sufficient and necessary condition that P [A|B] = P [B|A].
Corollary 2.6.25. Let (Ω, P)be a (discrete) probability space and A be an arbitrary
event and B1 , . . . , Bn be a partition of Ω.
We then have for all 1 ≤ i ≤ n
P [Bi ] · P [A|Bi ]
P [Bi |A] = Pn . (2.6.48)
i=1 P [Bi ] · P [A|Bi ]
Example 2.6.26. Suppose we have two dice, one is fair and one is manipulated. The
manipulated die always gives the number 6 and cannot be visually distinguished from the
fair one. We choose purely randomly a die and we would like to know if this die is the
fair or the manipulated. We introduce the events
We have obviously P [F ] = P [M ] = 12 since we don’t know more about the chosen die. We
now roll the chosen die two times and get two times the number 6. What is the probability
that this is the manipulated die? This means we have to compute P [M |‘two times a 6’].
We now get with Theorem 2.6.23
56
We know P [M ] and clearly P [‘two times a 6’|M ] = 1. We thus have to compute
P [‘two times a 6’]. We use here the path rule and there are two paths leading to ‘two
times a 6’. This paths are
1 1 1
2 6 2 2 1
−→ F −→ ‘two times a 6’, −→ M −→ ‘two times a 6’. (2.6.52)
We thus get
1 1 1
P [‘two times a 6’] = · 2 + · 1. (2.6.53)
2 6 2
This then gives
1
2
·1
P [M |‘two times a 6’] = 1 1 ≈ 0.97 (2.6.54)
2
· 62
+ 12 · 1
We thus see that is very likely that we have chosen the manipulated die.
The the video Sec2.6h contains an extension of Example 2.6.26.
Example 2.6.27. The infection rate for BSE in a herd of cows is 0.1 %, i.e. 1 of 1000
cows is suffering from BSE. Since this is a very dangerous disease, one has invented a test
for this disease: If a cow has BSE then the test is with a probability of 99.9 % positive.
If a cow don’t has BSE then the test is with a probability of 95 % negative, i.e. the test
shows that the cow is healthy. What is the probability that a positive tested cow has BSE?
1
We use the probability space (Ω, P)with Ω = {All cows in the herd} and P [ω] = |Ω| . We
now introduce the events
We are interested in P [B|⊕] and know P [⊕|B] , P [⊕|B c ] and P [B]. We thus use (2.6.47)
and get
P [⊕|B] · P [B]
P [B|⊕] = . (2.6.55)
P [⊕]
We now have to compute P [⊕]. Since we know P [⊕|B] and P [⊕|B c ], we can use Theo-
rem 2.6.17 and get
57
Using the given probabilities, we obtain
0.999 · 0.001 1 1
P [B|⊕] = = = ≈ 0.0196. (2.6.58)
0.999 · 0.001 + 0.05 · 0.999 1 + 50 51
If one looks at rates of the test above, one would expect a very good test, however only 1
of 50 positive tested cows really has BSE. It thus would be a good idea to test a positive
tested cow a second time.
Exercise 2.6.28.
a) Explain why the test in Example 2.6.27 is is good to identify healthy cows, but is
not good to find sick cows. (Hint: Compute P [B| ]).
We have mentioned in Section 2.6.3 and in the begin of this section that we have in general
P [A|B] 6= P [B|A]. However, our intuition of confounds very easily P [A|B] with P [B|A].
Let us consider an example. The following example is also available as video Sec2.6i.
Example 2.6.29. The company SDPT would like to sell a new kind of polygraph test to
the police. SDPT says that this test is 100% positive if the tested person lies. Should the
police buy this test? Most people would say intuitively: If it is correct what SDPT claims,
then this is a good test and the police should buy it. Let us now take a more careful look
at this test and check how good it really is. We introduce
The main question at this point is: If the test is positive, does this imply that the tested
person lies? To answer this question, we compute the probability that the tested person
lies if the test is positive. This means we have to compute P [L|⊕]. We know from SDPT
that P [⊕|L] = 1. Applying Bayes theorem then gives
P [⊕|L] · P [L] P [⊕|L] · P [L]
P [L|⊕] = = . (2.6.59)
P [⊕] P [⊕|L] · P [L] + P [⊕|T ] · P [T ]
Obviously, some information is missing. P [L] and P [T ] clearly depend on the situation.
On the other hand, the probability P [⊕|T ] only depends on the test itself and can be
interpreted as the probability that the test is positive if the tested person says the truth.
Ideally, one would have P [⊕|T ] = 0 (⇐⇒ P [ |T ] = 1). In this case, we would have
P [L|⊕] = 1 and the test of SDPT would be perfect. However, SDPT says nothing about
P [⊕|T ]. In principle it would be possible that P [⊕|T ] = 1. This would mean that the test
is always positive when the tested person says the truth. In other words, the test would be
58
always positive no matter if the tested person says the truth or lies. Note that the claim
of SDPT would be still correct, but the test itself would be obviously useless. We thus see
that it is not enough to know only P [⊕|L] to make a reasonable decision.
The video Sec2.6j contains another example that shows that our intuition is often incorrect
when it comes to conditional probabilities.
Definition 2.7.1. Let (Ω, P)be a (discrete) probability space. Two events A, B ⊂ Ω
are called (stochastically) independent if
Example 2.7.2. Let us roll two distinguishable, fair dice and consider the events
A = {‘The number of the first is divisible by 3’} and (2.7.3)
B = {‘The number of the second is divisible by 2’}. (2.7.4)
59
We set Ω = {(i, j); 1 ≤ i, j ≤ 6}. This is obviously a Laplace experiment (see Defini-
1 1
tion 2.2.1) and thus P [ω] = |Ω| = 36 . We have
We get
6 1 1 1
P [A ∩ B] = = = · = P [A] · P [B] . (2.7.8)
36 6 3 2
Thus the events A and B are independent.
The video Sec2.7b discusses the Example 2.7.3.
Example 2.7.3. Let us consider again the situation of Example 2.3.1 and check if our
intuition is correct. More precisely: Let us show that the first and second level of this
experiment are indeed independent in the meaning of Definition 2.7.1. Recall
Ω = {(a, a), (a, n), (n, a), (n, n)} and (2.7.9)
9 6 4
P [{(a, a)}] = , P [{(a, n)}] = P [{(n, a)}] = and P [{(n, n)}] = . (2.7.10)
25 25 25
We consider the events
60
Exercise 2.7.4. Let (Ω, P)be as in Example 2.7.3 and show that the events Ac and B
are independent.
Example 2.7.5. A possible question is: Is the interest of the people in mathematics
and physics (stochastically) independent? We write M for the set of people interested
in mathematics and P for the set of people interested in physics. A survey shows that
P [M ] = 0.7, P [P ] = 0.6 and P [M ∩ P ] = 0.48 (if we choose uniformly at random a
person). We now have
P [M ] · P [P ] = 0.42 6= P [M ∩ P ] (2.7.14)
In the everyday use, one calls two events independent if they belong to two different exper-
iments, which do not have an influence on each other. This is for stochastic independence
not mandatory. Let us give an example.
Example 2.7.7. Let us consider the same situation as in Example 2.6.4. This means we
look at the throw of a fair die and study the events
We now show that A and B are independent (even though both events refer to the same
throw!). Indeed, we have computed in Example 2.6.4 that
P [A ∩ B] 1
P [A|B] = = = P [A] . (2.7.16)
P [B] 3
We thus have P [A|B] = P [A] and therefore A and B are independent.
Unfortunately our intuition for independence is not always correct. It can occur that we
guess two events are independent, but they are not independent. This is for instance the
case in Example 2.6.8. A further famous example is
Example 2.7.8 (Monty Hall problem). Suppose we are in a game show. The show master
gives you the choice between three doors: beyond two doors is a goat and beyond one door
a car. After we have picked a door, the show master opens a remaining door with a
61
goat. The show master then asks us: ‘Would you like to keep your door or change it?’.
Intuitively, one would say that both doors have the same probability and it does not matter
if we keep the door or change. Let us now compute the probability to win the car. For
this, we define
A : = {The car is in the first chosen door.},
Z : = {A goat is in the first chosen door.},
W : = {We win the car.},
L : = {We don’t win the car.}.
We then have P [A] = 13 and P [Z] = 32 . Let us now consider three different strategies. In
all three cases, we have the same probability tree, see Figure 2.23.
Notice that we have not specified in Figure 2.23 the probabilities at the branches with ‘keep
the door’ and ‘change the door’ since they depend on the strategy.
Strategy 1: Keep the door.
With this strategy, the branches with ‘keep the door’ have probability 1 and the
branches with ‘change the door’ have probability 0. We get
1
P [W ] = P [A] · P [W |A] + P [Z] · P [W |Z] = P [A] · 1 + P [Z] · 0 = . (2.7.17)
3
Strategy 2: Change the door.
Here we see that only the right path of the tree leads to the car and thus
2
P [W ] = P [Z] = . (2.7.18)
3
Strategy 3: Choose the door randomly.
We have in this case
1 1 1
P [W ] = P [W |A] · P [A] + P [W |Z] · P [Z] = · P [A] + · P [Z] = . (2.7.19)
2 2 2
We thus see that the first level of the experiment has an influence on the second level and
thus both levels cannot be independent.
62
The video Sec2.7c considers a further example not contained in the notes.
Exercise 2.7.9. Consider the situation of Example 2.7.8, but this times with 1000
doors, goats beyond 999 doors and 998 doors opened after the first step. Compute
the probabilities to win the car for each strategy in Example 2.7.8.
We now would like to consider the following question: if A and B are independent, are
then also Ac and B independent? Intuitively one would say yes. The following lemma
shows that this is indeed true.
Lemma 2.7.10. Let (Ω, P)be a (discrete) probability space and A, B ⊂ Ω be two
events. It is equivalent
Proof. It is enough to prove the equivalence of a) and b). The equivalence of a) and c)
follows by interchanging A and B. Furthermore, the equivalence of c) and d) follows by
replacing A by Ac . We now have
We have used in the second equivalence that A ∩ B and A ∩ B c are disjoint and in the
third equivalence that A = (A ∩ B) ∪ (A ∩ B c ). We get with this computation
63
we define the independence of more than two events. A possible approach is to extend
(2.7.2). More precisely
Definition 2.7.11. Let (Ω, P)be a (discrete) probability space. The events A1 , . . . , An
with n ≥ 2 are called pairwise independent if
The question at this point is if Definition 2.7.11 gets close enough to our intuition of
independence. Let us thus think which properties we would like to have. For this, we
look at an example, where we definitive should have independence.
Example 2.7.12. Let an urn be given with 8 balls of which 5 are white and 3 black. We
now draw 4 times purely randomly a ball with replacement. We define
Aj := ‘The ball in jth draw is white.’ with 1 ≤ j ≤ 4. (2.7.21)
One now can show (for instance with the path rule) that
P [Ai ∩ Aj ] = P [Ai ] · P [Aj ] for all i 6= j (2.7.22)
and thus the events A1 , A2 , A3 and A4 are pairwise independent. Similarly, one can show
P [Ai ∩ Aj ∩ Ak ] = P [Ai ] · P [Aj ] · P [Ak ] with i 6= j 6= k 6= i and (2.7.23)
P [A1 ∩ A2 ∩ A3 ∩ A4 ] = P [A1 ] · P [A2 ] · P [A3 ] · P [A4 ] (2.7.24)
The question at this point is if the equations (2.7.22), (2.7.23) and (2.7.24) are character-
istic for independent events or if they are only a speciality for this example. Let us thus
look if we can find an interpretation for these equations. We know from the motivation
of the independence of two events at the begin of Section 2.7.1 that
P [Ai ∩ Aj ] = P [Ai ] · P [Aj ] ⇐⇒ P [Ai |Aj ] = P [Ai ] for i 6= j. (2.7.25)
In words this means the occurrence of the event Aj has no influence on the probability that
the event Ai occurs. Do the equations (2.7.23) and (2.7.24) have a similar interpretation?
A possible starting point to answer this question is: What should happen to P [A3 ] and
P [A4 ] if we know that A1 and A2 occur? If the events A1 , A2 , A3 and A4 are independent,
then the natural answer is that this should have of course no influence on the probabilities
P [A3 ] and P [A4 ]. Using conditional probability, this means we should have
P [A3 |A1 ∩ A2 ] = P [A3 ] and P [A4 |A1 ∩ A2 ] = P [A4 ] . (2.7.26)
As simple computation shows that
P [A3 |A1 ∩ A2 ] = P [A3 ] ⇐⇒ P [A1 ∩ A2 ∩ A3 ] = P [A1 ] · P [A2 ] · P [A3 ] and (2.7.27)
P [A4 |A1 ∩ A2 ] = P [A4 ] ⇐⇒ P [A1 ∩ A2 ∩ A4 ] = P [A1 ] · P [A2 ] · P [A4 ] . (2.7.28)
64
The equation (2.7.23) thus can be interpreted as follows: The occurrence of the events A1
and A2 has no influence on the probability that the event A3 occurs. Similarly for A4 .
With a similar computation one can show that
Thus (2.7.24) can be interpreted as: The occurrence of the events A1 , A2 and A3 has no
influence on the probability that the event A4 occurs. We thus see that the three equations
(2.7.22),(2.7.23) and (2.7.24) get indeed very close to our intuition for independence (for
four events). If we think now about n independent events, then we should have similar
properties. More preciously, if the events A1 , . . . , An are independent then we should have
P [A ∩ B ∩ C] = P [∅] = 0. (2.7.31)
Thus the events A, B and C are pairwise independent. However, the property (2.7.29) is
not fulfilled for d = 3 and thus A, B and C are not independent.
We thus see that Definition 2.7.11 is not strong enough to fulfil our intuitive interpretation
of independence. Another possible approach is to try
The question is again if (2.7.32) implies pairwise independence and (2.7.29). Unfortu-
nately, this is also not the case.
65
Example 2.7.14. We consider the Laplace experiment on Ω = {1, 2, 3, 4, 5, 6, 7, 8} with
It follows
1 1
P [A] = P [B] = P [C] = , P [A ∩ B ∩ C] = P [{1}] = , (2.7.34)
2 8
but
1 1
P [A ∩ B] = P [{1}] = , P [A ∩ C] = P [{1, 2}] = = P [{1, 5}] = P [B ∩ C] . (2.7.35)
8 4
Thus (2.7.32) is fulfilled, but the events A, B, C are not pairwise independent and thus
(2.7.29) is also not fulfilled.
Since neither Definition 2.7.11 nor the approach (2.7.32) is strong enough, we have to
use (2.7.29) as the definition for the independence of more than two events. We thus set
Definition 2.7.15. Let (Ω, P)be a (discrete) probability space. The events A1 , . . . , An
are called independent, if we have
If the events A1 , . . . , An are independent in the sense of Definition 2.7.15, then they are
also pairwise independent. One simply has to choose d = 2. Furthermore, it is easy to
see that Definition 2.7.15 also implies that each subfamily of independent A1 , . . . , An is
also independent.
Independence has many advantages. For instance, if we consider the path rule in Propo-
sition 2.3.4, then we can compute for independent events the probability of each part in
a path without paying attention to the rest of the path.
The question is how to check the independence of n events. The best strategy is to check
first all cases with d = 2, then all cases with d = 3 and so on. This means we first check
for all i < j
which obviously means we check first if the events are pairwise independent. After this
we check for all i < j < k
66
Example 2.7.16. Let us consider the probability space (Ω, P)with
1
Ω = {1, 2, 3, . . . , 120} and P [ω] = (2.7.39)
120
and the events A2 , A3 and A5 with
67
68
Chapter 3
We introduce in this section discrete random variables and study some of their properties.
This includes the probability mass function, the expectation, the variance and indepen-
dence of random variables. Also, we study the sum of independent random variables and
state the law of large numbers and the central limit theorem.
Ω = {1, 2, 3, 4, 5, 6} (3.1.1)
as sample space. However, there is clearly a huge difference between both experiments.
If we roll a die then the number ‘1’ is a perfect description of the outcome. On the other
hand, the number ‘1’ is not really a good description of a car. An example of a more
accurate description is given in Table 3.1.
We thus see that a car has more properties than just being the number ‘1’. This obser-
vation is of course not restricted to cars. In fact, in almost all random experiments a
selected ω has many properties. For instance, if we select a random citizen in UK, then
this citizen has a first name, a family name, an age, a gender, a weight and so on. It is
69
thus natural to write an ω ∈ Ω as
where each ω (j) is the value of a certain property. For instance, we could write for the
car1 in Table 3.1
However, in most situations we do not need to know all properties of a selected ω, only
some ‘interesting’ ones. For a car, these are for instance the brand or the colour. The
serial number of the engine on the other hand is (normally) not really interesting. We thus
would like to have a mathematical description of a property. Let us therefore consider an
example to find the right mathematical definition.
Example 3.1.1. Let us take a look at the six cars in Table 3.1. One interesting property
is for instance the colour. We now can write
colour (car1) = blue, colour (car2) = grey and colour (car3) = white. (3.1.4)
brand (car1) = BMW, brand (car2) = BMW and brand (car3) = VW. (3.1.5)
Figure 3.1: Illustration of the properties ‘colour’ and ‘brand’ of a car in Table 3.1.
70
Studying the two properties in Example 3.1.1 more careful, we see that a property can
be characterised as follows:
First, we determine the possible values of the property. We write S for the set of
these values. In Example 3.1.1 we have for the colour
A possible question at this point is: Are the arrows in Figure 3.2 fixed or are they also
random? Let us consider again the colour of a randomly select car. Clearly the colour of
a car doesn’t change and we thus assign to each car always the same colour. Similarly the
brand of a car don’t changes and we thus assign to each car always the same brand. The
arrows in Figure 3.2 are therefore fixed. We can summarize this as follows: A property
itself is not random, only the values a property takes are random.
If we now consider the illustration in Figure 3.2 and the above observation, we see that
we can realise a property as function X : Ω → S. Let us consider some further examples
of properties.
Example 3.1.2.
b) If we are interested in the eye colours of the students of this course, then
71
c) An insurance company would like to study the BMI (Body Mass index) of their clients.
Then Ω is the set of all clients, S = R+ and
the weight of ω
X(ω) = . (3.1.8)
(the size of ω)2
d ) If a bank is interested in the amount of money their clients have on their bank ac-
counts, then we choose Ω as the set of all clients of the bank, S = R+ and X assigns
each client the amount of money on his/her bank account.
e) We can study also more abstract sample spaces and properties. A possible question is
if a natural number is a prime number. We set in this case Ω = N, S = {0, 1} and
define X : N → S as
(
1 if ω is a prime number,
X(ω) := (3.1.9)
0 otherwise.
Definition 3.1.3. Let (Ω, P)be a discrete probability space. A discrete random vari-
able is a function
X : Ω → R,
ω → X(ω).
When we are working with random variables, we suggest that the reader should keep in
mind that random variables are nothing else than properties. We also recommend to keep
the illustration in Figure 3.2 in mind. Notice that an illustration like Figure 3.2 needs
often much space. It is thus sometimes more convenient to write the values of a random
variable directly next to the sample point, see Figure 3.3.
72
Figure 3.3: Alternative illustration of a random variable X.
The blue dots represent the sample points ω ∈ Ω. The values near the sample points are
the value of a random variable X of this sample point.
Random variables are important as they are the result of most real experiments. Also they
impose an additional structure by the number system, such as ordering, enables further
development of the ideas in the previous chapter. There are some points we would like to
emphasise about random variables
The name random variable can be misleading since a random variable is not a
variable nor random in any sense. As we have discussed above, a random variable is
a property. Thus a random variable itself is not random, only the values a random
variable takes are random. Mathematically, a random variable is only a function.
We will use always capital letters for random variables like X or Y . Also we write
typically X instead of X(ω) since the dependence on ω is most time clear from the
context.
Every time the experiment is conducted exactly one value of the random variable is
observed; this is called a realisation of the random variable.
We have used in Example 3.1.2 for all random variables the name X. This is of
course not always a good idea and can lead to confusions. We will thus use from
now more intuitive names, for instance G for ‘gender’ or C for ‘colour’, but we will
use always capital letters for random variables. However, we write X and Y for
generic random variables.
73
Some further examples of random variables are
Example 3.1.4.
a) Let Ω = {1, 2, 3, 4, 5, 6} and P [ω] = 61 for all ω ∈ Ω. We thus have again the model for
the throw of a fair die. We are interested here in the question if the thrown number is
even or odd. A possibility to study this question is to introduce the random variable:
(
1 for ω ∈ {2, 4, 6},
X : Ω → R with X(ω) = (3.1.10)
0 for ω ∈ {1, 3, 5}.
b) We draw purely randomly an integer between 1 and 16. In this case Ω = {1, . . . , 16}
1
and P [ω] = 16 for all ω ∈ Ω. The following random variable describes the amount of
1’s in the decimal representation of the drawn number:
1 for ω ∈ {1, 10, 12, 13, 14, 15, 16},
X(ω) = 2 for ω = 11, (3.1.11)
0 otherwise.
Random variables are (essentially) functions and we can perform with them mathematical
operations like to sum them, to multiply them with constants or to square them. These
operations then give again a random variable. As an example, let X and Y be random
variables, then
Example 3.1.5. An insurance is interested in the body mass index of some clients. These
clients are listed in Table 3.1.5.
The body mass index itself is not listed in Table 3.1.5, but can be computed from the size
and the weight of a client, see (3.1.8). Let us denote by S the size of a client (for instance
S(client1) = 1.66 m), by W the weight of a client (for instance W (client1) = 60 kg) and
74
by BM I the body mass index of a client. If we now pick uniformly at random on of the
5 clients, then S, W and BM I are clearly random variables and the BM I of a client is
given by BM I = W/S 2 . We have for instance
W (client1) 60
BM I(client1) = 2
= = 21.77. (3.1.13)
S (client1) (1.66)2
Example 3.1.6.
1
a) Let Ω = {1, 2, 3, 4, 5, 6} and P [ω] = 6
for all ω ∈ Ω. We set
Then
b) Suppose Ω = {(i, j); i, j = 1, . . . , 6} is the sample space resulting from rolling two dice
1
and P [ω] = 36 for all ω ∈ Ω. We can define random variables by
75
Definition 3.2.2. Let (Ω, P)be a (discrete) probability space and X a random variable
on this space. We then define for x ∈ R
Furthermore we define
10 1
P [X = 3] = P [Y = 8] = = . (3.2.9)
100 10
The idea behind the events {X = x} is to put all ω ∈ Ω together, where X has the same
value. An illustration of this idea is given in Figure 3.5. The question at this point is if
(a) The blue points represent the sample (b) The random variable X has only the val-
points ω ∈ Ω and the values near the points ues 1, 2 and 3. Thus we split Ω in to three
stand for values of X at this point. regions, each region standing for one of the
values 1, 2 and 3. Finally, we move each ω
to the corresponding region.
the Figures 3.4 and 3.5 are typical illustrations for the events {X = x}. Of course there
are random variables with more than 3 values, but apart from this, one can ask:
77
(a) ω ∈ Ω in {X = 1} and in {X = 2} (b) ω ∈ Ω not in any {X = x}
Figure 3.6: Impossible illustration for a random variable X with values 1, 2 and 3.
This two questions can be visualised graphically, see Figure 3.6. We thus can reformulate
the two above questions as: Are the illustrations in Figure 3.6 possible for a random
variable X and the events {X = x}?
Intuitively, it is quiet clear that the answer to both questions is no. For instance, let us
denote by X the brand of a randomly selected car in the UK. Obviously, no car has more
than one brand. Furthermore, each car has a brand.
Let us formulate the answer to both questions as proposition for a general random variable.
Proposition 3.2.5. Let (Ω, P)be a (discrete) probability space and X a random
variable on this space. We then have
Proposition 3.2.5 now shows that Ω is covered by the events {X = x} and thus each
ω ∈ Ω is contained some event {X = x}. Proposition 3.2.5 also shows that no ω can
occur in more than one {X = x}. In other words each ω ∈ Ω is contained in precisely
one event {X = x}.
A natural question at this point is: Ω is just countable, why do we use in (3.2.11) the
real numbers R? The reason is that we have made no restrictions for the values the
random variable X can take. Thus in principle X can take each value x ∈ R. However,
{X = x} = 6 ∅ for only countable many x ∈ R. On the other hand, if one already knows
that X : Ω → S where S ⊂ R is countable then one can rewrite (3.2.11)
[
Ω= {X = x}. (3.2.12)
x∈S
78
Proof of Proposition 3.2.5. We have by definition
{X = x1 } ∩ {X = x2 } = {ω ∈ Ω; X(ω) = x1 } ∩ {ω ∈ Ω; X(ω) = x2 }
= {ω ∈ Ω; X(ω) = x1 and X(ω) = x2 }.
Since X is a function Ω → R, X(ω) can have only one value for each ω. Thus we can
have either X(ω) = x1 or X(ω) = x2 , but not both at the same time (if x1 6= x2 ). Thus
{X = x1 } ∩ {X = x2 } is the empty set ∅. This proves the first equation.
S now come to the second equation. Since we {X = x} ⊂ Ω for all x ∈ R, we have
We
x∈R {X = x} ⊂ Ω. On the other hand, if an ω ∈ Ω is given, then X(ω) Sis a real number
x ∈ R. Thus this ω is contained in {X = x} for this x. This implies ω ∈ x∈R {X(ω) = x}.
This completes the proof.
Definition 3.2.6. Let (Ω, P)be a discrete probability space. The probability mass
function of a discrete random variable X is defined by
The probability mass function is a very important quantity of a random variable. One of
the reason is that one can find in practical applications often (a good approximation of)
the probability mass function of a given random variable. One the other hand, it is much
more difficult to find in a practical application a suitable sample space.
Remark 3.2.7. The terminology distribution is used in probability theory for several ob-
jects. For instance it is also used for the probability density function of a continuous
random variable. However, we consider here only discrete random variables and thus
distribution is here synonymously to probability mass function.
Substituting all ω ∈ Ω into X, we see that X can have only the values 0, 1, 4 and 9.
Furthermore, we have
79
The probability mass function of X is thus
1 2 2
pX (0) = P [{2}] = , pX (1) = P [{1, 3}] = , pX (4) = P [{0, 4}] = ,
6 6 6
1
pX (9) = P [{5}] = and pX (x) = 0 for x ∈ R \ {0, 1, 4, 9}.
6
An illustration is given in Figure 3.7.
Let us now make an important observation: if we sum all (non-zero) pX (x), we get
pX (0) + pX (1) + pX (4) + pX (9) = 1. (3.2.16)
Example 3.2.9. Suppose our sample space consists of the outcomes of throwing a fair
die, and suppose we gamble on the outcome: lose £1 if the outcome is 1, 2 or 3; win
nothing if the outcome is 4; win £2 if the outcome is 5 or 6. We define X to be the
random variable giving the profit. Explicitly, we have Ω = {1, 2, 3, 4, 5, 6}, P [ω] = 61 for
all ω ∈ Ω and
X(1) = X(2) = X(3) = −1, X(4) = 0, X(5) = X(6) = 2.
We now get
{X = −1} = {1, 2, 3}, {X = 0} = {4}, {X = 2} = {5, 6} and
{X = x} = ∅ for x ∈/ {−1, 0, 2}.
The probability mass function of X is thus
1 1 1
pX (−1) = P [{1, 2, 3}] = , pX (0) = P [{4}] = , pX (2) = P [{5, 6}] = and
2 6 3
pX (x) = 0 for x ∈ / {−1, 0, 2}.
If we sum similarly to Example 3.2.8 all (non-zero) pX (x), we get
pX (−1) + pX (0) + pX (2) = 1. (3.2.17)
We now show that the observation made in Example 3.2.8 and 3.2.9 in the equations 3.2.16
and 3.2.17 is true in general.
80
Proposition 3.2.10. Let (Ω, P)be a probability space and X be a discrete random
variable on this space. We then have
This proposition can be interpreted in words as: If we sum pX (x) over all possible values
of X then we have 1. Note that this one of the few results, which is true for discrete
random variables only.
Remark 3.2.11. If we define S ⊂ R as
then the set S is a countable subset of R (see Definition 2.1.2) since X is discrete. We
then can write
X X
pX (x) = pX (x). (3.2.21)
x∈R x∈S
pX (x)6=0
Proof of Proposition 3.2.10. We have pX (x) = P [{X = x}]. If follows from the definition
of a probability space (Ω, P) that 0 ≤ P [{X = x}] ≤ 1. This implies the first point. For
the second point, we use Proposition 3.2.5 and Lemma 2.1.17.a) and get
" #
[ [ X
1 = P [Ω] = P {X = x} = P {X = x}=
P [{X = x}]
x∈R x∈R x∈R
{X=x}6=∅ {X=x}6=∅
X
= pX (x).
x∈R
pX (x)6=0
S S
Notice that x∈R {X = x} and x∈R {X = x} are the same sets since we have only
{X=x}6=∅
omitted empty sets. However, this step is important since we can use LemmaS 2.1.17.a)
only for countable unions of sets and thus cannot apply Lemma 2.1.17.a) to x∈R {X = x}
(since R is uncountable). This is the point, where we need that X is discrete.
81
Customers n 0 1 2 3 4 5 6 total
Probability pN (x) 0.05 0.07 0.12 0.19 0.31 0.2 0.06 1
Table 3.1: Table for the distribution of N in Example 3.2.12
Example 3.2.12. A small branch of a bank considers the number of customers served per
hour. We denote this number by N . A survey leads then to Table 3.1. We thus have for
instance P [N = 3] = 0.19. Proposition 3.2.10 now states that the sum of the probabilities
pN (k) = P [N = k] over all possible values of N (here k ∈ {0, 1, 2, 3, 4, 5, 6}) gives 1, i.e.
It is easy to check that this is indeed true. Notice that in principle one would had to
include in this example also P [N = 7], P [N = 8], . . . , however Table 3.1 indicates that
P [N = k] = 0 for k ≥ 7 and we thus do not need to consider them.
It is clear that the underlying probability space (Ω, P)in Example 3.2.12 is quite com-
plicate. However, it is enough to know Table 3.2 if we are only interested how many
customers are served in an hour. One thus can forget for such questions the underly-
ing probability space. This is a general phenomenon. Almost all important information
about a discrete random variable X is contained in the probability mass function. If one
thus knows the probability mass function, one can forget the underlying probability space.
Therefore the probability mass function abstracts the underlying probability space. How-
ever, there are different random variables, which have the same probability mass function.
Example 3.2.13. We consider again the random variable X from Example 3.1.4.a).
Furthermore, we consider also a second probability space (Ω, P) with Ω = {H, T }, P [H] =
P [T ] = 21 and a random variable Y with
(
1, ω = H,
Y (ω) = (3.2.23)
0, ω = T.
Then X and Y have the same distribution, i.e. pX (x) = pY (x) for all x ∈ R. Explicit,
we have
x=0 x=1 x∈
/ {0, 1}
P [X = x] = P [Y = x] 0.5 0.5 0.
82
Days stayed x 4 5 6 7 8 9 10+ total
Probability pL (x) 0.038 0.114 0.430 0.300 0.080 0.030 0.008 1
Table 3.2: Table for the distribution of L in Example 3.3.1
Example 3.3.1. The length of stay in hospital after surgery is modelled as a random
variable L. The Table 3.2 gives the distribution of L.
We thus know that the probability to stay 5 days in hospital is P [L = 5] = 0.114. On the
other hand, what is the probability to stay less than a week in the hospital? The natural
answer is in this situation
P [L ≤ 7] = P [L = 4] + P [L = 5] + P [L = 6] + P [L = 7] (3.3.1)
= 0.038 + 0.114 + 0.430 + 0.300 = 0.882. (3.3.2)
Let us now justify the answer in Example 3.3.1. First, we have considered in Example 3.3.1
the probability P [L ≤ 7] and thus the event {L ≤ 7}. If one takes it exactly, we have
not given the definition of such events. This definition is of course straight forward, but
we state it here for completeness. Thus let X be a discrete random variable on some
probability space (Ω, P). We then set
Proposition 3.3.2. Let (Ω, P)be a discrete probability space and X be a discrete
random variable on this space. We then have for all x ∈ R
X X
P [X ≤ x] = pX (y) and P [X ≥ x] = pX (y). (3.3.6)
y≤x y≥x
pX (y)6=0 pX (y)6=0
83
S
Proof. The identity {X ∈ A} = x∈A {X = x} follows immediately from the definition of
{X ∈ A}. For the identity X
P [X ∈ A] = pX (x),
x∈A
pX (x)6=0
we can use precise the same argumentation as in the proof Proposition 3.2.10 and we
thus omit it. The equation (3.3.6) then follows with A = [−∞, x] respectively with
A = [x, ∞].
Applying this proposition to P [L ≤ 7] then immediately gives (3.3.1) since P [L = k] 6= 0
only for k = 4, 5, 6, 7 in the considered interval.
Example 3.3.3. We consider again the situation in Example 3.3.1. What is the proba-
bility of being in hospital for at most 6 days, between 5 and 7 days and at least 7 days.
We have
at most 6 days: P [L ≤ 6] = P [L = 4] + P [L = 5] + P [L = 6] = 0.582,
3.4 Expectation
We now will consider the mean of a random variable. More precisely, we are interested in
what happens if we repeat the same experiment very often and then consider the mean
over all realisations of X.
Example 3.4.1. Suppose we have a random experiment with Ω = {a, b, c}, P [a] = P [b] =
P [c] = 31 and a random variable X with X(a) = 1, X(b) = 3 and X(c) = −5. We now
repeat this experiment 10 times. A possible realisation is
a, a, c, b, a, c, b, b, c, b. (3.4.1)
84
Example 3.4.2. We play a betting game. For this game, we have to toss a fair coin. If
we toss ‘Head’, we win £1 and if we toss ‘Tail’, we have to pay £1. How much do we
win (or lose) in average per game if we play this game 1000 times? If our interpretation
of probability is correct, then we toss around 500 times ‘Head’ and in around 500 times
‘Tail’. Thus our gain in average per game should be around
Notice that the right hand side of (3.4.4) is independent of n. This observation is the
motivation for the following definition.
85
Definition 3.4.4. Let (Ω, P)be a discrete probability space and X a random variable
on this space. The expectation of X is defined as
X
E [X] = X(ω) · P [ω] (3.4.5)
ω∈Ω
if we have
X
E [|X|] := |X(ω)| · P [ω] < ∞. (3.4.6)
ω∈Ω
If (3.4.6) is fulfilled, we say that the expectation of X exists and write E [|X|] < ∞.
We will see in Chapter 3.8 that the expectation describes indeed the mean of a random
variable very well if we repeat the corresponding experiment very often.
Example 3.4.5. Let us compute the expectation for the throw of a fair die. In this
situation, we have
1
Ω = {1, 2, 3, 4, 5, 6}, P [ω] = and X(ω) = ω. (3.4.7)
6
The expectation of X is then
Then
Before we come to two important computation rules, we would like to say some words
about the condition in (3.4.6). This condition is required that the expectation of a random
variable is well defined and that the sum in (3.4.5) is independent of the order of the
summands. Let us given an example.
1
Example 3.4.8. We consider the geometric distribution on N with parameter p = 2
.
This means Ω = N and P [k] = 2−k for k ∈ N. We define the random variables
It is thus obvious that the expectation of both random variables is not well defined. Of
course, the condition (3.4.6) is not fulfilled for X and Y .
Example 3.4.8 shows that (3.4.6) is important. One thus has to verify each time (3.4.6) if
one is computing the expectation of a random variable. However, (3.4.6) is automatically
fulfilled if Ω is finite. Since most of our sample spaces are finite or (3.4.6) is obviously
fulfilled, we will not dwell on (3.4.6) in our computations (except if it is necessary).
Next we prove two important computation rules for the expectation.
87
Theorem 3.4.9. Let X be a random variable with distribution pX (x) and with
E [|X|] < ∞. We then have
X X
E [X] = x · pX (x) = x · P [X = x] (3.4.18)
x∈R x∈R
pX (x)6=0 pX (x)6=0
The main idea of Theorem 3.4.9 is similar to the idea of the probability mass function, see
Figure 3.5. Instead to sum over all ω ∈ Ω separately, we first put all ω together, where
X has the same value and then we sum over all possible values of X.
P
Proof of Theorem 3.4.9. By assumption, we have P ω |X(ω)|P [ω] < +∞. We are thus
allowed to reorder the summands P in the sum ω X(ω)P [ω] without changing the value
of the sum. We now split the sum ω∈Ω and combine all ω, where X has the same value.
In formulas this gives
X X X
E [X] = X(ω)P [ω] = X(ω) · P [ω]
ω∈Ω x∈R ω:X(ω)=x
X X
= x · P [X = x] = x · pX (x).
x∈R x∈R
This proves (3.4.18). The proof of (3.4.19) is almost identical and we thus omit it.
Example 3.4.10. We now apply Theorem 3.4.9 to the random variable in Example 3.2.8
and compare the result with Example 3.4.7. We have
E [X] = 0 · P [X = 0] + 1 · P [X = 1] + 4 · P [X = 4] + 9 · P [X = 9]
1 2 2 1 19
=0· +1· +4· +9· = . (3.4.20)
6 6 6 6 6
This value agrees of course with the value in Example 3.4.7.
Example 3.4.11. We consider again Example 3.1.4.b). We have in this situation Ω =
1
{1, . . . , 16}, P [{ω}] = 16 and
1 for ω ∈ {1, 10, 12, 13, 14, 15, 16},
X(ω) = 2 for ω = 11, (3.4.21)
0 otherwise.
88
The distribution of X is then
8
P [X = 0] = P [{2, 3, 4, 5, 6, 7, 8, 9}] = , (3.4.22)
16
7
P [X = 1] = P [{1, 10, 12, 13, 14, 15, 16}] = , (3.4.23)
16
1
P [X = 2] = P [{11}] = . (3.4.24)
16
We thus get
8 7 1 9
E [X] = 0 ·
+1· +2· = , (3.4.25)
16 16 16 16
8 7 1 11
E X = 02 ·
2
+ 12 · + 22 · = . (3.4.26)
16 16 16 16
We now can prove a further important property of the expectation: It is linear in X.
Proof. We have
X X X
E [aX + bY ] = aX(ω) + bY (ω) P [ω] = a X(ω)P [ω] + b Y (ω)P [ω]
ω∈Ω ω∈Ω ω∈Ω
= aE [X] + bE [Y ] .
We have stated Theorem 3.4.12 for two random variables, but Theorem 3.4.12 is valid for
several random variables. We omit the proof of this statement as it is almost the same as
for Theorem 3.4.12.
Let us now make a useful observation. Each a ∈ R can be interpreted as random variable
on a probability space (Ω, P). For this, we identify a ∈ R with the constant function
a :Ω → R (3.4.28)
ω → a, (3.4.29)
i.e the function that maps all sample points ω to a ∈ R. An illustration can be found in
Figure 3.8. We immediately get with the definition of the expectation that E [a] = a for
all a ∈ R. Combining this with Theorem 3.4.12 gives
89
Figure 3.8: Illustration of the constant random variable.
Corollary 3.4.13. Let a ∈ R and X and Y be random variables on the same prob-
ability space (Ω, P)with E [|X|] < ∞ and E [|Y |] < ∞. Then
a) E [X + Y ] = E [X] + E [Y ],
b) E [aX] = aE [X],
c) E [X + a] = E [X] + a,
d) E E [X] = E [X].
All points in this corollary follow immediately from Theorem 3.4.12. Only point d) re-
quires some explanation, as it is somewhat confusing at first glance. Here one has to
observe that E [X] is a real number. Thus d.) follows immediately from the observations
between Theorem 3.4.12 and Corollary 3.4.13.
Example 3.4.14. Let Ω = {1, 2, 3, 4, 5, 6} and P [ω] = 16 for all ω ∈ Ω. We define
(
1, if ω is divisible by 2 ,
X(ω) := (3.4.30)
0, otherwise
and
(
ω, if ω is divisible by 3,
Y (ω) := (3.4.31)
0, otherwise.
We now compute E [2X + Y ] direct and then with the help of the Theorems 3.4.9 and 3.4.12.
The direct computations gives
1 1 1 1 1 1
E [2X + Y ] = (0 + 0) · + (2 + 0) · + (0 + 3) · + (2 + 0) · + (0 + 0) · + (2 + 6) ·
6 6 6 6 6 6
15 5
= = . (3.4.32)
6 2
90
As next we use the Theorems 3.4.9 and 3.4.12. We have
1 4 1
P [X = 0] = P [X = 1] = and P [Y = 0] = , P [Y = 3] = P [Y = 6] = . (3.4.33)
2 6 6
Theorem 3.4.9 therefore implies
1 1 1 4 1 1 3
E [X] = 0 · + 1 · = and E [Y ] = 0 · + 3 · + 6 · = . (3.4.34)
2 2 2 6 6 6 2
We now get with Theorem 3.4.12
1 3 5
E [2X + Y ] = 2E [X] + E [Y ] = 2 · + = . (3.4.35)
2 2 2
The results of both computations agree of course.
3.5 Variance
The expectation of a random variable X can be interpreted as mean or as average be-
haviour. The expectation E [X] can describe X more or less good. For instance, con-
sider the random variables X, Y and Z with X ≡ 0, P [Y = +1] = P [Y = −1] = 21 and
P [Z = 1000] = P [Z = −1000] = 12 . We clearly have E [X] = E [Y ] = E [Z] = 0. For
X, this value is identical with the value of X itself and thus describes X very well. On
the other hand, we have for Z the difference between Z and E [Z] is always 1000. It is
thus natural to introduce something to measure how much a random variable X differs
in average from it’s expectation E [X]. The naive approach E [X − E [X]] gives
E [X − E [X]] = E [X] − E E [X] = E [X] − E [X] = 0
and is thus fruitless for all X. It has been agreed (for many reasons) to use the expectation
of the square of X − E [X]:
Definition 3.5.1. Let (Ω, P)be a (discrete) probability space and X a random variable
on this space with E [|X|] < ∞. The variance of X is defined as
V(X) = E (X − E [X])2
(3.5.1)
We have explicitly
X 2
V(X) = X(ω) − E [X] · P [ω] (3.5.2)
ω∈Ω
X
= (x − E [X])2 · P [X = x] . (3.5.3)
x∈R
P[X=x]6=0
91
We have used for (3.5.3) Theorem 3.4.9 with f (x) = (x − E [X])2 .
Remark 3.5.2. Apart from the variance, one often uses also the standard deviation σ
(spoken
p sigma) of a random variable X. The standard deviation is defined as σ(X) :=
V(X). However, we will use here only the variance.
Example 3.5.3. Let Ω = {0, 1} and P [0] = P [1] = 12 . We consider the random variables
X, Y and Z with
These X, Y and Z agree with the random variables in the beginning of this chapter. We
have E [X] = E [Y ] = E [Z] = 0 and
V(X) = 0, (3.5.5)
2 1 2 1
V(Y ) = E Y 2 = Y (0) · + Y (1) · = 1,
(3.5.6)
2 2
2 1 2 1
V(Z) = E Z = Z(0) · + Z(1) · = 106 .
2
(3.5.7)
2 2
Example 3.5.4. We consider again Example 3.1.4.b). We have in this situation Ω =
1
{1, . . . , 16}, P [{ω}] = 16 and
1
for ω ∈ {1, 10, 12, 13, 14, 15, 16},
X(ω) = 2 for ω = 11, (3.5.8)
0 otherwise.
The distribution of X is
8 7 1
P [X = 0] = , P [X = 1] = , P [X = 2] = . (3.5.9)
16 16 16
9
We have computed in Example 3.4.11 that E [X] = 16
. We use here (3.5.3) and get
92
Theorem 3.5.5 (Shifting theorem). Let (Ω, P)be a (discrete) probability space and
X a random variable on this space with E [|X|] < ∞. We then have
V(X) = E X 2 − (E [X])2 .
(3.5.10)
Proof. We have
Example 3.5.6. Let us compute the variance for the throw of a fair die. In this situation,
we have
1
Ω = {1, 2, 3, 4, 5, 6}, P [ω] = and X(ω) = ω. (3.5.11)
6
21
We know from Example 3.4.5, that E [X] = 6
= 72 . Similarly, we get
1 1 1 91
E X 2 = 12 · + 22 · + . . . + 62 · = .
(3.5.12)
6 6 6 6
The variance of X is then
2
2
2 91 7 91 49 35
V(X) = E X − (E [X]) = − = − = . (3.5.13)
6 2 6 4 12
Let us now give with the formula (3.5.2) a heuristic interpretation of the variance. Roughly
speaking, a small variance means that a typical value of X is close to E [X] while a large
variance means that a typical value of X is not so close to E [X]. This is of course not
a rigorous mathematical statement. However, let us try to see why this is make sense.
Notice that each summand in (3.5.2) is non negative and thus
2
‘V(X) small ’ =⇒ ‘ X(ω) − E [X] · P [ω] is small for all ω ∈ Ω’. (3.5.14)
2
Suppose now that ‘X(ω) is far away from E [X]’ and therefore ‘ X(ω) − E [X] is large’.
2
Since ‘ X(ω) − E [X] · P [ω] is small ’, we must have that ‘P [ω] very is small ’ and thus
it is very unlikely that this ω occurs. The question at this point is if we can make this
statement more rigours. More precisely, can we say anything about the probability that
93
the distance between X and E [X] is greater than a given number a > 0? In formulas,
can we say anything about
P |X − E [X] | > a (3.5.15)
An illustration of (3.5.16) is given in Figure 3.5.16. We can clearly not expect to get an
Figure 3.9: Illustration of X ∈
/ E [X] − a, E [X] + a
explicit expression for P |X − E [X] | > a just from the variance. However, we can give
an upper bound. More precisely, we get with the Chebyschev inequality
V(X)
P [|X − E [X] | ≥ a] ≤ for all a > 0. (3.5.17)
a2
We will postpone the precise formulation of the Chebyschev inequality to Section 3.8, see
Theorem 3.8.5, and give instead an example.
Example 3.5.7. A factory produces screws and these screws should have
a width of 5mm. Since the production process is not perfect,
there are always minor differences in the width of the produced screws.
Suppose that the width can be modelled by a random variable W with
E [W ] = 5mm and V(W ) = 0.005mm2 . What is the probability that a randomly
chosen screw has a width between 4.9mm and 5.1mm?
The given information is not enough to compute the desired probability exactly. However,
we can give with the Chebyschev inequality a lower bound. We have
P [4.9 < W < 5.1] = P [−0.1 < W − 5 < 0.1] = P [|W − 5| < 0.1]
= P [|W − E [W ] | < 0.1] = 1 − P [|W − E [W ] | ≥ 0.1]
V(W ) 0.005 1
≥1− 2
=1− = .
0.1 0.01 2
We thus see that at least the half of the screws have a width between 4.9mm and 5.1mm.
Note that this does not imply that P [4.9 ≤ W ≤ 5.1] is around 0.5. It is indeed possible
that P [4.9 ≤ W ≤ 5.1] is close to 1.
94
Exercise 3.5.8. Give an example of a random variable W with
Example 3.5.9. A school makes each year at the end of the term an exam.
The exam contains traditionally 10 questions, each giving 10 points.
Let us denote by T the total points achieved by a randomly chosen student.
The experience of several years shows that E [T ] = 40 and V(T ) = 200. How
likely is it that a student has more than 90 points?
As in Example 3.5.7, we cannot compute this probability explicit, but we can give an upper
bound. We have
P [T ≥ 90] = P [T − E [T ] ≥ 90 − E [T ]] = P [T − E [T ] ≥ 50]
V(T ) 200
≤ P [|T − E [T ] | ≥ 50] ≤ 2
= = 0.08. (3.5.18)
50 2500
We therefore have P [T ≥ 90] ≤ 0.08 and thus P [T ≥ 90] can’t be really high.
Example 3.5.10. A university is interested in the daily study time of their
students. For this, let us denote by D the daily study time of a randomly
picked student. A survey shows that E [D] = 4 hours and V(D) = 2. How
likely is it that a random student studies more than 5 hours or less
3 hours per day?
As in the Examples 3.5.7 and 3.5.9, we cannot compute this probability explicitly. We use
again (3.5.17) to give an upper bound.
P [D ≥ 5 or D ≤ 3] = P [D − E [D] ≥ 1 or D − E [D] ≤ −1] = P [|D − E [D] | ≥ 1]
V(D) 2
≤ = = 2. (3.5.19)
1 1
We therefore get with (3.5.17) that P [D ≥ 5 or D ≤ 3] ≤ 2. Obviously, this inequality is
useless as we already know by the definition of a probability that P [D ≥ 5 or D ≤ 3] ≤ 1.
We thus see that the Chebyschev inequality not always gives useful bounds.
At this point, one might ask: Why do we use the Chebyschev inequality to get bounds
instead to compute P [|X − E [X] | > a] directly? The reason is that there are many
situations, where we can compute (easily) the variance V(X), while it is very difficult
to compute the probability P [|X − E [X] | > a] explicit. We will see this for instance in
Section 3.8.1.
As the Chebyschev inequality indicates, it is important to be able to compute the variance
of a random variable in a good way. We thus look how to compute the variances V(aX) and
V(X +Y ). For this, we require the covariance of two random variables.
95
Definition 3.5.11. Let (Ω, P)be a (discrete) probability space and X and Y be
random variables on this space with E [X 2 ] < ∞ and E [Y 2 ] < ∞. The covariance of
X and Y is defined as
Notice that the identity E [(X − E [X])(Y − E [Y ])] = E [X · Y ]−E [X] E [Y ] follows with a
computation similar to the proof of Theorem 3.5.5. Furthermore, the assumption E [X 2 ] <
∞ and E [Y 2 ] < ∞ in Definition 3.5.11 implies that Cov(X, Y ) is well defined. This follows
from the Cauchy-Schwarz inequality for random variables,
We first compute the expectation and variance of X1 and X2 . A simple computation shows
that P [X1 = 1] = P [X2 = 1] = p and P [X1 = 0] = P [X2 = 0] = 1 − p. Random variables
with this distribution are called Bernoulli distributed with parameter p, see also 3.6. Thus
Furthermore, we have
96
This then gives
Theorem 3.5.13. Let (Ω, P)be a (discrete) probability space, a, b ∈ R and X and Y
be random variables on this space with E [X 2 ] < ∞ and E [Y 2 ] < ∞. We then have
Proof. The point a) follows immediately from the definition of the variance and covariance
(see Definition 3.5.1 and Definition 3.5.11).
For point b), note that we have E [X − a] = E [X] − a (see Corollary 3.4.13) and thus
h 2 i
= E (X − E [X])2 = V(X).
V(X − a) = E (X − a) − E [(X − a)]
The argumentation for the covariance is similarly and we thus omit it.
We use for c) the shift theorem and get
= a2 E X 2 − (E [X])2 = a2 V(X).
The argumentation for the covariance is again similarly and we thus omit it also.
For the proof of d), we define X 0 := X − E [X] and Y 0 := Y − E [Y ]. If follows from b)
that it is enough to prove d.) for X 0 and Y 0 . The advantage of X 0 and Y 0 compared to
97
X and Y is that E [X 0 ] = E [Y 0 ] = 0 and one can thus easier compute the variance and
covariance. Explicit, we have
= V (X 0 ) + V (Y 0 ) + 2 Cov(X 0 , Y 0 ). (3.5.31)
a) We have
n
! n
X X X
V Xi = V(Xj ) + Cov(Xi , Xj ) (3.5.32)
i=1 j=1 i6=j
Xn X
= V(Xj ) + 2 Cov(Xi , Xj ). (3.5.33)
j=1 i<j
The proof of Corollary 3.5.14 is almost the same as the proof of the points d) and e) in
Theorem 3.5.13 and we thus omit the proof of Corollary 3.5.14.
98
3.6 Important distributions of random variables
We consider in this section some important distributions of random variables. We have
summarized their probability mass function, expectation and variance in Table 3.3.
Name Short name Parameter Obtained values Distribution
( pX (x) E [X] V(X)
1, if x = c,
Degenerate c∈R c pX (x) = c 0
0, otherwise.
Bernoulli Ber(p) 0≤p≤1 {0, 1} pX (1) = p, pX (0) = 1 − p p p (1 − p)
Binomial B(n, p) 0 ≤ p ≤ 1, n ∈ N {0, 1, 2, . . . , n} pX (k) = nk
pk (1 − p)n−k n·p n p (1 − p)
1 1−p
Geometric Geo(p) 0<p≤1 N = {1, 2, 3, 4 . . . } pX (k) = p (1 − p)k−1 p p2
k
Poisson Poi(λ) λ ∈ R+ N0 = {0, 1, 2, 3, . . . } pX (k) = e−λ λk! λ λ
V(X) = E X 2 − (E [X])2 = c2 − c2 = 0.
(3.6.2)
P [X = 1] = p, P [X = 0] = 1 − p. (3.6.3)
99
A Bernoulli random variable X can be interpreted as the throw of one coin, where ‘1’ is
interpreted as ‘Head’ and ‘0’ as ‘Tail’. In this situation p = P [X = 1] is the probability
to toss ‘Head’. In general, a Bernoulli random variable can viewed as ‘success’ or ‘fail’
experiment or as ‘1’ or ‘0’ experiment. Some further examples are
I win the next game or lose the next game,
the next person smokes or don’t smokes,
the next baby is a boy or a girl.
The expectation of a Bernoulli distributed random variable is
E [X] = 0 · (1 − p) + 1 · p = p. (3.6.4)
The variance has already been computed in Example 3.5.12 and is
V(X) = p(1 − p). (3.6.5)
100
the number of games I win when I play n times a game,
It remains to shows that (3.6.6) indeed defines aprobability distribution on {0, . . . , n}.
We use the Binomial theorem (a + b)n = nk=0 nk ak bn−k and get with a = p, b = 1 − p
P
n n
X X n k
P [X = k] = p (1 − p)n−k = (p + (1 − p))n = 1. (3.6.10)
k=0 k=0
k
We now compute as next the expectation and the variance. As before, we use Theo-
rem 3.4.9 and get
n n
X X n k
E [X] = k · P [X = k] = k p (1 − p)n−k . (3.6.11)
k=0 k=0
k
It is possible to compute this expression directly, but it is easier to use instead Theo-
rem 3.4.12. We use the observation that X is the number of ‘Heads’ when one tosses a
coin n times. We thus can write
X = X1 + · · · + Xn (3.6.12)
with Xj = 1 if in the j’th throw ‘Head’ occurs and Xj = 0 otherwise. Obviously, each Xj
is Ber(p) distributed. We thus get with Theorem 3.4.12
We argue for the computation of the variance V(X) similarly and use (3.6.12). We thus
have to compute
We now would like to apply Corollary 3.5.14. For this, we have to compute the covariance
Cov(Xi , Xj ) for i 6= j and 1 ≤ i, j ≤ n. We have computed this in the case n = 2 in
Example 3.5.12 and have shown that the random variables X1 and X2 are uncorrelated.
This result is indeed true for a general n. In other words, the random variables X1 , . . . ,
Xn are all uncorrelated. One can show this for instance as in Example 3.5.12. However,
we will show in Chapter 3.7 that the random variables X1 ,. . . , Xn are independent and
thus are automatically uncorrelated (see Example 3.7.15 and Theorem 3.7.17). Since all
Xj are Bernoulli distributed, it follows Corollary 3.5.14
n
X
V(X) = V(X1 + · · · + Xn ) = V(Xj ) = n · p(1 − p). (3.6.15)
j=1
101
3.6.4 Geometric distribution
Let 0 < p ≤ 1 be given. A random variable X is called geometric distributed (short
X ∼ Geo(p)), if X can have only values in N = {1, 2, 3, . . . } and
P [X = k] = p · (1 − p)k−1 , k ∈ N. (3.6.16)
Let us now describe the model behind this distribution. Suppose we have a coin, which
gives ‘Head’ with probability p when it is tossed. We now toss this coin till the first time
‘Head’ occurs. We assume at this point that each throw is independent of the others.
This then leads to the tree diagram in Figure 3.10. We now define X to be the number
of tosses need till the first time ‘Head’ occurs. We then have for instance
and
P [X = 1] = p, P [X = 2] = (1 − p) · p, P [X = 3] = (1 − p)2 · p, . . . . (3.6.18)
We thus see that X has the distribution (3.6.16). Apart from tossing a coin till the first
‘Head’ occurs, the geometric distribution Geo(p) also occurs as the number of ‘success’
or ‘fail’ experiments needed till the first ‘success’ when the experiments are performed
independently and each experiment has the success probability p. Further examples are
the number of games needed till I win the first time a game,
102
Remark 3.6.1. Notice that there is also a shifted version of the geometric distribution,
which is defined on N0 = {0, 1, 2, 3, . . . } as
P [X = k] = p · (1 − p)k , k ∈ N0 . (3.6.19)
This corresponds to count only the number of ‘Tails’ needed till the first time ‘Head’
occurs. Unfortunately, the term ‘geometric distribution’ is used in the literature for both
versions and thus one has to check always which one is meant. We use in this course only
(3.6.16) and thus the term ‘geometric distribution’ always refers to (3.6.16).
It remains to show that (3.6.16) gives indeed a probability distribution. If p = 1, then
We thus get
∞
X
P [X = k] = 1 + 0 + 0 + . . . = 1. (3.6.21)
k=1
For 0 < p < 1, we use the sum formula of the geometric series
∞
X 1
qk = (for |q| < 1)
k=0
1−q
103
We thus see that we have to toss the coin in average p1 times till we get the first ‘Head’.
For a fair coin, this is two times, since p = 21 . The formula is also correct in the case
p = 0 if we interpret 10 as ∞. Indeed, p = 0 corresponds to ‘only Tail occurs’ and we thus
would wait infinitely long. We have of course excluded the case p = 0 in (3.6.16) since it
does not give a probability distribution (p = 0 implies P [X = k] = 0 for all k ∈ N). The
variance for geometric distributed X is given by
1−p
V(X) = . (3.6.25)
p2
104
get E [X 2 ]. We have
∞ k ∞ ∞
X
−λ λ −λ
X λk 2 −λ
X λk−2
E [X(X − 1)] = k(k − 1) · e =e =λ e
k=0
k! k=1
(k − 2)! k=1
(k − 2)!
∞ k
X λ
= λ2 e−λ = λ2 . (3.6.29)
k=0
k!
It now follows
V(X) = E X 2 − E [X]2 = E [X(X − 1)] + E [X] − E [X]2 = λ2 + λ − λ2 = λ. (3.6.30)
Definition 3.7.1. Let X and Y be discrete random variables on the same probability
space (Ω, P). We then set
P [X = x, Y = y] := P [{X = x, Y = y}] .
The joint distribution of more than two random variables is similarly defined.
The idea behind the definition of {X = x, Y = y} is similar to the idea behind the
definition of {X = x}. More precisely, one puts all ω ∈ Ω together, where X and Y have
the same value. Thus {X = x, Y = y} can be heuristically interpreted as ‘X has the value
x and Y has the value y’. Let us consider an example
Example 3.7.2. Consider the probability space (Ω, P) with
1
Ω = {1, 2, 3, 4, 5, 6, 7, 8} with P [ω] = for all ω ∈ Ω. (3.7.2)
8
105
Further, we define the random variables X and Y by
We then have
These events can also be illustrated graphically, see Figure 3.11. We now get
2 1
P [X = 2, Y = 1] = , P [X = 3, Y = 1] = ,
8 8
2 2
P [X = 1, Y = −1] = , P [X = 2, Y = −1] = 0, P [X = 3, Y = −1] = .
8 8
If one now takes a look at Example 3.7.2 and at Figure 3.11, one sees that the events
{X = x, Y = y} cover the sample space Ω, i.e we have in Example 3.7.2
[
Ω = {1, 2, 3, 4, 5, 6, 7, 8} = {X = x, Y = y}.
x∈{1,2,3}
y∈{−1,1}
106
Also we see that the events {X = x, Y = y} are disjoint, i.e
{X = x1 , Y = y1 } ∩ {X = x2 , Y = y2 } = ∅ if x1 6= x2 or y1 6= y2 .
These two observations are also true in general, but we will not prove them here. Notice
that we have made a similar observation for the events {X = x}, see Proposition 3.2.5.
More important is that we can recover the distribution of X (and of Y ) from the joint
distribution of X and Y . We have
Lemma 3.7.3. Let X and Y be discrete random variables on the same probability
space (Ω, P). We then have for each x ∈ R.
X
P [X = x] = P [X = x, Y = y] . (3.7.4)
y∈R
P[X=x,Y =y]>0
As illustration, let us look again at Example 3.7.2 and considering P [X = 1]. Since
{X = 1} = {1, 3, 7}, we see in Figure 3.11 that {X = 1} consists of all ω in the left
column and thus P [X = 1] is the sum over all P [ω] with ω in the left column. This sum
can be splitted up into the sum over all ω in the upper left box and the sum over all
P [ω] with ω in the lower left box. The first sum corresponds to P [X = 1, Y = 1] and the
second sum corresponds to P [X = 1, Y = −1]. Thus, we have in this illustration
P [X = 1] = P [X = 1, Y = 1] + P [X = 1, Y = −1] ,
Proof of Lemma 3.7.3. To show (3.7.4) in general, we use Proposition 3.2.5 and get
[
P [X = x] = P [{X = x} ∩ Ω] = P
{X = x} ∩
{Y = y}
y∈R
{Y =y}6=∅
[
= P
{X = x} ∩ {Y = y}
y∈R
{X=x}∩{Y =y}6=∅
X X
= P [{X = x} ∩ {Y = y}] = P [X = x, Y = y] .
y∈R y∈R
{X=x}∩{Y =y}6=∅ P[X=x,Y =y]>0
We also used that at most countable many set {Y = y} are non empty.
107
Example 3.7.4. Let us consider again the situation in Example 3.7.4. We have here an
urn containing 100 balls, numbered with 00, 01, . . . , 99. We now draw purely randomly a
1
ball. Thus Ω = {00, . . . , 99} and P [ω] = 100 for all ω ∈ Ω. We consider here the random
variables X and Y , where X is the first number and Y is the second number of the drawn
ball. We compute here as an example P [X = 3, Y = 8]. We have
{X = 3} = {ω ∈ Ω; X(ω) = 3} = {30, 31, . . . , 39},
{Y = 8} = {ω ∈ Ω; Y (ω) = 8} = {08, 18, . . . , 98}.
10 1
Thus |{X = 3}| = |{Y = 8}| = 10 and P [X = 3] = P [Y = 8] = 100
= 10
. Furthermore,
{X = 3, Y = 8} = {X = 3} ∩ {Y = 8} = {ω ∈ Ω; X(ω) = 3 and Y (ω) = 8} = {38}.
|{38}| 1
Thus P [X = 3, Y = 8] = 100
= 100
. With a similar argumentation, one can show that
1
P [X = 3, Y = j] = for all j ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. (3.7.5)
100
Applying (3.7.4) then gives
P [X = 3] = P [X = 3, Y = 0] + P [X = 3, Y = 1] + · · · + P [X = 3, Y = 9]
9 9
X X 1 1
= P [X = 3, Y = j] = = . (3.7.6)
j=0 j=0
100 10
The best way to present the joint distribution of two random variables is with a table.
We use an example to illustrate this.
Example 3.7.5. We roll twice a fair die. Let X1 be the number of the first throw, X2 be
the number of the second throw and Y = max{X1 , X2 }. We set here
1
Ω = {(x1 , x2 ); 1 ≤ x1 , x2 ≤ 6} and P [ω] = . (3.7.7)
36
We then have for instance
1
P [X1 = 1, Y = 2] = P [{(1, 2)}] = , (3.7.8)
36
2
P [X1 = 2, Y = 2] = P [{(2, 1), (2, 2)}] = , (3.7.9)
36
P [X1 = 3, Y = 2] = P [∅] = 0. (3.7.10)
The complete joint distribution of X1 and Y is visible in Table 3.4.
A positive side effect of the table is that we can easily read of the distribution of X1 and Y
by summing the values in the rows respectively in the columns. This follows from (3.7.4).
We have for instance
1 1 3 5
P [Y = 3] = + + = and (3.7.11)
36 36 36 36
1 1 1 1 1 6 11
P [Y = 6] = + + + + + = . (3.7.12)
36 36 36 36 36 36 36
108
X1 \ Y 1 2 3 4 5 6 Row sum
1 1 1 1 1 1 1
1 36 36 36 36 36 36 6
= P [X1 = 1]
2 1 1 1 1 1
2 0 36 36 36 36 36 6
= P [X1 = 2]
3 1 1 1 1
3 0 0 36 36 36 36 6
= P [X1 = 3]
4 1 1 1
4 0 0 0 36 36 36 6
= P [X1 = 4]
5 1 1
5 0 0 0 0 36 36 6
= P [X1 = 5]
6 1
6 0 0 0 0 0 36 6
= P [X1 = 6]
1 3 5 7 9 11
Column sum 36 36 36 36 36 36
Example 3.7.6. Let Ω = {−3, −2, −1, 1, 2, 3} and P [ω] = 61 . We consider the random
variables
(
1, if ω > 0,
X(ω) = ω, |X|(ω) = |ω| and Y (ω) = sign(ω) = (3.7.13)
−1, if ω < 0.
We compute the joint distribution of X and |X| and the joint distribution of |X| and Y .
We have for instance
1
P [X = −1, |X| = 1] = P [−1] = and P [X = −2, |X| = 1] = P [∅] = 0, (3.7.14)
6
1 1
P [|X| = 2, Y = 1] = P [2] = and P [|X| = 1, Y = −1] = P [−1] = . (3.7.15)
6 6
The remaining case can be computed similarly and are summarized in Table 3.5.
(a) Joint distribution of X and |X| (b) Joint distribution of |X| and Y
X \ |X| 1 2 3 Row sum Y \ |X| 1 2 3 Row sum
1 1 1 1 1 1
-3 0 0 6 6 -1 6 6 6 2
1 1 1 1 1 1
-2 0 6 0 6 1 6 6 6 2
1 1 1 1 1
-1 6 0 0 6
Column sum 3 3 3
1 1
1 6 0 0 6
1 1
2 0 6 0 6
1 1
3 0 0 6 6
1 1 1
Column sum 3 3 3
109
One can ask oneself at this point if the joint distribution of two random variables is
determined by the individual distributions of those random variables. The following
example shows that is not the case.
Example 3.7.7. We toss two times a coin. The first coin is a fair coin. For the second
throw, we have chose between two different coins. If the outcome of the first throw is
‘Head’, we chose a coin, which shows with probability p ‘Head’. If the outcome of the first
throw is ‘Tail’, we chose a coin, which shows with probability 1 − p ‘Head’. The value of
p is here arbitrary, except of course that 0 ≤ p ≤ 1. This now lead to the tree diagram in
Figure 3.12. We define Xj = 1 if we have in the j’th throw ‘Head’ and Xj = 0 otherwise.
We then get for instance with the path rule (see Proposition 2.3.4)
1
P [X1 = 1, X2 = 0] = P [(H, T )] = · (1 − p). (3.7.16)
2
The remaining cases can be computed similarly and are summarized in Table 3.6. Con-
X1 \ X2 0 1 Row sum
p 1−p 1
0 2 2 2
1−p p 1
1 2 2 2
1 1
Column sum 2 2
sidering the row and column sums, we see that X1 and X2 are always Bernoulli(1/2)
distributed. However, the joint distribution of X1 and X2 is dependent on p and is in
particular not determined through the distributions of X1 and X2 !
We will see in Chapter 3.7.2 that the joint distribution of two random variables is deter-
mined by the individual distributions of those random variables if the random variables
are independent, see Definition 3.7.9.
Similarly to Theorem 3.4.9, one can show
110
Theorem 3.7.8. Let (Ω, P)be a probability space and X and Y be discrete random
variables on this space with E [X 2 ] < ∞, E [Y 2 ] < ∞ (and E [|XY |] < ∞). Then
X
E [X · Y ] = x · y · P [X = x, Y = y] (3.7.17)
x,y∈R
The proof of Theorem 3.7.8 is almost identical to the proof of Theorem 3.4.9. We thus
omit it.
Since x1 and x2 are arbitrary, (3.7.19) has to be true for all x1 and x2 . In view of this
observation, it is evident to call X1 and X2 independent, if (3.7.19) is fulfilled for all x1
and x2 . We thus set
Definition 3.7.9. Let (Ω, P)be a probability space and X1 and X2 be discrete random
variables on this space. X1 and X2 are called (stochastically) independent, if
for all x1 , x2 ∈ R. If the random variables X1 and X2 are not independent, then we
call them dependent.
In contrary to independent events (see Section 2.7), one can extend this definition without
problems to more than two random variables. We set
111
Definition 3.7.10. Let (Ω, P)be a probability space and X1 , . . . , Xn be discrete ran-
dom variables on this space. X1 , . . . , Xn are called (stochastically) independent, if
Remark 3.7.11. Note that one has to verify (3.7.21) only for those x1 , . . . , xn ∈ R with
P [Xj = xj ] > 0 for all 1 ≤ j ≤ n.
Example 3.7.12. We consider again the coin toss of two coins in Example 3.5.12, using
the same notation. We now show that the random variables X1 and X2 are independent.
We have for instance
The computation for the remaining case is similar and summarized in Table 3.7. From
this, one sees immediately that X1 and X2 are independent.
X 1 \ X2 0 1 Row sum
0 (1 − p) · (1 − p) p · (1 − p) 1−p
1 p · (1 − p) p2 p
Column sum 1−p p
Table 3.7: Joint distribution of X1 and X2 in Example 3.7.7
If we consider Table 3.7, we see a further advantage of the tabular representation of the
joint distribution. We can easily check if two random variables are independent. For this,
one only has to check if each entry in the table is the product of the row- and column
sum. We have highlighted in Table 3.7 the entries corresponding to P [X1 = 0, X2 = 1].
Example 3.7.13. Let us consider again Example 3.7.6. Table 3.5 shows that X and |X|
are not independent while |X| and Y are independent.
112
Example 3.7.14. Let X and Y be independent random variables with
1 1 1
P [X = 1] = , P [X = 2] = , P [X = 3] = and (3.7.23)
6 3 2
1 3
P [Y = 1] = , P [Y = 2] = . (3.7.24)
4 4
The computation of the joint distribution of X and Y is illustrated in Figure 3.8.
(a) Step 1: The distributions of X, Y (b) Step 2: Completing the remaining entries
X \Y 1 2 Row sum X \Y 1 2 Row sum
1 1 1 3 1 1
1 6 1 4 · 6 4 · 6 6
1 1 1 3 1 1
2 3 2 4 · 3 4 · 3 3
1 1 1 3 1 1
3 2 3 4 · 2 4 · 2 2
1 3 1 3
Column sum 4 4 Column sum 4 4
Table 3.8: The creation of the table of the joint distributions in Example 3.7.14.
As next, we extend Example 3.7.12 and show that the n-fold coin toss is independent.
Example 3.7.15. We toss n times a coin, where in each throw ‘Head’ has the probability
p to occur. For this, we set
Ω = {0, 1}n = {(ω1 , ω2 , . . . , ωn ); ωj ∈ {0, 1}}. (3.7.25)
We now use the path rule to determine the probability P [ω] of each ω ∈ Ω. As an example,
let us consider n = 5 and ω = (1, 0, 0, 1, 1)
p 1−p 1−p p p
· −→ 1 −→ 0 −→ 0 −→ 1 −→ 1, (3.7.26)
and thus P [(1, 0, 0, 1, 1)] = p3 (1 − p)2 . In general, we get
n
X
k n−k
P [(ω1 , ω2 , . . . , ωn )] = p (1 − p) with k := ωj = |{1 ≤ j ≤ n; ωj = 1}|. (3.7.27)
j=1
113
If one compares the definition of independent events (see Definition 2.7.15) with Defini-
tion 3.7.10, one sees a small difference. For independent events A1 , . . . , An , we had to
require the product property for each subfamily of A1 , . . . , An to get a suitable defini-
tion for independence. On the other hand, we consider in (3.7.21) only the whole family.
The question at this point is: is (3.7.21) sufficient? If (3.7.21) is a suitable definition for
independent random variables, then
P [X1 = x1 , X2 = x2 ] = P [X1 = x1 ] · P [X2 = x2 ] and (3.7.30)
P [X1 = x1 , X2 = x2 , X3 = x3 ] = P [X1 = x1 ] · P [X2 = x2 ] · P [X3 = x3 ] (3.7.31)
should follow from (3.7.21). In words: If X1 , . . . , Xn are independent, then each subfamily
should also be independent, especially X1 , X2 and X1 , X2 , X3 . The following lemma shows
that this is indeed the case.
P [Xj1 = xj1 , . . . , Xjd = xjd ] = P [Xj1 = xj1 ] · P [Xj2 = xj2 ] · . . . · P [Xjd = xjd ] .
= P [X1 = x1 ] P [X2 = x2 ] .
114
We consider as next the expectation of independent random variables. We have
Theorem 3.7.17. Let (Ω, P)be a probability space and X and Y be independent,
discrete random variables on this space with E [|X|] < ∞ and E [|Y |] < ∞. Then the
expectation of X · Y exists and we have
E [X · Y ] = E [X] · E [Y ] . (3.7.33)
In particular, the random variables X and Y are also uncorrelated (see Definition 3.5.11).
Notice that independent random variables are always uncorrelated. However, the reverse
direction does not need to be not true, see Example 3.7.18.
Proof. We first assume that the expectation of X · Y exists. We then get with Theo-
rem 3.7.8
X X
E [X · Y ] = X(ω)Y (ω)P [ω] = xy · P [X = x, Y = y]
ω∈Ω x,y∈R
! !
X X X
= xy · P [X = x] P [Y = y] = x · P [X = x] y · P [Y = y]
x,y∈R x∈R y∈R
= E [X] · E [Y ] .
It remains to prove the existence of E [X · Y ]. It follows similarly
! !
X X
E [|X · Y |] ≤ |x| · P [X = x] |y| · P [Y = y]
x∈R y∈R
As mentioned after Theorem 3.7.17, we give now an example for two random variables X
and Y , which are uncorrelated, but not independent.
Example 3.7.18. Let
1 1
Ω = {0, 1, −1} and P [0] = , P [1] = P [−1] = . (3.7.34)
2 4
We also define X(ω) = ω and Y (ω) = |X(ω)| = |ω|. We then have
1 1
E [X] = X(−1) · P [−1] + X(0) · P [0] + X(1) · P [1] = − + 0 + = 0, (3.7.35)
4 4
1 1 1
E [Y ] = Y (−1) · P [−1] + Y (0) · P [0] + Y (1) · P [1] = + + 0 + = , (3.7.36)
4 4 2
E [XY ] = −1 · P [−1] + 0 · P [0] + 1 · P [1] = 0. (3.7.37)
115
We thus have Cov(X, Y ) = E [XY ] − E [X] E [Y ] = 0. Therefore X and Y are uncorre-
lated. Obviously X and Y are not independent. We have for instance
1
0 = P [X = 0, Y = 1] 6= P [X = 0] · P [Y = 1] = . (3.7.38)
4
Independent random variables have further interesting properties. However, we will not
mention them here, but we would like to emphasise that independent random variables
are ‘user friendly’.
116
Rolling the die once
The natural approach is to compute in all three cases the probability to pass the exam
and then to compare the results. If the die is rolled only once, we can easily compute
this. We denote by X1 the number in the first (and only) throw of the die. We get with
Proposition 3.3.2
where we have used for the last equality again Proposition 3.3.2. We thus have to compute
P [X1 + X2 = k] for 2 ≤ k ≤ 8. We have for instance
1
P [X1 + X2 = 2] = P [X1 = 1, X2 = 1] = ,
36
2
P [X1 + X2 = 3] = P [X1 = 1, X2 = 2] + P [X1 = 2, X2 = 1] = .
36
In a similar way, one can compute P [X1 + X2 = k] for the other values of k. The results
are listed in Table 3.9.
k 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P [X1 + X2 = k] 36 36 36 36 36 36 36 36 36 36 36
We insert the probabilities of Table 3.9 into the equation (3.8.2) and get
1+2+3+4+5+6+5 26
P [‘Peter passes’] = P [X1 + X2 ≤ 8] = = . (3.8.3)
36 36
2 24 26
We now clearly have 3
= 36
< 36
and thus the option b) is better for Peter than a).
117
Rolling the die 1000 times
Let us try to argue as in the previous two cases. We thus have here 1000 random variables
X1 , . . . , X1000 , where Xj is the number in the j’th throw (1 ≤ j ≤ 1000). As before, we roll
the die independently and the random variables X1 , . . . , X1000 are therefore independent.
We thus have
X1 + X2 + · · · + X1000
P [‘Peter passes’] = P ≤4
1000
"1000 # 4000
"1000 #
X X X
=P Xj ≤ 4000 = P Xj = k . (3.8.4)
j=1 k=1000 j=1
This
hPis clearly very ihard to compute explicitly. For instance, if we would like to compute
1000
P j=1 Xj = 4000 , then we have to find the number of x1 , . . . , x1000 with
Table 3.10: Expectation and the variance for the random variables in (3.8.6)
Before we interpret the entries in Table 3.10, let us briefly explain how they are obtained.
We know from Example 3.4.5 and Example 3.5.6 that
7 35
E [X1 ] = = 3.5 and V(X1 ) = ≈ 2.91. (3.8.7)
2 12
Each Xj has clearly the same distribution as X1 . Thus E [Xj ] = E [X1 ] and V(Xj ) =
V(X1 ) for all j. Also the Xj are independent and thus uncorrelated (see Theorem 3.7.17).
It then follows with Theorem 3.4.12 and Theorem 3.5.13 that
X1 + X2 1 1 7 7 7
E = (E [X1 ] + E [X2 ]) = + = ,
2 2 2 2 2 2
2 2
X1 + X2 1 1 1 35
V = V (X1 + X2 ) = V(X1 ) + V(X2 ) = V(X1 ) = .
2 2 2 2 24
118
The computations for E X 1000 and V X 1000 are similar and we thus omit them.
Table 3.10 shows that all three random variables have the same expectation, but the
variance of X 1000 is much smaller than the variances of X1 and X1 +X 2
2
. We have noted
in Section 3.5 that a small variance suggests that a typical value of a random variable is
close to the expectation. Thus Table 3.10 suggests a typical value of X 1000 is much closer
to 3.5 than a typical value of X1 and X1 +X
2
2
. Thus Peter should choose to roll 1000 times
the die since we expect with high probability a value around 3.5.
If one examines the computations in Chapter 3.8.1 for the Table 3.10, one see immediately
that we need for the expectation and variance of X1 +X
2
2
and X 1000 only the expectation
and variance of X1 since all Xj have the same distribution and are independent. Thus
this computation holds also for X n with n arbitrary. In formulas, we have
1
E X n = E [X1 ] and V X n = V (X1 ) . (3.8.9)
n
Thus the expectation of X n is for each n ∈ N the same, but the variance of X n tends
to 0 as n → ∞. This suggest that X n is for n large very close to E [X1 ] (at least with
high probability). To see if this intuition is correct, let us simulate the random variable
X n for different values of n. For this, we roll again a fair die and thus X1 has the same
distribution as in Section 3.8.1. We use the computer and let it do for a fixed n the
following:
119
(a) n = 1 (b) n = 10 (c) n = 100 (d) n = 1000 (e) n = 10000
Lemma 3.8.1. Let (Ω, P)be a (discrete) probability space and X a random variable
on this space with E [X 2 ] < ∞.
Lemma 3.8.1 can be interpreted in words as follows: if a random variable X has variance
0 then X is (essentially) a constant.
Proof. We have
X 2
0 = V(X) = X(ω) − E [X] P [ω] . (3.8.11)
ω∈Ω
2
Since each summand in (3.8.11) is non-negative, we must have X(ω) − E [X] P [ω] = 0
for all ω ∈ Ω. This is only possible if P [ω] = 0 or X(ω) = E [X] for each ω ∈ Ω.
Example 3.8.2. Consider the case X1 ∼ Bernoulli(1/2). We then have E [Xj ] = 1/2.
We first look at n even and write n = 2N . We get
" n # " 2N #
X n X 1 2N
P X n = 1/2 = P Xj = =P Xj = N = , (3.8.13)
j=1
2 j=1
2N N
120
since the sum of 2N independent Bernoulli(1/2) distributed random variables is
Binomial(2N, 1/2) distributed, see Chapter 3.6.3. Applying Stirling’s formula (see (2.5.9))
2N
to N then shows that
1
P X n = 1/2 ∼ √ → 0 as n → ∞. (3.8.14)
πn
" n #
X n
P X n = 1/2 = P Xj = = 0. (3.8.15)
j=1
2
The observation in Example 3.8.2 is indeed true for most distributions of X1 , including
the one originating from the throw of a fair die. We don’t give the proof since this is
(much) more complicated than the computations in Example 3.8.2.
Does Example 3.8.2 show that our intuition ‘X n −→ E [X1 ]’ is wrong? Let us consider
the following situation:
If we toss 2000 times a fair coin, would it be surprising to have 1003 times ‘Head’
instead 1000 times?
If we roll 3000 times a fair die, would it be surprising to see 502 times the 6?
Most people would say: Of course not! We thus can only expect that ‘X n is approximately
E [X1 ]’ with a high probability, but not ‘X n equal to E [X1 ]’ with a high probability. The
sentence ‘X n is approximately E [X1 ]’ is of course not a rigorous mathematical expression.
We make this statement mathematical rigorous by introducing a small error with > 0
a fixed constant. We now consider the question: What is the probability that the distance
between X n and E [X1 ] smaller than ? We have given in Figure 3.14 an illustration for
this question using the histogram in Figure 3.13.(b). The red part corresponds to all
outcomes where the distance between X n and E [X1 ] is smaller than and the blue part
corresponds to all outcomes, where the distance is greater than .
Notice that ‘the distance between X n and E [X1 ] is smaller than ’ is written in formulas
as |X n − E [X1 ] | < and the probability of this event is
P X n − E [X1 ] < . (3.8.16)
If our intuition is now correct then P X n − E [X1 ] < should converge to 1 as n → ∞.
Furthermore, this should be true for each fixed > 0. This is indeed true and is known
as the Law of large numbers.
121
Figure 3.14: Illustration of the event ‘the distance between X n and E [X1 ] is less than ’.
Theorem 3.8.3 (Law of large numbers). Let (Ω, P)be a (discrete) probability space
and X1 , X2 , . . . , Xn be independent random variables with the same distribution and
E [X12 ] < ∞. We then have for any fixed > 0
P X n − E [X1 ] < −→ 1 as n → ∞, (3.8.17)
Theorem 3.8.3 can be interpreted in words as: The mean X n of the random variables X1 ,
. . . , Xn is for n large close E [X1 ] with high probability and this probability tends to 1 as
n → ∞. Theorem 3.8.3 might look a little bit abstract in the first moment. Let us thus
give an example how to use it.
Example 3.8.4. Suppose we roll n times independently a fair die. How often does the 3
occur? If n is small, we can not say too much about it. However, if n is large, we can use
the law of large numbers. We model this experiment by introducing a sequence of random
variables X1 , X2 , . . . , Xn with
(
1, if 3 occurs in the j’th throw,
Xj = (3.8.18)
0, if 3 does not occur in the j’th throw.
By assumption, we roll the die independently and therefore X1 , X2 , . . . , Xn are indepen-
dent. Also, it is clear that Xj ∼ Ber(p) with p = 16 . Since each Xj is either 0 or 1, we
can interpret
n
X
X1 + · · · + X n = Xj = ‘The number of rolled 3’s ’ (3.8.19)
j=1
122
and therefore
n
X1 + · · · + Xn 1X
Xn = = Xj = ‘The fraction of rolled 3’s ’. (3.8.20)
n n j=1
We have E X n = E [X1 ] = 16 . Let us now consider the error = 1/100. The law of
Theorem 3.8.5 (Chebyschev inequality). Let (Ω, P)be a probability space and X be
a (discrete) random variable with E [X 2 ] < ∞. We then have for each > 0
V(X)
P [|X − E [X] | ≥ ] ≤ . (3.8.23)
2
We have mentioned the Chebyschev inequality already in Section 3.5, see (3.5.17). We also
have given there some examples on how it can be used to get bounds for some probabilities,
see the Examples 3.5.7, 3.5.9 and 3.5.10.
Proof. We have
X
V(X) = E (X − E [X])2 = (x − E [X])2 · P [X = x]
x∈R
X X
2
≥ (x − E [X]) · P [X = x] ≥ 2 · P [X = x]
x∈R x∈R
|x−E[X]|≥ |x−E[X]|≥
= 2 P [|X − E [X] | ≥ ] .
123
We have used in the first inequality that squares are always non-negative. Division by 2
completes the proof.
Proof of Theorem 3.8.3. Notice that (3.8.17) is equivalent to
lim P X n − E ≥ = 0. (3.8.24)
n→∞
The right hand side of (3.8.25) thus converges to 0 as n → ∞. This completes the proof
since the left hand side is always non-negative.
Remark 3.8.6.
The assumption E Xj2 < ∞ in Theorem 3.8.3 is not necessary and can be dropped.
However, the proof without this assumption is considerably more complex and we
thus do not give it here.
The law of large numbers can be used to verify that our definition of a probability
measure has the limiting property described on Page 9 and at the begin of this
section. The argumentation is similar to Example 3.8.4.
Theorem 3.8.3 is the so called weak law of large numbers. There is also a strong
law of large numbers. This states that
n
1X
Xi → E [X1 ]
n i=1
for almost all ω ∈ Ω. This convergence is much stronger than the convergence in
Theorem 3.8.3, but we will not explain here the difference. The strong law of large
numbers also requires the existence of infinitely many identical distributed random
variables on the same probability space. Unfortunately, this is not possible on a
discrete probability space. This is not necessary for the weak law of large numbers.
124
One now can consider the following game: Tom and Jerry toss several times a fair
coin. Each time when ‘Head’ occurs, Tom gets £1 from Jerry, otherwise Jerry gets
£1 from Tom. The law of large numbers now state that average win or deficit per
game goes to 0 as the number of rounds tends to infinity. Does this mean that both
are leading with high probability around 50 % of the time in this game? No! The
Arcsine law says the converse: The probability is very high that the same player is
leading almost the whole time. We will not study this here.
Number 1 2 3 4 5 6
Gain 2.2 1.2 0.2 −0.8 −1.8 −2.8
1 1 1 1 1 1
P [·] 6 6 6 6 6 6
The gain and deficit in this game do not cancel each other, i.e.
125
The average deficit per game is thus £0.3. One can not really call such a game fair. On
the other hand, if Peter would get £3.50 instead £3.20, then in average the gain and
deficit would be the same. Thus no side would have an advantage. In this case, we could
call the game fair.
We thus will call a game fair if nobody has an advantage and thus the gain and deficit
‘cancel in the long run’. If we are now thinking about the Law of large numbers, this
nothing else than the expectation of the gain per game is 0.
Definition 3.9.3. Let X be the gain of a betting game (negative values stand for
deficit). We call this game fair, if the expected gain is 0, i.e. if
E [X] = 0.
This definition is now purely mathematical and does not require our intuition. This is a
big advantage, because our intuition can often be misleading. For instance, one could try
to argue in Example 3.9.1 that ‘3 times 1/6 is 0.5!’. Another situation is the Monty Hall
problem is Example 2.7.8.
Example 3.9.4. We see in Figure 3.15 a wheel of fortune, where each sector has the
same probability to occur. The numbers in each sector stand for the money we win if
the corresponding sector occurs. If we would like to take part in this game, we first
have to pay a certain amount E. How much must E be so that this is a fair game?
Let Y be the money we can win in this game without considering E. We then have
E [Y ] = 5 · 51 + 1 · 52 = 1.4. The total gain of this game is X = Y − E and the game is fair
!
if 0 = E [X] = E [Y − E] = 1.4 − E. Thus E must be 1.4.
Example 3.9.5. Victor and Peter play the following game: Victor can toss a fair coin as
often he would like, but at most 4 times. Each time Victor tosses ‘Head’, he gets £1 from
126
Peter. Each time Victor tosses ‘Tail’, he has to pay £1 to Peter. Let us now consider
the following strategies and decide which one Victor should use.
Let Xj now be the gain of Victor for the strategy Sj . Obviously, we have E [X0 ] = 0, since
we have for a fair coin P [‘Head’] = P [‘Tail’] = 21 .
Let us now consider strategy S1 . Since Victor stops after the first ‘Head’, the only possible
realisations are ‘H’, ‘TH’ , ‘TTH’ , ‘TTTH’ and ‘TTTT’. We now get
X2 = 2 1 0 -2 -4
1 1 3 4 1
P [·] 4 4 16 16 16
The following theorem shows, that this example is only special case of a more general
phenomenon.
127
Theorem 3.9.6. In a fair game, each strategy leads to an expectation 0
The definition of a fair game used in Theorem 3.9.6 is slightly more general than the one
used in Definition 3.9.3. However, we will not go into details. Furthermore, we should
mention that in Theorem 3.9.6 only strategies are allowed, which can not see into the
future and which have to stop after finite many steps.
One can ask oneself, if betting games in the casino are fair. As an example, let us consider
Roulette.
Example 3.9.7. In Roulette, one can bet on the numbers 0 to 36, where each number has
the same probability to occur. There are many possible allowed combinations, also the 0
has a special role. We consider here only the betting on one number (franz.‘ Plein’, engl.
‘ Full number’). Furthermore, we ignore also the special role of 0. If we bet on the right
number, then the payout ratio is 35:1. If we bet on the wrong number, then we loose our
money. Suppose we bet with £1, then the average gain per game is
1 36 1
35 · + (−1) · = − < 0.
37 37 37
We thus see that this game is not fair. This is also true for all other allowed combinations
1
with or without the special role of the 0. A possible question at this point is: 37 ≈ 0.03
is very small, how much money can the casino make with this advantage? Let us make a
rough computation for one roulette table. Suppose the casino is 8 hours open, 20 rounds
roulette are played per hour and 10 guests bet per round 20£. Thus around 32’000 pounds
pass the roulette table in one day. The law of large numbers then state the gain for this
table is per day around 32000 · 0.03 ≈ 850£.
Example 3.9.7 shows that in Roulette, the casino is the winner in the long run. This
is also true for all other games in the casino. At this point, one should not be really
surprised since the aim of the casino is to make money.
128
3.10 Central limit theorem
We discovered in Section 3.8 with the law of large numbers a limit theorem in probabil-
ity theory. Such limit theorems are the strength of probability theory since they show
that there are regular patterns in the apparent chaos of random sequences. Also, we have
briefly illustrated at the begin of Section 3.8 why the law of large numbers is an important
result. However, we can not use the law of large numbers for concrete computations. As
an example, suppose that we toss 1000 times a fair coin with success probability p = 21 .
The law of large numbers then tell us that the number of Head’s should be close to 500.
But this is everything we get out of it. We could use the Chebyschev inequality as in
Example 3.5.7 to get some upper or lower bounds, but they are often not really good.
Much more useful is the central limit theorem. This is together with the law of large
numbers something like the first and second principle of probability theory. However, the
name central limit theorem does not come from its central role in probability theory, it
comes from the fact that it approximates very well events in the centre, i.e events that
are close to the expectation.
Let us first recall some facts about histograms. Histograms are used to present large
datasets consisting of real numbers, which are too large to be presented in a table. If a
large dataset is given then a histogram is constructed as follows:
determine the fraction of the values falling into each interval and
draw over each interval a rectangle with area equal to the fraction of values.
L = {3.5, 2.75, 3.06, 4.7, 3.52, 3.99, 3.12, 3.90, 2.98, 2.81, 4.07, 4.01, 3.74, 3.56)}. (3.10.1)
We use the range [2, 5] and the intervals I1 = [2, 3], I2 = [3, 3.5], I3 = [3.5, 4], I4 = [4, 5].
The numbers and fractions of values in each interval are displayed in Table 3.14 and
resulting histogram in Figure 3.16.
I1 I2 I3 I4
number 3 2 6 3
fraction ≈ 22% ≈ 14% ≈ 42% ≈ 22%
Table 3.14: Number and fraction of values of the dataset L in Example 3.10.1
129
Figure 3.16: Histogram for Example 3.10.1
We would like to make some brief comments about histograms. Often one uses the
number instead the faction of values in the intervals. Also, one uses normally intervals
of the same size, but it is not essential. An important point is how one determines the
number of intervals. A usual method is Sturges rule. We denote by N the number of
values in the dataset and k the number of intervals. Then Sturges rule says
√
for large N : k ≈ log2 N and k ≈ N for small N. (3.10.2)
1
P [Xj = 1] = P [Xj = 2] = . . . = P [Xj = 6] = . (3.10.3)
6
We used in Figure 3.13 the range [1, 6]. This time we use a range, which is more adapted
to the obtained data. This then gives Figure 3.17.
Figure 3.17: Histograms for the mean X n for the roll of a fair die.
130
We see in Figure 3.17 that all histograms have the same shape, but have a different range.
We will see that is a general phenomenon. Let us now illustrate how one can use the
histograms in Figure 3.17 to approximate the probabilities of some events. Suppose we
are interested in the probability
P 3.48 ≤ X 10000 ≤ 3.51 . (3.10.4)
As we have seen in Section 3.8.1, this is very hard to compute explicit. However, we have
simulated X 10000 with the computer to obtain Figure 3.17 and thus have a large dataset
with experimental values of X 10000 . A small part of this dataset is
[3.498, 3.507, 3.464, 3.529, 3.512, 3.510, 3.488, 3.501, 3.505, 3.500, 3.551, 3.503]. (3.10.5)
Since the dataset is large, the probability P 3.48 ≤ X 10000 ≤ 3.51 should be close to the
fraction of values in the interval [3.48, 3.51] in this dataset. If for instance 25% of the
values are in [3.48, 3.51] then we expect that P 3.48 ≤ X 10000 ≤ 3.51 ≈ 0.25. Notice that
one draws for a histogram over each used interval in the histogram a rectangle such that
the area is the fraction of values in this interval. If we thus would like to determine the
fraction of values in the interval [3.48, 3.51], we have to determine the area of all rectangles
over [3.48, 3.51]. This corresponds to the area of the red part in Figure 3.18.(a). Obviously,
we can approximate the shape of the histogram by a nice curve f (x), see Figure 3.18.(b).
(a) (b)
Figure 3.18: Illustration of the probability P 3.48 ≤ X 10000 ≤ 3.51
Thus we can approximate the area of all rectangles over [3.48, 3.51] by the area between
the curve f (x) and the x-Axis in the interval [3.48, 3.51]. We thus get
Z 3.51
P 3.48 ≤ X 10000 ≤ 3.51 ≈ f (x) dx, (3.10.6)
3.48
131
by the definition of the integral of function f . The question at this point is: What is f ?
2
The function f in Figure 3.18.(b) looks similar to the Gaussian bell curve e−x . Some
2
shifting and stretching of e−x then shows that the Gaussian bell curve is indeed a good
candidate for the function f . More precisely, if we write µ := E [X1 ] and σ 2 := V(X1 )
then we can use the function
1 (x−µ)2
−
f (x) = √ √ e 2(nσ 2 ) , (3.10.7)
( nσ 2 ) 2π
which is the curve plotted in Figure 3.18.(b). An interesting observation is that f only
depends on the number of throws n, the expectation µ = E [X1 ] and the variance σ 2 =
V(X1 ), but not on the distribution of X1 . The important point is now that the behaviour
of X n for an arbitrary random variable X1 is almost the same as for the roll of a fair die,
i.e. the histograms in Figure 3.18 have the same shape and the shape can be very well
approximated by the function f in (3.10.7). However, is is often not so convenient to work
directly with X n . It is indeed much more convenient to shift and scale X n such that it
has expectation 0 and variance 1. In formulas, this means one normally works with
P
n
− nµ
Pn
X
j=1 (Xj − µ)
r
n j=1 j
X
en := (X n − µ) = √ = √ . (3.10.8)
σ2 nσ 2 nσ 2
Exercise 3.10.2.
n −E[X n ]
en = X√
a) Verify that X ,
V(X n )
h i
b) check that E Xen = 0 and V( X
en ) = 1 and
Let us now formulate the above observation in a mathematical rigorous way using the
random variable X
en . This is an important result and is known as central limit theorem.
132
Theorem 3.10.3 (Central limit theorem). Let (Ω, P)be a (discrete) probability space
and X1 , X2 , . . . , Xn be independent random variables with the same distribution and
E [X1 ] = µ, V(X1 ) = σ 2 > 0. We then have for all a, b with −∞ ≤ a < b ≤ ∞
" Pn #
j=1 (Xj − µ)
Z b
1 2
P a≤ √ ≤ b −→ √ e−x /2 dx = Φ(b) − Φ(a) (3.10.9)
nσ 2 2π a
as n → ∞, where Φ is defined as
Z t
1 2 /2
Φ(t) := √ e−x dx for all t ∈ R. (3.10.10)
2π −∞
We will not give here a proof of Theorem 3.10.3 since this requires some probabilistic
tools, which we will not discuss in this course. The special case X1 ∼ Ber(p) can be
proved very directly with Stirlings formula. However, this is a little bit technical and we
thus will also not give it.
Remark 3.10.4.
a) The integral on the RHS of (3.10.9) has a probabilistic interpretation. It is the prob-
ability P [a ≤ N ≤ b], where N is a continuous random variable with normal distribu-
tion. This is sometimes also called Gaussian distribution. We will not go further into
details since we are considering here only discrete random variables.
b) The technical term for the convergence in Theorem 3.10.3 is convergence in law or
convergence in distribution. We will not need this convergence and will thus not
further discuss it.
Let us now show how to use the central limit theorem to get approximations for sums of
independent random variables. The idea behind this is a follows: Suppose that (cn )n∈N
is a sequence with cn → c. If we know c explicitly and are interested in cn with n large,
then we can hope cn ≈ c. In view of Theorem 3.10.3, we hope that we have for n large
" Pn #
−
Z b
j=1 (X j µ) 1 2
P a≤ √ ≤b ≈ √ e−x /2 dx. (3.10.11)
nσ 2 2π a
Considering again the probability in (3.10.4) then the approximation in (3.10.6) is pre-
cisely what we get with (3.10.11). Thus one of the advantages of the central limit theorem
is that it is not anymore necessary to create a huge dataset of samples to obtain approx-
imations. Note that one can replace ≤ by < in (3.10.11) without further ado.
We would like to highlight at this point that (3.10.11) gives very often a good approxima-
tion, but we have in general not an equality in (3.10.11). Especially for discrete random
variables, there is no hope to have equality in (3.10.11).
133
Let us now give examples how to use the central limit theorem (and thus (3.10.11)) to
get approximations of events involving sums of independent random variables.
Example 3.10.5. Suppose 120 persons have booked a certain flight. The
experience of the flight company shows that in average only 90% of the
passengers really take the flight. What is the probability that more
than 105 passengers take the flight? To answers this question, we number the
passengers from 1 to 120 and define for 1 ≤ j ≤ 120
(
1, if passenger j takes the flight,
Xj = (3.10.12)
0, otherwise.
" P120 #
j=1 (Xj − p)
P [X1 + . . . + X120 ≤ 105] = P p ≤b . (3.10.16)
120 · p (1 − p)
134
Using (3.10.11) for the RHS of (3.10.16) gives the approximation
" P120 #
j=1 (X j − p)
P p ≤ b ≈ Φ(b) ≈ Φ(−0.278) ≈ 0.61. (3.10.17)
120 · p (1 − p)
The procedure in the general case is (almost) the same as in Example 3.10.14. More
precisely, we think about the computation (3.10.15). We start with a sum of independent
random variables, subtract the expectation and divide it by the square root of the variance.
This then gives a new random variable with expectation 0 and variance 1. Let us illustrate
this with a further example.
Example 3.10.6. Suppose that an employee at the customer service of a
phone company needs in average 5 minutes to handle a call of a customer.
We assume at this point the time need per call is Poisson distributed and
independent of each other. Suppose that an employee has 100 calls a day.
What is the probability that the used time per call is in average more
than 6 minutes? We denote by Xj the time need for each call. Thus Xj ∼ Poi(5)
and are independent of each other. We are therefore interested in
X1 + · · · + X100
P > 6 = P X 100 > 6 (3.10.18)
100
We now have n = 100, µ = E [X1 ] = 5 and σ 2 = V(X1 ) = 5 and thus
X 100 − E X 100 6 − E X 100
P X 100 > 6 = P X 100 − E X 100 > 6 − E X 100 = P q > q
V(X 100 ) V(X 100 )
"r #
X 100 − E [X 1 ] 6 − E [X 1 ] 100 √
= P q >q =P (X 100 − µ) > 20 .
1
V(X ) 1
V(X ) σ2
100 1 100 1
We now
√ know from (3.10.8) that the last expression has the required form. We thus set
a = 20 and b = ∞ and get with (3.10.11)
√ √
P X 100 > 6 ≈ Φ(∞) − Φ 20 = 1 − Φ 20 ≈ 0.
One important question at this point is of course: How good is the approximation in
(3.10.11)? The quality of the approximation clearly depends on the distribution of X1
and one can easily construct examples such that that approximation in (3.10.11) is even
for large n very bad. However, these examples are normally very exotic. We can say
roughly that the approximation for ‘real life examples’ is typically very good, even for
relative small values of n. As general guideline, let us say that one can use (3.10.11) for
‘real life examples’ for n ≥ 30. Ash illustration, let
i us take a look at an example, where
Pn
we can compute the probability P j=1 Xj ≤ x .
135
Example 3.10.7. Let us roll 6000 times independently a faire die. How
big is the probability to roll at least 1100 times the 6? Similarly to
Example 3.10.14, we define for 1 ≤ j ≤ 6000
(
1, if 6 occurs in the j’th throw,
Xj = (3.10.19)
0, otherwise.
1
We then have that Xj ∼ Ber(p) with p = 6
and all Xj are independent. We are thus
interested in the probability
A naive approach would be to type this expression as it is directly into a computer. Surpris-
ingly, this expression with n = 6000 is already large enough that many computer programs
get problems. However, there are several programs, which can work directly with Binomial
distribution, for instance Maple or R. If one uses such a program, one gets
"6000 #
X
P Xj ≥ 1100 ≈ 0.00029.
j=1
Let us now use the central limit theorem to compute an approximation for P [X ≥ 1100].
We have n = 6000, µ = E [X1 ] = 16 and σ 2 = V(X1 ) = 61 · 65 . We thus get
"6000 # " P6000 # " P6000 #
X j=1 Xj − np
1100 − np j=1 X j − 1000 100
P Xj ≥ 1100 = P p ≥p =P 100
√ ≥ 100 √
j=1 np(1 − p) np(1 − p) 6
3 6
3
Z ∞
1 2 √
≈ √ √ e−x /2 dx = 1 − Φ(2 3) = 0.00027.
2 3 2π
We thus see that our approximation agrees on 4 decimal places. Thus the magnitude of
1
the error is in this case 100 %. We would like to point out here that n = 6000 is already
quite big and that one can not expect that the error is also so tiny for relatively small n.
Indeed, if we consider n = 300 and look at P [X1 + . . . + X300 ≥ 55] then we get with a
computer program ≈ 0.24 while the central limit theorem gives 1 − Φ(0.77) ≈ 0.22. Thus
the error is not anymore as tiny as before, but is still reasonably small.
136
2
Another point we have to discuss briefly is that the function e−x is not elementary
2
integrable. It is thus not possible to give for the antiderivative Φ(x) of e−x a closed
expression. However, with our modern computers, calculators and smartphones, this is
not a big problem. It can be reached already with low computing power and a simple
algorithm in a very short time a sufficiently accurate approximation of Φ(x) (for a given
x ∈ R). For completeness, we would like to show here the classical approach. If one doesn’t
has a computer, it needs obviously much time to compute Φ(x) manually. One thus has
calculated Φ(x) manually for many values of x and tabulated these values. Table 3.10
shows a small excerpt from this table. The full table can be found in Appendix C.
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
The table reads as follows: If we are looking for the probability Φ(x) for a value x in the
interval 0 to 3.49, so x is up to one decimal place in the left margin line of the table and the
second decimal place is located in the header. As an example, we have Φ(0.14) = 0.5557.
For x ≥ 3.5, we use Φ(x) = 1. For x < 0, one uses that Φ(x) = 1 − Φ(−x). For instance
Φ(−0.09) = 1 − Φ(0.09) = 0.4641. If one requires Φ(x) for an x with more than two
decimal places, one interpolates the values in the table. We will not explain this here.
Note that there are also other useful limit theorems apart from the central limit theorem.
One is for instance the Poisson limit theorem, which can be used for the sum of indepen-
dent Bernoulli random variables with small success rate. In this situation, the Poisson
limit theorem gives often better approximations than the central limit theorem. However,
we will not discuss this here.
137
138
Appendix A
This notation indicates that we let i vary over all integers between 1 and 12 and then
build the sum over all terms, which occur, when we substitute i in the expression in the
sum. We have for instance
5
X 6
X
1+2+3+4+5= i, 2·3+2·4+2·5+2·6= 2i. (A.1.4)
i=1 i=3
The variable i is called summation index and is an auxiliary variable, which can be
replaced by any other variable. We thus have
12
X 12
X 12
X
xi = xj = xk .
i=1 j=1 k=1
139
We can use of course also other upper and the lower limits than 1 and 12. For instance
b
X
xi = xa + xa+1 + · · · + xb
i=a
Pb
One requires at this point that a ≤ b and defines i=a xi = 0 if b < a. In the special case
a = b, one gets
a
X
x i = xa .
i=a
Until now, we have used the summation index mostly as index for the summands. The
expression we would like to sum can of course depend in a different way on the summation
index. In general, we have thus
b
X
f (i) = f (a) + f (a + 1) + · · · + f (b − 1) + f (b), (A.1.5)
i=a
where f is a function, which assigns each i between a and b a value. To make this a little
bit clearer, we will discuss an example in detail. Let
7
X
(2i)i
i=1
be given. i is the summation index. i starts from 1 and runs till 7. We thus have to build
the sum over all i ∈ {1, 2, 3, 4, 5, 6, 7}. When we insert the value i = 1 in the term in the
sum, we obtain (2 · 1)1 . For i = 2, we get (2 · 2)2 , for i = 3 we get (2 · 3)3 ,. . . , and finally
for i = 7 we get (2 · 7)7 . We thus have
7
X
(2i)i = (2 · 1)1 + (2 · 2)2 + (2 · 3)3 + (2 · 4)4 + (2 · 5)5 + (2 · 6)6 + (2 · 7)7 .
i=1
140
P
The sum with is much more compact and it is much easier to keep the overview. Also,
one could try to write out the sum on the right hand side of (A.1.6) as
6 + 24 + 60 + · · · + 1030200.
In this form, obviously almost nobody would be able to recognise the right sum. Often
one has apart from the summation index also other parameters occurring in the expression
in the sum. Similar to the integration of functions, the summation index determines over
which parameter one builds the sum. Let us consider the following example
4
X
xki = xk1 + xk2 + xk3 + xk4 ,
i=1
X4
xki = x1i + x2i + x3i + x4i .
k=1
P
To write a sum as in (A.1.1) with the held of the sum symbol is more difficult and
requires that one knows the structure.
Example A.1.1.
a) Let us study the sum
1 + 5 + 52 + · · · + 5n .
Note that 50 = 1 and 51 = 5. Thus the sum is 50 + 51 + · · · + 5n and the general form
of the term is 5k . We thus have
Xn
5k .
k=0
b) If one considers the sum y + 2y 2 + 3y 3 + 4y 4 + 5y 5 , then the general term has the form
ky k and k varies between 1 and 5. We thus have
5
X
y + 2y 2 + 3y 3 + 4y 4 + 5y 5 = ky k .
k=1
c) We consider as next
5 5 2 5 3 5 4
q+4 q +9 q + 16 q + 25q 5 .
4 3 2 1
The exponent of q starts at 1 and ends at 5. The binomial coefficient has the form k5 ,
where k starts at 4 and ends at 0. The coefficient in front of the binomial coefficient
has the form k 2 and starts at 1 and ends at 5. This gives
5
5 5 2 5 3 5 4 5
X
2 5
q+4 q +9 q + 25 q + 36q = k qk .
4 3 2 1 k=1
5 − k
141
Let us now consider some properties of the sum notation.
Proof. We have
n
X
(ak + bk ) = (am + bm ) + (am+1 + bm+1 ) + · · · + (an + bn )
k=m
n
X n
X
= (am + · · · + an ) + (bm + · · · + bn ) = ak + bk
k=m k=m
and
n
X n
X
c · ak = (c · am ) + (c · am+1 ) + . . . + (c · an ) = c · (am + · · · + an ) = c · ak .
k=m k=m
The remaining equations are proven similarly and we thus omit the proofs.
Remark A.1.3. In general, we have
n n
! n
!
X X X
ak · bk 6= ak · bk . (A.1.11)
k=m k=m k=m
142
Some important sums are
Lemma A.1.5.
n n
X n(n + 1) X n(n + 1)(2n + 1)
i= , i2 = , (A.1.12)
i=1
2 i=1
6
n 2 n
n2 (n + 1)2 q n+1 − 1
X
3 n(n + 1) X
i = = , qi = (A.1.13)
i=1
2 4 i=0
q−1
n ∞
X n n−k k X 1
a b = (a + b)n , qi = for |q| < 1. (A.1.14)
k=0
k i=0
1−q
We omit here the proof of Lemma A.1.5. Till now, we have worked only with one sum-
mation index. One can of course define sums over more than one summation index. One
defines in this case !
X n Xn Xn
aij := aij .
i,j=1 i=1 j=1
If only finitely many aj 6= 0, then we obviously don’t have a problem. We then build the
sum only over all non-zero aj 6= 0. In formulas, this means
∞
X X
aj := aj .
j=1 j∈N
aj 6=0
If infinitely many aj 6= 0, we have to define the infinite sum as a limit. We set in this case
∞
X n
X
aj := lim aj
n→∞
j=1 j=1
143
if the limit exists. An example is for instance
∞
X 1 π2
2
= . (A.2.1)
k=1
k 6
However, it is not possible to assign each infinite sum a meaningful value. To see this let
us consider
n
(
X −1, if n is odd,
(−1)k = (−1) + 1 + (−1) + 1 + (−1) + 1 + · · · + (−1)n =
k=1
0, if n is even.
We thus see that Pthe value jumps up and down between −1 and 0 as n → ∞ and thus
theP limit limn→∞ nk=1 (−1)k does not exist. Therefore it is not possible to assign a value
to ∞ k
k=1 (−1) .
Another important point is that we often rearrange in probability theory the order of the
summands in infinite sums. This is of course only meaningful if this rearranging of the
summands has no influence on the value of the sum. If one rearranges only finitely many
summands then this has no influence on the value of the sum. However, if one rearrange
infinitely many summands, it can happen that the sum gets a different value. A famous
example is the harmonic series
∞
X (−1)n+1 1 1 1
=1− + − + ··· . (A.2.2)
n=1
n 2 3 4
In the order in (A.2.2), the sum has the value ln(2). However, the order of summands is
not for each infinite sum relevant. The following theorem gives necessary and sufficient
conditions, under which the rearrangement of summands has no influence on the value.
144
Remark A.2.2. Notice that (almost) all sums we consider are absolutely convergent. We
thus will only mention the absolute convergence if it is important or necessary.
We will not prove Theorem A.2.1. One can ask oneself: what is with the sum
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + · · ·? (A.2.5)
The most natural approach would be to define this sum as +∞ and this value would
be obviously independent of the orde of the summands in (A.2.5) since all are the same.
However, one cannot really make computations with +∞ (What should be +∞ − (+∞)
or +∞/ + ∞?). One thus normally excludes +∞ to avoid problems. The next theorem
shows that not absolute convergent sums can make big problems.
Then the sum can take by suitable rearrangement of the order of summands every
value in R and also ±∞.
This version of the sum notation is useful if the area, over which one would like to sum,
can not be easily described. Also, we can express the sums in the Sections A.1 and A.2
in this way. For example,
X 10
X
ak = ak = a1 + · · · + a10 .
1≤k≤10 k=1
145
Also we can build sums over an arbitrary index sets I. We write in this case
X
ai . (A.3.1)
i∈I
We will need this for instance in the definition of the probability of an event, see Defini-
tion 2.1.7. Let us consider an example.
Example A.3.1. Let I = {1, 2, 4, 9} and J = {−2, −1, 0, 1, 2}. We then have
X X
ai = a1 + a2 + a4 + a9 and ai = a−2 + a−1 + a0 + a1 + a2 . (A.3.2)
i∈I i∈J
Lemma A.3.2. Let I and J be disjoint sets, i.e. there is no element which is contained
in I and in J. We then have
X X X
ai = ai + ai , (A.3.3)
i∈I∪J i∈I i∈J
The proof is similar to the proof of Lemma A.1.2 and we thus omit the proof.
If we now look at (A.3.2) then we see that the sets I and J in this example a not disjoint as
the elements 1 and 2 are contained in I and in J. Therefore we cannot apply Lemma A.3.2.
Indeed, we have in this case
X X
ai + ai = a−2 + a−1 + a0 + 2a1 + 2a2 + a4 + a9 and (A.3.4)
i∈I i∈J
X
ai = a−2 + a−1 + a0 + a1 + a2 + a4 + a9 . (A.3.5)
i∈I∪J
Thus we see that (A.3.4) and (A.3.5) are equal if and only if a1 + a2 = 0. In other words,
we have equality only in special cases and thus (A.3.3) does not hold in general. Let us
now consider another example.
Example A.3.3. Let I = {3, 5} and J = {4, 9}. In this case I and J are disjoint, i.e.
we have I ∩ J = ∅. Thus we can apply Lemma A.3.2 and get
X X X
ai + ai = (a3 + a5 ) + (a4 + a9 ) = a3 + a4 + a5 + a9 = ai . (A.3.6)
i∈I i∈J i∈I∪J
In the above examples, we consider only subsets of the integers. In other words, we had
always I ⊂ Z and J ⊂ Z. This condition is not necessary and we can use arbitrary set as
index sets. For instance, we can use I = { , , , ~}.
146
For completeness, we have to highlight that we have the same problem as in Section A.2.
If the index set I is finite, then the sum
X
ai (A.3.7)
i∈I
is always well defined. However, if the the index set is infinite then we have to be more
careful. In this case, we need that at most countable many ai 6= 0 (see Definition 2.1.2).
Further, we have to introduce a numbering of all i ∈ I with ai 6= 0 as i1 , i2 , . . . and define
X n
X ∞
X
ai := lim ai k if |aik | < ∞. (A.3.8)
n→∞
i∈I k=1 k=1
The property ∞
P
k=1 |aik | < ∞ is required so that the sum does not depend on the choice
of the numbering of the indices ik . An example is the expression for the expectation of
a discrete random variable X with the distribution, see Theorem 3.4.9. We have in this
case I = R and P [X = x] = 0 except for countable many x1 , x2 , . . . . We thus have
X ∞
X
E [X] = x · P [X = x] = xk · P [X = xk ] .
x∈R k=1
where the sequence (xj )j∈N consists of all x ∈ R with P [X = x] > 0 (with an arbitrary
ordering).
147
148
Appendix B
Sets
Sets are one of the most basic objects in mathematics and occur directly or indirectly
almost everywhere. In the context of probability theory, they are used to model events,
see Definition 2.1.7. We give in this appendix a brief introduction and a summery of the
properties we need in this lecture.
B.1 Definition
Definition B.1.1. A set is a collection of objects and each object in this collection
occurs precisely once. The objects which make up the set are called the elements
or members of the set. We write ω ∈ Ωif the element ω is contained in the set Ω,
otherwise we write ω ∈/ Ω.
Sets are conventionally denoted with capital letters. If one writes down a set, one
displays the elements between curly braces { }.
b) The set of all prime numbers less than 20 is B = {2, 3, 5, 7, 11, 13, 17, 19}.
We have 3 ∈ B, 11 ∈ B and 4 ∈ / B.
149
the real numbers R and
Definition B.1.3. The cardinality of a set A, denoted |A|, is the number of elements
in Ω. If the set has an infinite number of elements, then its cardinality is ∞.
Remark B.1.5. If the cardinality |A| of a set A is infinite, one often distinguishes if the set
is countable infinite (i.e. the set is countable, see Definition 2.1.2) or uncountable infinite.
We do not need this distinction and thus write for simplicity just ∞ for both cases.
Definition B.1.6. The set A is a subset of the set B if every element of A is also an
element of B. In this case we write A ⊂ B. Otherwise, we write A 6⊂ B.
Subsets of set B can be specified for instance as
Remark B.1.7. Notice that that there are different conventions used in literature for the
symbols about subsets. Some authors write A ⊆ B if A is a subset of B and write A ( B
if A is a proper subset of B, i.e A 6= B. Some others write A ⊂ B instead of A ( B. In
this course, A ⊂ B indicates always that A is a subset of B and A can be equal to B.
Example B.1.8.
b) We have N ⊂ Z, Q ⊂ R, Q 6⊂ N.
The notation of subset now gives us an elegant way to say when two sets are equal.
Definition B.1.9. Two sets A and B are equal if and only if A ⊆ B and B ⊆ A. We
write in this case A = B. Otherwise we write A 6= B.
If one speaks about subsets of a set, one has also speak about the power set of a set.
150
Definition B.1.10. The power set PΩ of a set Ω is defined as the set of all subsets
of Ω inclusive the set Ω itself and the empty set ∅. In formulas,
The union A ∪ B is the set all elements ω ∈ Ω that are in A or in B (or in both).
The set difference A \ B is the set of elements that are in A but not in B.
The complementary set Ac to the set A is the set consisting of those elements
ω ∈ Ω that are not in A (but in Ω).
A useful way to illustrate sets is with Venn diagrams, see Figure B.1. These were intro-
duced by John Venn (1834-1923) an English mathematician. Drawing Venn diagrams for
three or fewer sets is reasonably straightforward. Good drawings of larger numbers of
intersecting sets have been sought, with those by Anthony Edwards are particularly well
known, see Figure B.2.
Often we have to work with more than two or three sets and it thus can be unconvincing
to write something like
151
(a) The union A ∪ B (b) The intersection A ∩ B
152
(a) Three sets (b) Four sets
We will now prove some properties of set operations. To keep the notation simple, we
restrict us to two or three sets. We have
153
Proposition B.2.3 (Properties of the union ∪). Let A, B and C be subsets of a set
Ω. We then have
A ∪ B = B ∪ A, (A ∪ B) ∪ C = A ∪ (B ∪ C), (B.2.4)
A ⊂ A ∪ B, A∪A=A (B.2.5)
A ∪ ∅ = A, A ⊂ B if and only if A ∪ B = B. (B.2.6)
Proposition B.2.4 (Properties of the the intersection ∩). Let A, B and C be subsets
of a set Ω. We then have
A ∩ B = B ∩ A, (A ∩ B) ∩ C = A ∩ (B ∩ C), (B.2.7)
A ∩ B ⊂ A, A∩A=A (B.2.8)
A ∩ ∅ = ∅, A ⊂ B if and only if A ∩ B = A. (B.2.9)
The proofs of Proposition B.2.3 and B.2.4 are straight forward and follow almost im-
mediately from the definition. We thus omit them. Some further properties of sets are
A ∪ Ac = Ω, A ∩ Ac = ∅, (B.2.10)
A \ B = A ∩ Bc, c c
(A \ B) = A ∪ B, (B.2.11)
(A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C), (A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C)
(B.2.12)
The proof of Proposition B.2.5 is also straight forward and can be proven for instance
with Venn diagrams. We thus omit this proof as well.
Finally, we have
154
(a) The sets A (green) and B (lila) (b) A ∩ B (red) and Ac ∪ B c (blue)
(A ∪ B)c = Ac ∩ B c (B.2.13)
(A ∩ B)c = Ac ∪ B c (B.2.14)
Proof. We prove first (B.2.14). The best way to prove (B.2.14) is to use Venn diagrams,
see Figure B.3. Equation (B.2.13) then follows immediately from (B.2.14) with A e := Ac
and Be := B c and using that (Ac )c = A. The Equation B.2.15 now can be seen as follows:
∞
!c ∞
\ \
ω∈ Ai ⇐⇒ ω ∈ / Ai ⇐⇒ It exists an i with ω ∈
/ Ai
i=1 i=1
∞
[
⇐⇒ It exists an i with ω ∈ Aci ⇐⇒ ω ∈ (Ai )c .
i=1
155
156
Appendix C
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
157
158
Appendix D
Exponential series: We have for all real x, y ∈ R and e the Euler’s number
∞
x x2 x3 X xi
e = exp(x) = 1 + x + + + ... = ,
2! 3! i=0
i!
exp(x + y) = exp(x) exp(y).
Using the result for geometric sums, similar results can be derived. Differentiating (D.0.1),
term by term, each side with respect to p gives
159
Weighted geometric sum:
∞
X
(1 − p)−2 = 1 + 2p + 3p2 + . . . = jpj−1
j=1
X∞
2(1 − p)−3 = 2 + 6p + 12p2 + . . . = j(j − 1)pj−2 .
j=2
Binomial theorem
For any positive integer n ∈ N, and real numbers p, q ∈ R
n n n 0 n n−1 1 n n−k k n 0 n
(p + q) = p q + p q + ... + p q + ... + p q
0 1 k n
n
X n n−k k
= p q .
k
k=0
160
Symbol index
161
Tn
i=1 Ai . . . . . . . . . . . . . . . . . . The intersection A1 ∩ A2 ∩ . . . ∩ An ,
see Definition B.2.2
S∞
i=1 Ai . . . . . . . . . . . . . . . . . . The union A1 ∪ A2 ∪ A3 ∪ . . ., see Definition B.2.2
Sn
i=1 Ai . . . . . . . . . . . . . . . . . . The union A1 ∪ A2 ∪ . . . ∪ An ,
see Definition B.2.2
n
s
. . . . . . . . . . . . . . . . . . . . . Binomial coefficient, see (2.5.15)
∅ . . . . . . . . . . . . . . . . . . . . . . The empty set {}
P . . . . . . . . . . . . . . . . . . . . . . Probability measure, see Definition 2.1.3 and 2.1.19
V(X) . . . . . . . . . . . . . . . . . . . Variance of a random variable, see Definition 3.5.1
PΩ . . . . . . . . . . . . . . . . . . . . The power set of Ω, see Definition B.1.10
ω . . . . . . . . . . . . . . . . . . . . . . a sample point in a probability space (Ω, P),
see Definition 2.1.3 and 2.1.19
ω ∈ Ω . . . . . . . . . . . . . . . . . . . The element ω is contained in the set Ω,
see Definition B.1.1
ω∈
/ Ω . . . . . . . . . . . . . . . . . . . The element ω is not contained in the set Ω,
see Definition B.1.1
Pn P
i=1 . . . . . . . . . . . . . . . . . . . notation for sums, see Chapter A
Ber(p) . . . . . . . . . . . . . . . . . . Bernoulli distribution, see Section 3.6.2
Geo(p) . . . . . . . . . . . . . . . . . . Geometric distribution, see Section 3.6.4
Poi(λ) . . . . . . . . . . . . . . . . . . . Poisson distribution, see Section 3.6.5
a ≈ b . . . . . . . . . . . . . . . . . . . b is an approximation of a.
an ∼ bn . . . . . . . . . . . . . . . . . . an and bn are asymptotically equal, see Page 35
e . . . . . . . . . . . . . . . . . . . . . . Euler’s number, e ≈ 2.71828.
n! . . . . . . . . . . . . . . . . . . . . . Factorial, short notation for
n · (n − 1) · . . . · 2 · 1, see (2.5.6)
pX (x) . . . . . . . . . . . . . . . . . . . The probability mass function, see Definition 3.2.6
|A| . . . . . . . . . . . . . . . . . . . . . Cardinality of a set A, see Definition B.1.3
(Ω, P) . . . . . . . . . . . . . . . . . . . Probability space, see Definition 2.1.3 and 2.1.19
Ω . . . . . . . . . . . . . . . . . . . . . Sample space, see Definition 2.1.3 and 2.1.19
162
Index
sigma-additivity, 14 event, 7
complement, 9
absolutely convergent, 134 impossible, 9
independence, 52, 59
Bayes’ Theorem, 48
occurs, 9
Bernoulli distribution, 90
pairwise independent, 56
binomial coefficient, 29
sure, 9
Binomial distribution, 91
expectation, 78
Binomial theorem, 92, 148
Bernoulli distribution, 91
capital Pi notation, 22 Binomial distribution, 92
cardinality, 4, 6, 15, 25, 32, 138 degenerate distribution, 90
Cauchy-Schwarz inequality, 87 geometric distribution, 94
Chebyschev inequality, 85, 120 Poisson distribution, 95
chebyschev inequality, 114 experiment
classical experiment, 15 classical, 15
classical measure, 15 fair, 15
combinatorics, 21 Laplace, 15
complementary, 139 multi level, 17
conditional probability, 35, 36 purely random, 15
Bayes’ Theorem, 48 single level, 17
law of total probability, 47 exponential series, 147
multiplication rule, 42 factorial, 26
convergence in distribution, 124 favourable outcomes, 15
convergence in law, 124
countable, 4, 15 Gaussian bell curve, 123
covariance, 87 Gaussian distribution, 124
Geometric distribution, 93
dependent Geometric sum, 94
random variable, 102, 103 geometric sum, 147
disjoint, 12, 139 partial, 147
distribution, 71
joint distribution, 96 harmonic series, 134
163
events, 59 discrete, 64
random variable, 102, 103 geometric, 93
indicator function, 26 independent, 102, 103
induced sample space, 73 Poisson, 95
inequality uncorrelated, 87, 88
Chebyschev, 114
triangle, 81, 148 sample point, 5
intersection, 139 sample space, 5
set, 137
joint distribution, 96 elements, 137
set difference, 139
Kolmogorov’s axioms, 15 shifting theorem, 84
Kolmogorov’s axioms, 2 standard deviation, 83
Stirlings formula, 27, 124
Laplace experiment, 15 Sturges rule, 121
Laplace measure, 15 subset, 12, 138
Law of large numbers, 107, 113 sum rule, 21
law of total probability, 39, 47 summation index, 129
multiplication rule, 38, 42 travelling salesman problem, 27
tree diagram, 17, 19
normal distribution, 124
triangle inequality, 81, 148
pairwise independent, 56
uncorrelated, 87, 88
partial Geometric sum, 147
uncountable, 4, 73
partition, 46, 49
union, 139
path rule, 19, 41, 43, 44, 49, 56, 59, 91
Pochhammer symbol, 26 variance, 83
Poisson distribution, 95 Bernoulli distribution, 91
Poisson limit theorem, 95, 128 Binomial distribution, 92
possible outcomes, 15 degenerate distribution, 90
power set, 25, 139 geometric distribution, 95
probability mass function, 71 Poisson distribution, 95
probability measure, 5 Venn diagram, 7, 9, 139
probability space, discrete, 5, 14
probability tree, 19
product rule, 22
random variable
distribution, 96
Bernoulli, 90
Binomial, 91
covariance, 87
dependent, 102, 103
164