A Short Introduction To Probability
A Short Introduction To Probability
Probability
c 2018 D.P. Kroese. These notes can be used for educational purposes, pro-
vided they are kept in their original form, including this title page.
0
4 Joint Distributions 65
4.1 Joint Distribution and Independence . . . . . . . . . . . . . . . . 65
4.1.1 Discrete Joint Distributions . . . . . . . . . . . . . . . . . 66
4.1.2 Continuous Joint Distributions . . . . . . . . . . . . . . . 69
4.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . 79
D.P. Kroese and J.C.C. Chan (2014). Statistical Modeling and Computation,
Springer, New York.
2. do the tutorial exercises and the exercises in the appendix, which are
there to help you with the “technical” side of things; you will learn here
to apply the concepts learned at the lectures,
3. carry out random experiments on the computer. This will give you a
better intuition about how randomness works.
All of these will be essential if you wish to understand probability beyond “filling
in the formulas”.
Throughout these notes I try to use a uniform notation in which, as a rule, the
number of symbols is kept to a minimum. For example, I prefer qij to q(i, j),
Xt to X(t), and EX to E[X].
The symbol “:=” denotes “is defined as”. We will also use the abbreviations
r.v. for random variable and i.i.d. (or iid) for independent and identically and
distributed.
I will use the sans serif font to denote probability distributions. For example
Bin denotes the binomial distribution, and Exp the exponential distribution.
Numbering
All references to Examples, Theorems, etc. are of the same form. For example,
Theorem 1.2 refers to the second theorem of Chapter 1. References to formula’s
appear between brackets. For example, (3.4) refers to formula 4 of Chapter 3.
Literature
1. tossing a die,
Here x is a vector with 1s and 0s, indicating Heads and Tails, say. Typical
outcomes for three such experiments are given in Figure 1.1.
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
Figure 1.1: Three experiments where a fair coin is tossed 100 times. The dark
bars indicate when “Heads” (=1) appears.
We can also plot the average number of “Heads” against the number of tosses.
In the same Matlab program, this is done in two extra lines of code:
y = cumsum(x)./[1:100]
plot(y)
The result of three such experiments is depicted in Figure 1.2. Notice that the
average number of Heads seems to converge to 1/2, but there is a lot of random
fluctuation.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
Example 1.2 (Control Chart) Control charts, see Figure 1.3, are frequently
used in manufacturing as a method for quality control. Each hour the average
output of the process is measured — for example, the average weight of 10
bags of sugar — to assess if the process is still “in control”, for example, if the
machine still puts on average the correct amount of sugar in the bags. When
the process > Upper Control Limit or < Lower Control Limit and an alarm is
raised that the process is out of control, e.g., the machine needs to be adjusted,
because it either puts too much or not enough sugar in the bags. The question
is how to set the control limits, since the random process naturally fluctuates
around its “centre” or “target” line.
0
1
0
1
0
µ + c1
0
1
111111111111111111111111111111111
000000000000000000000000000000000
0
1 UCL
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
µ 000000000000000000000000000000000
111111111111111111111111111111111
0
1
0
1 Center line
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
µ - c 000000000000000000000000000000000
111111111111111111111111111111111
0
1
0
1 LCL
0
1
0
1
0
1
0
1
1 2 3 4 5 6 7 8 9 10 11 12
0.4
0.2
0.0
Example 1.4 A 4-engine aeroplane is able to fly on just one engine on each
wing. All engines are unreliable.
Number the engines: 1,2 (left wing) and 3,4 (right wing). Observe which engine
works properly during a specified period of time. There are 24 = 16 possible
outcomes of the experiment. Which outcomes lead to “system failure”? More-
over, if the probability of failure within some time period is known for each of
the engines, what is the probability of failure for the entire system? Again this
can be viewed as a random experiment.
14000
12000
10000
number of bytes
8000
6000
4000
2000
0
125 130 135 140 145 150 155
Interval
Definition 1.1 The sample space Ω of a random experiment is the set of all
possible outcomes of the experiment.
Ω = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), . . . , (6, 6)}.
Here (x1 , . . . , x10 ) represents the outcome that the length of the first se-
lected person is x1 , the length of the second person is x2 , et cetera.
Notice that for modelling purposes it is often easier to take the sample space
larger than necessary. For example the actual lifetime of a machine would
certainly not span the entire positive real axis. And the heights of the 10
selected people would not exceed 3 metres.
1.3 Events
Often we are not interested in a single outcome but in whether or not one of a
group of outcomes occurs. Such subsets of the sample space are called events.
Events will be denoted by capital letters A, B, C, . . . . We say that event A
occurs if the outcome of the experiment is one of the elements in A.
A = {(4, 6), (5, 5), (5, 6), (6, 4), (6, 5), (6, 6)}.
A = [0, 1000) .
3. The event that out of fifty selected people, five are left-handed,
A = {5} .
Example 1.5 (Coin Tossing) Suppose that a coin is tossed 3 times, and that
we “record” every head and tail (not only the number of heads or tails). The
sample space can then be written as
where, for example, HTH means that the first toss is heads, the second tails,
and the third heads. An alternative sample space is the set {0, 1}3 of binary
vectors of length 3, e.g., HTH corresponds to (1,0,1), and THH to (0,1,1).
A = {HHH, HT H, T HH, T T H} .
Since events are sets, we can apply the usual set operations to them:
Example 1.6 Suppose we cast two dice consecutively. The sample space is
Ω = {(1, 1), (1, 2), . . . , (1, 6), (2, 1), . . . , (6, 6)}. Let A = {(6, 1), . . . , (6, 6)} be
the event that the first die is 6, and let B = {(1, 6), . . . , (1, 6)} be the event
that the second dice is 6. Then A∩B = {(6, 1), . . . , (6, 6)}∩{(1, 6), . . . , (6, 6)} =
{(6, 6)} is the event that both die are 6.
A\B A\B A \B
Figure 1.8: A Venn diagram
Example 1.7 (System Reliability) In Figure 1.9 three systems are depicted,
each consisting of 3 unreliable components. The series system works if and only
if (abbreviated as iff) all components work; the parallel system works iff at least
one of the components works; and the 2-out-of-3 system works iff at least 2 out
of 3 components work.
1
1 1 2
1 2 3 2 1 3
Series 3
2 3
Parallel 2-out-of-3
Figure 1.9: Three unreliable systems
Let Ai be the event that the ith component is functioning, i = 1, 2, 3; and let
Da , Db , Dc be the events that respectively the series, parallel and 2-out-of-3
system is functioning. Then,
Da = A1 ∩ A2 ∩ A3 ,
and
D b = A1 ∪ A2 ∪ A3 .
Also,
Two useful results in the theory of sets are the following, due to De Morgan:
If {Ai } is a collection of events (sets) then
!c
[ \
Ai = Aci (1.1)
i i
and !c
\ [
Ai = Aci . (1.2)
i i
This is easily proved via Venn diagrams. Note that if we interpret Ai as the
event that a component works, then the left-hand side of (1.1) is the event that
the corresponding parallel system is not working. The right hand is the event
that at all components are not working. Clearly these two events are the same.
1.4 Probability
The third ingredient in the model for a random experiment is the specification
of the probability of the events. It tells us how likely it is that a particular event
will occur.
Axiom 2 just states that the probability of the “certain” event Ω is 1. Property
(1.3) is the crucial property of a probability, and is sometimes referred to as the
sum rule. It just states that if an event can happen in a number of different
ways that cannot happen at the same time, then the probability of this event is
simply the sum of the probabilities of the composing events.
Note that a probability rule P has exactly the same properties as the common
“area measure”. For example, the total area of the union of the triangles in
Figure 1.10 is equal to the sum of the areas of the individual triangles. This
Figure 1.10: The probability measure has the same properties as the “area”
measure: the total area of the triangles is the sum of the areas of the idividual
triangles.
is how you should interpret property (1.3). But instead of measuring areas, P
measures probabilities.
1. P(∅) = 0.
2. A ⊂ B =⇒ P(A) ≤ P(B).
3. P(A) ≤ 1.
4. P(Ac ) = 1 − P(A).
Proof.
Example 1.8 Consider the experiment where we throw a fair die. How should
we define Ω and P?
Obviously, Ω = {1, 2, . . . , 6}; and some common sense shows that we should
define P by
|A|
P(A) = , A ⊂ Ω,
6
where |A| denotes the number of elements in set A. For example, the probability
of getting an even number is P({2, 4, 6}) = 3/6 = 1/2.
1
0
0
1
Ω 000
111
000
111
11
00 111
000
1
0
0
1 00 111
11 000
000
111 11
00
11
00 111
000
000
111
000
111
1
0 1
0
0
1
1
0
1
0 0
1
11
00 11
00
00
11 00
11
00
11 00
11
111
000 11
00 11
00
11
00 000
111 00
11
00
11 000
111 00
11
1
0 00
11 000
111
000
111
00
11
00
11 1
0
0
1
000
111 A
11
00 1
0
00
11 0
1
00
11
|A|
P(A) = .
|Ω|
Thus for such sample spaces the calculation of probabilities reduces to counting
the number of outcomes (in A and Ω).
Example 1.9 We draw at random a point in the interval [0, 1]. Each point is
equally likely to be drawn. How do we specify the model for this experiment?
The sample space is obviously Ω = [0, 1], which is a continuous sample space.
We cannot define P via the elementary events {x}, x ∈ [0, 1] because each of
these events must have probability 0 (!). However we can define P as follows:
For each 0 ≤ a ≤ b ≤ 1, let
P([a, b]) = b − a .
This completely specifies P. In particular, we can find the probability that the
point falls into any (sufficiently nice) set A as the length of that set.
1.5 Counting
2. Consider a horse race with 8 horses. How many ways are there to gamble
on the placings (1st, 2nd, 3rd).
Take k balls 1
5 2 9
Replace balls (yes/no) 3 6 4
Note order (yes/no) 8 10 7
Urn (n balls)
Consider an urn with n different balls, numbered 1, . . . , n from which k balls are
drawn. This can be done in a number of different ways. First, the balls can be
drawn one-by-one, or one could draw all the k balls at the same time. In the first
case the order in which the balls are drawn can be noted, in the second case
that is not possible. In the latter case we can (and will) still assume the balls are
drawn one-by-one, but that the order is not noted. Second, once a ball is drawn,
it can either be put back into the urn (after the number is recorded), or left
out. This is called, respectively, drawing with and without replacement. All
in all there are 4 possible experiments: (ordered, with replacement), (ordered,
without replacement), (unordered, without replacement) and (ordered, with
replacement). The art is to recognise a seemingly unrelated counting problem
as one of these four urn problems. For the 4 examples above we have the
following
by sets, e.g., {1, 2, 3} = {3, 2, 1}. We now consider for each of the four cases
how to count the number of arrangements. For simplicity we consider for each
case how the counting works for n = 4 and k = 3, and then state the general
situation.
Here, after we draw each ball, note the number on the ball, and put the ball
back. Let n = 4, k = 3. Some possible outcomes are (1, 1, 1), (4, 1, 2), (2, 3, 2),
(4, 2, 1), . . . To count how many such arrangements there are, we can reason as
follows: we have three positions (·, ·, ·) to fill in. Each position can have the
numbers 1,2,3 or 4, so the total number of possibilities is 4 × 4 × 4 = 43 = 64.
This is illustrated via the following tree diagram:
Third position
Second position
First position
(1,1,1)
1
2
3
4
1
3
(3,2,1)
Here we draw again k numbers (balls) from the set {1, 2, . . . , n}, and note the
order, but now do not replace them. Let n = 4 and k = 3. Again there
are 3 positions to fill (·, ·, ·), but now the numbers cannot be the same, e.g.,
(1,4,2),(3,2,1), etc. Such an ordered arrangements called a permutation of
Third position
Second position
First position
2
3
4
1
1
2 3
1
4 (2,3,1)
4 (2,3,4)
3 1
2 (3,2,1)
4
4
1
2
3
This time we draw k numbers from {1, . . . , n} but do not replace them (no
replication), and do not note the order (so we could draw them in one grab).
Taking again n = 4 and k = 3, a possible outcome is {1, 2, 4}, {1, 2, 3}, etc.
If we noted the order, there would be n Pk outcomes, amongst which would be
(1,2,4),(1,4,2),(2,1,4),(2,4,1),(4,1,2) and (4,2,1). Notice that these 6 permuta-
tions correspond to the single unordered arrangement {1, 2, 4}. Such unordered
Note the two different notations for this number. We will use the second one.
Taking n = 4, k = 3, possible outcomes are {3, 3, 4}, {1, 2, 4}, {2, 2, 2}, etc.
The trick to solve this counting problem is to represent the outcomes in a
different way, via an ordered vector (x1 , . . . , xn ) representing how many times
an element in {1, . . . , 4} occurs. For example, {3, 3, 4} corresponds to (0, 0, 2, 1)
and {1, 2, 4} corresponds to (1, 1, 0, 1). Thus, we can count how many distinct
vectors (x1 , . . . , xn ) there are such that the sum of the components is 3, and
each xi can take value 0,1,2 or 3. Another way of looking at this is to consider
placing k = 3 balls into n = 4 urns, numbered 1,. . . ,4. Then (0, 0, 2, 1) means
that the third urn has 2 balls and the fourth urn has 1 ball. One way to
distribute the balls over the urns is to distribute n − 1 = 3 “separators” and
k = 3 balls over n − 1 + k = 6 positions, as indicated in Figure 1.13.
1 2 3 4 5 6
The number of ways this can be done is the equal to the number of ways k
positions for the balls can be chosen out of n − 1 + k positions, that is, n+k−1
k .
We thus have:
Returning to our original four problems, we can now solve them easily:
1. The total number of ways the exam can be completed is 320 = 3, 486, 784, 401.
2. The number of placings is 8 P3 = 336.
20
3. The number of possible combinations of CDs is = 1140.
3
More examples
Here are some more examples. Not all problems can be directly related to the
4 problems above. Some require additional reasoning. However, the counting
principles remain the same.
1. In how many ways can the numbers 1,. . . ,5 be arranged, such as 13524,
25134, etc?
Answer: 5! = 120.
2. How many different arrangements are there of the numbers 1,2,. . . ,7, such
that the first 3 numbers are 1,2,3 (in any order) and the last 4 numbers
are 4,5,6,7 (in any order)?
Answer: 3! × 4!.
3. How many different arrangements are there of the word “arrange”, such
as “aarrnge”, “arrngea”, etc?
Answer: Convert this into a ball drawing problem with 7 balls, numbered
1,. . . ,7. Balls 1 and 2 correspond to ’a’, balls 3 and 4 to ’r’, ball 5 to ’n’,
ball 6 to ’g’ and ball 7 to ’e’. The total number of permutations of the
numbers is 7!. However, since, for example, (1,2,3,4,5,6,7) is identical to
(2,1,3,4,5,6,7) (when substituting the letters back), we must divide 7! by
2! × 2! to account for the 4 ways the two ’a’s and ’r’s can be arranged. So
the answer is 7!/4 = 1260.
4. An urn has 1000 balls, labelled 000, 001, . . . , 999. How many balls are
there that have all number in ascending order (for example 047 and 489,
but not 033 or 321)?
Answer: There are 10 × 9 × 8 = 720 balls with different numbers. Each
triple of numbers can be arranged in 3! = 6 ways, and only one of these
is in ascending order. So the total number of balls in ascending order is
720/6 = 120.
5. In a group of 20 people each person has a different birthday. How many
different arrangements of these birthdays are there (assuming each year
has 365 days)?
365 P
Answer: 20 .
Once we’ve learned how to count, we can apply the equilikely principle to
calculate probabilities:
1. What is the probability that out of a group of 40 people all have different
birthdays?
Answer: Choosing the birthdays is like choosing 40 balls with replace-
ment from an urn containing the balls 1,. . . ,365. Thus, our sample
space Ω consists of vectors of length 40, whose components are cho-
sen from {1, . . . , 365}. There are |Ω| = 36540 such vectors possible,
and all are equally likely. Let A be the event that all 40 people have
different birthdays. Then, |A| = 365 P40 = 365!/325! It follows that
P(A) = |A|/|Ω| ≈ 0.109, so not very big!
2. What is the probability that in 10 tosses with a fair coin we get exactly
5 Heads and 5 Tails?
Answer: Here Ω consists of vectors of length 10 consisting of 1s (Heads)
and 0s (Tails), so there are 210 of them, and all are equally likely. Let A
be the event of exactly 5 heads. We must count how many binary vectors
there are with exactly 5 1s. This is equivalent to determining in how
many ways the positions of the 5 1s can be chosen out of 10 positions,
10 10
that is, 5 . Consequently, P(A) = 5 /2 = 252/1024 ≈ 0.25.
10
P(A ∩ B)
P(A | B) = (1.4)
P(B)
Example 1.10 We throw two dice. Given that the sum of the eyes is 10, what
is the probability that one 6 is cast?
Then, A ∩ B = {(4, 6), (6, 4)}. And, since all elementary events are equally
likely, we have
2/36 2
P(A | B) = = .
3/36 3
A B C
Suppose, again without loss of generality that Monte opens curtain B. The
contestant is now offered the opportunity to switch to curtain C. Should the
contestant stay with his/her original choice (A) or switch to the other unopened
curtain (C)?
Notice that the sample space consists here of 4 possible outcomes: Ac: The
prize is behind A, and Monte opens C; Ab: The prize is behind A, and Monte
opens B; Bc: The prize is behind B, and Monte opens C; and Cb: The prize
is behind C, and Monte opens B. Let A, B, C be the events that the prize
is behind A, B and C, respectively. Note that A = {Ac, Ab}, B = {Bc} and
C = {Cb}, see Figure 1.14.
Ac Ab
1/6 1/6
Cb Bc
1/3 1/3
Figure 1.14: The sample space for the Monte Hall problem.
Now, obviously P(A) = P(B) = P(C), and since Ac and Ab are equally likely,
we have P({Ab}) = P({Ac}) = 1/6. Monte opening curtain B means that we
have information that event {Ab, Cb} has occurred. The probability that the
prize is under A given this event, is therefore
P({Ac, Ab} ∩ {Ab, Cb}) P({Ab}) 1/6 1
P(A | B is opened) = = = = .
P({Ab, Cb}) P({Ab, Cb}) 1/6 + 1/3 3
This is what we expected: the fact that Monte opens a curtain does not
give us any extra information that the prize is behind A. So one could think
that it doesn’t matter to switch or not. But wait a minute! What about
P(B | B is opened)? Obviously this is 0 — opening curtain B means that we
know that event B cannot occur. It follows then that P(C | B is opened) must
be 2/3, since a conditional probability behaves like any other probability and
must satisfy axiom 2 (sum up to 1). Indeed,
P({Cb} ∩ {Ab, Cb}) P({Cb}) 1/3 2
P(C | B is opened) = = = = .
P({Ab, Cb}) P({Ab, Cb}) 1/6 + 1/3 3
Hence, given the information that B is opened, it is twice as likely that the
prize is under C than under A. Thus, the contestant should switch!
Proof. We only show the proof for 3 events, since the n > 3 event case follows
similarly. By applying (1.4) to P(B | A) and P(C | A ∩ B), the left-hand side of
(1.6) is we have,
P(A ∩ B) P(A ∩ B ∩ C)
P(A) P(B | A) P(C | A ∩ B) = P(A) = P(A ∩ B ∩ C) ,
P(A) P(A ∩ B)
Example 1.12 We draw consecutively 3 balls from a bowl with 5 white and 5
black balls, without putting them back. What is the probability that all balls
will be black?
Solution: Let Ai be the event that the ith ball is black. We wish to find the
probability of A1 A2 A3 , which by the product rule (1.6) is
5 43
P(A1 ) P(A2 | A1 ) P(A3 | A1 A2 ) = = 0.083.
10 9 8
Note that this problem can also be easily solved by counting arguments, as in
the previous section.
Now P(Ak | Ak−1 = (365 − k + 1)/365 because given that the first k − 1 people
have different birthdays, there are no duplicate birthdays if and only if the
birthday of the k-th is chosen from the 365 − (k − 1) remaining birthdays.
Thus, we obtain (1.7). More generally, the probability that n randomly selected
people have different birthdays is
365 364 363 365 − n + 1
P(An ) = × × × ··· × , n≥1.
365 365 365 365
A graph of P(An ) against n is given in Figure 1.15. Note that the probability
P(An ) rapidly decreases to zero. Indeed, for n = 23 the probability of having
no duplicate birthdays is already less than 1/2.
0.8
0.6
0.4
0.2
10 20 30 40 50 60
B1 B2 B3 B4 B5 B6
Pn
Then, by the sum rule, P(A) = i=1 P(A ∩ Bi ) and hence, by the definition of
conditional probability we have
Pn
P(A) = i=1 P(A|Bi ) P(Bi )
Combining the Law of Total Probability with the definition of conditional prob-
ability gives Bayes’ Rule:
P(A|Bj ) P(Bj )
P(Bj |A) = Pn
i=1 P(A|Bi )P(Bi )
Example 1.14 A company has three factories (1, 2 and 3) that produce the
same chip, each producing 15%, 35% and 50% of the total production. The
Let Bi denote the event that the chip is produced by factory i. The {Bi } form
a partition of Ω. Let A denote the event that the chip is faulty. By Bayes’ rule,
0.15 × 0.01
P(B1 | A) = = 0.052 .
0.15 × 0.01 + 0.35 × 0.05 + 0.5 × 0.02
1.6.3 Independence
This definition covers the case B = ∅ (empty set). We can extend the definition
to arbitrarily many events:
Example 1.15 (A Coin Toss Experiment and the Binomial Law) We flip
a coin n times. We can write the sample space as the set of binary n-tuples:
Here 0 represent Tails and 1 represents Heads. For example, the outcome
(0, 1, 0, 1, . . .) means that the first time Tails is thrown, the second time Heads,
the third times Tails, the fourth time Heads, etc.
How should we define P? Let Ai denote the event of Heads during the ith throw,
i = 1, . . . , n. Then, P should be such that the events A1 , . . . , An are independent.
And, moreover, P(Ai ) should be the same for all i. We don’t know whether the
coin is fair or not, but we can call this probability p (0 ≤ p ≤ 1).
These two rules completely specify P. For example, the probability that the
first k throws are Heads and the last n − k are Tails is
Also, let Bk be the event that there are k Heads in total. The probability of
this event is the sum the probabilities of elementary events {(x1 , . . . , xn )} such
that x1 + · · · + xn = k. Each of these events has probability pk (1 − p)n−k , and
there are nk of these. Thus,
n k
P(Bk ) = p (1 − p)n−k , k = 0, 1, . . . , n .
k
so that with the product law and the mutual independence of Ac1 , . . . , Ak we
have the geometric law:
Example 2.1 (Sum of two dice) Suppose we toss two fair dice and note
their sum. If we throw the dice one-by-one and observe each throw, the sample
space is Ω = {(1, 1), . . . , (6, 6)}. The function X, defined by X(i, j) = i + j, is
a random variable, which maps the outcome (i, j) to the sum i + j, as depicted
in Figure 2.1. Note that all the outcomes in the “encircled” set are mapped to
8. This is the set of all outcomes whose sum is 8. A natural notation for this
set is to write {X = 8}. Since this set has 5 outcomes, and all outcomes in Ω
are equally likely, we have
5
P({X = 8}) = .
36
This notation is very suggestive and convenient. From a non-mathematical
viewpoint we can interpret X as a “random” variable. That is a variable that
can take on several values, with certain probabilities. In particular it is not
difficult to check that
6 − |7 − x|
P({X = x}) = , x = 2, . . . , 12.
36
X
Ω
6
5
4
3 2 3 4 5 6 7 8 9 10 11 12
2
1
1 2 3 4 5 6
We usually denote random variables with capital letters from the last part of the
alphabet, e.g. X, X1 , X2 , . . . , Y, Z. Random variables allow us to use natural
and intuitive notations for certain events, such as {X = 10}, {X > 1000},
{max(X, Y ) ≤ Z}, etc.
Example 2.2 We flip a coin n times. In Example 1.15 we can find a probability
model for this random experiment. But suppose we are not interested in the
complete outcome, e.g., (0,1,0,1,1,0. . . ), but only in the total number of heads
(1s). Let X be the total number of heads. X is a “random variable” in the
true sense of the word: X could lie anywhere between 0 and n. What we are
interested in, however, is the probability that X takes certain values. That is,
we are interested in the probability distribution of X. Example 1.15 now
suggests that
n k
P(X = k) = p (1 − p)n−k , k = 0, 1, . . . , n . (2.1)
k
This contains all the information about X that we could possibly wish to know.
Example 2.1 suggests how we can justify this mathematically Define X as the
function that assigns to each outcome ω = (x1 , . . . , xn ) the number x1 +· · ·+xn .
Then clearly X is a random variable in mathematical terms (that is, a function).
Moreover, the event that there are exactly k Heads in n throws can be written
as
{ω ∈ Ω : X(ω) = k}.
If we abbreviate this to {X = k}, and further abbreviate P({X = k}) to
P(X = k), then we obtain exactly (2.1).
We give some more examples of random variables without specifying the sample
space.
The set of all possible values a random variable X can take is called the range
of X. We further distinguish between discrete and continuous random variables:
One way to specify the probability distribution is to give the probabilities of all
events of the form {X ≤ x}, x ∈ R. This leads to the following definition.
Note that above we should have written P({X ≤ x}) instead of P(X ≤ x).
From now on we will use this type of abbreviation throughout the course. In
Figure 2.2 the graph of a cdf is depicted.
The following properties for F are a direct consequence of the three Axiom’s
for P.
F(x)
0
x
4. 0 ≤ F (x) ≤ 1.
Proof. We will prove (1) and (2) in STAT3004. For (3), suppose x ≤ y and
define A = {X ≤ x} and B = {X ≤ y}. Then, obviously, A ⊂ B (for example
if {X ≤ 3} then this implies {X ≤ 4}). Thus, by (2) on page 14, P(A) ≤ P(B),
which proves (3). Property (4) follows directly from the fact that 0 ≤ P(A) ≤ 1
for any event A — and hence in particular for A = {X ≤ x}.
Any function F with the above properties can be used to specify the distribution
of a random variable X. Suppose that X has cdf F . Then the probability that
X takes a value in the interval (a, b] (excluding a, including b) is given by
Namely, P(X ≤ b) = P({X ≤ a}∪{a < X ≤ b}), where the events {X ≤ a} and
{a < X ≤ b} are disjoint. Thus, by the sum rule: F (b) = F (a) + P(a < X ≤ b),
which leads to the result above. Note however that
P(a ≤ X ≤ b) = F (b) − F (a) + P(X = a)
= F (b) − F (a) + F (a) − lim F (a − h)
h↓0
Example 2.3 Toss a die and let X be its face value. X is discrete with range
{1, 2, 3, 4, 5, 6}. If the die is fair the probability mass function is given by
P
x 1 2 3 4 5 6
1 1 1 1 1 1
f (x) 1
6 6 6 6 6 6
Example 2.4 Toss two dice and let X be the largest face value showing. The
pmf of X can be found to satisfy
P
x 1 2 3 4 5 6
1 3 5 7 9 11
f (x) 1
36 36 36 36 36 36
P6
The probability that the maximum is at least 3 is P(X ≥ 3) = x=3 f (x) =
32/36 = 8/9.
f(x)
a x b
Note that the corresponding cdf F is simply a primitive (also called anti-
derivative) of the pdf f . In particular,
Z x
F (x) = P(X ≤ x) = f (u) du.
−∞
Example 2.5 Draw a random number from the interval of real numbers [0, 2].
Each number is equally possible. Let X represent the number. What is the
probability density function f and the cdf F of X?
By differentiating F we find
1/2 0 ≤ x ≤ 2,
f (x) =
0 otherwise
Note that this density is constant on the interval [0, 2] (and zero elsewhere),
reflecting that each point in [0,2] is equally likely. Note also that we have
modelled this random experiment using a continuous random variable and its
pdf (and cdf). Compare this with the more “direct” model of Example 1.9.
Describing an experiment via a random variable and its pdf, pmf or cdf seems
much easier than describing the experiment by giving the probability space. In
fact, we have not used a probability space in the above examples.
2.3 Expectation
Definition 2.4 Let X be a discrete random variable with pmf f . The expec-
tation (or expected value) of X, denoted by EX, is defined by
X X
EX = x P(X = x) = x f (x) .
x x
Namely, in the long run the fractions of times the sum is equal to 2, 3, 4,
1 2 3
. . . are 36 , 35 , 36 , . . ., so the average pay-out per game is the weighted sum of
2,3,4,. . . with the weights being the probabilities/fractions. Thus the game is
“fair” if the average profit (pay-out - d) is zero.
p1 p2 pn
x1 x2 xn
EX
Then there centre of mass, the place where we can “balance” the weights, is
centre of mass = x1 p1 + · · · + xn pn ,
For continuous random variables we can define the expectation in a similar way:
Definition 2.5 Let X be a continuous random variable with pdf f . The ex-
pectation (or expected value) of X, denoted by EX, is defined by
Z
EX = x f (x) dx .
x
Theorem 2.1 If X is discrete with pmf f , then for any real-valued function g
X
E g(X) = g(x) f (x) .
x
Proof. We prove it for the discrete case only. Let Y = g(X), where X is a
discrete random variable with pmf fX , and g is a function. Y is again a random
variable. The pmf of Y , fY satisfies
X X
fY (y) = P(Y = y) = P(g(X) = y) = P(X = x) = fX (x) .
x:g(x)=y x:g(x)=y
Example 2.7 Find EX 2 if X is the outcome of the toss of a fair die. We have
1 1 1 1 91
EX 2 = 12 + 2 2 + 3 2 + . . . + 62 = .
6 6 6 6 6
1. E(a X + b) = a EX + b .
Proof. Suppose X has pmf f . Then 1. follows (in the discrete case) from
X X X
E(aX + b) = (ax + b)f (x) = a x f (x) + b f (x) = a EX + b .
x x x
The continuous case is proved analogously, by replacing the sum with an inte-
gral.
The square root of the variance is called the standard deviation. The number
EX r is called the rth moment of X.
The following important properties for variance hold for discrete or continu-
ous random variables and follow easily from the definitions of expectation and
variance.
1. Var(X) = EX 2 − (EX)2
2. Var(aX + b) = a2 Var(X)
2.4 Transforms
We will shortly introduce this as the Poisson distribution, but for now this is
not important. The PGF of X is given by
∞
X λx
G(z) = z x e−λ
x!
x=0
∞
−λ
X (zλ)x
= e
x!
x=0
−λ zλ
= e e = e−λ(1−z) .
Proof. By definition
Thus, G0 (0) = P(X = 1). Differentiating again, we see that G00 (0) = 2 P(X =
2), and in general the n-th derivative of G at zero is G(n) (0) = x! P(X = x),
which completes the proof.
Thus we have the uniqueness property: two pmf’s are the same if and only if
their PGFs are the same.
Another useful property of the PGF is that we can obtain the moments of X
by differentiating G and evaluating it at z = 1.
d Ez X
G0 (z) = = EXz X−1 .
dz
d EXz X−1
G00 (z) = = EX(X − 1)z X−2 .
dz
G000 (z) = EX(X − 1)(X − 2)z X−3 .
Et cetera. If you’re not convinced, write out the expectation as a sum, and use
the fact that the derivative of the sum is equal to the sum of the derivatives
(although we need a little care when dealing with infinite sums).
In particular,
EX = G0 (1),
and
Var(X) = G00 (1) + G0 (1) − (G0 (1))2 .
As for the PGF, the moment generation function has the uniqueness property:
Two MGFs are the same if and only if their corresponding distribution functions
are the same.
Remark 2.1 The transforms discussed here are particularly useful when deal-
ing with sums of independent random variables. We will return to them in
Chapters 4 and 5.
In this section we give a number of important discrete distributions and list some
of their properties. Note that the pmf of each of these distributions depends on
one or more parameters; so in fact we are dealing with families of distributions.
We write X ∼ Ber(p). Despite its simplicity, this is one of the most important
distributions in probability! It models for example a single coin toss experiment.
The cdf is given in Figure 2.5.
1
1−p
0 1
Here are some important properties of the Bernoulli distribution. Some of these
properties can be proved more easily after we have discussed multiple random
variables.
n=10, p=0.7
0.25
0.2
0.15
0.1
0.05
0 1 2 3 4 5 6 7 8 9 10
G(z) = Ez X = Ez X1 +···+Xn = Ez X1 · · · Ez Xn
= (1 − p + zp) × · · · × (1 − p + zp) = (1 − p + zp)n .
Specifically,
n n
X
k n k n−k
X n
G(z) = z p (1−p) = (z p)k (1−p)n−k = (1−p+zp)n .
k k
k=0 k=0
Note that once we have obtained the PGF, we can obtain the expectation
and variance as G0 (1) = np and G00 (1) + G0 (1) − (G0 (1))2 = (n − 1)np2 +
np − n2 p2 = np(1 − p).
Again we look at a sequence of coin tosses but count a different thing. Let X
be the number of tosses needed before the first head occurs. Then
ttt
| {z. . . }t h
x−1
and this has probability (1 − p)x−1 p. See also Example 1.16 on page 28. Such a
random variable X is said to have a geometric distribution with parameter p.
We write X ∼ G(p). An example of the graph of the pdf is given in Figure 2.7
p=0.3
0.3
0.25
0.2
0.15
0.1
0.05
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
We give some more properties, including the expectation, variance and PGF of
the geometric distribution. It is easiest to start with the PGF:
1
using the well-known result for geometric sums: 1 + a + a2 + · · · = 1−a ,
for |a| < 1.
2(1 − p) 1 1 1−p
Var(X) = G00 (1) + G0 (1) − (G00 (1))2 = 2
+ − 2 = .
p p p p2
This is obvious from the fact that {X > k} corresponds to the event of k
consecutive failures.
Now, the event {X > k + x} is a subset of {X > k}, hence their intersection
is {X > k + x}. Moreover, the probabilities of the events {X > k + x} and
{X > k} are (1 − p)k+x and (1 − p)k , respectively, so that
(1 − p)k+x
P(X > k + x | X > k) = = (1 − p)x = P(X > x),
(1 − p)k
as required.
coin tossing experiment where we toss a coin n times with success probability
λ/n. Let X be the number of successes. Then, as we have seen X ∼ Bin(n, λ/n).
In other words,
k
λ n−k
n λ
P(X = k) = 1−
k n n
λk n × n − 1 × · · · × n − k + 1 λ n λ −k
= 1− 1−
k! n × n × ··· × n n n
As n → ∞, the second and fourth factors go to 1, and the third factor goes to
e−λ (this is one of the defining properties of the exponential function). Hence,
we have
λk −λ
lim P(X = k) = e ,
n→∞ k!
which shows that the Poisson distribution is a limiting case of the binomial one.
An example of the graph of its pmf is given in Figure 2.8
lambda = 10
0.12
0.1
0.08
0.06
0.04
0.02
0 1 2 3 4 5 6 7 8 9 1011121314151617181920
G(z) = e−λ(1−z) .
Thus for the Poisson distribution the variance and expectation are the
same.
Consider an urn with N balls, r of which are red. We draw at random n balls
from the urn without replacement. The number of red balls amongst the n
chosen balls has a Hyp(n, r, N ) distribution. Namely, if we number the red balls
1, . . . , r and the remaining balls r+1, . . . , N , then the total number of outcomes
of the random experiment is N n , and each of these outcomes is equally likely.
−r
The number of outcomes in the event “k balls are red” is k × N r
n−k because
the k balls have to be drawn from the r red balls, and the remaining n − k balls
have to be drawn from the N − k non-red balls. In table form we have:
Selected k n−k n
Total r N −r N
Example 2.9 Five cards are selected from a full deck of 52 cards. Let X be
the number of Aces. Then X ∼ Hyp(n = 5, r = 4, N = 52).
P
k 0 1 2 3 4
a b x→
We have
b 2
b − a2
x 1 a+b
Z
EX = dx = = .
a b−a b−a 2 2
This can be seen more directly by observing that the pdf is symmetric around
c = (a+b)/2, and that the expectation is therefore equal to the symmetry point
c. For the variance we have
a+b 2
Z b 2
x
Var(X) = EX − (EX) =
2 2
dx −
a b−a 2
(a − b)2
= ... = .
12
A more elegant way to derive this is to use the fact that X can be thought of
as the sum X = a + (b − a)U , where U ∼ U[0, 1]. Namely, for x ∈ [a, b]
x−a x−a
P(X ≤ x) = =P U ≤ = P(a + (b − a)U ≤ x) .
b−a b−a
Thus, we have Var(X) = Var(a + (b − a)U ) = (b − a)2 Var(U ). And
Z 1 2
1 1 1 1
Var(U ) = EU − (EU ) =
2 2 2
u du − = − = .
0 2 3 4 12
1.5 λ=2
f (x)
0.5
λ=1
λ = 0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x
λ 1
EX = M 0 (0) = = .
(λ − s)2 s=0 λ
Proof. By (1.4)
P(X > s + t, X > s) P(X > s + t)
P(X > s + t | X > s) = =
P(X > s) P(X > s)
e−λ(t+s)
= = e−λt = P(X > t),
e−λs
where in the second equation we have used the fact that the event {X > s + t}
is contained in the event {X > s} hence the intersection of these two sets is
{X > s + t}.
For example, when X denotes the lifetime of a machine, then given the fact
that the machine is alive at time s, the remaining lifetime of the machine, i.e.
X − s, has the same exponential distribution as a completely new machine. In
other words, the machine has no memory of its age and does not “deteriorate”
(although it will break down eventually).
It is not too difficult to prove that the exponential distribution is the only
continuous (positive) distribution with the memoryless property.
The normal (or Gaussian) distribution is the most important distribution in the
study of statistics. We say that a random variable has a normal distribution
with parameters µ and σ 2 if its density function f is given by
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) , x ∈ R. (2.9)
σ 2π
We write X ∼ N(µ, σ 2 ). The parameters µ and σ 2 turn out to be the expectation
and variance of the distribution, respectively. If µ = 0 and σ = 1 then
1 2
f (x) = √ e−x /2 ,
2π
0.8
0.7
0.6 N(0,0.25)
0.5
f (x)
0.4
N(0,1) N(2,1)
0.3
0.2
0.1
0
−3 −2 −1 0 1 2 3 4 5
x
1. If X ∼ N(µ, σ 2 ), then
X −µ
∼ N(0, 1) . (2.10)
σ
Thus by subtracting the mean and dividing by the standard deviation
we obtain a standard normal distribution. This procedure is called stan-
dardisation.
Proof. Let X ∼ N(µ, σ 2 ), and Z = (X − µ)/σ. Then,
−1 −z 2 /2 ∞
Z ∞
1 2
EZ = z √ e
2
+ √ e−z /2 dz = 1 ,
2π −∞ −∞ 2π
since the last integrand is the pdf of the standard normal distribution.
where the second integrand is the pdf of the N(s, 1) distribution, which
therefore integrates to 1. Now, for general X ∼ N(µ, σ 2 ) write X =
µ + σZ. Then,
2 s2 /2 2 s2 /2
EesX = Ees(µ+σZ) = esµ EesσZ = esµ eσ = esµ+σ .
0.5
n=1
0.4
n=2
0.3 n=3
n=4
0.2
0.1
2 4 6 8 10
Figure 2.12: Pdfs for the χ2n -distribution, for various degrees of freedom n
As a consequence, we have
α+1
α λ α
EX = M 0 (0) = = ,
λ λ−s s=0 λ
and, similarly,
α
Var(X) = .
λ2
3.1 Introduction
This chapter deals with the execution of random experiments via the computer,
also called stochastic simulation. In a typical stochastic simulation, random-
ness is introduced into simulation models via independent uniformly distributed
random variables, also called random numbers. These random numbers are
then used as building blocks to simulate more general stochastic systems.
ator captures all the important statistical properties of true random sequences,
even though the sequence is generated by a deterministic algorithm. For this
reason, these generators are sometimes called pseudorandom.
The most common methods for generating pseudorandom sequences use the so-
called linear congruential generators. These generate a deterministic sequence
of numbers by means of the recursive formula
where the initial value, X0 , is called the seed, and the a, c and m (all positive
integers) are called the multiplier, the increment and the modulus, respectively.
Note that applying the modulo-m operator in (3.1) means that aXi +c is divided
by m, and the remainder is taken as the value of Xi+1 . Thus, each Xi can only
assume a value from the set {0, 1, . . . , m − 1}, and the quantities
Xi
Ui = , (3.2)
m
called pseudorandom numbers, constitute approximations to the true sequence
of uniform random variables. Note that the sequence {Xi } will repeat itself in
at most m steps, and will therefore be periodic with period not exceeding m.
For example, let a = c = X0 = 3 and m = 5; then the sequence obtained from
the recursive formula Xi+1 = 3Xi + 3 (mod 5) is Xi = 3, 2, 4, 0, 3, which has
period 4, while m = 5. In the special case where c = 0, (3.1) simply reduces to
(Readers not acquainted with the notion inf should read min.) It is easy to
show that if U ∼ U(0, 1), then
X = F −1 (U ) (3.5)
Thus, to generate a random variable X with cdf F , draw U ∼ U(0, 1) and set
X = F −1 (U ). Figure 3.1 illustrates the inverse-transform method given by the
following algorithm.
2. Return X = F −1 (U ).
F (x)
x
0 X
The cdf is
0, x<0
R
x 2
F (x) =
0 2x dx = x , 0 ≤ x ≤ 1
1, x > 1.
F (x)
1 p5
p4
U
p3 f
p2 f
p1
x
0 X
The algorithm for generating a random variable from F can thus be written as
follows:
2. Find the smallest positive integer, k, such that U ≤ F (xk ) and return
X = xk .
min(find(cumsum(p)> rand));
Much of the execution time in Algorithm 3.2 is spent in making the comparisons
of Step 2. This time can be reduced by using efficient search techniques.
with respect to x. Even in the case where F −1 exists in an explicit form, the
inverse-transform method may not necessarily be the most efficient random
variable generation method.
The next two subsections present algorithms for generating variables from com-
monly used continuous and discrete distributions. Of the numerous algorithms
available we have tried to select those which are reasonably efficient and rela-
tively simple to implement.
Exponential Distribution
There are many alternative procedures for generating variables from the expo-
nential distribution. The interested reader is referred to Luc Devroye’s book
Non-Uniform Random Variate Generation, Springer-Verlag, 1986. (The entire
book can be downloaded for free.)
(x − µ)2
1
f (x) = √ exp − , −∞ < x < ∞ , (3.9)
σ 2π 2σ 2
where µ is the mean (or expectation) and σ 2 the variance of the distribution.
1.2
0.8
0.6
0.4
0.2
1 2 3 4
Bernoulli distribution
In Figure 1.1 on page 6 typical outcomes are given of 100 independent Bernoulli
random variables, each with success parameter p = 0.5.
Binomial distribution
Recall that a binomial random variable X can be viewed as the total number of
successes in n independent Bernoulli experiments, each with success probability
p; see Example 1.15. Denoting the result of the i-th trial by Xi = 1 (success)
or Xi = 0 (failure), we can write X = X1 + · · · + Xn with the {Xi } iid Ber(p)
random variables. The simplest generation algorithm can thus be written as
follows:
It is worthwhile to note that if Y ∼ Bin(n, p), then n−Y ∼ Bin(n, 1−p). Hence,
to enhance efficiency, one may elect to generate X from Bin(n, p) according to
Y1 ∼ Bin(n, p), if p ≤ 12
X=
Y2 ∼ Bin(n, 1 − p), if p > 21 .
Geometric Distribution
The random variable X can be interpreted as the the number of trials required
until the first success occurs, in a series of independent Bernoulli trials with
success parameter p. Note that P(X > m) = (1 − p)m .
Joint Distributions
Often a random experiment is described via more than one random variable.
Examples are:
2. We flip a coin repeatedly. Let Xi = 1 if the ith flip is “heads” and 0 else.
The experiment is described by the sequence X1 , X2 , . . . of coin flips.
How can we specify the behaviour of the random variables above? We should
not just specify the pdf or pmf of the individual random variables, but also
say something about the “interaction” (or lack thereof) between the random
variables. For example, in the third experiment above if the height Y is large,
we expect that X is large as well. On the other hand, for the first and second
experiment it is reasonable to assume that information about one of the random
variables does not give extra information about the others. What we need to
specify is the joint distribution of the random variables.
The theory for multiple random variables is quite similar to that of a single
random variable. The most important extra feature is perhaps the concept of
independence of random variables. Independent random variables play a crucial
role in stochastic modelling.
If we know F then we can in principle derive any probability involving the Xi ’s.
Note the abbreviation on the right-hand side. We will henceforth use this kind
of abbreviation throughout the notes.
Similar to the 1-dimensional case we distinguish between the case where the Xi
are discrete and continuous. The corresponding joint distributions are again
called discrete and continuous, respectively.
To see how things work in the discrete case, let’s start with an example.
Example 4.1 In a box are three dice. Die 1 is a normal die; die 2 has no 6
face, but instead two 5 faces; die 3 has no 5 face, but instead two 6 faces. The
experiment consists of selecting a die at random, followed by a toss with that
die. Let X be the die number that is selected, and let Y be the face value of
that die. The probabilities P(X = x, Y = y) are specified below.
y
P
x 1 2 3 4 5 6
1 1 1 1 1 1 1
1
18 18 18 18 18 18 3
1 1 1 1 1 1
2 0
18 18 18 18 9 3
1 1 1 1 1 1
3 0
18 18 18 18 9 3
P 1 1 1 1 1 1
1
6 6 6 6 6 6
We sometimes write fX1 ,...,Xn instead of f to show that this is the pmf of the
random variables X1 , . . . , Xn . Or, if X is the corresponding random vector, we
could write fX instead.
Note that, by the sum rule, if we are given the joint pmf of X1 , . . . , Xn we can
in principle calculate all possible probabilities involving these random variables.
For example, in the 2-dimensional case
X
P((X, Y ) ∈ B) = P(X = x, Y = y) ,
(x,y)∈B
for any subset B of possible values for (X, Y ). In particular, we can find the
pmf of X by summing the joint pmf over all possible values of y:
X
P(X = x) = P(X = x, Y = y) .
y
The converse is not true: from the individual distributions (so-called marginal
distribution) of X and Y we cannot in general reconstruct the joint distribution
of X and Y . We are missing the “dependency” information. E.g., in Exam-
ple 4.1 we cannot reconstruct the inside of the two-dimensional table if only
given the column and row totals.
However, there is one important exception to this, namely when we are dealing
with independent random variables. We have so far only defined what inde-
pendence is for events. The following definition says that random variables
X1 , . . . , Xn are independent if the events {X1 ∈ A1 }, . . . , {Xn ∈ An } are in-
dependent for any subsets A1 , . . . , An of R. Intuitively, this means that any
information about one of them does not affect our knowledge about the others.
for all x1 , x2 , . . . , xn .
Example 4.2 We repeat the experiment in Example 4.1 with three ordinary
fair dice. What are now the joint probabilities in the table? Since the events
{X = x} and {Y = y} are now independent, each entry in the pmf table is
1 1
3 × 6 . Clearly in the first experiment not all events {X = x} and {Y = y} are
independent (why not?).
This completely specifies our model. In particular we can find any probability
related to the Xi ’s. For example, let X = X1 + · · · + Xn be the total number of
Heads in n tosses. Obviously X is a random variable that takes values between
0Pand n. Denote by A the set of all binary vectors x = (x1 , . . . , xn ) such that
n n
i=1 xi = k. Note that A has k elements. We now have
X
P(X = k) = P(X1 = x1 , . . . , Xn = xn )
x∈A
X X
= P(X1 = x1 ) · · · P(Xn = xn ) = pk (1 − p)n−k
x∈A x∈A
n k
= p (1 − p)n−k .
k
In other words, X ∼ Bin(n, p). Compare this to what we did in Example 1.15
on page 27.
Remark 4.1 If fX1 ,...,Xn denotes the joint pmf of X1 , . . . , Xn and fXi the
marginal pmf of Xi , i = 1, . . . , n, then the theorem above states that inde-
pendence of the Xi ’s is equivalent to
fX1 ,...,Xn (x1 , . . . , xn ) = fX1 (x1 ) · · · fXn (xn )
for all possible x1 , . . . , xn .
Multinomial Distribution
Example 4.4 We independently throw n balls into k urns, such that each ball
is thrown in urn i with probability pi , i = 1, . . . , k.
probab. p1 p p3 pk
2
1 2 3 k
Let Xi be the total number of balls in urn i, i = 1, . . . , k. We show that
(X1 , . . . , Xk ) ∼ Mnom(n, p1 , . . . , pk ). Let x1 , . . . , xk be integers between 0 and
n that sum up to n. The probability that the first x1 balls fall in the first urn,
the next x2 balls fall in the second urn, etcetera, is
px1 1 px2 2 · · · pxk k .
To find the probability that there are x1 balls in the first urn, x2 in the second,
etcetera, we have to multiply the probability above with the number of ways in
which we can fill the urns with x1 , x2 , . . . , xk balls, i.e. n!/(x1 ! x2 ! · · · xk !). This
gives (4.2).
Remark 4.3 Note that for the binomial distribution there are only two possible
urns. Also, note that for each i = 1, . . . , k, Xi ∼ Bin(n, pi ).
Joint distributions for continuous random variables are usually defined via
the joint pdf. The results are very similar to the discrete case discussed in
Section 4.1.1. Compare this section also with the 1-dimensional case in Sec-
tion 2.2.2.
for all a1 , . . . , bn .
We sometimes write fX1 ,...,Xn instead of f to show that this is the pdf of the
random variables X1 , . . . , Xn . Or, if X is the corresponding random vector, we
could write fX instead.
Note that if the joint pdf is given, then in principle we can calculate all proba-
bilities. Specifically, in the 2-dimensional case we have
Z Z
P((X, Y ) ∈ B) = f (x, y) dx dy , (4.3)
(x,y)∈B
for any subset B of possible values for R2 . Thus, the calculation of probabilities
is reduced to integration.
Similarly to the discrete case, if X1 , . . . , Xn have joint pdf f , then the (individ-
ual, or marginal) pdf of each Xi can be found by integrating f over all other
variables. For example, in the two-dimensional case
Z ∞
fX (x) = f (x, y) dy.
y=−∞
However, we usually cannot reconstruct the joint pdf from the marginal pdf’s
unless we assume that the the random variables are independent. The defi-
nition of independence is exactly the same as for discrete random variables,
see Definition 4.2. But, more importantly, we have the following analogue of
Theorem 4.1.
Example 4.5 Consider the experiment were we select randomly and indepen-
dently n points from the interval [0,1]. We can carry this experiment out using
a calculator or computer, using the random generator. On your calculator this
means pushing the RAN# or Rand button. Here is a possible outcome, or
realisation, of the experiment, for n = 12.
0.9451226800 0.2920864820 0.0019900900 0.8842189383 0.8096459523
0.3503489150 0.9660027079 0.1024852543 0.7511286891 0.9528386400
0.2923353821 0.0837952423
A model for this experiment is: Let X1 , . . . , Xn be independent random vari-
ables, each with a uniform distribution on [0,1]. The joint pdf of X1 , . . . , Xn is
very simple, namely
f (x1 , . . . , xn ) = 1, 0 ≤ x1 ≤ 1, . . . , 0 ≤ xn ≤ 1 ,
(and 0 else). In principle we can now calculate any probability involving the
Xi ’s. For example for the case n = 2 what is the probability
X1 + X22
P 2
> sin(X1 − X2 ) ?
X1 X2
The answer, by (4.3), is
ZZ
1 dx1 dx2 = Area (A),
A
where
x1 + x22
2 2
A = (x1 , x2 ) ∈ [0, 1] : > sin(x1 − x2 ) .
x1 x2
(Here [0, 1]2 is the unit square in R2 ).
Remark 4.4 The type of model used in the previous example, i.e., X1 , . . . , Xn
are independent and all have the same distribution, is the most widely used
model in statistics. We say that X1 , . . . , Xn is a random sample of size n,
from some given distribution. In Example 4.5 X1 , . . . , Xn is a random sample
from a U[0, 1]-distribution. In Example 4.3 we also had a random sample, this
time from a Ber(p)-distribution. The common distribution of a random sample
is sometimes called the sampling distribution.
Using the computer we can generate the outcomes of random samples from
many (sampling) distributions. In Figure 4.1 the outcomes of a two random
samples, both of size 1000 are depicted in a histogram. Here the x-axis is
divived into 20 intervals, and the number of points in each interval is counted.
The first sample is from the U[0, 1]-distribution, and the second sample is from
the N(1/2, 1/12)-distribution. The matlab commands are:
figure(1)
hist(rand(1,1000),20)
figure(2)
hist(1/2 + randn(1,1000)*sqrt(1/12),20)
Note that the true expectation and variance of the distributions are the same.
However, the “density” of points in the two samples is clearly different, and
follows that shape of the corresponding pdf’s.
70
60
50
40
30
20
10
0
0 0.2 0.4 0.6 0.8 1
140
120
100
80
60
40
20
0
−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Figure 4.1: A histogram of a random sample of size 100 from the U[0, 1]-
distribution (above) and the N(1/2, 1/12)-distribution (below).
4.2 Expectation
Similar to the 1-dimensional case, the expected value of any real-valued function
of X1 , . . . , Xn is the weighted average of all values that this function can take.
Example 4.6 Let X and Y be continuous, possibly dependent, r.v.’s with joint
pdf f . Then,
Z ∞Z ∞
E(X + Y ) = (x + y)f (x, y) dxdy
−∞ −∞
Z ∞Z ∞ Z ∞Z ∞
= xf (x, y) dxdy + yf (x, y) dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
= xfX (x) dx + yfY (y) dy
−∞ −∞
= EX + EY .
Y = a + b1 X1 + b2 X2 + · · · + bn Xn
EY = a + b1 EX1 + · · · + bn EXn
= a + b1 µ1 + · · · + bn µn
Notice that we do not use anywhere the independence of the {Xi }. Now, let
X ∼ Hyp(n, r, N ), and let p = r/N . We can think of X as the total number of
red balls when n balls are drawn from an urn with r red balls and N − r other
balls. Without loss of generality we may assume the balls are drawn one-by-one.
Let Xi = 1 if the i-th balls is red, and 0 otherwise. Then, again X1 + · · · + Xn ,
and each Xi ∼ Ber(p), but now the {Xi } are dependent. However, this does not
affect the result (4.5), so that the expectation of X is np = nr/N .
Proof. We prove it only for the 2-dimensional continuous case. Let f denote
the joint pdf of X and Y , and fX and fY the marginals. Then, f (x, y) =
fX (x)fY (y) for all x, y. Thus
ZZ ZZ
EXY = x y f (x, y) dx dy = x y fX (x) fY (y) dx dy
Z Z
= xfX (x)dx fY (y)dy = EXEY .
Example 4.8 Let X ∼ Poi(λ), then we saw in Example 2.8 on page 38 that
its PGF is given by
G(z) = e−λ(1−z) . (4.6)
Now let Y ∼ Poi(µ) be independent of X. Then, the PGF of X + Y is given by
Example 4.9 The MGF of X ∼ Gam(α, λ) is given, see (2.13) on page 52, by
α
λ
E[e ] =
sX
.
λ−s
As a special case, the moment generating function of the Exp(λ) distribution is
given by λ/(λ − s). Now let X1 , . . . , Xn be iid Exp(λ) random variables. The
MGF of Sn = X1 + · · · + Xn is
n
λ
E[e ] = E[e
sSn sX1
···esXn
] = E[e ] · · · E[e
sX1 sXn
]= ,
λ−s
which shows that Sn ∼ Gam(n, λ).
The covariance is a measure for the amount of linear dependency between the
variables. If small values of X (smaller than the expected value of X) go
together with small values of Y , and at the same time large values of X go
together with large values of Y , then Cov(X, Y ) will be positive. If on the other
hand small values of X go together with large values of Y , and large values of
X go together with small values of Y , then Cov(X, Y ) will be negative.
For easy reference we list some important properties of the variance and covari-
ance. The proofs follow directly from the definitions of covariance and variance
and the properties of the expectation.
1. Var(X) = EX 2 − (EX)2 .
2. Var(aX + b) = a2 Var(X).
3. Cov(X, Y ) = EXY − EXEY .
4. Cov(X, Y ) = Cov(Y, X).
5. Cov(aX + bY, Z) = a Cov(X, Z) + b Cov(Y, Z).
6. Cov(X, X) = Var(X).
7. Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ).
8 X and Y indep. =⇒ Cov(X, Y ) = 0.
7. By property 6. we have
Var(X + Y ) = Cov(X + Y, X + Y ) .
Proof. Let a be an arbitrary real number, and denote the standard deviations
of X and Y by σX and σY . Obviously the variance of ±aX + Y is always
non-negative. Thus, using the properties of covariance and variance
Var(−aX + Y ) = a2 σX
2
+ σY2 − 2aCov(X, Y ) ≥ 0.
Y = a + b1 X1 + b2 X2 + · · · + bn Xn
3 3 3
2 2 2
1 1 1
0 0 0
-1 -1 -1
-2 -2 -2
-3 -3 -3
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
3 3 3
2 2 2
1 1 1
0 0 0
-1 -1 -1
-2 -2 -2
-3 -3 -3
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
Var(Y ) = Cov(b1 X1 + b2 X2 + · · · + bn Xn , b1 X1 + b2 X2 + · · · + bn Xn )
Xn X
= Cov(bi Xi , bi Xi ) + 2 Cov(bi Xi , bj Xj ) .
i=1 i<j
Since Cov(bi Xi , bi Xi ) = b2i Var(Xi ) and all covariance term are zero because of
the independence of Xi and XJ j (i 6= j), the result follows.
For the hypergeometric case must include the covariance terms as well:
X
Var(X) = Var(X1 ) + · · · + Var(Xn ) + 2 Cov(Xi , Xj ) .
i<j
Definition 4.6 For any random vector X we define the expectation vector
as the vector of expectations
The covariance matrix Σ is defined as the matrix whose (i, j)th element is
µ = EX
and
Σ = E(X − µ)(X − µ)T .
Note that µ and Σ take the same role as µ and σ 2 in the 1-dimensional case.
We sometimes write µX and ΣX if we wish to emphasise that µ and Σ belong
to the vector X.
uT Σ u ≥ 0 .
To see this, suppose Σ is the covariance matrix of some random vector X with
expectation vector µ. Write Y = X − µ. Then
uT Σ u = uT EY Y T u = E uT Y Y T u
T 2
= E Y T u Y T u = E Y T u ≥ 0.
Suppose X and Y are both discrete or both continuous, with joint pmf/pdf
fX,Y , and suppose fX (x) > 0. The conditional pdf/pmf of Y given X = x is
defined as
fX,Y (x, y)
fY |X (y | x) = , for all y . (4.7)
fX (x)
We can interpret fY |X (· | x) as the pmf/pdf of Y given the information that
X takes the value x. For discrete random variables the definition is simply a
consequence or rephrasing of the conditioning formula (1.4) on page 23. Namely,
P(X = x, Y = y) fX,Y (x, y)
fY |X (y | x) = P(Y = y | X = x) = = .
P(X = x) fX (x)
Example 4.11 We draw “uniformly” a point (X, Y ) from the 10 points on the
triangle D below. Thus, each point is equally likely to be dawn. That is, the
joint pmf is The joint and marginal pmf’s are easy to determine:
1
fX,Y (x, y) = P(X = x, Y = y) = , (x, y) ∈ D,
10
D
4
0
0 1 2 3 4
and
y
fY (y) = P(Y = y) = , y ∈ {1, 2, 3, 4}.
10
Clearly X and Y are not independent. In fact, if we know that X = 2, then Y
can only take the values j = 2, 3 or 4. The corresponding probabilities are
P(Y = y, X = 2) 1/10 1
P(Y = y | X = 2) = = = .
P(X = 2) 3/10 3
In other words, the conditional pmf of Y given X = 2 is
fX,Y (2, y) 1
fY | X (y | 2) = = , y = 2, 3, 4 .
fX (2) 3
Thus, given X = 2, Y takes the values 2,3 and 4 with equal probability.
When X is continuous, we can not longer directly apply (1.4) to define the
conditional density. Instead, we define first the conditional cdf of Y given
X = x as the limit
FY (y | x) := lim FY (y | x < X ≤ x + h) .
h→0
Since the conditional pmf (pdf) has all the properties of a probability mass
(density) function, it makes sense to define the corresponding conditional ex-
pectation as (in the continuous case)
Z
E[Y | X = x] = y fY |X (y | x) dy .
Conditional pmf’s and pdf’s for more than two random variables are defined
analogously. For example, the conditional pmf of Xn given X1 , . . . , Xn−1 is
given by
P(X1 = x1 , . . . , Xn = xn )
fXn | X1 ,...,Xn−1 (xn | x1 , . . . , xn−1 ) = .
P(X1 = x1 , . . . , Xn−1 = xn−1 )
Functions of Random
Variables and Limit Theorems
Example 5.1 Let X be a continuous random variable with pdf fX , and let
Y = aX + b, where a 6= 0. We wish to determine the pdf fY of Y . We first
express the cdf of Y in terms of the cdf of X. Suppose first that a > 0. We
have for any y
Differentiating this with respect to y gives fY (y) = fX ((y − b)/a) /a. For a < 0
we get similarly fY (y) = fX ((y − b)/a) /(−a) . Thus in general
1 y−b
fY (y) = fX . (5.1)
|a| a
This is exactly the formula for the pdf of a χ21 -distribution. Thus Y ∼ χ21 .
where the second equation follows from the independence assumption. It follows
that
FZ (z) = (F (z))n .
Similarly,
P(Y > y) = P(X1 > y, X2 > y, . . . , Xn > y) = P(X1 > y)P(X2 > y) · · · P(Xn > y),
so that
FY (y) = 1 − (1 − F (y))n .
hist(rand(1,10000)+rand(1,10000),50)
400
300
200
100
0
0 0.5 1 1.5 2
This looks remarably like a triangle. Perhaps the true pdf of Z = X + Y has a
triangular shape? This is indeed easily proved. Namely, first observe that the
pdf of Z must be symmetrical around 1. Thus to find the pdf, it suffices to find
its form for z ∈ [0, 1]. Take such a z. Then, see Figure 5.2,
1
ZZ
FZ (z) = P(Z ≤ z) = P((X, Y ) ∈ A) = f (x, y) dxdy = area(A) = z 2 .
A 2
where have used the fact that the joint density f (x, y) is equal to 1 on the
square [0, 1] × [0, 1]. By differentiating the cdf FZ we get the pdf fZ
fZ (z) = z, z ∈ [0, 1],
and by symmetry
fZ (z) = 2 − z, z ∈ [1, 2],
which is indeed a triangular density. If we rescaled the histogram such that the
total area under the bars would be 1, the fit with the true distribution would
be very good.
1
x+y=z
0 z 1
Linear Transformations
Consider Figure 5.3. For any fixed x, let z = Ax. Hence, x = A−1 z. Consider
the n-dimensional cube C = [z1 , z1 + h] × · · · × [zn , zn + h]. Let D be the image
of C under A−1 , i.e., the parallelepiped of all points x such that Ax ∈ C. Then,
P(Z ∈ C) ≈ hn fZ (z) .
Now recall from linear algebra that any n-dimensional rectangle with “volume”
V is transformed into a n-dimensional parallelepiped with volume V |A|, where
|A| := | det(A)|. Thus,
fX (A−1 z)
fZ (z) = , z ∈ Rn . (5.4)
|A|
D
−1 C
A
x z
A
General Transformations
We can apply the same technique as for the linear transformation to general
transformations x 7→ g(x), written out:
x1 g1 (x)
x2 g2 (x)
.. 7→ .. .
. .
xn gn (x)
the corresponding polar coordinates. The joint pdf fR,Θ of R and Θ is given by
1 −r2 /2
fR,Θ (r, θ) = e r , for r ≥ 0 and θ ∈ [0, 2π).
2π
Namely, specifying x and y in terms of r and θ gives
The result now follows from the transformation rule (5.5), noting that the joint
1 −(x2 +y 2 )/2
pdf of X and Y is fX,Y (x, y) = 2π e . It is not difficult to verify that
2
R and Θ are independent, that Θ ∼ U[0, 2π)√and that P(R > r) = e−r /2 . This
means
√ that R has the same distribution as V , with V ∼ Exp(1/2). Namely,
2
P( V > v) = P(V > v 2 ) = e−v /2 . Both Θ and R are easy to generate, and
are transformed via (5.6) to independent standard normal random variables.
Z = µ + σX.
Z = µ + B X, (5.8)
for some (m×n) matrix B. Note that by Theorem 5.1 Z has expectation vector
µ and covariance matrix Σ = BB T . Any random vector of the form (5.8) is
said to have a jointly normal (or multi-variate normal) distribution. We write
Z ∼ N(µ, Σ).
that
1 1 T −1
fY (y) = p e− 2 y Σ y .
n
(2π) |Σ|
Because Z is obtained from Y by simply adding a constant vector µ, we have
fZ (z) = fY (z − µ), and therefore
1 1
(z−µ)T Σ−1 (z−µ)
fZ (z) = p e− 2 , z ∈ Rn . (5.9)
(2π)n |Σ|
0.2
3 0.2 3
0.15
2
0.15 2
0.1 1
1
0.1
0
0.05 0
−1 0.05
−1
0
−2
3 0 −2
2
1
0
−1 −3 3 2
−2 1
−3 0 −3
−1 −2 −3
ρ=0 ρ = 0.5
0.4 3
0.3 2
1
0.2
0
0.1
−1
0
3 −2
2
1
0
−1
−2 −3
−3
ρ = 0.9
One of the most (if not the most) important properties of the normal distri-
bution is that linear combinations of independent normal random variables are
normally distributed. Here is a more precise formulation.
n n n
!
X X X
Y =a+ bi Xi ∼ N a + bi µi , b2i σi2 . (5.13)
i=1 i=1 i=1
Proof. The easiest way to prove this is by using moment generating functions.
First, recall that the MGF of a N(µ, σ 2 )-distributed random variable X is given
by
1 2 2
MX (s) = eµs+ 2 σ s .
Let MY be the moment generating function of Y . Since X1 , . . . , Xn are inde-
pendent, we have
n
X
MY (s) = E exp{as + bi Xi s}
i=1
n
Y
= eas MXi (bi s)
i=1
n
Y 1
= eas exp{µi (bi s) + σi2 (bi s)2 }
2
i=1
n n
X 1X 2 2 2
= exp{sa + s bi µi + bi σi s },
2
i=1 i=1
Remark 5.2 Note that from Theorems 4.3 and 4.6 we had already established
the expectation and variance of Y in (5.13). But we have now found that the
distribution is normal.
Example 5.8 A machine produces ball bearings with a N(1, 0.01) diameter
(cm).The balls are placed on a sieve with a N(1.1, 0.04) diameter. The diameter
of the balls and the sieve are assumed to be independent of each other.
Question: What is the probability that a ball will fall through?
In this section we briefly discuss two of the main results in probability: the
Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). Both are
about sums of independent random variables.
Suppose EXi = µ and Var(Xi ) = σ 2 . We assume that both µ and σ 2 are finite.
By the rules for expectation and variance we know that
ESn = n EX1 = nµ
and
Var(Sn ) = n Var(X1 ) = nσ 2 .
Moreover, if the Xi have moment generating function M , then the MGF of Sn
is simply given by
Ees(X1 +···+Xn ) = EesX1 · · · EesXn = [M (s)]n .
The law of large numbers roughly states that Sn /n is close to µ, for large n.
Here is a more precise statement.
Proof. First, for any z > 0, and any positive random variable Z we have
Z z Z ∞ Z ∞
EZ = tf (t) dt + tf (t) dt ≥ tf (t) dt
0 z z
Z ∞
≥ zf (t) dt = z P(Z ≥ z) ,
z
There is also a strong law of large numbers, which implies the weak law,
but is more difficult to prove. It states the following:
Sn
P lim = µ = 1,
n→∞ n
The Central Limit Theorem says something about the approximate distribution
of Sn (or Sn /n). Roughly it says this:
To see the CLT in action consider Figure 5.4. The first picture shows the pdf’s
of S1 , . . . , S4 for the case where the Xi have a U[0, 1] distribution. The second
show the same, but now for an Exp(1) distribution. We clearly see convergence
to a bell shaped curve.
n=1
1
n=2 0.8
0.8 n=3 n=1
0.6
0.6
n=4
0.4 n=2
0.4
n=3
00 1 2 3 4 5 00 2 4 x 6 8
Figure 5.4: Illustration of the CLT for the uniform and exponential distribution
The CLT is not restricted to continuous distributions. For example, Figure 5.5
shows the cdf of S30 in the case where the Xi have a Bernoulli distribution with
success probability 1/2. Note that S30 ∼ Bin(30, 1/2), see Example 4.3.
0.8
0.6
0.4
0.2
00 5 10 15
x 20 25 30
Figure 5.5: The cdf of a Bin(20, 1/2)-distribution and its normal approximation.
In general we have:
(a) Find the sample space, if we observe the exact sequences of Heads
(= 1) and Tails (= 0).
(b) Find the sample space, if we observe only the total number of Heads.
(a) How many possible outcomes of the experiment are there, if we put
each ball back into the urn before we draw the next?
(b) Answer the same question as above, but now if we don’t put the balls
back.
(c) Calculate the probability that in case (a) we draw 3 times the same
ball.
4. Let P(A) = 0.9 and P(B) = 0.8. Show that P(A ∩ B) ≥ 0.7.
5. What is the probability that none of 54 people in a room share the same
birthday?
(a) Find the probability that both dice show the same face.
(b) Find the same probability, using the extra information that the sum
of the dice is not greater than 4.
7. We draw 3 cards from a full deck of cards, noting the order. Number the
cards from 1 to 52.
(a) Give the sample space. Is each elementary event equally likely?
(b) What is the probability that we draw 3 Aces?
(c) What is the probability that we draw 1 Ace, 1 King and 1 Queen?
(d) What is the probability that we draw no pictures (no A,K,Q,J)?
8. We draw at random a number in the interval [0,1] such that each number
is “equally likely”. Think of the random generator on you calculator.
(a) Determine the probability that we draw a number less than 1/2.
(b) What is the probability that we draw a number between 1/3 and
3/4?
(c) Suppose we do the experiment two times (independently), giving us
two numbers in [0,1]. What is the probability that the sum of these
numbers is greater than 1/2? Explain your reasoning.
10. We draw 4 cards (at the same time) from a deck of 52, not noting the
order. Calculate the probability of drawing one King, Queen, Jack and
Ace.
11. In a group of 20 people there are three brothers. The group is separated at
random into two groups of 10. What is the probability that the brothers
are in the same group?
12. How many binary vectors are there of length 20 with exactly 5 ones?
13. We draw at random 5 balls from an urn with 20 balls (numbered 1,. . . ,20),
without replacement or order. How many different possible combinations
are there?
15. Consider the following system. Each component has a probability 0.1 of
failing. What is the probability that the system works?
2
16. Two fair dice are thrown and the smallest of the face values, Z say, is
noted.
z * * * ...
(a) Give the pmf of Z in table form:
P(Z = z) * * * ...
(b) Calculate the expectation of Z.
17. In a large population 40% votes for A and 60% for B. Suppose we select
at random 10 people. What is the probability that in this group exactly
4 people will vote for A?
6. (a) 1/6.
(b) Let A be the event that the dice show the same face, and B the event
that the sum is not greater than 4. Then B = {(1, 1), (1, 2), (1, 3)(2, 1), (2, 2), (3, 1)},
and A ∩ B = {(1, 1), (2, 2)}. Hence, P(A | B) = 2/6 = 1/3.
7. (a) Ω = {(1, 2, 3), . . . , (52, 51, 50)}. Each elementary event is equally
likely.
4×3×2
(b) 52×51×50 .
12×8×4
(c) 52×51×50 .
36×35×34
(d) 52×51×50 .
8. (a) 1/2.
(b) 5/12.
(c) 7/8.
10. |Ω| = 52
4 (all equally likely outcomes. Note that the outcomes are
repsesented as (unordered sets), e.g., {1, 2, 3, 4}. Let A
be the event of
drawing one K, Q, J and Ace each. Then, |A| = 41 × 41 × 41 × 41 = 44 .
Thus, P(A) = 44 / 52
4 .
12. 20
5 , because we have to choose the 5 positions for the 1s, out of 20
positions.
13. 20
5
14. Let B be the event that a 1 was sent, and A the event that a 1 is received.
Then, P(A | B) = 0.95, and P(Ac | B c ) = 0.90. Thus, P(Ac | B) = 0.05 and
P(A | B c ) = 0.10. Moreover, P(B) = 2/3 and P(B c ) = 1/3. By Bayes:
15. Let Ai be the event that component i works, i = 1, 2, 3, and let A be the
event that the system works. We have A = A1 ∩ (A2 ∪ A3 ). Thus, by the
independence of the A0i s:
z 1 2 3 4 5 6
16. (a)
P(Z = z) 11/36 9/36 7/36 5/36 3/36 1/36
(b) E[Z] = 1×11/36+2×9/36+3×7/36+4×5/36+5×3/36+6×1/36.
17. Let X be the number that vote for A. Then X ∼ Bin(10, 0.4). Hence,
P(X = 4) = 10 4 (0.6)6 .
4 (0.4)
18. The region where the largest coordinate is less than z is a square with
area z 2 (make a picture). Divide this area by the area of the unit square
(1), to obtain P(Z ≤ z) = z 2 , for all z ∈ [0, 1]. Thus,
0
z<0
F (z) = P(Z ≤ z) = z 2 0≤z≤1
1 z>1.
19. (a) (
3 x2 0≤x≤1
f (x) =
0 otherwise .
R 3/4
(b) f (x) dx = F (3/4) − F (1/2) = (3/4)3 − (1/2)3 .
1/2
R1 R1
(c) E[X] = 0 x 3x2 dx = 3 0 x3 dx = 3/4.
(a) What is the probability that the second die is 3, given that the sum
of the dice is 6?
(b) What is the probability that the first die is 3 and the second not 3?
(a) What is the probability that box 1 has 2, box 2 has 5 and box 3 has
3 balls?
(b) What is the probability that box 1 remains empty?
6. Consider again the experiment where we throw two fair dice one after the
other. Let the random variable Z denote the sum of the dice.
12. Let X ∼ N(0, 1). Find P(X ≤ 1.4) from the table of the N(0, 1) distribu-
tion. Also find P(X > −1.3).
13. Let Y ∼ N(1, 4). Find P(Y ≤ 3), and P(−1 ≤ Y ≤ 2).
15. We draw at random 5 numbers from 1,. . . 100, with replacement (for ex-
ample, drawing number 9 twice is possible). What is the probability that
exactly 3 numbers are even?
16. We draw at random 5 numbers from 1,. . . 100, without replacement. What
is the probability that exactly 3 numbers are even?
19. A random variable X takes the values 0, 2, 5 with probabilities 1/2, 1/3, 1/6,
respectively. What is the expectation of X?
21. We draw at random a 10 balls from an urn with 25 red and 75 white balls.
What is the expected number of red balls amongst the 10 balls drawn?
Does it matter if we draw the balls with or without replacement?
24. We repeatedly throw two fair dice until two sixes are thrown. What is
the expected number of throws required?
25. Suppose we divide the population of Brisbane (say 1,000,000 people) ran-
domly in groups of 3.
(a) How many groups would you expect there to be in which all persons
have the same birthday?
(b) What is the probability that there is at least one group in which all
persons have the same birthday?
(a) What is the probability that the component is still functioning after
2 years?
(b) What is the probability that the component is still functioning after
10 years, given it is still functioning after 7 years?
27. Let X ∼ N(0, 1). Prove that Var(X) = 1. Use this to show that Var(Y ) =
σ 2 , for Y ∼ N(µ, σ 2 ).
28. let X ∼ Exp(1). Use the Moment Generating Function to show that
E[X n ] = n!.
29. Explain how to generate random numbers from the Exp(10) distribution.
Sketch a graph of the scatterplot of 10 such numbers.
30. Explain how to generate random numbers from the U[10, 15] distribution.
Sketch a graph of the scatterplot of 10 such numbers.
31. Suppose we can generate random numbers from the N(0, 1) distribution,
e.g., via Matlabs randn function. How can we generate from the N(3, 9)
distribution?
1
2. (a) 5 (conditional probability, the possible outcomes are (1, 5), (2, 4), . . . , (5, 1).
In only one of these the second die is 3).
1
(b) 6 × 56 (independent events).
20
20
3. (a) 10 /2 ≈ 0.176197.
P20 k
20
(b) k=15 20 /2 ≈ 0.0206947.
35 9 1
4. (a) 36 36 ≈ 0.021557. (geometric formula)
100
(b) 1 − 35 36 ≈ 0.94022.
10! 1 2 1 5 1 3 315
5. (a) 2! 5! 3! 4 2 4 = 4096 ≈ 0.076904. (multinomial)
(b) (3/4)10 ≈ 0.0563135.
11. Y ∼ N(1, 4). (affine transformation of a normal r.v. gives again a normal
r.v.)
12. P(X ≤ 1.4) = Φ(1.4) ≈ 0.9192. P(X > −1.3) = P(X < 1.3) (by symme-
try of the pdf — make a picture). P(X < 1.3) = P(X ≤ 1.3) = Φ(1.3) ≈
0.9032.
15. Let X be the total number of “even” numbers. Then, X ∼ Bin(5, 1/2).
And P(X = 3) = 53 /32 = 10/32 = 0.3125
16. Let X be the total number of “even” numbers. Then X ∼ Hyp(5, 50, 100).
(50)(50)
Hence P(X = 3) = 3 100 2 = 6125/19206 ≈ 0.318911.
(5)
17. λ = 3600/100 = 36.
19. 3/2.
20. 5
21. It does matter for the distribution of the number of red balls, which is Bin(10, 1/4)
if we replace and Hyp(10, 25, 100) if we don’t replace. However the expectation
is the same for both cases: 2.5.
R1
22. Var(X) = EX 2 −(EX)2 . By symmetry EX = 1/2. And EX 2 = 0 x2 dx = 1/3.
So Var(X) = 1/12.
25. Let N be the number of groups in which each person has the same birthday.
Then N ∼ Bin(333333, 1/3652 ). Hence (a) EN ≈ 2.5, and (b) P(N > 0) = 1 −
P(N = 0) = 1 − (1 − 1/3652 )333333 ≈ 0.92. [Alternatively N has approximately
a Poi(2.5) distribution, so P(N = 0) ≈ e−2.5 , which gives the same answer 0.92.]
28. M (s) = λ/(λ − s). Differentiate: M 0 (s) = λ/(λ − s)2 , M 00 (s) = 2λ/(λ − s)3 ,
. . . , M (n) (s) = n! λ/(λ − s)n+1 . Now apply the moment formula.
1
29. Draw U ∼ U(0, 1). Return X = − 10 log U .
(a) Let M be the smallest of the n numbers, and X̄ the average of the
n numbers. Express M and X̄ in terms of X1 , . . . , Xn .
y
x 1 3 6 8
2 0 0.1 0.1 0
5 0.2 0 0 0
6 0 0.2 0.1 0.3
(a) How would you estimate the unknown parameter µ if you had the
data x1 , . . . , xn ?
(b) Show that EX̄ = µ.
(c) Show that VarX̄ goes to 0 as n → ∞.
(d) What is the distribution of X̄?
(e) Discuss the implications of (b) and (c) for your estimation of the
unknown µ.
10. Let X ∼ Bin(100, 1/2). Approximate, using the CLT, the probability
P(X ≥ 60).
11. Let X have a uniform distribution on the interval [1, 3]. Define Y = X 2 −4.
Derive the probability density function (pdf) of Y . Make sure you also
specify where this pdf is zero.
14. Some revision questions. Please make sure you can comfortably answer
the questions below by heart.
(a) Give the formula for the pmf of the following distributions:
i. Bin(n, p),
ii. G(p),
iii. Poi(λ).
(b) Give the formula for the pdf of the following distributions:
i. U(a, b),
ii. Exp(λ),
iii. N(0, 1).
16. A lift can carry a maximum of 650 kg. Suppose that the weight of a person
is normally distributed with expectation 75 kg and standard deviation 10
kg. Let Zn be the total weight of n randomly selected persons.
17. The thickness of a printed circuit board is required to lie between the
specification limits of 0.150 - 0.004 and 0.150 + 0.004 cm. A machine
produces circuit boards with a thickness that is normally distributed with
mean 0.151 cm and standard deviation 0.003 cm.
(a) What is the probability that the thickness X of a circuit board which
is produced by this machine falls within the specification limits?
(b) Now consider the mean thickness X̄ = (X1 +· · ·+X25 )/25 for a batch
of 25 circuit boards. What is the probability that this batch mean
will fall within the specification limits? Assume that X1 , . . . , X25
are independent random variables with the same distribution as X
above.
18. We draw n numbers independently and uniformly from the interval [0,1]
and note their sum Sn .
19. Consider the following game: You flip 10 fair coins, all at once, and count
how many Heads you have. I’ll pay you out the squared number of Heads,
in dollars. However, you will need to pay me some money in advance. How
much would you prepare to give me if you could play this game as many
times as you’d like?
(b) The joint pdf of U and V follows from the transformation rule:
fU,V (u, v) = fX,Y (x, y) u = e−(x+y) u = e−u u,
for u ≥ 0 and 0 ≤ v ≤ 1.
8. Note that f (x, y) can be written as the product of f1 (x) = e−x , x ≥ 0 and
f2 (y) = 2e−2y , y ≥ 0. It follows that X and Y are independent random
variables, and that X ∼ Exp(1) and Y ∼ Exp(2).
10. Similar to question 6: P(X ≥ 60) ≈ P(Y ≥ 60), with Y ∼ N(50, 25).
Moreover, P(Y ≥ 60) = P(Z ≥ (60 − 50)/5) = P(Z ≥ 2) = 1 − Φ(2) =
0.02275, where Z ∼ N(0, 1).
11. First draw the graph of the function y = x2 − 4 on the interval [1,3].
Note that the function is increasing from −3 to 5. To find the pdf, first
calculate the cdf:
0.15
0.125
0.1
0.075
0.05
0.025
13. (a) -4 -2 2 4 6 8 10
√
(b) P(Y ≤ 5) = Φ((5 − 2)/ 5) ≈ 0.9101.
√
2 ×5) = N(2, 45). P(1 ≤ Z ≤ 5) = P((1−2)/ 45 ≤
(c) Z ∼ N(3×2−4,√ 3
V ≤ (5 −
√ 2)/ 45), with V√∼ N(0, 1). The latter probability is equal
to Φ(3/ 45) − (1 − Φ(1/ 45)) ≈ 0.2319.
15. (a)
2
0
Xn
−2
−4
0 20 40 60 80 100
n
10
0
Yn
−10
−20
0 20 40 60 80 100
n
16. (a) Z8 ∼ N(8 × 75, 8 × 100) =√N(600, 800). P(Z8 ≥ 650) = 1 − P(Z8 ≤
650) = 1 − Φ((650 − 600)/ 800) = 1 − Φ(1.7678) ≈ 0.0385.
√
(b) P(Zn ≥ 650) = 1 − Φ((650 − n75)/ n100). For n = 8 the probability
is 0.0385. For n = 7 it is much smaller than 0.01. So the largest
such n is n = 7.
0.8
0.6
0.4
0.2
(b) N(10, 20/12), because the expectation of U(0, 1) is 1/2 and the vari-
ance is 1/12.
(c) P(X̄ > 0.6) = P(X1 + · · · + X20 > 12) ≈ P(Y > 12), with Y ∼
N(10,p20/12). Now, P(Y > 12) = 1 − P(Y ≤ 12) = 1 − Φ((12 −
10)/ 20/12) = 1 − Φ(1.5492) ≈ 0.0607.
Sample Exams
B.1 Exam 1
1. Two fair dice are thrown and the sum of the face values, Z say, is noted.
z * * * ...
(a) Give the pmf of Z in table form: [4]
P(Z = z) * * * ...
(b) Calculate the variance of Z. [4]
(c) Consider the game, where a player throws two fair dice, and is paid
Y = (Z − 7)2 dollars, with Z the sum of face values. To enter the
game the player is required to pay 5 dollars.
What is the expected profit (or loss) of the player, if he/she plays
the game 100 times (each time paying 5 dollars to play)? [4]
small
d d
(a) How large should the diameter of the sieve be, so that the proportion
of large blueberries is 30%? [6]
(b) Suppose that the diameter is chosen such as in (a). What is the
probability that out of 1000 blueberries, fewer than 280 end up in
the “large” class? [6]
4. We draw a random vector (X, Y ) non-uniformly from the triangle (0, 0)–
(1, 0)–(1, 1)
y
0 1 x
in the following way: First we draw X uniformly on [0, 1]. Then, given
X = x we draw Y uniformly on [0, x].
B.2 Exam 2
1. Two fair dice are thrown and the smallest of the face values, Z say, is
noted.
4 e−4(x−1) , x ≥ 1 ,
f (x) =
0 x<1.
Summary of Formulas
2. P(Ac ) = 1 − P(A).
Rx
9. In particular (continuous), F (x) = −∞ f (u) du.
R
11. Marginal from joint pdf: fX (x) = fX,Y (x, y) dy.
Distr. pmf x∈
Ber(p) px (1
x − p) n−x
1−x {0, 1}
n
Bin(n, p) x p (1 − p) {0, 1, . . . , n}
−λ λx
Poi(λ) e x! {0, 1, . . .}
G(p) p(1 − p)x−1 {1, 2, . . .}
−r
(xr )(Nn−x )
Hyp(n, r, N ) {0, . . . , n}
(Nn )
Distr. pdf x∈
1
U[a, b] b−a [a, b]
Exp(λ) λ e−λx R+
λα xα−1 e−λx
Gam(α, λ) Γ(α) R+
2
− 12 ( x−µ
σ )
N(µ, σ 2 ) √1
σ 2π
e R
P(A∩B)
14. Conditional probability: P(A | B) = P(B) .
EX Var(X)
Ber(p) p p(1 − p)
Bin(n, p) np np(1 − p)
1 1−p
G(p) p p2
Poi(λ) λ λ
−n
Hyp(n, pN, N ) np np(1 − p) N
N −1
a+b (b−a)2
U(a, b) 2 12
1 1
Exp(λ) λ λ2
α α
Gam(α, λ) λ λ2
N(µ, σ 2 ) µ σ2
Var(X) = EX 2 − (EX)2 .
Var(aX + b) = a2 Var(X).
cov(X, Y ) = EXY − EXEY .
cov(X, Y ) = cov(Y, X).
cov(aX + bY, Z) = a cov(X, Z) + b cov(Y, Z).
cov(X, X) = Var(X).
Var(X+Y ) = Var(X)+Var(Y )+2 cov(X, Y ).
X and Y independent =⇒ cov(X, Y ) = 0.
Ber(p) 1 − p + zp
Bin(n, p) (1 − p + zp)n
zp
G(p) 1−z (1−p)
Poi(λ) e−λ(1−z)
36. P(N = n) = 1
n! G(n) (0) . (n-th derivative, at 0)
37. EN = G0 (1)
fX (A−1 z)
46. Linear transformation: fZ (z) = |A| .
fX (x)
47. General transformation: fZ (z) = , with x = g −1 (z), where
|Jx (g)|
|Jx (g)| is the Jacobian of g evaluated at x.
1−an+1
4. Geometric sum: 1 + a + a2 + · · · + an = 1−a (a 6= 1).
1
If |a| < 1 then 1 + a + a2 + · · · = 1−a .
5. Logarithms:
6. Exponential:
x2 x3
(a) ex = 1 + x + 2! + 3! + ···.
x n
(b) ex = limn→∞ 1 +
n .
(c) ex+y = ex ey .
7. Differentiation:
(a) (f + g)0 = f 0 + g 0 ,
(b) (f g)0 = f 0 g + f g 0 ,
0 0 g0
(c) fg = f g−f g2
.
d n n−1 .
(d) dx x = n x
d x x
(e) dx e = e .
d 1
(f) dx log(x) = x .
Statistical Tables
This table gives the cummulative distribution function (cdf) Φ of a N(0, 1)-
distributed random variable Z.
Z z
1 2
Φ(z) = P(Z ≤ z) = √ e−x /2 dx.
2π −∞
The last column gives the probability density function (pdf) ϕ of the N(0, 1)-
distribution
1 2
φ(z) = √ e−z /2 .
2π
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 φ(z)
0.0 5000 5040 5080 5120 5160 5199 5239 5279 5319 5359 0.3989
0.1 5398 5438 5478 5517 5557 5596 5636 5675 5714 5753 0.3970
0.2 5793 5832 5871 5910 5948 5987 6026 6064 6103 6141 0.3910
0.3 6179 6217 6255 6293 6331 6368 6406 6443 6480 6517 0.3814
0.4 6554 6591 6628 6664 6700 6736 6772 6808 6844 6879 0.3683
0.5 6915 6950 6985 7019 7054 7088 7123 7157 7190 7224 0.3521
0.6 7257 7291 7324 7357 7389 7422 7454 7486 7517 7549 0.3332
0.7 7580 7611 7642 7673 7704 7734 7764 7794 7823 7852 0.3123
0.8 7881 7910 7939 7967 7995 8023 8051 8078 8106 8133 0.2897
0.9 8159 8186 8212 8238 8264 8289 8315 8340 8365 8389 0.2661
1.0 8413 8438 8461 8485 8508 8531 8554 8577 8599 8621 0.2420
1.1 8643 8665 8686 8708 8729 8749 8770 8790 8810 8830 0.2179
1.2 8849 8869 8888 8907 8925 8944 8962 8980 8997 9015 0.1942
1.3 9032 9049 9066 9082 9099 9115 9131 9147 9162 9177 0.1714
1.4 9192 9207 9222 9236 9251 9265 9279 9292 9306 9319 0.1497
1.5 9332 9345 9357 9370 9382 9394 9406 9418 9429 9441 0.1295
1.6 9452 9463 9474 9484 9495 9505 9515 9525 9535 9545 0.1109
1.7 9554 9564 9573 9582 9591 9599 9608 9616 9625 9633 0.0940
1.8 9641 9649 9656 9664 9671 9678 9686 9693 9699 9706 0.0790
1.9 9713 9719 9726 9732 9738 9744 9750 9756 9761 9767 0.0656
2.0 9772 9778 9783 9788 9793 9798 9803 9808 9812 9817 0.0540
2.1 9821 9826 9830 9834 9838 9842 9846 9850 9854 9857 0.0440
2.2 9861 9864 9868 9871 9875 9878 9881 9884 9887 9890 0.0355
2.3 9893 9896 9898 9901 9904 9906 9909 9911 9913 9916 0.0283
2.4 9918 9920 9922 9925 9927 9929 9931 9932 9934 9936 0.0224
2.5 9938 9940 9941 9943 9945 9946 9948 9949 9951 9952 0.0175
2.6 9953 9955 9956 9957 9959 9960 9961 9962 9963 9964 0.0136
2.7 9965 9966 9967 9968 9969 9970 9971 9972 9973 9974 0.0104
2.8 9974 9975 9976 9977 9977 9978 9979 9979 9980 9981 0.0079
2.9 9981 9982 9982 9983 9984 9984 9985 9985 9986 9986 0.0060
3.0 9987 9987 9987 9988 9988 9989 9989 9989 9990 9990 0.0044
3.1 9990 9991 9991 9991 9992 9992 9992 9992 9993 9993 0.0033
3.2 9993 9993 9994 9994 9994 9994 9994 9995 9995 9995 0.0024
3.3 9995 9995 9995 9996 9996 9996 9996 9996 9996 9997 0.0017
3.4 9997 9997 9997 9997 9997 9997 9997 9997 9997 9998 0.0012
3.5 9998 9998 9998 9998 9998 9998 9998 9998 9998 9998 0.0009
3.6 9998 9998 9999 9999 9999 9999 9999 9999 9999 9999 0.0006