ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
Course pack
Dr James Abdey
lse.ac.uk/statistics
2
ST102/ST109
Course pack
The author asserts copyright over all material in this course guide except where
otherwise indicated. All rights reserved. No part of this work may be reproduced in any
form, or by any means, without permission in writing from the author.
ii
Contents
Contents
Preliminaries ix
0.1 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
0.2 Course materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
0.3 Supplementary reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
0.4 Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
0.5 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
0.6 And finally, words of wisdom from a wise (youngish) man . . . . . . . . . xi
Preface xiii
0.7 The role of statistics in the research process . . . . . . . . . . . . . . . . xiii
iii
Contents
2 Probability theory 23
2.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 32
2.6 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 36
2.6.1 Brute force: listing and counting . . . . . . . . . . . . . . . . . . . 38
2.6.2 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 38
2.6.3 Combining counts: rules of sum and product . . . . . . . . . . . . 43
2.7 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 44
2.7.1 Independence of multiple events . . . . . . . . . . . . . . . . . . . 45
2.7.2 Independent versus mutually exclusive events . . . . . . . . . . . 46
2.7.3 Conditional probability of independent events . . . . . . . . . . . 48
2.7.4 Chain rule of conditional probabilities . . . . . . . . . . . . . . . . 48
2.7.5 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Random variables 59
3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 61
iv
Contents
v
Contents
vi
Contents
vii
Contents
viii
Preliminaries
0.1 Organisation
Course lecturer: Dr. James Abdey.
• Email: [email protected]
• Office hours: see the Student Hub.
Lectures:
• Tuesdays 09:00–11:00, Peacock Theatre, Michaelmas term weeks 1–10.
Lecture recordings will be made available via Moodle, but note these should not be
a substitute for attending lectures!
ix
Preliminaries
However, you will all be aware of the considerable, dare I say significant,
heterogeneity within the student body. Therefore, some may seek additional
reassurance in the form of recommended reading.2
I should stress that purchase of a textbook is at your sole discretion. I suspect this
decision will be based on the extent to which your student finances have been
studiously managed to date. For the shopaholics among you, the recommended text
is:
• Larsen, R.J. and M.J. Marx (2017). An Introduction to Mathematical Statistics
and Its Applications, Pearson, sixth edition.3
Of course, numerous titles are available covering the topics frequently found in
first-year undergraduate service-level courses in statistics. Again, due to the
doubtless heterogeneous preferences among you all, some may find one author’s
style readily accessible, while others may despair at the baffling presentation of
material.4 Consequently, my best advice would be to sample5 a range of textbooks,
and choose your preferred one. Any textbook would act as a supplement to the
course materials. In particular, textbooks are filled with additional exercises to
check understanding – and if you’re lucky, they’ll give you (some) solutions too!
0.4 Assessment
Classes will involve going through the solutions to exercises, and full solutions will
be made available on Moodle upon completion of all that week’s classes. Further
unseen problems will also be covered. As if the sheer joy of studying the discipline
was not incentive enough to engage with the exercises, they are the best
preparation for the. . .
• two-hour written examination in week 0 of Lent term – this has 50%
weighting for ST102, and 100% weighting for ST109. For details, please see the
‘Past exam papers (January)’ section of Moodle.
You will need a scientific calculator for both the classes and the examination. The
only permitted calculators for in-person examinations are the Casio fx-83 or
fx-85 range, available from many retailers.
2
Of course, if I’ve done my job properly, these notes should transcend all other publications!
3
Second-hand earlier editions will be just as valid.
4
One clearly hopes the former applies to this humble author.
5
A sacred word in statistics.
x
Syllabus
0.5 Syllabus
The full syllabus for ST102 (Michaelmas term) and ST109 Elementary Statistical
Theory can be found in the table of contents.
(Not so) many years ago, I too was an undergraduate. Fresh-faced and full of
enthusiasm (some things never change), I discovered the strategy for success in
statistics.6 As you embark on this statistical voyage, I feel compelled to share the
following with you.
xi
Preliminaries
xii
Preface
Research may be about almost any topic: physics, biology, medicine, economics, history,
literature etc. Most of our examples will be from the social sciences: economics,
management, finance, sociology, political science, psychology etc. Research in this sense
is not just what universities do. Governments, businesses, and all of us as individuals do
it too. Statistics is used in essentially the same way for all of these.
Understanding the gender pay gap: what has competition got to do with it?
Heeding the push from below: how do social movements persuade the rich to
listen to the poor?
xiii
Preface
We can think of the empirical research process as having five key stages.
2. Research design: deciding what kinds of data to collect, how and from where.
The main job of statistics is the analysis of data, although it also informs other stages
of the research process. Statistics are used when the data are quantitative, i.e. in the
form of numbers.
Statistical analysis of quantitative data has the following features.
It can cope with large volumes of data, in which case the first task is to provide an
understandable summary of the data. This is the job of descriptive statistics.
It can deal with situations where the observed data are regarded as only a part (a
sample) from all the data which could have been obtained (the population). There
is then uncertainty in the conclusions. Measuring this uncertainty is the job of
statistical inference.
We conclude this preface with an example of how statistics can be used to help answer
a research question.
Gill and Spriggs (2005): Assessing the impact of CCTV. Home Office Research
Study 292.
xiv
The role of statistics in the research process
Intervention: CCTV cameras installed in the target area but not in the
control area.
Compare measures of crime and the fear of crime in the target and control
areas in the 12 months before and 12 months after the intervention.
Level of crime: the number of crimes recorded by the police, in the 12 months
before and 12 months after the intervention.
Fear of crime: a survey of residents of the areas.
• Respondents: random samples of residents in each of the areas.
• In each area, one sample before the intervention date and one about 12
months after.
• Sample sizes:
Before After
Target area 172 168
Control area 215 242
• Question considered here: ‘In general, how much, if at all, do you worry
that you or other people in your household will be victims of crime?’ (from
1 = ‘all the time’ to 5 = ‘never’).
Statistical analysis of the data.
% of respondents who worry ‘sometimes’, ‘often’ or ‘all the time’:
Target Control
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
26 23 −3 53 46 −7 0.98 0.55–1.74
It is possible to calculate various statistics, for example the Relative Effect Size
RES = ([d]/[c])/([b]/[a]) = 0.98 is a summary measure which compares the
changes in the two areas.
RES < 1, which means that the observed change in the reported fear of crime
has been a bit less good in the target area.
However, there is uncertainty because of sampling: only 168 and 242 individuals
were actually interviewed at each time in each area, respectively.
The confidence interval for RES includes 1, which means that changes in the
self-reported fear of crime in the two areas are ‘not statistically significantly
different’ from each other.
The number of (any kind of) recorded crimes:
Target area Control area
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
112 101 −11 73 88 15 1.34 0.79–1.89
xv
Preface
Now the RES > 1, which means that the observed change in the number of
crimes has been worse in the control area than in the target area.
However, the numbers of crimes in each area are fairly small, which means that
these estimates of the changes in crime rates are fairly uncertain.
The confidence interval for RES again includes 1, which means that the changes
in crime rates in the two areas are not statistically significantly different from
each other.
In summary, this study did not support the claim that the introduction of CCTV
reduces crime or the fear of crime.
If you want to read more about research of this question, see Welsh and
Farrington (2008). Effects of closed circuit television surveillance on crime.
Campbell Systematic Reviews 2008:17.
Many of the statistical terms and concepts mentioned above have not been explained yet
– that is what the rest of the course is for! However, it serves as an interesting example
of how statistics can be employed in the social sciences to investigate research questions.
xvi
Chapter 1
Data visualisation and descriptive
statistics
1.3 Introduction
Starting point: a collection of numerical data (a sample) has been collected in order to
answer some questions. Statistical analysis may have two broad aims.
1
1. Data visualisation and descriptive statistics
2. Statistical inference: use the observed data to draw conclusions about some
broader population.
Sometimes ‘1.’ is the only aim. Even when ‘2.’ is the main aim, ‘1.’ is still an essential
first step.
Data do not just speak for themselves. There are usually simply too many numbers to
make sense of just by staring at them. Descriptive statistics attempt to summarise
some key features of the data to make them understandable and easy to
communicate. These summaries may be graphical or numerical (tables or
individual summary statistics).
Example 1.1 We consider data for 155 countries on three variables from around
2002. The data can be found in the file ‘Countries.csv’. The variables are the
following.
Gross domestic product per capita (GDP per capita) (i.e. per person, in
$000s) which is a ratio scale.
The statistical data in a sample are typically stored in a data matrix, as shown in
Figure 1.1.
Rows of the data matrix correspond to different units (subjects/observations).
Here, region, the level of democracy, and GDP per capita are the variables.
2
1.3. Introduction
3
1. Data visualisation and descriptive statistics
A continuous variable can, in principle, take any real values within some interval.
In Example 1.1, GDP per capita is continuous, taking any non-negative value.
A variable is discrete if it is not continuous, i.e. if it can only take certain values,
but not any others.
In Example 1.1, region and the level of democracy are discrete, with possible
values of 1, 2, . . . , 6, and 0, 1, 2, . . . , 10, respectively.
Many discrete variables have only a finite number of possible values. In Example 1.1, the
region variable has 6 possible values, and the level of democracy has 11 possible values.
The simplest possibility is a binary, or dichotomous, variable, with just two possible
values. For example, a person’s sex could be recorded as 1 = female and 2 = male.1
A discrete variable can also have an unlimited number of possible values.
Example 1.2 In Example 1.1, the levels of democracy have a meaningful ordering,
from less democratic to more democratic countries. The numbers assigned to the
different levels must also be in this order, i.e. a larger number = more democratic.
In contrast, different regions (Africa, Asia, Europe, Latin America, Northern
America and Oceania) do not have such an ordering. The numbers used for the
region variable are just labels for different regions. A different numbering (such as
6 = Africa, 5 = Asia, 1 = Europe, 3 = Latin America, 2 = Northern America and
4 = Oceania) would be just as acceptable as the one we originally used. Some
statistical methods are appropriate for variables with both ordered and unordered
values, some only in the ordered case. Unordered categories are nominal data;
ordered categories are ordinal data.
1
Note that because sex is a nominal variable, the coding is arbitrary. We could also have, for example,
0 = male and 1 = female, or 0 = female and 1 = male. However, it is important to remember which
coding has been used!
2
In practice, of course, there is a finite number of internet users in the world. However, it is reasonable
to treat this variable as taking an unlimited number of possible values.
4
1.5. The sample distribution
a list of the values of the variable which are observed in the sample
the number of times each value occurs (the counts or frequencies of the observed
values).
When the number of different observed values is small, we can show the whole sample
distribution as a frequency table of all the values and their frequencies.
Example 1.3 Continuing with Example 1.1, the observations of the region variable
in the sample are:
3 5 3 3 3 5 3 3 6 3 2 3 3 3 3
3 3 2 2 2 3 6 2 3 2 2 2 3 3 2
2 3 3 3 2 4 3 2 3 1 4 3 1 3 3
4 4 4 1 2 4 3 4 3 2 1 2 3 1 3
2 1 4 2 4 3 1 4 6 2 1 3 4 2 1
4 4 4 2 3 2 4 1 4 1 4 2 2 2 4
2 2 1 4 2 1 4 2 2 4 4 1 6 3 1
2 1 2 2 1 1 2 1 1 3 2 2 1 2 4
2 1 2 1 1 2 1 2 1 2 1 1 1 1 1
1 1 1 2 1 1 1 1 1 2 1 1 1 1 1
1 1 1 2 1
Relative
Frequency frequency
Region (count) (%)
100 × (48/155)
(1) Africa 48 31.0
(2) Asia 44 28.4
(3) Europe 34 21.9
(4) Latin America 23 14.8
(5) Northern America 2 1.3
(6) Oceania 4 2.6
Total 155 100
5
1. Data visualisation and descriptive statistics
Here ‘%’ is the percentage of countries in a region, out of the 155 countries in the
sample. his is a measure of proportion (that is, relative frequency).
Similarly, for the level of democracy, the frequency table is:
Level of Cumulative
democracy Frequency % %
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Total 155 100
‘Cumulative %’ for a value of the variable is the sum of the percentages for that
value and all lower-numbered values.
6
1.5. The sample distribution
Example 1.4 Continuing with Example 1.1, a table of frequencies for GDP per
capita where values have been grouped into non-overlapping intervals is shown
below. Figure 1.3 shows a histogram of GDP per capita with a greater number of
intervals to better display the sample distribution.
20
10
0
0 10 20 30 40
7
1. Data visualisation and descriptive statistics
Example 1.5 Figure 1.4 shows a (more-or-less) symmetric sample distribution for
diastolic blood pressure.
0.04
0.03
Proportion
0.02
0.01
0.0
40 60 80 100 120
Diastolic blood pressure
Figure 1.4: Diastolic blood pressures of 4,489 respondents aged 25 or over, Health Survey
for England, 2002.
8
1.6. Measures of central tendency
60
50
40
Frequency
30
20
10
0
0 20 40 60 80 100
Marks
We begin with measures of central tendency. These answer the question: where is
the ‘centre’ or ‘average’ of the distribution?
We consider the following measures of central tendency:
median
mode.
Example 1.7 We use Xi to denote the value of X for unit i, where i can take
values 1, 2, . . . , n, and n is the sample size.
Therefore, the n observations of X in the dataset (the sample) are X1 , X2 , . . . , Xn .
These can also be written as Xi , for i = 1, 2, . . . , n.
9
1. Data visualisation and descriptive statistics
P P
This may be written as i Xi , or just Xi . Other versions of the same idea are:
∞
P
infinite sums: Xi = X1 + X 2 + · · ·
i=1
1+4+7 12
= = 4.
3 3
10
1.6. Measures of central tendency
If a variable has a small number of distinct values, X̄ is easy to calculate from the
frequency table. For example, the level of democracy has just 11 different values
which occur in the sample 35, 12, . . . , 32 times each, respectively.
Suppose X has K different values X1 , X2 , . . . , XK , with corresponding frequencies
K
P
f1 , f2 , . . . , fK . Therefore, fj = n and:
j=1
K
P
f j Xj
j=1 f 1 X 1 + f 2 X2 + · · · + f K XK f 1 X 1 + f 2 X2 + · · · + f K XK
X̄ = = = .
K
P f1 + f2 + · · · + fK n
fj
j=1
In our example, the mean of the level of democracy (where K = 11) is:
35 × 0 + 12 × 1 + · · · + 32 × 10 0 + 12 + · · · + 320
X̄ = = ≈ 5.3.
35 + 12 + · · · + 32 155
Deviations:
from X̄ (= 4) from the median (= 3)
i Xi Xi − X̄ (Xi − X̄)2 Xi − 3 (Xi − 3)2
1 1 −3 9 −2 4
2 2 −2 4 −1 1
3 3 −1 1 0 0
4 5 +1 1 +2 4
5 9 +5 25 +6 36
Sum 20 0 40 +5 45
X̄ = 4
We see that the sum of deviations from the mean is 0, i.e. we have:
n
X
(Xi − X̄) = 0.
i=1
The mean is ‘in the middle’ of the observations X1 , X2 , . . . , Xn , in the sense that
positive and negative values of the deviations Xi − X̄ cancel out, when summed over all
the observations.
n
Also, the smallest possible value of the sum of squared deviations (Xi − C)2 for any
P
i=1
constant C is obtained when C = X̄.
11
1. Data visualisation and descriptive statistics
Median
The (sample) median, q50 , of a variable X is the value which is ‘in the middle’ of
the ordered sample.
For example, if n = 4, q50 = (X(2) + X(3) )/2: (1) (2) (3) (4).
Example 1.10 Continuing with Example 1.1, n = 155, so q50 = X(78) . For the level
of democracy, the median is 6.
From a table of frequencies, the median is the value for which the cumulative
percentage first reaches 50% (or, if a cumulative % is exactly 50%, the average of the
corresponding value of X and the next highest value).
The ordered values of the level of democracy are:
(.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)
(0.) 0 0 0 0 0 0 0 0 0
(1.) 0 0 0 0 0 0 0 0 0 0
(2.) 0 0 0 0 0 0 0 0 0 0
(3.) 0 0 0 0 0 0 1 1 1 1
(4.) 1 1 1 1 1 1 1 1 2 2
(5.) 2 2 3 3 3 3 3 3 4 4
(6.) 4 4 4 5 5 5 5 5 6 6
(7.) 6 6 6 6 6 6 6 6 6 6
(8.) 7 7 7 7 7 7 7 7 7 7
(9.) 7 7 7 8 8 8 8 8 8 8
(10.) 8 8 8 8 8 8 8 8 8 9
(11.) 9 9 9 9 9 9 9 9 9 9
(12.) 9 9 9 9 10 10 10 10 10 10
(13.) 10 10 10 10 10 10 10 10 10 10
(14.) 10 10 10 10 10 10 10 10 10 10
(15.) 10 10 10 10 10 10
12
1.6. Measures of central tendency
The median can be determined from the frequency table of the level of democracy:
1, 2, 4, 5, 8.
1, 2, 4, 5, 8, 100.
The median is now 4.5, and the mean is 20. In general, the mean is affected much more
than the median by outliers, i.e. unusually small or large observations. Therefore, you
should identify outliers early on and investigate them – perhaps there has been a data
entry error, which can simply be corrected. If deemed genuine outliers, a decision has to
be made about whether or not to remove them.
For an exactly symmetric distribution, the mean and median are equal.
When summarising variables with skewed distributions, it is useful to report both the
mean and the median.
13
1. Data visualisation and descriptive statistics
Mean Median
Level of democracy 5.3 6
GDP per capita 8.6 4.7
Diastolic blood pressure 74.2 73.5
Examination marks 56.6 57.0
1.6.7 Mode
The (sample) mode of a variable is the value which has the highest frequency (i.e.
appears most often) in the data.
Example 1.12 For Example 1.1, the modal region is 1 (Africa) and the mode of
the level of democracy is 0.
The mode is not very useful for continuous variables which have many different values,
such as GDP per capita in Example 1.1. A variable can have several modes (i.e. be
multimodal). For example, GDP per capita has modes 0.8 and 1.9, both with 5
countries out of the 155.
The mode is the only measure of central tendency which can be used even when the
values of a variable have no ordering, such as for the (nominal) region variable in
Example 1.1.
14
1.7. Measures of dispersion
Example 1.13 A small example determining the sum of the squared deviations
from the (sample) mean, used to calculate common measures of dispersion.
Deviations from X̄
i Xi Xi2 Xi − X̄ (Xi − X̄)2
1 1 1 −3 9
2 2 4 −2 4
3 3 9 −1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 P 40
= Xi2 = (Xi − X̄)2
P
X̄ = 4
These are the most commonly-used measures of dispersion. The standard deviation is
more understandable than the variance, because the standard deviation is expressed in
the same units as X (rather than the variance, which is expressed in squared units).
A useful rule-of-thumb for interpretation is that for many symmetric distributions, such
as the ‘normal’ distribution:
about 2/3 of the observations are between X̄ − S and X̄ + S, that is, within one
(sample) standard deviation about the (sample) mean
Remember that standard deviations (and variances) are never negative, and they are
15
1. Data visualisation and descriptive statistics
zero only if all the Xi observations are the same (that is, there is no variation in the
data).
If we are using a frequency table, we can also calculate:
K
!
1 X
S2 = fj Xj2 − nX̄ 2 .
n−1 j=1
Deviations from X̄
i Xi Xi2 Xi − X̄ (Xi − X̄)2
1 1 1 −3 9
2 2 4 −2 4
3 3 9 −1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 P 40
Xi2 (Xi − X̄)2
P
X̄ = 4 = =
We have:
1 X 40 1 X 2 120 − 5 × 42
S2 = (Xi − X̄)2 = = 10 = Xi − nX̄ 2 =
n−1 4 n−1 4
√ √
and S = S 2 = 10 = 3.16.
The median, q50 , is basically the value which divides the sample into the smallest 50%
of observations and the largest 50%. If we consider other percentage splits, we get other
(sample) quantiles (percentiles), qc .
The first quartile, q25 or Q1 , is the value which divides the sample into the
smallest 25% of observations and the largest 75%.
The extremes in this spirit are the minimum, X(1) (the ‘0% quantile’, so to
speak), and the maximum, X(n) (the ‘100% quantile’).
These are no longer ‘in the middle’ of the sample, but they are more general
measures of location of the sample distribution.
16
1.7. Measures of dispersion
The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the
extremes of the distribution, i.e. the minimum and maximum observations. The IQR
focuses on the middle 50% of the distribution, so it is completely insensitive to outliers.
1.7.4 Boxplots
A boxplot (in full, a box-and-whiskers plot) summarises some key features of a sample
distribution using quantiles. The plot is comprised of the following.
The box, whose edges are the first and third quartiles (Q1 and Q3 ). Hence the box
captures the middle 50% of the data. Therefore, the length of the box is the
interquartile range.
The bottom whisker extends either to the minimum or up to a length of 1.5 times
the interquartile range below the first quartile, whichever is closer to the first
quartile.
The top whisker extends either to the maximum or up to a length of 1.5 times the
interquartile range above the third quartile, whichever is closer to the third quartile.
Points beyond 1.5 times the interquartile range below the first quartile or above the
third quartile are regarded as outliers, and plotted as individual points.
A much longer whisker (and/or outliers) in one direction relative to the other indicates
a skewed distribution, as does a median line not in the middle of the box.
Example 1.16 Figure 1.7 displays a boxplot of GDP per capita using the sample
of 155 countries introduced in Example 1.1. Some summary statistics for this
variable are reported below.
Standard
Mean Median deviation IQR Range
GDP per capita 8.6 4.7 9.5 9.7 37.3
17
1. Data visualisation and descriptive statistics
40
Maximum = 37.8
Outliers
30
GDP per capita
1.8.1 Scatterplots
A scatterplot shows the values of two continuous variables against each other, plotted
as points in a two-dimensional coordinate system.
18
1.8. Associations between two variables
Example 1.17 A plot of data for 164 countries is shown in Figure 1.8 which plots
the following variables.
Interpretation: it appears that virtually all countries with high levels of corruption
have relatively low GDP per capita. At lower levels of corruption there is a positive
association, where countries with very low levels of corruption also tend to have high
GDP per capita.
Example 1.18 Figure 1.9 is a time series of an index of prices of consumer goods
and services in the UK for the period 1800–2009 (Office for National Statistics; scaled
so that the price level in 1974 = 100). This shows the price inflation over this period.
19
1. Data visualisation and descriptive statistics
Example 1.19 Figure 1.10 shows side-by-side boxplots of GDP per capita for the
different regions in Example 1.1.
GDP per capita in African countries tends to be very low. There is a handful of
countries with somewhat higher GDPs per capita (shown as outliers in the plot).
The median for Asia is not much higher than for Africa. However, the
distribution in Asia is very much skewed to the right, with a tail of countries
with very high GDPs per capita.
The boxplots for Northern America and Oceania are not very useful, because
they are based on very few countries (two and three countries, respectively).
Example 1.20 The table below reports the results from a survey of 972 private
investors.3 The variables are as follows.
20
1.8. Associations between two variables
30
GDP per capit a
20
10
0
Africa Asia Europe Latin Am . North Am . Oceania
Region
Interpretation: look at the row percentages. For example, 17.8% of those aged under
45, but only 5.2% of those aged 65 and over, think that short-term gains are ‘very
important’. Among the respondents, the older age groups seem to be less concerned
with quick profits than the younger age groups.
3
Lewellen, W.G., R.C. Lease and G.G. Schlarbaum (1977). ‘Patterns of investment strategy and
behavior among individual investors’. The Journal of Business, 50(3), pp. 296–333.
21
1. Data visualisation and descriptive statistics
22
Chapter 2
Probability theory
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.
2.3 Introduction
Consider the following hypothetical example. A country will soon hold a referendum
about whether it should leave the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows:
Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%
23
2. Probability theory
However, we are not interested in just this sample of 950 respondents, but in the
population which they represent, that is, all likely voters.
Statistical inference will allow us to say things like the following about the
population.
‘The null hypothesis that π = 0.50, against the alternative hypothesis that
π > 0.50, is rejected at the 5% significance level.’
In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results.
In the next few chapters, we will learn about the terms in bold, among others.
In statistical inference, the data we have observed are regarded as a sample from a
broader population, selected with a random process.
Values in a sample are also random. We cannot predict the precise values which
will be observed before we actually collect the sample.
A preview of probability
24
2.4. Set theory: the basics
Experiment: for example, rolling a single die and recording the outcome.
Sample space S: the set of all possible outcomes, here {1, 2, 3, 4, 5, 6}.
Event: any subset A of the sample space, for example A = {4, 5, 6}.1
B = {1, 2, 3, 4, 5}.
1 ∈ A and 2 ∈ A
6∈
/ A and 1.5 ∈
/ A.
The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.
Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .
25
2. Probability theory
A⊂B when x ∈ A ⇒ x ∈ B.
Example 2.4 An example of the distinction between subsets and non-subsets is:
{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set
{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.
Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.
A ∪ B = {x | x ∈ A or x ∈ B}.
That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.
1
Strictly speaking not all subsets are events, as discussed later.
26
2.4. Set theory: the basics
A ∪ B = {1, 2, 3, 4}
A ∪ C = {1, 2, 3, 4, 5, 6}
B ∪ C = {2, 3, 4, 5, 6}.
A ∩ B = {x | x ∈ A and x ∈ B}.
That is, the set of those elements which belong to both A and B. An example is
shown in Figure 2.4.
A ∩ B = {2, 3}
A ∩ C = {4}
B ∩ C = ∅.
27
2. Probability theory
Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1
and: n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1
These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.
Complement (‘not’)
Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈ / A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.
We now consider some useful properties of set operators. In proofs and derivations
about sets, you can use the following results without proof.
28
2.4. Set theory: the basics
Commutativity:
A ∩ B = B ∩ A and A ∪ B = B ∪ A.
Associativity:
A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.
Distributive laws:
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
and:
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
De Morgan’s laws:
If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:
∅c = S.
∅ ⊂ A, A ⊂ A and A ⊂ S.
A ∩ A = A and A ∪ A = A.
A ∩ Ac = ∅ and A ∪ Ac = S.
If B ⊂ A, A ∩ B = B and A ∪ B = A.
A ∩ ∅ = ∅ and A ∪ ∅ = A.
A ∩ S = A and A ∪ S = S.
∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.
29
2. Probability theory
A ∩ B = ∅.
Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.
Partition
The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form
∞
S
a partition of A if they are pairwise disjoint and Ai = A.
i=1
A3 A2
A1
We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.
30
2.5. Axiomatic definition of probability
Example 2.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.
The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows.
Axioms of probability
‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample
space S onto real numbers.2 Such a function is a probability function if it satisfies
the following axioms (‘self-evident truths’).
Axiom 2: P (S) = 1.
The axioms require that a probability function must always satisfy these requirements.
2
The precise definition also requires a careful statement of which subsets of S are allowed as events,
which we can skip on this course.
31
2. Probability theory
Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.
Probability property
However, the only real number for P (∅) which satisfies this is P (∅) = 0.
32
2.5. Axiomatic definition of probability
A2
A1
A3
Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.
That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.
Probability property
Probability property
Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:
P (Ac ) = 1 − P (A) < 0.
This violates Axiom 1, so cannot be true. Therefore, it must be that P (A) ≤ 1 for all A.
Putting this and Axiom 1 together, we get:
0 ≤ P (A) ≤ 1
for all events A.
Probability property
33
2. Probability theory
since P (B ∩ Ac ) ≥ 0.
Probability property
P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B)
P (A) = P (A ∩ B c ) + P (A ∩ B)
P (B) = P (Ac ∩ B) + P (A ∩ B)
and hence:
These show that the probability function has the kinds of values we expect of something
called a ‘probability’.
P (Ac ) = 1 − P (A).
86% spend at least 1 hour watching television (event A, with P (A) = 0.86)
34
2.5. Axiomatic definition of probability
19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19)
15% spend at least 1 hour watching television and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).
Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined below. The mathematical theory of probability
and calculations on probabilities are the same whichever interpretation we assign to
‘probability’. So, in this course, we do not need to discuss the matter further.
Example 2.10 How should we interpret the following, as statements about the real
world of coins and babies?
‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.
‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.
A key question is how to determine appropriate numerical values of P (A) for the
probabilities of particular events.
35
2. Probability theory
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.
If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.
Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in this population.
Standard illustrations of classical probability are devices used in games of chance, such
as:
We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space, S, contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Therefore:
k number of outcomes in A
P (A) = = .
m total number of outcomes in the sample space, S
That is, the probability of A is the proportion of outcomes which belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible
outcomes.
36
2.6. Classical probability and counting rules
Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?
S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}.
Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.
Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?
The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.
The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.
Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?
37
2. Probability theory
s s s s s s s s
Example 2.15 Consider a group of four people, where each pair of people is either
connected (= friends) or not. How many different patterns of connections are there
(ignoring the identities of who is friends with whom)?
The answer is 11. See the patterns in Figure 2.8.
whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once)
38
2.6. Classical probability and counting rules
with replacement, so that each of the n objects may appear several times in the
selection.
Therefore:
n objects are available for selection for the 1st object in the sequence
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.
Suppose that the selection of k objects out of n is again treated as an ordered sequence,
but that selection is now:
Now:
n objects are available for selection for the 1st object in the sequence
. . . and so on, until n − k + 1 objects are available for selection for the kth object.
n × (n − 1) × · · · × (n − k + 1). (2.2)
39
2. Probability theory
Factorials
The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.
n!
n × (n − 1) × · · · × (n − k + 1) = .
(n − k)!
Suppose now that the identities of the objects in the selection matter, but the order
does not.
For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.
n
The number k
is known as the binomial coefficient. Note that because 0! = 1,
n n
0
= n = 1, so there is only 1 way of selecting 0 or n out of n objects.
40
2.6. Classical probability and counting rules
With Without
replacement replacement
Ordered nk n!/(n − k)!
n+k−1 n n!
Unordered k k
= k! (n−k)!
We have not discussed the unordered, with replacement case which is non-examinable.
It is provided here only for completeness with an illustration given in Example 2.16.
Example 2.17 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending 29 February does not exist, so that n = 365) in the following cases?
1. It makes a difference who has which birthday (ordered), i.e. Amy (1 January),
Bob (5 May) and Sam (5 December) is different from Amy (5 May), Bob (5
December) and Sam (1 January), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:
(365)3 = 48,627,125.
2. It makes a difference who has which birthday (ordered), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!
3. Only the dates matter, but not who has which one (unordered), i.e. Amy (1
January), Bob (5 May) and Sam (5 December) is treated as the same as Amy (5
May), Bob (5 December) and Sam (1 January), and different people must have
41
2. Probability theory
Example 2.18 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following.
1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is (365)r .
2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.
Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
(365) (365)r
and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
(365)r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:
42
2.6. Classical probability and counting rules
Rule of sum
m1 + m2 + · · · + mK .
Rule of product
Example 2.19 (The ST102 Moodle site contains a separate note which explains
playing cards and hands, and shows how to calculate the probabilities of all the
hands. This is for reference only – you do not need to memorise the different types of
hands!)
Five playing cards are drawn from a well-shuffled deck of 52 playing cards. What is
the probability that the cards form a hand which is higher than ‘a flush’ ? The cards
in a hand are treated as an unordered set.
First, we determine the size of the sample space which is all unordered subsets of 5
cards selected from 52. So the size of the sample space is:
52 52! 52 × 51 × 50 × 49 × 48
= = = 2,598,960.
5 5! × 47! 5×4×3×2×1
43
2. Probability theory
A ‘full house’ is three cards of the same rank and two of another rank, for example:
♦2 ♠2 ♣2 ♦4 ♠4.
We can break the number of ways of choosing these into two steps.
The total number of ways of selecting the three: the rank of these can be any of
the 13 ranks. There are four cards of this rank, so the three of that rank can be
chosen in 43 = 4 ways. So the total number of different triplets is 13 × 4 = 52.
The total number of ways of selecting the two: the rank of these can be any of
the remaining 12 ranks, and the two cards of that rank can be chosen in 42 = 6
ways. So the total number of different pairs (with a different rank than the
triplet) is 12 × 6 = 72.
The rule of product then says that the total number of full houses is:
52 × 72 = 3,744.
The following is a summary of the numbers of all types of 5-card hands, and their
probabilities:
independence
conditional probability
Bayes’ theorem.
44
2.7. Conditional probability and Bayes’ theorem
updating probabilities of events, after we learn that some other event has happened.
Independence
P (A ∩ B) = P (A) P (B).
if A happens, this does not affect the probability of B happening (and vice versa)
if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).
Example 2.20 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:
Therefore:
Note that there is a difference between pairwise independence and full independence.
The following example illustrates.
45
2. Probability theory
Example 2.21 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher
C only has a scarf and Teacher D only has gloves. One teacher out of the four is
selected at random. It is shown that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) =
and P (S ∩ G) = .
4 4
From these results, we can verify that:
P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)
and so the events are pairwise independent. However, one teacher has a hat, a scarf
and gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.
46
2.7. Conditional probability and Bayes’ theorem
Conditional probability
Consider two events A and B. Suppose you are told that B has occurred. How does
this affect the probability of event A?
The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:
P (A ∩ B)
P (A | B) =
P (B)
assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.
Example 2.22 Suppose we roll two independent fair dice again. Consider the
following events.
These are shown in Figure 2.10. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and
P (A ∩ B) = 2/36. Therefore, the conditional probability of A given B is:
P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15
47
2. Probability theory
A
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
A B
Example 2.23 In Example 2.22, when we are told that the conditioning event B
has occurred, we know we are within the green line in Figure 2.10. So the 15
outcomes within it become the new sample space. There are 2 outcomes which
satisfy A and which are inside this new sample space, so:
2 number of cases of A within B
P (A | B) = = .
15 number of cases of B
48
2.7. Conditional probability and Bayes’ theorem
That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:
s
B
s
As
where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 2.24.
Example 2.25 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52
4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore,
P (A) = 1/270,725.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:
P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards
P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn
P (A3 | A1 , A2 ) = 2/50
49
2. Probability theory
P (A4 | A1 , A2 , A3 ) = 1/49.
We now return to probabilities of partitions like the situation shown in Figure 2.11.
A1
HH
H
A2
HH
A1
rH HHr
A
A3
H
HH A2
HH A3
H
Figure 2.11: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.
Both diagrams in Figure 2.11 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.11,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).
50
2.7. Conditional probability and Bayes’ theorem
r B1
r B2
HH
H
r
r B3 HHHr
H A
@H
H
@ HH
@ Hr
@ B4
@
@r
B5
Figure 2.12: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.
Example 2.26 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c )(1 − P (B)).
r Bc
HH
HH
HH
rH
Hr A
H
HH
H
r
HH
B
Example 2.27 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity. If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity.
If a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
51
2. Probability theory
P (A | B c ) = 0.01. Therefore:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.
What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.14.
So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this.
52
2.7. Conditional probability and Bayes’ theorem
Bayes’ theorem
Using the chain rule and the total probability formula, we have:
P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1
Example 2.28 Continuing with Example 2.27, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:
Therefore:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098
Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.
Example 2.29 You are taking part in a gameshow. The host of the show, who is
known as Monty, shows you three outwardly identical boxes. In one of them is a
prize, and the other two are empty.
You are asked to select, but not open, one of the boxes. After you have done so,
Monty, who knows where the prize is, opens one of the two remaining boxes.
He always opens a box he knows to be empty, and randomly chooses which box to
open when he has more than one option (which happens when your initial choice
contains the prize).
After opening the empty box, Monty gives you the choice of either switching to the
other unopened box or sticking with your original choice. You then receive whatever
is in the box you choose.
What should you do, assuming you want to win the prize?
Suppose the three boxes are numbered 1, 2 and 3. Let us define the following events.
53
2. Probability theory
Suppose you choose Box 1 first, and then Monty opens Box 3 (the answer works the
same way for all combinations of these). So Boxes 1 and 2 remain unopened.
What we want to know now are the conditional probabilities P (B1 | M3 ) and
P (B2 | M3 ).
You should switch boxes if P (B2 | M3 ) > P (B1 | M3 ), and stick with your original
choice otherwise. (You would be indifferent about switching if it was the case that
P (B2 | M3 ) = P (B1 | M3 ).)
Suppose that you first choose Box 1, and then Monty opens Box 3. Bayes’ theorem
tells us that:
P (M3 | B2 ) P (B2 )
P (B2 | M3 ) = .
P (M3 | B1 ) P (B1 ) + P (M3 | B2 ) P (B2 ) + P (M3 | B3 ) P (B3 )
We can assign values to each of these.
54
2.7. Conditional probability and Bayes’ theorem
Example 2.30 You are waiting for your bag at the baggage reclaim carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags which come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags
to arrive’. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are as follows.
P (x | A) = 1 for all x. If your bag has been lost, it will not arrive!
P (x | Ac ) = (200 − x)/200 if we assume that bags come out in a completely
random order.
For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive,
the probability is less than 0.5 that your bag has been lost.
For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is
the first probability over 0.5.
55
2. Probability theory
1.0
BA
0.8
Air Malta
Figure 2.15: Plot of P (A | x) as a function of x for the two airlines in Example 2.30, Air
Malta and British Airways (BA).
56
2.9. Key terms and concepts
57
2. Probability theory
58
Chapter 3
Random variables
define a random variable and distinguish it from the values which it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.
3.3 Introduction
In Chapter 1, we considered descriptive statistics for a sample of observations of a
variable X. Here we will represent the observations as a sequence of variables, denoted
as:
X1 , X2 , . . . , Xn
where n is the sample size.
In statistical inference, the observations will be treated as a sample drawn at random
from a population. We will then think of each observation Xi of a variable X as an
outcome of an experiment.
The experiment is ‘select a unit at random from the population and record its
value of X’.
The outcome is the observed value Xi of X.
Because variables X in statistical data are recorded as numbers, we can now focus on
experiments where the outcomes are also numbers – random variables.
59
3. Random variables
Random variable
A random variable is an experiment for which the outcomes are numbers.1 This
means that for a random variable:
the outcomes are numbers in this sample space (instead of ‘outcomes’, we often
call them the values of the random variable)
There are two main types of random variables, depending on the nature of S, i.e. the
possible values of the random variable.
Notation
P (a < X < b) denotes the probability that X is between the numbers a and b.
You will notice that many of the quantities we define for random variables are
analogous to sample quantities defined in Chapter 1.
1
This definition is a bit informal, but it is sufficient for this course.
2
Strictly speaking, a discrete random variable is not just a random variable which is not continuous
as there are many others, such as mixture distributions.
60
3.4. Discrete random variables
Example 3.1 The following two examples will be used throughout this chapter.
61
3. Random variables
Example 3.2 Consider the following probability distribution for the household
size, X.3
Number of people
in the household, x P (X = x)
1 0.3002
2 0.3417
3 0.1551
4 0.1336
5 0.0494
6 0.0145
7 0.0034
8 0.0021
Probability function
p(x) = P (X = x).
We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.
Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).
Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
– especially when it is necessary to indicate clearly to which random variable the
function corresponds.
3
Source: ONS, National report for the 2001 Census, England and Wales. Table UV51.
62
3.4. Discrete random variables
Example 3.3 Continuing Example 3.2, here we can simply list all the values:
0.3002 for x = 1
0.3417 for x = 2
0.1551 for x = 3
0.1336 for x = 4
p(x) = 0.0494 for x = 5
0.0145 for x = 6
0.0034 for x = 7
0.0021 for x = 8
0 otherwise.
8
P
These are clearly all non-negative, and their sum is p(x) = 1.
x=1
A graphical representation of the pf is shown in Figure 3.1.
0.35
0.30
0.25
0.20
p(x)
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8
For the next example, we need to remember the following results from mathematics,
concerning sums of geometric series. If r 6= 1, then:
n−1
X
x
a(1 − r n )
ar =
x=0
1−r
63
3. Random variables
Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:
Hence the probability that the first success occurs after x failures is the probability
of a sequence of x failures followed by a success, i.e. the probability is:
(1 − π)x π.
So the pf of the random variable X (the number of failures before the first success)
is: (
(1 − π)x π for x = 0, 1, 2, . . .
p(x) = (3.1)
0 otherwise
where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf.
π = 0.7
0.3
π = 0.3
0.2
0.1
0.0
0 5 10 15
x (number of failures)
Figure 3.2: Probability function for Example 3.4. π = 0.7 indicates a fairly good free-throw
shooter. π = 0.3 indicates a pretty poor free-throw shooter.
64
3.4. Discrete random variables
i.e. the sum of the probabilities of the possible values of X which are less than or
equal to x.
Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:
Number of people
in the household, x p(x) F (x)
1 0.3002 0.3002
2 0.3417 0.6419
3 0.1551 0.7970
4 0.1336 0.9306
5 0.0494 0.9800
6 0.0145 0.9945
7 0.0034 0.9979
8 0.0021 1.0000
we can write: (
0 for x < 0
F (x) =
1 − (1 − π)x+1 for x = 0, 1, 2, . . . .
The cdf is shown in graphical form in Figure 3.4.
65
3. Random variables
1.0
0.8
0.6
F(x)
0.4
0.2
0.0
0 2 4 6 8
at such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is
right-continuous).
Either the pf or the cdf can be used to calculate the probabilities of any events for a
discrete random variable.
Example 3.7 Continuing with the household size example (for the probabilities,
see Example 3.5), then:
66
3.4. Discrete random variables
1.0
0.8
0.6
F(x)
0.4 π = 0.7
π = 0.3
0.2
0.0
0 5 10 15
x (number of failures)
The expected value (or mean) of X is denoted E(X), and defined as:
X
E(X) = xi p(xi ).
xi ∈S
P P
This can also be written more concisely as E(X) = x p(x) or E(X) = x p(x).
x
We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’),
or µX , is often used.
67
3. Random variables
K
f 1 x1 + f 2 x2 + · · · + f K xK X
X̄ = = x1 pb(x1 ) + x2 pb(x2 ) + · · · + xK pb(xK ) = xi pb(xi )
f1 + f2 + · · · + fK i=1
where:
fi
p(x
b i) = K
P
fi
i=1
K
X
E(X) = x1 p(x1 ) + x2 p(x2 ) + · · · + xK p(xK ) = xi p(xi ).
i=1
So X̄ uses the sample proportions, pb(xi ), whereas E(X) uses the population
probabilities, p(xi ).
Number of people
in the household, x p(x) x p(x)
1 0.3002 0.3002
2 0.3417 0.6834
3 0.1551 0.4653
4 0.1336 0.5344
5 0.0494 0.2470
6 0.0145 0.0870
7 0.0034 0.0238
8 0.0021 0.0168
Sum 2.3579
= E(X)
Example 3.9 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and
0 otherwise.
68
3.4. Discrete random variables
∞ ∞
X y
X
y
= (1 − π)
y(1 − π) π + (1 − π) π
y=0 y=0
| {z } | {z }
= E(X) =1
= (1 − π) (E(X) + 1)
= (1 − π) E(X) + (1 − π)
So, before scoring a basket, a fairly good free-throw shooter (with π = 0.7) misses on
average about 0.42 shots, and a pretty poor free-throw shooter (with π = 0.3) misses
on average about 2.33 shots.
Example 3.10 To illustrate the use of expected values, let us consider the game of
roulette, from the point of view of the casino (‘The House’).
Suppose a player puts a bet of £1 on ‘red’. If the ball lands on any of the 18 red
numbers, the player gets that £1 back, plus another £1 from The House. If the result
is one of the 18 black numbers or the green 0, the player loses the £1 to The House.
We assume that the roulette wheel is unbiased, i.e. that all 37 numbers have equal
probabilities. What can we say about the probabilities and expected values of wins
and losses?
69
3. Random variables
Define the random variable X = ‘money received by The House’. Its possible values
are −1 (the player wins) and 1 (the player loses). The probability function is:
18/37 for x = −1
p(x) = 19/37 for x = 1
0 otherwise.
On average, The House expects to win 2.7p for every £1 which players bet on red.
This expected gain is known as the house edge. It is positive for all possible bets in
roulette.
The edge is the expected gain from a single bet. Usually, however, players bet again
if they win at first – gambling can be addictive!
Consider a player who starts with £10 and bets £1 on red repeatedly until the
player either has lost all of the £10 or doubled their money to £20.
It can be shown that the probability that such a player reaches £20 before they go
down to £0 is about 0.368. Define X = ‘money received by The House’, with the
probability function:
0.368 for x = −10
p(x) = 0.632 for x = 10
0 otherwise.
On average, The House can expect to keep about 26.4% of the money which players
like this bring to the table.
70
3.4. Discrete random variables
In general:
E(g(X)) 6= g(E(X))
when g(X) is a non-linear function of X.
Suppose X is a random variable and a and b are constants, i.e. known numbers
which are not random variables. Therefore:
E(aX + b) = a E(X) + b.
Proof: We have:
X
E(aX + b) = (ax + b)p(x)
x
X X
= ax p(x) + b p(x)
x x
X X
=a x p(x) + b p(x)
x x
= a E(X) + b
P
ii. p(x) = 1, by definition of the probability function.
x
E(aX + b) = a E(X) + b
E(b) = b.
71
3. Random variables
Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation)
of the random variable X.
Alternative notation: the variance is often denoted σ 2 (‘sigma squared’) and the
standard deviation by σ (‘sigma’).
An alternative formula: the variance can also be calculated as:
2 2 2 2
Var(X) =pE((X − E(X))
√ ) = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.
Example 3.14 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and
0 otherwise. It can be shown (although the proof is beyond the scope of the course)
that for this distribution:
1−π
Var(X) = .
π2
In the two cases we have used as examples:
72
3.4. Discrete random variables
So the variation in how many free throws a pretty poor shooter misses before the
first success is much higher than the variation for a fairly good shooter.
Var(aX + b) = a2 Var(X).
Proof:
Var(aX + b) = E ((aX + b) − E(aX + b))2
= E (aX − a E(X))2
= E a2 (X − E(X))2
= a2 E (X − E(X))2
= a2 Var(X).
Therefore, sd(aX + b) = |a| sd(X).
If a = 0, this gives:
Var(b) = 0.
That is, the variance of a constant is 0. The converse also holds – if a random variable
has a variance of 0, it is actually a constant.
E(aX + b) = a E(X) + b
Var(aX + b) = a2 Var(X) and sd(aX + b) = |a| sd(X)
E(b) = b and Var(b) = sd(b) = 0.
p
We define Var(X) = E((X − E(X))2 ) = E(X 2 ) − (E(X))2 and sd(X) = Var(X).
Also, Var(X) ≥ 0 and sd(X) ≥ 0 always, and Var(X) = sd(X) = 0 only if X is a
constant.
73
3. Random variables
Example 3.15 For further practice, let us consider a discrete random variable X
which has possible values 0, 1, 2, . . . , n, where n is a known positive integer, and X
has the following probability function:
(
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) =
0 otherwise
Note: the examination may also contain questions like this. The difficulty of such
questions depends partly on the form of p(x), and what kinds of manipulations are
needed to work with it. So questions of this type may be very easy, or quite hard!
This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
For the values x = 0, 1, 2, . . . , n, the value of the cdf is:
x
X n y
F (x) = P (X ≤ x) = π (1 − π)n−y .
y=0
y
74
3.4. Discrete random variables
Since X is a discrete random variable, F (x) is a step function. For E(X), we have:
n
X n x
E(X) = x π (1 − π)n−x
x=0
x
n
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n
X n−1
= nπ π x−1 (1 − π)n−x
x=1
x−1
n−1
X n−1
= nπ π y (1 − π)(n−1)−y
y=0
y
= nπ × 1
= nπ
where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, 2, . . . , n − 1 and probability
parameter π.
The variance of the distribution is Var(X) = nπ(1 − π). This is not derived here, but
will be proved in a different way later.
The form of the mgf is not interesting or informative in itself. Instead, the reason we
define the mgf is that it is a convenient tool for deriving means and variances of
distributions, using the following results:
0 00
MX (0) = E(X) and MX (0) = E(X 2 )
This is useful if the mgf is easier to derive than E(X) and Var(X) directly.
75
3. Random variables
Other moments about zero are obtained from the mgf similarly:
(k)
MX (0) = E(X k ) for k = 1, 2, . . . .
using the sum to infinity of a geometric series, for t < − ln(1 − π) to ensure
convergence of the sum.
From the mgf MX (t) = π/(1 − et (1 − π)) we obtain:
π(1 − π)et
MX0 (t) =
(1 − et (1 − π))2
π(1 − π)et (1 − (1 − π)et )(1 + (1 − π)et )
MX00 (t) =
(1 − et (1 − π))4
76
3.5. Continuous random variables
Note: this uses the series expansion of the exponential function from calculus, i.e.
that for any number a, we have:
∞
X ax a2 a3
ea = =1+a+ + + ··· .
x=0
x! 2! 3!
t
From the mgf MX (t) = eλ(e −1) we obtain:
t
MX0 (t) = λet eλ(e −1)
t
MX00 (t) = λet (1 + λet )eλ(e −1)
and hence:
MX0 (0) = λ = E(X)
MX00 (0) = λ(1 + λ) = E(X 2 )
and:
Var(X) = E(X 2 ) − (E(X))2 = λ(1 + λ) − λ2 = λ.
If the mgfs mentioned in these statements exist, then the following apply.
The mgf uniquely determines a probability distribution. In other words, if for two
random variables X and Y we have MX (t) = MY (t) (for points around t = 0), then
X and Y have the same distribution.
and, in particular, if all the Xi s have the same distribution (of X), then
MY (t) = MX (t)n .
77
3. Random variables
In other words, the set of possible values (the sample space) is the real numbers R,
or one or more intervals in R.
Suppose the policy has a deductible of £999, so all claims are at least £1,000.
Most of the concepts introduced for discrete random variables have exact or
approximate analogies for continuous random variables, and many results are the same
for both types. However, there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.
where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
α = 2.2 is shown in Figure 3.5.
Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous random variable:
That is, the probability that X has any particular value exactly is always 0.
Because of (3.3), with a continuous random variable we do not need to be very careful
about differences between < and ≤, and between > and ≥. Therefore, the following
probabilities are all equal:
78
3.5. Continuous random variables
2.0
1.5
f(x)
1.0
0.5
0.0
R3
Example 3.20 In Figure 3.6, the shaded area is P (1.5 < X ≤ 3) = 1.5
f (x) dx.
Properties of pdfs
The pdf f (x) of any continuous random variable must satisfy the following conditions.
1. We require:
f (x) ≥ 0 for all x.
2. We require: Z ∞
f (x) dx = 1.
−∞
79
3. Random variables
2.0
1.5
f(x)
1.0
0.5
0.0
Example 3.21 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
αk α /xα+1 for x ≥ k
f (x) =
0 otherwise
1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0.
2. We have:
∞ ∞ ∞
αk α
Z Z Z
f (x) dx = dx = αk α x−α−1 dx
−∞ k xα+1 k
h i∞
α 1 −α
= αk x
−α k
= (−k α )(0 − k −α )
= 1.
80
3.5. Continuous random variables
The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.
f (x) = F 0 (x).
81
3. Random variables
1.0
0.8
0.6
F(x)
0.4
0.2
0.0
1 2 3 4 5 6 7
Example 3.23 Continuing with the insurance example (with k = 1 and α = 2.2),
then:
Example 3.24 Consider now a continuous random variable with the following pdf:
(
λe−λx for x ≥ 0
f (x) = (3.5)
0 otherwise
where λ > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since: Z x h ix
λe−λt dt = − e−λt = 1 − e−λx
0 0
the cdf of the exponential distribution is:
(
0 for x < 0
F (x) =
1 − e−λx for x ≥ 0.
82
3.5. Continuous random variables
2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to one. This follows from:
Z ∞
f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x)
−∞ x→∞ x→−∞
Mixed distributions
P (X = 0) = π for some π ∈ (0, 1). Here π is the probability that a policy results in
no payment.
Among the rest, X follows a continuous distribution with the probabilities
distributed as (1 − π)f (x), where f (x) is a continuous pdf over x > 0. In other
words, this spreads the remaining probability (1 − π) over different non-zero values
of payments. For example, we could use the Pareto distribution for this loss
distribution f (x) (or actually as a distribution of X + k, since the company only
pays the amount above the deductible, k).
Suppose X is a continuous random variable with pdf f (x). Definitions of its expected
value, the expected value of any transformation g(X), the variance and standard
deviation are the same as for discrete distributions, except that summation is replaced
by integration:
Z ∞
E(X)= x f (x) dx
−∞
Z ∞
E(g(X))= g(x) f (x) dx
−∞
Z ∞
2
Var(X)= E((X − E(X)) ) = (x − E(X))2 f (x) dx = E(X 2 ) − (E(X))2
−∞
p
sd(X)= Var(X).
83
3. Random variables
Example 3.25 For the Pareto distribution, introduced in Example 3.19, we have:
Z ∞ Z ∞
E(X) = x f (x) dx = x f (x) dx
−∞ k
∞
αk α
Z
= x dx
k xα+1
∞
αk α
Z
= dx
k xα
Z ∞
(α − 1)k α−1
αk
= dx
α−1 k x(α−1)+1
| {z }
=1
αk
= (for α > 1).
α−1
Here the last step follows because the last integrand has the form of the Pareto pdf
with parameter α − 1, so its integral from k to ∞ is 1. This integral converges only if
α − 1 > 0, i.e. if α > 1.
Similarly:
∞ ∞
αk α
Z Z
2 2
E(X ) = x f (x) dx = x2 dx
k k xα+1
∞
αk α
Z
= dx
k xα−1
Z ∞
αk 2 (α − 2)k α−2
= dx
α−2 x(α−2)+1
|k {z }
=1
αk 2
= (for α > 2)
α−2
and hence:
2
αk 2 α2 k 2
2 2 k α
Var(X) = E(X ) − (E(X)) = − = .
α − 2 (α − 1)2 α−1 α−2
Expected values and variances are said to be infinite when the corresponding integral
does not exist (i.e. does not have a finite value).
84
3.5. Continuous random variables
For the Pareto distribution, the distribution is defined for all α > 0, but the mean is
infinite if α < 1 and the variance is infinite if α < 2. This happens because for small
values of α the distribution has very heavy tails, i.e. the probabilities of very large
values of X are non-negligible.
This is actually useful in some insurance applications, for example liability insurance
and medical insurance. There most claims are relatively small, but there is a
non-negligible probability of extremely large claims. The Pareto distribution with a
small α can be a reasonable representation of such situations. Figure 3.8 shows plots of
Pareto cdfs with α = 2.2 and α = 0.8. When α = 0.8, the distribution is so heavy-tailed
that E(X) is infinite.
1.0
0.8
0.6
F(x)
0.4
α = 2.2
α = 0.8
0.2
0.0
0 10 20 30 40 50
85
3. Random variables
2
=
λ2
where the last step follows because the last integral is simply E(X) = 1/λ again.
Finally:
2 1 1
Var(X) = E(X 2 ) − (E(X))2 = 2 − 2 = 2 .
λ λ λ
The properties of the mgf stated in Section 3.4.8 also hold for continuous distributions.
If the expected value E(etX ) is infinite, the random variable X does not have an mgf.
For example, the Pareto distribution does not have an mgf for positive t.
from which we get MX0 (t) = λ/(λ − t)2 and MX00 (t) = 2λ/(λ − t)3 , so:
1 2
E(X) = MX0 (0) = and E(X 2 ) = MX00 (0) =
λ λ2
and Var(X) = E(X 2 ) − (E(X))2 = 2/λ2 − 1/λ2 = 1/λ2 .
These agree with the results derived with a bit more work in Example 3.26.
86
3.5. Continuous random variables
The median of a random variable (i.e. of its probability distribution) is similar in spirit.
1 ln 2
e−λm = ⇔ −λm = − ln 2 ⇔ m= .
2 λ
87
3. Random variables
88
Chapter 4
Common distributions of random
variables
state properties of these distributions such as the expected value and variance.
4.3 Introduction
In statistical inference we will treat observations:
X1 , X2 , . . . , Xn
(the sample) as values of a random variable X, which has some probability distribution
(the population distribution).
How to choose the probability distribution?
There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.
89
4. Common distributions of random variables
This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:
among continuous: different sets of possible values (for example, all real numbers x,
x ≥ 0, or x ∈ [0, 1]); symmetric versus skewed distributions.
The ‘distributions’ discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it etc.
In the statistical analysis of a random variable X we typically:
use observed data to choose (estimate) values for the parameters of that
distribution, and perform statistical inference on them.
Here we need a family of discrete distributions with only two possible values (0
and 1). The Bernoulli distribution (discussed in the next section), which has one
parameter π (the probability that Xi = 1) is appropriate.
Within the family of Bernoulli distributions, we use the one where the value of
π is our best estimate based on the observed data. This is π
b = 513/950 = 0.54.
For the discrete uniform, Bernoulli, binomial, Poisson, continuous uniform, exponential
and normal distributions:
you should memorise their pf/pdf, cdf (if given), mean, variance and median (if
given)
you can use these in any examination question without proof, unless the question
directly asks you to derive them again.
90
4.4. Common discrete distributions
you do not need to memorise their pf/pdf or cdf; if needed for a question, these will
be provided
if a question involves means, variances or other properties of these distributions,
these will either be provided, or the question will ask you to derive them.
The discrete uniform distribution is not very common in applications, but it is useful as
a reference point for more complex distributions.
91
4. Common distributions of random variables
agree / disagree
male / female
1
X
2
E(X ) = x2 p(x) = 02 × (1 − π) + 12 × π = π
x=0
and:
Var(X) = E(X 2 ) − (E(X))2 = π − π 2 = π(1 − π). (4.4)
The moment generating function is:
1
X
MX (t) = etx p(x) = e0 (1 − π) + et π = (1 − π) + πet .
x=0
92
4.4. Common discrete distributions
Let X denote the total number of successes in these n trials. X follows a binomial
distribution with parameters n and π, where n ≥ 1 is a known integer and 0 ≤ π ≤ 1.
This is often written as:
X ∼ Bin(n, π).
Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers.
James is taking the test, but has no idea at all about the correct answers. So he
guesses every answer and, therefore, has the probability of 1/4 of getting any
individual question correct.
Let X denote the number of correct answers in James’ test. X follows the binomial
distribution with n = 4 and π = 0.25, i.e. we have:
X ∼ Bin(4, 0.25).
For example, what is the probability that James gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 , where ‘1’ denotes a correct
answer and ‘0’ denotes an incorrect answer.
However, we do not care about the order of the 1s and 0s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability π 3 (1 − π)1 .
The total number of sequences with three 1s (and, therefore, one 0) is the number of
locations for the three 1s which can be selected in the sequence of 4 answers. This is
4
3
= 4. Therefore, the probability of obtaining three 1s is:
4 3
π (1 − π)1 = 4 × (0.25)3 × (0.75)1 ≈ 0.0469.
3
We have already shown that (4.5) satisfies the conditions for being a probability
function in the previous chapter (see Example 3.15).
93
4. Common distributions of random variables
Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again James who guesses each one of the answers. Let X
denote the number of correct answers by such a student, so that we have
X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, 2, . . . , 20 we get (rounded to 2 decimal
places):
x 0 1 2 3 4 5 6 7 8 9 10
p(x) 0.00 0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
x 11 12 13 14 15 16 17 18 19 20
p(x) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore, P (X ≥ 8) = 0.102 > 0.05 and also
P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability π of the correct
answer for every question, so that X ∼ Bin(20, π). Figure 4.1 shows plots of the
probabilities for π = 0.25, 0.5, 0.7 and 0.9.
94
4.4. Common discrete distributions
0.30
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
95
4. Common distributions of random variables
Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process which generates the occurrences satisfies the
following conditions:
1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.
2. The probability of two or more occurrences at the same time is negligibly small.
3. The probability of one occurrence in any short time interval of length t is λt for
some constant λ > 0.
Example 4.7 Examples of variables for which we might use a Poisson distribution:
Because λ is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.
Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if
Y is the number of arrivals per two hours, Y ∼ Poisson(1.5 × 2) = Poisson(3).
Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X ∼ Poisson(2) and X ∼ Poisson(4).
96
4.4. Common discrete distributions
0.25
λ=2
λ=4
0.20
0.15
p(x)
0.10
0.05
0.00
0 2 4 6 8 10
97
4. Common distributions of random variables
X ∼ Bin(n, π)
Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the ‘law of small numbers’.
Figure 4.3 shows the numbers of soldiers killed by horsekick in each of 14 army corps
of the Prussian army in each of the years spanning 1875–94.
Suppose that the number of men killed by horsekicks in one corps in one year is
X ∼ Bin(n, π), where:
The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson
distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.
98
4.4. Common discrete distributions
Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
army in each of the years spanning 1875–94. Source: Bortkiewicz (1898) Das Gesetz der
kleinen Zahlen, Leipzig: Teubner.
0.5
Poisson(0.7)
Sample proportion
0.4
0.3
Probability
0.2
0.1
0.0
0 1 2 3 4 5 6
Men killed
Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What is
the probability that everyone who arrives for the flight will get a seat?
Let X denote the number of people who fail to turn up. Using the binomial
distribution, X ∼ Bin(200, 0.01). We have:
99
4. Common distributions of random variables
Geometric(π) distribution.
• Distribution of the number of failures in Bernoulli trials before the first success.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• See the basketball example in Chapter 3.
Hypergeometric(n, A, B) distribution.
• Experiment where initially A + B objects are available for selection, and A of
them represent ‘success’.
• n objects are selected at random, without replacement.
• Hypergeometric is then the distribution of the number of successes.
• The sample space is the integers x where max{0, n − B} ≤ x ≤ min{n, A}.
• If the selection was with replacement, the distribution of the number of
successes would be Bin(n, A/(A + B)).
Multinomial(n, π1 , π2 , . . . , πk ) distribution.
• Here π1 + π2 + · · · + πk = 1, and the πi s are the probabilities of the values
1, 2, . . . , k.
• If n = 1, the sample space is 1, 2, . . . , k. This is essentially a generalisation of
the discrete uniform distribution, but with non-equal probabilities πi .
• If n > 1, the sample space is the vectors (n1 , n2 , . . . , nk ) where ni ≥ 0 for all i,
and n1 + n2 + · · · + nk = n. This is essentially a generalisation of the binomial
to the case where each trial has k ≥ 2 possible outcomes, and the random
variable records the numbers of each outcome in n trials. Note that with
k = 2, Multinomial(n, π1 , π2 ) is essentially the same as Bin(n, π) with π = π2
(or with π = π1 ).
• When n > 1, the multinomial distribution is the distribution of a multivariate
random variable, as discussed later in the course.
100
4.5. Common continuous distributions
Uniform distribution.
Exponential distribution.
Normal distribution.
The pdf is ‘flat’, as shown in Figure 4.5 (along with the cdf). Clearly, f (x) ≥ 0 for all x,
and: Z ∞ Z b
1 1 h ib 1 h i
f (x) dx = dx = x = b − a = 1.
−∞ a b−a b−a a b−a
The cdf is:
Z x 0
for x < a
F (x) = P (X ≤ x) = f (t) dt = (x − a)/(b − a) for a ≤ x ≤ b
a
1 for x > b.
101
4. Common distributions of random variables
F(x)
f(x)
a b a b
x x
Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).
It was shown in the previous chapter that this satisfies the conditions for a pdf (see
Example 3.24). The general shape of the pdf is that of ‘exponential decay’, as shown in
Figure 4.6 (hence the name).
The cdf of the Exp(λ) distribution is:
(
0 for x < 0
F (x) =
1 − e−λx for x ≥ 0.
The cdf is shown in Figure 4.7 for λ = 1.6.
For X ∼ Exp(λ), we have:
1
E(X) =
λ
and:
1
Var(X) =
.
λ2
These have been derived in the previous chapter (see Example 3.26). The median of the
distribution, also previously derived (see Example 3.29), is:
ln 2 1
m= = (ln 2) × = (ln 2) E(X) ≈ 0.69 × E(X).
λ λ
102
4.5. Common continuous distributions
f(x)
0 1 2 3 4 5
0.4
0.2
0.0
0 1 2 3 4 5
Note that the median is always smaller than the mean, because the distribution is
skewed to the right.
The moment generating function of the exponential distribution (derived in Example
3.27) is:
λ
MX (t) = for t < λ.
λ−t
The exponential is, among other things, a basic distribution of waiting times of various
kinds. This arises from a connection between the Poisson distribution – the simplest
distribution for counts – and the exponential.
If the number of events per unit of time has a Poisson distribution with parameter
λ, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter λ.
103
4. Common distributions of random variables
E(X) = λ for Pois(λ), i.e. a large λ means many events per unit of time, on average.
E(X) = 1/λ for Exp(λ), i.e. a large λ means short waiting times between successive
events, on average.
From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(ln 2) × 0.625 = 0.433.
We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0 for x < 0
F (x) = −1.6x
1−e for x ≥ 0.
For example:
104
4.5. Common continuous distributions
Many variables have distributions which are approximately normal, for example
heights of humans or animals, and weights of various products.
105
4. Common distributions of random variables
0 1 2 3 4 5 6 0 2 4 6 8 10
alpha=0.5, beta=1 alpha=1, beta=0.5
0 1 2 3 4 5 6 0 5 10 15 20
alpha=2, beta=1 alpha=2, beta=0.25
normal distribution has a crucial role in statistical inference. This will be discussed
later in the course.
(x − µ)2
1
f (x) = √ exp − for − ∞ < x < ∞
2πσ 2 2σ 2
where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
A random variable X with this pdf is said to have a normal distribution with mean
µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ).
R∞
Clearly, f (x) ≥ 0 for all x. Also, it can be shown that −∞
f (x) dx = 1 (do not attempt
to show this), so f (x) really is a pdf.
The proof of the second point, which is somewhat elaborate, is shown in a separate note
on the ST102 Moodle site. This note is not examinable!
106
4.5. Common continuous distributions
If X ∼ N (µ, σ 2 ), then:
E(X) = µ
and:
Var(X) = σ 2
and, therefore, the standard deviation is sd(X) = σ.
A (non-examinable) proof of this is given in a separate note on the ST102 Moodle site.
It uses the moment generating function of the normal distribution, which is shown to be:
σ 2 t2
MX (t) = exp µt + for − ∞ < t < ∞.
2
The mean can also be inferred from the observation that the normal pdf is symmetric
about µ. This also implies that the median of the normal distribution is µ.
The normal density is the so-called ‘bell curve’. The two parameters affect it as follows.
N (0, 1) and N (5, 1) have the same dispersion but different location: the N (5, 1)
curve is identical to the N (0, 1) curve, but shifted 5 units to the right
N (0, 1) and N (0, 9) have the same location but different dispersion: the N (0, 9)
curve is centered at the same value, 0, as the N (0, 1) curve, but spread out more
widely.
0.4
0.3
N(0, 1) N(5, 1)
0.2
0.1
N(0, 9)
0.0
−5 0 5 10
107
4. Common distributions of random variables
We now consider one of the convenient properties of the normal distribution. Suppose
X is a random variable, and we consider the linear transformation Y = aX + b, where a
and b are constants.
Whatever the distribution of X, it is true that E(Y ) = a E(X) + b and also that
Var(Y ) = a2 Var(X).
Furthermore, if X is normally distributed, then so is Y . In other words, if
X ∼ N (µ, σ 2 ), then:
Y = aX + b ∼ N (aµ + b, a2 σ 2 ). (4.7)
This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.
Let us apply (4.7) with a = 1/σ and b = −µ/σ, to get:
2 !
1 µ X −µ 1 µ 1 2
Z= X− = ∼N µ− , σ = N (0, 1).
σ σ σ σ σ σ
108
4.5. Common continuous distributions
1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
2. Even for N (0, 1), it only shows probabilities for z ≥ 0.
We next show how these are not really limitations, starting with ‘2.’.
The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the
same probability. Another way to justify these results is that if Z ∼ N (0, 1), then also
−Z ∼ N (0, 1).
Suppose that z ≥ 0, so that −z ≤ 0. Table 3 shows:
P (Z > z) = 1 − Φ(z) = Pz
which is called Pz for short. From it, we also get the following probabilities.
P (Z ≤ z) = Φ(z) = 1 − P (Z > z) = 1 − Pz .
P (Z ≤ −z) = Φ(−z) = P (−Z ≥ z) = P (Z ≥ z) = P (Z > z) = Pz .
P (Z > −z) = 1 − Φ(−z) = P (−Z < z) = P (Z < z) = 1 − Pz .
In each of these, ≤ can be replaced by <, and ≥ by > (see Section 3.5). Figure 4.11
shows tail probabilities for the standard normal distribution.
−z 0 +z
109
4. Common distributions of random variables
Example 4.15 Consider the 0.2005 value in the ‘0.8’ row and ‘0.04’ column of
Table 3 of Murdoch and Barnes’ Statistical Tables, which shows that:
Example 4.16 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:
110
4.5. Common continuous distributions
and we can refer values of this standardised variable to Table 3 of Murdoch and
Barnes’ Statistical Tables.
X − 74.2 90 − 74.2
P (X > 90) = P >
11.31 11.31
= P (Z > 1.40)
= 1 − Φ(1.40)
= 1 − 0.9192
= 0.0808
and:
X − 74.2 60 − 74.2
P (X < 60) = P <
11.31 11.31
= P (Z < −1.26)
= P (Z > 1.26)
= 1 − Φ(1.26)
= 1 − 0.8962
= 0.1038.
Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152.
These probabilities are shown in Figure 4.12.
0.04
Mid: 0.82
0.03
Low: 0.10
0.02
High: 0.08
0.01
0.00
40 60 80 100 120
111
4. Common distributions of random variables
0.683
Figure 4.13: Some probabilities around the mean for the normal distribution.
112
4.5. Common continuous distributions
Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, 2, . . . , 40, then:
P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)
since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast if Y ∼ N (16, 9.6), then:
P (Y ≤ 4) < P (Y ≤ 4.5) < P (Y < 5)
since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (nπ, nπ(1 − π)) distribution.
Continuity correction
113
4. Common distributions of random variables
Example 4.17 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X ∼ Bin(1,000, 0.361).
P (X ≥ 400) ≈ P (Y ≥ 399.5)
Y − 361 399.5 − 361
=P √ ≥ √
230.68 230.68
= P (Z ≥ 2.53)
= 1 − Φ(2.53)
= 0.0057.
3. Suppose that 300 respondents in the actual survey say they would vote for the
Conservative Party now. What do you conclude from this?
From the answer to Question 2, we know that P (X ≤ 300) < 0.01, if π = 0.361.
In other words, if the Conservatives’ support remains 36.1%, we would be very
unlikely to get a random sample where only 300 (or fewer) respondents would
say they would vote for the Conservative Party.
114
4.6. Overview of chapter
Now X = 300 is actually observed. We can then conclude one of two things (if
we exclude other possibilities, such as a biased sample or lying by the
respondents).
(a) The Conservatives’ true level of support is still 36.1% (or even higher), but
by chance we ended up with an unusual sample with only 300 of their
supporters.
(b) The Conservatives’ true level of support is currently less than 36.1% (in
which case getting 300 in the sample would be more probable).
Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.
There are two kinds of statistics, the kind you look up and the kind you make
up.
(Rex Stout)
115
4. Common distributions of random variables
116
Chapter 5
Multivariate random variables
5.3 Introduction
So far, we have considered univariate situations, that is one random variable at a time.
Now we will consider multivariate situations, that is two or more random variables at
once, and together.
In particular, we consider two somewhat different types of multivariate situations.
X = (X1 , X2 , . . . , Xn )0
117
5. Multivariate random variables
x = (x1 , x2 , . . . , xn )0
p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )
for all vectors (x1 , x2 , . . . , xn ) of n real numbers. The value p(x1 , x2 , . . . , xn ) of the joint
probability function is itself a single number, not a vector.
In the bivariate case, this is:
p(x, y) = P (X = x, Y = y)
which we sometimes write as pX,Y (x, y) to make the random variables clear.
118
5.4. Joint probability functions
Example 5.2 Consider a randomly selected football match in the English Premier
League (EPL), and the two random variables:
Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example
simple, we have recorded the small number of scores of 4 or greater also as 3).
Consider the joint distribution of (X, Y ). We use probabilities based on data from
the 2009–10 EPL season.
Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following:
Y =y
X=x 0 1 2 3
0 0.100 0.031 0.039 0.031
1 0.100 0.146 0.092 0.015
2 0.085 0.108 0.092 0.023
3 0.062 0.031 0.039 0.006
The joint probability function gives probabilities of values of (X, Y ), for example:
A 1–1 draw, which is the most probable single result, has probability
P (X = 1, Y = 1) = p(1, 1) = 0.146.
119
5. Multivariate random variables
where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible
values of X3 and X4 .
The simplest marginal distributions are those of individual variables in the multivariate
random variable.
The marginal pf is then obtained by summing the joint pf over all the other variables.
The resulting marginal distribution is univariate, and its pf is a univariate pf.
For the bivariate distribution of (X, Y ) the univariate marginal distributions are
those of X and Y individually. Their marginal pfs are:
X X
pX (x) = p(x, y) and pY (y) = p(x, y).
y x
Example 5.4 Continuing with the football example introduced in Example 5.2, the
joint and marginal probability functions are:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000
120
5.6. Continuous multivariate distributions
For example:
3
X
pX (0) = p(0, y) = p(0, 0) + p(0, 1) + p(0, 2) + p(0, 3)
y=0
Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and
medians of individual variables are obtained from the univariate (marginal)
distributions of Xi , as defined in Chapter 3.
Example 5.6 For a randomly selected man (aged over 16) in England, let:
121
5. Multivariate random variables
0.025
0.05
0.020
0.04
0.015
0.03
0.010
0.02
0.005
0.01
0.000
0.00
80
60
40
Height (cm)
Figure 5.2: Bivariate joint pdf (contour plot) for Example 5.6.
122
5.7. Conditional distributions
Wei
ght
f(x,y)
(kg)
Height (cm)
Let x be one possible value of X, for which pX (x) > 0. The conditional
distribution of Y given that X = x is the discrete probability distribution with
the pf:
Example 5.7 Recall that in the football example the joint and marginal pfs were:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000
We can now calculate the conditional pf of Y given X = x for each x, i.e. of away
goals given home goals. For example:
123
5. Multivariate random variables
pY |X (y | x) when y is:
X=x 0 1 2 3 Sum
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.00
2 0.276 0.351 0.299 0.075 1.00
3 0.449 0.225 0.283 0.043 1.00
if the home team scores 0 goals, the probability that the visiting team scores 1
goal is pY |X (1 | 0) = 0.154
if the home team scores 1 goal, the probability that the visiting team wins the
match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303.
The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is
defined similarly, with the roles of X and Y reversed:
pX,Y (x, y)
pX|Y (x | y) =
pY (y)
for any value x.
Conditional distributions are general and are not limited to the bivariate case. If X
and/or Y are vectors of random variables, the conditional pf of Y given X = x is:
pX,Y (x, y)
pY|X (y | x) =
pX (x)
124
5.7. Conditional distributions
where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal
pf of the random vector X.
So, if the home team scores 0 goals, the expected number of goals by the visiting
team is EY |X (Y | 0) = 1.00.
EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly.
Here X is the number of goals by the home team, and Y is the number of goals by
the visiting team:
pY |X (y | x) when y is:
X=x 0 1 2 3 EY |X (Y | x)
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.06
2 0.276 0.351 0.299 0.075 1.17
3 0.449 0.225 0.283 0.043 0.92
125
5. Multivariate random variables
3.0
Home goals x
Expected away goals E(Y|x)
2.5
2.0
1.5
1.0
0.5
0.0
Goals
Example 5.9 For a randomly selected man (aged over 16) in England, consider
X = height (in cm) and Y = weight (in kg). The joint distribution of (X, Y ) is
approximately bivariate normal (see Example 5.6).
The conditional distribution of Y given X = x is then a normal distribution for each
x, with the following parameters:
In other words, the conditional mean depends on x, but the conditional variance
does not. For example:
For women, this conditional distribution is normal with the following parameters:
126
5.8. Covariance and correlation
110
100
Conditional mean of weight (kg)
90
80
70
Women
Men
60
Height (cm)
We next consider two measures of association which are used to summarise the
strength of an association in a single number: covariance and correlation (scaled
covariance).
5.8.1 Covariance
Definition of covariance
Properties of covariance
The covariance of a random variable with itself is the variance of the random
variable:
127
5. Multivariate random variables
5.8.2 Correlation
Definition of correlation
When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X
and Y are uncorrelated.
Correlation and covariance are measures of the strength of the linear (‘straight-line’)
association between X and Y .
The further the correlation is from 0, the stronger is the linear association. The most
extreme possible values of correlation are −1 and +1, which are obtained when Y is an
exact linear function of X.
Corr(X, Y ) = +1 when Y = aX + b with a > 0.
Corr(X, Y ) = −1 when Y = aX + b with a < 0.
Example 5.10 Recall the joint pf pX,Y (x, y) in the football example:
Y =y
X=x 0 1 2 3
0 0 0 0 0
0.100 0.031 0.039 0.031
1 0 1 2 3
0.100 0.146 0.092 0.015
2 0 2 4 6
0.085 0.108 0.092 0.023
3 0 3 6 9
0.062 0.031 0.039 0.006
Here, the numbers in bold are the values of xy for each combination of x and y.
From these and their probabilities, we can derive the probability distribution of XY .
128
5.8. Covariance and correlation
For example:
XY = xy 0 1 2 3 4 6 9
P (XY = xy) 0.448 0.146 0.200 0.046 0.092 0.062 0.006
Hence:
E(X) = 1.383
E(Y ) = 1.065
E(X 2 ) = 2.827
E(Y 2 ) = 2.039
Var(X) = 2.827 − (1.383)2 = 0.9143
Var(Y ) = 2.039 − (1.065)2 = 0.9048.
The numbers of goals scored by the home and visiting teams are very nearly
uncorrelated (i.e. not linearly associated).
129
5. Multivariate random variables
Sample covariance
Sample correlation
Plot (f) shows that r can be 0 even if two variables are clearly related, if that
relationship is not linear.
130
5.9. Independent random variables
for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), p2 (x2 ), . . . , pn (xn ) are the univariate
marginal pfs of X1 , X2 , . . . , Xn , respectively.
Similarly, continuous random variables X1 , X2 , . . . , Xn are independent if and only
if their joint pdf is:
for all x1 , x2 , . . . , xn , where f1 (x1 ), f2 (x2 ), . . . , fn (xn ) are the univariate marginal pdfs
of X1 , X2 , . . . , Xn , respectively.
131
5. Multivariate random variables
If two random variables are independent, they are also uncorrelated, i.e. we have:
Cov(X, Y ) = 0 and Corr(X, Y ) = 0.
This will be proved later.
The reverse is not true, i.e. two random variables can be dependent even when their
correlation is 0. This can happen when the dependence is non-linear.
132
5.10. Sums and products of random variables
and:
n
Y
ai Xi = (a1 X1 )(a2 X2 ) · · · (an Xn )
i=1
Example 5.15 In the football example, the sum Z = X + Y is the total number of
goals scored in a match.
Its probability function is obtained from the joint pf pX,Y (x, y), that is:
Z=z 0 1 2 3 4 5 6
pZ (z) 0.100 0.131 0.270 0.293 0.138 0.062 0.006
However, what can we say about such distributions in general, in cases where we cannot
derive them as easily?
Sums Products
Only for
Mean Yes
independent variables
Variance Yes No
Normal: Yes
Distributional Some other distributions:
No
form only for independent
variables
133
5. Multivariate random variables
In particular, for n = 2:
134
5.10. Sums and products of random variables
These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random
variables are not independent.
There is no corresponding simple result for the means of products of dependent random
variables. There is also no simple result for the variances of products of random
variables, even when they are independent.
Recall:
Cov(X, Y ) = E(XY ) − E(X) E(Y ).
Proof:
135
5. Multivariate random variables
Cov(X, Y ) = Corr(X, Y ) = 0.
Proof:
a1 X1 + a2 X2 + · · · + an Xn + b
whatever the joint distribution of X1 , X2 , . . . , Xn . This is usually all we can say about
the distribution of this sum.
In particular, the form of the distribution of the sum (i.e. its pf/pdf) depends on the
joint distribution of X1 , X2 , . . . , Xn , and there are no simple general results about that.
For example, even if X and Y have distributions from the same family, the distribution
of X + Y is often not from that same family. However, such results are available for a
few special cases.
P P
If Xi ∼ Bin(ni , π), then i Xi ∼ Bin( i ni , π).
P P
If Xi ∼ Pois(λi ), then i Xi ∼ Pois( i λi ).
136
5.10. Sums and products of random variables
An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = nπ and
Var(X) = nπ(1 − π) is as follows.
All sums (linear combinations) of normally distributed random variables are also
normally distributed.
Suppose X1 , X2 , . . . , Xn are normally distributed random variables, with Xi ∼ N (µi , σi2 )
for i = 1, 2, . . . , n, and a1 , a2 , . . . , an and b are constants, then:
n
X
ai Xi + b ∼ N (µ, σ 2 )
i=1
where:
n
X n
X XX
2
µ= ai µi + b and σ = a2i σi2 + 2 ai aj Cov(Xi , Xj ).
i=1 i=1 i<j
If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j,
n
the variance simplifies to σ 2 = a2i σi2 .
P
i=1
Example 5.17 Suppose that in the population of English people aged 16 or over:
the heights of men (in cm) follow a normal distribution with mean 174.9 and
standard deviation 7.39
the heights of women (in cm) follow a normal distribution with mean 161.3 and
standard deviation 6.85.
Suppose we select one man and one woman at random and independently of each
other. Denote the man’s height by X and the woman’s height by Y . What is the
probability that the man is at most 10 cm taller than the woman?
137
5. Multivariate random variables
In other words, what is the probability that the difference between X and Y is at
most 10?
Since X and Y are independent we have:
2
D = X − Y ∼ N (µX − µY , σX + σY2 )
= N (174.9 − 161.3, (7.39)2 + (6.85)2 )
= N (13.6, (10.08)2 ).
Statistics are like bikinis. What they reveal is suggestive, but what they conceal
is vital.
(Aaron Levenstein)
138
Appendix A
Data visualisation and descriptive
statistics
n
P
1. a = n × a.
i=1
n times
n z }| {
• Proof:
P
a = (a + a + · · · + a) = n × a.
i=1
n
P n
P
2. aXi = a Xi .
i=1 i=1
n n
• Proof:
P P
aXi = (aX1 + aX2 + · · · + aXn ) = a(X1 + X2 + · · · + Xn ) = a Xi .
i=1 i=1
n
P n
P n
P
3. (Xi + Yi ) = Xi + Yi .
i=1 i=1 i=1
= ((X1 + X2 + · · · + Xn ) + (Y1 + Y2 + · · · + Yn ))
= (X1 + X2 + · · · + Xn ) + (Y1 + Y2 + · · · + Yn )
n
X n
X
= Xi + Yi .
i=1 i=1
139
A. Data visualisation and descriptive statistics
Sometimes sets of numbers may be indexed with two (or even more) subscripts, for
example as Xij , for i = 1, 2, . . . , n and j = 1, 2, . . . , m.
Summation over both indices is written as:
n X
X m n
X
Xij = (Xi1 + Xi2 + · · · + Xim )
i=1 j=1 i=1
Product notation
n n
aXi = an
Q Q
1. Xi .
i=1 i=1
n
a = an .
Q
2.
i=1
n
n
n
Q Q Q
3. Xi Yi = Xi Yi .
i=1 i=1 i=1
The mean is ‘in the middle’ of the observations X1 , X2 , . . . , Xn , in the sense that
positive and negative values of the deviations Xi − X̄ cancel out, when summed over
all the observations, that is:
Xn
(Xi − X̄) = 0.
i=1
Proof: (The proof uses the definition of X̄ and the properties of summation introduced
earlier. Note that X̄ is a constant in the summation, because it has the same value for
140
A.1. (Re)vision of fundamentals
all i.)
n
P
n
X n
X n
X n
X n
X Xi
i=1
(Xi − X̄) = Xi − X̄ = Xi − nX̄ = Xi − n
i=1 i=1 i=1 i=1 i=1
n
n
X n
X
= Xi − Xi = 0.
i=1 i=1
n
(Xi − C)2 , for any
P
The smallest possible value of the sum of squared deviations
i=1
constant C, is obtained when C = X̄.
Proof:
=0
X X z }| {
2
(Xi − C) = (Xi −X̄ + X̄ −C)2
X
= ((Xi − X̄) + (X̄ − C))2
X
= ((Xi − X̄)2 + 2(Xi − X̄)(X̄ − C) + (X̄ − C)2 )
X X X
= (Xi − X̄)2 + 2(Xi − X̄)(X̄ − C) + (X̄ − C)2
=0
X zX }| {
2
= (Xi − X̄) + 2(X̄ − C) (Xi − X̄) +n(X̄ − C)2
X
= (Xi − X̄)2 + n(X̄ − C)2
X
≥ (Xi − X̄)2
since n(X̄ − C)2 ≥ 0 for any choice of C. Equality is obtained only when C = X̄, so
that n(X̄ − C)2 = 0.
n
X n
X
2
(Xi − X̄) = Xi2 − nX̄ 2 .
i=1 i=1
141
A. Data visualisation and descriptive statistics
Proof: We have:
n
X n
X
2
(Xi − X̄) = (Xi2 − 2Xi X̄ + X̄ 2 )
i=1 i=1
= nX̄ = nX̄ 2
z }| { z }| {
n
X Xn n
X
= Xi2 − 2X̄ Xi + X̄ 2
i=1 i=1 i=1
n
X
= Xi2 − nX̄ 2 .
i=1
Therefore, the sample variance can also be calculated as:
n
!
2 1 X
S = Xi2 − nX̄ 2
n−1 i=1
√
(and the standard deviation S = S 2 again).
This formula
P is most
Pconvenient for calculations done by hand when summary statistics
2
such as i Xi and i Xi are provided.
Sample moment
In other words, these are sample averages of the powers Xik and (Xi − X̄)k , respectively.
Clearly:
n 1
X̄ = m1 and S 2 = m02 = (nm2 − n(m1 )2 ).
n−1 n−1
Moments of powers 3 and 4 are used in two more summary statistics which are
described next, for reference only.
These are used much less often than measures of central tendency and dispersion.
142
A.1. (Re)vision of fundamentals
A distribution with high kurtosis (i.e. leptokurtic) has a sharp peak and a high
proportion of observations in the tails far from the peak.
A distribution with low kurtosis (i.e. platykurtic) is ‘flat’, with no pronounced peak
with most of the observations spread evenly around the middle and weak tails.
A sample measure of kurtosis is:
m04 4
P
i (Xi − X̄) /n
g2 = − 3 = − 3.
(m02 )2
P
( i (Xi − X̄)2 /n)2
g2 > 0 for leptokurtic and g2 < 0 for platykurtic distributions, and g2 = 0 for the normal
distribution (introduced in Chapter 4). Some software packages define a measure of
kurtosis without the −3, i.e. ‘excess kurtosis’.
This is how computer software calculates general sample quantiles (or how you can do
so by hand, if you ever needed to).
Suppose we need to calculate the cth sample quantile, qc , where 0 < c < 100. Let
R = (n + 1)c/100, and define r as the integer part of R and f = R − r as the fractional
part (if R is an integer, r = R and f = 0). It follows that:
qc = X(r) + f (X(r+1) − X(r) ) = (1 − f )X(r) + f X(r+1) .
For example, if n = 10:
143
A. Data visualisation and descriptive statistics
Solution:
Begin with the left-hand side and proceed as follows:
n Xn n
" n #
X X X
(xi − xj )2 = (xi − xj )2 .
i=1 j=1 i=1 j=1
n
P
Now, recall that x̄ = xi /n, so re-write as:
i=1
n
" n
#
X X
= nx2i − 2xi nx̄ + x2j .
i=1 j=1
Re-arrange again:
n n n
! n
X X X X
=n x2i − 2nx̄ xi + x2j 1.
i=1 i=1 j=1 i=1
144
A.3. Practice questions
Finally, add terms, factor out 2n, apply the ‘x̄ trick’ . . . and you’re done!
" n # " n #
X X
= 2n x2i − nx̄2 = 2n (xi − x̄)2 .
i=1 i=1
Hint: there are three terms in the expression of (a), nine terms in (b) and six terms
in (c). Write out the terms, and try and find ways to simplify them which avoid the
need for a lot of messy algebra!
(c) s.d.y = |a| s.d.x , where s.d.y is the standard deviation of y etc.
What are the mean and standard deviation of the set {x1 + k, x2 + k, . . . , xn + k}
where k is a constant? What are the mean and standard deviation of the set
{cx1 , cx2 , . . . , cxn } where c is a constant? Justify your answers with reference to the
above results.
145
A. Data visualisation and descriptive statistics
146
Appendix B
Probability theory
Therefore: √
2 3± 9 − 6.4
2π − 3π + 0.8 = 0 ⇒ π= .
4
Hence π = 0.346887, since the other root is > 1!
However:
P (A ∩ B)
P (A | B) = > P (A) i.e. P (A ∩ B) > P (A) P (B).
P (B)
Hence:
1 − P (A) − P (B) + P (A) P (B)
P (Ac | B c ) > = 1 − P (A) = P (Ac ).
1 − P (B)
147
B. Probability theory
4. A and B are any two events in the sample space S. The binary set operator ∨
denotes an exclusive union, such that:
A ∨ B = (A ∪ B) ∩ (A ∩ B)c = {s | s ∈ A or B, and s 6∈ (A ∩ B)}.
Show, from the axioms of probability, that:
(a) P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B)
(b) P (A ∨ B | A) = 1 − P (B | A).
Solution:
(a) We have:
A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ).
By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint:
P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ).
We can write A = (A ∩ B) ∪ (A ∩ B c ), hence (using axiom 3):
P (A ∩ B c ) = P (A) − P (A ∩ B).
Similarly, P (B ∩ Ac ) = P (B) − P (A ∩ B), hence:
P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B).
(b) We have:
P ((A ∨ B) ∩ A)
P (A ∨ B | A) =
P (A)
P (A ∩ B c )
=
P (A)
P (A) − P (A ∩ B)
=
P (A)
P (A) P (A ∩ B)
= −
P (A) P (A)
= 1 − P (B | A).
148
B.1. Worked examples
Solution:
Bayes’ theorem is:
P (A | Bj ) P (Bj )
P (Bj | A) = K
.
P
P (A | Bi ) P (Bi )
i=1
By definition:
P (Bj ∩ A) P (A | Bj ) P (Bj )
P (Bj | A) = = .
P (A) P (A)
If {Bi }, for i = 1, 2, . . . , K, is a partition of the sample space S, then:
K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1
6. A man has two bags. Bag A contains five keys and bag B contains seven keys. Only
one of the twelve keys fits the lock which he is trying to open. The man selects a
bag at random, picks out a key from the bag at random and tries that key in the
lock. What is the probability that the key he has chosen fits the lock?
Solution:
Define a partition {Ci }, such that:
5 1 5
C1 = key in bag A and bag A chosen ⇒ P (C1 ) = × =
12 2 24
7 1 7
C2 = key in bag B and bag A chosen ⇒ P (C2 ) = × =
12 2 24
5 1 5
C3 = key in bag A and bag B chosen ⇒ P (C3 ) = × =
12 2 24
7 1 7
C4 = key in bag B and bag B chosen ⇒ P (C4 ) = × = .
12 2 24
Hence we require, defining the event F = ‘key fits’:
1 1 1 5 1 7 1
P (F ) = × P (C1 ) + × P (C4 ) = × + × = .
5 7 5 24 7 24 12
7. Continuing with Question 6, suppose the first key chosen does not fit the lock.
What is the probability that the bag chosen:
(a) is bag A?
(b) contains the required key?
149
B. Probability theory
Solution:
P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 )
P (bag A | F c ) = 4
.
P c
P (F | Ci ) P (Ci )
i=1
4 6
P (F c | C1 ) = , P (F c | C2 ) = 1, P (F c | C3 ) = 1 and P (F c | C4 ) = .
5 7
Hence:
4/5 × 5/24 + 1 × 7/24 1
P (bag A | F c ) = = .
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24 2
P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 )
P (right bag | F c ) = 4
P
P (F c | Ci ) P (Ci )
i=1
8. Assume that a calculator has a ‘random number’ key and that when the key is
pressed an integer between 0 and 999 inclusive is generated at random, all numbers
being generated independently of one another.
(a) What is the probability that the number generated is less than 300?
(b) If two numbers are generated, what is the probability that both are less than
300?
(c) If two numbers are generated, what is the probability that the first number
exceeds the second number?
(d) If two numbers are generated, what is the probability that the first number
exceeds the second number, and their sum is exactly 300?
(e) If five numbers are generated, what is the probability that at least one number
occurs more than once?
Solution:
150
B.1. Worked examples
P( ∞
S
i=1 (Ai ∩ B))
=
P (B)
∞
X P (Ai ∩ B)
=
i=1
P (B)
∞
X
= P (Ai | B)
i=1
where the equation on the second line follows from (B.1) in the question, since
Ai ∩ B are also events in S, and they are pairwise mutually exclusive (i.e.
(Ai ∩ B ∩ (Aj ∩ B) = ∅ for all i 6= j).
151
B. Probability theory
10. Suppose that three components numbered 1, 2 and 3 have probabilities of failure
π1 , π2 and π3 , respectively. Determine the probability of a system failure in each of
the following cases where component failures are assumed to be independent.
(a) Parallel system – the system fails if all components fail.
(b) Series system – the system fails unless all components do not fail.
(c) Mixed system – the system fails if component 1 fails or if both component 2
and component 3 fail.
Solution:
(a) Since the component failures are independent, the probability of system failure
is π1 π2 π3 .
(b) The probability that component i does not fail is 1 − πi , hence the probability
that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability
that the system fails is:
1 − (1 − π1 )(1 − π2 )(1 − π3 ).
11. Why is S = {1, 1, 2}, not a sensible way to try to define a sample space?
Solution:
Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer
to write S = {1, 2}.
12. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)
Solution:
The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample
space S) and ∅.
13. For an event A, work out a simpler way to express the events A ∩ S, A ∪ S, A ∩ ∅
and A ∪ ∅.
Solution:
We have:
A ∩ S = A, A ∪ S = S, A ∩ ∅ = ∅ and A ∪ ∅ = A.
14. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).
152
B.1. Worked examples
Solution:
S has 4 elementary outcomes which are equally likely, so each elementary outcome
has probability 1/4.
We have:
P (A ∩ B) P ({c}) 1/4 1
P (A | B) = = = =
P (B) P ({c, d}) 1/4 + 1/4 2
and:
P (B ∩ A) P ({c}) 1/4 1
P (B | A) = = = = .
P (A) P ({a, b, c}) 1/4 + 1/4 + 1/4 3
15. Suppose that we toss a fair coin twice. The sample space is given by
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way – for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
‘heads on the first toss’ and ‘heads on the second toss’ are independent.
Solution:
Note carefully here that we have equally likely elementary outcomes (due to the
coin being fair), so that each has probability 1/4, and the independence follows.
The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2,
because it is specified by two elementary outcomes. The event ‘heads on the second
toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss
and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the
multiplication property P (A ∩ B) = 1/4 = 1/2 × 1/2 = P (A) P (B) is satisfied, and
the two events are independent.
16. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0.1
Solution:
It is important to get the logical flow in the right direction here. We are told that
A and B are disjoint events, that is:
A ∩ B = ∅.
So:
P (A ∩ B) = 0.
We are also told that A and B are independent, that is:
P (A ∩ B) = P (A) P (B).
It follows that:
0 = P (A) P (B)
and so either P (A) = 0 or P (B) = 0.
1
Note that independence and disjointness are not similar ideas.
153
B. Probability theory
17. Write down the condition for three events A, B and C to be independent.
Solution:
Applying the product rule, we must have:
Therefore, since all subsets of two events from A, B and C must be independent,
we must also have:
P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)
and:
P (B ∩ C) = P (B) P (C).
One must check that all four conditions hold to verify independence of A, B and C.
18. Prove the simplest version of Bayes’ theorem from first principles.
Solution:
Applying the definition of conditional probability, we have:
P (B ∩ A) P (A ∩ B) P (A | B) P (B)
P (B | A) = = = .
P (A) P (A) P (A)
19. A statistics teacher knows from past experience that a student who does their
homework consistently has a probability of 0.95 of passing the examination,
whereas a student who does not do their homework has a probability of 0.30 of
passing.
(a) If 25% of students do their homework consistently, what percentage can expect
to pass?
(b) If a student chosen at random from the group gets a pass, what is the
probability that the student has done their homework consistently?
Solution:
Here the random experiment is to choose a student at random, and to record
whether the student passes (P ) or fails (F ), and whether the student has done
their homework consistently (C) or has not (N ).2 The sample space is
S = {P C, P N, F C, F N }. We use the events Pass = {P C, P N }, and Fail
= {F C, F N }. We consider the sample space partitioned by Homework
= {P C, F C}, and No Homework = {P N, F N }.
2
Notice that F = P c and N = C c .
154
B.1. Worked examples
(a) The first part of the example asks for the denominator of Bayes’ theorem:
155
B. Probability theory
21. A, B and C throw a die in that order until a six appears. The person who throws
the first six wins. What are their respective chances of winning?
Solution:
We must assume that the game finishes with probability one (it would be proved in
a more advanced subject). If A, B and C all throw and fail to get a six, then their
respective chances of winning are as at the start of the game. We can call each
completed set of three throws a round. Let us denote the probabilities of winning
by P (A), P (B) and P (C) for A, B and C, respectively. Therefore:
156
B.1. Worked examples
22. In men’s singles tennis, matches are played on the best-of-five-sets principle.
Therefore, the first player to win three sets wins the match, and a match may
consist of three, four or five sets. Assuming that two players are perfectly evenly
matched, and that sets are independent events, calculate the probabilities that a
match lasts three sets, four sets and five sets, respectively.
Solution:
Suppose that the two players are A and B. We calculate the probability that A
wins a three-, four- or five-set match, and then, since the players are evenly
matched, double these probabilities for the final answer.
P (‘A wins in 3 sets’) = P (‘A wins 1st set’ ∩ ‘A wins 2nd set’ ∩ ‘A wins 3rd set’).
P (‘A wins in 3 sets’) = P (‘A wins 1st set’) P (‘A wins 2nd set’) P (‘A wins 3rd set’)
1 1 1 1
= × × = .
2 2 2 8
Therefore, the total probability that the game lasts three sets is:
1 1
2× = .
8 4
If A wins in four sets, the possible winning patterns are:
157
B. Probability theory
Each of these patterns has probability (1/2)4 by using the same argument as in the
case of 3 sets. So the probability that A wins in four sets is 3 × (1/16) = 3/16.
Therefore, the total probability of a match lasting four sets is 2 × (3/16) = 3/8.
The probability of a five-set match should be 1 − 3/8 − 1/4 = 3/8, but let us check
this directly. The winning patterns for A in a five-set match are:
Each of these has probability (1/2)5 because of the independence of the sets. So the
probability that A wins in five sets is 6 × (1/32) = 3/16. Therefore, the total
probability of a five-set match is 3/8, as before.
1. (a) A, B and C are any three events in the sample space S. Prove that:
3. (a) Show that if A and B are independent events in a sample space, then Ac and
B c are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then
X c and Y c are not in general mutually exclusive.
4. In a game of tennis, each point is won by one of the two players A and B. The
usual rules of scoring for tennis apply. That is, the winner of the game is the player
who first scores four points, unless each player has won three points, when deuce is
called and play proceeds until one player is two points ahead of the other and
hence wins the game.
A is serving and has a probability of winning any point of 2/3. The result of each
point is assumed to be independent of every other point.
(a) Show that the probability of A winning the game without deuce being called is
496/729.
(b) Find the probability of deuce being called.
158
B.2. Practice questions
(c) If deuce is called, show that A’s subsequent probability of winning the game is
4/5.
(d) Hence determine A’s overall chance of winning the game.
159
B. Probability theory
160
Appendix C
Random variables
and:
so π = 8/25 = 0.32.
161
C. Random variables
πet
.
1 − et (1 − π)
Solution:
(a) Working from the definition:
X ∞
X
tX tx
MX (t) = E(e ) = e p(x) = etx (1 − π)x−1 π
x∈S x=1
∞
X
= πet (et (1 − π))x−1
x=1
πet
=
1 − et (1 − π)
Therefore:
π π 1
E(X) = MX0 (0) = 2
= 2 = .
(1 − (1 − π)) π π
Solution:
(a) We have:
1 1 1
ax2 bx3
Z Z
2
f (x) dx = 1 ⇒ ax + bx dx = + =1
0 0 2 3 0
162
C.1. Worked examples
(c) Finally:
1 1 1
6x4 6x5
Z Z
2 2 3 4
E(X ) = x (6x(1 − x)) dx = 6x − 6x dx = − = 0.3.
0 0 4 5 0
and so the variance is:
Var(X) = E(X 2 ) − (E(X))2 = 0.3 − 0.25 = 0.05.
Solution:
(a) A sketch of the cumulative distribution function is:
G (w )
1
1-(2/3)e -1
1/3
0 2 w
163
C. Random variables
(c) We have:
2
P (W > 1) = 1 − G(1) = e−1/2
3
2
P (W = 2) = e−1
3
P (0.5 < W ≤ 1.5)
P (W ≤ 1.5 | W > 0.5) =
P (W > 0.5)
G(1.5) − G(0.5)
=
1 − G(0.5)
(1 − (2/3)e−1.5/2 ) − (1 − (2/3)e−0.5/2 )
=
(2/3)e−0.5/2
= 1 − e−1/2 .
164
C.1. Worked examples
Solution:
R∞
(a) Clearly, f (x) ≥ 0 for all x and −∞ f (x) dx = 1. This can be seen
geometrically, since f (x) defines two rectangles, one with base 1 and height
1/4, the other with base 1 and height 3/4, giving a total area of 1/4 + 3/4 = 1.
(b) We have:
Z ∞ Z 1 Z 2 2 1 2 2
x 3x x 3x 1 3 3 5
E(X) = x f (x) dx = dx+ dx = + = + − = .
−∞ 0 4 1 4 8 0 8 1 8 2 8 4
The median is most simply found geometrically. The area to the right of the
point x = 4/3 is 0.5, i.e. the rectangle with base 2 − 4/3 = 2/3 and height 3/4,
giving an area of 2/3 × 3/4 = 1/2. Hence the median is 4/3.
(c) For the variance, we proceed as follows:
Z ∞ Z 1 2 Z 2 2 3 1 3 2
2 2 x 3x x x 1 1 11
E(X ) = x f (x) dx = dx+ dx = + = +2− = .
−∞ 0 4 1 4 12 0 4 1 12 4 6
Hence the variance is:
11 25 88 75 13
Var(X) = E(X 2 ) − (E(X))2 = − = − = ≈ 0.2708.
6 16 48 48 48
(d) The cdf is:
0 for x<0
x/4 for 0≤x≤1
F (x) =
3x/4 − 1/2 for 1<x≤2
1 for x > 2.
(e) P (X = 1) = 0, since the cdf is continuous, and:
P ({X > 1.5} ∩ {X > 0.5}) P (X > 1.5)
P (X > 1.5 | X > 0.5) = =
P (X > 0.5) P (X > 0.5)
0.5 × 0.75
=
1 − 0.5 × 0.25
0.375
=
0.875
3
= ≈ 0.4286.
7
(f) The moment generating function is:
Z ∞ 1 Z 2 tx
etx
Z
tX tx 3e
MX (t) = E(e ) = e f (x) dx = dx + dx
−∞ 0 4 1 4
tx 1 tx 2
e 3e
= +
4t 0 4t 1
1 t 3
= (e − 1) + (e2t − et )
4t 4t
1
3e2t − 2et − 1 .
=
4t
165
C. Random variables
E((X − E(X))3 )
.
σ3
(f) If a sample of five observations is drawn at random from the distribution, find
the probability that all the observations exceed 1.5.
Solution:
(a) Clearly, f (x) ≥ 0 for all x and:
2 4 2
x3
Z
x
dx = = 1.
0 4 16 0
166
C.1. Worked examples
Solution:
(a) Clearly, f (x) ≥ 0 for all x since λ2 > 0, x ≥ 0 and e−λx ≥ 0.
R∞
To show, −∞ f (x) dx = 1, we have:
Z ∞ Z ∞
f (x) dx = λ2 xe−λx dx
−∞ 0
∞ Z ∞
e−λx e−λx
2
= λx + λ2 dx
−λ 0 0 λ
Z ∞
=0+ λe−λx dx
0
2
=0+ (from the exponential distribution).
λ
For the variance:
Z ∞ i∞ Z ∞
−λx
h
−λx 6
2
E(X ) = 2 2
x λ xe 3
dx = − x λe + 3x2 λe−λx dx = .
0 0 0 λ2
2 2 2
So, Var(X) = 6/λ − (2/λ) = 2/λ .
167
C. Random variables
Solution:
(a) We have:
i. P (X = 0) = F (0) = 1 − a.
ii. P (X = 1) = lim (F (1) − F (x)) = 1 − (1 − ae−1 ) = ae−1 .
x→1
−x
iii. f (x) = ae , for 0 ≤ x < 1, and 0 otherwise.
iv. The mean is:
Z 1
−1
E(X) = 0 × (1 − a) + 1 × (ae ) + x ae−x dx
0
h i1 Z 1
= ae−1 + − xae−x + ae−x dx
0 0
h i1
= ae−1 − ae−1 + − ae−x
0
= a(1 − e−1 ).
168
C.1. Worked examples
(a) Determine the constant k and derive the cumulative distribution function,
F (x), of X.
(b) Find E(X) and Var(X).
Solution:
(a) We have: Z ∞ Z π
f (x) dx = k sin(x) dx = 1.
−∞ 0
Therefore: h iπ 1
k(− cos(x)) = 2k = 1 ⇒ k= .
0 2
The cdf is hence:
0
for x < 0
F (x) = (1 − cos(x))/2 for 0 ≤ x ≤ π
1 for x > π.
Next:
Z π π Z π
2 21 1 2
E(X ) = x sin(x) dx = x (− cos(x)) + x cos(x) dx
0 2 2 0 0
π2 h iπ Z π
= + x sin(x) − sin(x) dx
2 0 0
π2 h iπ
= − − cos(x)
2 0
π2
= − 2.
2
Therefore, the variance is:
π2 π2 π2
Var(X) = E(X 2 ) − (E(X))2 = −2− = − 2.
2 4 4
10. (a) Define the cumulative distribution function (cdf) of a random variable and
state the principal properties of such a function.
169
C. Random variables
(b) Identify which, if any, of the following functions could be a cdf under suitable
choices of the constants a and b. Explain why (or why not) each function
satisfies the properties required of a cdf and the constraints which may be
required in respect of the constants a and b.
i. F (x) = a(b − x)2 for −1 ≤ x ≤ 1.
ii. F (x) = a(1 − xb ) for −1 ≤ x ≤ 1.
iii. F (x) = a − b exp(−x/2) for 0 ≤ x ≤ 2.
Solution:
(a) We defined the cdf to be F (x) = P (X ≤ x) where:
• 0 ≤ F (x) ≤ 1
• F (x) is non-decreasing
Rx
• dF (x)/dx = f (x) and F (x) = −∞
f (t) dt for continuous X
• F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞.
(b) i. Okay. a = 0.25 and b = −1.
ii. Not okay. At x = 1, F (x) = 0, which would mean a decreasing function.
iii. Okay. a = b > 0 and b = (1 − e−1 )−1 .
11. Suppose that random variable X has the range {x1 , x2 , . . .}, where x1 < x2 < · · · .
Prove the following results:
∞
X
p(xi ) = 1
i=1
Solution:
The events X = x1 , X = x2 , . . . are disjoint, so we can write:
∞
X ∞
X
p(xi ) = P (X = xi ) = P (X = x1 ∪ X = x2 ∪ · · · ) = P (S) = 1.
i=1 i=1
In words, this result states that the sum of the probabilities of all the possible
values X can take is equal to 1.
For the second equation, we have:
F (xk ) = P (X ≤ xk ) = P (X = xk ∪ X ≤ xk−1 ).
170
C.1. Worked examples
k
X
F (xk ) = P (X ≤ xk ) = P (X = x1 ∪ X = x2 ∪ · · · ∪ X = xk ) = p(xi ).
i=1
12. At a charity event, the organisers sell 100 tickets to a raffle. At the end of the
event, one of the tickets is selected at random and the person with that number
wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5. What
is the probability for each of them to win the prize?
Solution:
Let X denote the number on the winning ticket. Since all values between 1 and 100
are equally likely, X has a discrete ‘uniform’ distribution such that:
1
P (‘Carol wins’) = P (X = 22) = p(22) = = 0.01
100
and:
5
P (‘Janet wins’) = P (X ≤ 5) = F (5) = = 0.05.
100
13. What is the expectation of the random variable X if the only possible value it can
take is c?
Solution:
We have p(c) = 1, so X is effectively a constant, even though it is called a random
variable. Its expectation is:
X
E(X) = x p(x) = cp(x) = cp(c) = c × 1 = c. (C.1)
∀x
Solution:
We have:
E(X − E(X)) = E(X) − E(E(X))
Since E(X) is just a number, as opposed to a random variable, (C.1) tells us that
its expectation is equal to itself. Therefore, we can write:
171
C. Random variables
15. Show that if Var(X) = 0 then p(µ) = 1. (We say in this case that X is almost
surely equal to its mean.)
Solution:
From the definition of variance, we have:
X
Var(X) = E((X − µ)2 ) = (x − µ)2 p(x) ≥ 0
∀x
because the squared term (x − µ)2 is non-negative (as is p(x)). The only case where
it is equal to 0 is when x − µ = 0, that is, when x = µ. Therefore, the random
variable X can only take the value µ, and we have p(µ) = P (X = µ) = 1.
172
C.2. Practice questions
(b) at each trial, the rat chooses with equal probability between the doors which it
has not so far tried
(c) the rat never chooses the same door on two successive trials, but otherwise
chooses at random with equal probabilities.
173
C. Random variables
174
Appendix D
Common distributions of random
variables
Solution:
(a) We have:
n
X n x
E(X) = x π (1 − π)n−x
x=0
x
n
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n
X n − 1 x−1
= nπ π (1 − π)(n−1)−(x−1)
x=1
x − 1
n−1
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y
= nπ.
(b) We have:
n
X n x
E(X(X − 1)) = x(x − 1) π (1 − π)n−x
x=0
x
n
X n x
= x(x − 1) π (1 − π)n−x
x=2
x
175
D. Common distributions of random variables
n
X n(n − 1)(n − 2)!
E(X(X − 1)) = π 2 π x−2 (1 − π)n−x
x=2
(x − 2)! ((n − 2) − (x − 2))!
n
X
2n − 2 x−2
= n(n − 1)π π (1 − π)(n−2)−(x−2)
x=2
x−2
n−2
X
2 n−2 y
= n(n − 1)π π (1 − π)(n−2)−y
y=0
y
= n(n − 1)π 2 .
(c) We have:
E(X(X − 1) · · · (X − r))
n
X n x
= x(x − 1) · · · (x − r) π (1 − π)n−x (if r < n)
x=0
x
n
X n x
= x(x − 1) · · · (x − r) π (1 − π)n−x
x=r+1
x
Solution:
n
P
(a) Xn = Bi takes the values 0, 1, 2, . . . , n. Any sequence consisting of x 1s and
i=1
x n−x
n− x 0s has a probability π (1 − π) and gives a value Xn = x. There are
n
x
such sequences, so:
n x
P (Xn = x) = π (1 − π)n−x
x
and 0 otherwise. Hence E(Bi ) = π and Var(Bi ) = π(1 − π) which means
E(Xn ) = nπ and Var(Xn ) = nπ(1 − π).
176
D.1. Worked examples
P (Y = y) = (1 − π)y−1 π
β α α−1 −βx
f (x) = x e for x > 0 (D.1)
Γ(α)
and 0 otherwise, where α > 0 and β > 0 are parameters, and Γ(α) is the value of
the gamma function such that:
Z ∞
Γ(α) = xα−1 e−x dx.
0
The gamma function has a finite value for all α > 0. Two of its properties are that:
• Γ(1) = 1
• Γ(α) = (α − 1) Γ(α − 1) for all α > 1.
(a) The function f (x) defined by (1) satisfies all the conditions for being a pdf.
Show that this implies the following result about an integral:
Z ∞
Γ(α)
xα−1 e−βx dx = α for any α > 0, β > 0.
0 β
Using this result and the known properties of the exponential distribution,
derive the expected value of X ∼ Gamma(α, β) when α is a positive integer
(i.e. α = 1, 2, . . .).
177
D. Common distributions of random variables
Solution:
(a) This
R ∞ follows immediately from the general property of pdfs that
−∞
f (x) dx = 1, applied to the specific pdf here. We have:
(b) With α = 1, the pdf becomes f (x) = βe−βx for x ≥ 0, and 0 otherwise. This is
the pdf of the exponential distribution with parameter β, i.e. X ∼ Exp(β).
(c) We have:
∞ ∞
β α α−1 −βx
Z Z
tX tx
MX (t) = E(e ) = e f (x) dx = etx x e dx
0 0 Γ(α)
∞
βα
Z
= etx xα−1 e−βx dx
Γ(α) 0
∞
βα
Z
= xα−1 e−(β−t)x dx
Γ(α) 0
βα Γ(α)
= ×
Γ(α) (β − t)α
α
β
=
β−t
which is finite when β − t > 0, i.e. when t < β. The second-to-last step follows
by substituting β − t for β in the result in (a).
(d) i. We have:
∞ ∞
β α α−1 −βx
Z Z
E(X) = x f (x) dx = x x e dx
−∞ 0 Γ(α)
Z ∞
βα
= x(α+1)−1 e−βx dx
Γ(α) 0
β α Γ(α + 1)
=
Γ(α) β α+1
β α αΓ(α)
=
Γ(α) β α+1
α
=
β
using (a) and the gamma function property stated in the question.
ii. The first derivative of MX (t) is:
α−1
β β
MX0 (t) =α .
β−t (β − t)2
Therefore:
α
E(X) = MX0 (0) = .
β
178
D.1. Worked examples
(e) When α is a positive integer, by the result stated in the question, we have
Pα
X= Yi , where Y1 , Y2 , . . . , Yα are independent random variables each
i=1
distributed as Gamma(1, β), i.e. as exponential with parameter β as concluded
in (b). The expected value of the exponential distribution can be taken as
given from the lectures, so E(Yi ) = 1/β for each i = 1, 2, . . . , α. Therefore,
using the general result on expected values of sums:
α
! α
X X 1 α
E(X) = E Yi = E(Yi ) = α × = .
i=1 i=1
β β
4. James enjoys playing Solitaire on his laptop. One day, he plays the game
repeatedly. He has found, from experience, that the probability of success in any
game is 1/3 and is independent of the outcomes of other games.
(a) What is the probability that his first success occurs in the fourth game he
plays? What is the expected number of games he needs to play to achieve his
first success?
(b) What is the probability of three successes in ten games? What is the expected
number of successes in ten games?
(c) Use a suitable approximation to find the probability of less than 25 successes
in 100 games. You should justify the use of the approximation.
(d) What is the probability that his third success occurs in the tenth game he
plays?
Solution:
(a) P (first success in 4th game) = (2/3)3 × (1/3) = 8/81 ≈ 0.1. This is a
geometric distribution, for which E(X) = 1/π = 1/(1/3) = 3.
(b) Use X ∼ Bin(10, 1/3), such that E(X) = 10 × 1/3 = 3.33, and:
3 7
10 1 2
P (X = 3) = ≈ 0.2601.
3 3 3
(d) This is a negative binomial distribution (used for the trial number of the kth
success) with a pf given by:
x−1 k
p(x) = π (1 − π)x−k for x = k, k + 1, k + 2, . . .
k−1
179
D. Common distributions of random variables
5. You may assume that 15% of individuals in a large population are left-handed.
(a) If a random sample of 40 individuals is taken, find the probability that exactly
6 are left-handed.
(b) If a random sample of 400 individuals is taken, find the probability that
exactly 60 are left-handed by using a suitable approximation. Briefly discuss
the appropriateness of the approximation.
(c) What is the smallest possible size of a randomly chosen sample if we wish to
be 99% sure of finding at least one left-handed individual in the sample?
Solution:
180
D.1. Worked examples
6. Show that the moment generating function (mgf) of a Poisson distribution with
parameter λ is given by:
MX (t) = exp(λ exp(t) − 1), writing exp(θ) ≡ eθ .
Hence show that the mean and variance of the distribution are both λ.
Solution:
We have:
∞
X λx
MX (t) = E(exp(Xt)) = exp(xt) exp(−λ)
x=0
x!
∞
X exp(−λ)
= (λ exp(t))x
x=0
x!
∞
X (λ exp(t))x
= exp(−λ)
x=0
x!
181
D. Common distributions of random variables
8. People entering an art gallery are counted by the attendant at the door. Assume
that people arrive in accordance with a Poisson distribution, with one person
arriving every 2 minutes. The attendant leaves the door unattended for 5 minutes.
(a) Calculate the probability that:
i. nobody will enter the gallery in this time
ii. 3 or more people will enter the gallery in this time.
(b) Find, to the nearest second, the length of time for which the attendant could
leave the door unattended for there to be a probability of 0.90 of no arrivals in
that time.
(c) Comment briefly on the assumption of a Poisson distribution in this context.
Solution:
(a) λ = 1 for a two-minute interval, so λ = 2.5 for a five-minute interval. Therefore:
P (no arrivals) = e−2.5 = 0.0821
and:
P (≥ 3 arrivals) = 1−pX (0)−pX (1)−pX (2) = 1−e−2.5 (1+2.5+3.125) = 0.4562.
(b) For an interval of N minutes, the parameter is N/2. We need p(0) = 0.90, so
e−N/2 = 0.90 giving N/2 = − ln(0.90) and N = 0.21 minutes, or 13 seconds.
(c) The rate is unlikely to be constant: more people at lunchtimes or early
evenings etc. Likely to be several arrivals in a small period – couples, groups
etc. Quite unlikely the Poisson will provide a good model.
Solution:
(a) The survivor function is:
Z ∞ h i∞
=(y) = P (Y > y) = λe−λx dx = − e−λx = e−λy .
y y
(b) The age-specific failure rate is constant, indicating it does not vary with age.
This is unlikely to be true in practice!
182
D.1. Worked examples
10. For the binomial distribution with a probability of success of 0.25 in an individual
trial, calculate the probability that, in 50 trials, there are at least 8 successes:
(a) using the normal approximation without a continuity correction
(b) using the normal approximation with a continuity correction.
Compare these results with the exact probability of 0.9547 and comment.
Solution:
We seek P (X ≥ 8) using the normal approximation Y ∼ N (12.5, 9.375).
(a) So, without a continuity correction:
8 − 12.5
P (Y ≥ 8) = P Z ≥ √ = P (Z ≥ −1.47) = 0.9292.
9.375
The required probability could have been expressed as P (X > 7), or indeed
any number in [7, 8), for example:
7 − 12.5
P (Y > 7) = P Z ≥ √ = P (Z ≥ −1.80) = 0.9641.
9.375
11. A greengrocer has a very large pile of oranges on his stall. The pile of fruit is a
mixture of 50% old fruit with 50% new fruit; one cannot tell which are old and
which are new. However, 20% of old oranges are mouldy inside, but only 10% of
new oranges are mouldy. Suppose that you choose 5 oranges at random. What is
the distribution of the number of mouldy oranges in your sample?
Solution:
For an orange chosen at random, the event ‘mouldy’ is the union of the disjoint
events ‘mouldy’ ∩ ‘new’ and ‘mouldy’ ∩ ‘old’. So:
As the pile of oranges is very large, we can assume that the results for the five
oranges will be independent, so we have 5 independent trials each with probability
of ‘mouldy’ equal to 0.15. The distribution of the number of mouldy oranges will be
a binomial distribution with n = 5 and π = 0.15.
183
D. Common distributions of random variables
12. Underground trains on the Northern line have a probability 0.05 of failure between
Golders Green and King’s Cross. Supposing that the failures are all independent,
what is the probability that out of 10 journeys between Golders Green and King’s
Cross more than 8 do not have a breakdown?
Solution:
The probability of no breakdown on one journey is π = 1 − 0.05 = 0.95, so the
number of journeys without a breakdown, X, has a Bin(10, 0.95) distribution. We
want P (X > 8), which is:
P (X > 8) = p(9) + p(10)
10 9 1 10
= × (0.95) × (0.05) + × (0.95)10 × (0.05)0
9 10
= 0.3151 + 0.5987
= 0.9138.
13. Suppose that the normal rate of infection for a certain disease in cattle is 25%. To
test a new serum which may prevent infection, three experiments are carried out.
The test for infection is not always valid for some particular cattle, so the
experimental results are incomplete – we cannot always tell whether a cow is
infected or not. The results of the three experiments are:
(a) 10 animals are injected; all 10 remain free from infection
(b) 17 animals are injected; more than 15 remain free from infection and there are
2 doubtful cases
(c) 23 animals are infected; more than 20 remain free from infection and there are
three doubtful cases.
Which experiment provides the strongest evidence in favour of the serum?
Solution:
These experiments involve tests on different cattle, which one might expect to
behave independently of one another. The probability of infection without injection
with the serum might also reasonably be assumed to be the same for all cattle. So
the distribution which we need here is the binomial distribution. If the serum has
no effect, then the probability of infection for each of the cattle is 0.25.
One way to assess the evidence of the three experiments is to calculate the
probability of the result of the experiment if the serum had no effect at all. If it has
an effect, then one would expect larger numbers of cattle to remain free from
infection, so the experimental results as given do provide some clue as to whether
the serum has an effect, in spite of their incompleteness.
Let X(n) be the number of cattle infected, out of a sample of n. We are assuming
that X(n) ∼ Bin(n, 0.25).
(a) With 10 trials, the probability of 0 infected if the serum has no effect is:
10
P (X(10) = 0) = × (0.75)10 = (0.75)10 = 0.0563.
0
184
D.1. Worked examples
(b) With 17 trials, the probability of more than 15 remaining uninfected if the
serum has no effect is:
(c) With 23 trials, the probability of more than 20 remaining free from infection if
the serum has no effect is:
14. In a large industrial plant there is an accident on average every two days.
(a) What is the chance that there will be exactly two accidents in a given week?
(b) What is the chance that there will be two or more accidents in a given week?
(c) If James goes to work there for a four-week period, what is the probability
that no accidents occur while he is there?
Solution:
Here we have counts of random events over time, which is a typical application for
the Poisson distribution. We are assuming that accidents are equally likely to occur
at any time and are independent. The mean for the Poisson distribution is 0.5 per
day.
Let X be the number of accidents in a week. The probability of exactly two
accidents in a given week is found by using the parameter λ = 5 × 0.5 = 2.5 (5
working days a week assumed).
185
D. Common distributions of random variables
(c) If James goes to the industrial plant and does not change the probability of an
accident simply by being there (he might bring bad luck, or be superbly
safety-conscious!), then over 4 weeks there are 20 working days, and the
probability of no accident comes from a Poisson random variable with mean
10. If Y is the number of accidents while James is there, the probability of no
accidents is:
e−10 (10)0
pY (0) = = 0.0000454.
0!
James is very likely to be there when there is an accident!
15. The chance that a lottery ticket has a winning number is 0.0000001.
(a) If 10,000,000 people buy tickets which are independently numbered, what is
the probability there is no winner?
(b) What is the probability that there is exactly 1 winner?
(c) What is the probability that there are exactly 2 winners?
Solution:
The number of winning tickets, X, will be distributed as:
X ∼ Bin(10,000,000, 0.0000001).
Since n is large and π is small, the Poisson distribution should provide a good
approximation. The Poisson parameter is:
λ = nπ = 10,000,000 × 0.0000001 = 1
and so we set X ∼ Pois(1). We have:
e−1 10 e−1 11 e−1 12
p(0) = = 0.3679, p(1) = = 0.3679 and p(2) = = 0.1839.
0! 1! 2!
Using the exact binomial distribution of X, the results are:
(10)7
7
p(0) = × ((10)−7 )0 × (1 − (10)−7 )(10) = 0.3679
0
(10)7
7
p(1) = × ((10)−7 )1 × (1 − (10)−7 )(10) −1 = 0.3679
1
(10)7
7
p(2) = × ((10)−7 )2 × (1 − (10)−7 )(10) −2 = 0.1839.
2
Notice that, in this case, the Poisson approximation is correct to at least 4 decimal
places.
186
D.1. Worked examples
16. Suppose that X ∼ Uniform[0, 1]. Compute P (X > 0.2), P (X ≥ 0.2) and
P (X 2 > 0.04).
Solution:
We have a = 0 and b = 1, and can use the formula for P (c < X ≤ d), for constants
c and d. Hence:
1 − 0.2
P (X > 0.2) = P (0.2 < X ≤ 1) = = 0.8.
1−0
Also:
P (X ≥ 0.2) = P (X = 0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.
Finally:
P (X 2 > 0.04) = P (X < −0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.
17. Suppose that the service time for a customer at a fast food outlet has an
exponential distribution with parameter 1/3 (customers per minute). What is the
probability that a customer waits more than 4 minutes?
Solution:
The distribution of X is Exp(1/3), so the probability is:
P (X > 4) = 1 − F (4) = 1 − (1 − e−(1/3)×4 ) = 1 − 0.7364 = 0.2636.
18. Suppose that the distribution of men’s heights in London, measured in cm, is
N (175, 62 ). Find the proportion of men whose height is:
(a) under 169 cm
(b) over 190 cm
(c) between 169 cm and 190 cm.
Solution:
The values of interest are 169 and 190. The corresponding z-values are:
169 − 175 190 − 175
z1 = = −1 and z2 = = 2.5.
6 6
Using values from Table 3 of Murdoch and Barnes’ Statistical Tables, we have:
P (X < 169) = P (Z < −1) = Φ(−1)
= 1 − Φ(1) = 1 − 0.8413 = 0.1587
187
D. Common distributions of random variables
19. Two statisticians disagree about the distribution of IQ scores for a population
under study. Both agree that the distribution is normal, and that σ = 15, but A
says that 5% of the population have IQ scores greater than 134.6735, whereas B
says that 10% of the population have IQ scores greater than 109.224. What is the
difference between the mean IQ score as assessed by A and that as assessed by B?
Solution:
The standardised z-value giving 5% in the upper tail is 1.6449, and for 10% it is
1.2816. So, converting to the scale for IQ scores, the values are:
µA + 24.6735 = 134.6735
so:
µA = 110
whereas:
µB + 19.224 = 109.224
so µB = 90. The difference µA − µB = 110 − 90 = 20.
188
D.2. Practice questions
(a) Sketch f (z) and explain why it can serve as the pdf for a random variable Z.
(b) Determine the moment generating function of Z.
(c) Use the mgf to find E(Z), Var(Z), E(Z 3 ) and E(Z 4 ).
(You may assume that −1 < t < 1, for the mgf, which will ensure convergence.)
Hence find E(X) and Var(X). (The wording of the question implies that you use
the result which you have just proved. Other methods of derivation will not be
accepted!)
5. Cars independently pass a point on a busy road at an average rate of 150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(b) What is the expected number passing in two minutes?
(c) Find the probability that the expected number actually passes in a given
two-minute period.
6. James goes fishing every Saturday. The number of fish he catches follows a Poisson
distribution. On a proportion π of the days he goes fishing, he does not catch
anything. He makes it a rule to take home the first, and then every other, fish
which he catches, i.e. the first, third, fifth fish etc.
(a) Using a Poisson distribution, find the mean number of fish he catches.
(b) Show that the probability that he takes home the last fish he catches is
(1 − π 2 )/2.
There are two kinds of statistics, the kind you look up and the kind you make
up.
(Rex Stout)
189
D. Common distributions of random variables
190
Appendix E
Multivariate random variables
Solution:
(a) The joint distribution (with marginal probabilities) is:
W =w
0 2 4 pZ (z)
−1 0.00 0.00 0.16 0.16
Z=z 0 0.00 0.08 0.24 0.32
1 0.16 0.12 0.00 0.28
2 0.24 0.00 0.00 0.24
pW (w) 0.40 0.20 0.40 1.00
(b) It is straightforward to see that:
P (W = 2 ∩ Z = 1) 0.12 3
P (W = 2 | Z = 1) = = = .
P (Z = 1) 0.28 7
For E(W | Z = 0), we have:
X 0 0.08 0.24
E(W | Z = 0) = w P (W = w | Z = 0) = 0 × +2× +4× = 3.5.
w
0.32 0.32 0.32
We see E(W ) = 2 (by symmetry), and:
E(Z) = −1 × 0.16 + 0 × 0.32 + 1 × 0.28 + 2 × 0.24 = 0.6.
Also:
XX
E(W Z) = wz p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4
w z
hence:
Cov(W, Z) = E(W Z) − E(W ) E(Z) = −0.4 − 2 × 0.6 = −1.6.
191
E. Multivariate random variables
X=x
−1 0 1
−1 0.05 0.15 0.10
Y =y 0 0.10 0.05 0.25
1 0.10 0.05 0.15
Solution:
X = x|Y = 1 −1 0 1
pX|Y =1 (x | Y = 1) 1/3 1/6 1/2
(b) From the conditional distribution we see:
1 1 1 1
E(X | Y = 1) = −1 × +0× +1× = .
3 6 2 6
E(Y ) = 0 (by symmetry), and so Var(Y ) = E(Y 2 ) = 0.6.
E(X) = 0.25 and:
(Note that Var(X) and Var(Y ) are not strictly necessary here!)
Next:
XX
E(XY ) = xy p(x, y)
x y
So:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0 ⇒ Corr(X, Y ) = 0.
(c) X and Y are not independent random variables since, for example:
192
E.1. Worked examples
where:
eiθ
πi =
1 + eiθ
for i = 1, 2, . . . , n. Derive the joint probability function, p(x1 , x2 , . . . , xn ).
Solution:
Since the Xi s are independent (but not identically distributed) random variables,
we have:
n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1
193
E. Multivariate random variables
Solution:
Since the Xi s are independent (and identically distributed) random variables, we
have: n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1
6. The random variables X1 and X2 are independent and have the common
distribution given in the table below:
X=x 0 1 2 3
pX (x) 0.2 0.4 0.3 0.1
Solution:
(a) The joint distribution of W and Y is:
W =w
0 1 2 3
0 (0.2)2 2(0.2)(0.4) 2(0.2)(0.3) 2(0.2)(0.1)
Y =y 1 0 (0.4)(0.4) 2(0.4)(0.3) 2(0.4)(0.1)
2 0 0 (0.3)(0.3) 2(0.3)(0.1)
3 0 0 0 (0.1)(0.1)
(0.2)2 (0.8)(0.4) (1.5)(0.3) (1.9)(0.1)
which is:
W =w
0 1 2 3
0 0.04 0.16 0.12 0.04
Y =y 1 0.00 0.16 0.24 0.08
2 0.00 0.00 0.09 0.06
3 0.00 0.00 0.00 0.01
0.04 0.32 0.45 0.19
194
E.1. Worked examples
7. Consider two random variables X and Y . X can take the values −1, 0 and 1, and
Y can take the values 0, 1 and 2. The joint probabilities for each pair are given by
the following table:
X = −1 X = 0 X = 1
Y =0 0.10 0.20 0.10
Y =1 0.10 0.05 0.10
Y =2 0.10 0.05 0.20
Solution:
(a) The marginal distribution of X is:
X=x −1 0 1
pX (x) 0.3 0.3 0.4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 0.40 0.25 0.35
Hence:
E(X) = −1 × 0.3 + 0 × 0.3 + 1 × 0.4 = 0.1
and:
E(Y ) = 0 × 0.40 + 1 × 0.25 + 2 × 0.35 = 0.95.
195
E. Multivariate random variables
(b) We have:
Cov(U, V ) = Cov(X + Y, X − Y )
= E((X + Y )(X − Y )) − E(X + Y ) E(X − Y )
= E(X 2 − Y 2 ) − (E(X) + E(Y ))(E(X) − E(Y ))
hence:
(c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding
values of V are −3, −1 and 1. We have:
0.1 2
P (V = −3 | U = 1) = =
0.25 5
0.05 1
P (V = −1 | U = 1) = =
0.25 5
0.1 2
P (V = 1 | U = 1) = =
0.25 5
hence:
2 1 2
E(V | U = 1) = −3 × + −1 × + 1× = −1.
5 5 5
8. Two refills for a ballpoint pen are selected at random from a box containing three
blue refills, two red refills and three green refills. Define the following random
variables:
X = the number of blue refills selected
Y = the number of red refills selected.
196
E.1. Worked examples
Solution:
3 2 2 3 3
P (X = 1, Y = 1) = P (BR) + P (RB) = × + × = .
8 7 8 7 14
(b) We have:
X=x
0 1 2
0 3/28 9/28 3/28
Y =y 1 3/14 3/14 0
2 1/28 0 0
(c) The marginal distribution of X is:
X=x 0 1 2
pX (x) 10/28 15/28 3/28
Hence:
10 15 3 3
E(X) = 0 × +1× +2× = .
28 28 28 4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 15/28 12/28 1/28
Hence:
15 12 1 1
E(Y ) = 0 × +1× +2× = .
28 28 28 2
The conditional distribution of X given Y = 1 is:
X = x|Y = 1 0 1
pX|Y =1 (x | y = 1) 1/2 1/2
Hence:
1 1 1
E(X | Y = 1) = 0 × +1× = .
2 2 2
(e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The
random variables are not independent.
197
E. Multivariate random variables
9. Show that the marginal distributions of a bivariate distribution are not enough to
define the bivariate distribution itself.
Solution:
Here we must show that there are two distinct bivariate distributions with the
same marginal distributions. It is easiest to think of the simplest case where X and
Y each take only two values, say 0 and 1.
Suppose the marginal distributions of X and Y are the same, with
p(0) = p(1) = 0.5. One possible bivariate distribution with these marginal
distributions is the one for which there is independence between X and Y . This has
pX,Y (x, y) = pX (x) pY (y) for all x, y. Writing it in full:
pX,Y (0, 0) = pX,Y (1, 0) = pX,Y (0, 1) = pX,Y (1, 1) = 0.5 × 0.5 = 0.25.
The table of probabilities for this choice of independence is shown in the first table
below.
Trying some other value for pX,Y (0, 0), like 0.2, gives the second table below.
X/Y 0 1 X/Y 0 1
0 0.25 0.25 0 0.2 0.3
1 0.25 0.25 1 0.3 0.2
The construction of these probabilities is done by making sure the row and column
totals are equal to 0.5, and so we now have a second distribution with the same
marginal distributions as the first.
This example is very simple, but one can almost always construct many bivariate
distributions with the same marginal distributions even for continuous random
variables.
11. There are different ways to write the covariance. Show that:
and:
Cov(X, Y ) = E((X − E(X))Y ) = E(X(Y − E(Y ))).
198
E.1. Worked examples
Solution:
Working directly from the definition:
The remaining result follows by an argument symmetric with the last one.
12. Suppose that Var(X) = Var(Y ) = 1, and that X and Y have correlation coefficient
ρ. Show that it follows from Var(X − ρY ) ≥ 0 that ρ2 ≤ 1.
Solution:
We have:
Hence 1 − ρ2 ≥ 0, and so ρ2 ≤ 1.
X=x −1 0 1
P (X = x) a b a
E(X) = −1 × a + 0 × b + 1 × a = 0
E(X 2 ) = +1 × a + 0 × b + 1 × a = 2a
E(X 3 ) = −1 × a + 0 × b + 1 × a = 0
199
E. Multivariate random variables
There are many possible choices for a and b which give a valid probability
distribution, for instance a = 0.25 and b = 0.5.
14. A fair coin is thrown n times, each throw being independent of the ones before. Let
R = ‘the number of heads’, and S = ‘the number of tails’. Find the covariance of R
and S. What is the correlation of R and S?
Solution:
One can go about this in a straightforward way. If Xi is the number of heads and
Yi is the number of tails on the ith throw, then the distribution of Xi and Yi is
given by:
X/Y 0 1
0 0 0.5
1 0.5 0
Cov(R, S) = −0.25n.
(add the variances of the Xi s or Yi s). The correlation between R and S works out
as −0.25n/0.25n = −1.
15. Suppose that X and Y have a bivariate distribution. Find the covariance of the
new random variables W = aX + bY and V = cX + dY where a, b, c and d are
constants.
200
E.2. Practice questions
Solution:
The covariance of W and V is:
16. Following on from Question 15, show that, if the variances of X and Y are the
same, then W = X + Y and V = X − Y are uncorrelated.
Solution:
Here we have a = b = c = 1 and d = −1. Substituting into the formula found above:
2
σW V = σX − σY2 = 0.
There is no assumption here that X and Y are independent. It is not true that W
and V are independent without further restrictions on X and Y .
(b) For random variables X and Y , and constants a, b, c and d, show that:
3. X and Y are discrete random variables which can assume values 0, 1 and 2 only.
201
E. Multivariate random variables
(a) Draw up a table to describe the joint distribution of X and Y and find the
value of the constant A.
(b) Describe the marginal distributions of X and Y .
(c) Give the conditional distribution of X | Y = 1 and find E(X | Y = 1).
(d) Are X and Y independent? Give a reason for your answer.
Statistics are like bikinis. What they reveal is suggestive, but what they conceal
is vital.
(Aaron Levenstein)
202
Appendix F
Solutions to Practice questions
1. (a) We have:
3
X
(Yj − Ȳ ) = (Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ).
j=1
However:
3Ȳ = Y1 + Y2 + Y3
hence:
3
X
(Yj − Ȳ ) = 3Ȳ − 3Ȳ = 0.
j=1
(b) We have:
3 X
X 3
(Yj − Ȳ )(Yk − Ȳ ) = (Y1 − Ȳ )((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))
j=1 k=1
3 X
X 3
(Yj − Ȳ )(Yk − Ȳ ) = 02 = 0.
j=1 k=1
(c) We have:
3 X
X 3 3
X 3
X 3
X
(Yj − Ȳ )(Yk − Ȳ ) = j6=k (Yj − Ȳ )(Yk − Ȳ ) + (Yj − Ȳ )2
j=1 k=1 j=1 k=1 j=1
We have written the nine terms in the left-hand expression as the sum of the
six terms for which j 6= k, and the three terms for which j = k.
203
F. Solutions to Practice questions
2. (a) We have:
n
P n
P n
P
yi (axi + b) a xi + nb
i=1 i=1 i=1
ȳ = = = = ax̄ + b.
n n n
(b) Multiply out the square within the summation sign and then evaluate the
three expressions, remembering that x̄ is a constant with respect to summation
and can be taken outside the summation sign as a common factor, i.e. we have:
n
X n
X
2
(xi − x̄) = (x2i − 2xi x̄ + x̄2 )
i=1 i=1
n
X n
X n
X
= x2i − 2x̄ xi + x̄2
i=1 i=1 i=1
Xn
= x2i − 2nx̄2 + nx̄2
i=1
n
P
hence the result. Recall that xi = nx̄.
i=1
(c) It is probably best to work with variances to avoid the square roots. The
variance of y values, say s2y , is given by:
n
1X
s2y = (yi − ȳ)2
n i=1
n
1X
= (axi + b − (ax̄ + b))2
n i=1
n
21
X
=a (xi − x̄)2
n i=1
= a2 s2x .
The result follows on taking the square root, observing that the standard
deviation cannot be a negative quantity.
Adding a constant k to each value of a dataset adds k to the mean and leaves the
standard deviation unchanged. This corresponds to a transformation yi = axi + b
with a = 1 and b = k. Apply (a) and (c) with these values.
Multiplying each value of a dataset by a constant c multiplies the mean by c and
also the standard deviation by |c|. This corresponds to a transformation yi = cxi
with a = c and b = 0. Apply (a) and (c) with these values.
204
F.2. Chapter 2 – Probability theory
P (A) + P (B)
≤ P (A ∪ B).
2
Similarly, A ∩ B ⊂ A and A ∩ B ⊂ B, so P (A ∩ B) ≤ P (A) and
P (A ∩ B) ≤ P (B).
Adding, 2P (A ∩ B) ≤ P (A) + P (B) so:
P (A) + P (B)
P (A ∩ B) ≤ .
2
205
F. Solutions to Practice questions
(b) To show that X c and Y c are not necessarily mutually exclusive when X and Y
are mutually exclusive, the best approach is to find a counterexample.
Attempts to ‘prove’ the result directly are likely to be logically flawed.
Look for a simple example. Suppose we roll a die. Let X = {6} be the event of
obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and
Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have
X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive.
4. (a) A will win the game without deuce if he or she wins four points, including the
last point, before B wins three points. This can occur in three ways.
• A wins four straight points, i.e. AAAA with probability (2/3)4 = 16/81.
• B wins just one point in the game. There are 4 C1 ways for this to happen,
namely BAAAA, ABAAA, AABAA and AAABA. Each has probability
(1/3)(2/3)4 , so the probability of one of these outcomes is given by
4(1/3)(2/3)4 = 64/243.
• B wins just two points in the game. There are 5 C2 ways for this to
happen, namely BBAAAA, BABAAA, BAABAA, BAAABA,
ABBAAA, ABABAA, ABAABA, AABBAA, AABABA and
AAABBA. Each has probability (1/3)2 (2/3)4 , so the probability of one of
these outcomes is given by 10(1/3)2 (2/3)4 = 160/729.
Therefore, the probability that A wins without a deuce must be the sum of
these, namely:
16 64 160 144 + 192 + 160 496
+ + = = .
81 243 729 729 729
206
F.2. Chapter 2 – Probability theory
(b) We can mimic the above argument to find the probability that B wins the
game without a deuce. That is, the probability of four straight points to B is
(1/3)4 = 1/81, the probability that A wins just one point in the game is
4(2/3)(1/3)4 = 8/243, and the probability that A wins just two points is
10(2/3)2 (1/3)4 = 40/729. So the probability of B winning without a deuce is
1/81 + 8/243 + 40/729 = 73/729 and so the probability of deuce is
1 − 496/729 − 73/729 = 160/729.
(c) Either: suppose deuce has been called. The probability that A wins the set
without further deuces is the probability that the next two points go AA –
with probability (2/3)2 .
The probability of exactly one further deuce is that the next four points go
ABAA or BAAA – with probability (2/3)3 (1/3) + (2/3)3 (1/3) = (2/3)4 .
The probability of exactly two further deuces is that the next six points go
ABABAA, ABBAAA, BAABAA or BABAAA – with probability
4(2/3)4 (1/3)2 = (2/3)6 .
Continuing this way, the probability that A wins after three further deuces is
(2/3)8 and the overall probability that A wins after deuce has been called is
(2/3)2 + (2/3)4 + (2/3)6 + (2/3)8 + · · · .
This is a geometric progression (GP) with first term a = (2/3)2 and common
ratio (2/3)2 , so the overall probability that A wins after deuce has been called
is a/(1 − r) (sum to infinity of a GP) which is:
(2/3)2 4/9 4
2
= = .
1 − (2/3) 5/9 5
Or (quicker!): given a deuce, the next 2 balls can yield the following results.
A wins with probability (2/3)2 , B wins with probability (1/3)2 , and deuce
with probability 4/9.
Hence P (A wins | deuce) = (2/3)2 + (4/9) P (A wins | deuce) and solving
immediately gives P (A wins | deuce) = 4/5.
(d) We have:
207
F. Solutions to Practice questions
1 2 5
E(X) = 1 × +2× =
3 3 3
1 2
E(X 2 ) = 1 × + 4 × = 3
3 3
1 1 2 2
E(1/X) = 1 × + × =
3 2 3 3
and, clearly, E(X 2 ) 6= (E(X))2 and E(1/X) 6= 1/E(X) in this case. So the result
has been shown in general.
(b) We have:
X1 + X 2 + · · · + X n E(X1 + X2 + · · · + Xn )
E =
n n
E(X1 ) + E(X2 ) + · · · + E(Xn )
=
n
µ + µ + ··· + µ
=
n
nµ
=
n
= µ.
208
F.3. Chapter 3 – Random variables
X1 + X 2 + · · · + Xn Var(X1 + X2 + · · · + Xn )
Var =
n n2
Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
(by independence) =
n2
σ2 + σ2 + · · · + σ2
=
n2
nσ 2
=
n2
σ2
= .
n
3. Suppose n subjects are procured. The probability that a single subject does not
have the abnormality is 0.96. Using independence, the probability that none of the
subjects has the abnormality is (0.96)n .
The probability that at least one subject has the abnormality is 1 − (0.96)n . We
require the smallest whole number n for which 1 − (0.96)n > 0.95, i.e. we have
(0.96)n < 0.05.
We can solve the inequality by ‘trial and error’, but it is neater to take logs.
n ln(0.96) < ln(0.05), so n > ln(0.05)/ ln(0.96), or n > 73.39. Rounding up, 74
subjects should be procured.
209
F. Solutions to Practice questions
(c) For the ‘forgetful’ rat (short-term, but not long-term, memory):
1
P (X = 1) =
4
3 1
P (X = 2) = ×
4 3
3 2 1
P (X = 3) = × ×
4 3 3
..
.
r−2
3 2 1
P (X = r) = × × (for r ≥ 2).
4 3 3
Therefore:
2 ! !
1 3 1 2 1 2 1
E(X) = + × 2× + 3× × + 4× × + ···
4 4 3 3 3 3 3
2 ! !
1 1 2 2
= + 2+ 3× + 4× + ··· .
4 4 3 3
Note that 2.5 < 3.25 < 4, so the intelligent rat needs the least trials on average,
while the stupid rat needs the most, as we would expect!
210
F.4. Chapter 4 – Common distributions of random variables
(b) Using the Poisson approximation with λ = 100 × 0.0228 = 2.28, we have the
following.
i. P (N = 0) ≈ e−2.28 = 0.1023.
ii. P (N ≤ 2) ≈ e−2.28 + e−2.28 × 2.28 + e−2.28 × (2.28)2 /2! = 0.6013.
The approximations are good (note there will be some rounding error, but the
values are close with the two methods). It is not surprising that there is close
agreement since n is large, π is small and nπ < 5.
1 1 1
MX (t) = E(eXt ) = et × + e2t × + · · · + ekt ×
k k k
1 t
= (e + e2t + · · · + ekt ).
k
The bracketed part of this expression is a geometric progression where the first
term is et and the common ratio is et .
Using the well-known result for the sum of k terms of a geometric progression, we
obtain:
1 et (1 − (et )k ) et (1 − ekt )
MX (t) = × = .
k 1 − et k(1 − et )
R ∞ f (z) to serve as a pdf, we require (i.) f (z) ≥ 0 for all z, and (ii.)
3. (a) For
−∞
f (z) dz = 1. The first condition certainly holds for f (z). The second also
holds since:
Z ∞ Z 0 Z ∞
1 −|z| 1 −|z|
f (z) dz = e dz + e dz
−∞ −∞ 2 0 2
Z 0 Z ∞
1 z 1 −z
= e dz + e dz
−∞ 2 0 2
h i∞
z 0 −z
= [e /2]−∞ − e /2
0
1 1
= +
2 2
= 1.
211
F. Solutions to Practice questions
1 1
= −
2(1 + t) 2(t − 1)
= (1 − t2 )−1
where the condition −1 < t < 1 ensures the integrands are 0 at the infinite
limits.
(c) We can find the various moments by differentiating MZ (t), but it is simpler to
expand it:
MZ (t) = (1 − t2 )−1 = 1 + t2 + t4 + · · · .
212
F.4. Chapter 4 – Common distributions of random variables
n
x
4. For X ∼ Bin(n, π), P (X = x) = x
π (1 − π)n−x . So, for E(X), we have:
n
X n x
E(X) = x π (1 − π)n−x
x=0
x
n
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n
X n − 1 x−1
= nπ π (1 − π)n−x
x=1
x − 1
n−1
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y
= nπ × 1
= nπ
where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, 2, . . . , n − 1 and
probability parameter π.
Similarly:
n
X n x
E(X(X − 1)) = x(x − 1) π (1 − π)n−x
x=0
x
n
X x(x − 1)n!
= π x (1 − π)n−x
x=2
x! (n − x)!
n
2
X (n − 2)!
= n(n − 1)π π x−2 (1 − π)n−x
x=2
(x − 2)! (n − x)!
n−2
2
X (n − 2)!
= n(n − 1)π π y (1 − π)n−y−2
y=0
y! (n − y − 2)!
m
2
X m!
E(X(X − 1)) = n(n − 1)π π y (1 − π)m−y
y=0
y! (m − y)!
= n(n − 1)π 2
213
F. Solutions to Practice questions
5. (a) A rate of 150 cars per hour is a rate of 2.5 per minute. Using a Poisson
distribution with λ = 2.5, P (none passes) = e−2.5 × (2.5)0 /0! = e−2.5 = 0.0821.
(c) The probability of 5 cars passing in two minutes is e−5 × 55 /5! = 0.1755.
6. (a) Let X denote the number of fish caught, such that X ∼ Pois(λ).
P (X = 0) = e−λ λx /x! where the parameter λ is as yet unknown, so
P (X = 0) = e−λ λ0 /0! = e−λ .
However, we know P (X = 0) = π. So e−λ = π giving −λ = ln(π) and
λ = ln(1/π).
(b) James will take home the last fish caught if he catches 1, 3, 5, . . . fish. So we
require:
Now we know:
λ2 λ3
eλ = 1 + λ + + + ···
2! 3!
and:
λ2 λ3
e−λ = 1 − λ + − + ··· .
2! 3!
Subtracting gives:
λ3 λ5
λ −λ
e −e =2 λ+ + + ··· .
3! 5!
eλ − e−λ 1 − e−2λ 1 − π2
−λ
e = =
2 2 2
214
F.5. Chapter 5 – Multivariate random variables
as required.
(b) We have:
as required.
2. (a) We have: !
k
X k
X k
X
E ai X i = E(ai Xi ) = ai E(Xi ).
i=1 i=1 i=1
(b) We have:
! !2 !2
k
X k
X k
X k
X
Var ai X i = E ai X i − ai E(Xi ) = E ai (Xi − E(Xi ))
i=1 i=1 i=1 i=1
k
X X
= a2i E((Xi − E(Xi ))2 ) + ai aj E((Xi − E(Xi ))(Xj − E(Xj )))
i=1 1≤i6=j≤n
k
X X k
X
= a2i Var(Xi ) + ai aj E(Xi − E(Xi )) E(Xj − E(Xj )) = a2i Var(Xi ).
i=1 1≤i6=j≤n i=1
215
F. Solutions to Practice questions
Additional note: remember there are two ways to compute the variance:
Var(X) = E((X − µ)2 ) and Var(X) = E(X 2 ) − (E(X))2 . The former is more
convenient for analytical derivations/proofs (see above), while the latter should be
used to compute variances for common distributions such as Poisson or exponential
distributions. Actually it is rather difficult to compute the variance for a Poisson
distribution using the formula Var(X) = E((X − µ)2 ) directly.
216
lse.ac.uk/statistics Department of Statistics
The London School of Economics
and Political Science
Houghton Street
London WC2A 2AE
Email: [email protected]
Telephone: +44 (0)20 7852 3709
The London School of Economics and Political Science is a School of the University of London. It is a
charity and is incorporated in England as a company limited by guarantee under the Companies Acts
(Reg no 70527).
The School seeks to ensure that people are treated equitably, regardless of age, disability, race,
nationality, ethnic or national origin, gender, religion, sexual orientation or personal circumstances.