100% found this document useful (1 vote)
19K views235 pages

ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)

Uploaded by

Jiang H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
19K views235 pages

ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)

Uploaded by

Jiang H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 235

ST102/ST109

Elementary Statistical Theory

Course pack

2022/23 (Michaelmas term)

Dr James Abdey

lse.ac.uk/statistics
2
ST102/ST109

Elementary Statistical Theory

Course pack

© James Abdey 2022–23

The author asserts copyright over all material in this course guide except where
otherwise indicated. All rights reserved. No part of this work may be reproduced in any
form, or by any means, without permission in writing from the author.
ii
Contents

Contents

Preliminaries ix
0.1 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
0.2 Course materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
0.3 Supplementary reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
0.4 Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
0.5 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
0.6 And finally, words of wisdom from a wise (youngish) man . . . . . . . . . xi

Preface xiii
0.7 The role of statistics in the research process . . . . . . . . . . . . . . . . xiii

1 Data visualisation and descriptive statistics 1


1.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Continuous and discrete variables . . . . . . . . . . . . . . . . . . . . . . 4
1.5 The sample distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.1 Bar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Sample distributions of variables with many values . . . . . . . . 7
1.5.3 Skewness of distributions . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Measures of central tendency . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6.1 Notation for variables . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.2 Summation notation . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.3 The sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6.4 The (sample) median . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6.5 Sensitivity to outliers . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.6 Skewness, means and medians . . . . . . . . . . . . . . . . . . . . 13
1.6.7 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7.1 Variance and standard deviation . . . . . . . . . . . . . . . . . . . 15
1.7.2 Sample quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii
Contents

1.7.3 Quantile-based measures of dispersion . . . . . . . . . . . . . . . 17


1.7.4 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.8 Associations between two variables . . . . . . . . . . . . . . . . . . . . . 18
1.8.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.8.2 Line plots (time series plots) . . . . . . . . . . . . . . . . . . . . . 19
1.8.3 Side-by-side boxplots for comparisons . . . . . . . . . . . . . . . . 20
1.8.4 Two-way contingency tables . . . . . . . . . . . . . . . . . . . . . 20
1.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Probability theory 23
2.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 32
2.6 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 36
2.6.1 Brute force: listing and counting . . . . . . . . . . . . . . . . . . . 38
2.6.2 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 38
2.6.3 Combining counts: rules of sum and product . . . . . . . . . . . . 43
2.7 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 44
2.7.1 Independence of multiple events . . . . . . . . . . . . . . . . . . . 45
2.7.2 Independent versus mutually exclusive events . . . . . . . . . . . 46
2.7.3 Conditional probability of independent events . . . . . . . . . . . 48
2.7.4 Chain rule of conditional probabilities . . . . . . . . . . . . . . . . 48
2.7.5 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3 Random variables 59
3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 61

iv
Contents

3.4.1 Probability distribution of a discrete random variable . . . . . . . 61


3.4.2 The cumulative distribution function (cdf) . . . . . . . . . . . . . 65
3.4.3 Properties of the cdf for discrete distributions . . . . . . . . . . . 66
3.4.4 General properties of the cdf . . . . . . . . . . . . . . . . . . . . . 66
3.4.5 Properties of a discrete random variable . . . . . . . . . . . . . . 67
3.4.6 Expected value versus sample mean . . . . . . . . . . . . . . . . . 68
3.4.7 Moments of a random variable . . . . . . . . . . . . . . . . . . . . 73
3.4.8 The moment generating function . . . . . . . . . . . . . . . . . . 75
3.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.1 Moment generating functions . . . . . . . . . . . . . . . . . . . . 86
3.5.2 Median of a random variable . . . . . . . . . . . . . . . . . . . . . 86
3.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4 Common distributions of random variables 89


4.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Common discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 Discrete uniform distribution . . . . . . . . . . . . . . . . . . . . 91
4.4.2 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.5 Connections between probability distributions . . . . . . . . . . . 98
4.4.6 Poisson approximation of the binomial distribution . . . . . . . . 98
4.4.7 Some other discrete distributions . . . . . . . . . . . . . . . . . . 100
4.5 Common continuous distributions . . . . . . . . . . . . . . . . . . . . . . 101
4.5.1 The (continuous) uniform distribution . . . . . . . . . . . . . . . 101
4.5.2 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . 102
4.5.3 Two other distributions . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.4 Normal (Gaussian) distribution . . . . . . . . . . . . . . . . . . . 105
4.5.5 Normal approximation of the binomial distribution . . . . . . . . 112
4.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

v
Contents

5 Multivariate random variables 117


5.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Joint probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6 Continuous multivariate distributions . . . . . . . . . . . . . . . . . . . . 121
5.7 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7.1 Properties of conditional distributions . . . . . . . . . . . . . . . . 124
5.7.2 Conditional mean and variance . . . . . . . . . . . . . . . . . . . 125
5.7.3 Continuous conditional distributions . . . . . . . . . . . . . . . . 125
5.8 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.8.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.8.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.8.3 Sample covariance and correlation . . . . . . . . . . . . . . . . . . 129
5.9 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . 131
5.9.1 Joint distribution of independent random variables . . . . . . . . 132
5.10 Sums and products of random variables . . . . . . . . . . . . . . . . . . . 133
5.10.1 Distributions of sums and products . . . . . . . . . . . . . . . . . 133
5.10.2 Expected values and variances of sums of random variables . . . . 134
5.10.3 Expected values of products of independent random variables . . 135
5.10.4 Some proofs of previous results . . . . . . . . . . . . . . . . . . . 135
5.10.5 Distributions of sums of random variables . . . . . . . . . . . . . 136
5.11 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

A Data visualisation and descriptive statistics 139


A.1 (Re)vision of fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.2 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
A.3 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

B Probability theory 147


B.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

C Random variables 161


C.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

vi
Contents

C.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

D Common distributions of random variables 175


D.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
D.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

E Multivariate random variables 191


E.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
E.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

F Solutions to Practice questions 203


F.1 Chapter 1 – Data visualisation and descriptive statistics . . . . . . . . . 203
F.2 Chapter 2 – Probability theory . . . . . . . . . . . . . . . . . . . . . . . 205
F.3 Chapter 3 – Random variables . . . . . . . . . . . . . . . . . . . . . . . . 208
F.4 Chapter 4 – Common distributions of random variables . . . . . . . . . . 210
F.5 Chapter 5 – Multivariate random variables . . . . . . . . . . . . . . . . . 215

vii
Contents

viii
Preliminaries

0.1 Organisation
Course lecturer: Dr. James Abdey.
• Email: [email protected]
• Office hours: see the Student Hub.

Lectures:
• Tuesdays 09:00–11:00, Peacock Theatre, Michaelmas term weeks 1–10.
Lecture recordings will be made available via Moodle, but note these should not be
a substitute for attending lectures!

Example workshops (optional):


• Tuesdays 15:00–16:00, Peacock Theatre, Michaelmas term weeks 1–11.
Recordings of these sessions will also be available.

Classes (90-minute duration):


• Michaelmas term weeks 2–11.

0.2 Course materials


All course materials will be available on Moodle. Your principal resource should be this
course pack featuring (i) lecture material (which covers the entire syllabus for the
Michaelmas term in detail1 ), and (ii) an examples manual. You are strongly advised to
read the relevant material in these notes before the corresponding lecture – this initial
exposure to the material will enhance your understanding of the lecture itself !
The optional ‘Example workshops’ will, unsurprisingly, be workshops of extra examples.
No new material will be covered in these sessions – instead I will go through
examination-style questions providing further practice of the course material.
References to the examples manual will be made in these sessions.

0.3 Supplementary reading


In principle, the course materials should be sufficient to fully understand all the
exciting topics which you will encounter on this statistical journey of discovery.
1
Conditional on your artistic tendencies, feel free to run wild with highlighters, a spectrum of Post-it
notes etc. to annotate this tome. With luck, you may be referring back to this at a later stage in your
studies!

ix
Preliminaries

However, you will all be aware of the considerable, dare I say significant,
heterogeneity within the student body. Therefore, some may seek additional
reassurance in the form of recommended reading.2

I should stress that purchase of a textbook is at your sole discretion. I suspect this
decision will be based on the extent to which your student finances have been
studiously managed to date. For the shopaholics among you, the recommended text
is:
• Larsen, R.J. and M.J. Marx (2017). An Introduction to Mathematical Statistics
and Its Applications, Pearson, sixth edition.3

Of course, numerous titles are available covering the topics frequently found in
first-year undergraduate service-level courses in statistics. Again, due to the
doubtless heterogeneous preferences among you all, some may find one author’s
style readily accessible, while others may despair at the baffling presentation of
material.4 Consequently, my best advice would be to sample5 a range of textbooks,
and choose your preferred one. Any textbook would act as a supplement to the
course materials. In particular, textbooks are filled with additional exercises to
check understanding – and if you’re lucky, they’ll give you (some) solutions too!

0.4 Assessment
Classes will involve going through the solutions to exercises, and full solutions will
be made available on Moodle upon completion of all that week’s classes. Further
unseen problems will also be covered. As if the sheer joy of studying the discipline
was not incentive enough to engage with the exercises, they are the best
preparation for the. . .
• two-hour written examination in week 0 of Lent term – this has 50%
weighting for ST102, and 100% weighting for ST109. For details, please see the
‘Past exam papers (January)’ section of Moodle.

Statistical tables are provided in the examination. Specifically, these will be


from:
Murdoch, J. and J.A. Barnes Statistical Tables, fourth edition (probably),
Palgrave Macmillan.
You do not need to purchase this. Relevant abstracts (Tables 3, 7, 8 and 9) are
provided in electronic form on Moodle.

You will need a scientific calculator for both the classes and the examination. The
only permitted calculators for in-person examinations are the Casio fx-83 or
fx-85 range, available from many retailers.
2
Of course, if I’ve done my job properly, these notes should transcend all other publications!
3
Second-hand earlier editions will be just as valid.
4
One clearly hopes the former applies to this humble author.
5
A sacred word in statistics.

x
Syllabus

0.5 Syllabus

The full syllabus for ST102 (Michaelmas term) and ST109 Elementary Statistical
Theory can be found in the table of contents.

0.6 And finally, words of wisdom from a wise


(youngish) man

(Not so) many years ago, I too was an undergraduate. Fresh-faced and full of
enthusiasm (some things never change), I discovered the strategy for success in
statistics.6 As you embark on this statistical voyage, I feel compelled to share the
following with you.

Statistics is fundamentally a cumulative discipline – that is, the following chapters


are not a series of self-contained units, rather they build on each other sequentially.
As such, you are strongly advised to thoroughly study and understand all topics,
since accruing this knowledge will make it easier to make sense of later topics.

Repetition is the key to success. Repetition is the key to success. Familiarity


breeds recognition of how to solve problems (and hence ‘examination questions’), so
repeatedly attempting exercise sets and related questions in the examples manual
is highly recommended. To illustrate this point, consider the following:
Cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The
phaonmneal pweor of the hmuan mnid, aoccdrnig to rarseech at
Cmabrigde Uinervtisy. It dn’seot mttaer in waht oredr the ltteers in a
wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in
the rghit pclae. The rset can be a taotl mses and you can sitll raed it
wouthit a porbelm.
Contrast this with:
Miittluvraae asilyans sattes an idtenossiy ctuoonr epilsle is the
itternoiecson of a panle pleralal to the xl-yapne and the sruacfe of a
btiiarave nmarol dbttiisruein.
I suspect most of you could pretty much follow the first passage due to your
familiarity with most of the ‘true’ words. The second passage is constructed in
exactly the same way, but is probably harder to comprehend due to your relative
unfamiliarity with these words and the difficulty of the concepts involved.7
Therefore, repeated exposure to something eases comprehension and your capacity
to perform well in statistics follows this exact idea.
6
I expect this approach is not restricted to statistics, so feel free to apply it in an interdisciplinary
setting.
7
For the interested reader, the correct ‘translation’ is ‘Multivariate analysis states an isodensity
contour ellipse is the intersection of a plane parallel to the xy-plane and the surface of a bivariate
normal distribution’. And before you panic, no we will not be covering isodensity contour ellipses!

xi
Preliminaries

So, to conclude, perseverance with problem solving is your passport to a strong


examination performance. Attempting (ideally succesfully!) the weekly exercise sets
is of paramount importance. Here endeth the first lesson.

Practise, practise, practise.


(James Abdey)

xii
Preface

Torture numbers, and they’ll confess to anything.


(Gregg Easterbrook)

0.7 The role of statistics in the research process


Before we get into details, let us begin with the ‘big picture’. First, some definitions.

Research: trying to answer questions about the world in a systematic (scientific)


way.

Empirical research: doing research by first collecting relevant information (data)


about the world.

Research may be about almost any topic: physics, biology, medicine, economics, history,
literature etc. Most of our examples will be from the social sciences: economics,
management, finance, sociology, political science, psychology etc. Research in this sense
is not just what universities do. Governments, businesses, and all of us as individuals do
it too. Statistics is used in essentially the same way for all of these.

Example 0.1 It all starts with a question.

Can labour regulation hinder economic performance?

Understanding the gender pay gap: what has competition got to do with it?

Children and online risk: powerless victims or resourceful participants?

Refugee protection as a collective action problem: is the European Union (EU)


shirking its responsibilities?

Do directors perform for pay?

Heeding the push from below: how do social movements persuade the rich to
listen to the poor?

Does devolution lead to regional inequalities in welfare activity?

The childhood origins of adult socio-economic disadvantage: do cohort and


gender matter?

Parental care as unpaid family labour: how do spouses share?

xiii
Preface

Key stages of the empirical research process

We can think of the empirical research process as having five key stages.

1. Formulating the research question.

2. Research design: deciding what kinds of data to collect, how and from where.

3. Collecting the data.

4. Analysis of the data to answer the research question.

5. Reporting the answer and how it was obtained.

The main job of statistics is the analysis of data, although it also informs other stages
of the research process. Statistics are used when the data are quantitative, i.e. in the
form of numbers.
Statistical analysis of quantitative data has the following features.

It can cope with large volumes of data, in which case the first task is to provide an
understandable summary of the data. This is the job of descriptive statistics.

It can deal with situations where the observed data are regarded as only a part (a
sample) from all the data which could have been obtained (the population). There
is then uncertainty in the conclusions. Measuring this uncertainty is the job of
statistical inference.

We conclude this preface with an example of how statistics can be used to help answer
a research question.

Example 0.2 CCTV, crime and fear of crime.


Our research question is what is the effect of closed-circuit television (CCTV)
surveillance on:

the number of recorded crimes?

the fear of crime felt by individuals?

We illustrate this using part of the following study.

Gill and Spriggs (2005): Assessing the impact of CCTV. Home Office Research
Study 292.

The research design of the study comprised the following.

Target area: a housing estate in northern England.

Control area: a second, comparable housing estate.

xiv
The role of statistics in the research process

Intervention: CCTV cameras installed in the target area but not in the
control area.
Compare measures of crime and the fear of crime in the target and control
areas in the 12 months before and 12 months after the intervention.

The data and data collection were as follows.

Level of crime: the number of crimes recorded by the police, in the 12 months
before and 12 months after the intervention.
Fear of crime: a survey of residents of the areas.
• Respondents: random samples of residents in each of the areas.
• In each area, one sample before the intervention date and one about 12
months after.
• Sample sizes:
Before After
Target area 172 168
Control area 215 242

• Question considered here: ‘In general, how much, if at all, do you worry
that you or other people in your household will be victims of crime?’ (from
1 = ‘all the time’ to 5 = ‘never’).
Statistical analysis of the data.
% of respondents who worry ‘sometimes’, ‘often’ or ‘all the time’:
Target Control
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
26 23 −3 53 46 −7 0.98 0.55–1.74

It is possible to calculate various statistics, for example the Relative Effect Size
RES = ([d]/[c])/([b]/[a]) = 0.98 is a summary measure which compares the
changes in the two areas.
RES < 1, which means that the observed change in the reported fear of crime
has been a bit less good in the target area.
However, there is uncertainty because of sampling: only 168 and 242 individuals
were actually interviewed at each time in each area, respectively.
The confidence interval for RES includes 1, which means that changes in the
self-reported fear of crime in the two areas are ‘not statistically significantly
different’ from each other.
The number of (any kind of) recorded crimes:
Target area Control area
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
112 101 −11 73 88 15 1.34 0.79–1.89

xv
Preface

Now the RES > 1, which means that the observed change in the number of
crimes has been worse in the control area than in the target area.

However, the numbers of crimes in each area are fairly small, which means that
these estimates of the changes in crime rates are fairly uncertain.

The confidence interval for RES again includes 1, which means that the changes
in crime rates in the two areas are not statistically significantly different from
each other.

In summary, this study did not support the claim that the introduction of CCTV
reduces crime or the fear of crime.

If you want to read more about research of this question, see Welsh and
Farrington (2008). Effects of closed circuit television surveillance on crime.
Campbell Systematic Reviews 2008:17.

Many of the statistical terms and concepts mentioned above have not been explained yet
– that is what the rest of the course is for! However, it serves as an interesting example
of how statistics can be employed in the social sciences to investigate research questions.

xvi
Chapter 1
Data visualisation and descriptive
statistics

1.1 Synopsis of chapter


Graphical representations of data provide us with a useful view of the distribution of
variables. In this chapter, we shall cover a selection of approaches for displaying data
visually – each being appropriate in certain situations. We then consider descriptive
statistics, whose main objective is to interpret key features of a dataset numerically.
Graphs and charts have little intrinsic value per se, however their main function is to
bring out interesting features of a dataset. For this reason, simple descriptions should be
preferred to complicated graphics.
Although data visualisation is useful as a preliminary form of data analysis to get a ‘feel’
for the data, in practice we also need to be able to summarise data numerically. We
introduce descriptive statistics and distinguish between measures of location, measures
of dispersion and skewness. All these statistics provide useful summaries of raw datasets.

1.2 Learning outcomes


At the end of this chapter, you should be able to:

interpret and summarise raw data on social science variables graphically

interpret and summarise raw data on social science variables numerically

calculate basic measures of location and dispersion

describe the skewness of a distribution and interpret boxplots

discuss the key terms and concepts introduced in the chapter.

1.3 Introduction
Starting point: a collection of numerical data (a sample) has been collected in order to
answer some questions. Statistical analysis may have two broad aims.

1
1. Data visualisation and descriptive statistics

1. Descriptive statistics: summarise the data which were collected, in order to


make them more understandable.

2. Statistical inference: use the observed data to draw conclusions about some
broader population.

Sometimes ‘1.’ is the only aim. Even when ‘2.’ is the main aim, ‘1.’ is still an essential
first step.
Data do not just speak for themselves. There are usually simply too many numbers to
make sense of just by staring at them. Descriptive statistics attempt to summarise
some key features of the data to make them understandable and easy to
communicate. These summaries may be graphical or numerical (tables or
individual summary statistics).

Example 1.1 We consider data for 155 countries on three variables from around
2002. The data can be found in the file ‘Countries.csv’. The variables are the
following.

Region of the country.


• This is a nominal variable coded (in alphabetical order) as follows:
1 = Africa, 2 = Asia, 3 = Europe, 4 = Latin America, 5 = Northern
America, 6 = Oceania.

The level of democracy, i.e. a democracy index, in the country.


• This is an 11-point ordinal scale from 0 (lowest level of democracy) to 10
(highest level of democracy).

Gross domestic product per capita (GDP per capita) (i.e. per person, in
$000s) which is a ratio scale.

The statistical data in a sample are typically stored in a data matrix, as shown in
Figure 1.1.
Rows of the data matrix correspond to different units (subjects/observations).

Here, each unit is a country.

The number of units in a dataset is the sample size, typically denoted by n.

Here, n = 155 countries.

Columns of the data matrix correspond to variables, i.e. different characteristics of


the units.

Here, region, the level of democracy, and GDP per capita are the variables.

2
1.3. Introduction

Figure 1.1: Example of a data matrix.

3
1. Data visualisation and descriptive statistics

1.4 Continuous and discrete variables


Different variables may have different properties. These determine which kinds of
statistical methods are suitable for the variables.

Continuous and discrete variables

A continuous variable can, in principle, take any real values within some interval.

In Example 1.1, GDP per capita is continuous, taking any non-negative value.

A variable is discrete if it is not continuous, i.e. if it can only take certain values,
but not any others.

In Example 1.1, region and the level of democracy are discrete, with possible
values of 1, 2, . . . , 6, and 0, 1, 2, . . . , 10, respectively.

Many discrete variables have only a finite number of possible values. In Example 1.1, the
region variable has 6 possible values, and the level of democracy has 11 possible values.
The simplest possibility is a binary, or dichotomous, variable, with just two possible
values. For example, a person’s sex could be recorded as 1 = female and 2 = male.1
A discrete variable can also have an unlimited number of possible values.

For example, the number of visitors to a website in a day: 0, 1, 2, . . ..2

Example 1.2 In Example 1.1, the levels of democracy have a meaningful ordering,
from less democratic to more democratic countries. The numbers assigned to the
different levels must also be in this order, i.e. a larger number = more democratic.
In contrast, different regions (Africa, Asia, Europe, Latin America, Northern
America and Oceania) do not have such an ordering. The numbers used for the
region variable are just labels for different regions. A different numbering (such as
6 = Africa, 5 = Asia, 1 = Europe, 3 = Latin America, 2 = Northern America and
4 = Oceania) would be just as acceptable as the one we originally used. Some
statistical methods are appropriate for variables with both ordered and unordered
values, some only in the ordered case. Unordered categories are nominal data;
ordered categories are ordinal data.

1
Note that because sex is a nominal variable, the coding is arbitrary. We could also have, for example,
0 = male and 1 = female, or 0 = female and 1 = male. However, it is important to remember which
coding has been used!
2
In practice, of course, there is a finite number of internet users in the world. However, it is reasonable
to treat this variable as taking an unlimited number of possible values.

4
1.5. The sample distribution

1.5 The sample distribution


The sample distribution of a variable consists of:

a list of the values of the variable which are observed in the sample
the number of times each value occurs (the counts or frequencies of the observed
values).

When the number of different observed values is small, we can show the whole sample
distribution as a frequency table of all the values and their frequencies.

Example 1.3 Continuing with Example 1.1, the observations of the region variable
in the sample are:

3 5 3 3 3 5 3 3 6 3 2 3 3 3 3

3 3 2 2 2 3 6 2 3 2 2 2 3 3 2

2 3 3 3 2 4 3 2 3 1 4 3 1 3 3

4 4 4 1 2 4 3 4 3 2 1 2 3 1 3

2 1 4 2 4 3 1 4 6 2 1 3 4 2 1

4 4 4 2 3 2 4 1 4 1 4 2 2 2 4

2 2 1 4 2 1 4 2 2 4 4 1 6 3 1

2 1 2 2 1 1 2 1 1 3 2 2 1 2 4

2 1 2 1 1 2 1 2 1 2 1 1 1 1 1

1 1 1 2 1 1 1 1 1 2 1 1 1 1 1

1 1 1 2 1

We may construct a frequency table for the region variable as follows:

Relative
Frequency frequency
Region (count) (%)
100 × (48/155)
(1) Africa 48 31.0
(2) Asia 44 28.4
(3) Europe 34 21.9
(4) Latin America 23 14.8
(5) Northern America 2 1.3
(6) Oceania 4 2.6
Total 155 100

5
1. Data visualisation and descriptive statistics

Here ‘%’ is the percentage of countries in a region, out of the 155 countries in the
sample. his is a measure of proportion (that is, relative frequency).
Similarly, for the level of democracy, the frequency table is:

Level of Cumulative
democracy Frequency % %
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Total 155 100

‘Cumulative %’ for a value of the variable is the sum of the percentages for that
value and all lower-numbered values.

1.5.1 Bar charts


A bar chart is the graphical equivalent of the table of frequencies. Figure 1.2 displays
the region variable data as a bar chart. The relative frequencies of each region are
clearly visible.

Figure 1.2: Example of a bar chart showing the region variable.

6
1.5. The sample distribution

1.5.2 Sample distributions of variables with many values


If a variable has many distinct values, listing frequencies of all of them is not very
practical.
A solution is to group the values into non-overlapping intervals, and produce a table or
graph of the frequencies within the intervals. The most common graph used for this is a
histogram.
A histogram is like a bar chart, but without gaps between bars, and often uses more
bars (intervals of values) than is sensible in a table. Histograms are usually drawn using
statistical software, such as R. You can let the software choose the intervals and the
number of bars.

Example 1.4 Continuing with Example 1.1, a table of frequencies for GDP per
capita where values have been grouped into non-overlapping intervals is shown
below. Figure 1.3 shows a histogram of GDP per capita with a greater number of
intervals to better display the sample distribution.

GDP per capita (in $000s) Frequency %


[0, 2) 49 31.6
[2, 5) 32 20.6
[5, 10) 29 18.7
[10, 20) 21 13.5
[20, 30) 19 12.3
[30, 50) 5 3.2
Total 155 100

Histogram of GDP per capita


50
40
30
Frequency

20
10
0

0 10 20 30 40

GDP per capita (thousands of U.S. dollars)

Figure 1.3: Histogram of GDP per capita.

7
1. Data visualisation and descriptive statistics

1.5.3 Skewness of distributions


Skewness and symmetry are terms used to describe the general shape of a sample
distribution.
From Figure 1.3, it is clear that a small number of countries has much larger values of
GDP per capita than the majority of countries in the sample. The distribution of GDP
per capita has a ‘long right tail’. Such a distribution is called positively skewed (or
skewed to the right).
A distribution with a longer left tail (i.e. toward small values) is negatively skewed (or
skewed to the left). A distribution is symmetric if it is not skewed in either direction.

Example 1.5 Figure 1.4 shows a (more-or-less) symmetric sample distribution for
diastolic blood pressure.
0.04
0.03
Proportion
0.02
0.01
0.0

40 60 80 100 120
Diastolic blood pressure

Figure 1.4: Diastolic blood pressures of 4,489 respondents aged 25 or over, Health Survey
for England, 2002.

Example 1.6 Figure 1.5 shows a (slightly) negatively-skewed distribution of marks


in an examination. Note the data relate to all candidates sitting the examination.
Therefore, the histogram shows the population distribution, not a sample
distribution.

1.6 Measures of central tendency


Frequency tables, bar charts and histograms aim to summarise the whole sample
distribution of a variable. Next we consider descriptive statistics, which summarise one
feature of the sample distribution in a single number: summary statistics.

8
1.6. Measures of central tendency

Histogram of examination marks

60
50
40
Frequency

30
20
10
0

0 20 40 60 80 100

Marks

Figure 1.5: Final examination marks of a first-year statistics course.

We begin with measures of central tendency. These answer the question: where is
the ‘centre’ or ‘average’ of the distribution?
We consider the following measures of central tendency:

mean (i.e. the average, sample mean or arithmetic mean)

median

mode.

1.6.1 Notation for variables


In formulae, a generic variable is denoted by a single letter. In these course notes,
usually X. However, any other letter (Y , W etc.) can also be used, as long as it is used
consistently. A letter with a subscript denotes a single observation of a variable.

Example 1.7 We use Xi to denote the value of X for unit i, where i can take
values 1, 2, . . . , n, and n is the sample size.
Therefore, the n observations of X in the dataset (the sample) are X1 , X2 , . . . , Xn .
These can also be written as Xi , for i = 1, 2, . . . , n.

1.6.2 Summation notation


Let X1 , X2 , . . . , Xn (i.e. Xi , for i = 1, 2, . . . , n) be a set of n numbers. The sum of the
numbers is written as: n
X
Xi = X1 + X2 + · · · + Xn .
i=1

9
1. Data visualisation and descriptive statistics

P P
This may be written as i Xi , or just Xi . Other versions of the same idea are:

P
infinite sums: Xi = X1 + X 2 + · · ·
i=1

sums of sets of observations other than 1 to n, for example:


n/2
X
Xi = X2 + X3 + · · · + Xn/2 .
i=2

1.6.3 The sample mean


The sample mean (‘arithmetic mean’, ‘mean’ or ‘average’) is the most common
measure of central tendency. The sample mean of a variable X is denoted X̄. It is the
‘sum of the observations’ divided by the ‘number of observations’ (sample size)
expressed as:
Pn
Xi
i=1
X̄ = .
n
P
Example 1.8 The mean X̄ = i Xi /n of the numbers 1, 4 and 7 is:

1+4+7 12
= = 4.
3 3

Example 1.9 For the variables in Example 1.1:

the level of democracy has X̄ = 5.3


GDP per capita has X̄ = 8.6 (in $000s)
for region the mean is not meaningful(!), because the values of the variable do
not have a meaningful ordering.

The frequency table of the level of democracy is:

Level of democracy Frequency Cumulative


Xj fj % %
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Total 155 100

10
1.6. Measures of central tendency

If a variable has a small number of distinct values, X̄ is easy to calculate from the
frequency table. For example, the level of democracy has just 11 different values
which occur in the sample 35, 12, . . . , 32 times each, respectively.
Suppose X has K different values X1 , X2 , . . . , XK , with corresponding frequencies
K
P
f1 , f2 , . . . , fK . Therefore, fj = n and:
j=1

K
P
f j Xj
j=1 f 1 X 1 + f 2 X2 + · · · + f K XK f 1 X 1 + f 2 X2 + · · · + f K XK
X̄ = = = .
K
P f1 + f2 + · · · + fK n
fj
j=1

In our example, the mean of the level of democracy (where K = 11) is:
35 × 0 + 12 × 1 + · · · + 32 × 10 0 + 12 + · · · + 320
X̄ = = ≈ 5.3.
35 + 12 + · · · + 32 155

Why is the mean a good summary of the central tendency?

Consider the following small dataset:

Deviations:
from X̄ (= 4) from the median (= 3)
i Xi Xi − X̄ (Xi − X̄)2 Xi − 3 (Xi − 3)2
1 1 −3 9 −2 4
2 2 −2 4 −1 1
3 3 −1 1 0 0
4 5 +1 1 +2 4
5 9 +5 25 +6 36
Sum 20 0 40 +5 45
X̄ = 4

We see that the sum of deviations from the mean is 0, i.e. we have:

n
X
(Xi − X̄) = 0.
i=1

The mean is ‘in the middle’ of the observations X1 , X2 , . . . , Xn , in the sense that
positive and negative values of the deviations Xi − X̄ cancel out, when summed over all
the observations.
n
Also, the smallest possible value of the sum of squared deviations (Xi − C)2 for any
P
i=1
constant C is obtained when C = X̄.

11
1. Data visualisation and descriptive statistics

1.6.4 The (sample) median


Let X(1) , X(2) , . . . , X(n) denote the sample values of X when ordered from the smallest
to the largest, known as the order statistics, such that:

X(1) is the smallest observed value (the minimum) of X

X(n) is the largest observed value (the maximum) of X.

Median

The (sample) median, q50 , of a variable X is the value which is ‘in the middle’ of
the ordered sample.

If n is odd, then q50 = X((n+1)/2) .

For example, if n = 3, q50 = X(2) : (1) (2) (3).

If n is even, q50 = (X(n/2) + X(n/2+1) )/2.

For example, if n = 4, q50 = (X(2) + X(3) )/2: (1) (2) (3) (4).

Example 1.10 Continuing with Example 1.1, n = 155, so q50 = X(78) . For the level
of democracy, the median is 6.
From a table of frequencies, the median is the value for which the cumulative
percentage first reaches 50% (or, if a cumulative % is exactly 50%, the average of the
corresponding value of X and the next highest value).
The ordered values of the level of democracy are:

(.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)
(0.) 0 0 0 0 0 0 0 0 0
(1.) 0 0 0 0 0 0 0 0 0 0
(2.) 0 0 0 0 0 0 0 0 0 0
(3.) 0 0 0 0 0 0 1 1 1 1
(4.) 1 1 1 1 1 1 1 1 2 2
(5.) 2 2 3 3 3 3 3 3 4 4
(6.) 4 4 4 5 5 5 5 5 6 6
(7.) 6 6 6 6 6 6 6 6 6 6
(8.) 7 7 7 7 7 7 7 7 7 7
(9.) 7 7 7 8 8 8 8 8 8 8
(10.) 8 8 8 8 8 8 8 8 8 9
(11.) 9 9 9 9 9 9 9 9 9 9
(12.) 9 9 9 9 10 10 10 10 10 10
(13.) 10 10 10 10 10 10 10 10 10 10
(14.) 10 10 10 10 10 10 10 10 10 10
(15.) 10 10 10 10 10 10

12
1.6. Measures of central tendency

The median can be determined from the frequency table of the level of democracy:

Level of democracy Frequency Cumulative


Xj fj % %
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Total 155 100

1.6.5 Sensitivity to outliers


For the following small ordered dataset, the mean and median are both 4:

1, 2, 4, 5, 8.

Suppose we add one observation to get the ordered sample:

1, 2, 4, 5, 8, 100.

The median is now 4.5, and the mean is 20. In general, the mean is affected much more
than the median by outliers, i.e. unusually small or large observations. Therefore, you
should identify outliers early on and investigate them – perhaps there has been a data
entry error, which can simply be corrected. If deemed genuine outliers, a decision has to
be made about whether or not to remove them.

1.6.6 Skewness, means and medians


Due to its sensitivity to outliers, the mean, more than the median, is pulled toward the
longer tail of the sample distribution.

For a positively-skewed distribution, the mean is larger than the median.

For a negatively-skewed distribution, the mean is smaller than the median.

For an exactly symmetric distribution, the mean and median are equal.

When summarising variables with skewed distributions, it is useful to report both the
mean and the median.

13
1. Data visualisation and descriptive statistics

Example 1.11 For the datasets considered previously:

Mean Median
Level of democracy 5.3 6
GDP per capita 8.6 4.7
Diastolic blood pressure 74.2 73.5
Examination marks 56.6 57.0

1.6.7 Mode
The (sample) mode of a variable is the value which has the highest frequency (i.e.
appears most often) in the data.

Example 1.12 For Example 1.1, the modal region is 1 (Africa) and the mode of
the level of democracy is 0.

The mode is not very useful for continuous variables which have many different values,
such as GDP per capita in Example 1.1. A variable can have several modes (i.e. be
multimodal). For example, GDP per capita has modes 0.8 and 1.9, both with 5
countries out of the 155.
The mode is the only measure of central tendency which can be used even when the
values of a variable have no ordering, such as for the (nominal) region variable in
Example 1.1.

1.7 Measures of dispersion


Central tendency is not the whole story. The two sample distributions in Figure 1.6
have the same mean, but they are clearly not the same. In one (red) the values have
more dispersion (variation) than in the other.

Figure 1.6: Two sample distributions.

14
1.7. Measures of dispersion

Example 1.13 A small example determining the sum of the squared deviations
from the (sample) mean, used to calculate common measures of dispersion.

Deviations from X̄
i Xi Xi2 Xi − X̄ (Xi − X̄)2
1 1 1 −3 9
2 2 4 −2 4
3 3 9 −1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 P 40
= Xi2 = (Xi − X̄)2
P
X̄ = 4

1.7.1 Variance and standard deviation


The first measures of dispersion, the sample variance and its square root, the sample
standard deviation, are based on (Xi − X̄)2 , i.e. the squared deviations from the mean.

Sample variance and standard deviation

The sample variance of a variable X, denoted S 2 (or SX


2
), is defined as:
n
2
1 X
S = (Xi − X̄)2 .
n−1 i=1

The sample standard deviation of X, denoted S (or SX ), is the positive square


root of the sample variance:
v
u n
u 1 X
S= t (Xi − X̄)2 .
n − 1 i=1

These are the most commonly-used measures of dispersion. The standard deviation is
more understandable than the variance, because the standard deviation is expressed in
the same units as X (rather than the variance, which is expressed in squared units).
A useful rule-of-thumb for interpretation is that for many symmetric distributions, such
as the ‘normal’ distribution:

about 2/3 of the observations are between X̄ − S and X̄ + S, that is, within one
(sample) standard deviation about the (sample) mean

about 95% of the observations are between X̄ − 2 × S and X̄ + 2 × S, that is,


within two (sample) standard deviations about the (sample) mean.

Remember that standard deviations (and variances) are never negative, and they are

15
1. Data visualisation and descriptive statistics

zero only if all the Xi observations are the same (that is, there is no variation in the
data).
If we are using a frequency table, we can also calculate:

K
!
1 X
S2 = fj Xj2 − nX̄ 2 .
n−1 j=1

Example 1.14 Consider the following simple dataset:

Deviations from X̄
i Xi Xi2 Xi − X̄ (Xi − X̄)2
1 1 1 −3 9
2 2 4 −2 4
3 3 9 −1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 P 40
Xi2 (Xi − X̄)2
P
X̄ = 4 = =

We have:
1 X 40 1 X 2  120 − 5 × 42
S2 = (Xi − X̄)2 = = 10 = Xi − nX̄ 2 =
n−1 4 n−1 4
√ √
and S = S 2 = 10 = 3.16.

1.7.2 Sample quantiles

The median, q50 , is basically the value which divides the sample into the smallest 50%
of observations and the largest 50%. If we consider other percentage splits, we get other
(sample) quantiles (percentiles), qc .

Example 1.15 Some special quantiles are given below.

The first quartile, q25 or Q1 , is the value which divides the sample into the
smallest 25% of observations and the largest 75%.

The third quartile, q75 or Q3 , gives the 75%–25% split.

The extremes in this spirit are the minimum, X(1) (the ‘0% quantile’, so to
speak), and the maximum, X(n) (the ‘100% quantile’).

These are no longer ‘in the middle’ of the sample, but they are more general
measures of location of the sample distribution.

16
1.7. Measures of dispersion

1.7.3 Quantile-based measures of dispersion

Range and interquartile range

Two measures based on quantile-type statistics are the:

range: X(n) − X(1) = maximum − minimum

interquartile range (IQR): IQR = q75 − q25 = Q3 − Q1 .

The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the
extremes of the distribution, i.e. the minimum and maximum observations. The IQR
focuses on the middle 50% of the distribution, so it is completely insensitive to outliers.

1.7.4 Boxplots
A boxplot (in full, a box-and-whiskers plot) summarises some key features of a sample
distribution using quantiles. The plot is comprised of the following.

The line inside the box, which is the median.

The box, whose edges are the first and third quartiles (Q1 and Q3 ). Hence the box
captures the middle 50% of the data. Therefore, the length of the box is the
interquartile range.

The bottom whisker extends either to the minimum or up to a length of 1.5 times
the interquartile range below the first quartile, whichever is closer to the first
quartile.

The top whisker extends either to the maximum or up to a length of 1.5 times the
interquartile range above the third quartile, whichever is closer to the third quartile.

Points beyond 1.5 times the interquartile range below the first quartile or above the
third quartile are regarded as outliers, and plotted as individual points.

A much longer whisker (and/or outliers) in one direction relative to the other indicates
a skewed distribution, as does a median line not in the middle of the box.

Example 1.16 Figure 1.7 displays a boxplot of GDP per capita using the sample
of 155 countries introduced in Example 1.1. Some summary statistics for this
variable are reported below.

Standard
Mean Median deviation IQR Range
GDP per capita 8.6 4.7 9.5 9.7 37.3

17
1. Data visualisation and descriptive statistics

40
Maximum = 37.8
Outliers
30
GDP per capita

23.7 = Largest observation at most


1.5 x IQR = 14.6 above 3rd Quartile
20

3rd Quartile = 11.4


10

(IQR = 11.4-1.7 = 9.7)


Median = 4.7
1st Quartile = 1.7 Minimum = 0.5
0

Figure 1.7: Boxplot of GDP per capita.

1.8 Associations between two variables


So far, we have tried to summarise (some aspect of) the sample distribution of one
variable at a time.
However, we can also look at two (or more) variables together. The key question is then
whether some values of one variable tend to occur frequently together with particular
values of another, for example high values with high values. This would be an example
of an association between the variables. Such associations are central to most
interesting research questions, so you will hear much more about them in the future.
Some common methods of descriptive statistics for two-variable associations are
introduced here, but only very briefly now and mainly through examples.
The best way to summarise two variables together depends on whether the variables
have ‘few’ or ‘many’ possible values. We illustrate one method for each combination, as
listed below.

‘Many’ versus ‘many’: scatterplots (including line plots).

‘Few’ versus ‘many’: side-by-side boxplots.

‘Few’ versus ‘few’: two-way contingency tables (cross-tabulations).

1.8.1 Scatterplots
A scatterplot shows the values of two continuous variables against each other, plotted
as points in a two-dimensional coordinate system.

18
1.8. Associations between two variables

Example 1.17 A plot of data for 164 countries is shown in Figure 1.8 which plots
the following variables.

On the horizontal axis (the x-axis): a World Bank measure of ‘control of


corruption’, where high values indicate low levels of corruption.

On the vertical axis (the y-axis): GDP per capita in $.

Interpretation: it appears that virtually all countries with high levels of corruption
have relatively low GDP per capita. At lower levels of corruption there is a positive
association, where countries with very low levels of corruption also tend to have high
GDP per capita.

Figure 1.8: GDP per capita plotted against control of corruption.

1.8.2 Line plots (time series plots)


A common special case of a scatterplot is a line plot (time series plot), where the
variable on the x-axis is time. The points are connected in time order by lines, to show
how the variable on the y-axis changes over time.

Example 1.18 Figure 1.9 is a time series of an index of prices of consumer goods
and services in the UK for the period 1800–2009 (Office for National Statistics; scaled
so that the price level in 1974 = 100). This shows the price inflation over this period.

19
1. Data visualisation and descriptive statistics

Figure 1.9: UK index of prices of consumer goods and services.

1.8.3 Side-by-side boxplots for comparisons


Boxplots are useful for comparisons of how the distribution of a continuous variable
varies across different groups, i.e. across different levels of a discrete variable.

Example 1.19 Figure 1.10 shows side-by-side boxplots of GDP per capita for the
different regions in Example 1.1.

GDP per capita in African countries tends to be very low. There is a handful of
countries with somewhat higher GDPs per capita (shown as outliers in the plot).

The median for Asia is not much higher than for Africa. However, the
distribution in Asia is very much skewed to the right, with a tail of countries
with very high GDPs per capita.

The median in Europe is high, and the distribution is fairly symmetric.

The boxplots for Northern America and Oceania are not very useful, because
they are based on very few countries (two and three countries, respectively).

1.8.4 Two-way contingency tables


A (two-way) contingency table (or cross-tabulation) shows the frequencies in the
sample of each possible combination of the values of two discrete variables. Such tables
often show the percentages within each row or column of the table.

Example 1.20 The table below reports the results from a survey of 972 private
investors.3 The variables are as follows.

Row variable: age as a discrete, grouped variable (four categories).

20
1.8. Associations between two variables

Boxplot of GDP per capita by region


40

30
GDP per capit a

20

10

0
Africa Asia Europe Latin Am . North Am . Oceania
Region

Figure 1.10: Side-by-side boxplots of GDP per capita by region.

Column variable: how much importance the respondent places on short-term


gains from his/her investments (four levels).

Interpretation: look at the row percentages. For example, 17.8% of those aged under
45, but only 5.2% of those aged 65 and over, think that short-term gains are ‘very
important’. Among the respondents, the older age groups seem to be less concerned
with quick profits than the younger age groups.

Importance of short-term gains


Slightly Very
Age group Irrelevant important Important important Total
Under 45 37 45 38 26 146
(25.3) (30.8) (26.0) (17.8) (100)
45–54 111 77 57 37 282
(39.4) (27.3) (20.2) (13.1) (100)
55–64 153 49 31 20 253
(60.5) (19.4) (12.3) (7.9) (100)
65 and over 193 64 19 15 291
(66.3) (22.0) (6.5) (5.2) (100)
Total 494 235 145 98 972
(50.8) (24.2) (14.9) (10.1) (100)

Numbers in parentheses are percentages within the rows. For example,


25.3 = (37/146) × 100.

3
Lewellen, W.G., R.C. Lease and G.G. Schlarbaum (1977). ‘Patterns of investment strategy and
behavior among individual investors’. The Journal of Business, 50(3), pp. 296–333.

21
1. Data visualisation and descriptive statistics

1.9 Overview of chapter


This chapter has looked at different ways of presenting data visually. Which type of
diagram is most appropriate will be determined by the types of data being analysed.
You should be able to interpret any important features which are apparent from the
diagram. This chapter has also introduced some quantitative approaches to
summarising data, known as descriptive statistics. We have distinguished measures of
location, dispersion and skewness. Although descriptive statistics serve as a very basic
form of statistical analysis, they nevertheless are extremely useful for capturing the
main characteristics of a dataset. Therefore, any statistical analysis of data should start
with data visualisation and the calculation of descriptive statistics!

1.10 Key terms and concepts


(Arithmetic) mean Association
Bar chart Binary
Boxplot Contingency table
Continuous Count
Data matrix Descriptive statistics
Dichotomous Discrete
Distribution Frequency
Frequency table Histogram
Interquartile range Line plot
Maximum Measures of central tendency
Measures of dispersion Median
Minimum Mode
Nominal Order statistics
Ordinal Outlier
Proportion Quantile
Quartile Range
Relative frequency Sample distribution
Sample size Scatterplot
Skewness Standard deviation
Symmetry Unit
Variable Variance

The average human has one breast and one testicle.


(Des McHale)

22
Chapter 2
Probability theory

2.1 Synopsis of chapter


Probability theory is very important for statistics because it provides the rules which
allow us to reason about uncertainty and randomness, which is the basis of statistics.
Independence and conditional probability are profound ideas, but they must be fully
understood in order to think clearly about any statistical investigation.

2.2 Learning outcomes


After completing this chapter, you should be able to:

explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.

2.3 Introduction
Consider the following hypothetical example. A country will soon hold a referendum
about whether it should leave the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows:

Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%

23
2. Probability theory

However, we are not interested in just this sample of 950 respondents, but in the
population which they represent, that is, all likely voters.
Statistical inference will allow us to say things like the following about the
population.

‘A 95% confidence interval for the population proportion, π, of ‘Yes’ voters is


(0.5083, 0.5717).’

‘The null hypothesis that π = 0.50, against the alternative hypothesis that
π > 0.50, is rejected at the 5% significance level.’

In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results.

Each response Xi is a realisation of a random variable from a Bernoulli


distribution with probability parameter π.

The responses X1 , X2 , . . . , Xn are independent of each other.

The sampling distribution of the sample mean (proportion) X̄ has expected


value π and variance π(1 − π)/n.

By use of the central limit theorem, the sampling distribution is approximately


a normal distribution.

In the next few chapters, we will learn about the terms in bold, among others.

The need for probability in statistics

In statistical inference, the data we have observed are regarded as a sample from a
broader population, selected with a random process.

Values in a sample are variable. If we collected a different sample we would not


observe exactly the same values again.

Values in a sample are also random. We cannot predict the precise values which
will be observed before we actually collect the sample.

Probability theory is the branch of mathematics which deals with randomness. So we


need to study this first.

A preview of probability

The first basic concepts in probability will be the following.

24
2.4. Set theory: the basics

Experiment: for example, rolling a single die and recording the outcome.

Outcome of the experiment: for example, rolling a 3.

Sample space S: the set of all possible outcomes, here {1, 2, 3, 4, 5, 6}.

Event: any subset A of the sample space, for example A = {4, 5, 6}.1

Probability of an event A, P (A), will be defined as a function which assigns


probabilities (real numbers) to events (sets). This uses the language and concepts of set
theory. So we need to study the basics of set theory first.

2.4 Set theory: the basics


A set is a collection of elements (also known as ‘members’ of the set).

Example 2.1 The following are all examples of sets.

A = {Amy, Bob, Sam}.

B = {1, 2, 3, 4, 5}.

C = {x | x is a prime number} = {2, 3, 5, 7, 11, . . .}.

D = {x | x ≥ 0} (that is, the set of all non-negative real numbers).

Membership of sets and the empty set

x ∈ A means that object x is an element of set A.


x∈/ A means that object x is not an element of set A.
The empty set, denoted ∅, is the set with no elements, i.e. x ∈
/ ∅ is true for every
object x, and x ∈ ∅ is not true for any object x.

Example 2.2 If A = {1, 2, 3, 4, 5}, then:

1 ∈ A and 2 ∈ A

6∈
/ A and 1.5 ∈
/ A.

The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.

Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .

25
2. Probability theory

Figure 2.1: Venn diagram depicting A ∪ B (the total shaded area).

Subsets and equality of sets

A ⊂ B means that set A is a subset of set B, defined as:

A⊂B when x ∈ A ⇒ x ∈ B.

Hence A is a subset of B if every element of A is also an element of B. An example


is shown in Figure 2.2.

Figure 2.2: Venn diagram depicting a subset, where A ⊂ B.

Example 2.4 An example of the distinction between subsets and non-subsets is:

{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set

{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.

Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.

Unions of sets (‘or’)

The union, denoted ∪, of two sets is:

A ∪ B = {x | x ∈ A or x ∈ B}.

That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.

1
Strictly speaking not all subsets are events, as discussed later.

26
2.4. Set theory: the basics

Figure 2.3: Venn diagram depicting the union of two sets.

Example 2.5 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:

A ∪ B = {1, 2, 3, 4}

A ∪ C = {1, 2, 3, 4, 5, 6}

B ∪ C = {2, 3, 4, 5, 6}.

Intersections of sets (‘and’)

The intersection, denoted ∩, of two sets is:

A ∩ B = {x | x ∈ A and x ∈ B}.

That is, the set of those elements which belong to both A and B. An example is
shown in Figure 2.4.

Figure 2.4: Venn diagram depicting the intersection of two sets.

Example 2.6 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:

A ∩ B = {2, 3}

A ∩ C = {4}

B ∩ C = ∅.

27
2. Probability theory

Unions and intersections of many sets

Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1

and: n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1

These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.

Complement (‘not’)

Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈ / A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.

Figure 2.5: Venn diagram depicting the complement of a set.

We now consider some useful properties of set operators. In proofs and derivations
about sets, you can use the following results without proof.

28
2.4. Set theory: the basics

Properties of set operators

Commutativity:

A ∩ B = B ∩ A and A ∪ B = B ∪ A.

Associativity:

A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.

Distributive laws:

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

and:
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).

De Morgan’s laws:

(A ∩ B)c = Ac ∪ B c and (A ∪ B)c = Ac ∩ B c .

Further properties of set operators

If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:

∅c = S.

∅ ⊂ A, A ⊂ A and A ⊂ S.

A ∩ A = A and A ∪ A = A.

A ∩ Ac = ∅ and A ∪ Ac = S.

If B ⊂ A, A ∩ B = B and A ∪ B = A.

A ∩ ∅ = ∅ and A ∪ ∅ = A.

A ∩ S = A and A ∪ S = S.

∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.

29
2. Probability theory

Mutually exclusive events

Two sets A and B are disjoint or mutually exclusive if:

A ∩ B = ∅.

Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.

Partition

The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form

S
a partition of A if they are pairwise disjoint and Ai = A.
i=1

A3 A2

A1

Figure 2.6: The partition of the set A into A1 , A2 and A3 .

Example 2.7 Suppose that A ⊂ B. Show that A and B ∩ Ac form a partition of B.

We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.

30
2.5. Axiomatic definition of probability

2.5 Axiomatic definition of probability


First, we consider four basic concepts in probability.
An experiment is a process which produces outcomes and which can have several
different outcomes. The sample space S is the set of all possible outcomes of the
experiment. An event is any subset A of the sample space such that A ⊂ S.

Example 2.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.

The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows.

A ∩ B: both A and B happen.


A ∪ B: either A or B happens (or both happen).
Ac : A does not happen, i.e. something other than A happens.

Once we introduce probabilities of events, we can also say that:

the sample space, S, is a certain event


the empty set, ∅, is an impossible event.

Axioms of probability

‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample
space S onto real numbers.2 Such a function is a probability function if it satisfies
the following axioms (‘self-evident truths’).

Axiom 1: P (A) ≥ 0 for all events A.

Axiom 2: P (S) = 1.

Axiom 3: If events A1 , A2 , . . . are pairwise disjoint (i.e. Ai ∩ Aj = ∅ for all


i 6= j), then:

! ∞
[ X
P Ai = P (Ai ).
i=1 i=1

The axioms require that a probability function must always satisfy these requirements.
2
The precise definition also requires a careful statement of which subsets of S are allowed as events,
which we can skip on this course.

31
2. Probability theory

Axiom 1 requires that probabilities are always non-negative.

Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.

Axiom 3 states that if events A1 , A2 , . . . are mutually exclusive, the probability of


their union is simply the sum of their individual probabilities.
All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.

2.5.1 Basic properties of probability

Probability property

For the empty set, ∅, we have:


P (∅) = 0. (2.1)

Proof: Since ∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅, Axiom 3 gives:



X
P (∅) = P (∅ ∪ ∅ ∪ · · · ) = P (∅).
i=1

However, the only real number for P (∅) which satisfies this is P (∅) = 0.


Probability property (finite additivity)

If A1 , A2 , . . . , An are pairwise disjoint, then:


n
! n
[ X
P Ai = P (Ai ).
i=1 i=1

Proof: In Axiom 3, set An+1 = An+2 = · · · = ∅, so that:



! ∞ n ∞ n
[ X X X X
P Ai = P (Ai ) = P (Ai ) + P (Ai ) = P (Ai )
i=1 i=1 i=1 i=n+1 i=1

since P (Ai ) = P (∅) = 0 for i = n + 1, n + 2, . . ..



In pictures, the previous result means that in a situation like the one shown in Figure
2.7, the probability of the combined event A = A1 ∪ A2 ∪ A3 is simply the sum of the
probabilities of the individual events:

P (A) = P (A1 ) + P (A2 ) + P (A3 ).

32
2.5. Axiomatic definition of probability

A2
A1

A3

Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.

That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.

Probability property

For any event A, we have:


P (Ac ) = 1 − P (A).

Proof: We have that A ∪ Ac = S and A ∩ Ac = ∅. Therefore:


1 = P (S) = P (A ∪ Ac ) = P (A) + P (Ac )
using the previous result, with n = 2, A1 = A and A2 = Ac .


Probability property

For any event A, we have:


P (A) ≤ 1.

Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:
P (Ac ) = 1 − P (A) < 0.
This violates Axiom 1, so cannot be true. Therefore, it must be that P (A) ≤ 1 for all A.
Putting this and Axiom 1 together, we get:
0 ≤ P (A) ≤ 1
for all events A.


Probability property

For any two events A and B, if A ⊂ B, then P (A) ≤ P (B).

33
2. Probability theory

Proof: We proved in Example 2.7 that we can partition B as B = A ∪ (B ∩ Ac ) where


the two sets in the union are disjoint. Therefore:

P (B) = P (A ∪ (B ∩ Ac )) = P (A) + P (B ∩ Ac ) ≥ P (A)

since P (B ∩ Ac ) ≥ 0.


Probability property

For any two events A and B, then:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Proof: Using partitions:

P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B)

P (A) = P (A ∩ B c ) + P (A ∩ B)

P (B) = P (Ac ∩ B) + P (A ∩ B)

and hence:

P (A ∪ B) = (P (A) − P (A ∩ B)) + P (A ∩ B) + (P (B) − P (A ∩ B))


= P (A) + P (B) − P (A ∩ B).

In summary, the probability function has the following properties.

P (S) = 1 and P (∅) = 0.

0 ≤ P (A) ≤ 1 for all events A.

If A ⊂ B, then P (A) ≤ P (B).

These show that the probability function has the kinds of values we expect of something
called a ‘probability’.

P (Ac ) = 1 − P (A).

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

These are useful for deriving probabilities of new events.

Example 2.9 Suppose that, on an average weekday, of all adults in a country:

86% spend at least 1 hour watching television (event A, with P (A) = 0.86)

34
2.5. Axiomatic definition of probability

19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19)

15% spend at least 1 hour watching television and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).

We select a member of the population for an interview at random. For example, we


then have:

P (Ac ) = 1 − P (A) = 1 − 0.86 = 0.14, which is the probability that the


respondent watches less than 1 hour of television

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.86 + 0.19 − 0.15 = 0.90, which is the


probability that the respondent spends at least 1 hour watching television or
reading newspapers (or both).

What does ‘probability’ mean?

Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined below. The mathematical theory of probability
and calculations on probabilities are the same whichever interpretation we assign to
‘probability’. So, in this course, we do not need to discuss the matter further.

Frequency interpretation of probability

This states that the probability of an outcome A of an experiment is the proportion


(relative frequency) of trials in which A would be the outcome if the experiment was
repeated a very large number of times under similar conditions.

Example 2.10 How should we interpret the following, as statements about the real
world of coins and babies?

‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.

‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.

How to find probabilities?

A key question is how to determine appropriate numerical values of P (A) for the
probabilities of particular events.

35
2. Probability theory

This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.

Example 2.11 Consider the following.

If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.

Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in this population.

The estimation of probabilities of events from observed data is an important part of


statistics.

2.6 Classical probability and counting rules


Classical probability is a simple special case where values of probabilities can be
found by just counting outcomes. This requires that:

the sample space contains only a finite number of outcomes


all of the outcomes are equally likely.

Standard illustrations of classical probability are devices used in games of chance, such
as:

tossing a coin (heads or tails) one or more times


rolling one or more dice (each scored 1, 2, 3, 4, 5 or 6)
drawing one or more playing cards from a deck of 52 cards.

We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space, S, contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Therefore:
k number of outcomes in A
P (A) = = .
m total number of outcomes in the sample space, S
That is, the probability of A is the proportion of outcomes which belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible
outcomes.

36
2.6. Classical probability and counting rules

Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?

The sample space is the 36 ordered pairs:

S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.

The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}.

The probability is P (A) = 4/36 = 1/9.

Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.

Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?

The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.

Therefore, P (A) = 1 − 3/36 = 33/36 = 11/12.

The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.

Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?

P (A) = 6/36, P (B) = 3/36 and P (A ∩ B) = 1/36.

So P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = (6 + 3 − 1)/36 = 8/36 = 2/9.

How to count the outcomes

In general, it is useful to know about three ways of counting.

37
2. Probability theory

[1] [2] [3] [4]


s s s s s s s s

s s s s s s s s

[5] [6] [7] [8]


s s s s s s s s
@
@
@
s s s @s s s s s

[9] [10] [11]


s s s s s s
@ @
@ @
@ @
s s s @s s @s

Figure 2.8: Friendship patterns in a four-person network.

Listing and counting all outcomes.

Combinatorial methods: choosing k objects out of n objects.

Combining different methods: rules of sum and product.

2.6.1 Brute force: listing and counting


In small problems, just listing all possibilities is often quickest.

Example 2.15 Consider a group of four people, where each pair of people is either
connected (= friends) or not. How many different patterns of connections are there
(ignoring the identities of who is friends with whom)?
The answer is 11. See the patterns in Figure 2.8.

2.6.2 Combinatorial counting methods


A powerful set of counting methods answers the following question: how many ways are
there to select k objects out of n distinct objects?
The answer will depend on:

whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once)

whether the selected set is treated as ordered or unordered.

38
2.6. Classical probability and counting rules

Ordered sets, with replacement

Suppose that the selection of k objects out of n needs to be:

ordered, so that the selection is an ordered sequence where we distinguish between


the 1st object, 2nd, 3rd etc.

with replacement, so that each of the n objects may appear several times in the
selection.

Therefore:

n objects are available for selection for the 1st object in the sequence

n objects are available for selection for the 2nd object in the sequence

. . . and so on, until n objects are available for selection for the kth object in the
sequence.

Therefore, the number of possible ordered sequences of k objects selected with


replacement from n objects is:
k times
z }| {
n × n × · · · × n = nk .

Ordered sets, without replacement

Suppose that the selection of k objects out of n is again treated as an ordered sequence,
but that selection is now:

ordered, so that the selection is an ordered sequence where we distinguish between


the 1st object, 2nd, 3rd etc.

without replacement, so that if an object is selected once, it cannot be selected


again.

Now:

n objects are available for selection for the 1st object in the sequence

n − 1 objects are available for selection for the 2nd object

n − 2 objects are available for selection for the 3rd object

. . . and so on, until n − k + 1 objects are available for selection for the kth object.

Therefore, the number of possible ordered sequences of k objects selected without


replacement from n objects is:

n × (n − 1) × · · · × (n − k + 1). (2.2)

39
2. Probability theory

An important special case is when k = n.

Factorials

The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.

Using factorials, (2.2) can be written as:

n!
n × (n − 1) × · · · × (n − k + 1) = .
(n − k)!

Unordered sets, without replacement

Suppose now that the identities of the objects in the selection matter, but the order
does not.

For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.

The number of such unordered subsets (combinations) of k out of n objects is


determined as follows.

The number of ordered sequences is n!/(n − k)!.

Among these, every different combination of k distinct elements appears k! times,


in different orders.

Ignoring the ordering, there are:


 
n n!
=
k k! (n − k)!

different combinations, for each k = 0, 1, 2, . . . , n.

n

The number k
is known as the binomial coefficient. Note that because 0! = 1,
n n
 
0
= n = 1, so there is only 1 way of selecting 0 or n out of n objects.

Summary of the combinatorial counting rules

The number of k outcomes from n distinct possible outcomes can be summarised as


follows:

40
2.6. Classical probability and counting rules

With Without
replacement replacement
Ordered nk n!/(n − k)!
n+k−1 n n!
 
Unordered k k
= k! (n−k)!

We have not discussed the unordered, with replacement case which is non-examinable.
It is provided here only for completeness with an illustration given in Example 2.16.

Example 2.16 We consider an outline of the proof, using n = 5 and k = 3 for


illustration.
Half-graphically, let x denote selected values and | the ‘walls’ between different
distinct values. For example:

x|xx||| denotes the selection of set (1, 2, 2)

x||x||x denotes the set (1, 3, 5)

||||xxx denotes the set (5, 5, 5).


In general, we have a sequence of n + k − 1 symbols, i.e. n − 1 walls (|) and k
selections (x). The number of different unordered sets of k objects selected with
replacement from n objects is the number of different ways of choosing the locations
of the xs in this, that is:  
n+k−1
.
k

Example 2.17 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending 29 February does not exist, so that n = 365) in the following cases?

1. It makes a difference who has which birthday (ordered), i.e. Amy (1 January),
Bob (5 May) and Sam (5 December) is different from Amy (5 May), Bob (5
December) and Sam (1 January), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:

(365)3 = 48,627,125.

2. It makes a difference who has which birthday (ordered), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!

3. Only the dates matter, but not who has which one (unordered), i.e. Amy (1
January), Bob (5 May) and Sam (5 December) is treated as the same as Amy (5
May), Bob (5 December) and Sam (1 January), and different people must have

41
2. Probability theory

different birthdays (without replacement). The number of different sets of


birthdays is:
 
365 365! 365 × 364 × 363
= = = 8,038,030.
3 3! (365 − 3)! 3×2×1

Example 2.18 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following.

1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is (365)r .

2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.

Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
(365) (365)r

and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
(365)r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:

r P (A) r P (A) r P (A) r P (A)


2 0.003 12 0.167 22 0.476 32 0.753
3 0.008 13 0.194 23 0.507 33 0.775
4 0.016 14 0.223 24 0.538 34 0.795
5 0.027 15 0.253 25 0.569 35 0.814
6 0.040 16 0.284 26 0.598 36 0.832
7 0.056 17 0.315 27 0.627 37 0.849
8 0.074 18 0.347 28 0.654 38 0.864
9 0.095 19 0.379 29 0.681 39 0.878
10 0.117 20 0.411 30 0.706 40 0.891
11 0.141 21 0.444 31 0.730 41 0.903

42
2.6. Classical probability and counting rules

2.6.3 Combining counts: rules of sum and product


Even more complex cases can be handled by combining counts.

Rule of sum

If an element can be selected in m1 ways from set 1, or m2 ways from set 2, . . . or


mK ways from set K, the total number of possible selections is:

m1 + m2 + · · · + mK .

Rule of product

If, in an ordered sequence of K elements, element 1 can be selected in m1 ways, and


then element 2 in m2 ways, . . . and then element K in mK ways, the total number
of possible sequences is:
m1 × m2 × · · · × mK .

Example 2.19 (The ST102 Moodle site contains a separate note which explains
playing cards and hands, and shows how to calculate the probabilities of all the
hands. This is for reference only – you do not need to memorise the different types of
hands!)
Five playing cards are drawn from a well-shuffled deck of 52 playing cards. What is
the probability that the cards form a hand which is higher than ‘a flush’ ? The cards
in a hand are treated as an unordered set.
First, we determine the size of the sample space which is all unordered subsets of 5
cards selected from 52. So the size of the sample space is:
 
52 52! 52 × 51 × 50 × 49 × 48
= = = 2,598,960.
5 5! × 47! 5×4×3×2×1

The hand is higher than a flush if it is a:

‘straight flush’ or ‘four-of-a-kind’ or ‘full house’.


The rule of sum says that the number of hands better than a flush is:

number of straight flushes + number of four-of-a-kinds + number of full houses


= 40 + 624 + 3,744
= 4,408.

Therefore, the probability we want is:


4,408
≈ 0.0017.
2,598,960
How did we get the counts above?

43
2. Probability theory

For full houses, shown next.

For the others, see the ST102 Moodle site.

A ‘full house’ is three cards of the same rank and two of another rank, for example:

♦2 ♠2 ♣2 ♦4 ♠4.

We can break the number of ways of choosing these into two steps.

The total number of ways of selecting the three: the rank of these can be any of
the 13 ranks. There are four cards of this rank, so the three of that rank can be
chosen in 43 = 4 ways. So the total number of different triplets is 13 × 4 = 52.
The total number of ways of selecting the two: the rank of these can be any of
the remaining 12 ranks, and the two cards of that rank can be chosen in 42 = 6
ways. So the total number of different pairs (with a different rank than the
triplet) is 12 × 6 = 72.

The rule of product then says that the total number of full houses is:

52 × 72 = 3,744.

The following is a summary of the numbers of all types of 5-card hands, and their
probabilities:

Hand Number Probability


Straight flush 40 0.000015
Four-of-a-kind 624 0.00024
Full house 3,744 0.00144
Flush 5,108 0.0020
Straight 10,200 0.0039
Three-of-a-kind 54,912 0.0211
Two pairs 123,552 0.0475
One pair 1,098,240 0.4226
High card 1,302,540 0.5012
Total 2,598,960 1.0

2.7 Conditional probability and Bayes’ theorem


Next we introduce some of the most important concepts in probability:

independence

conditional probability

Bayes’ theorem.

44
2.7. Conditional probability and Bayes’ theorem

These give us powerful tools for:

deriving probabilities of combinations of events

updating probabilities of events, after we learn that some other event has happened.

Independence

Two events A and B are (statistically) independent if:

P (A ∩ B) = P (A) P (B).

Independence is sometimes denoted A ⊥⊥ B. Intuitively, independence means that:

if A happens, this does not affect the probability of B happening (and vice versa)

if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).

For example, independence is often a reasonable assumption when A and B


correspond to physically separate experiments.

Example 2.20 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:

A = ‘Score of die 1 is not 6’

B = ‘Score of die 2 is not 6’.

Therefore:

P (A) = 30/36 = 5/6

P (B) = 30/36 = 5/6

P (A ∩ B) = 25/36 = 5/6 × 5/6 = P (A) P (B), so A and B are independent.

2.7.1 Independence of multiple events


Events A1 , A2 , . . . , An are independent if the probability of the intersection of any subset
of these events is the product of the individual probabilities of the events in the subset.
This implies the important result that if events A1 , A2 , . . . , An are independent, then:

P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 ) · · · P (An ).

Note that there is a difference between pairwise independence and full independence.
The following example illustrates.

45
2. Probability theory

Example 2.21 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher
C only has a scarf and Teacher D only has gloves. One teacher out of the four is
selected at random. It is shown that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) =
and P (S ∩ G) = .
4 4
From these results, we can verify that:

P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)

and so the events are pairwise independent. However, one teacher has a hat, a scarf
and gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.

2.7.2 Independent versus mutually exclusive events


The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 2.9.
For mutually exclusive events A ∩ B = ∅, and so, from (2.1), P (A ∩ B) = 0. For
independent events, P (A ∩ B) = P (A) P (B). So since P (A ∩ B) = 0 6= P (A) P (B) in
general (except in the uninteresting case when P (A) = 0 or P (B) = 0), then mutually
exclusive events and independent events are different.
In fact, mutually exclusive events are extremely non-independent (i.e. dependent). For
example, if you know that A has happened, you know for certain that B has not
happened. There is no particularly helpful way to represent independent events using a
Venn diagram.

46
2.7. Conditional probability and Bayes’ theorem

Figure 2.9: Venn diagram depicting mutually exclusive events.

Conditional probability

Consider two events A and B. Suppose you are told that B has occurred. How does
this affect the probability of event A?

The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:

P (A ∩ B)
P (A | B) =
P (B)

assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.

Example 2.22 Suppose we roll two independent fair dice again. Consider the
following events.

A = ‘at least one of the scores is 2’.

B = ‘the sum of the scores is greater than 7’.

These are shown in Figure 2.10. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and
P (A ∩ B) = 2/36. Therefore, the conditional probability of A given B is:

P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15

Learning that B has occurred causes us to revise (update) the probability of A


downward, from 0.31 to 0.13.

One way to think about conditional probability is that when we condition on B, we


redefine the sample space to be B.

47
2. Probability theory

A
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)

(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)

(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)

(6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

A B

Figure 2.10: Events A, B and A ∩ B for Example 2.22.

Example 2.23 In Example 2.22, when we are told that the conditioning event B
has occurred, we know we are within the green line in Figure 2.10. So the 15
outcomes within it become the new sample space. There are 2 outcomes which
satisfy A and which are inside this new sample space, so:
2 number of cases of A within B
P (A | B) = = .
15 number of cases of B

2.7.3 Conditional probability of independent events


If A ⊥⊥ B, i.e. P (A ∩ B) = P (A) P (B), and P (B) > 0 and P (A) > 0, then:
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)
and:
P (A ∩ B) P (A) P (B)
P (B | A) = = = P (B).
P (A) P (A)
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.

2.7.4 Chain rule of conditional probabilities


Since P (A | B) = P (A ∩ B)/P (B), then:
P (A ∩ B) = P (A | B) P (B).

48
2.7. Conditional probability and Bayes’ theorem

That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:

s
B
s
As

The path to A is to get first to B, and then from B to A.


It is also true that:
P (A ∩ B) = P (B | A) P (A)
and you can use whichever is more convenient. Very often some version of this chain
rule is much easier than calculating P (A ∩ B) directly.
The chain rule generalises to multiple events:

P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) · · · P (An | A1 , A2 , . . . , An−1 )

where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 2.24.

Example 2.24 For n = 3, we have:

P (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 )


= P (A1 ) P (A3 | A1 ) P (A2 | A1 , A3 )
= P (A2 ) P (A1 | A2 ) P (A3 | A1 , A2 )
= P (A2 ) P (A3 | A2 ) P (A1 | A2 , A3 )
= P (A3 ) P (A1 | A3 ) P (A2 | A1 , A3 )
= P (A3 ) P (A2 | A3 ) P (A1 | A2 , A3 ).

Example 2.25 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52

4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore,
P (A) = 1/270,725.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:

P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards

P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn

P (A3 | A1 , A2 ) = 2/50

49
2. Probability theory

P (A4 | A1 , A2 , A3 ) = 1/49.

Putting these together with the chain rule gives:

P (A) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) P (A4 | A1 , A2 , A3 )


4 3 2 1 24 1
= × × × = = .
52 51 50 49 6,497,400 270,725
Here we could obtain the result in two ways. However, there are very many situations
where classical probability and counting rules are not usable, whereas conditional
probabilities and the chain rule are completely general and always applicable.

More methods for summing probabilities

We now return to probabilities of partitions like the situation shown in Figure 2.11.

A1
HH
 H
A2
 HH

A1
rH HHr
 A

A3
H
HH A2 

HH A3
H

Figure 2.11: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.

Both diagrams in Figure 2.11 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.11,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).

2.7.5 Total probability formula


Suppose B1 , B2 , . . . , BK form a partition of the sample space. Therefore, A ∩ B1 ,
A ∩ B2 , . . ., A ∩ BK form a partition of A, as shown in Figure 2.12.
In other words, think of event A as the union of all the A ∩ Bi s, i.e. of ‘all the paths to
A via different intervening events Bi ’.
To get the probability of A, we now:

1. apply the chain rule to each of the paths:


P (A ∩ Bi ) = P (A | Bi ) P (Bi )

2. add up the probabilities of the paths:


K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1

50
2.7. Conditional probability and Bayes’ theorem

r B1

r B2
HH
 H
r

r B3 HHHr
H A
@H
H 
@ HH 
@ Hr
@ B4
@
@r
B5
Figure 2.12: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.

This is known as the formula of total probability. It looks complicated, but it is


actually often far easier to use than trying to find P (A) directly.

Example 2.26 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c )(1 − P (B)).

r Bc
HH
 HH
 HH
rH
  Hr A
H 
HH 

H 
r
HH
B

Example 2.27 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity. If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity.
If a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and

51
2. Probability theory

P (A | B c ) = 0.01. Therefore:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.

2.7.6 Bayes’ theorem


So far we have considered how to calculate P (A) for an event A which can happen in
different ways, ‘via’ different events B1 , B2 , . . . , BK .
Now we reverse the question. Suppose we know that A has occurred, as shown in Figure
2.13.

Figure 2.13: Paths to A indicating that A has occurred.

What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.14.

Figure 2.14: A being achieved via B1 .

So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this.

P (A ∩ Bj ) = P (A | Bj ) P (Bj ) from the chain rule.


K
P
P (A) = P (A | Bi ) P (Bi ) from the total probability formula.
i=1

52
2.7. Conditional probability and Bayes’ theorem

Bayes’ theorem

Using the chain rule and the total probability formula, we have:

P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1

which holds for each Bj , j = 1, 2, . . . , K. This is known as Bayes’ theorem.

Example 2.28 Continuing with Example 2.27, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:

P (B) = 0.0001 P (B c ) = 0.9999


P (A | B) = 0.99 and P (A | B c ) = 0.01.

Therefore:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098

Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.

Example 2.29 You are taking part in a gameshow. The host of the show, who is
known as Monty, shows you three outwardly identical boxes. In one of them is a
prize, and the other two are empty.
You are asked to select, but not open, one of the boxes. After you have done so,
Monty, who knows where the prize is, opens one of the two remaining boxes.
He always opens a box he knows to be empty, and randomly chooses which box to
open when he has more than one option (which happens when your initial choice
contains the prize).
After opening the empty box, Monty gives you the choice of either switching to the
other unopened box or sticking with your original choice. You then receive whatever
is in the box you choose.
What should you do, assuming you want to win the prize?
Suppose the three boxes are numbered 1, 2 and 3. Let us define the following events.

B1 , B2 , B3 : the prize is in Box 1, 2 and 3, respectively.


M1 , M2 , M3 : Monty opens Box 1, 2 and 3, respectively.

53
2. Probability theory

Suppose you choose Box 1 first, and then Monty opens Box 3 (the answer works the
same way for all combinations of these). So Boxes 1 and 2 remain unopened.
What we want to know now are the conditional probabilities P (B1 | M3 ) and
P (B2 | M3 ).
You should switch boxes if P (B2 | M3 ) > P (B1 | M3 ), and stick with your original
choice otherwise. (You would be indifferent about switching if it was the case that
P (B2 | M3 ) = P (B1 | M3 ).)
Suppose that you first choose Box 1, and then Monty opens Box 3. Bayes’ theorem
tells us that:
P (M3 | B2 ) P (B2 )
P (B2 | M3 ) = .
P (M3 | B1 ) P (B1 ) + P (M3 | B2 ) P (B2 ) + P (M3 | B3 ) P (B3 )
We can assign values to each of these.

The prize is initially equally likely to be in any of the boxes. Therefore,


P (B1 ) = P (B2 ) = P (B3 ) = 1/3.
If the prize is in Box 1 (which you choose), Monty chooses at random between
the two remaining boxes, i.e. Boxes 2 and 3. Hence P (M3 | B1 ) = 1/2.
If the prize is in one of the two boxes you did not choose, Monty cannot open
that box, and must open the other one. Hence P (M3 | B2 ) = 1 and so
P (M3 | B3 ) = 0.

Putting these probabilities into the formula gives:


1 × 1/3 2
P (B2 | M3 ) = =
1/2 × 1/3 + 1 × 1/3 + 0 × 1/3 3
and hence P (B1 | M3 ) = 1 − P (B2 | M3 ) = 1/3 (because also P (M3 | B3 ) = 0 and so
P (B3 | M3 ) = 0).
The same calculation applies to every combination of your first choice and Monty’s
choice. Therefore, you will always double your probability of winning the prize if you
switch from your original choice to the box that Monty did not open.
The Monty Hall problem has been called a ‘cognitive illusion’, because something
about it seems to mislead most people’s intuition. In experiments, around 85% of
people tend to get the answer wrong at first.
The most common incorrect response is that the probabilities of the remaining boxes
after Monty’s choice are both 1/2, so that you should not (or rather need not) switch.
This is typically based on ‘no new information’ reasoning. Since we know in advance
that Monty will open one empty box, the fact that he does so appears to tell us
nothing new and should not cause us to favour either of the two remaining boxes –
hence a probability of 1/2 for each.
It is true that Monty’s choice tells you nothing new about the probability of your
original choice, which remains at 1/3. However, it tells us a lot about the other two
boxes. First, it tells us everything about the box he chose, namely that it does not
contain the prize. Second, all of the probability of that box gets ‘inherited’ by the
box neither you nor Monty chose, which now has the probability 2/3.

54
2.7. Conditional probability and Bayes’ theorem

Example 2.30 You are waiting for your bag at the baggage reclaim carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags which come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags
to arrive’. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are as follows.
P (x | A) = 1 for all x. If your bag has been lost, it will not arrive!
P (x | Ac ) = (200 − x)/200 if we assume that bags come out in a completely
random order.

Using Bayes’ theorem, we get:


P (x | A) P (A)
P (A | x) =
P (x | A) P (A) + P (x | Ac ) P (Ac )
P (A)
= .
P (A) + ((200 − x)/200)(1 − P (A))
Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it
has been lost!
For other values of x we need P (A). This is the general probability that a bag gets
lost, before you start observing the arrival of the bags from your particular flight.
This kind of probability is known as the prior probability of an event A.
Let us assign values to P (A) based on some empirical data. Statistics by the
Association of European Airlines (AEA) show how many bags were ‘mishandled’ per
1,000 passengers the airlines carried. This is not exactly what we need (since not all
passengers carry bags, and some have several), but we will use it anyway. In
particular, we will compare the results for the best and the worst of the AEA in 2006:

Air Malta: P (A) = 0.0044


British Airways: P (A) = 0.023.

Figure 2.15 shows a plot of P (A | x) as a function of x for these two airlines.


The probabilities are fairly small, even for large values of x.

For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive,
the probability is less than 0.5 that your bag has been lost.
For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is
the first probability over 0.5.

This is because the baseline probability of lost bags, P (A), is low.


So, the moral of the story is that even when nearly everyone else has collected their
bags and left, do not despair!

55
2. Probability theory

1.0
BA

0.8
Air Malta

P( Your bag is lost )


0.6
0.4
0.2
0.0

0 50 100 150 200


Bags arrived

Figure 2.15: Plot of P (A | x) as a function of x for the two airlines in Example 2.30, Air
Malta and British Airways (BA).

2.8 Overview of chapter


This chapter introduced some formal terminology related to probability theory. The
axioms of probability were introduced, from which various other probability results were
derived. There followed a brief discussion of counting rules (using permutations and
combinations). The important concepts of independence and conditional probability
were discussed, and Bayes’ theorem was derived.

2.9 Key terms and concepts


Axiom Bayes’ theorem
Binomial coefficient Chain rule
Classical probability Collectively exhaustive
Combination Complement
Conditional probability Counting
Disjoint Element
Empty set Experiment
Event Factorial
Independence Intersection
Mutually exclusive Outcome
Pairwise disjoint Partition
Permutation Probability (theory)
Relative frequency Sample space
Set Subset
Total probability Union
Venn diagram With(out) replacement

56
2.9. Key terms and concepts

There are lies, damned lies and statistics.


(Mark Twain)

57
2. Probability theory

58
Chapter 3
Random variables

3.1 Synopsis of chapter


This chapter introduces the concept of random variables and probability distributions.
These distributions are univariate, which means that they are used to model a single
numerical quantity. The concepts of expected value and variance are also discussed.

3.2 Learning outcomes


After completing this chapter, you should be able to:

define a random variable and distinguish it from the values which it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.

3.3 Introduction
In Chapter 1, we considered descriptive statistics for a sample of observations of a
variable X. Here we will represent the observations as a sequence of variables, denoted
as:
X1 , X2 , . . . , Xn
where n is the sample size.
In statistical inference, the observations will be treated as a sample drawn at random
from a population. We will then think of each observation Xi of a variable X as an
outcome of an experiment.

The experiment is ‘select a unit at random from the population and record its
value of X’.
The outcome is the observed value Xi of X.

Because variables X in statistical data are recorded as numbers, we can now focus on
experiments where the outcomes are also numbers – random variables.

59
3. Random variables

Random variable

A random variable is an experiment for which the outcomes are numbers.1 This
means that for a random variable:

the sample space, S, is the set of real numbers R, or a subset of R

the outcomes are numbers in this sample space (instead of ‘outcomes’, we often
call them the values of the random variable)

events are sets of numbers (values) in this sample space.

Discrete and continuous random variables

There are two main types of random variables, depending on the nature of S, i.e. the
possible values of the random variable.

A random variable is continuous if S is all of R or some interval(s) of it, for


example [0, 1] or [0, ∞).

A random variable is discrete if it is not continuous.2 More precisely, a discrete


random variable takes a finite or countably infinite number of values.

Notation

A random variable is typically denoted by an upper-case letter, for example X (or Y ,


W etc.). A specific value of a random variable is often denoted by a lower-case letter,
for example x.
Probabilities of values of a random variable are written as follows.

P (X = x) denotes the probability that (the value of) X is x.

P (X > 0) denotes the probability that X is positive.

P (a < X < b) denotes the probability that X is between the numbers a and b.

Random variables versus samples

You will notice that many of the quantities we define for random variables are
analogous to sample quantities defined in Chapter 1.

1
This definition is a bit informal, but it is sufficient for this course.
2
Strictly speaking, a discrete random variable is not just a random variable which is not continuous
as there are many others, such as mixture distributions.

60
3.4. Discrete random variables

Random variable Sample


Probability distribution Sample distribution
Mean (expected value) Sample mean (average)
Variance Sample variance
Standard deviation Sample standard deviation
Median Sample median

This is no accident. In statistics, the population is represented as following a probability


distribution, and quantities for an observed sample are then used as estimators of the
analogous quantities for the population.

3.4 Discrete random variables

Example 3.1 The following two examples will be used throughout this chapter.

1. The number of people living in a randomly selected household in England.


• For simplicity, we use the value 8 to represent ‘8 or more’ (because 9 and
above are not reported separately in official statistics).
• This is a discrete random variable, with possible values of 1, 2, 3, 4, 5, 6, 7
and 8.

2. A person throws a basketball repeatedly from the free-throw line, trying to


make a basket. Consider the following random variable.
The number of unsuccessful throws before the first successful throw.
• The possible values of this are 0, 1, 2, . . ..

3.4.1 Probability distribution of a discrete random variable


The probability distribution (or just distribution) of a discrete random variable X
is specified by:

its possible values, x (i.e. its sample space, S)

the probabilities of the possible values, i.e. P (X = x) for all x ∈ S.

So we first need to develop a convenient way of specifying the probabilities.

61
3. Random variables

Example 3.2 Consider the following probability distribution for the household
size, X.3

Number of people
in the household, x P (X = x)
1 0.3002
2 0.3417
3 0.1551
4 0.1336
5 0.0494
6 0.0145
7 0.0034
8 0.0021

Probability function

The probability function (pf) of a discrete random variable X, denoted by p(x),


is a real-valued function such that for any number x the function is:

p(x) = P (X = x).

We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.

Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).

Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
– especially when it is necessary to indicate clearly to which random variable the
function corresponds.

Necessary conditions for a probability function

To be a pf of a discrete random variable X with sample space S, a function p(x)


must satisfy the following conditions.

1. p(x) ≥ 0 for all real numbers x.


P
2. p(xi ) = 1, i.e. the sum of probabilities of all possible values of X is 1.
xi ∈S
The pf is defined for all real numbers x, but p(x) = 0 for any x ∈
/ S, i.e. for any value
x which is not one of the possible values of X.

3
Source: ONS, National report for the 2001 Census, England and Wales. Table UV51.

62
3.4. Discrete random variables

Example 3.3 Continuing Example 3.2, here we can simply list all the values:



 0.3002 for x = 1




 0.3417 for x = 2



 0.1551 for x = 3

0.1336 for x = 4



p(x) = 0.0494 for x = 5

0.0145 for x = 6








 0.0034 for x = 7




 0.0021 for x = 8

0 otherwise.

8
P
These are clearly all non-negative, and their sum is p(x) = 1.
x=1
A graphical representation of the pf is shown in Figure 3.1.
0.35
0.30
0.25
0.20
p(x)

0.15
0.10
0.05
0.00

1 2 3 4 5 6 7 8

x (number of people in the household)

Figure 3.1: Probability function for Example 3.3.

For the next example, we need to remember the following results from mathematics,
concerning sums of geometric series. If r 6= 1, then:
n−1
X
x
a(1 − r n )
ar =
x=0
1−r

and if |r| < 1, then:



X a
ar x = .
x=0
1−r

63
3. Random variables

Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:

the probability of a successful throw is π at each throw and, therefore, the


probability of an unsuccessful throw is 1 − π

outcomes of different throws are independent.

Hence the probability that the first success occurs after x failures is the probability
of a sequence of x failures followed by a success, i.e. the probability is:

(1 − π)x π.

So the pf of the random variable X (the number of failures before the first success)
is: (
(1 − π)x π for x = 0, 1, 2, . . .
p(x) = (3.1)
0 otherwise
where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf.

Clearly, p(x) ≥ 0 for all x, since π ≥ 0 and 1 − π ≥ 0.

Using the sum to infinity of a geometric series, we get:


∞ ∞ ∞
X X
x
X 1 π
p(x) = (1 − π) π = π (1 − π)x = π = = 1.
x=0 x=0 x=0
1 − (1 − π) π

The expression of the pf involves a parameter π (the probability of a successful


throw), a number for which we can choose different values. This defines a whole
‘family’ of individual distributions, one for each value of π. For example, Figure 3.2
shows values of p(x) for two values of π reflecting fairly good and pretty poor
free-throw shooters, respectively.
0.7
0.6
0.5
0.4
p(x)

π = 0.7
0.3

π = 0.3
0.2
0.1
0.0

0 5 10 15

x (number of failures)

Figure 3.2: Probability function for Example 3.4. π = 0.7 indicates a fairly good free-throw
shooter. π = 0.3 indicates a pretty poor free-throw shooter.

64
3.4. Discrete random variables

3.4.2 The cumulative distribution function (cdf)


Another way to specify a probability distribution is to give its cumulative
distribution function (cdf) (or just simply distribution function).

Cumulative distribution function (cdf)

The cdf is denoted F (x) (or FX (x)) and defined as:

F (x) = P (X ≤ x) for all real numbers x.

For a discrete random variable it is given by:


X
F (x) = p(xi )
xi ∈S, xi ≤x

i.e. the sum of the probabilities of the possible values of X which are less than or
equal to x.

Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:

Number of people
in the household, x p(x) F (x)
1 0.3002 0.3002
2 0.3417 0.6419
3 0.1551 0.7970
4 0.1336 0.9306
5 0.0494 0.9800
6 0.0145 0.9945
7 0.0034 0.9979
8 0.0021 1.0000

These are shown in graphical form in Figure 3.3.

Example 3.6 In the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . .. We


can calculate a simple formula for the cdf, using the sum of a geometric series. Since,
for any non-negative integer y, we obtain:
y y y
X X
x
X 1 − (1 − π)y+1
p(x) = (1 − π) π = π (1 − π)x = π = 1 − (1 − π)y+1
x=0 x=0 x=0
1 − (1 − π)

we can write: (
0 for x < 0
F (x) =
1 − (1 − π)x+1 for x = 0, 1, 2, . . . .
The cdf is shown in graphical form in Figure 3.4.

65
3. Random variables

1.0
0.8
0.6
F(x)

0.4
0.2
0.0

0 2 4 6 8

x (number of people in the household)

Figure 3.3: Cumulative distribution function for Example 3.5.

3.4.3 Properties of the cdf for discrete distributions


The cdf F (x) of a discrete random variable X is a step function such that:

F (x) remains constant in all intervals between possible values of X

at a possible value xi of X, F (x) jumps up by the amount p(xi ) = P (X = xi )

at such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is
right-continuous).

3.4.4 General properties of the cdf


These hold for both discrete and continuous random variables.

1. 0 ≤ F (x) ≤ 1 for all x (since F (x) is a probability).

2. F (x) → 0 as x → −∞, and F (x) → 1 as x → ∞.

3. F (x) is a non-decreasing function, i.e. if x1 < x2 , then F (x1 ) ≤ F (x2 ).

4. For any x1 < x2 , P (x1 < X ≤ x2 ) = F (x2 ) − F (x1 ).

Either the pf or the cdf can be used to calculate the probabilities of any events for a
discrete random variable.

Example 3.7 Continuing with the household size example (for the probabilities,
see Example 3.5), then:

P (X = 1) = p(1) = F (1) = 0.3002

66
3.4. Discrete random variables

1.0
0.8
0.6
F(x)

0.4 π = 0.7
π = 0.3
0.2
0.0

0 5 10 15

x (number of failures)

Figure 3.4: Cumulative distribution function for Example 3.6.

P (X = 2) = p(2) = F (2) − F (1) = 0.3417

P (X ≤ 2) = p(1) + p(2) = F (2) = 0.6419

P (X = 3 or 4) = p(3) + p(4) = F (4) − F (2) = 0.2887

P (X > 5) = p(6) + p(7) + p(8) = 1 − F (5) = 0.0200

P (X ≥ 5) = p(5) + p(6) + p(7) + p(8) = 1 − F (4) = 0.0694.

3.4.5 Properties of a discrete random variable

Let X be a discrete random variable with sample space S and pf p(x).

Expected value of a discrete random variable

The expected value (or mean) of X is denoted E(X), and defined as:
X
E(X) = xi p(xi ).
xi ∈S
P P
This can also be written more concisely as E(X) = x p(x) or E(X) = x p(x).
x

We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’),
or µX , is often used.

67
3. Random variables

3.4.6 Expected value versus sample mean

The mean (expected value) E(X) of a probability distribution is analogous to the


sample mean (average) X̄ of a sample distribution.
This is easiest to see when the sample space is finite. Suppose the random variable X
can have K different values X1 , X2 , . . . , XK , and their frequencies in a sample are
f1 , f2 , . . . , fK , respectively. Therefore, the sample mean of X is:

K
f 1 x1 + f 2 x2 + · · · + f K xK X
X̄ = = x1 pb(x1 ) + x2 pb(x2 ) + · · · + xK pb(xK ) = xi pb(xi )
f1 + f2 + · · · + fK i=1

where:
fi
p(x
b i) = K
P
fi
i=1

are the sample proportions of the values xi .

The expected value of the random variable X is:

K
X
E(X) = x1 p(x1 ) + x2 p(x2 ) + · · · + xK p(xK ) = xi p(xi ).
i=1

So X̄ uses the sample proportions, pb(xi ), whereas E(X) uses the population
probabilities, p(xi ).

Example 3.8 Continuing with the household size example:

Number of people
in the household, x p(x) x p(x)
1 0.3002 0.3002
2 0.3417 0.6834
3 0.1551 0.4653
4 0.1336 0.5344
5 0.0494 0.2470
6 0.0145 0.0870
7 0.0034 0.0238
8 0.0021 0.0168
Sum 2.3579
= E(X)

The expected number of people in a randomly selected household is 2.36.

Example 3.9 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and
0 otherwise.

68
3.4. Discrete random variables

The expected value of X is then:


X ∞
X
E(X) = xi p(xi ) = x (1 − π)x π
xi ∈S x=0
X∞
(starting from x = 1) = x (1 − π)x π
x=1

X
= (1 − π) x (1 − π)x−1 π
x=1

X
(using y = x − 1) = (1 − π) (y + 1)(1 − π)y π
y=0

 
∞ ∞ 
X y
X
y 

= (1 − π) 
 y(1 − π) π + (1 − π) π 
 y=0 y=0 
| {z } | {z }
= E(X) =1

= (1 − π) (E(X) + 1)

= (1 − π) E(X) + (1 − π)

from which we can solve:


1−π 1−π
E(X) = = .
1 − (1 − π) π

Hence for example:

E(X) = 0.3/0.7 = 0.42 for π = 0.7

E(X) = 0.7/0.3 = 2.33 for π = 0.3.

So, before scoring a basket, a fairly good free-throw shooter (with π = 0.7) misses on
average about 0.42 shots, and a pretty poor free-throw shooter (with π = 0.3) misses
on average about 2.33 shots.

Example 3.10 To illustrate the use of expected values, let us consider the game of
roulette, from the point of view of the casino (‘The House’).
Suppose a player puts a bet of £1 on ‘red’. If the ball lands on any of the 18 red
numbers, the player gets that £1 back, plus another £1 from The House. If the result
is one of the 18 black numbers or the green 0, the player loses the £1 to The House.
We assume that the roulette wheel is unbiased, i.e. that all 37 numbers have equal
probabilities. What can we say about the probabilities and expected values of wins
and losses?

69
3. Random variables

Define the random variable X = ‘money received by The House’. Its possible values
are −1 (the player wins) and 1 (the player loses). The probability function is:

18/37 for x = −1

p(x) = 19/37 for x = 1

0 otherwise.

Therefore, the expected value is:


   
18 19
E(X) = −1 × + 1× = +0.027.
37 37

On average, The House expects to win 2.7p for every £1 which players bet on red.
This expected gain is known as the house edge. It is positive for all possible bets in
roulette.
The edge is the expected gain from a single bet. Usually, however, players bet again
if they win at first – gambling can be addictive!
Consider a player who starts with £10 and bets £1 on red repeatedly until the
player either has lost all of the £10 or doubled their money to £20.
It can be shown that the probability that such a player reaches £20 before they go
down to £0 is about 0.368. Define X = ‘money received by The House’, with the
probability function: 
0.368 for x = −10

p(x) = 0.632 for x = 10

0 otherwise.

Therefore, the expected value is:

E(X) = (−10 × 0.368) + (10 × 0.632) = +2.64.

On average, The House can expect to keep about 26.4% of the money which players
like this bring to the table.

Expected values of functions of a random variable

Let g(X) be a function (‘transformation’) of a discrete random variable X. This is


also a random variable, and its expected value is:
X
E(g(X)) = g(x) pX (x)

where pX (x) = p(x) is the probability function of X.

Example 3.11 The expected value of the square of X is:


X
E(X 2 ) = x2 p(x).

70
3.4. Discrete random variables

In general:
E(g(X)) 6= g(E(X))
when g(X) is a non-linear function of X.

Example 3.12 Note that:


 
2 2 1 1
E(X ) 6= (E(X)) and E 6= .
X E(X)

Expected values of linear transformations

Suppose X is a random variable and a and b are constants, i.e. known numbers
which are not random variables. Therefore:

E(aX + b) = a E(X) + b.

Proof: We have:
X
E(aX + b) = (ax + b)p(x)
x
X X
= ax p(x) + b p(x)
x x
X X
=a x p(x) + b p(x)
x x

= a E(X) + b

where the last step follows from:


P
i. x p(x) = E(X), by definition of E(X)
x

P
ii. p(x) = 1, by definition of the probability function.
x


A special case of the result:

E(aX + b) = a E(X) + b

is obtained when a = 0, which gives:

E(b) = b.

That is, the expected value of a constant is the constant itself.

71
3. Random variables

Variance and standard deviation of a discrete random variable

The variance of a discrete random variable X is defined as:


X
Var(X) = E((X − E(X))2 ) = (x − E(X))2 p(x).
x
p
The standard deviation of X is sd(X) = Var(X).

Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation)
of the random variable X.
Alternative notation: the variance is often denoted σ 2 (‘sigma squared’) and the
standard deviation by σ (‘sigma’).
An alternative formula: the variance can also be calculated as:

Var(X) = E(X 2 ) − (E(X))2 .

This will be proved later.

Example 3.13 Continuing with the household size example:

x p(x) x p(x) (x − E(X))2 (x − E(X))2 p(x) x2 x2 p(x)


1 0.3002 0.3002 1.844 0.554 1 0.300
2 0.3417 0.6834 0.128 0.044 4 1.367
3 0.1551 0.4653 0.412 0.064 9 1.396
4 0.1336 0.5344 2.696 0.360 16 2.138
5 0.0494 0.2470 6.981 0.345 25 1.235
6 0.0145 0.0870 13.265 0.192 36 0.522
7 0.0034 0.0238 21.549 0.073 49 0.167
8
P 0.0021 0.0168 31.833 0.067 64 0.134
2.3579 1.699 7.259
= E(X) = Var(X) = E(X 2 )

2 2 2 2
Var(X) =pE((X − E(X))
√ ) = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.

Example 3.14 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and
0 otherwise. It can be shown (although the proof is beyond the scope of the course)
that for this distribution:
1−π
Var(X) = .
π2
In the two cases we have used as examples:

Var(X) = 0.3/(0.7)2 = 0.61 and sd(X) = 0.78 for π = 0.7

Var(X) = 0.7/(0.3)2 = 7.78 and sd(X) = 2.79 for π = 0.3.

72
3.4. Discrete random variables

So the variation in how many free throws a pretty poor shooter misses before the
first success is much higher than the variation for a fairly good shooter.

Variances of linear transformations

If X is a random variable and a and b are constants, then:

Var(aX + b) = a2 Var(X).

Proof:
Var(aX + b) = E ((aX + b) − E(aX + b))2


= E (aX + b − a E(X) − b)2




= E (aX − a E(X))2


= E a2 (X − E(X))2


= a2 E (X − E(X))2


= a2 Var(X).
Therefore, sd(aX + b) = |a| sd(X).


If a = 0, this gives:
Var(b) = 0.
That is, the variance of a constant is 0. The converse also holds – if a random variable
has a variance of 0, it is actually a constant.

Summary of properties of E(X) and Var(X)

If X is a random variable and a and b are constants, then:

E(aX + b) = a E(X) + b
Var(aX + b) = a2 Var(X) and sd(aX + b) = |a| sd(X)
E(b) = b and Var(b) = sd(b) = 0.
p
We define Var(X) = E((X − E(X))2 ) = E(X 2 ) − (E(X))2 and sd(X) = Var(X).
Also, Var(X) ≥ 0 and sd(X) ≥ 0 always, and Var(X) = sd(X) = 0 only if X is a
constant.

3.4.7 Moments of a random variable


We can also define, for each k = 1, 2, . . ., the following:

the kth moment about zero is µk = E(X k )


the kth central moment is µ0k = E((X − E(X))k ).

73
3. Random variables

Clearly, µ1 = µ = E(X) and µ02 = Var(X).


These will be mentioned again in Chapter 7.

Example 3.15 For further practice, let us consider a discrete random variable X
which has possible values 0, 1, 2, . . . , n, where n is a known positive integer, and X
has the following probability function:
( 
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) =
0 otherwise

where nx = n!/(x! (n − x)!) denotes the binomial coefficient, and π is a probability




parameter such that 0 ≤ π ≤ 1.


A random variable like this follows the binomial distribution. We will discuss its
motivation and uses later in the next chapter.
Here, we consider the following tasks for this distribution.

Show that p(x) satisfies the conditions for a probability function.


Calculate probabilities from p(x).
Write down the cumulative distribution function, F (x).
Derive the expected value, E(X).

Note: the examination may also contain questions like this. The difficulty of such
questions depends partly on the form of p(x), and what kinds of manipulations are
needed to work with it. So questions of this type may be very easy, or quite hard!

To show that p(x) is a probability function, we need to show the following.

1. p(x) ≥ 0 for all x. This is clearly true, since x ≥ 0, π ≥ 0 and 1 − π ≥ 0.


n
P
2. p(x) = 1. This is easiest to show by using the binomial theorem, which states
x=0
that, for any integer n ≥ 0 and any real numbers y and z, then:
n  
n
X n x n−x
(y + z) = y z . (3.2)
x=0
x

If we choose y = π and z = 1 − π in (3.2), we get:


n   n
n n
X n x X
1 = 1 = (π + (1 − π)) = π (1 − π)n−x = p(x).
x=0
x x=0

This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
For the values x = 0, 1, 2, . . . , n, the value of the cdf is:
x  
X n y
F (x) = P (X ≤ x) = π (1 − π)n−y .
y=0
y

74
3.4. Discrete random variables

Since X is a discrete random variable, F (x) is a step function. For E(X), we have:
n  
X n x
E(X) = x π (1 − π)n−x
x=0
x
n  
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n  
X n−1
= nπ π x−1 (1 − π)n−x
x=1
x−1
n−1  
X n−1
= nπ π y (1 − π)(n−1)−y
y=0
y

= nπ × 1
= nπ

where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, 2, . . . , n − 1 and probability
parameter π.
The variance of the distribution is Var(X) = nπ(1 − π). This is not derived here, but
will be proved in a different way later.

3.4.8 The moment generating function

Moment generating function

The moment generating function (mgf) of a discrete random variable X is defined


as: X
MX (t) = E(etX ) = etx p(x).
x

MX (t) is a function of real numbers t. It is not a random variable itself.

The form of the mgf is not interesting or informative in itself. Instead, the reason we
define the mgf is that it is a convenient tool for deriving means and variances of
distributions, using the following results:
0 00
MX (0) = E(X) and MX (0) = E(X 2 )

which also gives:


00 0
Var(X) = E(X 2 ) − (E(X))2 = MX (0) − (MX (0))2 .

This is useful if the mgf is easier to derive than E(X) and Var(X) directly.

75
3. Random variables

Other moments about zero are obtained from the mgf similarly:
(k)
MX (0) = E(X k ) for k = 1, 2, . . . .

Example 3.16 In the basketball example, we considered the distribution with


p(x) = (1 − π)x π for x = 0, 1, 2, . . ..

The mgf for this distribution is:



X
tX
MX (t) = E(e ) = etx p(x)
x=0

X
= etx (1 − π)x π
x=0

X
=π (et (1 − π))x
x=0
π
=
1− et (1 − π)

using the sum to infinity of a geometric series, for t < − ln(1 − π) to ensure
convergence of the sum.
From the mgf MX (t) = π/(1 − et (1 − π)) we obtain:

π(1 − π)et
MX0 (t) =
(1 − et (1 − π))2
π(1 − π)et (1 − (1 − π)et )(1 + (1 − π)et )
MX00 (t) =
(1 − et (1 − π))4

and hence (since e0 = 1):


1−π
MX0 (0) = = E(X)
π
(1 − π)(2 − π)
MX00 (0) = = E(X 2 )
π2
and:
(1 − π)(2 − π) (1 − π)2 1−π
Var(X) = E(X 2 ) − (E(X))2 = 2
− 2
= .
π π π2

Example 3.17 Consider a discrete random variable X with possible values


0, 1, 2, . . ., a parameter λ > 0, and the following pf:
(
e−λ λx /x! for x = 0, 1, 2, . . .
p(x) =
0 otherwise.
The mgf for this distribution is:
∞ ∞
X e−λ λx X (et λ)x t t
MX (t) = etx = e−λ = e−λ eλe = eλ(e −1) .
x=0
x! x=0
x!

76
3.5. Continuous random variables

Note: this uses the series expansion of the exponential function from calculus, i.e.
that for any number a, we have:

X ax a2 a3
ea = =1+a+ + + ··· .
x=0
x! 2! 3!
t
From the mgf MX (t) = eλ(e −1) we obtain:
t
MX0 (t) = λet eλ(e −1)
t
MX00 (t) = λet (1 + λet )eλ(e −1)
and hence:
MX0 (0) = λ = E(X)
MX00 (0) = λ(1 + λ) = E(X 2 )
and:
Var(X) = E(X 2 ) − (E(X))2 = λ(1 + λ) − λ2 = λ.

Other useful properties of moment generating functions

If the mgfs mentioned in these statements exist, then the following apply.

The mgf uniquely determines a probability distribution. In other words, if for two
random variables X and Y we have MX (t) = MY (t) (for points around t = 0), then
X and Y have the same distribution.

If Y = aX + b where X is a random variable and a and b are constants, then:

MY (t) = ebt MX (at).

Suppose that the random variables X1 , X2 , . . . , Xn are independent (a concept


which will be defined in Chapter 5) and if we also define Y = X1 + X2 + · · · + Xn ,
then: n
Y
MY (t) = MXi (t)
i=1

and, in particular, if all the Xi s have the same distribution (of X), then
MY (t) = MX (t)n .

3.5 Continuous random variables


A random variable (and its probability distribution) is continuous if it can have an
uncountably infinite number of possible values.4
4
Strictly speaking, having an uncountably infinite number of possible values does not necessarily
imply that it is a continuous random variable. For example, the Cantor distribution (not covered in this
course) is neither a discrete nor an absolutely continuous probability distribution, nor is it a mixture of
these. However, we will not consider this matter any further in this course.

77
3. Random variables

In other words, the set of possible values (the sample space) is the real numbers R,
or one or more intervals in R.

Example 3.18 An example of a continuous random variable, used here as an


approximating model, is the size of claim made on an insurance policy (i.e. a claim
by the customer to the insurance company), in £000s.

Suppose the policy has a deductible of £999, so all claims are at least £1,000.

Therefore, the possible values of this random variable are {x | x ≥ 1}.

Most of the concepts introduced for discrete random variables have exact or
approximate analogies for continuous random variables, and many results are the same
for both types. However, there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.

Probability density function (pdf)

For a continuous random variable X, the probability function is replaced by the


probability density function (pdf), denoted as f (x) (or fX (x)).

Example 3.19 Continuing the insurance example in Example 3.18, we consider a


pdf of the following form:
(
αk α /xα+1 for x ≥ k
f (x) =
0 otherwise

where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
α = 2.2 is shown in Figure 3.5.

Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous random variable:

P (X = x) = 0 for all x. (3.3)

That is, the probability that X has any particular value exactly is always 0.
Because of (3.3), with a continuous random variable we do not need to be very careful
about differences between < and ≤, and between > and ≥. Therefore, the following
probabilities are all equal:

P (a < X < b), P (a ≤ X ≤ b), P (a < X ≤ b) and P (a ≤ X < b).

78
3.5. Continuous random variables

2.0
1.5
f(x)

1.0
0.5
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Figure 3.5: Probability density function for Example 3.19.

Probabilities of intervals for continuous random variables

Integrals of the pdf give probabilities of intervals of values such that:


Z b
P (a < X ≤ b) = f (x) dx
a

for any two numbers a < b.


In other words, the probability that the value of X is between a and b is the area
under f (x) between a and b. Here a can also be −∞, and/or b can be ∞.

R3
Example 3.20 In Figure 3.6, the shaded area is P (1.5 < X ≤ 3) = 1.5
f (x) dx.

Properties of pdfs

The pdf f (x) of any continuous random variable must satisfy the following conditions.

1. We require:
f (x) ≥ 0 for all x.

2. We require: Z ∞
f (x) dx = 1.
−∞

These are analogous to the conditions for probability functions of discrete


distributions.

79
3. Random variables

2.0
1.5
f(x)

1.0
0.5
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Figure 3.6: Probability density function showing P (1.5 < X ≤ 3).

Example 3.21 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
αk α /xα+1 for x ≥ k
f (x) =
0 otherwise

where α > 0 and k > 0.

1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0.

2. We have:
∞ ∞ ∞
αk α
Z Z Z
f (x) dx = dx = αk α x−α−1 dx
−∞ k xα+1 k
 h i∞
α 1 −α
= αk x
−α k

= (−k α )(0 − k −α )
= 1.

80
3.5. Continuous random variables

Cumulative distribution function

The cumulative distribution function (cdf) of a continuous random variable X


is defined exactly as for discrete random variables, i.e. the cdf is:

F (x) = P (X ≤ x) for all real numbers x.

The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.

Relationship between the cdf and pdf

The cdf is obtained from the pdf through integration:


Z x
F (x) = P (X ≤ x) = f (t) dt for all x.
−∞

The pdf is obtained from the cdf through differentiation:

f (x) = F 0 (x).

Example 3.22 Continuing the insurance example:


Z x Z x
αk α
f (t) dt = α+1
dt
−∞ k t
Z x
α
= (−k ) (−α)t−α−1 dt
k
h ix
= (−k α ) t−α
k
−α
α
= (−k )(x − k −α )
= 1 − k α x−α
 α
k
=1− .
x
Therefore: (
0 for x < k
F (x) = (3.4)
1 − (k/x)α for x ≥ k.
If we were given (3.4), we could obtain the pdf by differentiation, since F 0 (x) = 0
when x < k, and:
αk α
F 0 (x) = −k α (−α)x−α−1 = for x ≥ k.
xα+1
A plot of the cdf is shown in Figure 3.7.

81
3. Random variables

1.0
0.8
0.6
F(x)

0.4
0.2
0.0

1 2 3 4 5 6 7

Figure 3.7: Cumulative distribution function for Example 3.22.

Probabilities from cdfs and pdfs

Since P (X ≤ x) = F (x), it follows that P (X > x) = 1 − F (x). In general, for any


two numbers a < b, we have:
Z b
P (a < X ≤ b) = f (x) dx = F (b) − F (a).
a

Example 3.23 Continuing with the insurance example (with k = 1 and α = 2.2),
then:

P (X ≤ 1.5) = F (1.5) = 1 − (1/1.5)2.2 ≈ 0.59


P (X ≤ 3) = F (3) = 1 − (1/3)2.2 ≈ 0.91
P (X > 3) = 1 − F (3) ≈ 1 − 0.91 = 0.09
P (1.5 ≤ X ≤ 3) = F (3) − F (1.5) ≈ 0.91 − 0.59 = 0.32.

Example 3.24 Consider now a continuous random variable with the following pdf:
(
λe−λx for x ≥ 0
f (x) = (3.5)
0 otherwise
where λ > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since: Z x h ix
λe−λt dt = − e−λt = 1 − e−λx
0 0
the cdf of the exponential distribution is:
(
0 for x < 0
F (x) =
1 − e−λx for x ≥ 0.

82
3.5. Continuous random variables

We now show that (3.5) satisfies the conditions for a pdf.

1. Since λ > 0 and ea > 0 for any a, f (x) ≥ 0 for all x.

2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to one. This follows from:
Z ∞
f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x)
−∞ x→∞ x→−∞

which here is lim (1 − e−λx ) − 0 = (1 − 0) − 0 = 1.


x→∞

Mixed distributions

A random variable can also be a mixture of discrete and continuous parts.


For example, consider the sizes of payments which an insurance company needs to make
on all insurance policies of a particular type. Most policies result in no claims or claims
below the deductible, so the payment for them is 0. For those policies which do result in
a claim, the size of each claim is some number greater than 0.
Consider a random variable X which is a mixture of two components.

P (X = 0) = π for some π ∈ (0, 1). Here π is the probability that a policy results in
no payment.
Among the rest, X follows a continuous distribution with the probabilities
distributed as (1 − π)f (x), where f (x) is a continuous pdf over x > 0. In other
words, this spreads the remaining probability (1 − π) over different non-zero values
of payments. For example, we could use the Pareto distribution for this loss
distribution f (x) (or actually as a distribution of X + k, since the company only
pays the amount above the deductible, k).

Expected value and variance of a continuous distribution

Suppose X is a continuous random variable with pdf f (x). Definitions of its expected
value, the expected value of any transformation g(X), the variance and standard
deviation are the same as for discrete distributions, except that summation is replaced
by integration:
Z ∞
E(X)= x f (x) dx
−∞
Z ∞
E(g(X))= g(x) f (x) dx
−∞
Z ∞
2
Var(X)= E((X − E(X)) ) = (x − E(X))2 f (x) dx = E(X 2 ) − (E(X))2
−∞
p
sd(X)= Var(X).

83
3. Random variables

Example 3.25 For the Pareto distribution, introduced in Example 3.19, we have:
Z ∞ Z ∞
E(X) = x f (x) dx = x f (x) dx
−∞ k

αk α
Z
= x dx
k xα+1

αk α
Z
= dx
k xα
Z ∞
(α − 1)k α−1

αk
= dx
α−1 k x(α−1)+1
| {z }
=1

αk
= (for α > 1).
α−1
Here the last step follows because the last integrand has the form of the Pareto pdf
with parameter α − 1, so its integral from k to ∞ is 1. This integral converges only if
α − 1 > 0, i.e. if α > 1.
Similarly:
∞ ∞
αk α
Z Z
2 2
E(X ) = x f (x) dx = x2 dx
k k xα+1

αk α
Z
= dx
k xα−1
Z ∞
αk 2 (α − 2)k α−2

= dx
α−2 x(α−2)+1
|k {z }
=1

αk 2
= (for α > 2)
α−2
and hence:
2
αk 2 α2 k 2

2 2 k α
Var(X) = E(X ) − (E(X)) = − = .
α − 2 (α − 1)2 α−1 α−2

In our insurance example, where k = 1 and α = 2.2, we have:


 2
2.2 × 1 1 2.2
E(X) = ≈ 1.8 and Var(X) = × ≈ 7.6.
2.2 − 1 2.2 − 1 2.2 − 2

Means and variances can be ‘infinite’

Expected values and variances are said to be infinite when the corresponding integral
does not exist (i.e. does not have a finite value).

84
3.5. Continuous random variables

For the Pareto distribution, the distribution is defined for all α > 0, but the mean is
infinite if α < 1 and the variance is infinite if α < 2. This happens because for small
values of α the distribution has very heavy tails, i.e. the probabilities of very large
values of X are non-negligible.
This is actually useful in some insurance applications, for example liability insurance
and medical insurance. There most claims are relatively small, but there is a
non-negligible probability of extremely large claims. The Pareto distribution with a
small α can be a reasonable representation of such situations. Figure 3.8 shows plots of
Pareto cdfs with α = 2.2 and α = 0.8. When α = 0.8, the distribution is so heavy-tailed
that E(X) is infinite.
1.0
0.8
0.6
F(x)

0.4

α = 2.2
α = 0.8
0.2
0.0

0 10 20 30 40 50

Figure 3.8: Pareto distribution cdfs.

Example 3.26 Consider the exponential distribution introduced in Example 3.24.


To find E(X) we can use integration by parts by considering x λe−λx as the product
of the functions f = x and g 0 = λe−λx (so that g = −e−λx ). Therefore:
Z ∞ h i∞ Z ∞
−λx −λx
E(X) = x λe dx = − x e − −e−λx dx
0 0 0
h i∞
1 h −λx i∞
= − x e−λx e −
0 λ 0
h i 1h i
= 0−0 − 0−1
λ
1
= .
λ

85
3. Random variables

To obtain E(X 2 ), we choose f = x2 and g 0 = λe−λx , and use integration by parts:


Z ∞ h i∞ Z ∞
−λx 2 −λx
2
E(X ) = 2
x λe dx = − x e +2 x e−λx dx
0 0 0
Z ∞
2
=0+ x λe−λx dx
λ 0

2
=
λ2
where the last step follows because the last integral is simply E(X) = 1/λ again.
Finally:
2 1 1
Var(X) = E(X 2 ) − (E(X))2 = 2 − 2 = 2 .
λ λ λ

3.5.1 Moment generating functions


The moment generating function (mgf) of a continuous random variable X is defined as
for discrete random variables, with summation replaced by integration:
Z ∞
tX
MX (t) = E(e ) = etx f (x) dx.
−∞

The properties of the mgf stated in Section 3.4.8 also hold for continuous distributions.
If the expected value E(etX ) is infinite, the random variable X does not have an mgf.
For example, the Pareto distribution does not have an mgf for positive t.

Example 3.27 For the exponential distribution, we have:


Z ∞ Z ∞
−λx
tX
MX (t) = E(e ) = tx
e λe dx = λe−(λ−t)x dx
0 0
Z ∞
λ λ
= (λ − t)e−(λ−t)x dx = (for t < λ)
λ−t 0 λ−t
| {z }
=1

from which we get MX0 (t) = λ/(λ − t)2 and MX00 (t) = 2λ/(λ − t)3 , so:
1 2
E(X) = MX0 (0) = and E(X 2 ) = MX00 (0) =
λ λ2
and Var(X) = E(X 2 ) − (E(X))2 = 2/λ2 − 1/λ2 = 1/λ2 .
These agree with the results derived with a bit more work in Example 3.26.

3.5.2 Median of a random variable


Recall that the sample median is essentially the observation ‘in the middle’ of a set of
data, i.e. where half of the observations in the sample are smaller than the median and
half of the observations are larger.

86
3.5. Continuous random variables

The median of a random variable (i.e. of its probability distribution) is similar in spirit.

Median of a random variable

The median, m, of a continuous random variable X is the value which satisfies:

F (m) = 0.5. (3.6)

So once we know F (x), we can find the median by solving (3.6).

A more precise general definition of the median of any probability distribution is as


follows.
Let X be a random variable with the cumulative distribution function F (x). The
median m of X is any number which satisfies:
P (X ≤ m) = F (m) ≥ 0.5
and:
P (X ≥ m) = 1 − F (m) + P (X = m) ≥ 0.5.
For a continuous distribution P (X = m) = 0 for any m, so this reduces to F (m) = 0.5.
If, for a discrete distribution, F (xm ) = 0.5 exactly for some value xm , the median is not
unique. Instead, all values from xm to the next largest observation (these included) are
medians.

Example 3.28 For the Pareto distribution we have:


 α
k
F (x) = 1 − for x ≥ k.
x

So F (m) = 1 − (k/m)α = 1/2 when:


 α
k 1 k 1 √
α
= ⇔ = √
α
⇔ m = k 2.
m 2 m 2
For example:
√ 2.2
when k = 1 and α = 2.2, the median is m = 2 = 1.37

when k = 1 and α = 0.8, the median is m = 0.8 2 = 2.38.

Example 3.29 For the exponential distribution we have:

F (x) = 1 − e−λx for x ≥ 0.

So F (m) = 1 − e−λm = 1/2 when:

1 ln 2
e−λm = ⇔ −λm = − ln 2 ⇔ m= .
2 λ

87
3. Random variables

3.6 Overview of chapter


This chapter has formally introduced random variables, making a distinction between
discrete and continuous random variables. Properties of probability distributions were
discussed, including the determination of expected values and variances.

3.7 Key terms and concepts


Binomial distribution Constant
Continuous Cumulative distribution function
Discrete Estimators
Expected value Experiment
Exponential distribution Interval
Median Moment generating function
Outcome Parameter
Pareto distribution Probability density function
Probability distribution Probability (mass) function
Random variable Standard deviation
Step function Variance

The death of one man is a tragedy. The death of millions is a statistic.


(Stalin to Churchill, Potsdam 1945)

88
Chapter 4
Common distributions of random
variables

4.1 Synopsis of chapter content


This chapter formally introduces common ‘families’ of probability distributions which
can be used to model various real-world phenomena.

4.2 Learning outcomes


After completing this chapter, you should be able to:

summarise basic distributions such as the uniform, Bernoulli, binomial, Poisson,


exponential and normal

calculate probabilities of events for these distributions using the probability


function, probability density function or cumulative distribution function

determine probabilities using statistical tables, where appropriate

state properties of these distributions such as the expected value and variance.

4.3 Introduction
In statistical inference we will treat observations:

X1 , X2 , . . . , Xn

(the sample) as values of a random variable X, which has some probability distribution
(the population distribution).
How to choose the probability distribution?

Usually we do not try to invent new distributions from scratch.

Instead, we use one of many existing standard distributions.

There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.

89
4. Common distributions of random variables

This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:

continuous versus discrete

among discrete: a finite versus an infinite number of possible values

among continuous: different sets of possible values (for example, all real numbers x,
x ≥ 0, or x ∈ [0, 1]); symmetric versus skewed distributions.

The ‘distributions’ discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it etc.
In the statistical analysis of a random variable X we typically:

select a family of distributions based on the basic characteristics of X

use observed data to choose (estimate) values for the parameters of that
distribution, and perform statistical inference on them.

Example 4.1 An opinion poll on a referendum, where each Xi is an answer to the


question ‘Will you vote ‘Yes’ or ‘No’ to leaving the European Union?’ has answers
recorded as Xi = 0 if ‘No’ and Xi = 1 if ‘Yes’. In a poll of 950 people, 513 answered
‘Yes’.
How do we choose a distribution to represent Xi ?

Here we need a family of discrete distributions with only two possible values (0
and 1). The Bernoulli distribution (discussed in the next section), which has one
parameter π (the probability that Xi = 1) is appropriate.

Within the family of Bernoulli distributions, we use the one where the value of
π is our best estimate based on the observed data. This is π
b = 513/950 = 0.54.

Distributions in the examination

For the discrete uniform, Bernoulli, binomial, Poisson, continuous uniform, exponential
and normal distributions:

you should memorise their pf/pdf, cdf (if given), mean, variance and median (if
given)

you can use these in any examination question without proof, unless the question
directly asks you to derive them again.

90
4.4. Common discrete distributions

For any other distributions:

you do not need to memorise their pf/pdf or cdf; if needed for a question, these will
be provided
if a question involves means, variances or other properties of these distributions,
these will either be provided, or the question will ask you to derive them.

4.4 Common discrete distributions


For discrete random variables, we will consider the following distributions.

Discrete uniform distribution.


Bernoulli distribution.
Binomial distribution.
Poisson distribution.

4.4.1 Discrete uniform distribution


Suppose a random variable X has k possible values 1, 2, . . . , k. X has a discrete
uniform distribution if all of these values have the same probability, i.e. if:
(
1/k for x = 1, 2, . . . , k
p(x) = P (X = x) =
0 otherwise.

Example 4.2 A simple example of the discrete uniform distribution is the


distribution of the score of a fair die, with k = 6.

The discrete uniform distribution is not very common in applications, but it is useful as
a reference point for more complex distributions.

Mean and variance of a discrete uniform distribution

Calculating directly from the definition,1 we have:


k
X 1 + 2 + ··· + k k+1
E(X) = x p(x) = = (4.1)
x=1
k 2
and:
k
X 12 + 22 + · · · + k 2 (k + 1)(2k + 1)
E(X 2 ) = x2 p(x) = = . (4.2)
x=1
k 6
Therefore:
k2 − 1
Var(X) = E(X 2 ) − (E(X))2 = .
12
n n
1
i2 = n(n + 1)(2n + 1)/6.
P P
(4.1) and (4.2) make use, respectively, of i = n(n + 1)/2 and
i=1 i=1

91
4. Common distributions of random variables

4.4.2 Bernoulli distribution


A Bernoulli trial is an experiment with only two possible outcomes. We will number
these outcomes 1 and 0, and refer to them as ‘success’ and ‘failure’, respectively.

Example 4.3 Examples of outcomes of Bernoulli trials are:

agree / disagree

male / female

employed / not employed

owns a car / does not own a car

business goes bankrupt / continues trading.

The Bernoulli distribution is the distribution of the outcome of a single Bernoulli


trial. This is the distribution of a random variable X with the following probability
function: (
π x (1 − π)1−x for x = 0, 1
p(x) =
0 otherwise.
Therefore, P (X = 1) = π and P (X = 0) = 1 − P (X = 1) = 1 − π, and no other values
are possible. Such a random variable X has a Bernoulli distribution with (probability)
parameter π. This is often written as:
X ∼ Bernoulli(π).
If X ∼ Bernoulli(π), then:
1
X
E(X)= x p(x) = 0 × (1 − π) + 1 × π = π (4.3)
x=0

1
X
2
E(X ) = x2 p(x) = 02 × (1 − π) + 12 × π = π
x=0

and:
Var(X) = E(X 2 ) − (E(X))2 = π − π 2 = π(1 − π). (4.4)
The moment generating function is:
1
X
MX (t) = etx p(x) = e0 (1 − π) + et π = (1 − π) + πet .
x=0

4.4.3 Binomial distribution


Suppose we carry out n Bernoulli trials such that:

at each trial, the probability of success is π


different trials are statistically independent events.

92
4.4. Common discrete distributions

Let X denote the total number of successes in these n trials. X follows a binomial
distribution with parameters n and π, where n ≥ 1 is a known integer and 0 ≤ π ≤ 1.
This is often written as:
X ∼ Bin(n, π).

The binomial distribution was first encountered in Example 3.15.

Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers.
James is taking the test, but has no idea at all about the correct answers. So he
guesses every answer and, therefore, has the probability of 1/4 of getting any
individual question correct.
Let X denote the number of correct answers in James’ test. X follows the binomial
distribution with n = 4 and π = 0.25, i.e. we have:

X ∼ Bin(4, 0.25).

For example, what is the probability that James gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 , where ‘1’ denotes a correct
answer and ‘0’ denotes an incorrect answer.
However, we do not care about the order of the 1s and 0s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability π 3 (1 − π)1 .
The total number of sequences with three 1s (and, therefore, one 0) is the number of
locations for the three 1s which can be selected in the sequence of 4 answers. This is
4

3
= 4. Therefore, the probability of obtaining three 1s is:
 
4 3
π (1 − π)1 = 4 × (0.25)3 × (0.75)1 ≈ 0.0469.
3

Binomial distribution probability function

In general, the probability function of X ∼ Bin(n, π) is:


( 
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) = (4.5)
0 otherwise.

We have already shown that (4.5) satisfies the conditions for being a probability
function in the previous chapter (see Example 3.15).

93
4. Common distributions of random variables

Example 4.5 Continuing Example 4.4, where X ∼ Bin(4, 0.25), we have:


   
4 0 4 4
p(0) = × (0.25) × (0.75) = 0.3164, p(1) = × (0.25)1 × (0.75)3 = 0.4219,
0 1
   
4 4
p(2) = × (0.25)2 × (0.75)2 = 0.2109, p(3) = × (0.25)3 × (0.75)1 = 0.0469,
2 3
 
4
p(4) = × (0.25)4 × (0.75)0 = 0.0039.
4

If X ∼ Bin(n, π), then:


E(X) = nπ
and:
Var(X) = nπ(1 − π).
The expected value E(X) was derived in the previous chapter (see Example 3.15). The
variance will be derived later.
These can also be obtained from the moment generating function:

MX (t) = ((1 − π) + πet )n .

Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again James who guesses each one of the answers. Let X
denote the number of correct answers by such a student, so that we have
X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, 2, . . . , 20 we get (rounded to 2 decimal
places):

x 0 1 2 3 4 5 6 7 8 9 10
p(x) 0.00 0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
x 11 12 13 14 15 16 17 18 19 20
p(x) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore, P (X ≥ 8) = 0.102 > 0.05 and also
P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability π of the correct
answer for every question, so that X ∼ Bin(20, π). Figure 4.1 shows plots of the
probabilities for π = 0.25, 0.5, 0.7 and 0.9.

94
4.4. Common discrete distributions

π = 0.25, E(X)=5 π = 0.5, E(X)=10

0.30

0.30
0.20

0.20
Probability

Probability
0.10

0.10
0.00

0.00
0 5 10 15 20 0 5 10 15 20

Correct answers Correct answers

π = 0.7, E(X)=14 π = 0.9, E(X)=18


0.30

0.30
0.20

0.20
Probability

Probability
0.10

0.10
0.00

0.00

0 5 10 15 20 0 5 10 15 20

Correct answers Correct answers

Figure 4.1: Probability plots for Example 4.6.

4.4.4 Poisson distribution


The possible values of the Poisson distribution are the non-negative integers
0, 1, 2, . . ..

Poisson distribution probability function

The probability function of the Poisson distribution is:


(
e−λ λx /x! for x = 0, 1, 2, . . .
p(x) = (4.6)
0 otherwise

where λ > 0 is a parameter.

If a random variable X has a Poisson distribution with parameter λ, this is often


denoted by:
X ∼ Poisson(λ) or X ∼ Pois(λ).
If X ∼ Poisson(λ), then:
E(X) = λ
and:
Var(X) = λ.
These can also be obtained from the moment generating function (see Example 3.17):
t
MX (t) = eλ(e −1) .

95
4. Common distributions of random variables

Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process which generates the occurrences satisfies the
following conditions:

1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.

2. The probability of two or more occurrences at the same time is negligibly small.

3. The probability of one occurrence in any short time interval of length t is λt for
some constant λ > 0.

In essence, these state that individual occurrences should be independent, sufficiently


rare, and happen at a constant rate λ per unit of time. A process like this is a Poisson
process.
If occurrences are generated by a Poisson process, then the number of occurrences in a
randomly selected time interval of length t = 1, X, follows a Poisson distribution with
mean λ, i.e. X ∼ Poisson(λ).
The single parameter λ of the Poisson distribution is, therefore, the rate of occurrences
per unit of time.

Example 4.7 Examples of variables for which we might use a Poisson distribution:

The number of telephone calls received at a call centre per minute.

The number of accidents on a stretch of motorway per week.

The number of customers arriving at a checkout per minute.

The number of misprints per page of newsprint.

Because λ is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.

Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if
Y is the number of arrivals per two hours, Y ∼ Poisson(1.5 × 2) = Poisson(3).

λ is also the mean of the distribution, i.e. E(X) = λ.


Both motivations suggest that distributions with higher values of λ have higher
probabilities of large values of X.

Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X ∼ Poisson(2) and X ∼ Poisson(4).

96
4.4. Common discrete distributions

0.25
λ=2
λ=4

0.20
0.15
p(x)

0.10
0.05
0.00

0 2 4 6 8 10

Figure 4.2: Probability plots for Example 4.9.

Example 4.10 Customers arrive at a bank on weekday afternoons randomly at an


average rate of 1.6 customers per minute. Let X denote the number of arrivals per
minute and Y denote the number of arrivals per 5 minutes.
We assume a Poisson distribution for both, such that:
X ∼ Poisson(1.6)
and:
Y ∼ Poisson(1.6 × 5) = Poisson(8).

1. What is the probability that no customer arrives in a one-minute interval?


For X ∼ Poisson(1.6), the probability P (X = 0) is:
e−λ λ0 e−1.6 (1.6)0
pX (0) = = = e−1.6 = 0.2019.
0! 0!
2. What is the probability that more than two customers arrive in a one-minute
interval?
P (X > 2) = 1 − P (X ≤ 2) = 1 − (P (X = 0) + P (X = 1) + P (X = 2)) which is:
e−1.6 (1.6)0 e−1.6 (1.6)1 e−1.6 (1.6)2
1 − pX (0) − pX (1) − pX (2) = 1 − − −
0! 1! 2!
= 1 − e−1.6 − 1.6e−1.6 − 1.28e−1.6
= 1 − 3.88e−1.6
= 0.2167.

3. What is the probability that no more than 1 customer arrives in a five-minute


interval?
For Y ∼ Poisson(8), the probability P (Y ≤ 1) is:
e−8 80 e−8 81
pY (0) + pY (1) = + = e−8 + 8e−8 = 9e−8 = 0.0030.
0! 1!

97
4. Common distributions of random variables

4.4.5 Connections between probability distributions


There are close connections between some probability distributions, even across
different families of them. Some connections are exact, i.e. one distribution is exactly
equal to another, for particular values of the parameters. For example, Bernoulli(π) is
the same distribution as Bin(1, π).
Some connections are approximate (or asymptotic), i.e. one distribution is closely
approximated by another under some limiting conditions. We next discuss one of these,
the Poisson approximation of the binomial distribution.

4.4.6 Poisson approximation of the binomial distribution


Suppose that:

X ∼ Bin(n, π)

n is large and π is small.

Under such circumstances, the distribution of X is well-approximated by a Poisson(λ)


distribution with λ = nπ.
The connection is exact at the limit, i.e. Bin(n, π) → Poisson(λ) if n → ∞ and π → 0 in
such a way that nπ = λ remains constant.
This ‘law of small numbers’ provides another motivation for the Poisson distribution.

Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the ‘law of small numbers’.
Figure 4.3 shows the numbers of soldiers killed by horsekick in each of 14 army corps
of the Prussian army in each of the years spanning 1875–94.
Suppose that the number of men killed by horsekicks in one corps in one year is
X ∼ Bin(n, π), where:

n is large – the number of men in a corps (perhaps 50,000)

π is small – the probability that a man is killed by a horsekick.

X should be well-approximated by a Poisson distribution with some mean λ. The


sample frequencies and proportions of different counts are as follows:

Number killed 0 1 2 3 4 More


Count 144 91 32 11 2 0
% 51.4 32.5 11.4 3.9 0.7 0

The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson
distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.

98
4.4. Common discrete distributions

Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
army in each of the years spanning 1875–94. Source: Bortkiewicz (1898) Das Gesetz der
kleinen Zahlen, Leipzig: Teubner.
0.5

Poisson(0.7)
Sample proportion
0.4
0.3
Probability

0.2
0.1
0.0

0 1 2 3 4 5 6

Men killed

Figure 4.4: Fit of Poisson distribution to the data in Example 4.11.

Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What is
the probability that everyone who arrives for the flight will get a seat?
Let X denote the number of people who fail to turn up. Using the binomial
distribution, X ∼ Bin(200, 0.01). We have:

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − 0.1340 − 0.2707 = 0.5953.

Using the Poisson approximation, X ∼ Poisson(200 × 0.01) = Poisson(2).

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − e−2 − 2e−2 = 1 − 3e−2 = 0.5940.

99
4. Common distributions of random variables

4.4.7 Some other discrete distributions


Just their names and short comments are given here, so that you have an idea of what
else there is. You may meet some of these in future courses.

Geometric(π) distribution.
• Distribution of the number of failures in Bernoulli trials before the first success.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• See the basketball example in Chapter 3.

Negative binomial(r, π) distribution.


• Distribution of the number of failures in Bernoulli trials before r successes
occur.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• Negative binomial(1, π) is the same as Geometric(π).

Hypergeometric(n, A, B) distribution.
• Experiment where initially A + B objects are available for selection, and A of
them represent ‘success’.
• n objects are selected at random, without replacement.
• Hypergeometric is then the distribution of the number of successes.
• The sample space is the integers x where max{0, n − B} ≤ x ≤ min{n, A}.
• If the selection was with replacement, the distribution of the number of
successes would be Bin(n, A/(A + B)).

Multinomial(n, π1 , π2 , . . . , πk ) distribution.
• Here π1 + π2 + · · · + πk = 1, and the πi s are the probabilities of the values
1, 2, . . . , k.
• If n = 1, the sample space is 1, 2, . . . , k. This is essentially a generalisation of
the discrete uniform distribution, but with non-equal probabilities πi .
• If n > 1, the sample space is the vectors (n1 , n2 , . . . , nk ) where ni ≥ 0 for all i,
and n1 + n2 + · · · + nk = n. This is essentially a generalisation of the binomial
to the case where each trial has k ≥ 2 possible outcomes, and the random
variable records the numbers of each outcome in n trials. Note that with
k = 2, Multinomial(n, π1 , π2 ) is essentially the same as Bin(n, π) with π = π2
(or with π = π1 ).
• When n > 1, the multinomial distribution is the distribution of a multivariate
random variable, as discussed later in the course.

100
4.5. Common continuous distributions

4.5 Common continuous distributions


For continuous random variables, we will consider the following distributions.

Uniform distribution.
Exponential distribution.
Normal distribution.

4.5.1 The (continuous) uniform distribution


The (continuous) uniform distribution has non-zero probabilities only on an interval
[a, b], where a < b are given numbers. The probability that its value is in an interval
within [a, b] is proportional to the length of the interval. In other words, all intervals
(within [a, b]) which have the same length have the same probability.

Uniform distribution pdf

The pdf of the (continuous) uniform distribution is:


(
1/(b − a) for a ≤ x ≤ b
f (x) =
0 otherwise.

A random variable X with this pdf may be written as X ∼ Uniform[a, b].

The pdf is ‘flat’, as shown in Figure 4.5 (along with the cdf). Clearly, f (x) ≥ 0 for all x,
and: Z ∞ Z b
1 1 h ib 1 h i
f (x) dx = dx = x = b − a = 1.
−∞ a b−a b−a a b−a
The cdf is:

Z x 0
 for x < a
F (x) = P (X ≤ x) = f (t) dt = (x − a)/(b − a) for a ≤ x ≤ b
a 
1 for x > b.

Therefore, the probability of an interval [x1 , x2 ], where a ≤ x1 < x2 ≤ b, is:


x2 − x1
P (x1 ≤ X ≤ x2 ) = F (x2 ) − F (x1 ) = .
b−a
So the probability depends only on the length of the interval, x2 − x1 .
If X ∼ Uniform[a, b], we have:
a+b
E(X) = = median of X
2
and:
(b − a)2
Var(X) = .
12
The mean and median also follow from the fact that the distribution is symmetric
about (a + b)/2, i.e. the midpoint of the interval [a, b].

101
4. Common distributions of random variables

F(x)
f(x)

a b a b

x x

Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).

4.5.2 Exponential distribution

Exponential distribution pdf

A random variable X has the exponential distribution with the parameter λ


(where λ > 0) if its probability density function is:
(
λe−λx for x ≥ 0
f (x) =
0 otherwise.

This is often denoted X ∼ Exponential(λ) or X ∼ Exp(λ).

It was shown in the previous chapter that this satisfies the conditions for a pdf (see
Example 3.24). The general shape of the pdf is that of ‘exponential decay’, as shown in
Figure 4.6 (hence the name).
The cdf of the Exp(λ) distribution is:
(
0 for x < 0
F (x) =
1 − e−λx for x ≥ 0.
The cdf is shown in Figure 4.7 for λ = 1.6.
For X ∼ Exp(λ), we have:
1
E(X) =
λ
and:
1
Var(X) =
.
λ2
These have been derived in the previous chapter (see Example 3.26). The median of the
distribution, also previously derived (see Example 3.29), is:
ln 2 1
m= = (ln 2) × = (ln 2) E(X) ≈ 0.69 × E(X).
λ λ
102
4.5. Common continuous distributions

f(x)

0 1 2 3 4 5

Figure 4.6: Exponential distribution pdf.


1.0
0.8
0.6
F(x)

0.4
0.2
0.0

0 1 2 3 4 5

Figure 4.7: Exponential distribution cdf for λ = 1.6.

Note that the median is always smaller than the mean, because the distribution is
skewed to the right.
The moment generating function of the exponential distribution (derived in Example
3.27) is:
λ
MX (t) = for t < λ.
λ−t

Uses of the exponential distribution

The exponential is, among other things, a basic distribution of waiting times of various
kinds. This arises from a connection between the Poisson distribution – the simplest
distribution for counts – and the exponential.

If the number of events per unit of time has a Poisson distribution with parameter
λ, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter λ.

103
4. Common distributions of random variables

Note that the expected values of these behave as we would expect.

E(X) = λ for Pois(λ), i.e. a large λ means many events per unit of time, on average.

E(X) = 1/λ for Exp(λ), i.e. a large λ means short waiting times between successive
events, on average.

Example 4.13 Consider Example 4.10.

The number of customers arriving at a bank per minute has a Poisson


distribution with parameter λ = 1.6.

Therefore, the time X, in minutes, between the arrivals of two successive


customers follows an exponential distribution with parameter λ = 1.6.

From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(ln 2) × 0.625 = 0.433.
We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0 for x < 0
F (x) = −1.6x
1−e for x ≥ 0.

For example:

P (X ≤ 1) = F (1) = 1 − e−1.6×1 = 1 − e−1.6 = 0.7981.


The probability is about 0.8 that two arrivals are at most a minute apart.

P (X > 3) = 1 − F (3) = e−1.6×3 = e−4.8 = 0.0082.


The probability of a gap of 3 minutes or more between arrivals is very small.

4.5.3 Two other distributions


These are generalisations of the uniform and exponential distributions. Only their
names and short comments are given here, just so that you know they exist. You may
meet these again in future courses.

Beta(α, β) distribution, shown in Figure 4.8.


• Generalising the uniform distribution, these are distributions for a closed
interval, which is taken to be [0, 1].
• Therefore, the sample space is {x | 0 ≤ x ≤ 1}.
• Unlike for the uniform distribution, the pdf is generally not flat.
• Beta(1, 1) is the same as Uniform[0, 1].

104
4.5. Common continuous distributions

Gamma(α, β) distribution, shown in Figure 4.9.


• Generalising the exponential distribution, this is a two-parameter family of
skewed distributions for positive values.
• The sample space is {x | x > 0}.
• Gamma(1, β) is the same as Exp(β).

alpha=0.5, beta=1 alpha=1, beta=2 alpha=1, beta=1

alpha=0.5, beta=0.5 alpha=2, beta=2 alpha=4, beta=2

Figure 4.8: Beta distribution density functions.

4.5.4 Normal (Gaussian) distribution


The normal distribution is by far the most important probability distribution in
statistics. This is for three broad reasons.

Many variables have distributions which are approximately normal, for example
heights of humans or animals, and weights of various products.

The normal distribution has extremely convenient mathematical properties, which


make it a useful default choice of distribution in many contexts.

Even when a variable is not itself even approximately normally distributed,


functions of several observations of the variable (‘sampling distributions’) are often
approximately normal, due to the central limit theorem. Because of this, the

105
4. Common distributions of random variables

0 1 2 3 4 5 6 0 2 4 6 8 10
alpha=0.5, beta=1 alpha=1, beta=0.5

0 1 2 3 4 5 6 0 5 10 15 20
alpha=2, beta=1 alpha=2, beta=0.25

Figure 4.9: Gamma distribution density functions.

normal distribution has a crucial role in statistical inference. This will be discussed
later in the course.

Normal distribution pdf

The pdf of the normal distribution is:

(x − µ)2
 
1
f (x) = √ exp − for − ∞ < x < ∞
2πσ 2 2σ 2

where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
A random variable X with this pdf is said to have a normal distribution with mean
µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ).

R∞
Clearly, f (x) ≥ 0 for all x. Also, it can be shown that −∞
f (x) dx = 1 (do not attempt
to show this), so f (x) really is a pdf.
The proof of the second point, which is somewhat elaborate, is shown in a separate note
on the ST102 Moodle site. This note is not examinable!

106
4.5. Common continuous distributions

If X ∼ N (µ, σ 2 ), then:
E(X) = µ
and:
Var(X) = σ 2
and, therefore, the standard deviation is sd(X) = σ.
A (non-examinable) proof of this is given in a separate note on the ST102 Moodle site.
It uses the moment generating function of the normal distribution, which is shown to be:

σ 2 t2
 
MX (t) = exp µt + for − ∞ < t < ∞.
2

The mean can also be inferred from the observation that the normal pdf is symmetric
about µ. This also implies that the median of the normal distribution is µ.
The normal density is the so-called ‘bell curve’. The two parameters affect it as follows.

The mean µ determines the location of the curve.

The variance σ 2 determines the dispersion (spread) of the curve.

Example 4.14 Figure 4.10 shows that:

N (0, 1) and N (5, 1) have the same dispersion but different location: the N (5, 1)
curve is identical to the N (0, 1) curve, but shifted 5 units to the right
N (0, 1) and N (0, 9) have the same location but different dispersion: the N (0, 9)
curve is centered at the same value, 0, as the N (0, 1) curve, but spread out more
widely.
0.4
0.3

N(0, 1) N(5, 1)
0.2
0.1

N(0, 9)
0.0

−5 0 5 10

Figure 4.10: Various normal distributions.

107
4. Common distributions of random variables

Linear transformations of the normal distribution

We now consider one of the convenient properties of the normal distribution. Suppose
X is a random variable, and we consider the linear transformation Y = aX + b, where a
and b are constants.
Whatever the distribution of X, it is true that E(Y ) = a E(X) + b and also that
Var(Y ) = a2 Var(X).
Furthermore, if X is normally distributed, then so is Y . In other words, if
X ∼ N (µ, σ 2 ), then:
Y = aX + b ∼ N (aµ + b, a2 σ 2 ). (4.7)
This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.
Let us apply (4.7) with a = 1/σ and b = −µ/σ, to get:
 2 !
1 µ X −µ 1 µ 1 2
Z= X− = ∼N µ− , σ = N (0, 1).
σ σ σ σ σ σ

The transformed variable Z = (X − µ)/σ is known as a standardised variable or a


z-score.
The distribution of the z-score is N (0, 1), i.e. the normal distribution with mean µ = 0
and variance σ 2 = 1 (and, therefore, a standard deviation of σ = 1). This is known as
the standard normal distribution. Its density function is:
x2
 
1
f (x) = √ exp − for − ∞ < x < ∞.
2π 2
The cumulative distribution function of the normal distribution is:
Z x
(t − µ)2
 
1
F (x) = √ exp − dt.
−∞ 2πσ 2 2σ 2
In the special case of the standard normal distribution, the cdf is:
Z x  2
1 t
F (x) = Φ(x) = √ exp − dt.
−∞ 2π 2

Note, this is often denoted Φ(x).


Such integrals cannot be evaluated in a closed form, so we use statistical tables of them,
specifically a table of Φ(x) (or we could use a computer, but not in the examination).
In the examination, you will have a table of some values of Φ(z), the cdf of Z ∼ N (0, 1)
(Table 3 in Murdoch and Barnes’ Statistical Tables). This is also on the ST102 Moodle
site for use in the exercises.
Since Table 3 uses the notation Φ(z) (for z-score), we will do so too below. Φ(x) and
Φ(z) mean the same thing, of course.
Table 3 shows values of 1 − Φ(z) = P (Z > z) for z ≥ 0. This table can be used to
calculate probabilities of any intervals for any normal distribution, but how? The table
seems to be incomplete.

108
4.5. Common continuous distributions

1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
2. Even for N (0, 1), it only shows probabilities for z ≥ 0.

We next show how these are not really limitations, starting with ‘2.’.
The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the
same probability. Another way to justify these results is that if Z ∼ N (0, 1), then also
−Z ∼ N (0, 1).
Suppose that z ≥ 0, so that −z ≤ 0. Table 3 shows:
P (Z > z) = 1 − Φ(z) = Pz
which is called Pz for short. From it, we also get the following probabilities.

P (Z ≤ z) = Φ(z) = 1 − P (Z > z) = 1 − Pz .
P (Z ≤ −z) = Φ(−z) = P (−Z ≥ z) = P (Z ≥ z) = P (Z > z) = Pz .
P (Z > −z) = 1 − Φ(−z) = P (−Z < z) = P (Z < z) = 1 − Pz .

In each of these, ≤ can be replaced by <, and ≥ by > (see Section 3.5). Figure 4.11
shows tail probabilities for the standard normal distribution.

−z 0 +z

Figure 4.11: Tail probabilities for the standard normal distribution.

If Z ∼ N (0, 1), for any two numbers z1 < z2 , then:


P (z1 < Z ≤ z2 ) = Φ(z2 ) − Φ(z1 )
where Φ(z2 ) and Φ(z1 ) are obtained using the rules above.
Reality check: remember that:
Φ(0) = P (Z ≤ 0) = 0.5 = P (Z > 0) = 1 − Φ(0).
So if you ever end up with results like P (Z ≤ −1) = 0.7 or P (Z ≤ 1) = 0.2 or
P (Z > 2) = 0.95, these must be wrong! (See property 3 of cdfs in Section 3.4.4.)

109
4. Common distributions of random variables

Example 4.15 Consider the 0.2005 value in the ‘0.8’ row and ‘0.04’ column of
Table 3 of Murdoch and Barnes’ Statistical Tables, which shows that:

1 − Φ(0.84) = P (Z > 0.84) = 0.2005.

Using the results above, we then also have:

P (Z ≤ 0.84) = Φ(0.84) = 1 − 0.2005 = 0.7995

P (Z ≤ −0.84) = P (Z ≥ 0.84) = 0.2005

P (Z ≥ −0.84) = P (Z ≤ 0.84) = 0.7995

P (−0.84 ≤ Z ≤ 0.84) = P (Z ≤ 0.84) − P (Z ≤ −0.84) = 0.5990.

Probabilities for any normal distribution

How about a normal distribution X ∼ N (µ, σ 2 ), for any other µ and σ 2 ?


What if we want to calculate, for any a < b, P (a < X ≤ b) = F (b) − F (a)?
Remember that (X − µ)/σ = Z ∼ N (0, 1). If we apply this transformation to all parts
of the inequalities, we get:
a−µ X −µ b−µ
 
P (a < X ≤ b)= P < ≤
σ σ σ
a−µ b−µ
 
=P <Z≤
σ σ
b−µ a−µ
   
=Φ −Φ
σ σ
which can be calculated using Table 3 of Murdoch and Barnes’ Statistical Tables. (Note
that this also covers the cases of the one-sided inequalities P (X ≤ b), with a = −∞,
and P (X > a), with b = ∞.)

Example 4.16 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:

X > 90 (high blood pressure)

X < 60 (low blood pressure)

60 ≤ X ≤ 90 (normal blood pressure).

These are calculated using standardisation with µ = 74.2, σ 2 = 127.87 and,


therefore, σ = 11.31. So here:
X − 74.2
= Z ∼ N (0, 1)
11.31

110
4.5. Common continuous distributions

and we can refer values of this standardised variable to Table 3 of Murdoch and
Barnes’ Statistical Tables.
 
X − 74.2 90 − 74.2
P (X > 90) = P >
11.31 11.31
= P (Z > 1.40)
= 1 − Φ(1.40)
= 1 − 0.9192
= 0.0808

and:
 
X − 74.2 60 − 74.2
P (X < 60) = P <
11.31 11.31
= P (Z < −1.26)
= P (Z > 1.26)
= 1 − Φ(1.26)
= 1 − 0.8962
= 0.1038.

Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152.
These probabilities are shown in Figure 4.12.
0.04

Mid: 0.82
0.03

Low: 0.10
0.02

High: 0.08
0.01
0.00

40 60 80 100 120

Diastolic blood pressure

Figure 4.12: Distribution of blood pressure for Example 4.16.

111
4. Common distributions of random variables

Some probabilities around the mean

The following results hold for all normal distributions.

P (µ − σ < X < µ + σ) = 0.683. In other words, about 68.3% of the total


probability is within 1 standard deviation of the mean.
P (µ − 1.96 × σ < X < µ + 1.96 × σ) = 0.950.
P (µ − 2 × σ < X < µ + 2 × σ) = 0.954.
P (µ − 2.58 × σ < X < µ + 2.58 × σ) = 0.99.
P (µ − 3 × σ < X < µ + 3 × σ) = 0.997.

The first two of these are illustrated graphically in Figure 4.13.

0.683

µ −1.96σ µ−σ µ µ+σ µ +1.96σ

<−−−−−−−−−− 0.95 −−−−−−−−−−>

Figure 4.13: Some probabilities around the mean for the normal distribution.

4.5.5 Normal approximation of the binomial distribution


For 0 < π < 1, the binomial distribution Bin(n, π) tends to the normal distribution
N (nπ, nπ(1 − π)) as n → ∞.
Less formally, the binomial distribution is well-approximated by the normal distribution
when the number of trials n is reasonably large.
For a given n, the approximation is best when π is not very close to 0 or 1. One
rule-of-thumb is that the approximation is good enough when nπ > 5 and n(1 − π) > 5.
Illustrations of the approximation are shown in Figure 4.14 for different values of n and
π. Each plot shows values of the pf of Bin(n, π), and the pdf of the normal
approximation, N (nπ, nπ(1 − π)).
When the normal approximation is appropriate, we can calculate probabilities for
X ∼ Bin(n, π) using Y ∼ N (nπ, nπ(1 − π)) and Table 3 of Murdoch and Barnes’
Statistical Tables.

112
4.5. Common continuous distributions

n=10, π = 0.5 n=25, π = 0.5 n=25, π = 0.25

n=10, π = 0.9 n=25, π = 0.9 n=50, π = 0.9

Figure 4.14: Examples of the normal approximation of the binomial distribution.

Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, 2, . . . , 40, then:
P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)
since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast if Y ∼ N (16, 9.6), then:
P (Y ≤ 4) < P (Y ≤ 4.5) < P (Y < 5)
since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (nπ, nπ(1 − π)) distribution.

Continuity correction

This technique involves representing each discrete binomial value x, for 0 ≤ x ≤ n,


by the continuous interval (x − 0.5, x + 0.5). Great care is needed to determine which
x values are included in the required probability. Suppose we are approximating
X ∼ Bin(n, π) with Y ∼ N (nπ, nπ(1 − π)), then:

P (X < 4) = P (X ≤ 3) ⇒ P (Y < 3.5) (since 4 is excluded)


P (X ≤ 4) = P (X < 5) ⇒ P (Y < 4.5) (since 4 is included)
P (1 ≤ X < 6) = P (1 ≤ X ≤ 5) ⇒ P (0.5 < Y < 5.5) (since 1 to 5 are included).

113
4. Common distributions of random variables

Example 4.17 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X ∼ Bin(1,000, 0.361).

1. What is the probability that X ≥ 400?


Using the normal approximation, noting n = 1,000 and π = 0.361, with
Y ∼ N (1,000 × 0.361, 1,000 × 0.361 × 0.639) = N (361, 230.68), we get:

P (X ≥ 400) ≈ P (Y ≥ 399.5)
 
Y − 361 399.5 − 361
=P √ ≥ √
230.68 230.68
= P (Z ≥ 2.53)
= 1 − Φ(2.53)
= 0.0057.

The exact probability from the binomial distribution is P (X ≥ 400) = 0.0059.


Without the continuity correction, the normal approximation would give 0.0051.

2. What is the largest number x for which P (X ≤ x) < 0.01?


We need the largest x which satisfies:
 
x + 0.5 − 361
P (X ≤ x) ≈ P (Y ≤ x + 0.5) = P Z ≤ √ < 0.01.
230.68
According to Table 3 of Murdoch and Barnes’ Statistical Tables, the smallest z
which satisfies P (Z ≥ z) < 0.01 is z = 2.33, so the largest z which satisfies
P (Z ≤ z) < 0.01 is z = −2.33. We then need to solve:
x + 0.5 − 361
√ ≤ −2.33
230.68
which gives x ≤ 325.1. The smallest integer value which satisfies this is x = 325.
Therefore, P (X ≤ x) < 0.01 for all x ≤ 325.
The sum of the exact binomial probabilities from 0 to x is 0.0093 for x = 325,
and 0.011 for x = 326. The normal approximation gives exactly the correct
answer in this instance.

3. Suppose that 300 respondents in the actual survey say they would vote for the
Conservative Party now. What do you conclude from this?
From the answer to Question 2, we know that P (X ≤ 300) < 0.01, if π = 0.361.
In other words, if the Conservatives’ support remains 36.1%, we would be very
unlikely to get a random sample where only 300 (or fewer) respondents would
say they would vote for the Conservative Party.

114
4.6. Overview of chapter

Now X = 300 is actually observed. We can then conclude one of two things (if
we exclude other possibilities, such as a biased sample or lying by the
respondents).
(a) The Conservatives’ true level of support is still 36.1% (or even higher), but
by chance we ended up with an unusual sample with only 300 of their
supporters.
(b) The Conservatives’ true level of support is currently less than 36.1% (in
which case getting 300 in the sample would be more probable).

Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.

4.6 Overview of chapter


This chapter has introduced some common discrete and continuous probability
distributions. Their properties, uses and applications have been discussed. The
relationships between some of these distributions have also been covered.

4.7 Key terms and concepts


Bernoulli distribution Binomial distribution
Central limit theorem Continuity correction
Continuous uniform distribution Discrete uniform distribution
Exponential distribution Moment
Moment generating function Normal distribution
Parameter Poisson distribution
Standardised variable Standard normal distribution
z-score

There are two kinds of statistics, the kind you look up and the kind you make
up.
(Rex Stout)

115
4. Common distributions of random variables

116
Chapter 5
Multivariate random variables

5.1 Synopsis of chapter


Almost all applications of statistical methods deal with several measurements on the
same, or connected, items. To think statistically about several measurements on a
randomly selected item, you must understand some of the concepts for joint
distributions of random variables.

5.2 Learning outcomes


After completing this chapter, you should be able to:

arrange the probabilities for a discrete bivariate distribution in tabular form


define marginal and conditional distributions, and determine them for a discrete
bivariate distribution
recall how to define and determine independence for two random variables
define and compute expected values for functions of two random variables and
demonstrate how to prove simple properties of expected values
provide the definition of covariance and correlation for two random variables and
calculate these.

5.3 Introduction
So far, we have considered univariate situations, that is one random variable at a time.
Now we will consider multivariate situations, that is two or more random variables at
once, and together.
In particular, we consider two somewhat different types of multivariate situations.

1. Several different variables – such as the height and weight of a person.


2. Several observations of the same variable, considered together – such as the heights
of all n people in a sample.

Suppose that X1 , X2 , . . . , Xn are random variables, then the vector:

X = (X1 , X2 , . . . , Xn )0

117
5. Multivariate random variables

is a multivariate random variable (here n-variate), also known as a random


vector. Its possible values are the vectors:

x = (x1 , x2 , . . . , xn )0

where each xi is a possible value of the random variable Xi , for i = 1, 2, . . . , n.


The joint probability distribution of a multivariate random variable X is defined by
the possible values x, and their probabilities.
For now, we consider just the simplest multivariate case, a bivariate random variable
where n = 2. This is sufficient for introducing most of the concepts of multivariate
random variables.
For notational simplicity, we will use X and Y instead of X1 and X2 . A bivariate
random variable is then the pair (X, Y ).

Example 5.1 In this chapter, we consider the following examples.


Discrete bivariate example – for a football match:

X = the number of goals scored by the home team

Y = the number of goals scored by the visiting (away) team.

Continuous bivariate example – for a person:

X = the person’s height

Y = the person’s weight.

5.4 Joint probability functions


When the random variables in (X1 , X2 , . . . , Xn ) are either all discrete or all continuous,
we also call the multivariate random variable either discrete or continuous, respectively.
For a discrete multivariate random variable, the joint probability distribution is
described by the joint probability function, defined as:

p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )

for all vectors (x1 , x2 , . . . , xn ) of n real numbers. The value p(x1 , x2 , . . . , xn ) of the joint
probability function is itself a single number, not a vector.
In the bivariate case, this is:

p(x, y) = P (X = x, Y = y)

which we sometimes write as pX,Y (x, y) to make the random variables clear.

118
5.4. Joint probability functions

Example 5.2 Consider a randomly selected football match in the English Premier
League (EPL), and the two random variables:

X = the number of goals scored by the home team

Y = the number of goals scored by the visiting (away) team.

Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example
simple, we have recorded the small number of scores of 4 or greater also as 3).
Consider the joint distribution of (X, Y ). We use probabilities based on data from
the 2009–10 EPL season.
Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following:

Y =y
X=x 0 1 2 3
0 0.100 0.031 0.039 0.031
1 0.100 0.146 0.092 0.015
2 0.085 0.108 0.092 0.023
3 0.062 0.031 0.039 0.006

and p(x, y) = 0 for all other (x, y).


Note that this satisfies the conditions for a probability function.

1. p(x, y) ≥ 0 for all (x, y).


3 P
P 3
2. p(x, y) = 0.100 + 0.031 + · · · + 0.006 = 1.000.
x=0 y=0

The joint probability function gives probabilities of values of (X, Y ), for example:

A 1–1 draw, which is the most probable single result, has probability

P (X = 1, Y = 1) = p(1, 1) = 0.146.

The match is a draw with probability:

P (X = Y ) = p(0, 0) + p(1, 1) + p(2, 2) + p(3, 3) = 0.344.

The match is won by the home team with probability:

P (X > Y ) = p(1, 0) + p(2, 0) + p(2, 1) + p(3, 0) + p(3, 1) + p(3, 2) = 0.425.

More than 4 goals are scored in the match with probability:

P (X + Y > 4) = p(2, 3) + p(3, 2) + p(3, 3) = 0.068.

119
5. Multivariate random variables

5.5 Marginal distributions

Consider a multivariate discrete random variable X = (X1 , X2 , . . . , Xn ).


The marginal distribution of a subset of the variables in X is the (joint) distribution
of this subset. The joint pf of these variables (the marginal pf) is obtained by
summing the joint pf of X over the variables which are not included in the subset.

Example 5.3 Consider X = (X1 , X2 , X3 , X4 ), and the marginal distribution of the


subset (X1 , X2 ). The marginal pf of (X1 , X2 ) is:
XX
p1,2 (x1 , x2 ) = P (X1 = x1 , X2 = x2 ) = p(x1 , x2 , x3 , x4 )
x3 x4

where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible
values of X3 and X4 .

The simplest marginal distributions are those of individual variables in the multivariate
random variable.
The marginal pf is then obtained by summing the joint pf over all the other variables.
The resulting marginal distribution is univariate, and its pf is a univariate pf.

Marginal distributions for discrete bivariate distributions

For the bivariate distribution of (X, Y ) the univariate marginal distributions are
those of X and Y individually. Their marginal pfs are:
X X
pX (x) = p(x, y) and pY (y) = p(x, y).
y x

Example 5.4 Continuing with the football example introduced in Example 5.2, the
joint and marginal probability functions are:

Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000

and p(x, y) = pX (x) = pY (y) = 0 for all other (x, y).

120
5.6. Continuous multivariate distributions

For example:
3
X
pX (0) = p(0, y) = p(0, 0) + p(0, 1) + p(0, 2) + p(0, 3)
y=0

= 0.100 + 0.031 + 0.039 + 0.031


= 0.201.

Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and
medians of individual variables are obtained from the univariate (marginal)
distributions of Xi , as defined in Chapter 3.

Example 5.5 Consider again the football example.

The expected number of goals scored by the home team is:


X
E(X) = x pX (x) = 0 × 0.201 + 1 × 0.353 + 2 × 0.308 + 3 × 0.138 = 1.383.
x

The expected number of goals scored by the visiting team is:


X
E(Y ) = y pY (y) = 0 × 0.347 + 1 × 0.316 + 2 × 0.262 + 3 × 0.075 = 1.065.
y

5.6 Continuous multivariate distributions


If all the random variables in X = (X1 , X2 , . . . , Xn ) are continuous, the joint
distribution of X is specified by its joint probability density function
f (x1 , x2 , . . . , xn ).
Marginal distributions are defined as in the discrete case, but with integration instead
of summation.
There will be no questions on continuous multivariate joint probability density
functions in the examination. Only discrete multivariate joint probability functions may
appear in the examination. So just a brief example of the continuous case is given here,
to give you an idea of such distributions.

Example 5.6 For a randomly selected man (aged over 16) in England, let:

X = his height (in cm)

Y = his weight (in kg).

The univariate marginal distributions of X and Y are approximately normal, with:

X ∼ N (174.9, (7.39)2 ) and Y ∼ N (84.2, (15.63)2 )

121
5. Multivariate random variables

and the bivariate joint distribution of (X, Y ) is a bivariate normal distribution.


Plots of the univariate and bivariate probability density functions are shown in
Figures 5.1, 5.2 and 5.3.

0.025
0.05

0.020
0.04

0.015
0.03

0.010
0.02

0.005
0.01

0.000
0.00

160 170 180 190 40 60 80 100 120

Height (cm) Weight (kg)

Figure 5.1: Univariate marginal pdfs for Example 5.6.


120
100
Weight (kg)

80
60
40

160 170 180 190

Height (cm)

Figure 5.2: Bivariate joint pdf (contour plot) for Example 5.6.

5.7 Conditional distributions


Consider discrete variables X and Y , with joint pf p(x, y) = pX,Y (x, y) and marginal pfs
pX (x) and pY (y), respectively.

122
5.7. Conditional distributions

Wei
ght
f(x,y)

(kg)
Height (cm)

Figure 5.3: Bivariate joint pdf for Example 5.6.

Conditional distributions of discrete bivariate distributions

Let x be one possible value of X, for which pX (x) > 0. The conditional
distribution of Y given that X = x is the discrete probability distribution with
the pf:

P (X = x and Y = y) pX,Y (x, y)


pY |X (y | x) = P (Y = y | X = x) = =
P (X = x) pX (x)

for any value y.


This is the conditional probability function of Y given X = x.

Example 5.7 Recall that in the football example the joint and marginal pfs were:

Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000

We can now calculate the conditional pf of Y given X = x for each x, i.e. of away
goals given home goals. For example:

pX,Y (0, y) pX,Y (0, y)


pY |X (y | 0) = pY |X (y | X = 0) = = .
pX (0) 0.201

So, for example, pY |X (1 | 0) = pX,Y (0, 1)/0.201 = 0.031/0.201 = 0.154.

123
5. Multivariate random variables

Calculating these for each value of x gives:

pY |X (y | x) when y is:
X=x 0 1 2 3 Sum
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.00
2 0.276 0.351 0.299 0.075 1.00
3 0.449 0.225 0.283 0.043 1.00

So, for example:

if the home team scores 0 goals, the probability that the visiting team scores 1
goal is pY |X (1 | 0) = 0.154

if the home team scores 1 goal, the probability that the visiting team wins the
match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303.

5.7.1 Properties of conditional distributions


Each different value of x defines a different conditional distribution and conditional pf
pY |X (y | x). Each value of pY |X (y | x) is a conditional probability of the kind previously
defined. Defining events A = {Y = y} and B = {X = x}, then:
P (A ∩ B) P (Y = y and X = x)
P (A | B) = =
P (B) P (X = x)
= P (Y = y | X = x)
pX,Y (x, y)
=
pX (x)
= pY |X (y | x).
A conditional distribution is itself a probability distribution, and a conditional pf is a
pf. Clearly, pY |X (y | x) ≥ 0 for all y, and:
P
pX,Y (x, y)
X y pX (x)
pY |X (y | x) = = = 1.
y
p X (x) p X (x)

The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is
defined similarly, with the roles of X and Y reversed:
pX,Y (x, y)
pX|Y (x | y) =
pY (y)
for any value x.
Conditional distributions are general and are not limited to the bivariate case. If X
and/or Y are vectors of random variables, the conditional pf of Y given X = x is:
pX,Y (x, y)
pY|X (y | x) =
pX (x)

124
5.7. Conditional distributions

where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal
pf of the random vector X.

5.7.2 Conditional mean and variance


Since a conditional distribution is a probability distribution, it also has a mean
(expected value) and variance (and median etc.).
These are known as the conditional mean and conditional variance, and are
denoted, respectively, by:

EY |X (Y | x) and VarY |X (Y | x).

Example 5.8 In the football example, we have:


X
EY |X (Y | 0) = y pY |X (y | 0) = 0 × 0.498 + 1 × 0.154 + 2 × 0.194 + 3 × 0.154 = 1.00.
y

So, if the home team scores 0 goals, the expected number of goals by the visiting
team is EY |X (Y | 0) = 1.00.
EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly.
Here X is the number of goals by the home team, and Y is the number of goals by
the visiting team:

pY |X (y | x) when y is:
X=x 0 1 2 3 EY |X (Y | x)
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.06
2 0.276 0.351 0.299 0.075 1.17
3 0.449 0.225 0.283 0.043 0.92

Plots of the conditional means are shown in Figure 5.4.

5.7.3 Continuous conditional distributions


Suppose X and Y are continuous, with joint pdf fX,Y (x, y) and marginal pdfs fX (x)
and fY (y), respectively.
The conditional distribution of Y given that X = x is the continuous probability
distribution with the pdf:
fX,Y (x, y)
fY |X (y | x) =
fX (x)
which is defined if fX (x) > 0. For a conditional distribution of X given Y = y,
fX|Y (x | y) is defined similarly, with the roles of X and Y reversed.
Unlike in the discrete case, this is not a conditional probability. However, fY |X (y | x) is a
pdf of a continuous random variable, so the conditional distribution is itself a
continuous probability distribution.

125
5. Multivariate random variables

3.0
Home goals x
Expected away goals E(Y|x)

2.5
2.0
1.5
1.0
0.5
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Goals

Figure 5.4: Conditional means for Example 5.8.

Example 5.9 For a randomly selected man (aged over 16) in England, consider
X = height (in cm) and Y = weight (in kg). The joint distribution of (X, Y ) is
approximately bivariate normal (see Example 5.6).
The conditional distribution of Y given X = x is then a normal distribution for each
x, with the following parameters:

EY |X (Y | x) = −58.1 + 0.81x and VarY |X (Y | x) = 208.

In other words, the conditional mean depends on x, but the conditional variance
does not. For example:

EY |X (Y | 160) = 71.5 and EY | X (Y | 190) = 95.8.

For women, this conditional distribution is normal with the following parameters:

EY |X (Y | x) = −23.0 + 0.58x and VarY |X (Y | x) = 221.

The conditional means are shown in Figure 5.5.

5.8 Covariance and correlation


Suppose that the conditional distributions pY |X (y | x) of a random variable Y given
different values x of a random variable X are not all the same, i.e. the conditional
distribution of Y ‘depends on’ the value of X.
Therefore, there is said to be an association (or dependence) between X and Y .
If two random variables are associated (dependent), knowing the value of one (for
example, X) will help to predict the likely value of the other (for example, Y ).

126
5.8. Covariance and correlation

110
100
Conditional mean of weight (kg)

90
80
70

Women
Men
60

140 150 160 170 180 190 200 210

Height (cm)

Figure 5.5: Conditional means for Example 5.9.

We next consider two measures of association which are used to summarise the
strength of an association in a single number: covariance and correlation (scaled
covariance).

5.8.1 Covariance

Definition of covariance

The covariance of two random variables X and Y is defined as:

Cov(X, Y ) = Cov(Y, X) = E((X − E(X))(Y − E(Y ))).

This can also be expressed as the more convenient formula:

Cov(X, Y ) = E(XY ) − E(X) E(Y ).

This result will be proved later.


(Note that these involve expected values of products of two random variables, which
have not been defined yet. We will do so later in this chapter.)

Properties of covariance

Suppose X and Y are random variables, and a, b, c and d are constants.

The covariance of a random variable with itself is the variance of the random
variable:

Cov(X, X) = E(XX) − E(X) E(X) = E(X 2 ) − (E(X))2 = Var(X).

127
5. Multivariate random variables

The covariance of a random variable and a constant is 0:


Cov(a, X) = E(aX) − E(a) E(X) = a E(X) − a E(X) = 0.

The covariance of linear transformations of random variables is:


Cov(aX + b, cY + d) = ac Cov(X, Y ).

5.8.2 Correlation

Definition of correlation

The correlation of two random variables X and Y is defined as:


Cov(X, Y ) Cov(X, Y )
Corr(X, Y ) = Corr(Y, X) = p = .
Var(X) Var(Y ) sd(X) sd(Y )

When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X
and Y are uncorrelated.

Correlation and covariance are measures of the strength of the linear (‘straight-line’)
association between X and Y .
The further the correlation is from 0, the stronger is the linear association. The most
extreme possible values of correlation are −1 and +1, which are obtained when Y is an
exact linear function of X.
Corr(X, Y ) = +1 when Y = aX + b with a > 0.
Corr(X, Y ) = −1 when Y = aX + b with a < 0.

If Corr(X, Y ) > 0, we say that X and Y are positively correlated.


If Corr(X, Y ) < 0, we say that X and Y are negatively correlated.

Example 5.10 Recall the joint pf pX,Y (x, y) in the football example:

Y =y
X=x 0 1 2 3
0 0 0 0 0
0.100 0.031 0.039 0.031
1 0 1 2 3
0.100 0.146 0.092 0.015
2 0 2 4 6
0.085 0.108 0.092 0.023
3 0 3 6 9
0.062 0.031 0.039 0.006

Here, the numbers in bold are the values of xy for each combination of x and y.
From these and their probabilities, we can derive the probability distribution of XY .

128
5.8. Covariance and correlation

For example:

P (XY = 2) = pX,Y (1, 2) + pX,Y (2, 1) = 0.092 + 0.108 = 0.200.

The pf of the product XY is:

XY = xy 0 1 2 3 4 6 9
P (XY = xy) 0.448 0.146 0.200 0.046 0.092 0.062 0.006

Hence:

E(XY ) = 0 × 0.448 + 1 × 0.146 + 2 × 0.200 + · · · + 9 × 0.006 = 1.478.

From the marginal pfs pX (x) and pY (y) we get:

E(X) = 1.383
E(Y ) = 1.065
E(X 2 ) = 2.827
E(Y 2 ) = 2.039
Var(X) = 2.827 − (1.383)2 = 0.9143
Var(Y ) = 2.039 − (1.065)2 = 0.9048.

Therefore, the covariance of X and Y is:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 1.478 − 1.383 × 1.065 = 0.00511

and the correlation is:


Cov(X, Y ) 0.00511
Corr(X, Y ) = p =√ = 0.00562.
Var(X) Var(Y ) 0.9143 × 0.9048

The numbers of goals scored by the home and visiting teams are very nearly
uncorrelated (i.e. not linearly associated).

5.8.3 Sample covariance and correlation

We have just introduced covariance and correlation, two new characteristics of


probability distributions (population distributions). We now discuss their sample
equivalents.
Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a sample of n pairs of observed values of two
random variables X and Y .
We can use these observations to calculate sample versions of the covariance and
correlation between X and Y . These are measures of association in the sample, i.e.
descriptive statistics. They are also estimates of the corresponding population quantities
Cov(X, Y ) and Corr(X, Y ). The uses of these sample measures will be discussed in
more detail later in the course.

129
5. Multivariate random variables

Sample covariance

The sample covariance of random variables X and Y is calculated as:


n
\Y ) = 1 X
Cov(X, (Xi − X̄)(Yi − Ȳ )
n−1 i=1

where X̄ and Ȳ are the sample means of X and Y , respectively.

Sample correlation

The sample correlation of random variables X and Y is calculated as:


n
P
\Y ) (Xi − X̄)(Yi − Ȳ )
Cov(X, i=1
r= = s
SX SY n
P n
P
(Xi − X̄)2 (Yi − Ȳ )2
i=1 i=1

where SX and SY are the sample standard deviations of X and Y , respectively.

r is always between −1 and +1, and is equal to −1 or +1 only if X and Y are


perfectly linearly related in the sample.

r = 0 if X and Y are uncorrelated (not linearly related) in the sample.

Example 5.11 Figure 5.6 shows different examples of scatterplots of observations


of X and Y , and different values of the sample correlation, r. The line shown in each
plot is the best-fitting (least squares) line for the scatterplot (which will be
introduced later in the course).

In (a), X and Y are perfectly linearly related, and r = 1.

Plots (b), (c) and (e) show relationships of different strengths.

In (c), the variables are negatively correlated.

In (d), there is no linear relationship, and r = 0.

Plot (f) shows that r can be 0 even if two variables are clearly related, if that
relationship is not linear.

130
5.9. Independent random variables

(a) r=1 (b) r=0.85 (c) r=-0.5

(d) r=0 (e) r=0.92 (f) r=0

Figure 5.6: Scatterplots depicting various sample correlations as discussed in Example


5.11.

5.9 Independent random variables


Two discrete random variables X and Y are associated if pY |X (y | x) depends on x.
What if it does not, i.e. what if:
pX,Y (x, y)
pY |X (y | x) = = pY (y) for all x and y
pX (x)
so that knowing the value of X does not help to predict Y ?
This implies that:
pX,Y (x, y) = pX (x) pY (y) for all x, y. (5.1)
X and Y are independent of each other if and only if (5.1) is true.

Independent random variables

In general, suppose that X1 , X2 , . . . , Xn are discrete random variables. These are


independent if and only if their joint pf is:

p(x1 , x2 , . . . , xn ) = p1 (x1 ) p2 (x2 ) · · · pn (xn )

for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), p2 (x2 ), . . . , pn (xn ) are the univariate
marginal pfs of X1 , X2 , . . . , Xn , respectively.
Similarly, continuous random variables X1 , X2 , . . . , Xn are independent if and only
if their joint pdf is:

f (x1 , x2 , . . . , xn ) = f1 (x1 ) f2 (x2 ) · · · fn (xn )

for all x1 , x2 , . . . , xn , where f1 (x1 ), f2 (x2 ), . . . , fn (xn ) are the univariate marginal pdfs
of X1 , X2 , . . . , Xn , respectively.

131
5. Multivariate random variables

If two random variables are independent, they are also uncorrelated, i.e. we have:
Cov(X, Y ) = 0 and Corr(X, Y ) = 0.
This will be proved later.
The reverse is not true, i.e. two random variables can be dependent even when their
correlation is 0. This can happen when the dependence is non-linear.

Example 5.12 The football example is an instance of this. The conditional


distributions pY |X (y | x) are clearly not all the same, but the correlation is very
nearly 0 (see Example 5.10).
Another example is plot (f) in Figure 5.6, where the dependence is not linear, but
quadratic.

5.9.1 Joint distribution of independent random variables


When random variables are independent, we can easily derive their joint pf or pdf as
the product of their univariate marginal distributions. This is particularly simple if all
the marginal distributions are the same.

Example 5.13 Suppose that X1 , X2 , . . . , Xn are independent, and each of them


follows the Poisson distribution with the same mean λ. Therefore, the marginal pf of
each Xi is:
e−λ λxi
p(xi ) =
xi !
and the joint pf of the random variables is:
P
n n xi
Y Y e−λ λxi −nλ
e λi
p(x1 , x2 , . . . , xn ) = p(x1 ) p(x2 ) · · · p(xn ) = p(xi ) = = Q .
i=1 i=1
xi ! xi !
i

Example 5.14 For a continuous example, suppose that X1 , X2 , . . . , Xn are


independent, and each of them follows a normal distribution with the same mean µ
and same variance σ 2 . Therefore, the marginal pdf of each Xi is:
(xi − µ)2
 
1
f (xi ) = √ exp −
2πσ 2 2σ 2
and the joint pdf of the variables is:
n
Y
f (x1 , x2 , . . . , xn ) = f (x1 ) f (x2 ) · · · f (xn ) = f (xi )
i=1
n
(xi − µ)2
 
Y 1
= √ exp −
i=1 2πσ 2 2σ 2
n
!
1 1 X
= √ n exp − 2 (xi − µ)2 .
2πσ 2 2σ i=1

132
5.10. Sums and products of random variables

5.10 Sums and products of random variables


Suppose X1 , X2 , . . . , Xn are random variables. We now go from the multivariate setting
back to the univariate setting, by considering univariate functions of X1 , X2 , . . . , Xn . In
particular, we consider sums and products like:
n
X
ai Xi + b = a1 X1 + a2 X2 + · · · + an Xn + b (5.2)
i=1

and:
n
Y
ai Xi = (a1 X1 )(a2 X2 ) · · · (an Xn )
i=1

where a1 , a2 , . . . , an and b are constants.


Each such sum or product is itself a univariate random variable. The probability
distribution of such a function depends on the joint distribution of X1 , X2 , . . . , Xn .

Example 5.15 In the football example, the sum Z = X + Y is the total number of
goals scored in a match.
Its probability function is obtained from the joint pf pX,Y (x, y), that is:

Z=z 0 1 2 3 4 5 6
pZ (z) 0.100 0.131 0.270 0.293 0.138 0.062 0.006

For example, pZP


(1) = pX,Y (0, 1) + pX,Y (1, 0) = 0.031 + 0.100 = 0.131. The mean of Z
is then E(Z) = z pZ (z) = 2.448.
z

Another example is the distribution of XY (see Example 5.10).

However, what can we say about such distributions in general, in cases where we cannot
derive them as easily?

5.10.1 Distributions of sums and products


General results for the distributions of sums and products of random variables are
available as follows:

Sums Products
Only for
Mean Yes
independent variables
Variance Yes No
Normal: Yes
Distributional Some other distributions:
No
form only for independent
variables

133
5. Multivariate random variables

5.10.2 Expected values and variances of sums of random


variables
We state, without proof, the following important result.
If X1 , X2 , . . . , Xn are random variables with means E(X1 ), E(X2 ), . . . , E(Xn ),
respectively, and a1 , a2 , . . . , an and b are constants, then:
n
!
X
E ai Xi + b = E(a1 X1 + a2 X2 + · · · + an Xn + b)
i=1

= a1 E(X1 ) + a2 E(X2 ) + · · · + an E(Xn ) + b


n
X
= ai E(Xi ) + b. (5.3)
i=1

Two simple special cases of this, when n = 2, are:

E(X + Y ) = E(X) + E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = a2 = 1


and b = 0

E(X − Y ) = E(X) − E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = 1,


a2 = −1 and b = 0.

Example 5.16 In the football example, we have previously shown that


E(X) = 1.383, E(Y ) = 1.065 and E(X + Y ) = 2.448. So E(X + Y ) = E(X) + E(Y ),
as the theorem claims.

If X1 , X2 , . . . , Xn are random variables with variances Var(X1 ), Var(X2 ), . . . , Var(Xn ),


respectively, and covariances Cov(Xi , Xj ) for i 6= j, and a1 , a2 , . . . , an and b are
constants, then:
n
! n
X X XX
Var ai Xi + b = a2i Var(Xi ) + 2 ai aj Cov(Xi , Xj ). (5.4)
i=1 i=1 i<j

In particular, for n = 2:

Var(X + Y ) = Var(X) + Var(Y ) + 2 × Cov(X, Y )


Var(X − Y ) = Var(X) + Var(Y ) − 2 × Cov(X, Y ).

If X1 , X2 , . . . , Xn are independent random variables, then Cov(Xi , Xj ) = 0 for all i 6= j,


and so (5.4) simplifies to:
n
! n
X X
Var ai Xi = a2i Var(Xi ). (5.5)
i=1 i=1

In particular, for n = 2, when X and Y are independent:

Var(X + Y ) = Var(X) + Var(Y )


Var(X − Y ) = Var(X) + Var(Y ).

134
5.10. Sums and products of random variables

These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random
variables are not independent.

5.10.3 Expected values of products of independent random


variables
If X1 , X2 , . . . , Xn are independent random variables and a1 , a2 , . . . , an are constants,
then: !
Yn Y n
E ai Xi = E((a1 X1 )(a2 X2 ) · · · (an Xn )) = ai E(Xi ).
i=1 i=1

In particular, when X and Y are independent:

E(XY ) = E(X) E(Y ).

There is no corresponding simple result for the means of products of dependent random
variables. There is also no simple result for the variances of products of random
variables, even when they are independent.

5.10.4 Some proofs of previous results


With these new results, we can now prove some results which were stated earlier.
Recall:
Var(X) = E(X 2 ) − (E(X))2 .
Proof:

Var(X) = E((X − E(X))2 )


= E(X 2 − 2E(X)X + (E(X))2 )
= E(X 2 ) − 2 E(X) E(X) + (E(X))2
= E(X 2 ) − 2(E(X))2 + (E(X))2
= E(X 2 ) − (E(X))2

using (5.3), with X1 = X 2 , X2 = X, a1 = 1, a2 = −2E(X) and b = (E(X))2 .




Recall:
Cov(X, Y ) = E(XY ) − E(X) E(Y ).
Proof:

Cov(X, Y ) = E((X − E(X))(Y − E(Y )))


= E(XY − E(Y )X − E(X)Y + E(X) E(Y ))
= E(XY ) − E(Y ) E(X) − E(X) E(Y ) + E(X) E(Y )
= E(XY ) − E(X) E(Y )

135
5. Multivariate random variables

using (5.3), with X1 = XY , X2 = X, X3 = Y , a1 = 1, a2 = −E(Y ), a3 = −E(X) and


b = E(X) E(Y ).


Recall that if X and Y are independent, then:

Cov(X, Y ) = Corr(X, Y ) = 0.

Proof:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X) E(Y ) − E(X) E(Y ) = 0

since E(XY ) = E(X) E(Y ) when X and Y are independent.


Since Corr(X, Y ) = Cov(X, Y )/[sd(X) sd(Y )], Corr(X, Y ) = 0 whenever Cov(X, Y ) = 0.


5.10.5 Distributions of sums of random variables


We now know the expected value and variance of the sum:

a1 X1 + a2 X2 + · · · + an Xn + b

whatever the joint distribution of X1 , X2 , . . . , Xn . This is usually all we can say about
the distribution of this sum.
In particular, the form of the distribution of the sum (i.e. its pf/pdf) depends on the
joint distribution of X1 , X2 , . . . , Xn , and there are no simple general results about that.
For example, even if X and Y have distributions from the same family, the distribution
of X + Y is often not from that same family. However, such results are available for a
few special cases.

Sums of independent binomial and Poisson random variables

Suppose X1 , X2 , . . . , Xn are random variables, and we consider the unweighted sum:


n
X
Xi = X1 + X2 + · · · + Xn .
i=1

That is, the general sum given by (5.2), with a1 = a2 = · · · = an = 1 and b = 0.


The following results hold when the random variables X1 , X2 , . . . , Xn are independent,
but not otherwise.

P P
If Xi ∼ Bin(ni , π), then i Xi ∼ Bin( i ni , π).
P P
If Xi ∼ Pois(λi ), then i Xi ∼ Pois( i λi ).

136
5.10. Sums and products of random variables

Application to the binomial distribution

An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = nπ and
Var(X) = nπ(1 − π) is as follows.

1. Let Z1 , Z2 , . . . , Zn be independent random variables, each distributed as


Zi ∼ Bernoulli(π) = Bin(1, π).
2. It is easy to show that E(Zi ) = π and Var(Zi ) = π(1 − π) for each i = 1, 2, . . . , n
(see (4.3) and (4.4)).
n
P
3. Also, Zi = X ∼ Bin(n, π) by the result above for sums of independent binomial
i=1
random variables.
4. Therefore, using the results (5.2) and (5.5), we have:
n
X n
X
E(X) = E(Zi ) = nπ and Var(X) = Var(Zi ) = nπ(1 − π).
i=1 i=1

Sums of normally distributed random variables

All sums (linear combinations) of normally distributed random variables are also
normally distributed.
Suppose X1 , X2 , . . . , Xn are normally distributed random variables, with Xi ∼ N (µi , σi2 )
for i = 1, 2, . . . , n, and a1 , a2 , . . . , an and b are constants, then:
n
X
ai Xi + b ∼ N (µ, σ 2 )
i=1

where:
n
X n
X XX
2
µ= ai µi + b and σ = a2i σi2 + 2 ai aj Cov(Xi , Xj ).
i=1 i=1 i<j

If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j,
n
the variance simplifies to σ 2 = a2i σi2 .
P
i=1

Example 5.17 Suppose that in the population of English people aged 16 or over:

the heights of men (in cm) follow a normal distribution with mean 174.9 and
standard deviation 7.39

the heights of women (in cm) follow a normal distribution with mean 161.3 and
standard deviation 6.85.

Suppose we select one man and one woman at random and independently of each
other. Denote the man’s height by X and the woman’s height by Y . What is the
probability that the man is at most 10 cm taller than the woman?

137
5. Multivariate random variables

In other words, what is the probability that the difference between X and Y is at
most 10?
Since X and Y are independent we have:
2
D = X − Y ∼ N (µX − µY , σX + σY2 )
= N (174.9 − 161.3, (7.39)2 + (6.85)2 )
= N (13.6, (10.08)2 ).

The probability we need is:


 
D − 13.6 10 − 13.6
P (D ≤ 10) = P ≤
10.08 10.08
= P (Z ≤ −0.36)
= P (Z ≥ 0.36)
= 0.3594

using Table 3 of Murdoch and Barnes’ Statistical Tables.


The probability that a randomly selected man is at most 10 cm taller than a
randomly selected woman is about 0.3594.

5.11 Overview of chapter


This chapter has introduced how to deal with more than one random variable at a time.
Focusing mainly on discrete bivariate distributions, the relationships between joint,
marginal and conditional distributions were explored. Sums and products of random
variables concluded the chapter.

5.12 Key terms and concepts


Association Bivariate
Conditional distribution Conditional mean
Conditional variance Correlation
Covariance Dependence
Independence Joint probability distribution
Joint probability (density) function Marginal distribution
Multivariate Random vector
Uncorrelated Univariate

Statistics are like bikinis. What they reveal is suggestive, but what they conceal
is vital.
(Aaron Levenstein)

138
Appendix A
Data visualisation and descriptive
statistics

A.1 (Re)vision of fundamentals


Properties of the summation operator

Let Xi and Yi , for i = 1, 2, . . . , n, be sets of n numbers. Let a denote a constant, i.e. a


number with the same value for all i.
All of the following results follow simply from the properties of addition (if you are still
not convinced, try them with n = 3).

n
P
1. a = n × a.
i=1
n times
n z }| {
• Proof:
P
a = (a + a + · · · + a) = n × a.
i=1


n
P n
P
2. aXi = a Xi .
i=1 i=1
n n
• Proof:
P P
aXi = (aX1 + aX2 + · · · + aXn ) = a(X1 + X2 + · · · + Xn ) = a Xi .
i=1 i=1


n
P n
P n
P
3. (Xi + Yi ) = Xi + Yi .
i=1 i=1 i=1

• Proof: Rearranging the elements of the summation, we get:


n
X
(Xi + Yi ) = ((X1 + Y1 ) + (X2 + Y2 ) + · · · + (Xn + Yn ))
i=1

= ((X1 + X2 + · · · + Xn ) + (Y1 + Y2 + · · · + Yn ))
= (X1 + X2 + · · · + Xn ) + (Y1 + Y2 + · · · + Yn )
n
X n
X
= Xi + Yi .
i=1 i=1

139
A. Data visualisation and descriptive statistics

Extension: double (triple etc.) summation

Sometimes sets of numbers may be indexed with two (or even more) subscripts, for
example as Xij , for i = 1, 2, . . . , n and j = 1, 2, . . . , m.
Summation over both indices is written as:
n X
X m n
X
Xij = (Xi1 + Xi2 + · · · + Xim )
i=1 j=1 i=1

= (X11 + X12 + · · · + X1m ) + (X21 + X22 + · · · + X2m )


+ · · · + (Xn1 + Xn2 + · · · + Xnm ).

The order of summation can be changed, that is:


n X
X m m X
X n
Xij = Xij .
i=1 j=1 j=1 i=1

Product notation

The analogous notation for the product of a set of numbers is:


n
Y
Xi = X1 × X2 × · · · × Xn .
i=1

The following results follow from the properties of multiplication.

n n
aXi = an
Q Q
1. Xi .
i=1 i=1

n
a = an .
Q
2.
i=1

n
 n
 n

Q Q Q
3. Xi Yi = Xi Yi .
i=1 i=1 i=1

The sum of deviations from the mean is 0

The mean is ‘in the middle’ of the observations X1 , X2 , . . . , Xn , in the sense that
positive and negative values of the deviations Xi − X̄ cancel out, when summed over
all the observations, that is:
Xn
(Xi − X̄) = 0.
i=1

Proof: (The proof uses the definition of X̄ and the properties of summation introduced
earlier. Note that X̄ is a constant in the summation, because it has the same value for

140
A.1. (Re)vision of fundamentals

all i.)

n
P
n
X n
X n
X n
X n
X Xi
i=1
(Xi − X̄) = Xi − X̄ = Xi − nX̄ = Xi − n
i=1 i=1 i=1 i=1 i=1
n
n
X n
X
= Xi − Xi = 0.
i=1 i=1

The mean minimises the sum of squared deviations

n
(Xi − C)2 , for any
P
The smallest possible value of the sum of squared deviations
i=1
constant C, is obtained when C = X̄.
Proof:
=0
X X z }| {
2
(Xi − C) = (Xi −X̄ + X̄ −C)2
X
= ((Xi − X̄) + (X̄ − C))2
X
= ((Xi − X̄)2 + 2(Xi − X̄)(X̄ − C) + (X̄ − C)2 )
X X X
= (Xi − X̄)2 + 2(Xi − X̄)(X̄ − C) + (X̄ − C)2
=0
X zX }| {
2
= (Xi − X̄) + 2(X̄ − C) (Xi − X̄) +n(X̄ − C)2
X
= (Xi − X̄)2 + n(X̄ − C)2
X
≥ (Xi − X̄)2

since n(X̄ − C)2 ≥ 0 for any choice of C. Equality is obtained only when C = X̄, so
that n(X̄ − C)2 = 0.


An alternative formula for the variance

The sum of squares in S 2 can also be expressed as:

n
X n
X
2
(Xi − X̄) = Xi2 − nX̄ 2 .
i=1 i=1

141
A. Data visualisation and descriptive statistics

Proof: We have:
n
X n
X
2
(Xi − X̄) = (Xi2 − 2Xi X̄ + X̄ 2 )
i=1 i=1

= nX̄ = nX̄ 2
z }| { z }| {
n
X Xn n
X
= Xi2 − 2X̄ Xi + X̄ 2
i=1 i=1 i=1
n
X
= Xi2 − nX̄ 2 .
i=1


Therefore, the sample variance can also be calculated as:
n
!
2 1 X
S = Xi2 − nX̄ 2
n−1 i=1


(and the standard deviation S = S 2 again).
This formula
P is most
Pconvenient for calculations done by hand when summary statistics
2
such as i Xi and i Xi are provided.

Sample moment

Sample moments will be formally introduced in Chapter 7 (in Lent term).


Let us define, for a variable X and for each k = 1, 2, . . ., the following:

the kth sample moment about zero is:


n
Xik
P
i=1
mk =
n

the kth central sample moment is:


n
(Xi − X̄)k
P
i=1
m0k = .
n

In other words, these are sample averages of the powers Xik and (Xi − X̄)k , respectively.
Clearly:
n 1
X̄ = m1 and S 2 = m02 = (nm2 − n(m1 )2 ).
n−1 n−1

Moments of powers 3 and 4 are used in two more summary statistics which are
described next, for reference only.
These are used much less often than measures of central tendency and dispersion.

142
A.1. (Re)vision of fundamentals

Sample skewness (non-examinable)

A measure of the skewness of the distibution of a variable X is:


m03 3
P
i (Xi − X̄) /n
g1 = 3 = P .
s ( i (Xi − X̄)2 /(n − 1))3/2
For this measure, g1 = 0 for a symmetric distribution, g1 > 0 for a positively-skewed
distribution, and g1 < 0 for a negatively-skewed distribution.
For example, g1 = 1.24 for the (positively skewed) GDP per capita distribution shown
in Chapter 1 of the main course notes, and g1 = 0.006 for the (fairly symmetric)
diastolic blood pressure distribution.

Sample kurtosis (non-examinable)

Kurtosis refers to yet another characteristic of a sample distribution. This has to do


with the relative sizes of the ‘peak’ and tails of the distribution (think about shapes of
histograms).

A distribution with high kurtosis (i.e. leptokurtic) has a sharp peak and a high
proportion of observations in the tails far from the peak.
A distribution with low kurtosis (i.e. platykurtic) is ‘flat’, with no pronounced peak
with most of the observations spread evenly around the middle and weak tails.
A sample measure of kurtosis is:
m04 4
P
i (Xi − X̄) /n
g2 = − 3 = − 3.
(m02 )2
P
( i (Xi − X̄)2 /n)2
g2 > 0 for leptokurtic and g2 < 0 for platykurtic distributions, and g2 = 0 for the normal
distribution (introduced in Chapter 4). Some software packages define a measure of
kurtosis without the −3, i.e. ‘excess kurtosis’.

Calculation of sample quantiles (non-examinable)

This is how computer software calculates general sample quantiles (or how you can do
so by hand, if you ever needed to).
Suppose we need to calculate the cth sample quantile, qc , where 0 < c < 100. Let
R = (n + 1)c/100, and define r as the integer part of R and f = R − r as the fractional
part (if R is an integer, r = R and f = 0). It follows that:
qc = X(r) + f (X(r+1) − X(r) ) = (1 − f )X(r) + f X(r+1) .
For example, if n = 10:

for q50 (the median): R = 5.5, r = 5, f = 0.5, and so we have:


q50 = X(5) + 0.5(X(6) − X(5) ) = 0.5(X(5) + X(6) )
as before
for q25 (the first quartile): R = 2.75, r = 2, f = 0.75, and so:
q25 = X(2) + 0.75(X(3) − X(2) ) = 0.25X(2) + 0.75X(3) .

143
A. Data visualisation and descriptive statistics

A.2 Worked example


1. Show that: " n #
n X
X n X
(xi − xj )2 = 2n (xi − x̄)2 .
i=1 j=1 i=1

Solution:
Begin with the left-hand side and proceed as follows:
n Xn n
" n #
X X X
(xi − xj )2 = (xi − xj )2 .
i=1 j=1 i=1 j=1

Now, expand the square:


n
" n #
X X
2 2
= (xi − 2xi xj + xj ) .
i=1 j=1

Next, sum separately inside [ ] so we have:


n
" n n n
#
X X X X
= x2i + (−2xi xj ) + x2j .
i=1 j=1 j=1 j=1

Now, factor out xi terms inside [ ] to give:


n
" n n n
#
X X X X
= x2i 1 − 2xi xj + x2j .
i=1 j=1 j=1 j=1

n
P
Now, recall that x̄ = xi /n, so re-write as:
i=1

n
" n
#
X X
= nx2i − 2xi nx̄ + x2j .
i=1 j=1

Next, expand the bracket:


n n n n
!
X X X X
= nx2i + (−2xi nx̄) + x2j .
i=1 i=1 i=1 j=1

Re-arrange again:
n n n
! n
X X X X
=n x2i − 2nx̄ xi + x2j 1.
i=1 i=1 j=1 i=1

Apply the ‘x̄ trick’ once more:


n n
!
X X
=n x2i − 2nx̄ × nx̄ + x2j n.
i=1 j=1

144
A.3. Practice questions

Factor out the common n to give:


" n n
#
X X
=n x2i − 2nx̄2 + x2j .
i=1 j=1

Without loss of generality, we can re-define the index j as index i so:


" n n
#
X X
=n x2i − 2nx̄2 + x2i .
i=1 i=1

Finally, add terms, factor out 2n, apply the ‘x̄ trick’ . . . and you’re done!
" n # " n #
X X
= 2n x2i − nx̄2 = 2n (xi − x̄)2 .
i=1 i=1

A.3 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix F.

1. Let Y1 , Y2 and Y3 be real numbers with Ȳ = (Y1 + Y2 + Y3 )/3. Show that:


3
P
(a) (Yj − Ȳ ) = 0
j=1
3 P
P 3
(b) (Yj − Ȳ )(Yk − Ȳ ) = 0
j=1 k=1
3 3 3
(Yj − Ȳ )2 .
P P P
(c) j6=k (Yj − Ȳ )(Yk − Ȳ ) = −
j=1 k=1 j=1

Hint: there are three terms in the expression of (a), nine terms in (b) and six terms
in (c). Write out the terms, and try and find ways to simplify them which avoid the
need for a lot of messy algebra!

2. For constants a and b, show that:


(a) ȳ = ax̄ + b, where yi = axi + b for i = 1, 2, . . . , n
n n
(xi − x̄)2 = x2i − nx̄2
P P
(b)
i=1 i=1

(c) s.d.y = |a| s.d.x , where s.d.y is the standard deviation of y etc.
What are the mean and standard deviation of the set {x1 + k, x2 + k, . . . , xn + k}
where k is a constant? What are the mean and standard deviation of the set
{cx1 , cx2 , . . . , cxn } where c is a constant? Justify your answers with reference to the
above results.

The average human has one breast and one testicle.


(Des McHale)

145
A. Data visualisation and descriptive statistics

146
Appendix B
Probability theory

B.1 Worked examples


1. A and B are independent events. Suppose that P (A) = 2π, P (B) = π and
P (A ∪ B) = 0.8. Evaluate π.
Solution:
We have:

P (A ∪ B) = 0.8 = P (A) + P (B) − P (A ∩ B)


= P (A) + P (B) − P (A) P (B)
= 2π + π − 2π 2 .

Therefore: √
2 3± 9 − 6.4
2π − 3π + 0.8 = 0 ⇒ π= .
4
Hence π = 0.346887, since the other root is > 1!

2. A and B are events such that P (A | B) > P (A). Prove that:

P (Ac | B c ) > P (Ac )

where Ac and B c are the complements of A and B, respectively, and P (B c ) > 0.


Solution:
From the definition of conditional probability:

P (Ac ∩ B c ) P ((A ∪ B)c ) 1 − P (A) − P (B) + P (A ∩ B)


P (Ac | B c ) = c
= c
= .
P (B ) P (B ) 1 − P (B)

However:
P (A ∩ B)
P (A | B) = > P (A) i.e. P (A ∩ B) > P (A) P (B).
P (B)

Hence:
1 − P (A) − P (B) + P (A) P (B)
P (Ac | B c ) > = 1 − P (A) = P (Ac ).
1 − P (B)

147
B. Probability theory

3. A, B and C are independent events. Prove that A and (B ∪ C) are independent.


Solution:
Using the distributive law:
P (A ∩ (B ∪ C)) = P ((A ∩ B) ∪ (A ∩ C))
= P (A ∩ B) + P (A ∩ C) − P (A ∩ B ∩ C)
= P (A) P (B) + P (A) P (C) − P (A) P (B) P (C)
= P (A) (P (B) + P (C) − P (B) P (C))
= P (A) P (B ∪ C).

4. A and B are any two events in the sample space S. The binary set operator ∨
denotes an exclusive union, such that:
A ∨ B = (A ∪ B) ∩ (A ∩ B)c = {s | s ∈ A or B, and s 6∈ (A ∩ B)}.
Show, from the axioms of probability, that:
(a) P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B)
(b) P (A ∨ B | A) = 1 − P (B | A).

Solution:
(a) We have:
A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ).
By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint:
P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ).
We can write A = (A ∩ B) ∪ (A ∩ B c ), hence (using axiom 3):
P (A ∩ B c ) = P (A) − P (A ∩ B).
Similarly, P (B ∩ Ac ) = P (B) − P (A ∩ B), hence:
P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B).

(b) We have:
P ((A ∨ B) ∩ A)
P (A ∨ B | A) =
P (A)
P (A ∩ B c )
=
P (A)
P (A) − P (A ∩ B)
=
P (A)
P (A) P (A ∩ B)
= −
P (A) P (A)
= 1 − P (B | A).

148
B.1. Worked examples

5. State and prove Bayes’ theorem.

Solution:
Bayes’ theorem is:
P (A | Bj ) P (Bj )
P (Bj | A) = K
.
P
P (A | Bi ) P (Bi )
i=1

By definition:
P (Bj ∩ A) P (A | Bj ) P (Bj )
P (Bj | A) = = .
P (A) P (A)
If {Bi }, for i = 1, 2, . . . , K, is a partition of the sample space S, then:

K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1

Hence the result.

6. A man has two bags. Bag A contains five keys and bag B contains seven keys. Only
one of the twelve keys fits the lock which he is trying to open. The man selects a
bag at random, picks out a key from the bag at random and tries that key in the
lock. What is the probability that the key he has chosen fits the lock?

Solution:
Define a partition {Ci }, such that:

5 1 5
C1 = key in bag A and bag A chosen ⇒ P (C1 ) = × =
12 2 24
7 1 7
C2 = key in bag B and bag A chosen ⇒ P (C2 ) = × =
12 2 24
5 1 5
C3 = key in bag A and bag B chosen ⇒ P (C3 ) = × =
12 2 24
7 1 7
C4 = key in bag B and bag B chosen ⇒ P (C4 ) = × = .
12 2 24
Hence we require, defining the event F = ‘key fits’:

1 1 1 5 1 7 1
P (F ) = × P (C1 ) + × P (C4 ) = × + × = .
5 7 5 24 7 24 12

7. Continuing with Question 6, suppose the first key chosen does not fit the lock.
What is the probability that the bag chosen:
(a) is bag A?
(b) contains the required key?

149
B. Probability theory

Solution:

(a) We require P (bag A | F c ) which is:

P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 )
P (bag A | F c ) = 4
.
P c
P (F | Ci ) P (Ci )
i=1

The conditional probabilities are:

4 6
P (F c | C1 ) = , P (F c | C2 ) = 1, P (F c | C3 ) = 1 and P (F c | C4 ) = .
5 7
Hence:
4/5 × 5/24 + 1 × 7/24 1
P (bag A | F c ) = = .
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24 2

(b) We require P (right bag | F c ) which is:

P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 )
P (right bag | F c ) = 4
P
P (F c | Ci ) P (Ci )
i=1

4/5 × 5/24 + 6/7 × 7/24


=
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24
5
= .
11

8. Assume that a calculator has a ‘random number’ key and that when the key is
pressed an integer between 0 and 999 inclusive is generated at random, all numbers
being generated independently of one another.
(a) What is the probability that the number generated is less than 300?
(b) If two numbers are generated, what is the probability that both are less than
300?
(c) If two numbers are generated, what is the probability that the first number
exceeds the second number?
(d) If two numbers are generated, what is the probability that the first number
exceeds the second number, and their sum is exactly 300?
(e) If five numbers are generated, what is the probability that at least one number
occurs more than once?

Solution:

(a) Simply 300/1,000 = 0.3.


(b) Simply 0.3 × 0.3 = 0.09.

150
B.1. Worked examples

(c) Suppose P (first greater) = x, then by symmetry we have that


P (second greater) = x. However, the probability that both are equal is (by
counting):
{0, 0}, {1, 1}, . . . , {999, 999} 1,000
= = 0.001.
1,000,000 1,000,000
Hence x + x + 0.001 = 1, so x = 0.4995.
(d) The following cases apply {300, 0}, {299, 1}, . . . , {151, 149}, i.e. there are 150
possibilities from (10)6 . So the required probability is:
150
= 0.00015.
1,000,000
(e) The probability that they are all different is:
999 998 997 996
× × × .
1,000 1,000 1,000 1,000
Subtracting from 1 gives the required probability, i.e. 0.009965.

9. If C1 , C2 , . . . are events in S which are pairwise mutually exclusive (i.e. Ci ∩ Cj = ∅


for all i 6= j), then, by the axioms of probability:

! ∞
[ X
P Ci = P (Ci ). (B.1)
i=1 i=1

Suppose that A1 , A2 , . . . are pairwise mutually exclusive events in S. Prove that a


property like (B.1) also holds for conditional probabilities given some event B, i.e.
prove that:

! ! ∞
[ X
P Ai | B = P (Ai | B).
i=1 i=1
You can assume that all unions and intersections of Ai and B are also events in S.
Solution:
We have:

! !
P (( ∞
S
i=1 Ai ) ∩ B)
[
P Ai |B =
i=1
P (B)

P( ∞
S
i=1 (Ai ∩ B))
=
P (B)

X P (Ai ∩ B)
=
i=1
P (B)

X
= P (Ai | B)
i=1

where the equation on the second line follows from (B.1) in the question, since
Ai ∩ B are also events in S, and they are pairwise mutually exclusive (i.e.
(Ai ∩ B ∩ (Aj ∩ B) = ∅ for all i 6= j).

151
B. Probability theory

10. Suppose that three components numbered 1, 2 and 3 have probabilities of failure
π1 , π2 and π3 , respectively. Determine the probability of a system failure in each of
the following cases where component failures are assumed to be independent.
(a) Parallel system – the system fails if all components fail.
(b) Series system – the system fails unless all components do not fail.
(c) Mixed system – the system fails if component 1 fails or if both component 2
and component 3 fail.

Solution:
(a) Since the component failures are independent, the probability of system failure
is π1 π2 π3 .
(b) The probability that component i does not fail is 1 − πi , hence the probability
that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability
that the system fails is:
1 − (1 − π1 )(1 − π2 )(1 − π3 ).

(c) Components 2 and 3 may be combined to form a notional component 4 with


failure probability π2 π3 . So the system is equivalent to a component with
failure probability π1 and another component with failure probability π2 π3 ,
these being connected in series. Therefore, the failure probability is:
1 − (1 − π1 )(1 − π2 π3 ) = π1 + π2 π3 − π1 π2 π3 .

11. Why is S = {1, 1, 2}, not a sensible way to try to define a sample space?
Solution:
Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer
to write S = {1, 2}.

12. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)
Solution:
The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample
space S) and ∅.

13. For an event A, work out a simpler way to express the events A ∩ S, A ∪ S, A ∩ ∅
and A ∪ ∅.
Solution:
We have:
A ∩ S = A, A ∪ S = S, A ∩ ∅ = ∅ and A ∪ ∅ = A.

14. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).

152
B.1. Worked examples

Solution:
S has 4 elementary outcomes which are equally likely, so each elementary outcome
has probability 1/4.
We have:
P (A ∩ B) P ({c}) 1/4 1
P (A | B) = = = =
P (B) P ({c, d}) 1/4 + 1/4 2
and:
P (B ∩ A) P ({c}) 1/4 1
P (B | A) = = = = .
P (A) P ({a, b, c}) 1/4 + 1/4 + 1/4 3

15. Suppose that we toss a fair coin twice. The sample space is given by
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way – for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
‘heads on the first toss’ and ‘heads on the second toss’ are independent.
Solution:
Note carefully here that we have equally likely elementary outcomes (due to the
coin being fair), so that each has probability 1/4, and the independence follows.
The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2,
because it is specified by two elementary outcomes. The event ‘heads on the second
toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss
and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the
multiplication property P (A ∩ B) = 1/4 = 1/2 × 1/2 = P (A) P (B) is satisfied, and
the two events are independent.

16. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0.1
Solution:
It is important to get the logical flow in the right direction here. We are told that
A and B are disjoint events, that is:

A ∩ B = ∅.

So:
P (A ∩ B) = 0.
We are also told that A and B are independent, that is:

P (A ∩ B) = P (A) P (B).

It follows that:
0 = P (A) P (B)
and so either P (A) = 0 or P (B) = 0.
1
Note that independence and disjointness are not similar ideas.

153
B. Probability theory

17. Write down the condition for three events A, B and C to be independent.

Solution:
Applying the product rule, we must have:

P (A ∩ B ∩ C) = P (A) P (B) P (C).

Therefore, since all subsets of two events from A, B and C must be independent,
we must also have:

P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)

and:
P (B ∩ C) = P (B) P (C).

One must check that all four conditions hold to verify independence of A, B and C.

18. Prove the simplest version of Bayes’ theorem from first principles.

Solution:
Applying the definition of conditional probability, we have:

P (B ∩ A) P (A ∩ B) P (A | B) P (B)
P (B | A) = = = .
P (A) P (A) P (A)

19. A statistics teacher knows from past experience that a student who does their
homework consistently has a probability of 0.95 of passing the examination,
whereas a student who does not do their homework has a probability of 0.30 of
passing.
(a) If 25% of students do their homework consistently, what percentage can expect
to pass?
(b) If a student chosen at random from the group gets a pass, what is the
probability that the student has done their homework consistently?

Solution:
Here the random experiment is to choose a student at random, and to record
whether the student passes (P ) or fails (F ), and whether the student has done
their homework consistently (C) or has not (N ).2 The sample space is
S = {P C, P N, F C, F N }. We use the events Pass = {P C, P N }, and Fail
= {F C, F N }. We consider the sample space partitioned by Homework
= {P C, F C}, and No Homework = {P N, F N }.
2
Notice that F = P c and N = C c .

154
B.1. Worked examples

(a) The first part of the example asks for the denominator of Bayes’ theorem:

P (Pass) = P (Pass | Homework) P (Homework)


+ P (Pass | No Homework) P (No Homework)
= 0.95 × 0.25 + 0.30 × (1 − 0.25)
= 0.2375 + 0.225
= 0.4625.

(b) Now applying Bayes’ theorem:


P (Homework ∩ Pass)
P (Homework | Pass) =
P (Pass)
P (Pass | Homework) P (Homework)
=
P (Pass)
0.95 × 0.25
=
0.4625
= 0.5135.

Alternatively, we could arrange the calculations in a tree diagram as shown


below.

20. Plagiarism is a serious problem for assessors of coursework. One check on


plagiarism is to compare the coursework with a standard text. If the coursework
has plagiarised the text, then there will be a 95% chance of finding exactly two
phrases which are the same in both coursework and text, and a 5% chance of
finding three or more phrases. If the work is not plagiarised, then these
probabilities are both 50%.

155
B. Probability theory

Suppose that 5% of coursework is plagiarised. An assessor chooses some coursework


at random. What is the probability that it has been plagiarised if it has exactly two
phrases in the text?3
What if there are three or more phrases? Did you manage to get a roughly correct
guess of these results before calculating?
Solution:
Suppose that two phrases are the same. We use Bayes’ theorem:
0.95 × 0.05
P (plagiarised | two the same) = = 0.0909.
0.95 × 0.05 + 0.5 × 0.95
Finding two phrases has increased the chance the work is plagiarised from 5% to
9.1%. Did you get anywhere near 9% when guessing? Now suppose that we find
three or more phrases:
0.05 × 0.05
P (plagiarised | three or more the same) = = 0.0052.
0.05 × 0.05 + 0.5 × 0.95
It seems that no plagiariser is silly enough to keep three or more phrases the same,
so if we find three or more, the chance of the work being plagiarised falls from 5%
to 0.5%! How close did you get by guessing?

21. A, B and C throw a die in that order until a six appears. The person who throws
the first six wins. What are their respective chances of winning?
Solution:
We must assume that the game finishes with probability one (it would be proved in
a more advanced subject). If A, B and C all throw and fail to get a six, then their
respective chances of winning are as at the start of the game. We can call each
completed set of three throws a round. Let us denote the probabilities of winning
by P (A), P (B) and P (C) for A, B and C, respectively. Therefore:

P (A) = P (A wins on the 1st throw)


+ P (A wins in some round after the 1st round)
1
= + P (A, B and C fail on the 1st throw and A wins after the 1st round)
6
1
= + P (A, B and C fail in the 1st round)
6
× P (A wins after the 1st round | A, B and C fail in the 1st round)
1
= + P (No six in first 3 throws) P (A)
6
 3
1 5
= + P (A)
6 6
 
1 125
= + P (A).
6 216
3
Try making a guess before doing the calculation!

156
B.1. Worked examples

So (1 − 125/216)P (A) = 1/6, and P (A) = 216/(91 × 6) = 36/91.


Similarly:

P (B) = P (B wins in the 1st round)


+ P (B wins after the 1st round)
= P (A fails with the 1st throw and B throws a six on the 1st throw)
+ P (All fail in the 1st round and B wins after the 1st round)
= P (A fails with the 1st throw) P (B throws a six with the 1st throw)
+ P (All fail in the 1st round) P (B wins after the 1st | All fail in the 1st)
     3
5 1 5
= + P (B).
6 6 6

So, (1 − 125/216)P (B) = 5/36, and P (B) = 5(216)/(91 × 36) = 30/91.


In the same way, P (C) = (5/6)(5/6)(1/6)(216/91) = 25/91.
Notice that P (A) + P (B) + P (C) = 1. You may, on reflection, think that this
rather long solution could be shortened, by considering the relative winning
chances of A, B and C.

22. In men’s singles tennis, matches are played on the best-of-five-sets principle.
Therefore, the first player to win three sets wins the match, and a match may
consist of three, four or five sets. Assuming that two players are perfectly evenly
matched, and that sets are independent events, calculate the probabilities that a
match lasts three sets, four sets and five sets, respectively.
Solution:
Suppose that the two players are A and B. We calculate the probability that A
wins a three-, four- or five-set match, and then, since the players are evenly
matched, double these probabilities for the final answer.

P (‘A wins in 3 sets’) = P (‘A wins 1st set’ ∩ ‘A wins 2nd set’ ∩ ‘A wins 3rd set’).

Since the sets are independent, we have:

P (‘A wins in 3 sets’) = P (‘A wins 1st set’) P (‘A wins 2nd set’) P (‘A wins 3rd set’)
1 1 1 1
= × × = .
2 2 2 8
Therefore, the total probability that the game lasts three sets is:

1 1
2× = .
8 4
If A wins in four sets, the possible winning patterns are:

BAAA, ABAA and AABA.

157
B. Probability theory

Each of these patterns has probability (1/2)4 by using the same argument as in the
case of 3 sets. So the probability that A wins in four sets is 3 × (1/16) = 3/16.
Therefore, the total probability of a match lasting four sets is 2 × (3/16) = 3/8.
The probability of a five-set match should be 1 − 3/8 − 1/4 = 3/8, but let us check
this directly. The winning patterns for A in a five-set match are:

BBAAA, BABAA, BAABA, ABBAA, ABABA and AABBA.

Each of these has probability (1/2)5 because of the independence of the sets. So the
probability that A wins in five sets is 6 × (1/32) = 3/16. Therefore, the total
probability of a five-set match is 3/8, as before.

B.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix F.

1. (a) A, B and C are any three events in the sample space S. Prove that:

P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (B∩C)−P (A∩C)+P (A∩B∩C).

(b) A and B are events in a sample space S. Show that:


P (A) + P (B)
P (A ∩ B) ≤ ≤ P (A ∪ B).
2

2. Suppose A and B are events with P (A) = p, P (B) = 2p and P (A ∪ B) = 0.75.


(a) Evaluate p and P (A | B) if A and B are independent events.
(b) Evaluate p and P (A | B) if A and B are mutually exclusive events.

3. (a) Show that if A and B are independent events in a sample space, then Ac and
B c are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then
X c and Y c are not in general mutually exclusive.

4. In a game of tennis, each point is won by one of the two players A and B. The
usual rules of scoring for tennis apply. That is, the winner of the game is the player
who first scores four points, unless each player has won three points, when deuce is
called and play proceeds until one player is two points ahead of the other and
hence wins the game.
A is serving and has a probability of winning any point of 2/3. The result of each
point is assumed to be independent of every other point.
(a) Show that the probability of A winning the game without deuce being called is
496/729.
(b) Find the probability of deuce being called.

158
B.2. Practice questions

(c) If deuce is called, show that A’s subsequent probability of winning the game is
4/5.
(d) Hence determine A’s overall chance of winning the game.

There are lies, damned lies and statistics.


(Mark Twain)

159
B. Probability theory

160
Appendix C
Random variables

C.1 Worked examples


1. Toward the end of the financial year, James is considering whether to accept an
offer to buy his stock option now, rather than wait until the normal exercise time.
If he sells now, his profit will be £120,000. If he waits until the exercise time, his
profit will be £200,000, provided that there is no crisis in the markets before that
time; if there is a crisis, the option will be worthless and he would expect a net loss
of £50,000. What action should he take to maximise his expected profit if the
probability of crisis is:
(a) 0.5?
(b) 0.1?
For what probability of a crisis would James be indifferent between the two courses
of action if he wishes to maximise his expected profit?
Solution:
Let π = probability of crisis, then:

S = E(profit given James sells) = £120,000

and:

W = E(profit given James waits) = £200,000(1 − π) + (−£50,000)π.

(a) If π = 0.5, then S = £120,000 and W = £75,000, so S > W , hence James


should sell now.
(b) If π = 0.1, then S = £120,000 and W = £175,000, so S < W , hence James
should wait until the exercise time.
To be indifferent, we require S = W , i.e. we have:

£200,000 − £250,000 π = £120,000

so π = 8/25 = 0.32.

2. Suppose the random variable X has a geometric distribution with parameter π,


which has the following probability function:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x) =
0 otherwise.

161
C. Random variables

(a) Show that its moment generating function is:

πet
.
1 − et (1 − π)

(b) Hence show that the mean of the distribution is 1/π.

Solution:
(a) Working from the definition:

X ∞
X
tX tx
MX (t) = E(e ) = e p(x) = etx (1 − π)x−1 π
x∈S x=1

X
= πet (et (1 − π))x−1
x=1

πet
=
1 − et (1 − π)

using the sum to infinity of a geometric series.


(b) Differentiating:

(1 − et (1 − π))πet + πet (et (1 − π)) πet


MX0 (t) = = .
(1 − et (1 − π))2 (1 − et (1 − π))2

Therefore:
π π 1
E(X) = MX0 (0) = 2
= 2 = .
(1 − (1 − π)) π π

3. A continuous random variable, X, has a probability density function, f (x), defined


by: (
ax + bx2 for 0 ≤ x ≤ 1
f (x) =
0 otherwise
and E(X) = 1/2. Determine:
(a) the constants a and b
(b) the cumulative distribution function, F (x), of X
(c) the variance, Var(X).

Solution:
(a) We have:
1 1 1
ax2 bx3
Z Z 
2
f (x) dx = 1 ⇒ ax + bx dx = + =1
0 0 2 3 0

i.e. we have a/2 + b/3 = 1.

162
C.1. Worked examples

Also, we know E(X) = 1/2, hence:


Z 1  3 1
2 ax bx4 1
x (ax + bx ) dx = + =
0 3 4 0 2
i.e. we have:
a b 1
+ = ⇒ a = 6 and b = −6.
3 4 2
Hence f (x) = 6x(1 − x) for 0 ≤ x ≤ 1, and 0 otherwise.
(b) We have: 
0
 for x < 0
F (x) = 3x2 − 2x3 for 0 ≤ x ≤ 1

1 for x > 1.

(c) Finally:
1 1 1
6x4 6x5
Z Z 
2 2 3 4
E(X ) = x (6x(1 − x)) dx = 6x − 6x dx = − = 0.3.
0 0 4 5 0
and so the variance is:
Var(X) = E(X 2 ) − (E(X))2 = 0.3 − 0.25 = 0.05.

4. The waiting time, W , of a traveller queueing at a taxi rank is distributed according


to the cumulative distribution function, G(w), defined by:

0
 for w < 0
G(w) = 1 − (2/3) exp(−w/2) for 0 ≤ w < 2

1 for w ≥ 2.

(a) Sketch the cumulative distribution function.


(b) Is the random variable W discrete, continuous or mixed?
(c) Evaluate P (W > 1), P (W = 2), P (W ≤ 1.5 | W > 0.5) and E(W ).

Solution:
(a) A sketch of the cumulative distribution function is:

G (w )
1

1-(2/3)e -1

1/3

0 2 w

(b) We see the distribution is mixed, with discrete ‘atoms’ at 0 and 2.

163
C. Random variables

(c) We have:

2
P (W > 1) = 1 − G(1) = e−1/2
3
2
P (W = 2) = e−1
3
P (0.5 < W ≤ 1.5)
P (W ≤ 1.5 | W > 0.5) =
P (W > 0.5)
G(1.5) − G(0.5)
=
1 − G(0.5)
(1 − (2/3)e−1.5/2 ) − (1 − (2/3)e−0.5/2 )
=
(2/3)e−0.5/2
= 1 − e−1/2 .

Finally, the mean is:


Z 2
1 2 −1 1
E(W ) = × 0 + e × 2 + w e−w/2 dw
3 3 0 3
 −w/2 2 Z 2
4 −1 we 2 −w/2
= e + + e dw
3 3 −1/2 0 0 3
 −w/2 2
4 −1 4 −1 2e
= e − e +
3 3 3 −1/2 0
4
= (1 − e−1 ).
3

5. A random variable X has the following pdf:



1/4 for 0 ≤ x ≤ 1

f (x) = 3/4 for 1 < x ≤ 2

0 otherwise.

(a) Explain why f (x) can serve as a pdf.


(b) Find the mean and median of the distribution.
(c) Find the variance, Var(X).
(d) Write down the cdf of X.
(e) Find P (X = 1) and P (X > 1.5 | X > 0.5).
(f) Derive the moment generating function of X.

164
C.1. Worked examples

Solution:
R∞
(a) Clearly, f (x) ≥ 0 for all x and −∞ f (x) dx = 1. This can be seen
geometrically, since f (x) defines two rectangles, one with base 1 and height
1/4, the other with base 1 and height 3/4, giving a total area of 1/4 + 3/4 = 1.
(b) We have:
Z ∞ Z 1 Z 2  2 1  2 2
x 3x x 3x 1 3 3 5
E(X) = x f (x) dx = dx+ dx = + = + − = .
−∞ 0 4 1 4 8 0 8 1 8 2 8 4
The median is most simply found geometrically. The area to the right of the
point x = 4/3 is 0.5, i.e. the rectangle with base 2 − 4/3 = 2/3 and height 3/4,
giving an area of 2/3 × 3/4 = 1/2. Hence the median is 4/3.
(c) For the variance, we proceed as follows:
Z ∞ Z 1 2 Z 2 2  3 1  3 2
2 2 x 3x x x 1 1 11
E(X ) = x f (x) dx = dx+ dx = + = +2− = .
−∞ 0 4 1 4 12 0 4 1 12 4 6
Hence the variance is:
11 25 88 75 13
Var(X) = E(X 2 ) − (E(X))2 = − = − = ≈ 0.2708.
6 16 48 48 48
(d) The cdf is:


 0 for x<0

x/4 for 0≤x≤1
F (x) =


 3x/4 − 1/2 for 1<x≤2

1 for x > 2.
(e) P (X = 1) = 0, since the cdf is continuous, and:
P ({X > 1.5} ∩ {X > 0.5}) P (X > 1.5)
P (X > 1.5 | X > 0.5) = =
P (X > 0.5) P (X > 0.5)
0.5 × 0.75
=
1 − 0.5 × 0.25
0.375
=
0.875
3
= ≈ 0.4286.
7
(f) The moment generating function is:
Z ∞ 1 Z 2 tx
etx
Z
tX tx 3e
MX (t) = E(e ) = e f (x) dx = dx + dx
−∞ 0 4 1 4
 tx 1  tx 2
e 3e
= +
4t 0 4t 1
1 t 3
= (e − 1) + (e2t − et )
4t 4t
1
3e2t − 2et − 1 .

=
4t

165
C. Random variables

6. A continuous random variable X has the following pdf:


(
x3 /4 for 0 ≤ x ≤ 2
f (x) =
0 otherwise.

(a) Explain why f (x) can serve as a pdf.


(b) Find the mean and mode of the distribution.
(c) Determine the cdf, F (x), of X.
(d) Find the variance, Var(X).
(e) Find the skewness of X, given by:

E((X − E(X))3 )
.
σ3
(f) If a sample of five observations is drawn at random from the distribution, find
the probability that all the observations exceed 1.5.

Solution:
(a) Clearly, f (x) ≥ 0 for all x and:
2  4 2
x3
Z
x
dx = = 1.
0 4 16 0

(b) The mean is:


∞ 2  5 2
x4
Z Z
x 32
E(X) = x f (x) dx = dx = = = 1.6
−∞ 0 4 20 0 20

and the mode is 2 (where the density reaches a maximum).


(c) The cdf is: 
0
 for x < 0
4
F (x) = x /16 for 0 ≤ x ≤ 2

1 for x > 2.

(d) For the variance, we first find E(X 2 ), given by:


2 2  6 2
x5
Z Z
2 2 x 64 8
E(X ) = x f (x) dx = dx = = =
0 0 4 24 0 24 3
8 64 8
⇒ Var(X) = E(X 2 ) − (E(X))2 = − = ≈ 0.1067.
3 25 75

(e) The third moment about zero is:


2 2  7 2
x6
Z Z
3 3 x 128
E(X ) = x f (x) dx = dx = = ≈ 4.5714.
0 0 4 28 0 28

166
C.1. Worked examples

Letting E(X) = µ, the numerator is:


E((X − E(X))3 ) = E(X 3 ) − 3µ E(X 2 ) + 3µ2 E(X) − µ3
= 4.5714 − (3 × 1.6 × 2.6667) + (3 × (1.6)3 ) − (1.6)3
which is −0.0368, and the denominator is (0.1067)3/2 = 0.0349, hence the
skewness is −1.0544.
(f) The probability of a single observation exceeding 1.5 is:
Z 2 Z 2 3  4 2
x x
f (x) dx = dx = = 1 − 0.3164 = 0.6836.
1.5 1.5 4 16 1.5
So the probability of all five exceeding 1.5 is, by independence:
(0.6836)5 = 0.1493.

7. Consider the function:


(
λ2 xe−λx for x ≥ 0
f (x) =
0 otherwise.
(a) Show that this function has the characteristics of a probability density
function.
(b) Evaluate E(X) and Var(X).

Solution:
(a) Clearly, f (x) ≥ 0 for all x since λ2 > 0, x ≥ 0 and e−λx ≥ 0.
R∞
To show, −∞ f (x) dx = 1, we have:
Z ∞ Z ∞
f (x) dx = λ2 xe−λx dx
−∞ 0
∞ Z ∞
e−λx e−λx

2
= λx + λ2 dx
−λ 0 0 λ
Z ∞
=0+ λe−λx dx
0

= 1 (provided λ > 0).

(b) For the mean:


Z ∞
E(X) = x λ2 xe−λx dx
0
h i∞ Z ∞
−λx
= − x λe 2
+ 2xλe−λx dx
0 0

2
=0+ (from the exponential distribution).
λ
For the variance:
Z ∞ i∞ Z ∞
−λx
h
−λx 6
2
E(X ) = 2 2
x λ xe 3
dx = − x λe + 3x2 λe−λx dx = .
0 0 0 λ2
2 2 2
So, Var(X) = 6/λ − (2/λ) = 2/λ .

167
C. Random variables

8. A random variable, X, has a cumulative distribution function, F (x), defined by:



0
 for x < 0
F (x) = 1 − ae −x
for 0 ≤ x < 1

1 for x ≥ 1.

(a) Derive expressions for:


i. P (X = 0)
ii. P (X = 1)
iii. the pdf of X (where it is continuous)
iv. E(X).
(b) Suppose that E(X) = 0.75(1 − e−1 ). Evaluate the median of X and Var(X).

Solution:
(a) We have:
i. P (X = 0) = F (0) = 1 − a.
ii. P (X = 1) = lim (F (1) − F (x)) = 1 − (1 − ae−1 ) = ae−1 .
x→1
−x
iii. f (x) = ae , for 0 ≤ x < 1, and 0 otherwise.
iv. The mean is:
Z 1
−1
E(X) = 0 × (1 − a) + 1 × (ae ) + x ae−x dx
0
h i1 Z 1
= ae−1 + − xae−x + ae−x dx
0 0
h i1
= ae−1 − ae−1 + − ae−x
0

= a(1 − e−1 ).

(b) The median, m, satisfies:


 
−m 2
F (m) = 0.5 = 1 − 0.75e ⇒ m = − ln = 0.4055.
3
Recall Var(X) = E(X 2 ) − (E(X))2 , so:
Z 1
−1
2 2 2
E(X ) = 0 × (1 − a) + 1 × (ae ) + x2 ae−x dx
0
h i1 Z 1
= ae−1 + − x2 ae−x +2 xae−x dx
0 0

= ae−1 − ae−1 + 2(a − 2ae−1 )


= 2a − 4ae−1 .
Hence:
Var(X) = 2a − 4ae−1 − a2 (1 + e−2 − 2e−1 ) = 0.1716.

168
C.1. Worked examples

9. A continuous random variable, X, has a probability density function, f (x), defined


by: (
k sin(x) for 0 ≤ x ≤ π
f (x) =
0 otherwise.

(a) Determine the constant k and derive the cumulative distribution function,
F (x), of X.
(b) Find E(X) and Var(X).

Solution:
(a) We have: Z ∞ Z π
f (x) dx = k sin(x) dx = 1.
−∞ 0

Therefore: h iπ 1
k(− cos(x)) = 2k = 1 ⇒ k= .
0 2
The cdf is hence:

0
 for x < 0
F (x) = (1 − cos(x))/2 for 0 ≤ x ≤ π

1 for x > π.

(b) By symmetry, E(X) = π/2. Alternatively:


Z π iπ Z π 1
1 1h π 1h iπ π
E(X) = x sin(x) dx = x(− cos(x)) + cos(x) dx = + sin(x) = .
0 2 2 0 0 2 2 2 0 2

Next:
Z π  π Z π
2 21 1 2
E(X ) = x sin(x) dx = x (− cos(x)) + x cos(x) dx
0 2 2 0 0

π2 h iπ Z π
= + x sin(x) − sin(x) dx
2 0 0

π2 h iπ
= − − cos(x)
2 0

π2
= − 2.
2
Therefore, the variance is:

π2 π2 π2
Var(X) = E(X 2 ) − (E(X))2 = −2− = − 2.
2 4 4

10. (a) Define the cumulative distribution function (cdf) of a random variable and
state the principal properties of such a function.

169
C. Random variables

(b) Identify which, if any, of the following functions could be a cdf under suitable
choices of the constants a and b. Explain why (or why not) each function
satisfies the properties required of a cdf and the constraints which may be
required in respect of the constants a and b.
i. F (x) = a(b − x)2 for −1 ≤ x ≤ 1.
ii. F (x) = a(1 − xb ) for −1 ≤ x ≤ 1.
iii. F (x) = a − b exp(−x/2) for 0 ≤ x ≤ 2.

Solution:
(a) We defined the cdf to be F (x) = P (X ≤ x) where:
• 0 ≤ F (x) ≤ 1
• F (x) is non-decreasing
Rx
• dF (x)/dx = f (x) and F (x) = −∞
f (t) dt for continuous X
• F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞.
(b) i. Okay. a = 0.25 and b = −1.
ii. Not okay. At x = 1, F (x) = 0, which would mean a decreasing function.
iii. Okay. a = b > 0 and b = (1 − e−1 )−1 .

11. Suppose that random variable X has the range {x1 , x2 , . . .}, where x1 < x2 < · · · .
Prove the following results:

X
p(xi ) = 1
i=1

p(xk ) = F (xk ) − F (xk−1 )


k
X
F (xk ) = p(xi ).
i=1

Solution:
The events X = x1 , X = x2 , . . . are disjoint, so we can write:

X ∞
X
p(xi ) = P (X = xi ) = P (X = x1 ∪ X = x2 ∪ · · · ) = P (S) = 1.
i=1 i=1

In words, this result states that the sum of the probabilities of all the possible
values X can take is equal to 1.
For the second equation, we have:

F (xk ) = P (X ≤ xk ) = P (X = xk ∪ X ≤ xk−1 ).

The two events on the right-hand side are disjoint, so:

F (xk ) = P (X = xk ) + P (X ≤ xk−1 ) = p(xk ) + F (xk−1 )

170
C.1. Worked examples

which immediately gives the required result.


For the final result, we can write:

k
X
F (xk ) = P (X ≤ xk ) = P (X = x1 ∪ X = x2 ∪ · · · ∪ X = xk ) = p(xi ).
i=1

12. At a charity event, the organisers sell 100 tickets to a raffle. At the end of the
event, one of the tickets is selected at random and the person with that number
wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5. What
is the probability for each of them to win the prize?

Solution:
Let X denote the number on the winning ticket. Since all values between 1 and 100
are equally likely, X has a discrete ‘uniform’ distribution such that:

1
P (‘Carol wins’) = P (X = 22) = p(22) = = 0.01
100
and:
5
P (‘Janet wins’) = P (X ≤ 5) = F (5) = = 0.05.
100

13. What is the expectation of the random variable X if the only possible value it can
take is c?

Solution:
We have p(c) = 1, so X is effectively a constant, even though it is called a random
variable. Its expectation is:
X
E(X) = x p(x) = cp(x) = cp(c) = c × 1 = c. (C.1)
∀x

This is intuitively correct; on average, a constant must be equal to itself!

14. Show that E(X − E(X)) = 0.

Solution:
We have:
E(X − E(X)) = E(X) − E(E(X))
Since E(X) is just a number, as opposed to a random variable, (C.1) tells us that
its expectation is equal to itself. Therefore, we can write:

E(X − E(X)) = E(X) − E(X) = 0.

171
C. Random variables

15. Show that if Var(X) = 0 then p(µ) = 1. (We say in this case that X is almost
surely equal to its mean.)
Solution:
From the definition of variance, we have:
X
Var(X) = E((X − µ)2 ) = (x − µ)2 p(x) ≥ 0
∀x

because the squared term (x − µ)2 is non-negative (as is p(x)). The only case where
it is equal to 0 is when x − µ = 0, that is, when x = µ. Therefore, the random
variable X can only take the value µ, and we have p(µ) = P (X = µ) = 1.

C.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix F.

1. Construct suitable examples to show that for a random variable X:


(a) E(X 2 ) 6= (E(X))2 in general
(b) E(1/X) 6= 1/E(X) in general.

2. (a) Let X be a random variable. Show that:

Var(X) = E(X(X − 1)) − E(X)(E(X) − 1).

(b) Let X1 , X2 , . . . , Xn be independent random variables. Assume that all have a


mean of µ and a variance of σ 2 . Find expressions for the mean and variance of
the random variable (X1 + X2 + · · · + Xn )/n.

3. A doctor wishes to procure subjects possessing a certain chromosome abnormality


which is present in 4% of the population. How many randomly chosen independent
subjects should be procured if the doctor wishes to be 95% confident that at least
one subject has the abnormality?

4. In an investigation of animal behaviour, rats have to choose between four doors.


One of them, behind which is food, is ‘correct’. If an incorrect choice is made, the
rat is returned to the starting point and chooses again, continuing as long as
necessary until the correct choice is made. The random variable X is the serial
number of the trial on which the correct choice is made.
Find the probability function and expectation of X under each of the following
hypotheses:
(a) each door is equally likely to be chosen on each trial, and all trials are
mutually independent

172
C.2. Practice questions

(b) at each trial, the rat chooses with equal probability between the doors which it
has not so far tried
(c) the rat never chooses the same door on two successive trials, but otherwise
chooses at random with equal probabilities.

The death of one man is a tragedy. The death of millions is a statistic.


(Stalin to Churchill, Potsdam 1945)

173
C. Random variables

174
Appendix D
Common distributions of random
variables

D.1 Worked examples


1. The random variable X has a binomial distribution with parameters n and π.
Derive expressions for:
(a) E(X)
(b) E(X(X − 1))
(c) E(X(X − 1) · · · (X − r)).

Solution:
(a) We have:
n  
X n x
E(X) = x π (1 − π)n−x
x=0
x
n  
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n  
X n − 1 x−1
= nπ π (1 − π)(n−1)−(x−1)
x=1
x − 1
n−1  
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y

= nπ.

(b) We have:
n  
X n x
E(X(X − 1)) = x(x − 1) π (1 − π)n−x
x=0
x
n  
X n x
= x(x − 1) π (1 − π)n−x
x=2
x

175
D. Common distributions of random variables

n
X n(n − 1)(n − 2)!
E(X(X − 1)) = π 2 π x−2 (1 − π)n−x
x=2
(x − 2)! ((n − 2) − (x − 2))!
n  
X
2n − 2 x−2
= n(n − 1)π π (1 − π)(n−2)−(x−2)
x=2
x−2
n−2  
X
2 n−2 y
= n(n − 1)π π (1 − π)(n−2)−y
y=0
y
= n(n − 1)π 2 .

(c) We have:
E(X(X − 1) · · · (X − r))
n  
X n x
= x(x − 1) · · · (x − r) π (1 − π)n−x (if r < n)
x=0
x
n  
X n x
= x(x − 1) · · · (x − r) π (1 − π)n−x
x=r+1
x

= n(n − 1) · · · (n − r)π r+1


n  
X n − (r + 1) x−(r+1)
× π (1 − π)(n−(r+1))−(x−(r+1))
x=r+1
x − (r + 1)

= n(n − 1) · · · (n − r)π r+1 .

2. Suppose {Bi } is an infinite sequence of independent Bernoulli trials with:


P (Bi = 0) = 1 − π and P (Bi = 1) = π
for all i.
n
P
(a) Derive the distribution of Xn = Bi and the expected value and variance of
i=1
Xn .
(b) Let Y = min{i : Bi = 1}. Derive the distribution of Y and obtain an
expression for P (Y > y).

Solution:
n
P
(a) Xn = Bi takes the values 0, 1, 2, . . . , n. Any sequence consisting of x 1s and
i=1
x n−x
n−  x 0s has a probability π (1 − π) and gives a value Xn = x. There are
n
x
such sequences, so:
 
n x
P (Xn = x) = π (1 − π)n−x
x
and 0 otherwise. Hence E(Bi ) = π and Var(Bi ) = π(1 − π) which means
E(Xn ) = nπ and Var(Xn ) = nπ(1 − π).

176
D.1. Worked examples

(b) Y = min{i : Bi = 1} takes the values 1, 2, . . ., hence:

P (Y = y) = (1 − π)y−1 π

and 0 otherwise. It follows that P (Y > y) = (1 − π)y .

3. A continuous random variable X has the gamma distribution, denoted


X ∼ Gamma(α, β), if its probability density function (pdf) is of the form:

β α α−1 −βx
f (x) = x e for x > 0 (D.1)
Γ(α)

and 0 otherwise, where α > 0 and β > 0 are parameters, and Γ(α) is the value of
the gamma function such that:
Z ∞
Γ(α) = xα−1 e−x dx.
0

The gamma function has a finite value for all α > 0. Two of its properties are that:
• Γ(1) = 1
• Γ(α) = (α − 1) Γ(α − 1) for all α > 1.
(a) The function f (x) defined by (1) satisfies all the conditions for being a pdf.
Show that this implies the following result about an integral:
Z ∞
Γ(α)
xα−1 e−βx dx = α for any α > 0, β > 0.
0 β

(b) The Gamma(1, β) distribution is the same as another distribution with a


different name. What is this other distribution? Justify your answer.
(c) Show that if X ∼ Gamma(α, β), the moment generating function of X is:
 α
β
MX (t) =
β−t

which is defined when t < β.


(d) Suppose that X ∼ Gamma(α, β). Derive the expected value of X:
i. using the pdf and the definition of the expected value
ii. using the moment generating function.
(e) If X1 , X2 , . . . , Xk are independent random variables such that
Xi ∼ Gamma(αi , β) for i = 1, 2, . . . , k, then:
k k
!
X X
Xi ∼ Gamma αi , β .
i=1 i=1

Using this result and the known properties of the exponential distribution,
derive the expected value of X ∼ Gamma(α, β) when α is a positive integer
(i.e. α = 1, 2, . . .).

177
D. Common distributions of random variables

Solution:
(a) This
R ∞ follows immediately from the general property of pdfs that
−∞
f (x) dx = 1, applied to the specific pdf here. We have:

Γ(α) ∞ β α α−1 −βx


Z Z ∞
Γ(α)
α
= α x e dx = xα−1 e−βx dx.
β β 0 Γ(α) 0

(b) With α = 1, the pdf becomes f (x) = βe−βx for x ≥ 0, and 0 otherwise. This is
the pdf of the exponential distribution with parameter β, i.e. X ∼ Exp(β).
(c) We have:
∞ ∞
β α α−1 −βx
Z Z
tX tx
MX (t) = E(e ) = e f (x) dx = etx x e dx
0 0 Γ(α)

βα
Z
= etx xα−1 e−βx dx
Γ(α) 0

βα
Z
= xα−1 e−(β−t)x dx
Γ(α) 0

βα Γ(α)
= ×
Γ(α) (β − t)α
 α
β
=
β−t
which is finite when β − t > 0, i.e. when t < β. The second-to-last step follows
by substituting β − t for β in the result in (a).
(d) i. We have:
∞ ∞
β α α−1 −βx
Z Z
E(X) = x f (x) dx = x x e dx
−∞ 0 Γ(α)
Z ∞
βα
= x(α+1)−1 e−βx dx
Γ(α) 0

β α Γ(α + 1)
=
Γ(α) β α+1
β α αΓ(α)
=
Γ(α) β α+1
α
=
β
using (a) and the gamma function property stated in the question.
ii. The first derivative of MX (t) is:
 α−1
β β
MX0 (t) =α .
β−t (β − t)2
Therefore:
α
E(X) = MX0 (0) = .
β

178
D.1. Worked examples

(e) When α is a positive integer, by the result stated in the question, we have

X= Yi , where Y1 , Y2 , . . . , Yα are independent random variables each
i=1
distributed as Gamma(1, β), i.e. as exponential with parameter β as concluded
in (b). The expected value of the exponential distribution can be taken as
given from the lectures, so E(Yi ) = 1/β for each i = 1, 2, . . . , α. Therefore,
using the general result on expected values of sums:
α
! α
X X 1 α
E(X) = E Yi = E(Yi ) = α × = .
i=1 i=1
β β

4. James enjoys playing Solitaire on his laptop. One day, he plays the game
repeatedly. He has found, from experience, that the probability of success in any
game is 1/3 and is independent of the outcomes of other games.
(a) What is the probability that his first success occurs in the fourth game he
plays? What is the expected number of games he needs to play to achieve his
first success?
(b) What is the probability of three successes in ten games? What is the expected
number of successes in ten games?
(c) Use a suitable approximation to find the probability of less than 25 successes
in 100 games. You should justify the use of the approximation.
(d) What is the probability that his third success occurs in the tenth game he
plays?

Solution:
(a) P (first success in 4th game) = (2/3)3 × (1/3) = 8/81 ≈ 0.1. This is a
geometric distribution, for which E(X) = 1/π = 1/(1/3) = 3.
(b) Use X ∼ Bin(10, 1/3), such that E(X) = 10 × 1/3 = 3.33, and:
   3  7
10 1 2
P (X = 3) = ≈ 0.2601.
3 3 3

(c) Approximate Bin(100, 1/3) by:


   
1 1 2 200
N 100 × , 100 × × = N 33.3, .
3 3 3 9
The approximation seems reasonable since n = 100 is ‘large’, π = 1/3 is quite
close to 0.5, nπ > 5 and n(1 − π) > 5. Using a continuity correction:
!
24.5 − 33.3
P (X ≤ 24.5) = P Z ≤ p = P (Z ≤ −1.87) ≈ 0.0307.
200/9

(d) This is a negative binomial distribution (used for the trial number of the kth
success) with a pf given by:
 
x−1 k
p(x) = π (1 − π)x−k for x = k, k + 1, k + 2, . . .
k−1

179
D. Common distributions of random variables

and 0 otherwise. Hence we require:


   3  7
9 1 2
P (X = 10) = ≈ 0.0780.
2 3 3

Alternatively, you could calculate the probability of 2 successes in 9 trials,


followed by a further success.

5. You may assume that 15% of individuals in a large population are left-handed.
(a) If a random sample of 40 individuals is taken, find the probability that exactly
6 are left-handed.
(b) If a random sample of 400 individuals is taken, find the probability that
exactly 60 are left-handed by using a suitable approximation. Briefly discuss
the appropriateness of the approximation.
(c) What is the smallest possible size of a randomly chosen sample if we wish to
be 99% sure of finding at least one left-handed individual in the sample?

Solution:

(a) Let X ∼ Bin(40, 0.15), hence:


 
40
P (X = 6) = × (0.15)6 × (0.85)34 = 0.1742.
6

(b) Use a normal approximation with a continuity correction. We require


P (59.5 < X < 60.5), where X ∼ N (60, 51) since X has mean nπ and variance
nπ(1 − π) with n = 400 and π = 0.15. Standardising, this is
2 × P (0 < Z ≤ 0.07) = 0.0558, approximately.
Rules-of-thumb for use of the approximation are that n is ‘large’, π is close to
0.5, and nπ and n(1 − π) are both at least 5. The first and last of these
definitely hold. There is some doubt whether a value of 0.15 can be considered
close to 0.5, so use with caution!
(c) Given a sample of size n, P (no left-handers) = (0.85)n . Therefore:

P (at least 1 left-hander) = 1 − (0.85)n .

We require 1 − (0.85)n > 0.99, or (0.85)n < 0.01. This gives:


 n
1
100 <
0.85
or:
ln(100)
n> = 28.34.
ln(1.1765)
Rounding up, this gives a sample size of 29.

180
D.1. Worked examples

6. Show that the moment generating function (mgf) of a Poisson distribution with
parameter λ is given by:
MX (t) = exp(λ exp(t) − 1), writing exp(θ) ≡ eθ .
Hence show that the mean and variance of the distribution are both λ.
Solution:
We have:

X λx
MX (t) = E(exp(Xt)) = exp(xt) exp(−λ)
x=0
x!

X exp(−λ)
= (λ exp(t))x
x=0
x!

X (λ exp(t))x
= exp(−λ)
x=0
x!

= exp(−λ) exp(λ exp(t))


= exp(λ(exp(t) − 1)).
We have that MX (0) = exp(0) = 1. Now, taking logs:
ln MX (t) = λ(exp(t) − 1).
Now differentiate:
MX0 (t)
= λ exp(t) ⇒ MX0 (t) = MX (t)λ exp(t).
MX (t)
Differentiating again, we get:
MX00 (t) = MX0 (t)λ exp(t) + MX (t)λ exp(t).
We note E(X) = MX0 (0) = MX (0)λ exp(0) = λ, also:
Var(X) = MX00 (0) − (MX0 (0))2 = λ2 + λ − λ2 = λ.

7. In javelin throwing competitions, the throws of athlete A are normally distributed.


It has been found that 15% of her throws exceed 43 metres, while 3% exceed 45
metres. What distance will be exceeded by 90% of her throws?
Solution:
Suppose X ∼ N (µ, σ 2 ) is the random variable for throws. P (X > 43) = 0.15 leads
to µ = 43 − 1.035 × σ (using Table 3 of Murdoch and Barnes’ Statistical Tables).
Similarly, P (X > 45) = 0.03 leads to µ = 45 − 1.88 × σ. Solving yields µ = 40.55
and σ = 2.367, hence X ∼ N (40.55, (2.367)2 ). So:
x − 40.55
P (X > x) = 0.90 ⇒ = −1.28.
2.367
Hence x = 37.52 metres.

181
D. Common distributions of random variables

8. People entering an art gallery are counted by the attendant at the door. Assume
that people arrive in accordance with a Poisson distribution, with one person
arriving every 2 minutes. The attendant leaves the door unattended for 5 minutes.
(a) Calculate the probability that:
i. nobody will enter the gallery in this time
ii. 3 or more people will enter the gallery in this time.
(b) Find, to the nearest second, the length of time for which the attendant could
leave the door unattended for there to be a probability of 0.90 of no arrivals in
that time.
(c) Comment briefly on the assumption of a Poisson distribution in this context.

Solution:
(a) λ = 1 for a two-minute interval, so λ = 2.5 for a five-minute interval. Therefore:
P (no arrivals) = e−2.5 = 0.0821
and:
P (≥ 3 arrivals) = 1−pX (0)−pX (1)−pX (2) = 1−e−2.5 (1+2.5+3.125) = 0.4562.

(b) For an interval of N minutes, the parameter is N/2. We need p(0) = 0.90, so
e−N/2 = 0.90 giving N/2 = − ln(0.90) and N = 0.21 minutes, or 13 seconds.
(c) The rate is unlikely to be constant: more people at lunchtimes or early
evenings etc. Likely to be several arrivals in a small period – couples, groups
etc. Quite unlikely the Poisson will provide a good model.

9. The random variable Y , representing the life-span of an electronic component, is


distributed according to a probability density function f (y), where y > 0. The
survivor function, =, is defined as =(y) = P (Y > y) and the age-specific failure
rate, φ(y), is defined as f (y)/=(y). Suppose f (y) = λe−λy , i.e. Y ∼ Exp(λ).
(a) Derive expressions for =(y) and φ(y).
(b) Comment briefly on the implications of the age-specific failure rate you have
derived in the context of the exponentially-distributed component life-spans.

Solution:
(a) The survivor function is:
Z ∞ h i∞
=(y) = P (Y > y) = λe−λx dx = − e−λx = e−λy .
y y

The age-specific failure rate is:


f (y) λe−λy
φ(y) = = −λy = λ.
=(y) e

(b) The age-specific failure rate is constant, indicating it does not vary with age.
This is unlikely to be true in practice!

182
D.1. Worked examples

10. For the binomial distribution with a probability of success of 0.25 in an individual
trial, calculate the probability that, in 50 trials, there are at least 8 successes:
(a) using the normal approximation without a continuity correction
(b) using the normal approximation with a continuity correction.
Compare these results with the exact probability of 0.9547 and comment.
Solution:
We seek P (X ≥ 8) using the normal approximation Y ∼ N (12.5, 9.375).
(a) So, without a continuity correction:
 
8 − 12.5
P (Y ≥ 8) = P Z ≥ √ = P (Z ≥ −1.47) = 0.9292.
9.375
The required probability could have been expressed as P (X > 7), or indeed
any number in [7, 8), for example:
 
7 − 12.5
P (Y > 7) = P Z ≥ √ = P (Z ≥ −1.80) = 0.9641.
9.375

(b) With a continuity correction:


 
7.5 − 12.5
P (Y > 7.5) = P Z ≥ √ = P (Z ≥ −1.63) = 0.9484.
9.375

Compared to 0.9547, using the continuity correction yields the closer


approximation.

11. A greengrocer has a very large pile of oranges on his stall. The pile of fruit is a
mixture of 50% old fruit with 50% new fruit; one cannot tell which are old and
which are new. However, 20% of old oranges are mouldy inside, but only 10% of
new oranges are mouldy. Suppose that you choose 5 oranges at random. What is
the distribution of the number of mouldy oranges in your sample?
Solution:
For an orange chosen at random, the event ‘mouldy’ is the union of the disjoint
events ‘mouldy’ ∩ ‘new’ and ‘mouldy’ ∩ ‘old’. So:

P (‘mouldy’) = P (‘mouldy’ ∩ ‘new’) + P (‘mouldy’ ∩ ‘old’)


= P (‘mouldy’ | ‘new’) P (‘new’) + P (‘mouldy’ | ‘old’) P (‘old’)
= 0.1 × 0.5 + 0.2 × 0.5
= 0.15.

As the pile of oranges is very large, we can assume that the results for the five
oranges will be independent, so we have 5 independent trials each with probability
of ‘mouldy’ equal to 0.15. The distribution of the number of mouldy oranges will be
a binomial distribution with n = 5 and π = 0.15.

183
D. Common distributions of random variables

12. Underground trains on the Northern line have a probability 0.05 of failure between
Golders Green and King’s Cross. Supposing that the failures are all independent,
what is the probability that out of 10 journeys between Golders Green and King’s
Cross more than 8 do not have a breakdown?
Solution:
The probability of no breakdown on one journey is π = 1 − 0.05 = 0.95, so the
number of journeys without a breakdown, X, has a Bin(10, 0.95) distribution. We
want P (X > 8), which is:
P (X > 8) = p(9) + p(10)
   
10 9 1 10
= × (0.95) × (0.05) + × (0.95)10 × (0.05)0
9 10
= 0.3151 + 0.5987
= 0.9138.

13. Suppose that the normal rate of infection for a certain disease in cattle is 25%. To
test a new serum which may prevent infection, three experiments are carried out.
The test for infection is not always valid for some particular cattle, so the
experimental results are incomplete – we cannot always tell whether a cow is
infected or not. The results of the three experiments are:
(a) 10 animals are injected; all 10 remain free from infection
(b) 17 animals are injected; more than 15 remain free from infection and there are
2 doubtful cases
(c) 23 animals are infected; more than 20 remain free from infection and there are
three doubtful cases.
Which experiment provides the strongest evidence in favour of the serum?
Solution:
These experiments involve tests on different cattle, which one might expect to
behave independently of one another. The probability of infection without injection
with the serum might also reasonably be assumed to be the same for all cattle. So
the distribution which we need here is the binomial distribution. If the serum has
no effect, then the probability of infection for each of the cattle is 0.25.
One way to assess the evidence of the three experiments is to calculate the
probability of the result of the experiment if the serum had no effect at all. If it has
an effect, then one would expect larger numbers of cattle to remain free from
infection, so the experimental results as given do provide some clue as to whether
the serum has an effect, in spite of their incompleteness.
Let X(n) be the number of cattle infected, out of a sample of n. We are assuming
that X(n) ∼ Bin(n, 0.25).
(a) With 10 trials, the probability of 0 infected if the serum has no effect is:
 
10
P (X(10) = 0) = × (0.75)10 = (0.75)10 = 0.0563.
0

184
D.1. Worked examples

(b) With 17 trials, the probability of more than 15 remaining uninfected if the
serum has no effect is:

P (X(17) < 2) = P (X(17) = 0) + P (X(17) = 1)


   
17 17 17
= × (0.75) + × (0.25)1 × (0.75)16
0 1
= (0.75)17 + 17 × (0.25)1 × (0.75)16
= 0.0075 + 0.0426
= 0.0501.

(c) With 23 trials, the probability of more than 20 remaining free from infection if
the serum has no effect is:

P (X(23) < 3) = P (X(23) = 0) + P (X(23) = 1) + P (X(23) = 2)


   
23 23 23
= × (0.75) + × (0.25)1 × (0.75)22
0 1
 
23
+ × (0.25)2 × (0.75)21
2
23 × 22
= 0.7523 + 23 × 0.25 × (0.75)22 + × (0.25)2 × (0.75)21
2
= 0.0013 + 0.0103 + 0.0376
= 0.0492.

The most surprising-looking event in these three experiments is that of experiment


3, and so we can say that this experiment offered the most support for the use of
the serum.

14. In a large industrial plant there is an accident on average every two days.
(a) What is the chance that there will be exactly two accidents in a given week?
(b) What is the chance that there will be two or more accidents in a given week?
(c) If James goes to work there for a four-week period, what is the probability
that no accidents occur while he is there?

Solution:
Here we have counts of random events over time, which is a typical application for
the Poisson distribution. We are assuming that accidents are equally likely to occur
at any time and are independent. The mean for the Poisson distribution is 0.5 per
day.
Let X be the number of accidents in a week. The probability of exactly two
accidents in a given week is found by using the parameter λ = 5 × 0.5 = 2.5 (5
working days a week assumed).

185
D. Common distributions of random variables

(a) The probability of exactly two accidents in a week is:


e−2.5 (2.5)2
p(2) = = 0.2565.
2!
(b) The probability of two or more accidents in a given week is:
P (X ≥ 2) = 1 − p(0) − p(1) = 0.7127.

(c) If James goes to the industrial plant and does not change the probability of an
accident simply by being there (he might bring bad luck, or be superbly
safety-conscious!), then over 4 weeks there are 20 working days, and the
probability of no accident comes from a Poisson random variable with mean
10. If Y is the number of accidents while James is there, the probability of no
accidents is:
e−10 (10)0
pY (0) = = 0.0000454.
0!
James is very likely to be there when there is an accident!

15. The chance that a lottery ticket has a winning number is 0.0000001.
(a) If 10,000,000 people buy tickets which are independently numbered, what is
the probability there is no winner?
(b) What is the probability that there is exactly 1 winner?
(c) What is the probability that there are exactly 2 winners?

Solution:
The number of winning tickets, X, will be distributed as:
X ∼ Bin(10,000,000, 0.0000001).
Since n is large and π is small, the Poisson distribution should provide a good
approximation. The Poisson parameter is:
λ = nπ = 10,000,000 × 0.0000001 = 1
and so we set X ∼ Pois(1). We have:
e−1 10 e−1 11 e−1 12
p(0) = = 0.3679, p(1) = = 0.3679 and p(2) = = 0.1839.
0! 1! 2!
Using the exact binomial distribution of X, the results are:
(10)7
 
7
p(0) = × ((10)−7 )0 × (1 − (10)−7 )(10) = 0.3679
0
(10)7
 
7
p(1) = × ((10)−7 )1 × (1 − (10)−7 )(10) −1 = 0.3679
1
(10)7
 
7
p(2) = × ((10)−7 )2 × (1 − (10)−7 )(10) −2 = 0.1839.
2
Notice that, in this case, the Poisson approximation is correct to at least 4 decimal
places.

186
D.1. Worked examples

16. Suppose that X ∼ Uniform[0, 1]. Compute P (X > 0.2), P (X ≥ 0.2) and
P (X 2 > 0.04).
Solution:
We have a = 0 and b = 1, and can use the formula for P (c < X ≤ d), for constants
c and d. Hence:
1 − 0.2
P (X > 0.2) = P (0.2 < X ≤ 1) = = 0.8.
1−0
Also:
P (X ≥ 0.2) = P (X = 0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.
Finally:
P (X 2 > 0.04) = P (X < −0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.

17. Suppose that the service time for a customer at a fast food outlet has an
exponential distribution with parameter 1/3 (customers per minute). What is the
probability that a customer waits more than 4 minutes?
Solution:
The distribution of X is Exp(1/3), so the probability is:
P (X > 4) = 1 − F (4) = 1 − (1 − e−(1/3)×4 ) = 1 − 0.7364 = 0.2636.

18. Suppose that the distribution of men’s heights in London, measured in cm, is
N (175, 62 ). Find the proportion of men whose height is:
(a) under 169 cm
(b) over 190 cm
(c) between 169 cm and 190 cm.

Solution:
The values of interest are 169 and 190. The corresponding z-values are:
169 − 175 190 − 175
z1 = = −1 and z2 = = 2.5.
6 6
Using values from Table 3 of Murdoch and Barnes’ Statistical Tables, we have:
P (X < 169) = P (Z < −1) = Φ(−1)
= 1 − Φ(1) = 1 − 0.8413 = 0.1587

P (X > 190) = P (Z > 2.5) = 1 − Φ(2.5)


= 1 − 0.9938 = 0.0062
and:
P (169 < X < 190) = P (−1 < Z < 2.5) = Φ(2.5) − Φ(−1)
= 0.9938 − 0.1587 = 0.8351.

187
D. Common distributions of random variables

19. Two statisticians disagree about the distribution of IQ scores for a population
under study. Both agree that the distribution is normal, and that σ = 15, but A
says that 5% of the population have IQ scores greater than 134.6735, whereas B
says that 10% of the population have IQ scores greater than 109.224. What is the
difference between the mean IQ score as assessed by A and that as assessed by B?
Solution:
The standardised z-value giving 5% in the upper tail is 1.6449, and for 10% it is
1.2816. So, converting to the scale for IQ scores, the values are:

1.6449 × 15 = 24.6735 and 1.2816 × 15 = 19.224.

Write the means according to A and B as µA and µB , respectively. Therefore:

µA + 24.6735 = 134.6735

so:
µA = 110
whereas:
µB + 19.224 = 109.224
so µB = 90. The difference µA − µB = 110 − 90 = 20.

D.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix F.

1. At one stage in the manufacture of an article a piston of circular cross-section has


to fit into a similarly-shaped cylinder. The distributions of diameters of pistons and
cylinders are known to be normal with parameters as follows.
• Piston diameters: mean 10.42 cm, standard deviation 0.03 cm.
• Cylinder diameters: mean 10.52 cm, standard deviation 0.04 cm.
If pairs of pistons and cylinders are selected at random for assembly, for what
proportion will the piston not fit into the cylinder (i.e. for which the piston
diameter exceeds the cylinder diameter)?
(a) What is the chance that in 100 pairs, selected at random:
i. every piston will fit?
ii. not more than two of the pistons will fail to fit?
(b) Calculate both of these probabilities:
i. exactly
ii. using a Poisson approximation.
Discuss the appropriateness of using this approximation.

188
D.2. Practice questions

2. If X has the discrete uniform distribution such that P (X = i) = 1/k for


i = 1, 2, . . . , k, show that its moment generating function is:
et (1 − ekt )
MX (t) = .
k(1 − et )
(Do not attempt to find the mean and variance using the mgf.)

3. Let f (z) be defined as:


1
f (z) = e−|z| for all real values of z.
2

(a) Sketch f (z) and explain why it can serve as the pdf for a random variable Z.
(b) Determine the moment generating function of Z.
(c) Use the mgf to find E(Z), Var(Z), E(Z 3 ) and E(Z 4 ).
(You may assume that −1 < t < 1, for the mgf, which will ensure convergence.)

4. Show that for a binomial random variable X ∼ Bin(n, π), then:


n
X (n − 1)!
E(X) = nπ π x−1 (1 − π)n−x .
x=1
(x − 1)! (n − x)!

Hence find E(X) and Var(X). (The wording of the question implies that you use
the result which you have just proved. Other methods of derivation will not be
accepted!)

5. Cars independently pass a point on a busy road at an average rate of 150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(b) What is the expected number passing in two minutes?
(c) Find the probability that the expected number actually passes in a given
two-minute period.

6. James goes fishing every Saturday. The number of fish he catches follows a Poisson
distribution. On a proportion π of the days he goes fishing, he does not catch
anything. He makes it a rule to take home the first, and then every other, fish
which he catches, i.e. the first, third, fifth fish etc.
(a) Using a Poisson distribution, find the mean number of fish he catches.
(b) Show that the probability that he takes home the last fish he catches is
(1 − π 2 )/2.

There are two kinds of statistics, the kind you look up and the kind you make
up.
(Rex Stout)

189
D. Common distributions of random variables

190
Appendix E
Multivariate random variables

E.1 Worked examples


1. X and Y are independent random variables with distributions as follows:
X=x 0 1 2 Y =y 1 2
pX (x) 0.4 0.2 0.4 pY (y) 0.4 0.6
The random variables W and Z are defined by W = 2X and Z = Y − X,
respectively.
(a) Compute the joint distribution of W and Z.
(b) Evaluate P (W = 2 | Z = 1), E(W | Z = 0) and Cov(W, Z).

Solution:
(a) The joint distribution (with marginal probabilities) is:
W =w
0 2 4 pZ (z)
−1 0.00 0.00 0.16 0.16
Z=z 0 0.00 0.08 0.24 0.32
1 0.16 0.12 0.00 0.28
2 0.24 0.00 0.00 0.24
pW (w) 0.40 0.20 0.40 1.00
(b) It is straightforward to see that:
P (W = 2 ∩ Z = 1) 0.12 3
P (W = 2 | Z = 1) = = = .
P (Z = 1) 0.28 7
For E(W | Z = 0), we have:
X 0 0.08 0.24
E(W | Z = 0) = w P (W = w | Z = 0) = 0 × +2× +4× = 3.5.
w
0.32 0.32 0.32
We see E(W ) = 2 (by symmetry), and:
E(Z) = −1 × 0.16 + 0 × 0.32 + 1 × 0.28 + 2 × 0.24 = 0.6.
Also:
XX
E(W Z) = wz p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4
w z

hence:
Cov(W, Z) = E(W Z) − E(W ) E(Z) = −0.4 − 2 × 0.6 = −1.6.

191
E. Multivariate random variables

2. The joint probability distribution of the random variables X and Y is:

X=x
−1 0 1
−1 0.05 0.15 0.10
Y =y 0 0.10 0.05 0.25
1 0.10 0.05 0.15

(a) Identify the marginal distributions of X and Y and the conditional


distribution of X given Y = 1.
(b) Evaluate E(X | Y = 1) and the correlation coefficient of X and Y .
(c) Are X and Y independent random variables?

Solution:

(a) The marginal and conditional distributions are, respectively:


X=x −1 0 1 Y =y −1 0 1
pX (x) 0.25 0.25 0.50 pY (y) 0.30 0.40 0.30

X = x|Y = 1 −1 0 1
pX|Y =1 (x | Y = 1) 1/3 1/6 1/2
(b) From the conditional distribution we see:

1 1 1 1
E(X | Y = 1) = −1 × +0× +1× = .
3 6 2 6
E(Y ) = 0 (by symmetry), and so Var(Y ) = E(Y 2 ) = 0.6.
E(X) = 0.25 and:

Var(X) = E(X 2 ) − (E(X))2 = 0.75 − (0.25)2 = 0.6875.

(Note that Var(X) and Var(Y ) are not strictly necessary here!)
Next:
XX
E(XY ) = xy p(x, y)
x y

= (−1)(−1)(0.05) + (1)(−1)(0.1) + (−1)(1)(0.1) + (1)(1)(0.15)


= 0.

So:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0 ⇒ Corr(X, Y ) = 0.

(c) X and Y are not independent random variables since, for example:

P (X = 1, Y = −1) = 0.1 6= P (X = 1) P (Y = −1) = 0.5 × 0.3 = 0.15.

192
E.1. Worked examples

3. X1 , X2 , . . . , Xn are independent Bernoulli random variables. The probability


function of Xi is given by:
(
(1 − πi )1−xi πixi for xi = 0, 1
p(xi ) =
0 otherwise

where:
eiθ
πi =
1 + eiθ
for i = 1, 2, . . . , n. Derive the joint probability function, p(x1 , x2 , . . . , xn ).
Solution:
Since the Xi s are independent (but not identically distributed) random variables,
we have:
n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1

So, the joint probability function is:


n
P
n  1−xi  xi n  i θ ix

eiθxi

Y 1 e Y e i=1
p(x1 , x2 , . . . , xn ) = = = Q
n .
1 + eiθ 1 + eiθ 1 + eiθ iθ
i=1 i=1 (1 + e )
i=1

4. X1 , X2 , . . . , Xn are independent random variables with the common probability


density function: (
λ2 xe−λx for x ≥ 0
f (x) =
0 otherwise.
Derive the joint probability density function, f (x1 , x2 , . . . , xn ).
Solution:
Since the Xi s are independent (and identically distributed) random variables, we
have:
n
Y
f (x1 , x2 , . . . , xn ) = f (xi ).
i=1

So, the joint probability density function is:


n
n n n
! P
Y Y Y −λ xi
2 −λxi 2n −λx1 −λx2 −···−λxn 2n
f (x1 , x2 , . . . , xn ) = λ xi e =λ xi e =λ xi e i=1 .
i=1 i=1 i=1

5. X1 , X2 , . . . , Xn are independent random variables with the common probability


function:
θx
 
m
p(x) = for x = 0, 1, 2, . . . , m
x (1 + θ)m
and 0 otherwise. Derive the joint probability function, p(x1 , x2 , . . . , xn ).

193
E. Multivariate random variables

Solution:
Since the Xi s are independent (and identically distributed) random variables, we
have: n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1

So, the joint probability function is:


n
P
n  n  ! n  ! xi
θ xi x1 x2 xn

Y m Y m θ θ ···θ Y m θi=1
p(x1 , x2 , . . . , xn ) = = = .
i=1
xi (1 + θ)m i=1
xi (1 + θ)nm i=1
xi (1 + θ)nm

6. The random variables X1 and X2 are independent and have the common
distribution given in the table below:

X=x 0 1 2 3
pX (x) 0.2 0.4 0.3 0.1

The random variables W and Y are defined by W = max(X1 , X2 ) and


Y = min(X1 , X2 ).
(a) Calculate the table of probabilities which defines the joint distribution of W
and Y .
(b) Find:
i. the marginal distribution of W
ii. the conditional distribution of Y given W = 2
iii. E(Y | W = 2) and Var(Y | W = 2)
iv. Cov(W, Y ).

Solution:
(a) The joint distribution of W and Y is:
W =w
0 1 2 3
0 (0.2)2 2(0.2)(0.4) 2(0.2)(0.3) 2(0.2)(0.1)
Y =y 1 0 (0.4)(0.4) 2(0.4)(0.3) 2(0.4)(0.1)
2 0 0 (0.3)(0.3) 2(0.3)(0.1)
3 0 0 0 (0.1)(0.1)
(0.2)2 (0.8)(0.4) (1.5)(0.3) (1.9)(0.1)
which is:
W =w
0 1 2 3
0 0.04 0.16 0.12 0.04
Y =y 1 0.00 0.16 0.24 0.08
2 0.00 0.00 0.09 0.06
3 0.00 0.00 0.00 0.01
0.04 0.32 0.45 0.19

194
E.1. Worked examples

(b) i. Hence the marginal distribution of W is:


W =w 0 1 2 3
pW (w) 0.04 0.32 0.45 0.19
ii. The conditional distribution of Y | W = 2 is:
Y = y|W = 2 0 1 2 3
pY |W =2 (y | W = 2) 4/15 8/15 2/10 0
= 0.26̇ = 0.53̇ = 0.2 0
iii. We have:
4 8 2
E(Y | W = 2) = 0 × +1× +2× + 3 × 0 = 0.93̇
15 15 10
and:

Var(Y | W = 2) = E(Y 2 | W = 2)−(E(Y | W = 2))2 = 1.3̇−(0.93̇)2 = 0.4622.

iv. E(W Y ) = 1.69, E(W ) = 1.79 and E(Y ) = 0.81, therefore:

Cov(W, Y ) = E(W Y ) − E(W ) E(Y ) = 1.69 − 1.79 × 0.81 = 0.2401.

7. Consider two random variables X and Y . X can take the values −1, 0 and 1, and
Y can take the values 0, 1 and 2. The joint probabilities for each pair are given by
the following table:
X = −1 X = 0 X = 1
Y =0 0.10 0.20 0.10
Y =1 0.10 0.05 0.10
Y =2 0.10 0.05 0.20

(a) Calculate the marginal distributions and expected values of X and Y .


(b) Calculate the covariance of the random variables U and V , where U = X + Y
and V = X − Y .
(c) Calculate E(V | U = 1).

Solution:
(a) The marginal distribution of X is:
X=x −1 0 1
pX (x) 0.3 0.3 0.4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 0.40 0.25 0.35
Hence:
E(X) = −1 × 0.3 + 0 × 0.3 + 1 × 0.4 = 0.1
and:
E(Y ) = 0 × 0.40 + 1 × 0.25 + 2 × 0.35 = 0.95.

195
E. Multivariate random variables

(b) We have:

Cov(U, V ) = Cov(X + Y, X − Y )
= E((X + Y )(X − Y )) − E(X + Y ) E(X − Y )
= E(X 2 − Y 2 ) − (E(X) + E(Y ))(E(X) − E(Y ))

E(X 2 ) = ((−1)2 × 0.3) + (02 × 0.3) + (12 × 0.4) = 0.7

E(Y 2 ) = (02 × 0.4) + (12 × 0.25) + (22 × 0.35) = 1.65

hence:

Cov(U, V ) = (0.7 − 1.65) − (0.1 + 0.95)(0.1 − 0.95) = −0.0575.

(c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding
values of V are −3, −1 and 1. We have:

P (U = 1) = 0.1 + 0.05 + 0.1 = 0.25

0.1 2
P (V = −3 | U = 1) = =
0.25 5
0.05 1
P (V = −1 | U = 1) = =
0.25 5
0.1 2
P (V = 1 | U = 1) = =
0.25 5

hence:
     
2 1 2
E(V | U = 1) = −3 × + −1 × + 1× = −1.
5 5 5

8. Two refills for a ballpoint pen are selected at random from a box containing three
blue refills, two red refills and three green refills. Define the following random
variables:
X = the number of blue refills selected
Y = the number of red refills selected.

(a) Show that P (X = 1, Y = 1) = 3/14.


(b) Form the table showing the joint probability distribution of X and Y .
(c) Calculate E(X), E(Y ) and E(X | Y = 1).
(d) Find the covariance between X and Y .
(e) Are X and Y independent random variables? Give a reason for your answer.

196
E.1. Worked examples

Solution:

(a) With the obvious notation B = blue and R = red:

3 2 2 3 3
P (X = 1, Y = 1) = P (BR) + P (RB) = × + × = .
8 7 8 7 14

(b) We have:
X=x
0 1 2
0 3/28 9/28 3/28
Y =y 1 3/14 3/14 0
2 1/28 0 0
(c) The marginal distribution of X is:
X=x 0 1 2
pX (x) 10/28 15/28 3/28
Hence:
10 15 3 3
E(X) = 0 × +1× +2× = .
28 28 28 4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 15/28 12/28 1/28
Hence:
15 12 1 1
E(Y ) = 0 × +1× +2× = .
28 28 28 2
The conditional distribution of X given Y = 1 is:
X = x|Y = 1 0 1
pX|Y =1 (x | y = 1) 1/2 1/2
Hence:
1 1 1
E(X | Y = 1) = 0 × +1× = .
2 2 2

(d) The distribution of XY is:


XY = xy 0 1
pXY (xy) 22/28 6/28
Hence:
22 6 3
E(XY ) = 0 × +1× =
28 28 14
and:
3 3 1 9
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = − × =− .
14 4 2 56

(e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The
random variables are not independent.

197
E. Multivariate random variables

9. Show that the marginal distributions of a bivariate distribution are not enough to
define the bivariate distribution itself.
Solution:
Here we must show that there are two distinct bivariate distributions with the
same marginal distributions. It is easiest to think of the simplest case where X and
Y each take only two values, say 0 and 1.
Suppose the marginal distributions of X and Y are the same, with
p(0) = p(1) = 0.5. One possible bivariate distribution with these marginal
distributions is the one for which there is independence between X and Y . This has
pX,Y (x, y) = pX (x) pY (y) for all x, y. Writing it in full:

pX,Y (0, 0) = pX,Y (1, 0) = pX,Y (0, 1) = pX,Y (1, 1) = 0.5 × 0.5 = 0.25.

The table of probabilities for this choice of independence is shown in the first table
below.
Trying some other value for pX,Y (0, 0), like 0.2, gives the second table below.

X/Y 0 1 X/Y 0 1
0 0.25 0.25 0 0.2 0.3
1 0.25 0.25 1 0.3 0.2

The construction of these probabilities is done by making sure the row and column
totals are equal to 0.5, and so we now have a second distribution with the same
marginal distributions as the first.
This example is very simple, but one can almost always construct many bivariate
distributions with the same marginal distributions even for continuous random
variables.

10. Show that if:


P (X ≤ x ∩ Y ≤ y) = (1 − e−x )(1 − e−2y )
for all x, y ≥ 0, then X and Y are independent random variables, each with an
exponential distribution.
Solution:
The right-hand side of the result given is the product of the cdf of an exponential
random variable X with mean 1 and the cdf of an exponential random variable Y
with mean 2. So the result follows from the definition of independent random
variables.

11. There are different ways to write the covariance. Show that:

Cov(X, Y ) = E(XY ) − E(X) E(Y )

and:
Cov(X, Y ) = E((X − E(X))Y ) = E(X(Y − E(Y ))).

198
E.1. Worked examples

Solution:
Working directly from the definition:

Cov(X, Y ) = E((X − E(X))(Y − E(Y )))


= E(XY − X E(Y ) − E(X)Y + E(X) E(Y ))
= E(XY ) + E(−X E(Y )) + E(−E(X)Y ) + E(E(X) E(Y ))
= E(XY ) − E(X) E(Y ) − E(X) E(Y ) + E(X) E(Y )
= E(XY ) − E(X) E(Y ).

For the second part, we begin with the right-hand side:

E((X − E(X))Y ) = E(XY − E(X)Y )


= E(XY ) + E(−E(X)Y )
= E(XY ) − E(X) E(Y )
= Cov(X, Y ).

The remaining result follows by an argument symmetric with the last one.

12. Suppose that Var(X) = Var(Y ) = 1, and that X and Y have correlation coefficient
ρ. Show that it follows from Var(X − ρY ) ≥ 0 that ρ2 ≤ 1.
Solution:
We have:

0 ≤ Var(X − ρY ) = Var(X) − 2ρ Cov(X, Y ) + ρ2 Var(Y ) = 1 − 2ρ2 + ρ2 = (1 − ρ2 ).

Hence 1 − ρ2 ≥ 0, and so ρ2 ≤ 1.

13. The distribution of a random variable X is:

X=x −1 0 1
P (X = x) a b a

Show that X and X 2 are uncorrelated.


Solution:
This is an example of two random variables X and Y = X 2 which are uncorrelated,
but obviously dependent. The bivariate distribution of (X, Y ) in this case is
singular because of the complete functional dependence between them.
We have:

E(X) = −1 × a + 0 × b + 1 × a = 0
E(X 2 ) = +1 × a + 0 × b + 1 × a = 2a
E(X 3 ) = −1 × a + 0 × b + 1 × a = 0

199
E. Multivariate random variables

and we must show that the covariance is zero:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X 3 ) − E(X) E(X 2 ) = 0 − 0 × 2a = 0.

There are many possible choices for a and b which give a valid probability
distribution, for instance a = 0.25 and b = 0.5.

14. A fair coin is thrown n times, each throw being independent of the ones before. Let
R = ‘the number of heads’, and S = ‘the number of tails’. Find the covariance of R
and S. What is the correlation of R and S?
Solution:
One can go about this in a straightforward way. If Xi is the number of heads and
Yi is the number of tails on the ith throw, then the distribution of Xi and Yi is
given by:

X/Y 0 1
0 0 0.5
1 0.5 0

From this table, we compute the following:

E(Xi ) = E(Yi ) = 0 × 0.5 + 1 × 0.5 = 0.5

E(Xi2 ) = E(Yi2 ) = 0 × 0.5 + 1 × 0.5 = 0.5

Var(Xi ) = Var(Yi ) = 0.5 − (0.5)2 = 0.25

E(Xi Yi ) = 0 × 0.5 + 0 × 0.5 = 0

Cov(Xi , Yi ) = E(Xi Yi ) − E(Xi ) E(Yi ) = 0 − 0.25 = −0.25.


P P
Now, since R = i Xi and S = i Yi , we can add covariances of independent Xi s
and Yi s, just like means and variances, then:

Cov(R, S) = −0.25n.

Since R + S = n is a fixed quantity, there is a complete linear dependence between


R and S. We have R = n − S, so the correlation between R and S should be −1.
This can be checked directly since:

Var(R) = Var(S) = 0.25n

(add the variances of the Xi s or Yi s). The correlation between R and S works out
as −0.25n/0.25n = −1.

15. Suppose that X and Y have a bivariate distribution. Find the covariance of the
new random variables W = aX + bY and V = cX + dY where a, b, c and d are
constants.

200
E.2. Practice questions

Solution:
The covariance of W and V is:

E(W V ) − E(W ) E(V ) = E(acX 2 + bdY 2 + (ad + bc)XY )


− (ac E(X)2 + bd E(Y )2 + (ad + bc) E(X) E(Y ))
= ac(E(X 2 ) − E(X)2 ) + bd(E(Y 2 ) − E(Y )2 )
+ (ad + bc)(E(XY ) − E(X) E(Y ))
2
= acσX + bdσY2 + (ad + bc)σXY .

16. Following on from Question 15, show that, if the variances of X and Y are the
same, then W = X + Y and V = X − Y are uncorrelated.
Solution:
Here we have a = b = c = 1 and d = −1. Substituting into the formula found above:
2
σW V = σX − σY2 = 0.

There is no assumption here that X and Y are independent. It is not true that W
and V are independent without further restrictions on X and Y .

E.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix F.

1. (a) For random variables X and Y , show that:

Cov(X + Y, X − Y ) = Var(X) − Var(Y ).

(b) For random variables X and Y , and constants a, b, c and d, show that:

Cov(a + bX, c + dY ) = bd Cov(X, Y ).

2. Let X1 , X2 , . . . , Xk be independent random variables, and a1 , a2 , . . . , ak be


constants. Show that:
k  k
P P
(a) E ai X i = ai E(Xi )
i=1 i=1
 k
 k
a2i Var(Xi ).
P P
(b) Var ai X i =
i=1 i=1

3. X and Y are discrete random variables which can assume values 0, 1 and 2 only.

P (X = x, Y = y) = A(x + y) for some constant A and x, y ∈ {0, 1, 2}.

201
E. Multivariate random variables

(a) Draw up a table to describe the joint distribution of X and Y and find the
value of the constant A.
(b) Describe the marginal distributions of X and Y .
(c) Give the conditional distribution of X | Y = 1 and find E(X | Y = 1).
(d) Are X and Y independent? Give a reason for your answer.

Statistics are like bikinis. What they reveal is suggestive, but what they conceal
is vital.
(Aaron Levenstein)

202
Appendix F
Solutions to Practice questions

F.1 Chapter 1 – Data visualisation and descriptive


statistics

1. (a) We have:
3
X
(Yj − Ȳ ) = (Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ).
j=1

However:
3Ȳ = Y1 + Y2 + Y3

hence:
3
X
(Yj − Ȳ ) = 3Ȳ − 3Ȳ = 0.
j=1

(b) We have:

3 X
X 3
(Yj − Ȳ )(Yk − Ȳ ) = (Y1 − Ȳ )((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))
j=1 k=1

+ (Y2 − Ȳ )((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))


+ (Y3 − Ȳ )((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))
= ((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))2

and 3Ȳ = Y1 + Y2 + Y3 as above, so:

3 X
X 3
(Yj − Ȳ )(Yk − Ȳ ) = 02 = 0.
j=1 k=1

(c) We have:

3 X
X 3 3
X 3
X 3
X
(Yj − Ȳ )(Yk − Ȳ ) = j6=k (Yj − Ȳ )(Yk − Ȳ ) + (Yj − Ȳ )2
j=1 k=1 j=1 k=1 j=1

We have written the nine terms in the left-hand expression as the sum of the
six terms for which j 6= k, and the three terms for which j = k.

203
F. Solutions to Practice questions

However, we showed in (b) that the left-hand expression is in fact 0, so:


3
X 3
X 3
X
0= j6=k (Yj − Ȳ )(Yk − Ȳ ) + (Yj − Ȳ )2
j=1 k=1 j=1

from which the result follows.

2. (a) We have:
n
P n
P n
P
yi (axi + b) a xi + nb
i=1 i=1 i=1
ȳ = = = = ax̄ + b.
n n n
(b) Multiply out the square within the summation sign and then evaluate the
three expressions, remembering that x̄ is a constant with respect to summation
and can be taken outside the summation sign as a common factor, i.e. we have:
n
X n
X
2
(xi − x̄) = (x2i − 2xi x̄ + x̄2 )
i=1 i=1
n
X n
X n
X
= x2i − 2x̄ xi + x̄2
i=1 i=1 i=1
Xn
= x2i − 2nx̄2 + nx̄2
i=1

n
P
hence the result. Recall that xi = nx̄.
i=1

(c) It is probably best to work with variances to avoid the square roots. The
variance of y values, say s2y , is given by:
n
1X
s2y = (yi − ȳ)2
n i=1
n
1X
= (axi + b − (ax̄ + b))2
n i=1
n
21
X
=a (xi − x̄)2
n i=1
= a2 s2x .

The result follows on taking the square root, observing that the standard
deviation cannot be a negative quantity.
Adding a constant k to each value of a dataset adds k to the mean and leaves the
standard deviation unchanged. This corresponds to a transformation yi = axi + b
with a = 1 and b = k. Apply (a) and (c) with these values.
Multiplying each value of a dataset by a constant c multiplies the mean by c and
also the standard deviation by |c|. This corresponds to a transformation yi = cxi
with a = c and b = 0. Apply (a) and (c) with these values.

204
F.2. Chapter 2 – Probability theory

F.2 Chapter 2 – Probability theory


1. (a) We know P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ).
Consider A ∪ B ∪ C as (A ∪ B) ∪ C (i.e. as the union of the two sets A ∪ B and
C) and then apply the result above to obtain:

P (A ∪ B ∪ C) = P ((A ∪ B) ∪ C) = P (A ∪ B) + P (C) − P ((A ∪ B) ∩ C).

Now (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) – a Venn diagram can be drawn to


check this.
So:

P (A∪B ∪C) = P (A∪B)+P (C)−(P (A∩C)+P (B ∩C)−P ((A∩C)∩(B ∩C)))

using the earlier result again for A ∩ C and B ∩ C.


Now (A ∩ C) ∩ (B ∩ C) = A ∩ B ∩ C and if we apply the earlier result once
more for A and B, we obtain:

P (A∪B∪C) = P (A)+P (B)−P (A∩B)+P (C)−P (A∩C)−P (B∩C)+P (A∩B∩C)

which is the required result.


(b) Use the result that if X ⊂ Y then P (X) ≤ P (Y ) for events X and Y .
Since A ⊂ A ∪ B and B ⊂ A ∪ B, we have P (A) ≤ P (A ∪ B) and
P (B) ≤ P (A ∪ B).
Adding these inequalities, P (A) + P (B) ≤ 2P (A ∪ B) so:

P (A) + P (B)
≤ P (A ∪ B).
2
Similarly, A ∩ B ⊂ A and A ∩ B ⊂ B, so P (A ∩ B) ≤ P (A) and
P (A ∩ B) ≤ P (B).
Adding, 2P (A ∩ B) ≤ P (A) + P (B) so:

P (A) + P (B)
P (A ∩ B) ≤ .
2

2. (a) We know that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). For independent events


A and B, P (A ∩ B) = P (A) P (B), so P (A ∪ B) = P (A) + P (B) − P (A) P (B)
gives 0.75 = p + 2p − 2p2 , or 2p2 − 3p + 0.75 = 0.
Solving the quadratic equation gives:

3− 3
p= ≈ 0.317
4
suppressing the irrelevant case for which p > 1.
Since A and B are independent, P (A | B) = P (A) = p = 0.317.

205
F. Solutions to Practice questions

(b) For mutually exclusive events, P (A ∪ B) = P (A) + P (B), so 0.75 = p + 2p,


leading to p = 0.25.
Here P (A ∩ B) = 0, so P (A | B) = P (A ∩ B)/P (B) = 0.

3. (a) We are given that A and B are independent, so P (A ∩ B) = P (A) P (B). We


need to show a similar result for Ac and B c , namely we need to show that
P (Ac ∩ B c ) = P (Ac ) P (B c ).
Now Ac ∩ B c = (A ∪ B)c from basic set theory (draw a Venn diagram), hence:

P (Ac ∩ B c ) = P ((A ∪ B)c )


= 1 − P (A ∪ B)
= 1 − (P (A) + P (B) − P (A ∩ B))
= 1 − P (A) − P (B) + P (A ∩ B)
= 1 − P (A) − P (B) + P (A) P (B) (independence assumption)
= (1 − P (A))(1 − P (B)) (factorising)
= P (Ac ) P (B c ) (as required).

(b) To show that X c and Y c are not necessarily mutually exclusive when X and Y
are mutually exclusive, the best approach is to find a counterexample.
Attempts to ‘prove’ the result directly are likely to be logically flawed.
Look for a simple example. Suppose we roll a die. Let X = {6} be the event of
obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and
Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have
X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive.

4. (a) A will win the game without deuce if he or she wins four points, including the
last point, before B wins three points. This can occur in three ways.
• A wins four straight points, i.e. AAAA with probability (2/3)4 = 16/81.
• B wins just one point in the game. There are 4 C1 ways for this to happen,
namely BAAAA, ABAAA, AABAA and AAABA. Each has probability
(1/3)(2/3)4 , so the probability of one of these outcomes is given by
4(1/3)(2/3)4 = 64/243.
• B wins just two points in the game. There are 5 C2 ways for this to
happen, namely BBAAAA, BABAAA, BAABAA, BAAABA,
ABBAAA, ABABAA, ABAABA, AABBAA, AABABA and
AAABBA. Each has probability (1/3)2 (2/3)4 , so the probability of one of
these outcomes is given by 10(1/3)2 (2/3)4 = 160/729.
Therefore, the probability that A wins without a deuce must be the sum of
these, namely:
16 64 160 144 + 192 + 160 496
+ + = = .
81 243 729 729 729
206
F.2. Chapter 2 – Probability theory

(b) We can mimic the above argument to find the probability that B wins the
game without a deuce. That is, the probability of four straight points to B is
(1/3)4 = 1/81, the probability that A wins just one point in the game is
4(2/3)(1/3)4 = 8/243, and the probability that A wins just two points is
10(2/3)2 (1/3)4 = 40/729. So the probability of B winning without a deuce is
1/81 + 8/243 + 40/729 = 73/729 and so the probability of deuce is
1 − 496/729 − 73/729 = 160/729.
(c) Either: suppose deuce has been called. The probability that A wins the set
without further deuces is the probability that the next two points go AA –
with probability (2/3)2 .
The probability of exactly one further deuce is that the next four points go
ABAA or BAAA – with probability (2/3)3 (1/3) + (2/3)3 (1/3) = (2/3)4 .
The probability of exactly two further deuces is that the next six points go
ABABAA, ABBAAA, BAABAA or BABAAA – with probability
4(2/3)4 (1/3)2 = (2/3)6 .
Continuing this way, the probability that A wins after three further deuces is
(2/3)8 and the overall probability that A wins after deuce has been called is
(2/3)2 + (2/3)4 + (2/3)6 + (2/3)8 + · · · .
This is a geometric progression (GP) with first term a = (2/3)2 and common
ratio (2/3)2 , so the overall probability that A wins after deuce has been called
is a/(1 − r) (sum to infinity of a GP) which is:

(2/3)2 4/9 4
2
= = .
1 − (2/3) 5/9 5

Or (quicker!): given a deuce, the next 2 balls can yield the following results.
A wins with probability (2/3)2 , B wins with probability (1/3)2 , and deuce
with probability 4/9.
Hence P (A wins | deuce) = (2/3)2 + (4/9) P (A wins | deuce) and solving
immediately gives P (A wins | deuce) = 4/5.
(d) We have:

P (A wins the game) = P (A wins without deuce being called)


+ P (deuce is called) P (A wins | deuce is called)
496 160 4
= + ×
729 729 5
496 128
= +
729 729
624
= .
729

Aside: so the probability of B winning the game is 1 − 624/729 = 105/729. It


follows that A is about six times as likely as B to win the game although the
probability of winning any point is only twice that of B. Another example of the
counterintuitive nature of probability.

207
F. Solutions to Practice questions

F.3 Chapter 3 – Random variables

1. We require a counterexample. A simple one will suffice – there is no merit in


complexity. Let the discrete random variable X assume values 1 and 2 with
probabilities 1/3 and 2/3, respectively. (Obviously, there are many other examples
we could have chosen.) Therefore:

1 2 5
E(X) = 1 × +2× =
3 3 3
1 2
E(X 2 ) = 1 × + 4 × = 3
3 3
1 1 2 2
E(1/X) = 1 × + × =
3 2 3 3

and, clearly, E(X 2 ) 6= (E(X))2 and E(1/X) 6= 1/E(X) in this case. So the result
has been shown in general.

2. (a) Recall that Var(X) = E(X 2 ) − (E(X)2 ). Now, working backwards:

E(X(X − 1)) − E(X)(E(X) − 1) = E(X 2 − X) − (E(X))2 + E(X)


= E(X 2 ) − E(X) − E(X)2 + E(X)
(using standard properties of expectation) = E(X 2 ) − (E(X))2
= Var(X).

(b) We have:

 
X1 + X 2 + · · · + X n E(X1 + X2 + · · · + Xn )
E =
n n
E(X1 ) + E(X2 ) + · · · + E(Xn )
=
n
µ + µ + ··· + µ
=
n

=
n
= µ.

208
F.3. Chapter 3 – Random variables

 
X1 + X 2 + · · · + Xn Var(X1 + X2 + · · · + Xn )
Var =
n n2
Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
(by independence) =
n2
σ2 + σ2 + · · · + σ2
=
n2
nσ 2
=
n2
σ2
= .
n

3. Suppose n subjects are procured. The probability that a single subject does not
have the abnormality is 0.96. Using independence, the probability that none of the
subjects has the abnormality is (0.96)n .
The probability that at least one subject has the abnormality is 1 − (0.96)n . We
require the smallest whole number n for which 1 − (0.96)n > 0.95, i.e. we have
(0.96)n < 0.05.
We can solve the inequality by ‘trial and error’, but it is neater to take logs.
n ln(0.96) < ln(0.05), so n > ln(0.05)/ ln(0.96), or n > 73.39. Rounding up, 74
subjects should be procured.

4. (a) For the ‘stupid’ rat:


1
P (X = 1) =
4
3 1
P (X = 2) = ×
4 4
..
.
 r−1
3 1
P (X = r) = × .
4 4
This is a ‘geometric distribution’ with π = 1/4, which gives E(X) = 1/π = 4.
(b) For the ‘intelligent’ rat:
1
P (X = 1) =
4
3 1 1
P (X = 2) = × =
4 3 4
3 2 1 1
P (X = 3) = × × =
4 3 2 4
3 2 1 1
P (X = 4) = × × × 1 = .
4 3 2 4
Hence E(X) = (1 + 2 + 3 + 4)/4 = 10/4 = 2.5.

209
F. Solutions to Practice questions

(c) For the ‘forgetful’ rat (short-term, but not long-term, memory):

1
P (X = 1) =
4
3 1
P (X = 2) = ×
4 3
3 2 1
P (X = 3) = × ×
4 3 3
..
.
 r−2
3 2 1
P (X = r) = × × (for r ≥ 2).
4 3 3

Therefore:
     2 ! !
1 3 1 2 1 2 1
E(X) = + × 2× + 3× × + 4× × + ···
4 4 3 3 3 3 3
   2 ! !
1 1 2 2
= + 2+ 3× + 4× + ··· .
4 4 3 3

There is more than one way to evaluate this sum.


 2 !  2 !!
1 1 2 2 2 2
E(X) = + × 1+ + + ··· + 1 + 2 × + 3 × + ···
4 4 3 3 3 3
1 1
= + × (3 + 9)
4 4
= 3.25.

Note that 2.5 < 3.25 < 4, so the intelligent rat needs the least trials on average,
while the stupid rat needs the most, as we would expect!

F.4 Chapter 4 – Common distributions of random


variables
1. Let P ∼ N (10.42, (0.03)2 ) for the pistons, and C ∼ N (10.52, (0.04)2 ) for the
cylinders. It follows that D ∼ N (0.1, (0.05)2 ) for the difference (adding the
variances, assuming independence). The piston will fit if D > 0. We require:
 
0 − 0.1
P (D > 0) = P Z > = P (Z > −2) = 0.9772
0.05

so the proportion of 1 − 0.9772 = 0.0228 will not fit.


The number of pistons, N , failing to fit out of 100 will be a binomial random
variable such that N ∼ Bin(100, 0.0228).

210
F.4. Chapter 4 – Common distributions of random variables

(a) Calculating directly, we have the following.


i. P (N = 0) = (0.9772)100 = 0.0996.
ii. P (N ≤ 2) =
100

(0.9772)100 + 100 × (0.9772)99 (0.0228) + 2
(0.9772)98 (0.0228)2 = 0.6005.

(b) Using the Poisson approximation with λ = 100 × 0.0228 = 2.28, we have the
following.
i. P (N = 0) ≈ e−2.28 = 0.1023.
ii. P (N ≤ 2) ≈ e−2.28 + e−2.28 × 2.28 + e−2.28 × (2.28)2 /2! = 0.6013.
The approximations are good (note there will be some rounding error, but the
values are close with the two methods). It is not surprising that there is close
agreement since n is large, π is small and nπ < 5.

2. We have P (X = 1) = P (X = 2) = · · · = P (X = k) = 1/k. Therefore:

1 1 1
MX (t) = E(eXt ) = et × + e2t × + · · · + ekt ×
k k k
1 t
= (e + e2t + · · · + ekt ).
k

The bracketed part of this expression is a geometric progression where the first
term is et and the common ratio is et .
Using the well-known result for the sum of k terms of a geometric progression, we
obtain:
1 et (1 − (et )k ) et (1 − ekt )
MX (t) = × = .
k 1 − et k(1 − et )

R ∞ f (z) to serve as a pdf, we require (i.) f (z) ≥ 0 for all z, and (ii.)
3. (a) For
−∞
f (z) dz = 1. The first condition certainly holds for f (z). The second also
holds since:
Z ∞ Z 0 Z ∞
1 −|z| 1 −|z|
f (z) dz = e dz + e dz
−∞ −∞ 2 0 2
Z 0 Z ∞
1 z 1 −z
= e dz + e dz
−∞ 2 0 2
h i∞
z 0 −z
= [e /2]−∞ − e /2
0

1 1
= +
2 2
= 1.

A sketch of f (z) is shown below.

211
F. Solutions to Practice questions

(b) The moment generating function is:


Z 0 Z ∞
Zt 1 −|z| zt 1 −|z| zt
MZ (t) = E(e ) = e e dz + e e dz
−∞ 2 0 2
Z 0 Z ∞
1 z zt 1 −z zt
= e e dz + e e dz
−∞ 2 0 2
Z 0 Z ∞
1 z(1+t) 1 z(t−1)
= e dz + e dz
−∞ 2 0 2
 0  ∞
1 z(1+t) 1 z(t−1)
= e + e
2(1 + t) −∞ 2(t − 1) 0

1 1
= −
2(1 + t) 2(t − 1)
= (1 − t2 )−1

where the condition −1 < t < 1 ensures the integrands are 0 at the infinite
limits.
(c) We can find the various moments by differentiating MZ (t), but it is simpler to
expand it:
MZ (t) = (1 − t2 )−1 = 1 + t2 + t4 + · · · .

Now the coefficient of t is 0, so E(Z) = 0. The coefficient of t2 is 1, so


E(Z 2 )/2 = 1, and Var(Z) = E(Z 2 ) − (E(Z))2 = 2.
The coefficient of t3 is 0, so E(Z 3 ) = 0. The coefficient of t4 is 1, so
E(Z 4 )/4! = 1, and so E(Z 4 ) = 24.
Note that the first and third of these results follow directly from the fact
(illustrated in the sketch) that the distribution is symmetric about z = 0.

212
F.4. Chapter 4 – Common distributions of random variables

n
 x
4. For X ∼ Bin(n, π), P (X = x) = x
π (1 − π)n−x . So, for E(X), we have:

n  
X n x
E(X) = x π (1 − π)n−x
x=0
x
n  
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n  
X n − 1 x−1
= nπ π (1 − π)n−x
x=1
x − 1
n−1  
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y

= nπ × 1
= nπ

where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, 2, . . . , n − 1 and
probability parameter π.

Similarly:

n  
X n x
E(X(X − 1)) = x(x − 1) π (1 − π)n−x
x=0
x
n
X x(x − 1)n!
= π x (1 − π)n−x
x=2
x! (n − x)!
n
2
X (n − 2)!
= n(n − 1)π π x−2 (1 − π)n−x
x=2
(x − 2)! (n − x)!
n−2
2
X (n − 2)!
= n(n − 1)π π y (1 − π)n−y−2
y=0
y! (n − y − 2)!

with y = x − 2. Now let m = n − 2, so:

m
2
X m!
E(X(X − 1)) = n(n − 1)π π y (1 − π)m−y
y=0
y! (m − y)!

= n(n − 1)π 2

since the summation is 1, as before.

213
F. Solutions to Practice questions

Finally, noting Practice question 2 in Chapter 3:

Var(X) = E(X(X − 1)) − E(X)(E(X) − 1)


= n(n − 1)π 2 − nπ(nπ − 1)
= −nπ 2 + nπ
= nπ(1 − π).

5. (a) A rate of 150 cars per hour is a rate of 2.5 per minute. Using a Poisson
distribution with λ = 2.5, P (none passes) = e−2.5 × (2.5)0 /0! = e−2.5 = 0.0821.

(b) The expected number of cars passing in two minutes is 2 × 2.5 = 5.

(c) The probability of 5 cars passing in two minutes is e−5 × 55 /5! = 0.1755.

6. (a) Let X denote the number of fish caught, such that X ∼ Pois(λ).
P (X = 0) = e−λ λx /x! where the parameter λ is as yet unknown, so
P (X = 0) = e−λ λ0 /0! = e−λ .
However, we know P (X = 0) = π. So e−λ = π giving −λ = ln(π) and
λ = ln(1/π).

(b) James will take home the last fish caught if he catches 1, 3, 5, . . . fish. So we
require:

e−λ λ1 e−λ λ3 e−λ λ5


P (X = 1) + P (X = 3) + P (X = 5) + · · · = + + + ···
1! 3! 5!
 1
λ3 λ5

−λ λ
=e + + + ··· .
1! 3! 5!

Now we know:
λ2 λ3
eλ = 1 + λ + + + ···
2! 3!
and:
λ2 λ3
e−λ = 1 − λ + − + ··· .
2! 3!
Subtracting gives:

λ3 λ5
 
λ −λ
e −e =2 λ+ + + ··· .
3! 5!

Hence the required probability is:

eλ − e−λ 1 − e−2λ 1 − π2
 
−λ
e = =
2 2 2

since e−λ = π above gives e−2λ = π 2 .

214
F.5. Chapter 5 – Multivariate random variables

F.5 Chapter 5 – Multivariate random variables


1. (a) Recall that for any random variable U , we have Var(U ) = E(U 2 ) − (E(U ))2 ,
E(kU ) = k E(U ), where k is a constant, and for random variables U and V ,
E(U + V ) = E(U ) + E(V ), and also Cov(U, V ) = E(U V ) − E(U ) E(V ). We
have:

Cov(X + Y, X − Y ) = E((X + Y )(X − Y )) − E(X + Y ) E(X − Y )


= E(X 2 − XY + Y X − Y 2 ) − (E(X) + E(Y ))(E(X) − E(Y ))
= E(X 2 ) − E(Y 2 ) − (E(X))2 + E(X) E(Y ) − E(Y ) E(X) + (E(Y ))2
= E(X 2 ) − (E(X))2 − (E(Y 2 ) − (E(Y ))2 )
= Var(X) − Var(Y )

as required.
(b) We have:

Cov(a + bX, c + dY ) = E((a + bX)(c + dY )) − E(a + bX) E(c + dY )


= E(ac + adY + bcX + bdXY ) − (a + b E(X))(c + d E(Y ))
= ac + ad E(Y ) + bc E(X) + bd E(XY )
− ac − ad E(Y ) − bc E(X) − bd E(X) E(Y )
= bd E(XY ) − bd E(X) E(Y )
= bd(E(XY ) − E(X) E(Y ))
= bd Cov(X, Y )

as required.

2. (a) We have: !
k
X k
X k
X
E ai X i = E(ai Xi ) = ai E(Xi ).
i=1 i=1 i=1

(b) We have:
!  !2   !2 
k
X k
X k
X k
X
Var ai X i = E ai X i − ai E(Xi )  = E ai (Xi − E(Xi )) 
i=1 i=1 i=1 i=1

k
X X
= a2i E((Xi − E(Xi ))2 ) + ai aj E((Xi − E(Xi ))(Xj − E(Xj )))
i=1 1≤i6=j≤n

k
X X k
X
= a2i Var(Xi ) + ai aj E(Xi − E(Xi )) E(Xj − E(Xj )) = a2i Var(Xi ).
i=1 1≤i6=j≤n i=1

215
F. Solutions to Practice questions

Additional note: remember there are two ways to compute the variance:
Var(X) = E((X − µ)2 ) and Var(X) = E(X 2 ) − (E(X))2 . The former is more
convenient for analytical derivations/proofs (see above), while the latter should be
used to compute variances for common distributions such as Poisson or exponential
distributions. Actually it is rather difficult to compute the variance for a Poisson
distribution using the formula Var(X) = E((X − µ)2 ) directly.

3. (a) The joint distribution table is:


X=x
0 1 2
0 0 A 2A
Y =y 1 A 2A 3A
2 2A 3A 4A
PP
Since pX,Y (x, y) = 1, we have A = 1/18.
∀x ∀y

(b) The marginal distribution of X (similarly of Y ) is:


X=x 0 1 2
P (X = x) 3A = 1/6 6A = 1/3 9A = 1/2
(c) The distribution of X | Y = 1 is:
X = x|y = 1 0 1 2
PX|Y =1 (X = x | y = 1) A/6A = 1/6 2A/6A = 1/3 3A/6A = 1/2
Hence E(X | Y = 1) = (0 × 1/6) + (1 × 1/3) + (2 × 1/2) = 4/3.
(d) Even though the distributions of X and X | Y = 1 are the same, X and Y are
not independent. For example, P (X = 0, Y = 0) = 0 although P (X = 0) 6= 0
and P (Y = 0) 6= 0.

216
lse.ac.uk/statistics Department of Statistics
The London School of Economics
and Political Science
Houghton Street
London WC2A 2AE
Email: [email protected]
Telephone: +44 (0)20 7852 3709

The London School of Economics and Political Science is a School of the University of London. It is a
charity and is incorporated in England as a company limited by guarantee under the Companies Acts
(Reg no 70527).

The School seeks to ensure that people are treated equitably, regardless of age, disability, race,
nationality, ethnic or national origin, gender, religion, sexual orientation or personal circumstances.

You might also like