0% found this document useful (0 votes)
20 views108 pages

A Level Statistics-2

Uploaded by

bshrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views108 pages

A Level Statistics-2

Uploaded by

bshrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Statistics 2

J.S. Abdey
ST104b
2014

Undergraduate study in
Economics, Management,
Finance and the Social Sciences

This is an extract from a subject guide for an undergraduate course offered as part of the
University of London International Programmes in Economics, Management, Finance and
the Social Sciences. Materials for these programmes are developed by academics at the
London School of Economics and Political Science (LSE).
For more information, see: www.londoninternational.ac.uk
This guide was prepared for the University of London International Programmes by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London School
of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that due
to pressure of work the author is unable to enter into any correspondence relating to, or
arising from, the guide. If you have any comments on this subject guide, favourable or
unfavourable, please use the form at the back of this guide.

University of London International Programmes


Publications Office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
www.londoninternational.ac.uk

Published by: University of London


© University of London 2014
The University of London asserts copyright over all material in this subject guide except
where otherwise indicated. All rights reserved. No part of this work may be reproduced
in any form, or by any means, without permission in writing from the publisher. We make
every effort to respect copyright. If you think we have inadvertently used your copyright
material, please let us know.
Contents

Contents

1 Introduction 1
1.1 Route map to the guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Learning outcomes for the course . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.2 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.4 Online study resources (the Online Library and the VLE) . . . . . 6
1.7 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Probability theory 9
2.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 17
2.7.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 18
2.8 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 22
2.8.1 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 23
2.9 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 27
2.9.1 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 32
2.9.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.12 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

i
Contents

2.13 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 38


2.14 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Random variables 41
3.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.10 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.11 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 67
3.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Common distributions of random variables 69


4.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Common discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6.1 Discrete uniform distribution . . . . . . . . . . . . . . . . . . . . 71
4.6.2 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6.5 Connections between probability distributions . . . . . . . . . . . 78
4.6.6 Poisson approximation of the binomial distribution . . . . . . . . 78
4.6.7 Some other discrete distributions . . . . . . . . . . . . . . . . . . 80
4.7 Common continuous distributions . . . . . . . . . . . . . . . . . . . . . . 81
4.7.1 The (continuous) uniform distribution . . . . . . . . . . . . . . . 81
4.7.2 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . 83
4.7.3 Two other distributions . . . . . . . . . . . . . . . . . . . . . . . 85
4.7.4 Normal (Gaussian) distribution . . . . . . . . . . . . . . . . . . . 85

ii
Contents

4.7.5 Normal approximation of the binomial distribution . . . . . . . . 91


4.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.10 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.11 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 96
4.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 96

5 Multivariate random variables 99


5.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 Joint probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6.1 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . 102
5.7 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7.1 Properties of conditional distributions . . . . . . . . . . . . . . . . 105
5.7.2 Conditional mean and variance . . . . . . . . . . . . . . . . . . . 105
5.8 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.8.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.8.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.8.3 Sample covariance and correlation . . . . . . . . . . . . . . . . . . 109
5.9 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . 111
5.9.1 Joint distribution of independent random variables . . . . . . . . 112
5.10 Sums and products of random variables . . . . . . . . . . . . . . . . . . . 113
5.10.1 Expected values and variances of sums of random variables . . . . 114
5.10.2 Expected values of products of independent random variables . . 115
5.10.3 Some proofs of previous results . . . . . . . . . . . . . . . . . . . 115
5.10.4 Distributions of sums of random variables . . . . . . . . . . . . . 116
5.11 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.13 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.14 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 119
5.15 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 120

iii
Contents

6 Sampling distributions of statistics 121


6.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.6 Random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.7 Statistics and their sampling distributions . . . . . . . . . . . . . . . . . 124
6.8 Sampling distribution of a statistic . . . . . . . . . . . . . . . . . . . . . 124
6.9 Sample mean from a normal population . . . . . . . . . . . . . . . . . . . 126
6.10 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.11 Some common sampling distributions . . . . . . . . . . . . . . . . . . . . 132
6.11.1 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.11.2 (Student’s) t distribution . . . . . . . . . . . . . . . . . . . . . . . 135
6.11.3 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.12 Prelude to statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 137
6.12.1 Population versus random sample . . . . . . . . . . . . . . . . . . 138
6.12.2 Parameter versus statistic . . . . . . . . . . . . . . . . . . . . . . 139
6.12.3 Difference between ‘Probability’ and ‘Statistics’ . . . . . . . . . . 140
6.13 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.14 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.15 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.16 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 142
6.17 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 143

7 Point estimation 145


7.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.6 Estimation criteria: bias, variance and mean squared error . . . . . . . . 146
7.7 Method of moments (MM) estimation . . . . . . . . . . . . . . . . . . . . 151
7.8 Least squares (LS) estimation . . . . . . . . . . . . . . . . . . . . . . . . 153
7.9 Maximum likelihood (ML) estimation . . . . . . . . . . . . . . . . . . . . 154
7.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

iv
Contents

7.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


7.12 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.13 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 160
7.14 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 161

8 Interval estimation 163


8.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.4 Essential and further reading . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.6 Interval estimation for means of normal distributions . . . . . . . . . . . 164
8.6.1 An important property of normal samples . . . . . . . . . . . . . 166
8.6.2 Means of non-normal distributions . . . . . . . . . . . . . . . . . 166
8.7 Use of the chi-squared distribution . . . . . . . . . . . . . . . . . . . . . 167
8.8 Interval estimation for variances of normal distributions . . . . . . . . . . 168
8.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.11 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.12 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 169
8.13 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 169

9 Hypothesis testing 171


9.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.6 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.7 Setting p-value, significance level, test statistic . . . . . . . . . . . . . . . 173
9.7.1 General setting of hypothesis tests . . . . . . . . . . . . . . . . . 175
9.7.2 Statistical testing procedure . . . . . . . . . . . . . . . . . . . . . 175
9.7.3 Two-sided tests for normal means . . . . . . . . . . . . . . . . . . 176
9.7.4 One-sided tests for normal means . . . . . . . . . . . . . . . . . . 177
9.8 t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.9 General approach to statistical tests . . . . . . . . . . . . . . . . . . . . . 179

v
Contents

9.10 Two types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180


9.11 Tests for variances of normal distributions . . . . . . . . . . . . . . . . . 180
9.12 Summary: tests for µ and σ 2 in N (µ, σ 2 ) . . . . . . . . . . . . . . . . . . 182
9.13 Comparing two normal means with paired observations . . . . . . . . . . 183
9.14 Comparing two normal means . . . . . . . . . . . . . . . . . . . . . . . . 184
2
9.14.1 Tests on µX − µY with known σX and σY2 . . . . . . . . . . . . . 184
2
9.14.2 Tests on µX − µY with σX = σY2 but unknown . . . . . . . . . . . 185
9.15 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . 187
9.15.1 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . 189
9.16 Tests for the ratio of two normal variances . . . . . . . . . . . . . . . . . 190
9.17 Summary: tests for two normal distributions . . . . . . . . . . . . . . . . 192
9.18 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.19 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.20 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.21 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 195
9.22 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 195

10 Analysis of variance (ANOVA) 197


10.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 197
10.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
10.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
10.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
10.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
10.6 Testing for equality of three population means . . . . . . . . . . . . . . . 198
10.7 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.8 From one-way to two-way ANOVA . . . . . . . . . . . . . . . . . . . . . 206
10.9 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . 207
10.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.12 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10.13 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 214
10.14 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 215

11 Linear regression 217


11.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 217
11.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

vi
Contents

11.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217


11.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.6 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.7 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.8 Inference for parameters in normal regression models . . . . . . . . . . . 223
11.9 Regression ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.10 Confidence intervals for E(y) . . . . . . . . . . . . . . . . . . . . . . . . 227
11.11 Prediction intervals for y . . . . . . . . . . . . . . . . . . . . . . . . . . 228
11.12 Multiple linear regression models . . . . . . . . . . . . . . . . . . . . . 229
11.13 Multiple regression using Minitab . . . . . . . . . . . . . . . . . . . . . 231
11.14 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
11.15 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 233
11.16 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
11.17 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 234
11.18 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 234

A Sample examination paper 237

B Sample examination paper – Examiners’ commentary 241

vii
Contents

viii
1

Chapter 1
Introduction

1.1 Route map to the guide


This subject guide provides you with a framework for covering the syllabus of the
ST104b Statistics 2 half course and directs you to additional resources such as
readings and the virtual learning environment (VLE).
The following 10 chapters will cover important aspects of elementary statistical theory,
upon which many applications in EC2020 Elements of econometrics draw heavily.
The chapters are not a series of self-contained topics, rather they build on each other
sequentially. As such, you are strongly advised to follow the subject guide in chapter
order. There is little point in rushing past material which you have only partially
understood in order to reach the final chapter. Once you have completed your work on
all of the chapters, you will be ready for examination revision. A good place to start is
the sample examination paper which you will find at the end of the subject guide.
ST104b Statistics 2 extends the work of ST104a Statistics 1 and provides a precise
and accurate treatment of probability, distribution theory and statistical inference. As
such there will be a strong emphasis on mathematical statistics as important discrete
and continuous probability distributions are covered and properties of these
distributions are investigated.
Point estimation techniques are discussed including method of moments, least squares
and maximum likelihood estimation. Confidence interval construction and statistical
hypothesis testing follow. Analysis of variance and a treatment of linear regression
models, featuring the interpretation of computer-generated regression output and
implications for prediction, round off the course.
Collectively, these topics provide a solid training in statistical analysis. As such,
ST104b Statistics 2 is of considerable value to those intending to pursue further
study in statistics, econometrics and/or empirical economics. Indeed, the quantitative
skills developed in the subject guide are readily applicable to all fields involving real
data analysis.

1.2 Introduction to the subject area


Why study statistics?

By successfully completing this half course, you will understand the ideas of
randomness and variability, and the way in which they link to probability theory. This
will allow the use of a systematic and logical collection of statistical techniques of great

1
1. Introduction
1
practical importance in many applied areas. The examples in this subject guide will
concentrate on the social sciences, but the methods are important for the physical
sciences too. This subject aims to provide a grounding in probability theory and some
of the most common statistical methods.
The material in ST104b Statistics 2 is necessary as preparation for other subjects
you may study later on in your degree. The full details of the ideas discussed in this
subject guide will not always be required in these other subjects, but you will need to
have a solid understanding of the main concepts. This can only be achieved by seeing
how the ideas emerge in detail.

How to study statistics

For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.

1.3 Syllabus
The syllabus of ST104b Statistics 2 is as follows:

Probability: Set theory: the basics; Axiomatic definition of probability; Classical


probability and counting rules; Conditional probability and Bayes’ theorem.

Random variables: Discete random variables; Continuous random variables.

Common distributions of random variables: Common discrete distributions;


Common continuous distributions.

Multivariate random variables: Joint probability functions; Conditional


distributions; Covariance and correlation; Independent random variables; Sums and
products of random variables.

Sampling distributions of statistics: Random samples; Statistics and their


sampling distributions; Sampling distribution of a statistic; Sample mean from a
normal population; The central limit theorem; Some common sampling
distributions; Prelude to statistical inference.

Point estimation: Estimation criteria: bias, variance and mean squared error;
Method of moments estimation; Least squares estimation; Maximum likelihood
estimation.

Interval estimation: Interval estimation for means of normal distributions; Use of


the chi-squared distribution; Confidence intervals for normal variances.

2
1.4. Aims of the course
1
Hypothesis testing: Setting p-value, significance level, test statistic; t tests;
General approach to statistical tests; Two types of error; Tests for normal variances;
Comparing two normal means with paired observations; Comparing two normal
means; Tests for correlation coefficients; Tests for the ratio of two normal variances.

Analysis of variance (ANOVA): One-way analysis of variance; Two-way


analysis of variance.

Linear regression: Simple linear regression; Inference for parameters in normal


regression models; Regression ANOVA; Confidence intervals for E(y); Prediction
intervals for y; Multiple linear regression models.

1.4 Aims of the course


The aim of this half course is to develop students’ knowledge of elementary statistical
theory. The emphasis is on topics that are of importance in applications to
econometrics, finance and the social sciences. Concepts and methods that provide the
foundation for more specialised courses in statistics are introduced.

1.5 Learning outcomes for the course


At the end of this half course, and having completed the Essential reading and
activities, you should be able to:

apply and be competent users of standard statistical operators and be able to recall
a variety of well-known distributions and their respective moments

explain the fundamentals of statistical inference and apply these principles to


justify the use of an appropriate model and perform hypothesis tests in a number
of different settings

demonstrate understanding that statistical techniques are based on assumptions


and the plausibility of such assumptions must be investigated when analysing real
problems.

1.6 Overview of learning resources

1.6.1 The subject guide


This course builds on the ideas encountered in ST104a Statistics 1. Although this
subject guide offers a complete treatment of the course material, students may wish to
consider purchasing a textbook. Apart from the textbooks recommended in this subject
guide, you may wish to look in bookshops and libraries for alternative textbooks which
may help you. A critical part of a good statistics textbook is the collection of problems
to solve, and you may want to look at several different textbooks just to see a range of

3
1. Introduction
1
practice questions, especially for tricky topics. The subject guide is there mainly to
describe the syllabus and to show the level of understanding expected.
The subject guide is divided into chapters which should be worked through in the order
in which they appear. There is little point in rushing past material you only partly
understand to get to later chapters, as the presentation is somewhat sequential and not
a series of self-contained topics. You should be familiar with the earlier chapters and
have a solid understanding of them before moving on to the later ones.
The following procedure is recommended:

1. Read the introductory comments.


2. Consult the appropriate section of your textbook.
3. Study the chapter content, examples and learning activities.
4. Go through the learning outcomes carefully.
5. Attempt some of the problems from your textbook.
6. Refer back to this subject guide, or to the textbook, or to supplementary texts, to
improve your understanding until you are able to work through the problems
confidently.

The last two steps are the most important. It is easy to think that you have understood
the material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
Each chapter of the subject guide has suggestions for reading from the main textbook.
Usually, you will only need to read the material in the main textbook (see ‘Essential
reading’ below), but it may be helpful from time to time to look at others.

Basic notation

We often use the symbol  to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.

Time management

About one-third of your self-study time should be spent reading and the rest should be
spent solving problems. An internal student would expect maybe 15 hours of formal
teaching and another 50 hours of private study to be enough to cover the subject. Of
the 50 hours of private study, about 17 hours should be spent on the initial study of the
textbook and subject guide. The remaining 33 hours should be spent on attempting
problems, which may well require more reading.

Calculators

A calculator may be used when answering questions on the examination paper for
ST104b Statistics 2. It must comply in all respects with the specification given in the

4
1.6. Overview of learning resources
1
Regulations. You should also refer to the admission notice you will receive when
entering the examination and the ‘Notice on permitted materials’.
Make sure you accustom yourself to using your chosen calculator and feel comfortable
with it. Specifically, calculators must:

have no external wires

must be:

hand held

compact and portable

quiet in operation

non-programmable

and must:

not be capable of receiving, storing or displaying user-supplied non-numerical data.

The Regulations state: ‘The use of a calculator that communicates or displays textual
messages, graphical or algebraic information is strictly forbidden. Where a calculator is
permitted in the examination, it must be a non-scientific calculator. Where calculators
are permitted, only calculators limited to performing just basic arithmetic operations
may be used. This is to encourage candidates to show the Examiners the steps taken in
arriving at the answer.’

Computers

If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package such as
Minitab, R or SPSS. It is not necessary for this course to have such software available,
but if you do have access to it you may benefit from using it in your study of the
material.

1.6.2 Essential reading


Newbold, P., W.L. Carlson and B.M. Thorne, Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060].

Statistical tables

Lindley, D.V. and W.F. Scott, New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) second edition [ISBN 978-0521484855].

These statistical tables are the same ones that are distributed for use in the
examination, so it is advisable that you become familiar with them, rather than those
at the end of a textbook.

5
1. Introduction
1
1.6.3 Further reading
Please note that, as long as you read the Essential reading, you are then free to read
around the subject area in any text, paper or online resource. You will need to support
your learning by reading as widely as possible and by thinking about how these
principles apply in the real world. To help you read extensively, you have free access to
the virtual learning environment (VLE) and University of London Online Library (see
below).
Other useful texts for this course include:

Johnson, R.A. and G.K. Bhattacharyya, Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].
Larsen, R.J. and M.L. Marx, Introduction to Mathematical Statistics and Its
Applications (Pearson, 2013) fifth edition [ISBN 9781292023557].

While Newbold et al. is the main textbook for this course, there are many that are just
as good. You are encouraged to look at those listed above and at any others you may
find. It may be necessary to look at several textbooks for a single topic, as you may find
that the approach of one textbook suits you better than that of another.

1.6.4 Online study resources (the Online Library and the VLE)
In addition to the subject guide and the Essential reading, it is crucial that you take
advantage of the study resources that are available online for this course, including the
virtual learning environment (VLE) and the Online Library.
You can access the VLE, the Online Library and your University of London email
account via the Student Portal at:
http://my.londoninternational.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged in to the Student Portal in order to register! As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.
If you forget your login details, please click on the ‘Forgotten your password’ link on the
login page.

The VLE

The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. It forms an
important part of your study experience with the University of London and you should
access it regularly.
The VLE provides a range of resources for EMFSS courses:

Self-testing activities: Doing these allows you to test your own understanding of the
subject material.

6
1.6. Overview of learning resources
1
Electronic study materials: The printed materials that you receive from the
University of London are available to download, including updated reading lists
and references.

Past examination papers and Examiners’ commentaries: These provide advice on


how each examination question might best be answered.

A student discussion forum: This is an open space for you to discuss interests and
experiences, seek support from your peers, work collaboratively to solve problems
and discuss subject material.

Videos: There are recorded academic introductions to the subject, interviews and
debates and, for some courses, audio-visual tutorials and conclusions.

Recorded lectures: For some courses, where appropriate, the sessions from previous
years’ Study Weekends have been recorded and made available.

Study skills: Expert advice on preparing for examinations and developing your
digital literacy skills.

Feedback forms.

Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.

Making use of the Online Library

The Online Library contains a huge array of journal articles and other resources to help
you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login:
http://tinyurl.com/ollathens
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please see the online help pages:
www.external.shl.lon.ac.uk/summon/about.php

Additional material

There is a lot of computer-based teaching material available freely over the web. A
fairly comprehensive list can be found in the ‘Books & Manuals’ section of
http://statpages.org
Unless otherwise stated, all websites in this subject guide were accessed in April 2014.
We cannot guarantee, however, that they will stay current and you may need to

7
1. Introduction
1
perform an internet search to find the relevant pages.

1.7 Examination advice


Important: the information and advice given here are based on the examination
structure used at the time this subject guide was written. Please note that subject
guides may be used for several years. Because of this we strongly advise you to always
check both the current Regulations for relevant information about the examination, and
the VLE where you should be advised of any forthcoming changes. You should also
carefully check the rubric/instructions on the paper you actually sit and follow those
instructions.
Remember, it is important to check the VLE for:

up-to-date information on examination and assessment arrangements for this course

where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.

The examination is by a two-hour unseen question paper. No books may be taken into
the examination, but the use of calculators is permitted, and statistical tables and a
formula sheet are provided (the formula sheet can be found in past examination papers
available on the VLE).
The examination paper has a variety of questions, some quite short and others longer.
All questions must be answered correctly for full marks. You may use your calculator
whenever you feel it is appropriate, always remembering that the Examiners can give
marks only for what appears on the examination script. Therefore, it is important to
always show your working.
In terms of the examination, as always, it is important to manage your time carefully
and not to dwell on one question for too long – move on and focus on solving the easier
questions, coming back to harder ones later.

8
Chapter 2 2
Probability theory

2.1 Synopsis of chapter content


Probability is very important for statistics because it provides the rules that allow us to
reason about uncertainty and randomness, which is the basis of statistics. Independence
and conditional probability are profound ideas, but they must be fully understood in
order to think clearly about any statistical investigation.

2.2 Aims of the chapter


The aims of this chapter are to:

understand the concept of probability

work with independent events and determine conditional probabilities

work with probability problems.

2.3 Learning outcomes


After completing this chapter, and having completed the Essential reading and
activities, you should be able to:

explain the fundamental ideas of random experiments, sample spaces and events

list the axioms of probability and be able to derive all the common probability
rules from them

list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems

explain conditional probability and the concept of independent events

prove the law of total probability and apply it to problems where there is a
partition of the sample space

prove Bayes’ theorem and apply it to find conditional probabilities.

9
2. Probability theory

2.4 Essential reading

2 Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Chapter 3.

In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials


accessible via the ST104b Statistics 2 area at http://my.londoninternational.ac.uk

2.5 Introduction
Consider the following hypothetical example: A country will soon hold a referendum
about whether it should join the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote Yes or No to joining the EU?’ as follows:

Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%

However, we are not interested in just this sample of 950 respondents, but in the
population that they represent, that is all likely voters.
Statistical inference will allow us to say things like the following about the
population:

‘A 95% confidence interval for the population proportion, π, of ‘Yes’ voters is


(0.508, 0.572).’

‘The null hypothesis that π = 0.5, against the alternative hypothesis that π > 0.5,
is rejected at the 5% significance level.’

In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results:

Each response Xi is a realisation of a random variable from a Bernoulli


distribution with probability parameter π.

The responses X1 , X2 , . . . , Xn are independent of each other.

The sampling distribution of the sample mean (proportion) X̄ has expected


value π and variance π (1 − π)/n.

10
2.6. Set theory: the basics

By use of the central limit theorem, the sampling distribution is approximately


a normal distribution.

In the next few chapters, we will learn about the terms in bold, among others. 2
The need for probability in statistics

In statistical inference, the data that we have observed are regarded as a sample from a
broader population, selected with a random process:

Values in a sample are variable: If we collected a different sample we would not


observe exactly the same values again.

Values in a sample are also random: We cannot predict the precise values that will
be observed before we actually collect the sample.

Probability theory is the branch of mathematics that deals with randomness. So we


need to study this first.

A preview of probability

The first basic concepts in probability will be the following:

Experiment: For example, rolling a single die and recording the outcome.

Outcome of the experiment: For example, rolling a 3.

Sample space S: The set of all possible outcome; here {1, 2, 3, 4, 5, 6}.

Event: Any subset A of the sample space, for example A = {4, 5, 6}.1

Probability, P (A), will be defined as a function which assigns probabilities (real


numbers) to events (sets). This uses the language and concepts of set theory. So we
need to study the basics of set theory first.

2.6 Set theory: the basics


A set is a collection of elements (also known as ‘members’ of the set).

Example 2.1 The following are all examples of sets:

A = {Amy, Bob, Sam}.

B = {1, 2, 3, 4, 5}.

C = {x | x is a prime number} = {2, 3, 5, 7, 11, . . . }.

D = {x | x ≥ 0} (that is, the set of all non-negative real numbers).


1
Strictly speaking not all subsets are events, as discussed later.

11
2. Probability theory

Membership of sets and the empty set

x ∈ A means that object x is an element of set A.


2
x∈
/ A means that object x is not an element of set A.
The empty set, denoted ∅, is the set with no elements, i.e. x ∈
/ ∅ is true for every
object x, and x ∈ ∅ is not true for any object x.

Example 2.2 If A = {1, 2, 3, 4, 5}, then:

1 ∈ A and 2 ∈ A.

6∈
/ A and 1.5 ∈
/ A.

The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.

Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .

Figure 2.1: Venn diagram depicting A ∪ B (the total shaded area).

Subsets and equality of sets

A ⊂ B means that set A is a subset of set B, defined as:

A⊂B when x∈A ⇒ x ∈ B.

Hence A is a subset of B if every element of A is also an element of B. An example


is shown in Figure 2.2.

Example 2.4 An example of the distinction between subsets and non-subsets is:

{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set.

{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.

12
2.6. Set theory: the basics

Figure 2.2: Venn diagram depicting a subset, where A ⊂ B.

Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.

Unions of sets (‘or’)

The union, denoted ∪, of two sets is:

A ∪ B = {x | x ∈ A or x ∈ B}.

That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.

Figure 2.3: Venn diagram depicting the union of two sets.

Example 2.5 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:

A ∪ B = {1, 2, 3, 4}

A ∪ C = {1, 2, 3, 4, 5, 6}

B ∪ C = {2, 3, 4, 5, 6}.

Intersections of sets (‘and’)

The intersection, denoted ∩, of two sets is:

A ∩ B = {x | x ∈ A and x ∈ B}.

That is, the set of those elements which belong to both A and B. An example is
shown in Figure 2.4.

13
2. Probability theory

Figure 2.4: Venn diagram depicting the intersection of two sets.

Example 2.6 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:

A ∩ B = {2, 3}

A ∩ C = {4}

B ∩ C = ∅.

Unions and intersections of many sets

Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1
n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1

These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.

Complement (‘not’)

Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈
/ A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.

14
2.6. Set theory: the basics

Figure 2.5: Venn diagram depicting the complement of a set.

Properties of set operators

In proofs and derivations about sets, you can use the following results without proof:

Commutativity:

A ∩ B = B ∩ A and A ∪ B = B ∪ A.

Associativity:

A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.

Distributive laws:

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).

De Morgan’s laws:

(A ∩ B)c = Ac ∪ B c and (A ∪ B)c = Ac ∩ B c .

If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:

∅c = S.

∅ ⊂ A, A ⊂ A and A ⊂ S.

A ∩ A = A and A ∪ A = A.

A ∩ Ac = ∅ and A ∪ Ac = S.

If B ⊂ A, A ∩ B = B and A ∪ B = A.

A ∩ ∅ = ∅ and A ∪ ∅ = A.

A ∩ S = A and A ∪ S = S.

∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.

15
2. Probability theory

Mutually exclusive events

Two sets A and B are disjoint or mutually exclusive if:


2
A ∩ B = ∅.

Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.

Partition

The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form

S
a partition of A if they are pairwise disjoint and Ai = A.
i=1

A3 A2

A1

Figure 2.6: The partition of the set A into A1 , A2 and A3 .

Example 2.7 Suppose that A ⊂ B. Show that A and B ∩ Ac form a partition of B.

We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.

Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so


they form a partition of B.

16
2.7. Axiomatic definition of probability

2.7 Axiomatic definition of probability


First, we consider four basic concepts in probability. 2
An experiment is a process which produces outcomes and which can have several
different outcomes. The sample space S is the set of all possible outcomes of the
experiment. An event is any subset A of the sample space such that A ⊂ S.

Example 2.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.

The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows:

A ∩ B: both A and B happen.

A ∪ B: either A or B happens (or both happen).

Ac : A does not happen, i.e. something other than A happens.

Once we introduce probabilities of events, we can also say that:

the sample space S is a certain event

the empty set ∅ is an impossible event.

Axioms of probability

‘Probability’ is formally defined as a function P (A) from subsets (events) of the


sample space S onto real numbers.2 Such a function is a probability function if it
satisfies the following axioms (‘self-evident truths’):

Axiom 1: P (A) ≥ 0 for all events A.

Axiom 2: P (S) = 1.

Axiom 3: If events A1 , A2 , . . . are pairwise disjoint (i.e. Ai ∩ Aj = ∅ for all


i 6= j), then:

! ∞
[ X
P Ai = P (Ai ).
i=1 i=1

2
The precise definition also requires a careful statement of which subsets of S are allowed as events;
we can skip that on this course.

17
2. Probability theory

The axioms require that a probability function must always satisfy these requirements:

Axiom 1 requires that probabilities are always non-negative.


2
Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.

Axiom 3 states that if events A1 , A2 , . . . are mutually exclusive, the probability of


their union is simply the sum of their individual probabilities.

All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.

2.7.1 Basic properties of probability

Probability property

For the empty set, ∅, we have:


P (∅) = 0. (2.1)

Proof : Since ∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅, Axiom 3 gives:



X
P (∅) = P (∅ ∪ ∅ ∪ · · · ) = P (∅).
i=1

But the only real number for P (∅) which satisfies this is P (∅) = 0. 

Probability property

(Finite additivity:) If A1 , A2 , . . . , An are pairwise disjoint, then:


n
! n
[ X
P Ai = P (Ai ).
i=1 i=1

Proof : In Axiom 3, set An+1 = An+2 = · · · = ∅, so that:



! ∞ n ∞ n
[ X X X X
P Ai = P (Ai ) = P (Ai ) + P (Ai ) = P (Ai )
i=1 i=1 i=1 i=n+1 i=1

since P (Ai ) = P (∅) = 0 for i = n + 1, n + 2, . . .. 


In pictures, the previous result means that in a situation like the one shown in Figure
2.7, the probability of the combined event A = A1 ∪ A2 ∪ A3 is simply the sum of the
probabilities of the individual events:

P (A) = P (A1 ) + P (A2 ) + P (A3 ).

18
2.7. Axiomatic definition of probability

A2
A1 2
A3

Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.

That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.

Probability property

For any event A, we have:


P (Ac ) = 1 − P (A).

Proof : We have that A ∪ Ac = S and A ∩ Ac = ∅. Therefore:

1 = P (S) = P (A ∪ Ac ) = P (A) + P (Ac )

using the previous result, with n = 2, A1 = A and A2 = Ac . 

Probability property

For any event A, we have:


P (A) ≤ 1.

Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:

P (Ac ) = 1 − P (A) < 0.

This violates Axiom 1, so cannot be true. Therefore it must be that P (A) ≤ 1 for all A.
Putting this and Axiom 1 together, we get:

0 ≤ P (A) ≤ 1

for all events A. 

Probability property

For any two events A and B, if A ⊂ B, then P (A) ≤ P (B).

19
2. Probability theory

Proof : We proved in Example 2.7 that we can partition B as B = A ∪ (B ∩ Ac ) where


the two sets in the union are disjoint. Therefore:
2 P (B) = P (A ∪ (B ∩ Ac )) = P (A) + P (B ∩ Ac ) ≥ P (A)

since P (B ∩ Ac ) ≥ 0. 

Activity 2.1 For any two events A and B, prove that:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

In summary, the probability function has the following properties:

P (S) = 1 and P (∅) = 0.

0 ≤ P (A) ≤ 1 for all events A.

If A ⊂ B, then P (A) ≤ P (B).

These show that the probability function has the kinds of values we expect of something
called a ‘probability’.

P (Ac ) = 1 − P (A).

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

These are useful for deriving probabilities of new events.

Example 2.9 Suppose that, on an average weekday, of all adults in a country:

86% spend at least 1 hour watching television (event A, with P (A) = 0.86).

19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19).

15% spend at least 1 hour watching television, and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).

We select a member of the population for an interview at random. Then, for


example, we have:

P (Ac ) = 1 − P (A) = 1 − 0.86 = 0.14: the probability that the respondent


watches less than 1 hour of television.

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.86 + 0.19 − 0.15 = 0.90: the


probability that the respondent spends at least 1 hour watching television or
reading newspapers (or both).

20
2.7. Axiomatic definition of probability

What does ‘probability’ mean?

Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
2
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined below. The mathematical theory of probability
and calculations on probabilities are the same whichever interpretation we assign to
‘probability’. So, in this course, we do not need to discuss the matter further.

Frequency interpretation of probability

This states that the probability of an outcome A of an experiment is the proportion


(relative frequency) of trials in which A would be the outcome if the experiment was
repeated a very large number of times under similar conditions.

Example 2.10 How should we interpret the following, as statements about the real
world of coins and babies?

‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.

‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.

How to find probabilities?

A key question is how to determine appropriate numerical values P (A) for the
probabilities of particular events.
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.

Example 2.11

If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.

Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in that population.

The estimation of probabilities of events from observed data is an important part of


statistics.

21
2. Probability theory

2.8 Classical probability and counting rules

2 Classical probability is a simple special case where values of probabilities can be


found by just counting outcomes. This requires that:

The sample space contains only a finite number of outcomes.

All of the outcomes are equally probable.

Standard illustrations of classical probability are devices used in games of chance:

Tossing a coin (heads or tails) one or more times.

Rolling one or more dice (each scored 1, 2, 3, 4, 5 or 6).

Drawing one or more playing cards from a deck of 52 cards.

We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space S contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Then:
k number of outcomes in A
P (A) = = .
m total number of outcomes in sample space S

That is, the probability of A is the proportion of outcomes that belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes that belong to the event, and the total number of possible
outcomes.

Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?

Sample space: the 36 ordered pairs:

S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3,2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4,1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.

Outcomes in the event: A = {(1, 4), (2, 3), (3, 2), (4, 1)}.

The probability: P (A) = 4/36 = 1/9.

22
2.8. Classical probability and counting rules

Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of 2
the complementary event Ac , i.e. P (Ac ), is easier to find.

Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?

The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.

Therefore, P (A) = 1 − 3/36 = 33/36 = 11/12.

The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.

Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?

P (A) = 6/36, P (B) = 3/36 and P (A ∩ B) = 1/36.

So P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = (6 + 3 − 1)/36 = 8/36 = 2/9.

2.8.1 Combinatorial counting methods


A powerful set of counting methods answers the following question: How many ways are
there to select k objects out of n distinct objects?
The answer will depend on two things:

Whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once).

Whether the selected set is treated as ordered or unordered.

Ordered sets, with replacement

Suppose that the selection of k objects out of n needs to be:

ordered, so that the selection is an ordered sequence where we distinguish between


the 1st object, 2nd, 3rd etc.

with replacement, so that each of the n objects may appear several times in the
selection.

23
2. Probability theory

Then:

n objects are available for selection for the 1st object in the sequence
2
n objects are available for selection for the 2nd object in the sequence

. . . and so on, until n objects are available for selection for the kth object in the
sequence.

The number of possible ordered sequences of k objects selected with replacement from n
objects is therefore:
k times

n × n × · · · × n = nk .
z }| {

Ordered sets, without replacement

Suppose that the selection of k objects out of n is again treated as an ordered sequence,
but that selection is now:

ordered, so that the selection is an ordered sequence where we distinguish between


the 1st object, 2nd, 3rd etc.

without replacement: if an object is selected once, it cannot be selected again.

Now:

n objects are available for selection for the 1st object in the sequence

n − 1 objects are available for selection for the 2nd object

n − 2 objects are available for selection for the 3rd object

. . . and so on, until n − k + 1 objects are available for selection for the kth object.

The number of possible ordered sequences of k objects selected without replacement


from n objects is therefore:

n × (n − 1) × · · · × (n − k + 1). (2.2)

An important special case is when k = n.

Factorials

The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.

24
2.8. Classical probability and counting rules

Using factorials, (2.2) can be written as:

n!
n × (n − 1) × · · · × (n − k + 1) =
(n − k)!
. 2
Unordered sets, without replacement

Suppose now that the identities of the objects in the selection matter, but the order
does not.

For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.

The number of such unordered subsets (combinations) of k out of n objects is


determined as follows:

The number of ordered sequences is n!/(n − k)!.

Among these, every different combination of k distinct elements appears k! times,


in different orders.

Ignoring the ordering, there are therefore:


 
n n!
=
k (n − k)! k!

different combinations, for each k = 0, 1, . . . , n.


n

The number k
is known as the binomial coefficient. Note that because 0! = 1,
n n
 
0
= n
= 1, so there is only 1 way of selecting 0 or n out of n objects.

Example 2.15 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending February 29th does not exist, so that n = 365), in the following cases?

1. It makes a difference who has which birthday (ordered ), i.e. Amy (January 1st),
Bob (May 5th) and Sam (December 5th) is different from Amy (May 5th), Bob
(December 5th) and Sam (January 1st), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:

3653 = 48,627,125.

2. It makes a difference who has which birthday (ordered ), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!

25
2. Probability theory

3. Only the dates matter, but not who has which one (unordered ), i.e. Amy
(January 1st), Bob (May 5th) and Sam (December 5th) is treated as the same
2 as Amy (May 5th), Bob (December 5th) and Sam (January 1st), and different
people must have different birthdays (without replacement). The number of
different sets of birthdays is:
 
365 365! 365 × 364 × 363
= = = 8,038,030.
3 (365 − 3)! 3! 3×2×1

Example 2.16 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following:

1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is 365r .

2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.

Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
365 365r
and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
365r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:

r P (A) r P (A) r P (A) r P (A)


2 0.003 12 0.167 22 0.476 32 0.753
3 0.008 13 0.194 23 0.507 33 0.775
4 0.016 14 0.223 24 0.538 34 0.795
5 0.027 15 0.253 25 0.569 35 0.814
6 0.040 16 0.284 26 0.598 36 0.832
7 0.056 17 0.315 27 0.627 37 0.849
8 0.074 18 0.347 28 0.654 38 0.864
9 0.095 19 0.379 29 0.681 39 0.878
10 0.117 20 0.411 30 0.706 40 0.891
11 0.141 21 0.444 31 0.730 41 0.903

26
2.9. Conditional probability and Bayes’ theorem

2.9 Conditional probability and Bayes’ theorem


Next we introduce some of the most important concepts in probability: 2
Independence

Conditional probability

Bayes’ theorem.

These give us powerful tools for:

deriving probabilities of combinations of events

updating probabilities of events, after we learn that some other events have
happened.

Independence

Two events A and B are (statistically) independent if:

P (A ∩ B) = P (A) P (B).

Independence is sometimes denoted A ⊥⊥ B. Intuitively, independence means that:

if A happens, this does not affect the probability of B happening (and vice versa)

if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).
For example, independence is often a reasonable assumption when A and B
correspond to physically separate experiments.

Example 2.17 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:

A = ‘Score of die 1 is not 6’

B = ‘Score of die 2 is not 6’.

Then:

P (A) = 30/36 = 5/6

P (B) = 30/36 = 5/6

P (A ∩ B) = 25/36 = 5/6 × 5/6 = P (A) P (B), so A and B are independent.

27
2. Probability theory

Independence of multiple events

2 Events A1 , A2 , . . . , An are independent if the probability of the intersection of any subset


of these events is the product of the individual probabilities of the events in the subset.
This implies the important result that if events A1 , A2 , . . . , An are independent, then:

P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 ) · · · P (An ).

Note that there is a difference between pairwise independence and full independence.
The following example illustrates.

Example 2.18 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B has only a hat,
Teacher C has only a scarf and Teacher D has only gloves. One teacher out of the
four is selected at random. Show that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) =
and P (S ∩ G) = .
4 4
From these results, we can verify that:

P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)

and so the events are pairwise independent. But one teacher has a hat, a scarf and
gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.

Independent versus mutually exclusive events

The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 2.8.

28
2.9. Conditional probability and Bayes’ theorem

2
A

Figure 2.8: Venn diagram depicting mutually exclusive events.

For mutually exclusive events A ∩ B = ∅, and so, from (2.1), P (A ∩ B) = 0. For


independent events, P (A ∩ B) = P (A) P (B). So since P (A ∩ B) = 0 6= P (A) P (B) in
general (except in the uninteresting case that P (A) = 0 or P (B) = 0), then mutually
exclusive events and independent events are different.
In fact, mutually exclusive events are extremely non-independent (i.e. dependent). For
example, if you know that A has happened, you know for certain that B has not
happened. There is no particularly helpful way to represent independent events using a
Venn diagram.

Conditional probability

Consider two events A and B. Suppose that you are told that B has occurred. How
does this affect the probability of event A?
The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:

P (A ∩ B)
P (A | B) =
P (B)

assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.

Example 2.19 Suppose we roll two independent fair dice again. Consider the
following events:

A = ‘at least one of the scores is 2’.

B = ‘the sum of the scores is greater than 7’.

There are shown in Figure 2.9. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and
P (A ∩ B) = 2/36. The conditional probability of A given B is therefore:

P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15

Learning that B has happened causes us to revise (update) the probability of A


downwards, from 0.31 to 0.13.

29
2. Probability theory

2 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)

(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)

(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)

(6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

A B

Figure 2.9: Events A, B and A ∩ B for Example 2.19.

One way to think about conditional probability is that when we condition on B, we


redefine the sample space to be B.

Example 2.20 In Example 2.19, when we are told that the conditioning event B
has occurred, we know we are within the green line in Figure 2.9. So the 15
outcomes within it become the new sample space. There are 2 outcomes which
satisfy A and which are inside this new sample space, so:
2 cases of A within B
P (A | B) = = .
15 cases of B

Conditional probability of independent events

If A ⊥⊥ B, i.e. P (A ∩ B) = P (A) P (B), and P (B) > 0 and P (A) > 0, then:
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)
and:
P (A ∩ B) P (A) P (B)
P (B | A) = = = P (B).
P (A) P (A)
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.

Chain rule of conditional probabilities

Since P (A | B) = P (A ∩ B)/P (B), then:


P (A ∩ B) = P (A | B) P (B).

30
2.9. Conditional probability and Bayes’ theorem

That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:
2
s
B
s
As

The path to A is to get first to B, and then from B to A.


It is also true that:
P (A ∩ B) = P (B | A) P (A)

and you can use whichever is more convenient. Very often some version of this chain
rule is much easier than calculating P (A ∩ B) directly.
The chain rule generalises to multiple events:

P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) · · · P (An | A1 , . . . , An−1 )

where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 2.21.

Example 2.21 For n = 3, we have:

P (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 )


= P (A1 ) P (A3 | A1 ) P (A2 | A1 , A3 )
= P (A2 ) P (A1 | A2 ) P (A3 | A1 , A2 )
= P (A2 ) P (A3 | A2 ) P (A1 | A2 , A3 )
= P (A3 ) P (A1 | A3 ) P (A2 | A1 , A3 )
= P (A3 ) P (A2 | A3 ) P (A1 | A2 , A3 ).

Example 2.22 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52

4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore
P (A) = 1/270275.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:

P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards.

P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn.

P (A3 | A1 , A2 ) = 2/50.

P (A4 | A1 , A2 , A3 ) = 1/49.

31
2. Probability theory

Putting these together with the chain rule gives:

P (A) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) P (A4 | A1 , A2 , A3 )


2
4 3 2 1 24 1
= × × × = = .
52 51 50 49 6497400 270725

Here we could obtain the result in two ways. However, there are very many situations
where classical probability and counting rules are not usable, whereas conditional
probabilities and the chain rule are completely general and always applicable.

More methods for summing probabilities

We now return to probabilities of partitions like the situation shown in Figure 2.10.

 HH A1
 H
A2  HH
A1
r
  HHr
A
A
HH 
A3 H 2 
HH 
HH A3

Figure 2.10: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.

Both diagrams in Figure 2.10 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.10,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:

P (A) = P (A1 ) + P (A2 ) + P (A3 ).

2.9.1 Total probability formula


Suppose B1 , B2 , . . . , BK form a partition of the sample space. Then
A ∩ B1 , A ∩ B2 , . . . , A ∩ BK form a partition of A, as shown in Figure 2.11.
In other words, think of event A as the union of all the A ∩ Bi s, i.e. of ‘all the paths to
A via different intervening events Bi ’.
To get the probability of A, we now:

1. Apply the chain rule to each of the paths:

P (A ∩ Bi ) = P (A | Bi ) P (Bi ).

2. Add up the probabilities of the paths:


K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1

32
2.9. Conditional probability and Bayes’ theorem

r B1
2
B2
rHH
 HH
B

r  r 3 HHr
 A

H
@HH
 
@ HH
@ Hr
@ B4
@
@r
B5
Figure 2.11: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.

This is known as the formula of total probability. It looks complicated, but it is


actually often far easier to use than trying to find P (A) directly.

Example 2.23 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c ) [1 − P (B)].

r Bc
HH
 HH
 HH
rH
  Hr A
HH 

H
HH 
r
H 
B

Example 2.24 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity: If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99%
specificity: If a person does not have the disease, the test will give a negative result
with a probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01, and therefore:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.

33
2. Probability theory

2.9.2 Bayes’ theorem


So far we have considered how to calculate P (A) for an event A which can happen in
2 different ways, ‘via’ different events B1 , B2 , . . . , BK .
Now we reverse the question: Suppose we know that A has happened, as shown in
Figure 2.12.

Figure 2.12: Paths to A indicating that A has occurred.

What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.13.

Figure 2.13: A being achieved via B1 .

So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this:

P (A ∩ Bj ) = P (A | Bj ) P (Bj ) from the chain rule.


K
P
P (A) = P (A | Bi ) P (Bi ) from the total probability formula.
i=1

Bayes’ theorem

Using the chain rule and the total probability formula, we have:

P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1

which holds for each Bj , j = 1, . . . , K. This is known as Bayes’ theorem.

34
2.9. Conditional probability and Bayes’ theorem

Example 2.25 Continuing with Example 2.24, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
2
given that the person has received a positive test result.
The probabilities we need are:

P (B) = 0.0001 P (B c ) = 0.9999


P (A | B) = 0.99 and P (A | B c ) = 0.01.

Then:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098

Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore most
positive test results are actually false positives.

Example 2.26 You are waiting for your bag at the baggage return carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags that come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags
to arrive’. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are:

P (x | A) = 1 for all x. If the bag has been lost, it will not arrive!

P (x | Ac ) = (200 − x)/200 if we assume that bags come out in a completely


random order.
Using Bayes’ theorem, we get:

P (x | A) P (A)
P (A | x) =
P (x | A) P (A) + P (x | Ac ) P (Ac )
P (A)
= .
P (A) + [(200 − x)/200] [1 − P (A)]

Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it
has been lost!
For other values of x we need P (A). This is the general probability that a bag gets
lost, before you start observing the arrival of the bags from your particular flight.
This kind of probability is known as the prior probability of an event A.
Let us assign values to P (A) based on some empirical data. Statistics by the
Association of European Airlines (AEA) show how many bags were ‘mishandled’ per
1,000 passengers the airlines carried. This is not exactly what we need (since not all

35
2. Probability theory

passengers carry bags, and some have several), but we will use it anyway. In
particular, we will compare the results for the best and the worst of the AEA in 2006:
2
Air Malta: P (A) = 0.0044.

British Airways: P (A) = 0.023.

Figure 2.14 shows a plot of P (A | x) as a function of x for these two airlines.


The probabilities are fairly small even for large values of x.

For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive,
the probability is less than 0.5 that your bag has been lost.

For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is
the first probability over 0.5.

This is because the baseline probability of lost bags, P (A), is low.


So, the moral of the story is that even when nearly everyone else has collected their
bags and left, do not despair!
1.0

BA
0.8

Air Malta
P( Your bag is lost )
0.6
0.4
0.2
0.0

0 50 100 150 200


Bags arrived

Figure 2.14: Plot of P (A | x) as a function of x for the two airlines in Example 2.26, Air
Malta and British Airways (BA).

2.10 Overview of chapter


This chapter introduced some formal terminology related to probability. The axioms of
probability were introduced, from which various other probability results can be
derived. There followed a brief discussion of counting rules (using permutations and
combinations). The important concepts of independence and conditional probability
were discussed, and Bayes’ theorem was derived.

36
2.11. Key terms and concepts

2.11 Key terms and concepts


Axiom
Combination
Bayes’ theorem
Complement
2
Conditional probability Disjoint
Elementary outcome Empty set
Experiment Event
Independence Intersection
Mutually exclusive Outcome
Partition Permutation
Probability Random experiment
Sample space Set
Union Venn diagram

2.12 Learning activities


1. Why is S = {1, 1, 2}, not a sensible way to try to define a sample space?

2. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)

3. For an event A, work out a simpler way to express the events A ∩ S, A ∪ S, A ∩ ∅


and A ∪ ∅.

4. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).

5. Suppose that we toss a fair coin twice. The sample space is therefore
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way – for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
‘heads on the first toss’ and ‘heads on the second toss’ are independent.

6. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0. (Notice that independence and disjointness are not similar ideas.)

7. Write down the condition for the three events A, B and C to be independent.

8. Prove Bayes’ theorem from first principles.

9. A statistics teacher knows from past experience that a student who does homework
consistently has a probability of 0.95 of passing the examination, whereas a student
who does not do homework at all has a probability of 0.30 of passing the
examination.
(a) If 25% of students do their homework consistently, what percentage can expect
to pass the examination?

37
2. Probability theory

(b) If a student chosen at random from the group gets a pass in the examination,
what is the probability that the student had done homework consistently?
2 10. Plagiarism is a serious problem for assessors of coursework. One check on
plagiarism is to compare the coursework with a standard textbook. If the
coursework has plagiarised that textbook, then there will be a 95% chance of
finding exactly two phrases which are the same in both the coursework and
textbook, and a 5% chance of finding three or more phrases which are the same. If
the work is not plagiarised, then these probabilities are both 50%.
Suppose that 5% of coursework is plagiarised. An assessor chooses some coursework
at random.
(a) What is the probability that it has been plagiarised if it has exactly two
phrases in the textbook? (Try making a guess before doing the calculation.)
(b) Repeat (a) for three phrases in the textbook?
Did you manage to get a roughly correct guess of these results before calculating?

11. A box contains three red balls and two green balls. Two balls are taken from it
without replacement.
(a) What is the probability that none of the balls taken is red?
(b) Repeat (a) for 1 ball and 2 balls.
(c) Show that the probability that the first ball taken is red is the same as the
probability that the second ball taken is red.

12. Amy, Bob and Claire throw a fair die in that order until a six appears. The person
who throws the first six wins. What are their respective chances of winning?

13. In men’s singles tennis, matches are played on the best-of-five-sets principle.
Therefore the first player to win three sets wins the match, and a match may
consist of three, four or five sets. Assuming that two players are perfectly evenly
matched, and that sets are independent events, calculate the probabilities that a
match lasts three sets, four sets and five sets, respectively.

Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk

2.13 Reminder of learning outcomes


After completing this chapter, and having completed the Essential reading and
activities, you should be able to:

explain the fundamental ideas of random experiments, sample spaces and events

list the axioms of probability and be able to derive all the common probability
rules from them

38
2.14. Sample examination questions

list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events 2
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.

2.14 Sample examination questions


1. (a) A, B and C are any three events in the sample space S. Prove that:
P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (B∩C)−P (A∩C)+P (A∩B∩C).

(b) A and B are events in a sample space S. Show that:


P (A) + P (B)
P (A ∩ B) ≤ ≤ P (A ∪ B).
2

2. Suppose A and B are events with P (A) = p, P (B) = 2p and P (A ∪ B) = 0.75.


(a) Evaluate p and P (A | B) if A and B are independent events.
(b) Evaluate p and P (A | B) if A and B are mutually exclusive events.

3. (a) Show that if A and B are independent events in a sample space, then Ac and
B c are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then
X c and Y c are not in general mutually exclusive.

4. In a game of tennis, each point is won by one of the two players A and B. The
usual rules of scoring for tennis apply. That is, the winner of the game is the player
who first scores four points, unless each player has won three points, when deuce is
called and play proceeds until one player is two points ahead of the other and
hence wins the game.
A is serving and has a probability of winning any point of 2/3. The result of each
point is assumed to be independent of every other point.
(a) Show that the probability of A winning the game without deuce being called is
496/729.
(b) Find the probability of deuce being called.
(c) If deuce is called, show that A’s subsequent probability of winning the game is
4/5.
(d) Hence determine A’s overall probability of winning the game.

Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk

39
2. Probability theory

40
Chapter 3
Random variables
3
3.1 Synopsis of chapter content
This chapter introduces the concept of random variables and probability distributions.
These distributions are univariate, which means that they are used to model a single
numerical quantity. The concepts of expected value and variance are also discussed.

3.2 Aims of the chapter


The aims of this chapter are to:

be familiar with the concept of random variables


be able to explain what a probability distribution is
be able to determine the expected value and variance of a random variable.

3.3 Learning outcomes


After completing this chapter, and having completed the Essential reading and
activities, you should be able to:

define a random variable and distinguish it from the values that it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.

3.4 Essential reading


Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Chapters 4 and 5.

In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials


accessible via the ST104b Statistics 2 area at http://my.londoninternational.ac.uk

41
3. Random variables

3.5 Introduction
In ST104a Statistics 1, we considered descriptive statistics for a sample of
observations of a variable X. Here we will represent the observations as a sequence of
variables, denoted as:
X1 , X2 , . . . , Xn
3
where n is the sample size.
In statistical inference, the observations will be treated as a sample drawn at random
from a population. We will then think of each observation Xi of a variable X as an
outcome of an experiment.

The experiment is ‘select a unit at random from the population and record its
value of X’.

The outcome is the observed value Xi of X.

Because variables X in statistical data are recorded as numbers, we can now focus on
experiments where the outcomes are also numbers – random variables.

Random variable

A random variable is an experiment for which the outcomes are numbers.1 This
means that for a random variable:

The sample space, S, is the set of real numbers R, or a subset of R.

The outcomes are numbers in this sample space. Instead of ‘outcomes’, we often
call them the values of the random variable.

Events are sets of numbers (values) in this sample space.

Discrete and continuous random variables

There are two main types of random variables, depending on the nature of S, i.e. the
possible values of the random variable.

A random variable is continuous if S is all of R or some interval(s) of it, for


example [0, 1] or [0, ∞).

A random variable is discrete if it is not continuous.2 More precisely, a discrete


random variable takes a finite or countably infinite number of values.

1
This definition is a bit informal, but it is sufficient for this course.
2
Strictly speaking, a discrete random variable is not just a random variable which is not continuous
as there are many others, such as mixture distributions.

42
3.6. Discrete random variables

Notation

A random variable is typically denoted by an upper-case letter, for example X (or Y ,


W , etc.). A specific value of a random variable is often denoted by a lower-case letter,
for example, x.
Probabilities of values of a random variable are written like this:
3
P (X = x) denotes the probability that (the value of) X is x.

P (X > 0) denotes the probability that X is positive.

P (a < X < b) denotes the probability that X is between the numbers a and b.

Random variables versus samples

You will notice that many of the quantities we define for random variables are
analogous to sample quantities defined in ST104a Statistics 1.

Random variable Sample


Probability distribution Sample distribution
Mean (expected value) Sample mean (average)
Variance and standard deviation Sample variance and standard deviation
Median Sample median

This is no accident. In statistics, the population is represented as following a probability


distribution, and quantities for an observed sample are then used as estimators of the
analogous quantities for the population.

3.6 Discrete random variables

Example 3.1 The following two examples will be used throughout this chapter.

1. Number of people living in a randomly selected household in England.


• For simplicity, we use the value 8 to represent ‘8 or more’ (because 9 and
above are not reported separately in official statistics).
• This is a discrete random variable, with possible values of 1, 2, 3, 4, 5, 6, 7
and 8.

2. A person throws a basketball repeatedly from the free-throw line, trying to


make a basket. Consider the following random variable:
Number of unsuccessful throws before the first successful throw.
• The possible values of this are 0, 1, 2, . . . .

43
3. Random variables

Probability distribution of a discrete random variable

The probability distribution (or just distribution) of a discrete random variable X


is specified by:

its possible values x (i.e. its sample space S)


3
the probabilities of the possible values, i.e. P (X = x) for all x ∈ S.

So we first need to develop a convenient way of specifying the probabilities.

Example 3.2 Consider the following probability distribution for the household
size, X.

Number of people
in household (x) P (X = x)
1 0.3002
2 0.3417
3 0.1551
4 0.1336
5 0.0494
6 0.0145
7 0.0034
8 0.0021

Probability function

The probability function (pf) of a discrete random variable X, denoted by p(x),


is a real-valued function such that for any number x the function is:

p(x) = P (X = x).

We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.

Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).

Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
– especially when it is necessary to indicate clearly to which random variable the
function corresponds.

44
3.6. Discrete random variables

Necessary conditions for a probability function

To be a pf of a discrete random variable X with sample space S, a function p(x)


must satisfy the following conditions:

1. p(x) ≥ 0 for all real numbers x.

3
P
2. p(xi ) = 1, i.e. the sum of probabilities of all possible values of X is 1.
xi ∈S

The pf is defined for all real numbers x, but p(x) = 0 for any x ∈
/ S, i.e. for any
value x that is not one of the possible values of X.

Example 3.3 Continuing Example 3.2, here we can simply list all the values:



 0.3002, if x = 1




 0.3417, if x = 2



 0.1551, if x = 3

0.1336, if x = 4



p(x) = 0.0494, if x = 5

0.0145, if x = 6








 0.0034, if x = 7




 0.0021, if x = 8

0 otherwise.

8
P
These are clearly all non-negative, and their sum is p(x) = 1.
x=1
A graphical representation of the pf is shown in Figure 3.1.

For the next example, we need to remember the following results from mathematics,
concerning sums of geometric series: If r 6= 1, then:
n−1
X a(1 − rn )
a rx =
x=0
1−r

and if |r| < 1, then:



X a
a rx = .
x=0
1−r

Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:

the probability of a successful throw is π at each throw, and therefore the


probability of an unsuccessful throw is 1 − π

45
3. Random variables

0.35
0.30
0.25
0.20
3 p(x)

0.15
0.10
0.05
0.00

1 2 3 4 5 6 7 8

x (number of people in the household)

Figure 3.1: Probability function for Example 3.3.

outcomes of different throws are independent.

Then the probability that the first success occurs after x failures is the probability of
a sequence of x failures followed by a success, i.e. the probability is:

(1 − π)x π.

So the pf of the random variable X (the number of failures before the first success) is:
(
(1 − π)x π for x = 0, 1, 2, . . .
p(x) = (3.1)
0 otherwise

where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf.

Clearly, p(x) ≥ 0 for all x, since π ≥ 0 and 1 − π ≥ 0.

Using the sum to infinity of a geometric series, we get:


∞ ∞ ∞
X X
x
X 1 π
p(x) = (1 − π) π = π (1 − π)x = π · = = 1.
x=0 x=0 x=0
1 − (1 − π) π

The expression of the pf involves a parameter π (the probability of a successful


throw), a number for which we can choose different values. This defines a whole
‘family’ of individual distributions, one for each choice of π. For example, Figure 3.2
shows values of p(x) for two values of π reflecting good and poor free-throw shooters.

46
3.6. Discrete random variables

0.7
0.6
0.5
0.4
p(x)
π = 0.7

0.3
π = 0.3

0.2
3
0.1
0.0 0 5 10 15

x (number of failures)

Figure 3.2: Probability function for Example 3.4. π = 0.7: a fairly good free-throw shooter.
π = 0.3: a pretty poor free-throw shooter.

The cumulative distribution function (cdf)

Another way to specify a probability distribution is to give its cumulative


distribution function (cdf ), (or just simply distribution function).

Cumulative distribution function (cdf)

The cdf is denoted F (x) (or FX (x)) and defined as:

F (x) = P (X ≤ x) for all real numbers x.

For a discrete random variable it is given by:


X
F (x) = p(xi )
xi ∈S, xi ≤x

i.e. the sum of the probabilities of those possible values of X that are less than or
equal to x.

Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:

Number of people
in household (x) p(x) F (x)
1 0.3002 0.3002
2 0.3417 0.6419
3 0.1551 0.7970
4 0.1336 0.9306
5 0.0494 0.9800
6 0.0145 0.9945
7 0.0034 0.9979
8 0.0021 1.0000

47
3. Random variables

These are shown in graphical form in Figure 3.3.

1.0
3
0.8
0.6
F(x)

0.4
0.2
0.0

0 2 4 6 8

x (number of people in the household)

Figure 3.3: Cumulative distribution function for Example 3.5.

Example 3.6 In the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . . . We


can calculate a simple formula for the cdf, using the sum of a geometric series. Since,
for any non-negative integer y, we obtain:
y y y
X X
x
X 1 − (1 − π)y+1
p(x) = (1 − π) π = π (1 − π)x = π · = 1 − (1 − π)y+1
x=0 x=0 x=0
1 − (1 − π)

we can write: (
0 when x < 0
F (x) =
1 − (1 − π)x+1 when x = 0, 1, 2, . . . .

The cdf is shown in graphical form in Figure 3.4.

Properties of the cdf for discrete distributions

The cdf F (x) of a discrete random variable X is a step function such that:

F (x) remains constant in all intervals between possible values of X.

At a possible value xi of X, F (x) jumps up by the amount p(xi ) = P (X = xi ).

At such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is
right-continuous).

48
3.6. Discrete random variables

1.0
0.8
0.6
F(x)
3
0.4 π = 0.7
π = 0.3
0.2
0.0

0 5 10 15

x (number of failures)

Figure 3.4: Cumulative distribution function for Example 3.6.

General properties of the cdf

These hold for both discrete and continuous random variables:

1. 0 ≤ F (x) ≤ 1 for all x (since F (x) is a probability).

2. F (x) → 0 as x → −∞, and F (x) → 1 as x → ∞.

3. F (x) is a non-decreasing function, i.e. if x1 < x2 , then F (x1 ) ≤ F (x2 ).

4. For any x1 < x2 , P (x1 < X ≤ x2 ) = F (x2 ) − F (x1 ).

Either the pf or the cdf can be used to calculate the probabilities of any events for a
discrete random variable.

Example 3.7 Continuing with the household size example (for the probabilities,
see Example 3.5), then:

P (X = 1) = p(1) = F (1) = 0.3002.

P (X = 2) = p(2) = F (2) − F (1) = 0.3417.

P (X ≤ 2) = p(1) + p(2) = F (2) = 0.6419.

P (X = 3 or 4) = p(3) + p(4) = F (4) − F (2) = 0.2887.

P (X > 5) = p(6) + p(7) + p(8) = 1 − F (5) = 0.0200.

P (X ≥ 5) = p(5) + p(6) + p(7) + p(8) = 1 − F (4) = 0.0694.

49
3. Random variables

Properties of a discrete random variable

Let X be a discrete random variable with sample space S and pf p(x).

Expected value of a discrete random variable

The expected value (or mean) of X is denoted E(X), and defined as:
3 X
E(X) = xi p(xi ).
xi ∈S
P P
This can also be written more concisely as E(X) = x p(x) or E(X) = x p(x).
x

We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: Instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’),
or µX , is often used.

Expected value versus sample mean

The mean (expected value) E(X) of a probability distribution is analogous to the


sample mean (average) X̄ of a sample distribution.
This is easiest to see when the sample space is finite. Suppose the random variable X
can have K different values X1 , . . . , XK , and their frequencies in a sample are
f1 , . . . , fK , respectively. Then the sample mean of X is:

K
f 1 x1 + · · · + f K xK X
X̄ = = x1 pb(x1 ) + · · · + xK pb(xK ) = xi pb(xi )
f1 + · · · + fK i=1

where:
fi
pb(xi ) = K
P
fi
i=1

are the sample proportions of the values xi .

The expected value of the random variable X is:

K
X
E(X) = x1 p(x1 ) + · · · + xK p(xK ) = xi p(xi ).
i=1

So X̄ uses the sample proportions pb(xi ), whereas E(X) uses the population probabilities
p(xi ).

50
3.6. Discrete random variables

Example 3.8 Continuing with the household size example:

Number of people
in household (x) p(x) x p(x)
1 0.3002 0.3002
2 0.3417 0.6834
3 0.1551 0.4653
3
4 0.1336 0.5344
5 0.0494 0.2470
6 0.0145 0.0870
7 0.0034 0.0238
8 0.0021 0.0168
Sum 2.3579
= E(X)

The expected number of people in a randomly selected household is 2.36.

Example 3.9 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . . , and
0 otherwise.
The expected value of X is then:
X ∞
X
E(X) = xi p(xi ) = x (1 − π)x π
xi ∈S x=0
X∞
(starting from x = 1) = x (1 − π)x π
x=1

X
= (1 − π) x (1 − π)x−1 π
x=1

X
(using y = x − 1) = (1 − π) (y + 1)(1 − π)y π
y=0
 

∞ ∞
X X 
y y

= (1 − π) 
 y (1 − π) π + (1 − π) π 

 y=0 y=0 
| {z } | {z }
= E(X) =1

= (1 − π) [E(X) + 1]

= (1 − π) E(X) + (1 − π)

from which we can solve:


1−π 1−π
E(X) = = .
1 − (1 − π) π

51
3. Random variables

So, for example:

E(X) = 0.3/0.7 = 0.42 when π = 0.7.

E(X) = 0.7/0.3 = 2.33 when π = 0.3.

So, before scoring a basket, a good free-throw shooter (with π = 0.7) misses on
3 average about 0.42 shots, and a poor shooter (with π = 0.3) misses on average about
2.33 shots.

Expected values of functions of a random variable

Let g(X) be a function (‘transformation’) of a discrete random variable X. This is


also a random variable, and its expected value is:
X
E(g(X)) = g(x) pX (x)

where pX (x) = p(x) is the probability function of X.

Example 3.10 The expected value of the square of X is:


X
E(X 2 ) = x2 p(x).

In general:
E[g(X)] 6= g[E(X)]

when g(X) is a nonlinear function of X.

Example 3.11 Note that:


 
2 2 1 1
E(X ) 6= (E(X)) and E 6= .
X E(X)

Expected values of linear transformations

Suppose X is a random variable and a and b are constants, i.e. known numbers
that are not random variables. Then:

E(aX + b) = aE(X) + b.

52
3.6. Discrete random variables

Proof :
X
E(aX + b) = (ax + b) p(x)
x
X X
= ax p(x) + b p(x)
x x

= a
X
x p(x) + b
X
p(x) 3
x x
= aE(X) + b

where the last step follows from:

P
i. x p(x) = E(X), by definition of E(X).
x

P
ii. p(x) = 1, by definition of the probability function. 
x

A special case of the result:


E(aX + b) = aE(X) + b

is obtained when a = 0, which gives:

E(b) = b.

That is, the expected value of a constant is the constant itself.

Variance and standard deviation of a discrete random variable

The variance of a discrete random variable X is defined as:


 X
Var(X) = E (X − E(X))2 = (x − E(X))2 p(x).

x
p
The standard deviation of X is sd(X) = Var(X).

Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation)
of the distribution of X.

Alternative notation: The variance is often denoted σ 2 (‘sigma-squared’) and standard


deviation by σ (‘sigma’).

An alternative formula: The variance can also be calculated as:

Var(X) = E(X 2 ) − (E(X))2 .

This will be proved later.

53
3. Random variables

Example 3.12 Continuing with the household size example:

x p(x) x p(x) (x − E(X))2 (x − E(X))2 p(x) x2 x2 p(x)


1 0.3002 0.3002 1.844 0.554 1 0.300
2 0.3417 0.6834 0.128 0.044 4 1.367
3 3
4
0.1551
0.1336
0.4653
0.5344
0.412
2.696
0.064
0.360
9
16
1.396
2.138
5 0.0494 0.2470 6.981 0.345 25 1.235
6 0.0145 0.0870 13.265 0.192 36 0.522
7 0.0034 0.0238 21.549 0.073 49 0.167
P8 0.0021 0.0168 31.833 0.067 64 0.134
2.3579 1.699 7.259
= E(X) =Var(X) = E(X 2 )

2 2 2 2
Var(X) =pE[(X − E(X))
√ ] = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.

Example 3.13 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . . ,


and 0 otherwise. It can be shown (although the proof is beyond the scope of the
course) that for this distribution:
1−π
Var(X) = .
π2
In the two cases we have used as examples:

Var(X) = 0.3/(0.7)2 = 0.61 and sd(X) = 0.78 when π = 0.7.

Var(X) = 0.7/(0.3)2 = 7.78 and sd(X) = 2.79 when π = 0.3.

So the variation in how many free throws a poor shooter misses before the first
success is much higher than the variation for a good shooter.

Variances of linear transformations

If X is a random variable and a and b are constants, then:

Var(aX + b) = a2 Var(X).

Proof:
E ((aX + b) − E(aX + b))2
 
Var(aX + b) =
E (aX + b − a E(X) − b)2
 
=
E (aX − a E(X))2
 
=
E a2 (X − E(X))2
 
=
a2 E (X − E(X))2
 
=
= a2 Var(X).

54
3.6. Discrete random variables

Therefore, sd(aX + b) = |a| sd(X). 

If a = 0, this gives:
Var(b) = 0.
That is, the variance of a constant is 0. The converse also holds – if a random variable
has a variance of 0, it is actually a constant.

Example 3.14 For further practice, let us consider a discrete random variable X
3
which has possible values 0, 1, 2 . . . , n, where n is a known positive integer, and X
has the following probability function:
( 
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) =
0 otherwise

where nx = n!/[x!(n − x)!] denotes the binomial coefficient, and π is a probability




parameter which can have values 0 ≤ π ≤ 1.


A random variable like this follows the binomial distribution. We will discuss its
motivation and uses later in the next chapter.
Here, we consider the following tasks for this distribution:

Show that p(x) satisfies the conditions for a probability function.


Calculate probabilities from p(x).
Write down the cumulative distribution function.
Derive the expected value, E(X).

To show that p(x) is a probability function, we need to show the following:

1. p(x) ≥ 0 for all x. This is clearly true, since x ≥ 0, π ≥ 0 and 1 − π ≥ 0.


n
P
2. p(x) = 1. This is easiest to show by using the binomial theorem, which states
x=0
that, for any integer n ≥ 0 and any real numbers y and z, then:
n  
n
X n x n−x
(y + z) = y z . (3.2)
x=0
x

If we choose y = π and z = 1 − π in (3.2), we get:


n   n
n n
X n x n−x
X
1 = 1 = [π + (1 − π)] = π (1 − π) = p(x).
x=0
x x=0

This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
At the values x = 0, 1, . . . , n, the value of the cdf is:
x  
X n y
F (x) = P (X ≤ x) = π (1 − π)n−y .
y=0
y

55
3. Random variables

Since X is a discrete random variable, F (x) is a step function. For E(X), we have:
n 
X n x
E(X) = x π (1 − π)n−x
x=0
x
n  
X n x
= x π (1 − π)n−x
3 x=1
x
n
X n(n − 1)!
= π π x−1 (1 − π)n−x
x=1
(x − 1)![(n − 1) − (x − 1)]!
n  
X n − 1 x−1
= nπ π (1 − π)n−x
x=1
x − 1
n−1  
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y

= nπ · 1

= nπ

where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, . . . , n − 1 and probability
parameter π.

3.7 Continuous random variables


A random variable (and its probability distribution) is continuous if it can have an
uncountably infinite number of possible values.3

In other words, the set of possible values (sample space) is the real numbers R, or
one or more intervals in R.

Example 3.15 An example of a continuous random variable, used here as an


approximating model, is the size of claim made on an insurance policy (i.e. a claim
by the customer to the insurance company), in £000s.

Suppose the policy has a deductible of £999, so all claims are at least £1,000.

The possible values of this random variable are therefore {x | x ≥ 1}.

Most of the concepts introduced for discrete random variables have exact or
approximate analogies for continuous random variables, and many results are the same
3
Strictly speaking, having an uncountably infinite number of possible values does not necessarily
imply that it is a continuous random variable. For example, the Cantor distribution (not covered in
ST104b Statistics 2) is neither a discrete nor an absolutely continuous probability distribution, nor is
it a mixture of these. However, we will not consider this matter any further in this course.

56
3.7. Continuous random variables

for both types. But there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.

Probability density function (pdf)

For a continuous random variable X, the probability function is replaced by the


probability density function (pdf), denoted as f (x) [or fX (x)]. 3

Example 3.16 Continuing the insurance example in Example 3.15, we consider a


pdf of the following form:
(
0 when x < k
f (x) = α α+1
α k /x when x ≥ k

where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
α = 2.2 is shown in Figure 3.5.
2.0
1.5
f(x)

1.0
0.5
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Figure 3.5: Probability density function for Example 3.16.

Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous distribution:

P (X = x) = 0 for all x. (3.3)

That is, the probability that X has any particular value exactly is always 0.

57
3. Random variables

Because of (3.3), with a continuous distribution we do not need to be very careful about
differences between < and ≤, and between > and ≥. Therefore, the following
probabilities are all equal:

P (a < X < b), P (a ≤ X ≤ b), P (a < X ≤ b) and P (a ≤ X < b).

3
Probabilities of intervals for continuous random variables

Integrals of the pdf give probabilities of intervals of values:


Z b
P (a < X ≤ b) = f (x) dx
a

for any two numbers a < b.


In other words, the probability that the value of X is between a and b is the area
under f (x) between a and b. Here a can also be −∞, and/or b can be +∞.

R3
Example 3.17 In Figure 3.6, the shaded area is P (1.5 < X ≤ 3) = 1.5
f (x) dx.
2.0
1.5
f(x)

1.0
0.5
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Figure 3.6: Probability density function showing P (1.5 < X ≤ 3).

58
3.7. Continuous random variables

Properties of pdfs

The pdf f (x) of any continuous random variable must satisfy the following conditions:

1.
f (x) ≥ 0 for all x.

2. Z ∞
3
f (x) dx = 1.
−∞

These are analogous to the conditions for probability functions of discrete


distributions.

Example 3.18 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
0 when x < k
f (x) =
α k α /xα+1 when x ≥ k

where α > 0 and k > 0.

1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0.

2.
∞ ∞ ∞
α kα
Z Z Z
f (x) dx = dx = α k α x−α−1 dx
−∞ k xα+1 k
 
α 1  −α ∞
= αk · x k
−α

= (−k α ) · (0 − k −α )

= 1.

Cumulative distribution function

The cumulative distribution function (cdf) of a continuous random variable X


is defined exactly as for discrete random variables, i.e. the cdf is:

F (x) = P (X ≤ x) for all real numbers x.

The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.

59
3. Random variables

Relationship between the cdf and pdf

The cdf is obtained from the pdf through integration:


Z x
P (X ≤ x) = F (x) = f (t) dt for all x.
−∞

3 The pdf is obtained from the cdf through differentiation:

f (x) = F 0 (x).

Example 3.19 Continuing the insurance example:


Z x Z x
α kα
f (t) dt = α+1
dt
−∞ k t
Z x
= (−k )α
(−α) t−α−1 dt
k
 x
= (−k α ) t−α k

= (−k α )(x−α − k −α )

= 1 − k α x−α

= 1 − (k/x)α .

Therefore: (
0 when x < k
F (x) = (3.4)
1 − (k/x)α when x ≥ k.
If we were given (3.4), we could obtain the pdf by differentiation, since F 0 (x) = 0
when x < k, and:
α kα
F 0 (x) = −k α (−α) x−α−1 = when x ≥ k.
xα+1
A plot of the cdf is shown in Figure 3.7.

Probabilities from cdfs and pdfs

Since P (X ≤ x) = F (x), it follows that P (X > x) = 1 − F (x). In general, for any


two numbers a < b, we have:
Z b
P (a < X ≤ b) = f (x) dx = F (b) − F (a).
a

60
3.7. Continuous random variables

1.0
0.8
0.6
F(x)
3
0.4
0.2
0.0

1 2 3 4 5 6 7

Figure 3.7: Cumulative distribution function for Example 3.19.

Example 3.20 Continuing with the insurance example (with k = 1 and α = 2.2),
then:

P (X ≤ 1.5) = F (1.5) = 1 − (1/1.5)2.2 ≈ 0.59

P (X ≤ 3) = F (3) = 1 − (1/3)2.2 ≈ 0.91

P (X > 3) = 1 − F (3) ≈ 1 − 0.91 = 0.09

P (1.5 ≤ X ≤ 3) = F (3) − F (1.5) ≈ 0.91 − 0.59 = 0.32.

Example 3.21 Consider now a continuous random variable with the following pdf:
(
λ e−λx for x > 0
f (x) = (3.5)
0 for x ≤ 0

where λ > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since: Z x x
λ e−λt dt = − e−λt 0 = 1 − e−λx

0
the cdf of the exponential distribution is:
(
0 for x ≤ 0
F (x) =
1 − e−λx for x > 0.

We now show that (3.5) satisfies the conditions for a pdf.

1. Since λ > 0 and ea > 0 for any a, f (x) ≥ 0 for all x.

61
3. Random variables

2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to one. This follows from:
Z ∞
f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x)
−∞ x→∞ x→−∞

which here is lim (1 − e−λx ) − 0 = (1 − 0) − 0 = 1.


3 x→∞

Expected value and variance of a continuous distribution

Suppose X is a continuous random variable with pdf f (x). Definitions of its expected
value, the expected value of any transformation g(X), variance and standard
deviation are the same as for discrete distributions, except that summation is replaced
by integration:
Z ∞
E(X) = x f (x) dx
−∞
Z ∞
E[g(X)] = g(x) f (x) dx
−∞
Z ∞
2
Var(X) = E[(X − E(X)) ] = (x − E(X))2 f (x) dx = E(X 2 ) − (E(X))2
−∞
p
sd(X) = Var(X).

Example 3.22 For the Pareto distribution, introduced in Example 3.16, we have:
Z ∞ Z ∞
E(X) = x f (x) dx = x f (x) dx
−∞ k

α kα
Z
= x· dx
k xα+1

α kα
Z
= dx
k xα

(α − 1) k α−1
 Z
αk
= dx
α−1 x(α−1)+1
|k {z }
=1

αk
= (if α > 1).
α−1

Here the last step follows because the last integrand has the form of the Pareto pdf
with parameter α − 1, so its integral from k to ∞ is 1. This integral converges only if
α − 1 > 0, i.e. if α > 1.

62
3.7. Continuous random variables

Similarly:
∞ ∞
α kα
Z Z
2 2
E(X ) = x f (x) dx = x2 · dx
k k xα+1

α kα
Z
= dx
k xα−1

α k2
Z ∞
(α − 2) k α−2
3
= dx
α−2 k x(α−2)+1
| {z }
=1
2
αk
= (if α > 2)
α−2

and therefore:
2
α k2 α2 k 2

2 2 k α
Var(X) = E(X ) − (E(X)) = − = .
α − 2 (α − 1)2 α−1 α−2

In our insurance example, where k = 1 and α = 2.2, we have:


 2
2.2 × 1 1 2.2
E(X) = ≈ 1.8 and Var(X) = × ≈ 7.6.
2.2 − 1 2.2 − 1 2.2 − 2

Example 3.23 Consider the exponential distribution introduced in Example 3.21.


To find E(X) we can use integration by parts by considering x λ e−λx as the product
of the functions f = x and g 0 = λ e−λx (so that g = −e−λx ). Then:
Z ∞ Z ∞
−λx −λx ∞
−e−λx dx
 
E(X) = xλe dx = −x e 0

0 0
∞ ∞
= −x e−λx 0 − (1/λ) e−λx 0
 

= [0 − 0] − (1/λ)[0 − 1]
= 1/λ.

To obtain E(X 2 ), we choose f = x2 and g 0 = λ e−λx , and use integration by parts:


Z ∞ Z ∞
−λx
 2 −λx ∞
2
E(X ) = 2
x λe dx = −x e 0
+2 x e−λx dx
0 0
Z ∞
2
= 0+ x λ e−λx dx
λ 0
2
=
λ2
where the last step follows because the last integral is simply E(X) = 1/λ again.
Finally:
2 1 1
Var(X) = E(X 2 ) − (E(X))2 = 2 − 2 = 2 .
λ λ λ

63
3. Random variables

Means and variances can be ‘infinite’

Expected values and variances are said to be infinite when the corresponding integral
does not exist (i.e. does not have a finite value).
For the Pareto distribution, the distribution is defined for all α > 0, but the mean is
infinite if α < 1 and the variance is infinite if α < 2. This happens because for small
3 values of α the distribution has very heavy tails, i.e. the probabilities of very large
values of X are non-negligible.
This is actually useful in some insurance applications, for example liability insurance
and medical insurance. There most claims are relatively small, but there is a
non-negligible probability of extremely large claims. The Pareto distribution with a
small α can be a reasonable representation of such situations. Figure 3.8 shows plots of
Pareto cdfs with α = 2.2 and α = 0.8. When α = 0.8, the distribution is so heavy-tailed
that E(X) is infinite.
1.0
0.8
0.6
F(x)

0.4

α = 2.2
α = 0.8
0.2
0.0

0 10 20 30 40 50

Figure 3.8: Pareto distribution cdfs.

Median of a random variable

Recall from ST104a Statistics 1 that the sample median is essentially the observation
‘in the middle’ of a set of data, i.e. where half of the observations in the sample are
smaller than the median and half of the observations are larger.
The median of a random variable (i.e. of its probability distribution) is similar in spirit.

64
3.8. Overview of chapter

Median of a random variable

The median, m, of a continuous random variable X is the value which satisfies:

F (m) = 0.5. (3.6)

So once we know F (x), we can find the median by solving (3.6).


3

Example 3.24 For the Pareto distribution we have:

F (x) = 1 − (k/x)α for x ≥ k.

So F (m) = 1 − (k/m)α = 1/2 when:


√ √
(k/m)α = 1/2
α α
⇔ k/m = 1/ 2 ⇔ m = k 2.

For example:

2.2
When k = 1 and α = 2.2, the median is m = 2 = 1.37.

0.8
When k = 1 and α = 0.8, the median is m = 2 = 2.38.

Example 3.25 For the exponential distribution we have:

F (x) = 1 − e−λ x for x > 0.

So F (m) = 1 − e−λ m = 1/2 when:

log 2
e−λ m = 1/2 ⇔ −λ m = − log 2 ⇔ m= .
λ

3.8 Overview of chapter


This chapter has formally introduced random variables, making a distinction between
discrete and continuous random variables. Properties of probability distributions were
discussed, including the determination of expected values and variances.

3.9 Key terms and concepts


Constant Continuous
Cumulative distribution function Discrete
Estimators Expected value
Experiment Interval
Median Outcome

65
3. Random variables

Parameter Probability density function


Probability distribution Probability function
Random variable Standard deviation
Variance

3 3.10 Learning activities


1. Suppose that the random variable X takes the values {x1 , x2 , . . .}, where
x1 < x2 < · · · . Prove the following results:
(a)

X
p(xi ) = 1.
i=1

(b)
p(xk ) = F (xk ) − F (xk−1 ).

(c)
k
X
F (xk ) = p(xi ).
i=1

2. At a charity event, the organisers sell 100 tickets to a raffle. At the end of the
event, one of the tickets is selected at random and the person with that number
wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5. What
is the probability that each of them wins the prize?

3. A greengrocer has a very large pile of oranges on his stall. The pile of fruit is a
mixture of 50% old fruit with 50% new fruit; one cannot tell which are old fruit
and which are new fruit. However, 20% of old oranges are mouldy inside, but only
10% of new oranges are mouldy. Suppose that you choose 5 oranges at random.
What is the distribution of the number of mouldy oranges in your sample?

4. What is the expectation of the random variable X if the only possible value it can
take is c?

5. Show that E(X − E(X)) = 0.

6. Show that if Var(X) = 0 then p(µX ) = 1. (We say in this case that X is almost
surely equal to its mean.)

7. For a random variable X and constants a and b, prove that:

Var(aX + b) = a2 Var(X).

Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk

66
3.11. Reminder of learning outcomes

3.11 Reminder of learning outcomes


After completing this chapter, and having completed the Essential reading and
activities, you should be able to:

define a random variable and distinguish it from the values that it takes
3
explain the difference between discrete and continuous random variables

find the mean and the variance of simple random variables whether discrete or
continuous

demonstrate how to proceed and use simple properties of expected values and
variances.

3.12 Sample examination questions


1. In an investigation of animal behaviour, rats have to choose between four doors.
One of them, behind which is food, is ‘correct’. If an incorrect choice is made, the
rat is returned to the starting point and chooses again, continuing as long as
necessary until the correct choice is made. The random variable X is the serial
number of the trial on which the correct choice is made.
Find the probability function and expectation of X under each of the following
hypotheses:
(a) Each door is equally likely to be chosen on each trial, and all trials are
mutually independent.
(b) At each trial, the rat chooses with equal probability between the doors that it
has not so far tried.
(c) The rat never chooses the same door on two successive trials, but otherwise
chooses at random with equal probabilities.

2. Construct suitable examples to show that, for a random variable X, then:


(a) E(X 2 ) 6= E(X)2 , in general.
(b) E(1/X) 6= 1/E(X), in general.

3. (a) Let X be a random variable. Show that:

Var(X) = E(X(X − 1)) − E(X)(E(X) − 1).

(b) Let X1 , X2 , . . . , Xn be independent random variables. Assume that all have a


mean of µ and a variance of σ 2 . Find expressions for the mean and variance of
the random variable (X1 + X2 + · · · + Xn )/n.

Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk

67
3. Random variables

68
Chapter 4
Common distributions of random
variables

4.1 Synopsis of chapter content 4


This chapter formally introduces common ‘families’ of probability distributions which
can be used to model various real-world phenomena.

4.2 Aims of the chapter


The aims of this chapter are to:

be familiar with common probability distributions of both discrete and continuous


types

be familiar with the main properties of each common distribution introduced.

4.3 Learning outcomes


After completing this chapter, and having completed the Essential reading and
activities, you should be able to:

summarise basic distributions such as the uniform, Bernoulli, binomial, Poisson,


exponential and normal

calculate probabilities of events for these distributions using the probability


function, probability density function or cumulative distribution function

determine probabilities using statistical tables, where appropriate

state properties of these distributions such as the expected value and variance.

4.4 Essential reading


Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Chapters 4 and 5.

69
4. Common distributions of random variables

In addition there is essential ‘watching’ of this chapter’s accompanying video tutorials


accessible via the ST104b Statistics 2 area at http://my.londoninternational.ac.uk

4.5 Introduction
In statistical inference we will treat observations:

X1 , X2 , . . . , Xn

4 (the sample) as values of a random variable X, which has some probability distribution
(population distribution).
How to choose that probability distribution?

Usually we do not try to invent distributions from scratch.

Instead, we use one of many existing standard distributions.

There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.

This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:

continuous versus discrete

among discrete: finite versus infinite number of possible values

among continuous: different sets of possible values (for example, all real numbers x,
x > 0, or x ∈ [0, 1]); symmetric versus skewed distributions.

The ‘distributions’ discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it, etc.
In the statistical analysis of a random variable X we typically:

select a family of distributions based on the basic characteristics of X

use observed data to choose (estimate) values for the parameters of that
distribution, and perform statistical inference on them.

Example 4.1 An opinion poll on a referendum, where each Xi is an answer to the


question ‘Will you vote Yes or No to joining the European Union?’ has answers

70
4.6. Common discrete distributions

recorded as Xi = 0 if ‘No’ and Xi = 1 if ‘Yes’. In a poll of 950 people, 513 answered


‘Yes’.
How do we choose a distribution to represent Xi ?

Here we need a family of discrete distributions with only two possible values (0
and 1). The Bernoulli distribution (discussed in the next section), which has
one parameter π (the probability of Xi = 1) is appropriate.

Within the family of Bernoulli distributions, we use the one where the value of
π is our best estimate based on the observed data. This is π
b = 513/950 = 0.54.
4
4.6 Common discrete distributions
For discrete random variables, we will consider the following distributions:

Discrete uniform distribution

Bernoulli distribution

Binomial distribution

Poisson distribution.

4.6.1 Discrete uniform distribution


Suppose a random variable X has k possible values 1, 2, . . . , k. X has a discrete
uniform distribution if all of these values have the same probability, i.e. if:
(
1/k for all x = 1, 2, . . . , k
p(x) = P (X = x) =
0 otherwise.

Example 4.2 A simple example of the discrete uniform distribution is the


distribution of the score of a fair die, with k = 6.

The discrete uniform distribution is not very common in applications, but it is useful as
a reference point for more complex distributions.

Mean and variance of a discrete uniform distribution

Calculating directly from the definition,1 we have:


k
X 1 + 2 + ··· + k k+1
E(X) = x p(x) = = (4.1)
x=1
k 2
n n
1
i2 = n(n + 1)(2n + 1)/6.
P P
(4.1) and (4.2) make use, respectively, of i = n(n + 1)/2 and
i=1 i=1

71
4. Common distributions of random variables

and:
12 + 22 + · · · + k 2 (k + 1)(2k + 1)
E(X 2 ) = = . (4.2)
k 6
So:
k2 − 1
Var(X) = E(X 2 ) − (E(X))2 = .
12

4.6.2 Bernoulli distribution


A Bernoulli trial is an experiment with only two possible outcomes. We will number
4 these outcomes 1 and 0, and refer to them as ‘success’ and ‘failure’, respectively.

Example 4.3 Examples of outcomes of Bernoulli trials are:

Agree / Disagree

Male / Female

Employed / Not employed

Owns a car / Does not own a car

Business goes bankrupt / Continues trading.

The Bernoulli distribution is the distribution of the outcome of a single Bernoulli


trial. This is the distribution of a random variable X with the following probability
function: (
π x (1 − π)1−x for x = 0, 1
p(x) =
0 otherwise.

Therefore P (X = 1) = π and P (X = 0) = 1 − P (X = 1) = 1 − π, and no other values


are possible. Such a random variable X has a Bernoulli distribution with (probability)
parameter π. This is often written as:

X ∼ Bernoulli(π).

If X ∼ Bernoulli(π), then:
1
X
E(X) = x p(x) = 0 × (1 − π) + 1 × π = π (4.3)
x=0
X1
E(X 2 ) = x2 p(x) = 02 × (1 − π) + 12 × π = π
x=0

Var(X) = E(X 2 ) − (E(X))2 = π − π 2 = π (1 − π). (4.4)

4.6.3 Binomial distribution


Suppose we carry out n Bernoulli trials such that:

72
4.6. Common discrete distributions

at each trial, the probability of success is π

different trials are statistically independent events.

Let X denote the total number of successes in these n trials. Then X follows a
binomial distribution with parameters n and π, where n ≥ 1 is a known integer and
0 ≤ π ≤ 1. This is often written as:

X ∼ Bin(n, π).

The binomial distribution was first encountered in Example 3.14.


4
Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers.
Bob is taking the test, but has no idea at all about the correct answers. So he
guesses every answer and therefore has the probability of 1/4 of getting any
individual question correct.
Let X denote the number of correct answers in Bob’s test. X follows the binomial
distribution with n = 4 and π = 0.25, i.e. we have:

X ∼ Bin(4, 0.25).

For example, what is the probability that Bob gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 , where ‘1’ denotes a correct
answer and ‘0’ denotes an incorrect answer.
However, we do not care about the order of the 0s and 1s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability π 3 (1 − π)1 .
The total number of sequences with three 1s (and therefore one 0) is the number of
locations for the three 1s that can be selected in the sequence of 4 answers. This is
4

3
= 4. Therefore the probability of obtaining three 1s is:
 
4 3
π (1 − π)1 = 4 × 0.253 × 0.751 ≈ 0.0469.
3

Binomial distribution probability function

In general, the probability function of X ∼ Bin(n, π) is:


( 
n
x
π x (1 − π)n−x for x = 0, 1, . . . , n
p(x) = (4.5)
0 otherwise.

We have already shown that (4.5) satisfies the conditions for being a probability
function in the previous chapter (see Example 3.14).

73
4. Common distributions of random variables

Example 4.5 Continuing Example 4.4, where X ∼ Bin(4, 0.25), we have:


   
4 0 4 4
p(0) = × (0.25) × (0.75) = 0.3164, p(1) = × (0.25)1 × (0.75)3 = 0.4219,
0 1
   
4 2 2 4
p(2) = × (0.25) × (0.75) = 0.2109, p(3) = × (0.25)3 × (0.75)1 = 0.0469,
2 3
 
4
p(4) = × (0.25)4 × (0.75)0 = 0.0039.
4

4
If X ∼ Bin(n, π), then:

E(X) = n π
Var(X) = n π (1 − π).

The expected value E(X) was derived in the previous chapter. The variance will be
derived later.

Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again a student who guesses each one of the answers. Let
X denote the number of correct answers by such a student, so that we have
X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, . . . , 20 we get (rounded to 2 decimal places):
x 0 1 2 3 4 5 6 7 8 9 10
p(x) 0.00 0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
x 11 12 13 14 15 16 17 18 19 20
p(x) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore P (X ≥ 8) = 0.102 > 0.05 and also
P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability π of the correct
answer for every question, so that X ∼ Bin(20, π). Figure 4.1 shows plots of the
probabilities for π = 0.25, 0.5, 0.7 and 0.9.

4.6.4 Poisson distribution


The possible values of the Poisson distribution are the non-negative integers
0, 1, 2, . . . .

74
4.6. Common discrete distributions

π = 0.25, E(X)=5 π = 0.5, E(X)=10

0.30

0.30
0.20

0.20
Probability

Probability
0.10

0.10
0.00

0.00
0 5 10 15 20 0 5 10 15 20

Correct answers Correct answers

π = 0.7, E(X)=14 π = 0.9, E(X)=18 4


0.30

0.30
0.20

0.20
Probability

Probability
0.10

0.10
0.00

0.00

0 5 10 15 20 0 5 10 15 20

Correct answers Correct answers

Figure 4.1: Probability plots for Example 4.6.

Poisson distribution probability function

The probability function of the Poisson distribution is:


(
e−λ λx /x! for x = 0, 1, 2, . . .
p(x) = (4.6)
0 otherwise

where λ > 0 is a parameter.

Activity 4.1 Show that (4.6) satisfies the conditions to be a probability function.
Hint: You can use the following result from standard calculus. For any number a,

a
X ax
e = .
x=0
x!

If a random variable X has a Poisson distribution with parameter λ, this is often


denoted by:
X ∼ Poisson(λ) or X ∼ Pois(λ).

If X ∼ Poisson(λ), then:
E(X) = λ
Var(X) = λ.

75
4. Common distributions of random variables

Activity 4.2 Prove that the mean and variance of a Poisson-distributed random
variable are both equal to λ.

Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process that generates the occurrences satisfies the
following conditions:

1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.
4 2. The probability of two or more occurrences at the same time is negligibly small.
3. The probability of one occurrence in any short time interval of length t is λt for
some constant λ > 0.

In essence, these state that individual occurrences should be independent, sufficiently


rare, and happen at a constant rate λ per unit of time. A process like this is a Poisson
process.
If occurrences are generated by a Poisson process, then the number of occurrences in a
randomly selected time interval of length t = 1, X, follows a Poisson distribution with
mean λ, i.e. X ∼ Poisson(λ).
The single parameter λ of the Poisson distribution is therefore the rate of occurrences
per unit of time.

Example 4.7 Examples of variables for which we might use a Poisson distribution:

Number of telephone calls received at a call centre per minute.

Number of accidents on a stretch of motorway per week.

Number of customers arriving at a checkout per minute.

Number of misprints per page of newsprint.

Because λ is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.

Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if
Y is the number of arrivals per two hours, Y ∼ Poisson(2 × 1.5) = Poisson(3).

λ is also the mean of the distribution, i.e. E(X) = λ.


Both motivations suggest that distributions with higher values of λ have higher
probabilities of large values of X.

Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X ∼ Poisson(2) and X ∼ Poisson(4).

76
4.6. Common discrete distributions

0.25
λ=2
λ=4

0.20
0.15
p(x)

0.10
0.05

4
0.00

0 2 4 6 8 10

Figure 4.2: Probability plots for Example 4.9.

Example 4.10 Customers arrive at a bank on weekday afternoons randomly at an


average rate of 1.6 customers per minute. Let X denote the number of arrivals per
minute and Y denote the number of arrivals per 5 minutes.
We assume a Poisson distribution for both, such that:
X ∼ Poisson(1.6)
and Y ∼ Poisson(5 × 1.6) = Poisson(8).

1. What is the probability that no customer arrives in a one-minute interval?


For X ∼ Poisson(1.6), the probability P (X = 0) is:
e−λ λ0 e−1.6 (1.6)0
pX (0) = = = e−1.6 = 0.2019.
0! 0!
2. What is the probability that more than two customers arrive in a one-minute
interval?
P (X > 2) = 1 − P (X ≤ 2) = 1 − [P (X = 0) + P (X = 1) + P (X = 2)] which is:
e−1.6 (1.6)0 e−1.6 (1.6)1 e−1.6 (1.6)2
1 − pX (0) − pX (1) − pX (2) = 1 − − −
0! 1! 2!

= 1 − e−1.6 − 1.6e−1.6 − 1.28e−1.6


= 1 − 3.88e−1.6
= 0.2167.

3. What is the probability that no more than 1 customer arrives in a five-minute


interval?
For Y ∼ Poisson(8), the probability P (Y ≤ 1) is:
e−8 (8)0 e−8 (8)1
pY (0) + pY (1) = + = e−8 + 8e−8 = 9e−8 = 0.0030.
0! 1!

77
4. Common distributions of random variables

A word on calculators

In the examination you will be allowed a basic calculator only. To calculate binomial
and Poisson probabilities directly requires access to a ‘factorial’ key (for the binomial)
and ‘e’ key (for the Poisson), which will not appear on a basic calculator. Note that any
probability calculations which are required in the examination will be possible on a
basic calculator. For example, for the Poisson probabilities in Example 4.10, it would be
acceptable to give your answers in terms of e (in the simplest form).

4.6.5 Connections between probability distributions


4
There are close connections between some probability distributions, even across
different families of them. Some connections are exact, i.e. one distribution is exactly
equal to another, for particular values of the parameters. For example, Bernoulli(π) is
the same distribution as Bin(1, π).
Some connections are approximate (or asymptotic), i.e. one distribution is closely
approximated by another under some limiting conditions. We next discuss one of these,
the Poisson approximation of the binomial distribution.

4.6.6 Poisson approximation of the binomial distribution


Suppose that:

X ∼ Bin(n, π).

n is large and π is small.

Under such circumstances, the distribution of X is well-approximated by a Poisson(λ)


distribution with λ = n π.
The connection is exact at the limit, i.e. Bin(n, π) → Poisson(λ) if n → ∞ and π → 0
in such a way that n π = λ remains constant.

Activity 4.3 Suppose that X ∼ Bin(n, π) and Y ∼ Poisson(λ). Show that, if


n → ∞ and π → 0 in such a way that n π = λ remains constant, then, for any x, we
have:
P (X = x) → P (Y = x) as n → ∞.
Hint 1: Because n π = λ remains constant, substitute λ/n for π from the beginning.
Hint 2: One step of the proof uses the limit definition of the exponential function,
which states that, for any number y, we have:
 y n
lim 1 + = ey .
n→∞ n

This ‘law of small numbers’ provides another motivation for the Poisson distribution.

78
4.6. Common discrete distributions

Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the ‘law of small numbers’.
Figure 4.3 shows the numbers of soldiers killed by horsekick in each of 14 Army
Corps of the Prussian army in each of the years spanning 1875–94.
Suppose that the number of men killed by horsekicks in one corps in one year is
X ∼ Bin(n, π), where:

n is large – the number of men in a corps (perhaps 50,000).

π is small – the probability that a man is killed by a horsekick. 4


Then X should be well-approximated by a Poisson distribution with some mean λ.
The sample frequencies and proportions of different counts are as follows:

Number killed 0 1 2 3 4 More


Count 144 91 32 11 2 0
% 51.4 32.5 11.4 3.9 0.7 0

The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson
distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.

Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
army in each of the years 1875–94. Source: Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen, Leipzig: Teubner.

Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What
is the probability that everyone who arrives for the flight will get a seat?
Let X denote the number of people who fail to turn up. Using the binomial
distribution, X ∼ Bin(200, 0.01). We have:

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − 0.1340 − 0.2707 = 0.5953.

79
4. Common distributions of random variables

0.5
Poisson(0.7)
Sample proportion

0.4
0.3
Probability

0.2
0.1

4
0.0

0 1 2 3 4 5 6

Men killed

Figure 4.4: Fit of Poisson distribution to the data in Example 4.11.

Using the Poisson approximation, X ∼ Poisson(200 × 0.01) = Poisson(2).

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − e−2 − 2e−2 = 1 − 3e−2 = 0.5940.

4.6.7 Some other discrete distributions


Just their names and short comments are given here, so that you have an idea of what
else there is.

Geometric(π) distribution:
• Distribution of the number of failures in Bernoulli trials before the first success.
• π is the probability of success at each trial.
• Sample space is 0, 1, 2, . . . .
• See the basketball example in Chapter 3.

Negative binomial(r, π) distribution:


• Distribution of the number of failures in Bernoulli trials before r successes
occur.
• π is the probability of success at each trial.
• Sample space is 0, 1, 2, . . . .
• Negative Binomial(1, π) is the same as Geometric(π).

Hypergeometric(n, A, B) distribution:
• Experiment where initially A + B objects are available for selection, and A of
them represent ‘success’.
• n objects are selected at random, without replacement.
• Hypergeometric is then the distribution of the number of successes.

80
4.7. Common continuous distributions

• Sample space is the integers x where max{0, n − B} ≤ x ≤ min{n, A}.


• If the selection was with replacement, the distribution of the number of
successes would be Bin(n, A/(A + B)).

Multinomial(n, π1 , π2 , . . . , πk ) distribution:
• Here π1 + π2 + · · · + πk = 1, and the πi s are the probabilities of the values
1, 2, . . . , k.
• If n = 1, the sample space is 1, 2, . . . , k. This is essentially a generalisation of
the discrete uniform distribution, but with non-equal probabilities πi .
• If n > 1, the sample space is the vectors (n1 , n2 , . . . , nk ) where ni ≥ 0 for all i,
4
and n1 + n2 + · · · + nk = n. This is essentially a generalisation of the binomial
to the case where each trial has K ≥ 2 possible outcomes, and the random
variable records the numbers of each outcome in n trials. Note that with
K = 2, Multinomial(n, π1 , π2 ) is essentially the same as Bin(n, π) with π = π2
(or with π = π1 ).
• When n > 1, the multinomial is the distribution of a multivariate random
variable, as discussed later in the course.

4.7 Common continuous distributions


For continuous random variables, we will consider the following distributions:

Uniform distribution

Exponential distribution

Normal distribution.

4.7.1 The (continuous) uniform distribution


The (continuous) uniform distribution has non-zero probabilities only on an interval
[a, b], where a < b are given numbers. The probability that its value is in an interval
within [a, b] is proportional to the length of that interval. In other words, all intervals
(within [a, b]) which have the same length have the same probability.

Uniform distribution pdf

The pdf of the (continuous) uniform distribution is:


(
1/(b − a) for a ≤ x ≤ b
f (x) =
0 otherwise.

A random variable X with this pdf may be written as X ∼ Uniform[a, b].

81
4. Common distributions of random variables

The pdf is ‘flat’, as shown in Figure 4.5 (along with the cdf). Clearly, f (x) ≥ 0 for all x,
and: Z ∞ Z b
1 1 1
f (x) dx = dx = [x]ba = · [b − a] = 1.
−∞ a b−a b−a b−a

The cdf is:



Z x 0
 for x < a
F (x) = P (X ≤ x) = f (t) dt = (x − a)/(b − a) for a ≤ x ≤ b
a 
1 for x > b.

4
Activity 4.4 Derive the cdf for the continuous uniform distribution.

The probability of an interval [x1 , x2 ], where a ≤ x1 < x2 ≤ b, is therefore:

x 2 − x1
P (x1 ≤ X ≤ x2 ) = F (x2 ) − F (x1 ) = .
b−a
So the probability depends only on the length of the interval, x2 − x1 .

1
F(x)
f(x)

a b a b

x x

Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).

If X ∼ Uniform[a, b], we have:

b+a
E(X) = = median of X
2
(b − a)2
Var(X) = .
12

The mean and median also follow from the fact that the distribution is symmetric
about (b + a)/2, i.e. the midpoint of the interval [a, b].

Activity 4.5 Derive the mean and variance of the continuous uniform distribution.

82
4.7. Common continuous distributions

4.7.2 Exponential distribution

Exponential distribution pdf

A random variable X has the exponential distribution with the parameter λ


(where λ > 0) if its probability density function is:
(
λ e−λx for x > 0
f (x) =
0 otherwise.

This is often denoted X ∼ Exponential(λ) or X ∼ Exp(λ). 4


It was shown in the previous chapter that this satisfies the conditions for a pdf (see
Example 3.21). The general shape of the pdf is that of ‘exponential decay’, as shown in
Figure 4.6 (hence the name).
f(x)

0 1 2 3 4 5

Figure 4.6: Exponential distribution pdf.

The cdf of the Exponential(λ) distribution is:


(
0 for x ≤ 0
F (x) =
1 − e−λx for x > 0.

The cdf is shown in Figure 4.7 for λ = 1.6.


For X ∼ Exponential(λ), we have:

E(X) = 1/λ
Var(X) = 1/λ2 .

These have been derived in the previous chapter (see Example 3.23). The median of the
distribution, also previously derived (see Example 3.25), is:
log 2 1
m= = (log 2) × = (log 2) E(X) ≈ 0.69 E(X).
λ λ
83
4. Common distributions of random variables

1.0
0.8
0.6
F(x)

0.4
0.2

4
0.0

0 1 2 3 4 5

Figure 4.7: Exponential distribution cdf for λ = 1.6.

Note that the median is always smaller than the mean, because the distribution is
skewed to the right.

Uses of the exponential distribution

The exponential is, among other things, a basic distribution of waiting times of
various kinds. This arises from a connection between the Poisson distribution – the
simplest distribution for counts – and the exponential.

If the number of events per unit of time has a Poisson distribution with parameter
λ, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter λ.

Note that the expected values of these behave as we would expect.

E(X) = λ for Poisson(λ), i.e. a large λ means many events per unit of time, on
average.

E(X) = 1/λ for Exponential(λ), i.e. a large λ means short waiting times between
successive events, on average.

Example 4.13 Consider Example 4.10.

The number of customers arriving at a bank per minute has a Poisson


distribution with parameter λ = 1.6.

Then the time X, in minutes, between the arrivals of two successive customers
follows an exponential distribution with parameter λ = 1.6.
From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(log 2) × 0.625 = 0.433.

84
4.7. Common continuous distributions

We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0 for x ≤ 0
F (x) = −1.6x
1−e for x > 0.

For example:

P (X ≤ 1) = F (1) = 1 − e−1.6×1 = 1 − e−1.6 = 0.7981.


The probability is about 0.8 that two arrivals are at most a minute apart.
4
P (X > 3) = 1 − F (3) = e−1.6×3 = e−4.8 = 0.0082.
The probability of a gap of 3 minutes or more between arrivals is very small.

4.7.3 Two other distributions


These are generalisations of the uniform and exponential distributions. Only their
names and short comments are given here, just so that you know they exist.

Beta(α, β) distribution, shown in Figure 4.8.


• Generalising the uniform, these are distributions for a closed interval, which is
taken to be [0, 1].
• Sample space is therefore {x | 0 ≤ x ≤ 1}.
• Unlike for the uniform distribution, the pdf is generally not flat.
• Beta(1, 1) is the same as Uniform[0, 1].

Gamma(α, β) distribution, shown in Figure 4.9.


• Generalising the exponential distribution, this is a two-parameter family of
skewed distributions for positive values.
• Sample space is {x | x > 0}.
• Gamma(1, β) is the same as Exponential(β).

4.7.4 Normal (Gaussian) distribution


The normal distribution is by far the most important probability distribution in
statistics. This is for three broad reasons:

Many variables have distributions that are approximately normal, for example
heights of humans, animals and weights of various products.
The normal distribution has extremely convenient mathematical properties, which
make it a useful default choice of distribution in many contexts.
Even when a variable is not itself even approximately normally distributed,
functions of several observations of the variable (‘sampling distributions’) are often
approximately normal, due to the central limit theorem. Because of this, the

85
4. Common distributions of random variables

alpha=0.5, beta=1 alpha=1, beta=2 alpha=1, beta=1

alpha=0.5, beta=0.5 alpha=2, beta=2 alpha=4, beta=2

Figure 4.8: Beta distribution density functions.

normal distribution has a crucial role in statistical inference. This will be discussed
later in the course.

Normal distribution pdf

The pdf of the normal distribution is:

(x − µ)2
 
1
f (x) = √ exp − for − ∞ < x < ∞
2πσ 2 2σ 2

where π is the mathematical constant (i.e. π = 3.14159 . . . ), and µ and σ 2 are


parameters, with −∞ < µ < ∞ and σ 2 > 0.
A random variable X with this pdf is said to have a normal distribution with mean
µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ).

R∞
Clearly, f (x) ≥ 0 for all x. Also, it can be shown that −∞
f (x) dx = 1 (do not attempt
to show this), so f (x) really is a pdf.
If X ∼ N (µ, σ 2 ), then:

E(X) = µ
Var(X) = σ 2

and the standard deviation is therefore sd(X) = σ.


The mean can also be inferred from the observation that the normal pdf is symmetric
about µ. This also implies that the median of the normal distribution is µ.
The normal density is the so-called ‘bell curve’. The two parameters affect it as follows:

86
4.7. Common continuous distributions

0 1 2 3 4 5 6 0 2 4 6 8 10
alpha=0.5, beta=1 alpha=1, beta=0.5

0 1 2 3 4 5 6 0 5 10 15 20
alpha=2, beta=1 alpha=2, beta=0.25

Figure 4.9: Gamma distribution density functions.

The mean µ determines the location of the curve.

The variance σ 2 determines the dispersion (spread) of the curve.

Example 4.14 Figure 4.10 shows that:

N (0, 1) and N (5, 1) have the same dispersion but different location: the
N (5, 1) curve is identical to the N (0, 1) curve, but shifted 5 units to the right.
N (0, 1) and N (0, 9) have the same location but different dispersion: the
N (0, 9) curve is centered at the same value, 0, as the N (0, 1) curve, but spread
out more widely.

Linear transformations of the normal distribution

We now consider one of the convenient properties of the normal distribution. Suppose
X is a random variable, and we consider the linear transformation Y = aX + b, where a
and b are constants.
Whatever the distribution of X, it is true that E(Y ) = aE(X) + b and also that
Var(Y ) = a2 Var(X).
Furthermore, if X is normally distributed, then so is Y . In other words, if
X ∼ N (µ, σ 2 ), then:
Y = aX + b ∼ N (aµ + b, a2 σ 2 ). (4.7)

This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.

87
4. Common distributions of random variables

0.4
0.3
N(0, 1) N(5, 1)

0.2
0.1

N(0, 9)
4
0.0

−5 0 5 10

Figure 4.10: Various normal distributions.

Let us apply (4.7) with a = 1/σ and b = −µ/σ, to get:


 2 !
1 µ X −µ 1 µ 1 2
Z= X− = ∼N ·µ− , ·σ = N (0, 1).
σ σ σ σ σ σ

The transformed variable Z = (X − µ)/σ is known as a standardised variable or a


z-score.
The distribution of the z-score is N (0, 1), i.e. the normal distribution with mean µ = 0
and variance σ 2 = 1 (and therefore a standard deviation of σ = 1). This is known as the
standard normal distribution. Its density function is:
 2
1 x
f (x) = √ exp − for − ∞ < x < ∞.
2π 2

The cumulative distribution function of the normal distribution is:


Z x
(t − µ)2
 
1
F (x) = √ exp − dt.
−∞ 2πσ 2 2σ 2

In the special case of the standard normal distribution, the cdf is:
Z x  2
1 t
F (x) = Φ(x) = √ exp − dt.
−∞ 2π 2

Note, this is often denoted Φ(x).


Such integrals cannot be evaluated in a closed form, so we use statistical tables of them,
specifically a table of Φ(x) (or we could use a computer, but not in the examination).
In the examination, you will have a table of some values of Φ(x), the cdf of
Z ∼ N (0, 1). Specifically, Table 4 of the New Cambridge Statistical Tables shows values
of Φ(x) = P (Z ≤ x) for x ≥ 0. This table can be used to calculate probabilities of any
intervals for any normal distribution. But how? The table seems to be incomplete:

88
4.7. Common continuous distributions

1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
2. Even for N (0, 1), it only shows probabilities for x ≥ 0.

The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the
same probability. Another way to justify these results is that if Z ∼ N (0, 1), then
−Z ∼ N (0, 1) also. See ST104a Statistics 1 for a discussion of how to use Table 4 of
the New Cambridge Statistical Tables.

Probabilities for any normal distribution


4
2 2
How about a normal distribution X ∼ N (µ, σ ), for any other µ and σ ?
What if we want to calculate, for any a < b, P (a < X ≤ b) = F (b) − F (a)?
Remember that (X − µ)/σ = Z ∼ N (0, 1). If we apply this transformation to all parts
of the inequalities, we get:

 
a−µ X −µ b−µ
P (a < X ≤ b) = P < ≤
σ σ σ
 
a−µ b−µ
= P <Z≤
σ σ
   
b−µ a−µ
= Φ −Φ
σ σ

which can be calculated using Table 4 of the New Cambridge Statistical Tables. (Note
that this also covers the cases of the one-sided inequalities P (X ≤ b), with a = −∞,
and P (X > a), with b = ∞.)

Example 4.15 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:

X > 90 (high blood pressure)

X < 60 (low blood pressure)

60 ≤ X ≤ 90 (normal blood pressure).

These are calculated using standardisation with µ = 74.2 and σ 2 = 127.87, and
therefore σ = 11.31. So here:
X − 74.2
= Z ∼ N (0, 1)
11.31
and we can refer values of this standardised variable to Table 4 of the New

89
4. Common distributions of random variables

Cambridge Statistical Tables.


 
X − 74.2 90 − 74.2
P (X > 90) = P >
11.31 11.31
= P (Z > 1.40)
= 1 − Φ(1.40)
= 1 − 0.9192
= 0.0808.
4
 
X − 74.2 60 − 74.2
P (X < 60) = P <
11.31 11.31
= P (Z < −1.26)
= P (Z > 1.26)
= 1 − Φ(1.26)
= 1 − 0.8962
= 0.1038.

Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152.

These probabilities are shown in Figure 4.11.


0.04

Mid: 0.82
0.03

Low: 0.10
0.02

High: 0.08
0.01
0.00

40 60 80 100 120

Diastolic blood pressure

Figure 4.11: Distribution of blood pressure for Example 4.15.

90
4.7. Common continuous distributions

Some probabilities around the mean

The following results hold for all normal distributions:

P (µ − σ < X < µ + σ) = 0.683. In other words, about 68.3% of the total


probability is within 1 standard deviation of the mean.

P (µ − 1.96σ < X < µ + 1.96σ) = 0.950.

P (µ − 2σ < X < µ + 2σ) = 0.954.

P (µ − 2.58σ < X < µ + 2.58σ) = 0.99. 4


P (µ − 3σ < X < µ + 3σ) = 0.997.

The first two of these are illustrated graphically in Figure 4.12.

0.683

µ −1.96σ µ−σ µ µ+σ µ +1.96σ

<−−−−−−−−−− 0.95 −−−−−−−−−−>

Figure 4.12: Some probabilities around the mean for the normal distribution.

4.7.5 Normal approximation of the binomial distribution


For 0 < π < 1, the binomial distribution Bin(n, π) tends to the normal distribution
N (n π, n π (1 − π)) as n → ∞.
Less formally: The binomial is well-approximated by the normal when the number of
trials n is reasonably large.
For a given n, the approximation is best when π is not very close to 0 or 1. One
rule-of-thumb is that the approximation is good enough when n π > 5 and
n (1 − π) > 5. Illustrations of the approximation are shown in Figure 4.13 for different
values of n and π. Each plot shows values of the pf of Bin(n, π), and the pdf of the
normal approximation, N (n π, n π (1 − π)).
When the normal approximation is appropriate, we can calculate probabilities for
X ∼ Bin(n, π) using Y ∼ N (n π, n π (1 − π)) and Table 4 of the New Cambridge
Statistical Tables.

91
4. Common distributions of random variables

n=10, π = 0.5 n=25, π = 0.5 n=25, π = 0.25

n=10, π = 0.9 n=25, π = 0.9 n=50, π = 0.9

Figure 4.13: Examples of the normal approximation of the binomial distribution.

Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, . . . , 40, then:

P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)

since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast if Y ∼ N (16, 9.6), then:

P (Y ≤ 4) < P (Y ≤ 4.5) < P (Y < 5)

since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (n π, n π (1 − π)) distribution.

Continuity correction

This technique involves representing each discrete binomial value x, for 0 ≤ x ≤ n,


by the continuous interval (x − 0.5, x + 0.5). Great care is needed to determine which
x values are included in the required probability. Suppose we are approximating
X ∼ Bin(n, π) with Y ∼ N (n π, n π (1 − π)), then:

P (X < 4) = P (X ≤ 3) ⇒ P (Y < 3.5) (since 4 is excluded)


P (X ≤ 4) = P (X < 5) ⇒ P (Y < 4.5) (since 4 is included)
P (1 ≤ X < 6) = P (1 ≤ X ≤ 5) ⇒ P (0.5 < Y < 5.5) (since 1 to 5 are included).

92
4.7. Common continuous distributions

Example 4.16 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X ∼ Bin(1000, 0.361).

1. What is the probability that X ≥ 400?


Using the normal approximation, noting n = 1000 and π = 0.361, with
Y ∼ N (1000 × 0.361, 1000 × 0.361 × 0.639) = N (361, 230.68), we get: 4
P (X ≥ 400) ≈ P (Y ≥ 399.5)
 
Y − 361 399.5 − 361
= P √ ≥ √
230.68 230.68
= P (Z ≥ 2.53)
= 1 − Φ(2.53)
= 0.0057.

The exact probability from the binomial distribution is P (X ≥ 400) = 0.0059.


Without the continuity correction, the normal approximation would give 0.0051.
2. What is the largest number x for which P (X ≤ x) < 0.01?
We need the largest x which satisfies:
 
x + 0.5 − 361
P (X ≤ x) ≈ P (Y ≤ x + 0.5) = P Z≤ √ < 0.01.
230.68
According to Table 4 of the New Cambridge Statistical Tables, the smallest z
which satisfies P (Z ≥ z) < 0.01 is z = 2.33, so the largest z which satisfies
P (Z ≤ z) < 0.01 is z = −2.33. We then need to solve:
x + 0.5 − 361
√ ≤ −2.33
230.68
which gives x ≤ 325.1. The smallest integer value which satisfies this is x = 325.
Therefore P (X ≤ x) < 0.01 for all x ≤ 325.
The sum of the exact binomial probabilities from 0 to x is 0.0093 for x = 325,
and 0.011 for x = 326. The normal approximation gives exactly the correct
answer in this instance.
3. Suppose that 300 respondents in the actual survey say they would vote for the
Conservative Party now. What do you conclude from this?
From the answer to Question 2, we know that P (X ≤ 300) < 0.01, if π = 0.361.
In other words, if the Conservatives’ support remains 36.1%, we would be very
unlikely to get a random sample where only 300 (or fewer) respondents would
say they would vote for the Conservative Party.
Now X = 300 is actually observed. We can then conclude one of two things (if
we exclude other possibilities, such as a biased sample or lying by the
respondents):

93
4. Common distributions of random variables

(a) The Conservatives’ true level of support is still 36.1% (or even higher), but
by chance we ended up with an unusual sample with only 300 of their
supporters.
(b) The Conservatives’ true level of support is currently less than 36.1% (in
which case getting 300 in the sample would be more probable).

Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.

4 4.8 Overview of chapter


This chapter has introduced some common discrete and continuous probability
distributions. Their properties, uses and applications have been discussed. The
relationships between some of these distributions have also been discussed.

4.9 Key terms and concepts


Bernoulli distribution Binomial distribution
Central limit theorem Continuity correction
Exponential distribution Normal distribution
Parameter Poisson distribution
Population distribution Standardised variable
Uniform distribution z-score

4.10 Learning activities


1. London Underground trains on the Northern Line have a probability of 0.05 of
failure between Golders Green and King’s Cross. Supposing that the failures are all
independent, what is the probability that out of 10 journeys between Golders
Green and King’s Cross more than 8 do not have a breakdown?

2. Suppose that the normal rate of infection for a certain disease in cattle is 25%. To
test a new serum which may prevent infection, three experiments are carried out.
The test for infection is not always valid for some particular cattle, so the
experimental results are incomplete – we cannot always tell whether a cow is
infected or not. The results of the three experiments are:
i. 10 animals are injected; all 10 remain free from infection.
ii. 17 animals are injected; more than 15 remain free from infection and there are
2 doubtful cases.
iii. 23 animals are infected; more than 20 remain free from infection and there are
3 doubtful cases.
Which experiment provides the strongest evidence in favour of the serum?

94
4.10. Learning activities

3. In a large industrial plant there is an accident on average every two days.


(a) What is the chance that there will be exactly two accidents in a given week?
(b) Repeat (a) for the chance of two or more accidents in a given week.
(c) If Karen goes to work there for a four-week period what is the probability that
no accident occurs while she is there?

4. The chance that a lottery ticket has a winning number is 0.0000001. Suppose
10,000,000 people buy tickets that are independently numbered.
(a) What is the probability there is no winner? 4
(b) What is the probability there is exactly 1 winner?
(c) What is the probability there are exactly 2 winners?

5. Suppose that X ∼ Uniform[0, 1]. Compute P (X > 0.2), P (X ≥ 0.2) and


P (X 2 > 0.04).

6. Suppose that the service time for a customer at a fast food outlet has an
exponential distribution with parameter 1/3 (customers per minute). What is the
probability that a customer waits more than 4 minutes?

7. Suppose that the distribution of men’s heights in London, measured in cm, is


N (175, 62 ). Find the proportion of men whose height is:
(a) under 169 cm
(b) over 190 cm
(c) between 169 cm and 190 cm.

8. Two statisticians disagree about the distribution of IQ scores for a population


under study. Both agree that the distribution is normal, and that σ = 15, but A
says that 5% of the population have IQ scores greater than 134.6735, whereas B
says that 10% of the population have IQ scores greater than 109.224. What is the
difference between the mean IQ score as assessed by A and that as assessed by B?

9. Helmut goes fishing every Saturday. The number of fish he catches follows a
Poisson distribution. On a proportion p of the days he goes fishing, he does not
catch anything. He makes it a rule to take home the first fish and then every other
fish that he catches (i.e. the first, third, fifth fish and so on).
(a) Using a Poisson distribution, find the mean number of fish he catches.
(b) Show that the probability that he takes home the last fish he catches is
(1 − p2 )/2.

Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk

95
4. Common distributions of random variables

4.11 Reminder of learning outcomes


After completing this chapter, and having completed the Essential reading and
activities, you should be able to:

summarise basic distributions such as the uniform, Bernoulli, binomial, Poisson,


exponential and normal

calculate probabilities of events for these distributions using the probability


function, probability density function or cumulative distribution function
4 determine probabilities using statistical tables, where appropriate

state properties of these distributions such as the expected value and variance.

4.12 Sample examination questions


1. A doctor wishes to procure subjects possessing a certain chromosome abnormality
which is present in 4% of the population. How many randomly chosen independent
subjects should be procured if the doctor wishes to be 95% confident that at least
one subject has the abnormality?

2. At one stage in the manufacture of an article a piston of circular cross-section has


to fit into a similarly-shaped cylinder. The distributions of diameters of pistons and
cylinders are known to be normal with the following parameters:
• Piston diameters: mean 10.42 cm, standard deviation 0.03 cm.
• Cylinder diameters: mean 10.52 cm, standard deviation 0.04 cm.
(a) If pairs of pistons and cylinders are selected at random for assembly, for what
proportion will the piston not fit into the cylinder (i.e. for which the piston
diameter exceeds the cylinder diameter)?
(b) What is the chance that in 100 pairs, selected at random:
i. every piston will fit
ii. not more than two of the pistons will fail to fit?
(c) Calculate the probabilities in (b) using a Poisson approximation. Discuss the
appropriateness of using this approximation.

3. Show that, for a binomial random variable such that X ∼ Bin(n, π), we have:
n
X (n − 1)!
E(X) = n π π x−1 (1 − π)n−x .
x=1
(x − 1)!(n − x)!

Hence find E(X) and Var(X).

[The wording of the question implies that you use the result that you have just
proved. Other methods of derivation will not be accepted!]

96
4.12. Sample examination questions

4. Cars independently pass a point on a busy road at an average rate of 150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(b) What is the expected number passing in two minutes?
(c) Find the probability that the expected number actually passes in a given
two-minute period.
Other motor vehicles (vans, motorcycles etc.) pass the same point independently at
the rate of 75 per hour. Assume a Poisson distribution for these vehicles too.
(d) What is the probability of one car and one other motor vehicle in a two-minute
period? 4
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk

97
4. Common distributions of random variables

98

You might also like