A Level Statistics-2
A Level Statistics-2
J.S. Abdey
ST104b
2014
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This is an extract from a subject guide for an undergraduate course offered as part of the
University of London International Programmes in Economics, Management, Finance and
the Social Sciences. Materials for these programmes are developed by academics at the
London School of Economics and Political Science (LSE).
For more information, see: www.londoninternational.ac.uk
This guide was prepared for the University of London International Programmes by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London School
of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that due
to pressure of work the author is unable to enter into any correspondence relating to, or
arising from, the guide. If you have any comments on this subject guide, favourable or
unfavourable, please use the form at the back of this guide.
Contents
1 Introduction 1
1.1 Route map to the guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Aims of the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Learning outcomes for the course . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.2 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.4 Online study resources (the Online Library and the VLE) . . . . . 6
1.7 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Probability theory 9
2.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 17
2.7.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 18
2.8 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 22
2.8.1 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 23
2.9 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 27
2.9.1 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 32
2.9.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.12 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
i
Contents
3 Random variables 41
3.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.10 Learning activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.11 Reminder of learning outcomes . . . . . . . . . . . . . . . . . . . . . . . 67
3.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 67
ii
Contents
iii
Contents
iv
Contents
v
Contents
vi
Contents
vii
Contents
viii
1
Chapter 1
Introduction
By successfully completing this half course, you will understand the ideas of
randomness and variability, and the way in which they link to probability theory. This
will allow the use of a systematic and logical collection of statistical techniques of great
1
1. Introduction
1
practical importance in many applied areas. The examples in this subject guide will
concentrate on the social sciences, but the methods are important for the physical
sciences too. This subject aims to provide a grounding in probability theory and some
of the most common statistical methods.
The material in ST104b Statistics 2 is necessary as preparation for other subjects
you may study later on in your degree. The full details of the ideas discussed in this
subject guide will not always be required in these other subjects, but you will need to
have a solid understanding of the main concepts. This can only be achieved by seeing
how the ideas emerge in detail.
For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.
1.3 Syllabus
The syllabus of ST104b Statistics 2 is as follows:
Point estimation: Estimation criteria: bias, variance and mean squared error;
Method of moments estimation; Least squares estimation; Maximum likelihood
estimation.
2
1.4. Aims of the course
1
Hypothesis testing: Setting p-value, significance level, test statistic; t tests;
General approach to statistical tests; Two types of error; Tests for normal variances;
Comparing two normal means with paired observations; Comparing two normal
means; Tests for correlation coefficients; Tests for the ratio of two normal variances.
apply and be competent users of standard statistical operators and be able to recall
a variety of well-known distributions and their respective moments
3
1. Introduction
1
practice questions, especially for tricky topics. The subject guide is there mainly to
describe the syllabus and to show the level of understanding expected.
The subject guide is divided into chapters which should be worked through in the order
in which they appear. There is little point in rushing past material you only partly
understand to get to later chapters, as the presentation is somewhat sequential and not
a series of self-contained topics. You should be familiar with the earlier chapters and
have a solid understanding of them before moving on to the later ones.
The following procedure is recommended:
The last two steps are the most important. It is easy to think that you have understood
the material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
Each chapter of the subject guide has suggestions for reading from the main textbook.
Usually, you will only need to read the material in the main textbook (see ‘Essential
reading’ below), but it may be helpful from time to time to look at others.
Basic notation
We often use the symbol to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.
Time management
About one-third of your self-study time should be spent reading and the rest should be
spent solving problems. An internal student would expect maybe 15 hours of formal
teaching and another 50 hours of private study to be enough to cover the subject. Of
the 50 hours of private study, about 17 hours should be spent on the initial study of the
textbook and subject guide. The remaining 33 hours should be spent on attempting
problems, which may well require more reading.
Calculators
A calculator may be used when answering questions on the examination paper for
ST104b Statistics 2. It must comply in all respects with the specification given in the
4
1.6. Overview of learning resources
1
Regulations. You should also refer to the admission notice you will receive when
entering the examination and the ‘Notice on permitted materials’.
Make sure you accustom yourself to using your chosen calculator and feel comfortable
with it. Specifically, calculators must:
must be:
hand held
quiet in operation
non-programmable
and must:
The Regulations state: ‘The use of a calculator that communicates or displays textual
messages, graphical or algebraic information is strictly forbidden. Where a calculator is
permitted in the examination, it must be a non-scientific calculator. Where calculators
are permitted, only calculators limited to performing just basic arithmetic operations
may be used. This is to encourage candidates to show the Examiners the steps taken in
arriving at the answer.’
Computers
If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package such as
Minitab, R or SPSS. It is not necessary for this course to have such software available,
but if you do have access to it you may benefit from using it in your study of the
material.
Statistical tables
Lindley, D.V. and W.F. Scott, New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) second edition [ISBN 978-0521484855].
These statistical tables are the same ones that are distributed for use in the
examination, so it is advisable that you become familiar with them, rather than those
at the end of a textbook.
5
1. Introduction
1
1.6.3 Further reading
Please note that, as long as you read the Essential reading, you are then free to read
around the subject area in any text, paper or online resource. You will need to support
your learning by reading as widely as possible and by thinking about how these
principles apply in the real world. To help you read extensively, you have free access to
the virtual learning environment (VLE) and University of London Online Library (see
below).
Other useful texts for this course include:
Johnson, R.A. and G.K. Bhattacharyya, Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].
Larsen, R.J. and M.L. Marx, Introduction to Mathematical Statistics and Its
Applications (Pearson, 2013) fifth edition [ISBN 9781292023557].
While Newbold et al. is the main textbook for this course, there are many that are just
as good. You are encouraged to look at those listed above and at any others you may
find. It may be necessary to look at several textbooks for a single topic, as you may find
that the approach of one textbook suits you better than that of another.
1.6.4 Online study resources (the Online Library and the VLE)
In addition to the subject guide and the Essential reading, it is crucial that you take
advantage of the study resources that are available online for this course, including the
virtual learning environment (VLE) and the Online Library.
You can access the VLE, the Online Library and your University of London email
account via the Student Portal at:
http://my.londoninternational.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged in to the Student Portal in order to register! As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.
If you forget your login details, please click on the ‘Forgotten your password’ link on the
login page.
The VLE
The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. It forms an
important part of your study experience with the University of London and you should
access it regularly.
The VLE provides a range of resources for EMFSS courses:
Self-testing activities: Doing these allows you to test your own understanding of the
subject material.
6
1.6. Overview of learning resources
1
Electronic study materials: The printed materials that you receive from the
University of London are available to download, including updated reading lists
and references.
A student discussion forum: This is an open space for you to discuss interests and
experiences, seek support from your peers, work collaboratively to solve problems
and discuss subject material.
Videos: There are recorded academic introductions to the subject, interviews and
debates and, for some courses, audio-visual tutorials and conclusions.
Recorded lectures: For some courses, where appropriate, the sessions from previous
years’ Study Weekends have been recorded and made available.
Study skills: Expert advice on preparing for examinations and developing your
digital literacy skills.
Feedback forms.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.
The Online Library contains a huge array of journal articles and other resources to help
you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login:
http://tinyurl.com/ollathens
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please see the online help pages:
www.external.shl.lon.ac.uk/summon/about.php
Additional material
There is a lot of computer-based teaching material available freely over the web. A
fairly comprehensive list can be found in the ‘Books & Manuals’ section of
http://statpages.org
Unless otherwise stated, all websites in this subject guide were accessed in April 2014.
We cannot guarantee, however, that they will stay current and you may need to
7
1. Introduction
1
perform an internet search to find the relevant pages.
where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.
The examination is by a two-hour unseen question paper. No books may be taken into
the examination, but the use of calculators is permitted, and statistical tables and a
formula sheet are provided (the formula sheet can be found in past examination papers
available on the VLE).
The examination paper has a variety of questions, some quite short and others longer.
All questions must be answered correctly for full marks. You may use your calculator
whenever you feel it is appropriate, always remembering that the Examiners can give
marks only for what appears on the examination script. Therefore, it is important to
always show your working.
In terms of the examination, as always, it is important to manage your time carefully
and not to dwell on one question for too long – move on and focus on solving the easier
questions, coming back to harder ones later.
8
Chapter 2 2
Probability theory
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
prove the law of total probability and apply it to problems where there is a
partition of the sample space
9
2. Probability theory
2 Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Prentice-Hall, 2012) eighth edition [ISBN 9780273767060]
Chapter 3.
2.5 Introduction
Consider the following hypothetical example: A country will soon hold a referendum
about whether it should join the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote Yes or No to joining the EU?’ as follows:
Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%
However, we are not interested in just this sample of 950 respondents, but in the
population that they represent, that is all likely voters.
Statistical inference will allow us to say things like the following about the
population:
‘The null hypothesis that π = 0.5, against the alternative hypothesis that π > 0.5,
is rejected at the 5% significance level.’
In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results:
10
2.6. Set theory: the basics
In the next few chapters, we will learn about the terms in bold, among others. 2
The need for probability in statistics
In statistical inference, the data that we have observed are regarded as a sample from a
broader population, selected with a random process:
Values in a sample are also random: We cannot predict the precise values that will
be observed before we actually collect the sample.
A preview of probability
Experiment: For example, rolling a single die and recording the outcome.
Sample space S: The set of all possible outcome; here {1, 2, 3, 4, 5, 6}.
Event: Any subset A of the sample space, for example A = {4, 5, 6}.1
B = {1, 2, 3, 4, 5}.
11
2. Probability theory
1 ∈ A and 2 ∈ A.
6∈
/ A and 1.5 ∈
/ A.
The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.
Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .
Example 2.4 An example of the distinction between subsets and non-subsets is:
{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set.
{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.
12
2.6. Set theory: the basics
Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.
A ∪ B = {x | x ∈ A or x ∈ B}.
That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.
A ∪ B = {1, 2, 3, 4}
A ∪ C = {1, 2, 3, 4, 5, 6}
B ∪ C = {2, 3, 4, 5, 6}.
A ∩ B = {x | x ∈ A and x ∈ B}.
That is, the set of those elements which belong to both A and B. An example is
shown in Figure 2.4.
13
2. Probability theory
A ∩ B = {2, 3}
A ∩ C = {4}
B ∩ C = ∅.
Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1
n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1
These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.
Complement (‘not’)
Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈
/ A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.
14
2.6. Set theory: the basics
In proofs and derivations about sets, you can use the following results without proof:
Commutativity:
A ∩ B = B ∩ A and A ∪ B = B ∪ A.
Associativity:
A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.
Distributive laws:
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
De Morgan’s laws:
If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:
∅c = S.
∅ ⊂ A, A ⊂ A and A ⊂ S.
A ∩ A = A and A ∪ A = A.
A ∩ Ac = ∅ and A ∪ Ac = S.
If B ⊂ A, A ∩ B = B and A ∪ B = A.
A ∩ ∅ = ∅ and A ∪ ∅ = A.
A ∩ S = A and A ∪ S = S.
∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.
15
2. Probability theory
Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.
Partition
The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form
∞
S
a partition of A if they are pairwise disjoint and Ai = A.
i=1
A3 A2
A1
We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
16
2.7. Axiomatic definition of probability
Example 2.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.
The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows:
Axioms of probability
Axiom 2: P (S) = 1.
2
The precise definition also requires a careful statement of which subsets of S are allowed as events;
we can skip that on this course.
17
2. Probability theory
The axioms require that a probability function must always satisfy these requirements:
All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.
Probability property
But the only real number for P (∅) which satisfies this is P (∅) = 0.
Probability property
18
2.7. Axiomatic definition of probability
A2
A1 2
A3
Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.
That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.
Probability property
Probability property
Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:
This violates Axiom 1, so cannot be true. Therefore it must be that P (A) ≤ 1 for all A.
Putting this and Axiom 1 together, we get:
0 ≤ P (A) ≤ 1
Probability property
19
2. Probability theory
since P (B ∩ Ac ) ≥ 0.
These show that the probability function has the kinds of values we expect of something
called a ‘probability’.
P (Ac ) = 1 − P (A).
86% spend at least 1 hour watching television (event A, with P (A) = 0.86).
19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19).
15% spend at least 1 hour watching television, and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).
20
2.7. Axiomatic definition of probability
Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
2
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined below. The mathematical theory of probability
and calculations on probabilities are the same whichever interpretation we assign to
‘probability’. So, in this course, we do not need to discuss the matter further.
Example 2.10 How should we interpret the following, as statements about the real
world of coins and babies?
‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.
‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.
A key question is how to determine appropriate numerical values P (A) for the
probabilities of particular events.
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.
Example 2.11
If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.
Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in that population.
21
2. Probability theory
We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space S contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Then:
k number of outcomes in A
P (A) = = .
m total number of outcomes in sample space S
That is, the probability of A is the proportion of outcomes that belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes that belong to the event, and the total number of possible
outcomes.
Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?
S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3,2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4,1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
Outcomes in the event: A = {(1, 4), (2, 3), (3, 2), (4, 1)}.
22
2.8. Classical probability and counting rules
Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of 2
the complementary event Ac , i.e. P (Ac ), is easier to find.
Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?
The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.
The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.
Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?
Whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once).
with replacement, so that each of the n objects may appear several times in the
selection.
23
2. Probability theory
Then:
n objects are available for selection for the 1st object in the sequence
2
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.
The number of possible ordered sequences of k objects selected with replacement from n
objects is therefore:
k times
n × n × · · · × n = nk .
z }| {
Suppose that the selection of k objects out of n is again treated as an ordered sequence,
but that selection is now:
Now:
n objects are available for selection for the 1st object in the sequence
. . . and so on, until n − k + 1 objects are available for selection for the kth object.
n × (n − 1) × · · · × (n − k + 1). (2.2)
Factorials
The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.
24
2.8. Classical probability and counting rules
n!
n × (n − 1) × · · · × (n − k + 1) =
(n − k)!
. 2
Unordered sets, without replacement
Suppose now that the identities of the objects in the selection matter, but the order
does not.
For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.
Example 2.15 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending February 29th does not exist, so that n = 365), in the following cases?
1. It makes a difference who has which birthday (ordered ), i.e. Amy (January 1st),
Bob (May 5th) and Sam (December 5th) is different from Amy (May 5th), Bob
(December 5th) and Sam (January 1st), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:
3653 = 48,627,125.
2. It makes a difference who has which birthday (ordered ), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!
25
2. Probability theory
3. Only the dates matter, but not who has which one (unordered ), i.e. Amy
(January 1st), Bob (May 5th) and Sam (December 5th) is treated as the same
2 as Amy (May 5th), Bob (December 5th) and Sam (January 1st), and different
people must have different birthdays (without replacement). The number of
different sets of birthdays is:
365 365! 365 × 364 × 363
= = = 8,038,030.
3 (365 − 3)! 3! 3×2×1
Example 2.16 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following:
1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is 365r .
2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.
Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
365 365r
and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
365r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:
26
2.9. Conditional probability and Bayes’ theorem
Conditional probability
Bayes’ theorem.
updating probabilities of events, after we learn that some other events have
happened.
Independence
P (A ∩ B) = P (A) P (B).
if A happens, this does not affect the probability of B happening (and vice versa)
if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).
For example, independence is often a reasonable assumption when A and B
correspond to physically separate experiments.
Example 2.17 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:
Then:
27
2. Probability theory
Note that there is a difference between pairwise independence and full independence.
The following example illustrates.
Example 2.18 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B has only a hat,
Teacher C has only a scarf and Teacher D has only gloves. One teacher out of the
four is selected at random. Show that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) =
and P (S ∩ G) = .
4 4
From these results, we can verify that:
P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)
and so the events are pairwise independent. But one teacher has a hat, a scarf and
gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.
The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 2.8.
28
2.9. Conditional probability and Bayes’ theorem
2
A
Conditional probability
Consider two events A and B. Suppose that you are told that B has occurred. How
does this affect the probability of event A?
The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:
P (A ∩ B)
P (A | B) =
P (B)
assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.
Example 2.19 Suppose we roll two independent fair dice again. Consider the
following events:
There are shown in Figure 2.9. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and
P (A ∩ B) = 2/36. The conditional probability of A given B is therefore:
P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15
29
2. Probability theory
A B
Example 2.20 In Example 2.19, when we are told that the conditioning event B
has occurred, we know we are within the green line in Figure 2.9. So the 15
outcomes within it become the new sample space. There are 2 outcomes which
satisfy A and which are inside this new sample space, so:
2 cases of A within B
P (A | B) = = .
15 cases of B
If A ⊥⊥ B, i.e. P (A ∩ B) = P (A) P (B), and P (B) > 0 and P (A) > 0, then:
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)
and:
P (A ∩ B) P (A) P (B)
P (B | A) = = = P (B).
P (A) P (A)
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.
30
2.9. Conditional probability and Bayes’ theorem
That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:
2
s
B
s
As
and you can use whichever is more convenient. Very often some version of this chain
rule is much easier than calculating P (A ∩ B) directly.
The chain rule generalises to multiple events:
where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 2.21.
Example 2.22 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52
4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore
P (A) = 1/270275.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:
P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards.
P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn.
P (A3 | A1 , A2 ) = 2/50.
P (A4 | A1 , A2 , A3 ) = 1/49.
31
2. Probability theory
Here we could obtain the result in two ways. However, there are very many situations
where classical probability and counting rules are not usable, whereas conditional
probabilities and the chain rule are completely general and always applicable.
We now return to probabilities of partitions like the situation shown in Figure 2.10.
HH A1
H
A2 HH
A1
r
HHr
A
A
HH
A3 H 2
HH
HH A3
Figure 2.10: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.
Both diagrams in Figure 2.10 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.10,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
32
2.9. Conditional probability and Bayes’ theorem
r B1
2
B2
rHH
HH
B
r r 3 HHr
A
H
@HH
@ HH
@ Hr
@ B4
@
@r
B5
Figure 2.11: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.
Example 2.23 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c ) [1 − P (B)].
r Bc
HH
HH
HH
rH
Hr A
HH
H
HH
r
H
B
Example 2.24 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity: If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99%
specificity: If a person does not have the disease, the test will give a negative result
with a probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01, and therefore:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.
33
2. Probability theory
What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.13.
So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this:
Bayes’ theorem
Using the chain rule and the total probability formula, we have:
P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1
34
2.9. Conditional probability and Bayes’ theorem
Example 2.25 Continuing with Example 2.24, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
2
given that the person has received a positive test result.
The probabilities we need are:
Then:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098
Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore most
positive test results are actually false positives.
Example 2.26 You are waiting for your bag at the baggage return carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags that come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags
to arrive’. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are:
P (x | A) = 1 for all x. If the bag has been lost, it will not arrive!
P (x | A) P (A)
P (A | x) =
P (x | A) P (A) + P (x | Ac ) P (Ac )
P (A)
= .
P (A) + [(200 − x)/200] [1 − P (A)]
Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it
has been lost!
For other values of x we need P (A). This is the general probability that a bag gets
lost, before you start observing the arrival of the bags from your particular flight.
This kind of probability is known as the prior probability of an event A.
Let us assign values to P (A) based on some empirical data. Statistics by the
Association of European Airlines (AEA) show how many bags were ‘mishandled’ per
1,000 passengers the airlines carried. This is not exactly what we need (since not all
35
2. Probability theory
passengers carry bags, and some have several), but we will use it anyway. In
particular, we will compare the results for the best and the worst of the AEA in 2006:
2
Air Malta: P (A) = 0.0044.
For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive,
the probability is less than 0.5 that your bag has been lost.
For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is
the first probability over 0.5.
BA
0.8
Air Malta
P( Your bag is lost )
0.6
0.4
0.2
0.0
Figure 2.14: Plot of P (A | x) as a function of x for the two airlines in Example 2.26, Air
Malta and British Airways (BA).
36
2.11. Key terms and concepts
2. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)
4. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).
5. Suppose that we toss a fair coin twice. The sample space is therefore
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way – for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
‘heads on the first toss’ and ‘heads on the second toss’ are independent.
6. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0. (Notice that independence and disjointness are not similar ideas.)
7. Write down the condition for the three events A, B and C to be independent.
9. A statistics teacher knows from past experience that a student who does homework
consistently has a probability of 0.95 of passing the examination, whereas a student
who does not do homework at all has a probability of 0.30 of passing the
examination.
(a) If 25% of students do their homework consistently, what percentage can expect
to pass the examination?
37
2. Probability theory
(b) If a student chosen at random from the group gets a pass in the examination,
what is the probability that the student had done homework consistently?
2 10. Plagiarism is a serious problem for assessors of coursework. One check on
plagiarism is to compare the coursework with a standard textbook. If the
coursework has plagiarised that textbook, then there will be a 95% chance of
finding exactly two phrases which are the same in both the coursework and
textbook, and a 5% chance of finding three or more phrases which are the same. If
the work is not plagiarised, then these probabilities are both 50%.
Suppose that 5% of coursework is plagiarised. An assessor chooses some coursework
at random.
(a) What is the probability that it has been plagiarised if it has exactly two
phrases in the textbook? (Try making a guess before doing the calculation.)
(b) Repeat (a) for three phrases in the textbook?
Did you manage to get a roughly correct guess of these results before calculating?
11. A box contains three red balls and two green balls. Two balls are taken from it
without replacement.
(a) What is the probability that none of the balls taken is red?
(b) Repeat (a) for 1 ball and 2 balls.
(c) Show that the probability that the first ball taken is red is the same as the
probability that the second ball taken is red.
12. Amy, Bob and Claire throw a fair die in that order until a six appears. The person
who throws the first six wins. What are their respective chances of winning?
13. In men’s singles tennis, matches are played on the best-of-five-sets principle.
Therefore the first player to win three sets wins the match, and a match may
consist of three, four or five sets. Assuming that two players are perfectly evenly
matched, and that sets are independent events, calculate the probabilities that a
match lasts three sets, four sets and five sets, respectively.
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
38
2.14. Sample examination questions
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events 2
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.
3. (a) Show that if A and B are independent events in a sample space, then Ac and
B c are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then
X c and Y c are not in general mutually exclusive.
4. In a game of tennis, each point is won by one of the two players A and B. The
usual rules of scoring for tennis apply. That is, the winner of the game is the player
who first scores four points, unless each player has won three points, when deuce is
called and play proceeds until one player is two points ahead of the other and
hence wins the game.
A is serving and has a probability of winning any point of 2/3. The result of each
point is assumed to be independent of every other point.
(a) Show that the probability of A winning the game without deuce being called is
496/729.
(b) Find the probability of deuce being called.
(c) If deuce is called, show that A’s subsequent probability of winning the game is
4/5.
(d) Hence determine A’s overall probability of winning the game.
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk
39
2. Probability theory
40
Chapter 3
Random variables
3
3.1 Synopsis of chapter content
This chapter introduces the concept of random variables and probability distributions.
These distributions are univariate, which means that they are used to model a single
numerical quantity. The concepts of expected value and variance are also discussed.
define a random variable and distinguish it from the values that it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.
41
3. Random variables
3.5 Introduction
In ST104a Statistics 1, we considered descriptive statistics for a sample of
observations of a variable X. Here we will represent the observations as a sequence of
variables, denoted as:
X1 , X2 , . . . , Xn
3
where n is the sample size.
In statistical inference, the observations will be treated as a sample drawn at random
from a population. We will then think of each observation Xi of a variable X as an
outcome of an experiment.
The experiment is ‘select a unit at random from the population and record its
value of X’.
Because variables X in statistical data are recorded as numbers, we can now focus on
experiments where the outcomes are also numbers – random variables.
Random variable
A random variable is an experiment for which the outcomes are numbers.1 This
means that for a random variable:
The outcomes are numbers in this sample space. Instead of ‘outcomes’, we often
call them the values of the random variable.
There are two main types of random variables, depending on the nature of S, i.e. the
possible values of the random variable.
1
This definition is a bit informal, but it is sufficient for this course.
2
Strictly speaking, a discrete random variable is not just a random variable which is not continuous
as there are many others, such as mixture distributions.
42
3.6. Discrete random variables
Notation
P (a < X < b) denotes the probability that X is between the numbers a and b.
You will notice that many of the quantities we define for random variables are
analogous to sample quantities defined in ST104a Statistics 1.
Example 3.1 The following two examples will be used throughout this chapter.
43
3. Random variables
Example 3.2 Consider the following probability distribution for the household
size, X.
Number of people
in household (x) P (X = x)
1 0.3002
2 0.3417
3 0.1551
4 0.1336
5 0.0494
6 0.0145
7 0.0034
8 0.0021
Probability function
p(x) = P (X = x).
We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.
Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).
Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
– especially when it is necessary to indicate clearly to which random variable the
function corresponds.
44
3.6. Discrete random variables
3
P
2. p(xi ) = 1, i.e. the sum of probabilities of all possible values of X is 1.
xi ∈S
The pf is defined for all real numbers x, but p(x) = 0 for any x ∈
/ S, i.e. for any
value x that is not one of the possible values of X.
Example 3.3 Continuing Example 3.2, here we can simply list all the values:
0.3002, if x = 1
0.3417, if x = 2
0.1551, if x = 3
0.1336, if x = 4
p(x) = 0.0494, if x = 5
0.0145, if x = 6
0.0034, if x = 7
0.0021, if x = 8
0 otherwise.
8
P
These are clearly all non-negative, and their sum is p(x) = 1.
x=1
A graphical representation of the pf is shown in Figure 3.1.
For the next example, we need to remember the following results from mathematics,
concerning sums of geometric series: If r 6= 1, then:
n−1
X a(1 − rn )
a rx =
x=0
1−r
Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:
45
3. Random variables
0.35
0.30
0.25
0.20
3 p(x)
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8
Then the probability that the first success occurs after x failures is the probability of
a sequence of x failures followed by a success, i.e. the probability is:
(1 − π)x π.
So the pf of the random variable X (the number of failures before the first success) is:
(
(1 − π)x π for x = 0, 1, 2, . . .
p(x) = (3.1)
0 otherwise
where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf.
46
3.6. Discrete random variables
0.7
0.6
0.5
0.4
p(x)
π = 0.7
0.3
π = 0.3
0.2
3
0.1
0.0 0 5 10 15
x (number of failures)
Figure 3.2: Probability function for Example 3.4. π = 0.7: a fairly good free-throw shooter.
π = 0.3: a pretty poor free-throw shooter.
i.e. the sum of the probabilities of those possible values of X that are less than or
equal to x.
Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:
Number of people
in household (x) p(x) F (x)
1 0.3002 0.3002
2 0.3417 0.6419
3 0.1551 0.7970
4 0.1336 0.9306
5 0.0494 0.9800
6 0.0145 0.9945
7 0.0034 0.9979
8 0.0021 1.0000
47
3. Random variables
1.0
3
0.8
0.6
F(x)
0.4
0.2
0.0
0 2 4 6 8
we can write: (
0 when x < 0
F (x) =
1 − (1 − π)x+1 when x = 0, 1, 2, . . . .
The cdf F (x) of a discrete random variable X is a step function such that:
At such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is
right-continuous).
48
3.6. Discrete random variables
1.0
0.8
0.6
F(x)
3
0.4 π = 0.7
π = 0.3
0.2
0.0
0 5 10 15
x (number of failures)
Either the pf or the cdf can be used to calculate the probabilities of any events for a
discrete random variable.
Example 3.7 Continuing with the household size example (for the probabilities,
see Example 3.5), then:
49
3. Random variables
The expected value (or mean) of X is denoted E(X), and defined as:
3 X
E(X) = xi p(xi ).
xi ∈S
P P
This can also be written more concisely as E(X) = x p(x) or E(X) = x p(x).
x
We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: Instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’),
or µX , is often used.
K
f 1 x1 + · · · + f K xK X
X̄ = = x1 pb(x1 ) + · · · + xK pb(xK ) = xi pb(xi )
f1 + · · · + fK i=1
where:
fi
pb(xi ) = K
P
fi
i=1
K
X
E(X) = x1 p(x1 ) + · · · + xK p(xK ) = xi p(xi ).
i=1
So X̄ uses the sample proportions pb(xi ), whereas E(X) uses the population probabilities
p(xi ).
50
3.6. Discrete random variables
Number of people
in household (x) p(x) x p(x)
1 0.3002 0.3002
2 0.3417 0.6834
3 0.1551 0.4653
3
4 0.1336 0.5344
5 0.0494 0.2470
6 0.0145 0.0870
7 0.0034 0.0238
8 0.0021 0.0168
Sum 2.3579
= E(X)
Example 3.9 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . . , and
0 otherwise.
The expected value of X is then:
X ∞
X
E(X) = xi p(xi ) = x (1 − π)x π
xi ∈S x=0
X∞
(starting from x = 1) = x (1 − π)x π
x=1
∞
X
= (1 − π) x (1 − π)x−1 π
x=1
∞
X
(using y = x − 1) = (1 − π) (y + 1)(1 − π)y π
y=0
∞ ∞
X X
y y
= (1 − π)
y (1 − π) π + (1 − π) π
y=0 y=0
| {z } | {z }
= E(X) =1
= (1 − π) [E(X) + 1]
= (1 − π) E(X) + (1 − π)
51
3. Random variables
So, before scoring a basket, a good free-throw shooter (with π = 0.7) misses on
3 average about 0.42 shots, and a poor shooter (with π = 0.3) misses on average about
2.33 shots.
In general:
E[g(X)] 6= g[E(X)]
Suppose X is a random variable and a and b are constants, i.e. known numbers
that are not random variables. Then:
E(aX + b) = aE(X) + b.
52
3.6. Discrete random variables
Proof :
X
E(aX + b) = (ax + b) p(x)
x
X X
= ax p(x) + b p(x)
x x
= a
X
x p(x) + b
X
p(x) 3
x x
= aE(X) + b
P
i. x p(x) = E(X), by definition of E(X).
x
P
ii. p(x) = 1, by definition of the probability function.
x
E(b) = b.
Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation)
of the distribution of X.
53
3. Random variables
2 2 2 2
Var(X) =pE[(X − E(X))
√ ] = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.
So the variation in how many free throws a poor shooter misses before the first
success is much higher than the variation for a good shooter.
Var(aX + b) = a2 Var(X).
Proof:
E ((aX + b) − E(aX + b))2
Var(aX + b) =
E (aX + b − a E(X) − b)2
=
E (aX − a E(X))2
=
E a2 (X − E(X))2
=
a2 E (X − E(X))2
=
= a2 Var(X).
54
3.6. Discrete random variables
If a = 0, this gives:
Var(b) = 0.
That is, the variance of a constant is 0. The converse also holds – if a random variable
has a variance of 0, it is actually a constant.
Example 3.14 For further practice, let us consider a discrete random variable X
3
which has possible values 0, 1, 2 . . . , n, where n is a known positive integer, and X
has the following probability function:
(
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) =
0 otherwise
This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
At the values x = 0, 1, . . . , n, the value of the cdf is:
x
X n y
F (x) = P (X ≤ x) = π (1 − π)n−y .
y=0
y
55
3. Random variables
Since X is a discrete random variable, F (x) is a step function. For E(X), we have:
n
X n x
E(X) = x π (1 − π)n−x
x=0
x
n
X n x
= x π (1 − π)n−x
3 x=1
x
n
X n(n − 1)!
= π π x−1 (1 − π)n−x
x=1
(x − 1)![(n − 1) − (x − 1)]!
n
X n − 1 x−1
= nπ π (1 − π)n−x
x=1
x − 1
n−1
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y
= nπ · 1
= nπ
where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, . . . , n − 1 and probability
parameter π.
In other words, the set of possible values (sample space) is the real numbers R, or
one or more intervals in R.
Suppose the policy has a deductible of £999, so all claims are at least £1,000.
Most of the concepts introduced for discrete random variables have exact or
approximate analogies for continuous random variables, and many results are the same
3
Strictly speaking, having an uncountably infinite number of possible values does not necessarily
imply that it is a continuous random variable. For example, the Cantor distribution (not covered in
ST104b Statistics 2) is neither a discrete nor an absolutely continuous probability distribution, nor is
it a mixture of these. However, we will not consider this matter any further in this course.
56
3.7. Continuous random variables
for both types. But there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.
where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
α = 2.2 is shown in Figure 3.5.
2.0
1.5
f(x)
1.0
0.5
0.0
Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous distribution:
That is, the probability that X has any particular value exactly is always 0.
57
3. Random variables
Because of (3.3), with a continuous distribution we do not need to be very careful about
differences between < and ≤, and between > and ≥. Therefore, the following
probabilities are all equal:
3
Probabilities of intervals for continuous random variables
R3
Example 3.17 In Figure 3.6, the shaded area is P (1.5 < X ≤ 3) = 1.5
f (x) dx.
2.0
1.5
f(x)
1.0
0.5
0.0
58
3.7. Continuous random variables
Properties of pdfs
The pdf f (x) of any continuous random variable must satisfy the following conditions:
1.
f (x) ≥ 0 for all x.
2. Z ∞
3
f (x) dx = 1.
−∞
Example 3.18 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
0 when x < k
f (x) =
α k α /xα+1 when x ≥ k
1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0.
2.
∞ ∞ ∞
α kα
Z Z Z
f (x) dx = dx = α k α x−α−1 dx
−∞ k xα+1 k
α 1 −α ∞
= αk · x k
−α
= (−k α ) · (0 − k −α )
= 1.
The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.
59
3. Random variables
f (x) = F 0 (x).
= (−k α )(x−α − k −α )
= 1 − k α x−α
= 1 − (k/x)α .
Therefore: (
0 when x < k
F (x) = (3.4)
1 − (k/x)α when x ≥ k.
If we were given (3.4), we could obtain the pdf by differentiation, since F 0 (x) = 0
when x < k, and:
α kα
F 0 (x) = −k α (−α) x−α−1 = when x ≥ k.
xα+1
A plot of the cdf is shown in Figure 3.7.
60
3.7. Continuous random variables
1.0
0.8
0.6
F(x)
3
0.4
0.2
0.0
1 2 3 4 5 6 7
Example 3.20 Continuing with the insurance example (with k = 1 and α = 2.2),
then:
Example 3.21 Consider now a continuous random variable with the following pdf:
(
λ e−λx for x > 0
f (x) = (3.5)
0 for x ≤ 0
where λ > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since: Z x x
λ e−λt dt = − e−λt 0 = 1 − e−λx
0
the cdf of the exponential distribution is:
(
0 for x ≤ 0
F (x) =
1 − e−λx for x > 0.
61
3. Random variables
2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to one. This follows from:
Z ∞
f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x)
−∞ x→∞ x→−∞
Suppose X is a continuous random variable with pdf f (x). Definitions of its expected
value, the expected value of any transformation g(X), variance and standard
deviation are the same as for discrete distributions, except that summation is replaced
by integration:
Z ∞
E(X) = x f (x) dx
−∞
Z ∞
E[g(X)] = g(x) f (x) dx
−∞
Z ∞
2
Var(X) = E[(X − E(X)) ] = (x − E(X))2 f (x) dx = E(X 2 ) − (E(X))2
−∞
p
sd(X) = Var(X).
Example 3.22 For the Pareto distribution, introduced in Example 3.16, we have:
Z ∞ Z ∞
E(X) = x f (x) dx = x f (x) dx
−∞ k
∞
α kα
Z
= x· dx
k xα+1
∞
α kα
Z
= dx
k xα
∞
(α − 1) k α−1
Z
αk
= dx
α−1 x(α−1)+1
|k {z }
=1
αk
= (if α > 1).
α−1
Here the last step follows because the last integrand has the form of the Pareto pdf
with parameter α − 1, so its integral from k to ∞ is 1. This integral converges only if
α − 1 > 0, i.e. if α > 1.
62
3.7. Continuous random variables
Similarly:
∞ ∞
α kα
Z Z
2 2
E(X ) = x f (x) dx = x2 · dx
k k xα+1
∞
α kα
Z
= dx
k xα−1
α k2
Z ∞
(α − 2) k α−2
3
= dx
α−2 k x(α−2)+1
| {z }
=1
2
αk
= (if α > 2)
α−2
and therefore:
2
α k2 α2 k 2
2 2 k α
Var(X) = E(X ) − (E(X)) = − = .
α − 2 (α − 1)2 α−1 α−2
= [0 − 0] − (1/λ)[0 − 1]
= 1/λ.
63
3. Random variables
Expected values and variances are said to be infinite when the corresponding integral
does not exist (i.e. does not have a finite value).
For the Pareto distribution, the distribution is defined for all α > 0, but the mean is
infinite if α < 1 and the variance is infinite if α < 2. This happens because for small
3 values of α the distribution has very heavy tails, i.e. the probabilities of very large
values of X are non-negligible.
This is actually useful in some insurance applications, for example liability insurance
and medical insurance. There most claims are relatively small, but there is a
non-negligible probability of extremely large claims. The Pareto distribution with a
small α can be a reasonable representation of such situations. Figure 3.8 shows plots of
Pareto cdfs with α = 2.2 and α = 0.8. When α = 0.8, the distribution is so heavy-tailed
that E(X) is infinite.
1.0
0.8
0.6
F(x)
0.4
α = 2.2
α = 0.8
0.2
0.0
0 10 20 30 40 50
Recall from ST104a Statistics 1 that the sample median is essentially the observation
‘in the middle’ of a set of data, i.e. where half of the observations in the sample are
smaller than the median and half of the observations are larger.
The median of a random variable (i.e. of its probability distribution) is similar in spirit.
64
3.8. Overview of chapter
For example:
√
2.2
When k = 1 and α = 2.2, the median is m = 2 = 1.37.
√
0.8
When k = 1 and α = 0.8, the median is m = 2 = 2.38.
log 2
e−λ m = 1/2 ⇔ −λ m = − log 2 ⇔ m= .
λ
65
3. Random variables
(b)
p(xk ) = F (xk ) − F (xk−1 ).
(c)
k
X
F (xk ) = p(xi ).
i=1
2. At a charity event, the organisers sell 100 tickets to a raffle. At the end of the
event, one of the tickets is selected at random and the person with that number
wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5. What
is the probability that each of them wins the prize?
3. A greengrocer has a very large pile of oranges on his stall. The pile of fruit is a
mixture of 50% old fruit with 50% new fruit; one cannot tell which are old fruit
and which are new fruit. However, 20% of old oranges are mouldy inside, but only
10% of new oranges are mouldy. Suppose that you choose 5 oranges at random.
What is the distribution of the number of mouldy oranges in your sample?
4. What is the expectation of the random variable X if the only possible value it can
take is c?
6. Show that if Var(X) = 0 then p(µX ) = 1. (We say in this case that X is almost
surely equal to its mean.)
Var(aX + b) = a2 Var(X).
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk
66
3.11. Reminder of learning outcomes
define a random variable and distinguish it from the values that it takes
3
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk
67
3. Random variables
68
Chapter 4
Common distributions of random
variables
state properties of these distributions such as the expected value and variance.
69
4. Common distributions of random variables
4.5 Introduction
In statistical inference we will treat observations:
X1 , X2 , . . . , Xn
4 (the sample) as values of a random variable X, which has some probability distribution
(population distribution).
How to choose that probability distribution?
There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.
This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:
among continuous: different sets of possible values (for example, all real numbers x,
x > 0, or x ∈ [0, 1]); symmetric versus skewed distributions.
The ‘distributions’ discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it, etc.
In the statistical analysis of a random variable X we typically:
use observed data to choose (estimate) values for the parameters of that
distribution, and perform statistical inference on them.
70
4.6. Common discrete distributions
Here we need a family of discrete distributions with only two possible values (0
and 1). The Bernoulli distribution (discussed in the next section), which has
one parameter π (the probability of Xi = 1) is appropriate.
Within the family of Bernoulli distributions, we use the one where the value of
π is our best estimate based on the observed data. This is π
b = 513/950 = 0.54.
4
4.6 Common discrete distributions
For discrete random variables, we will consider the following distributions:
Bernoulli distribution
Binomial distribution
Poisson distribution.
The discrete uniform distribution is not very common in applications, but it is useful as
a reference point for more complex distributions.
71
4. Common distributions of random variables
and:
12 + 22 + · · · + k 2 (k + 1)(2k + 1)
E(X 2 ) = = . (4.2)
k 6
So:
k2 − 1
Var(X) = E(X 2 ) − (E(X))2 = .
12
Agree / Disagree
Male / Female
X ∼ Bernoulli(π).
If X ∼ Bernoulli(π), then:
1
X
E(X) = x p(x) = 0 × (1 − π) + 1 × π = π (4.3)
x=0
X1
E(X 2 ) = x2 p(x) = 02 × (1 − π) + 12 × π = π
x=0
72
4.6. Common discrete distributions
Let X denote the total number of successes in these n trials. Then X follows a
binomial distribution with parameters n and π, where n ≥ 1 is a known integer and
0 ≤ π ≤ 1. This is often written as:
X ∼ Bin(n, π).
X ∼ Bin(4, 0.25).
For example, what is the probability that Bob gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 , where ‘1’ denotes a correct
answer and ‘0’ denotes an incorrect answer.
However, we do not care about the order of the 0s and 1s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability π 3 (1 − π)1 .
The total number of sequences with three 1s (and therefore one 0) is the number of
locations for the three 1s that can be selected in the sequence of 4 answers. This is
4
3
= 4. Therefore the probability of obtaining three 1s is:
4 3
π (1 − π)1 = 4 × 0.253 × 0.751 ≈ 0.0469.
3
We have already shown that (4.5) satisfies the conditions for being a probability
function in the previous chapter (see Example 3.14).
73
4. Common distributions of random variables
4
If X ∼ Bin(n, π), then:
E(X) = n π
Var(X) = n π (1 − π).
The expected value E(X) was derived in the previous chapter. The variance will be
derived later.
Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again a student who guesses each one of the answers. Let
X denote the number of correct answers by such a student, so that we have
X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, . . . , 20 we get (rounded to 2 decimal places):
x 0 1 2 3 4 5 6 7 8 9 10
p(x) 0.00 0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
x 11 12 13 14 15 16 17 18 19 20
p(x) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore P (X ≥ 8) = 0.102 > 0.05 and also
P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability π of the correct
answer for every question, so that X ∼ Bin(20, π). Figure 4.1 shows plots of the
probabilities for π = 0.25, 0.5, 0.7 and 0.9.
74
4.6. Common discrete distributions
0.30
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
Activity 4.1 Show that (4.6) satisfies the conditions to be a probability function.
Hint: You can use the following result from standard calculus. For any number a,
∞
a
X ax
e = .
x=0
x!
If X ∼ Poisson(λ), then:
E(X) = λ
Var(X) = λ.
75
4. Common distributions of random variables
Activity 4.2 Prove that the mean and variance of a Poisson-distributed random
variable are both equal to λ.
Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process that generates the occurrences satisfies the
following conditions:
1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.
4 2. The probability of two or more occurrences at the same time is negligibly small.
3. The probability of one occurrence in any short time interval of length t is λt for
some constant λ > 0.
Example 4.7 Examples of variables for which we might use a Poisson distribution:
Because λ is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.
Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if
Y is the number of arrivals per two hours, Y ∼ Poisson(2 × 1.5) = Poisson(3).
Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X ∼ Poisson(2) and X ∼ Poisson(4).
76
4.6. Common discrete distributions
0.25
λ=2
λ=4
0.20
0.15
p(x)
0.10
0.05
4
0.00
0 2 4 6 8 10
77
4. Common distributions of random variables
A word on calculators
In the examination you will be allowed a basic calculator only. To calculate binomial
and Poisson probabilities directly requires access to a ‘factorial’ key (for the binomial)
and ‘e’ key (for the Poisson), which will not appear on a basic calculator. Note that any
probability calculations which are required in the examination will be possible on a
basic calculator. For example, for the Poisson probabilities in Example 4.10, it would be
acceptable to give your answers in terms of e (in the simplest form).
X ∼ Bin(n, π).
This ‘law of small numbers’ provides another motivation for the Poisson distribution.
78
4.6. Common discrete distributions
Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the ‘law of small numbers’.
Figure 4.3 shows the numbers of soldiers killed by horsekick in each of 14 Army
Corps of the Prussian army in each of the years spanning 1875–94.
Suppose that the number of men killed by horsekicks in one corps in one year is
X ∼ Bin(n, π), where:
The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson
distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.
Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
army in each of the years 1875–94. Source: Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen, Leipzig: Teubner.
Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What
is the probability that everyone who arrives for the flight will get a seat?
Let X denote the number of people who fail to turn up. Using the binomial
distribution, X ∼ Bin(200, 0.01). We have:
79
4. Common distributions of random variables
0.5
Poisson(0.7)
Sample proportion
0.4
0.3
Probability
0.2
0.1
4
0.0
0 1 2 3 4 5 6
Men killed
Geometric(π) distribution:
• Distribution of the number of failures in Bernoulli trials before the first success.
• π is the probability of success at each trial.
• Sample space is 0, 1, 2, . . . .
• See the basketball example in Chapter 3.
Hypergeometric(n, A, B) distribution:
• Experiment where initially A + B objects are available for selection, and A of
them represent ‘success’.
• n objects are selected at random, without replacement.
• Hypergeometric is then the distribution of the number of successes.
80
4.7. Common continuous distributions
Multinomial(n, π1 , π2 , . . . , πk ) distribution:
• Here π1 + π2 + · · · + πk = 1, and the πi s are the probabilities of the values
1, 2, . . . , k.
• If n = 1, the sample space is 1, 2, . . . , k. This is essentially a generalisation of
the discrete uniform distribution, but with non-equal probabilities πi .
• If n > 1, the sample space is the vectors (n1 , n2 , . . . , nk ) where ni ≥ 0 for all i,
4
and n1 + n2 + · · · + nk = n. This is essentially a generalisation of the binomial
to the case where each trial has K ≥ 2 possible outcomes, and the random
variable records the numbers of each outcome in n trials. Note that with
K = 2, Multinomial(n, π1 , π2 ) is essentially the same as Bin(n, π) with π = π2
(or with π = π1 ).
• When n > 1, the multinomial is the distribution of a multivariate random
variable, as discussed later in the course.
Uniform distribution
Exponential distribution
Normal distribution.
81
4. Common distributions of random variables
The pdf is ‘flat’, as shown in Figure 4.5 (along with the cdf). Clearly, f (x) ≥ 0 for all x,
and: Z ∞ Z b
1 1 1
f (x) dx = dx = [x]ba = · [b − a] = 1.
−∞ a b−a b−a b−a
4
Activity 4.4 Derive the cdf for the continuous uniform distribution.
x 2 − x1
P (x1 ≤ X ≤ x2 ) = F (x2 ) − F (x1 ) = .
b−a
So the probability depends only on the length of the interval, x2 − x1 .
1
F(x)
f(x)
a b a b
x x
Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).
b+a
E(X) = = median of X
2
(b − a)2
Var(X) = .
12
The mean and median also follow from the fact that the distribution is symmetric
about (b + a)/2, i.e. the midpoint of the interval [a, b].
Activity 4.5 Derive the mean and variance of the continuous uniform distribution.
82
4.7. Common continuous distributions
0 1 2 3 4 5
E(X) = 1/λ
Var(X) = 1/λ2 .
These have been derived in the previous chapter (see Example 3.23). The median of the
distribution, also previously derived (see Example 3.25), is:
log 2 1
m= = (log 2) × = (log 2) E(X) ≈ 0.69 E(X).
λ λ
83
4. Common distributions of random variables
1.0
0.8
0.6
F(x)
0.4
0.2
4
0.0
0 1 2 3 4 5
Note that the median is always smaller than the mean, because the distribution is
skewed to the right.
The exponential is, among other things, a basic distribution of waiting times of
various kinds. This arises from a connection between the Poisson distribution – the
simplest distribution for counts – and the exponential.
If the number of events per unit of time has a Poisson distribution with parameter
λ, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter λ.
E(X) = λ for Poisson(λ), i.e. a large λ means many events per unit of time, on
average.
E(X) = 1/λ for Exponential(λ), i.e. a large λ means short waiting times between
successive events, on average.
Then the time X, in minutes, between the arrivals of two successive customers
follows an exponential distribution with parameter λ = 1.6.
From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(log 2) × 0.625 = 0.433.
84
4.7. Common continuous distributions
We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0 for x ≤ 0
F (x) = −1.6x
1−e for x > 0.
For example:
Many variables have distributions that are approximately normal, for example
heights of humans, animals and weights of various products.
The normal distribution has extremely convenient mathematical properties, which
make it a useful default choice of distribution in many contexts.
Even when a variable is not itself even approximately normally distributed,
functions of several observations of the variable (‘sampling distributions’) are often
approximately normal, due to the central limit theorem. Because of this, the
85
4. Common distributions of random variables
normal distribution has a crucial role in statistical inference. This will be discussed
later in the course.
(x − µ)2
1
f (x) = √ exp − for − ∞ < x < ∞
2πσ 2 2σ 2
R∞
Clearly, f (x) ≥ 0 for all x. Also, it can be shown that −∞
f (x) dx = 1 (do not attempt
to show this), so f (x) really is a pdf.
If X ∼ N (µ, σ 2 ), then:
E(X) = µ
Var(X) = σ 2
86
4.7. Common continuous distributions
0 1 2 3 4 5 6 0 2 4 6 8 10
alpha=0.5, beta=1 alpha=1, beta=0.5
0 1 2 3 4 5 6 0 5 10 15 20
alpha=2, beta=1 alpha=2, beta=0.25
N (0, 1) and N (5, 1) have the same dispersion but different location: the
N (5, 1) curve is identical to the N (0, 1) curve, but shifted 5 units to the right.
N (0, 1) and N (0, 9) have the same location but different dispersion: the
N (0, 9) curve is centered at the same value, 0, as the N (0, 1) curve, but spread
out more widely.
We now consider one of the convenient properties of the normal distribution. Suppose
X is a random variable, and we consider the linear transformation Y = aX + b, where a
and b are constants.
Whatever the distribution of X, it is true that E(Y ) = aE(X) + b and also that
Var(Y ) = a2 Var(X).
Furthermore, if X is normally distributed, then so is Y . In other words, if
X ∼ N (µ, σ 2 ), then:
Y = aX + b ∼ N (aµ + b, a2 σ 2 ). (4.7)
This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.
87
4. Common distributions of random variables
0.4
0.3
N(0, 1) N(5, 1)
0.2
0.1
N(0, 9)
4
0.0
−5 0 5 10
In the special case of the standard normal distribution, the cdf is:
Z x 2
1 t
F (x) = Φ(x) = √ exp − dt.
−∞ 2π 2
88
4.7. Common continuous distributions
1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
2. Even for N (0, 1), it only shows probabilities for x ≥ 0.
The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the
same probability. Another way to justify these results is that if Z ∼ N (0, 1), then
−Z ∼ N (0, 1) also. See ST104a Statistics 1 for a discussion of how to use Table 4 of
the New Cambridge Statistical Tables.
a−µ X −µ b−µ
P (a < X ≤ b) = P < ≤
σ σ σ
a−µ b−µ
= P <Z≤
σ σ
b−µ a−µ
= Φ −Φ
σ σ
which can be calculated using Table 4 of the New Cambridge Statistical Tables. (Note
that this also covers the cases of the one-sided inequalities P (X ≤ b), with a = −∞,
and P (X > a), with b = ∞.)
Example 4.15 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:
These are calculated using standardisation with µ = 74.2 and σ 2 = 127.87, and
therefore σ = 11.31. So here:
X − 74.2
= Z ∼ N (0, 1)
11.31
and we can refer values of this standardised variable to Table 4 of the New
89
4. Common distributions of random variables
Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152.
Mid: 0.82
0.03
Low: 0.10
0.02
High: 0.08
0.01
0.00
40 60 80 100 120
90
4.7. Common continuous distributions
0.683
Figure 4.12: Some probabilities around the mean for the normal distribution.
91
4. Common distributions of random variables
Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, . . . , 40, then:
P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)
since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast if Y ∼ N (16, 9.6), then:
since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (n π, n π (1 − π)) distribution.
Continuity correction
92
4.7. Common continuous distributions
Example 4.16 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X ∼ Bin(1000, 0.361).
93
4. Common distributions of random variables
(a) The Conservatives’ true level of support is still 36.1% (or even higher), but
by chance we ended up with an unusual sample with only 300 of their
supporters.
(b) The Conservatives’ true level of support is currently less than 36.1% (in
which case getting 300 in the sample would be more probable).
Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.
2. Suppose that the normal rate of infection for a certain disease in cattle is 25%. To
test a new serum which may prevent infection, three experiments are carried out.
The test for infection is not always valid for some particular cattle, so the
experimental results are incomplete – we cannot always tell whether a cow is
infected or not. The results of the three experiments are:
i. 10 animals are injected; all 10 remain free from infection.
ii. 17 animals are injected; more than 15 remain free from infection and there are
2 doubtful cases.
iii. 23 animals are infected; more than 20 remain free from infection and there are
3 doubtful cases.
Which experiment provides the strongest evidence in favour of the serum?
94
4.10. Learning activities
4. The chance that a lottery ticket has a winning number is 0.0000001. Suppose
10,000,000 people buy tickets that are independently numbered.
(a) What is the probability there is no winner? 4
(b) What is the probability there is exactly 1 winner?
(c) What is the probability there are exactly 2 winners?
6. Suppose that the service time for a customer at a fast food outlet has an
exponential distribution with parameter 1/3 (customers per minute). What is the
probability that a customer waits more than 4 minutes?
9. Helmut goes fishing every Saturday. The number of fish he catches follows a
Poisson distribution. On a proportion p of the days he goes fishing, he does not
catch anything. He makes it a rule to take home the first fish and then every other
fish that he catches (i.e. the first, third, fifth fish and so on).
(a) Using a Poisson distribution, find the mean number of fish he catches.
(b) Show that the probability that he takes home the last fish he catches is
(1 − p2 )/2.
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk
95
4. Common distributions of random variables
state properties of these distributions such as the expected value and variance.
3. Show that, for a binomial random variable such that X ∼ Bin(n, π), we have:
n
X (n − 1)!
E(X) = n π π x−1 (1 − π)n−x .
x=1
(x − 1)!(n − x)!
[The wording of the question implies that you use the result that you have just
proved. Other methods of derivation will not be accepted!]
96
4.12. Sample examination questions
4. Cars independently pass a point on a busy road at an average rate of 150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(b) What is the expected number passing in two minutes?
(c) Find the probability that the expected number actually passes in a given
two-minute period.
Other motor vehicles (vans, motorcycles etc.) pass the same point independently at
the rate of 75 per hour. Assume a Poisson distribution for these vehicles too.
(d) What is the probability of one car and one other motor vehicle in a two-minute
period? 4
Solutions to these questions can be found on the VLE in the ST104b Statistics 2 area
at http://my.londoninternational.ac.uk
97
4. Common distributions of random variables
98