2020 Math1024
2020 Math1024
1 Introduction to Statistics 9
1.1 Lecture 1: What is statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.1 Early and modern definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.2 Uncertainty: the main obstacle to decision making . . . . . . . . . . . . . . . 10
1.1.3 Statistics tames uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.4 Why should I study statistics as part of my degree? . . . . . . . . . . . . . . 10
1.1.5 Lie, Damn Lie and Statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.6 What’s in this module? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.7 Take home points: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Lecture 2: Basic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2 How do I obtain data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Summarising data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.4 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Lecture 3: Data visualisation with R . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.2 Get into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.3 Working directory in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.4 Keeping and saving commands in a script file . . . . . . . . . . . . . . . . . . 19
1.3.5 How do I get my data into R? . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.6 Working with data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.7 Summary statistics from R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.8 Graphical exploration using R . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.9 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Introduction to Probability 23
2.1 Lecture 4: Definitions of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 Why should we study probability? . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.2 Two types of probabilities: subjective and objective . . . . . . . . . . . . . . 23
2.1.3 Union, intersection, mutually exclusive and complementary events . . . . . . 24
2.1.4 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.5 Application to an experiment with equally likely outcomes . . . . . . . . . . . 27
2.1.6 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Lecture 5: Using combinatorics to find probability . . . . . . . . . . . . . . . . . . . 27
3
CONTENTS 4
4 Statistical Inference 81
4.1 Lecture 19: Foundations of statistical inference . . . . . . . . . . . . . . . . . . . . . 81
4.1.1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.2 A fully specified model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.3 A parametric statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.4 A nonparametric statistical model . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.5 Should we prefer parametric or nonparametric and why? . . . . . . . . . . . . 84
4.1.6 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Lecture 20: Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 Population and sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.3 Statistic and estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.4 Bias and mean square error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.5 Take home points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Lecture 21: Estimation of mean and variance and standard error . . . . . . . . . . . 88
4.3.1 Lecture mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CONTENTS 6
Introduction to Statistics
• With the rapid industrialization of Europe in the first half of the 19th century, statistics
became established as a discipline. This led to the formation of the Royal Statistical Society,
the premier professional association of statisticians in the UK and also world-wide, in 1834.
• During this 19th century growth period, statistics acquired a new meaning as the interpre-
tation of data or methods of extracting information from data for decision making. Thus
statistics has its modern meaning as the methods for:
• Indeed, the Oxford English Dictionary defines statistics as: “The practice or science of col-
lecting and analysing numerical data in large quantities, especially for the purpose of inferring
proportions in a whole from those in a representative sample.”
• Note that the word ‘state’ has gone from its definition. Instead, statistical methods are now
essential for everyone wanting to answer questions using data.
For example, will it rain tomorrow? Does eating red meat make us live longer? Is smoking
harmful during pregnancy? Is the new shampoo better than the old? Will the UK economy get
better after Brexit? At a more personal level: What degree classification will I get at graduation?
How long will I live for? What prospects do I have in the career I have chosen? How do I invest
my money to maximise the return? Will the stock market crash tomorrow?
9
1 Introduction to Statistics 10
Hence, although people may miss-use the tools of statistics, it is our duty to learn and sharpen the
those to develop scientifically robust and strong arguments.
As discussed before statistical methods are only viable tool whenever there is uncertainty in
decision making. In scientific investigations, statistics is an inevitable instrument in search of truth
when uncertainty cannot be totally removed from decision making. Off-course, a statistical method
may not yield the best predictions in a very particular situation, but a systematic and robust
application of statistical methods will eventually win over pure guesses. For example, statistical
methods prove that cigarette smoking is bad for human health.
– For this we will use the R statistical package. R is freely available to download. Search
download R or go to: https://cran.r-project.org/. We will use it as a calculator
and also as a graphics package to explore data, perform statistical analysis, illustrate
theorems and calculate probabilities. You do not need to learn any program-
ming language. You will be instructed to learn basic commands like: 2+2; mean(x);
plot(x,y).
– In this module we will demonstrate using the R package. A nicer experience is provided
by the commercial, but still freely available, R Studio software. It is recommended that
you use that.
• Chapter 3: Random variables. We will learn that the results of different random exper-
iments lead to different random variables following distributions such as the binomial, and
normal. etc. We will learn their basic properties, e.g. mean and variance.
1 Introduction to Statistics 12
• This module will provide a very gentle introduction to statistics and probability together with
the software package R for data analysis.
• Statistical knowledge is essential for any scientific career in academia, industry and govern-
ment.
• Read the New York Times article For Today’s Graduate, Just One Word: Statistics
(search on Google).
• Watch the YouTube video Joy of Statistics before attending the next lecture.
As well as randomness, we need to pay attention to the design of the study. In a designed
experiment the investigator controls the values of certain experimental variables and then measures
a corresponding output or response variable. In designed surveys an investigator collects data on a
randomly selected sample of a well-defined population. Designed studies can often be more effective
at providing reliable conclusions, but are frequently not possible because of difficulties in the study.
1 Introduction to Statistics 13
We will return to the topics of survey methods and designed surveys later in Lecture 28. Until
then we assume that we have data from n randomly selected sampling units, which we will conve-
niently denote by x1 , x2 , . . . , xn . We will assume that these values are numeric, either discrete like
counts, e.g. number of road accidents, or continuous, e.g. heights of 4-year-olds, marks obtained
in an examination. We will consider the following example:
♥ Example 1 Fast food service time The service times (in seconds) of customers at a fast-food
restaurant. The first row is for customers who were served from 9–10AM and the second row is for
customers who who were served from 2–3PM on the same day.
AM 38 100 64 43 63 59 107 52 86 77
PM 45 62 52 72 81 88 64 75 59 70
• For numeric data x1 , x2 , . . . , xn , we would like to know the centre (measures of location or
central tendency) and the spread or variability.
Measures of location
• We are seeking a representative value for the data x1 , x2 , . . . , xn which should be a function
of the data. If a is that representative value then how much error is associated with it?
Pn
• The total error could be the sum of squares of the deviations from a, SSE = i=1 (xi − a)2
or the sum of the absolute deviations from a, SSA = ni=1 |xi − a|.
P
• What value of a will minimise the SSE or the SSA? For SSE the answer is the sample mean
and for SSA the answer is the sample median.
• How can we prove the above assertion? Use the derivative method. Set ∂a ∂
SSE = 0 and solve
for a. Check the second derivative condition that it is positive at the solution for a. Try this
at home.
1 Introduction to Statistics 14
Pn
− a)2 = Pni=1 (x 2 {Add and subtract x}
P
i=1 (xi i − x̄ + x̄ − a)
= Pni=1 (xi − x̄)2 + 2(xi − x̄)(x̄ a)2
− a) + (x̄ −P
= Pni=1 (xi − x̄)2 + 2(x̄ − a) ni=1 (xi − x̄) + ni=1 (x̄ − a)2
P
n 2 2
= i=1 (xi − x̄) + n(x̄ − a) ,
Pn
since i=1 (xi − x̄) = nx − nx = 0.
– Now note that: the first term is free of a; the second term is non-negative for any value
of a. Hence the minimum occurs when the second term is zero, i.e. when a = x̄.
– This establishes the fact that
the sum of (or mean) squares of the deviations from any number a is
minimised when a is the mean.
– In the proof we also noted that ni=1 (xi − x̄) = 0. This is stated as:
P
• The above justifies why we often use the mean as a representative value. For the service time
data, the mean time in AM is 68.9 seconds and for PM the mean is 66.8 seconds.
• Here the derivative approach does not work since the derivative does not exist for the absolute
function.
For the AM service time data: 38 < 43 < 52 < 59 < 63 < 64 < 77 < 86 < 100 < 107.
• Easy to argue that |x(1) − a| + |x(n) − a| is minimised when a is such that x(1) ≤ a ≤ x(n) .
• Easy to argue that |x(2) − a| + |x(n−1) − a| is minimised when a is such that x(2) ≤ a ≤ x(n−1) .
• Finally, when n is odd, the last term |x( n+1 ) − a| is minimised when a = x( n+1 ) or the middle
2 2
value in the ordered list.
1 Introduction to Statistics 15
• If however, n is even, the last pair of terms will be |x( n2 ) − a| + |x( n2 +1) − a|. This will be
minimised when a is any value between x( n2 ) and x( n2 +1) . For convenience, we often take the
mean of these as the middle value.
• Hence the middle value, popularly known as the median, minimises the SSA. Hence the
median is also often used as a representative value or a measure of central tendency. This
establishes the fact that:
the sum of (or mean) of the absolute deviations from any number a is
minimised when a is the median.
• To recap: the median is defined as the observation ranked 21 (n+1) in the ordered list if n is odd.
If n is even, the median is any value between n2 th and ( n2 +1)th in the ordered list. For example,
for the AM service times, n = 10 and 38 < 43 < 52 < 59 < 63 < 64 < 77 < 86 < 100 < 107.
So the median is any value between 63 and 64. For convenience, we often take the mean of
these. So the median is 63.5 seconds. Note that we use the unit of the observations when
reporting any measure of location.
Which of the three (mean, median and mode) would you prefer?
The mean gets more affected by extreme observations while the median does not. For example for
the AM service times, suppose the next observation is 190. The median will be 64 instead of 63.5
but the mean will shoot up to 79.9.
Measures of spread
• A quick measure of the spread is the range, which is defined as the difference between the
maximum and minimum observations. For the AM service times the range is 69 (107 − 38)
seconds.
1 Pn
• Standard deviation: square root of variance = n−1 2
i=1 (xi − x̄) .
n
X n
X n n
2 2 2
X 2 2
X
(xi − x̄) = xi − 2xi x̄ + x̄ = xi − 2x̄(nx̄) + nx̄ = x2i − nx̄2 .
i=1 i=1 i=1 i=1
Hence we calculate variance by the formula:
n
!
2 1 X
Var(x) = s = x2i − nx̄ 2
.
n−1
i=1
• Sometimes the variance is defined with the divisor n instead of n − 1. We have chosen n − 1
since this is the default in R. We will return to this in Chapter 4.
1 Introduction to Statistics 16
• The standard deviation (sd)for the AM service times is 23.2 seconds. Note that it has the
same unit as the observations.
• The interquartile range (IQR) is the difference between the third, Q3 and first, Q1 quartiles,
which are respectively the observations ranked 41 (3n + 1) and 14 (n + 3) in the ordered list.
Note that the median is the second quartile, Q2 . When n is even, definitions of Q3 and Q1
are similar to that of the median, Q2 . The IQR for the AM service times is 83.75 − 53.75 = 30
seconds.
Our mission in this lecture is to get started with R. We will learn the basic R commands (mean,
var, summary, table, barplot, hist, pie and boxplot) to explore data sets.
4 0 0 0 3 2 0 0 6 7
6 2 1 11 6 1 2 1 1 2
0 2 2 1 0 12 8 4 5 0
and so on.
terms of demographic variables such as sex and ethnicity. 68 students were weighed during the first
week of the semester, then again 12 weeks later.
student number initial weight (kg) final weight (kg)
1 77.56423 76.20346
2 49.89512 50.34871
.. .. ..
. . .
67 75.74986 77.11064
68 59.42055 59.42055
♥ Example 4 billionaires
Fortune magazine publishes a list of the world’s billionaires each year. The 1992 list includes
225 individuals. Their wealth, age, and geographic location (Asia, Europe, Middle East, United
States, and Other) are reported. Variables are: wealth: Wealth of family or individual in billions of
dollars; age: Age in years (for families it is the maximum age of family members); region: Region
of the World (Asia, Europe, Middle East, United States and Other). The head and tail values of
the data set are given below.
wealth age region
37.0 50 M
24.0 88 U
.. .. ..
. . .
1 9 M
1 59 E
• In both Rstudio and R there is the R console that allows you to type in commands at the
prompt > directly.
• Example: mean(c(38, 100, 64, 43, 63, 59, 107, 52, 86, 77)) and hitting the Enter
button computes the mean of the numbers entered.
• The letter c() is also a command that concatenates (collects) the input numbers into a vector
• Even when an R function has no arguments we still need to use the brackets, such as in ls()
which gives a list of objects in the current workspace.
• Example: x <- 2+2 means that x is assigned the value of the expression 2+2.
• If you are working in your computer, please create a folder and name it C:/math1024. R is
case sensitive, so if you name it Math1024 instead of math1024 then that’s what you need to
use. Avoid folder names with spaces, e.g. do not use: Math 1024.
• In the university workstations there is a drive called H: which is permanent (will be there for
you to use throughout your 3 (or 4) year degree programme. From Windows File Explorer
navigate to H: and create a sub-folder math1024.
• Please unzip (extract) the file and save the data files in the math1024 folder you created. You
do not need to download this file again unless you are explicitly told to do so.
• In R, issue the command getwd(), which will print out the currect working directory.
• Assuming you are working in the university computers, please set the working directory by
issuing the command: setwd("H:/math1024/"). In your own computer you will modify the
command to something like: setwd("C:/math1024/")
• In Rstudio, a more convenient way to set the working directory is: by following the menu
Session → Set Working Directory. It then gives you a dialogue box to navigate to the
folder you want.
• To confirm that this has been done correctly, re-issue the command getwd() and see the
output.
• Your data reading commands below will not work if you fail to follow the instruc-
tion in this subsection.
• Please remember that you need to issue the setwd("H:/math1024/") every time you log-in.
1 Introduction to Statistics 19
• However, we almost never type the long R commands at the R prompt > as we are prone to
making mistakes and we may need to modify the commands for improved functionality.
• That is why we prefer to simply write down the commands one after another in a script file
and save those for future use.
• You can either execute the entire script or only parts by highlighting the respective commands
and then clicking the Run button or Ctrl + R to execute.
• Do not forget to save the script file with a suitable name, e.g. myfirst.R in the math1024
sub-folder you created.
• It is very strongly recommended that you write R commands in a script file as instructed in
this subsection.
• All the commands used in this lecture are already typed in the file Rfile1.R that you can
also download from Blackboard.
• Please do not attempt to go into R now. Instead, just read these notes or watch the video. You
will go through the commands at your own pace as instructed in the notes for the laboratory
session as Appendix C of this booklet.
• To read a tab-delimited text file of data with the first row giving the column headers, the
command is: read.table("filename.txt", head=TRUE).
• For comma-separated files (such as the ones exported by EXCEL), the command is
read.table("filename.csv", head=TRUE, sep=",") or simply
read.csv("filename.csv", head=TRUE).
• The option head=TRUE tells that the first row of the data file contains the column headers.
• Read the help files by typing ?scan and ?read.table to learn these commands.
• You are reminded that the following data reading commands will fail if you have not set the
working directory correctly.
1 Introduction to Statistics 20
• Assuming that you have set the working directory to where your data files are saved, simply
type and Run
• R does not automatically show the data after reading. To see the data you need to issue a
command like: cfail, head(ffood), tail(bill) etc. after reading in the data.
• You must issue the correct command to read the data set correctly.
In the past, reading data into R has been the most difficult task for students. Please ask for
help in the lab sessions if you are still struggling with this. If all else fails, you can read the data
sets from the course web-page as follows:
• Just type ffood and hit the Enter button or the Run icon. See what happens.
• A convenient way to see the data is to see either the head or the tail of the data. For example,
type head(ffood) and hit Run or tail(ffood) and hit Run.
• To know the dimension (how many rows and columns) issue dim(ffood).
• To access elements of a data frame we can use square brackets, e.g. ffood[1, 2] gives the
first row second column element, ffood[1, ] gives everything in the first row and ffood[,
1] gives everything in the first column.
• The named columns in a data frame are often accessed by using the $ operator. For example,
ffood$AM prints the column whose header is AM.
• There are many R functions with intuitive names, e.g. mean, median, var, min, max,
sum, prod, summary, seq, rep etc. We will explain them as we need them.
1 Introduction to Statistics 21
Figure 1.1: Different shapes using the butterfly programme. Programming helps you to be uniquely
creative!
Chapter 2
Introduction to Probability
Chapter mission
Why should we study probability? What are probabilities? How do you find them? What are the
main laws of probabilities? How about some fun examples where probabilities are used to solve
real-life problems?
23
2 Introduction to Probability 24
in a statistical framework called Bayesian inference. Such methods allow one to combine expert
opinion and evidence from data to make the best possible inferences and prediction. Unfortunately
discussion of Bayesian inference methods is beyond the scope of this module, although we will talk
about it when possible.
The second definition of probability comes from the long-term relative frequency of a result of
a random experiment (e.g. coin tossing) which can be repeated an infinite number of times under
essentially similar conditions. First we give some essential definitions.
Random experiments. The experiment is random because in advance we do not know exactly
what outcome the experiment will give, even though we can write down all the possible outcomes
which together are called the sample space (S). For example, in a coin tossing experiment, S
= {head, tail}. If we toss two coins together, S = {HH, HT, TH, TT} where H and T denote
respectively the outcome head and tail from the toss of a single coin.
Event. An event is defined as a particular result of the random experiment. For example, HH
(two heads) is an event when we toss two coins together. Similarly, at least one head e.g. {HH, HT,
TH} is an event as well. Events are denoted by capital letters A, B, C, . . . or A1 , B1 , A2 etc., and
a single outcome is called an elementary event, e.g. HH. An event which is a group of elementary
events is called a composite event, e.g. at least one head. How to determine the probability of a
given event A, P {A}, is the focus of probability theory.
Probability as relative frequency. Imagine we are able to repeat a random experiment
under identical conditions and count how many of those repetitions result in the event A. The
relative frequency of A, i.e. the ratio
the number of repetitions resulting in A
,
total number of repetitions
approaches a fixed limit value as the number of repetitions increases. This limit value is defined as
P {A}.
As a simple example, in the experiment of tossing a particular coin, suppose we are interested
in the event A of getting a ‘head’. We can toss the coin 1000 times (i.e. do 1000 replications of
the experiment) and record the number of heads out of the 1000 replications. Then the relative
frequency of A out of the 1000 replications is the proportion of heads observed.
Sometimes, however, it is much easier to find P {A} by using some ‘common knowledge’ about
probability. For example, if the coin in the example above is fair (i.e. P {‘head0 } = P {‘tail0 }), then
this information and the common knowledge that P {‘head0 }+P {‘tail0 } = 1 immediately imply that
P {‘head0 } = 0.5 and P {‘tail0 } = 0.5. Next, the essential ‘common knowledge’ about probability
will be formalized as the axioms of probability, which form the foundation of probability theory.
But before that, we need to learn a bit more about the event space (collection of all events).
♥ Example 5 Die throw Roll a six-faced die and observe the score on the uppermost face.
Here S = {1, 2, 3, 4, 5, 6}, which is composed of six elementary events.
The union of two given events A and B, denoted as (A or B) or A ∪ B, consists of the outcomes
that are either in A or B or both. ‘Event A ∪ B occurs’ means ‘either A or B occurs or both occur’.
For example, in Example 5, suppose A is the event that an even number is observed. This
event consists of the set of outcomes 2, 4 and 6, i.e. A = {an even number} = {2, 4, 6}. Sup-
pose B is the event that a number larger than 3 is observed. This event consists of the out-
comes 4, 5 and 6, i.e. B = {a number larger than 3} = {4, 5, 6}. Hence the event A ∪ B =
{an even number or a number larger than 3} = {2, 4, 5, 6}. Clearly, when a 6 is observed, both A
and B have occurred.
The intersection of two given events A and B, denoted as (A and B) or A ∩ B, consists of the
outcomes that are common to both A and B. ‘Event A∩B occurs’ means ‘both A and B occur’. For
example, in Example 5, A∩B = {4, 6}. Additionally, if C = {a number less than 6} = {1, 2, 3, 4, 5},
the intersection of events A and C is the event A ∩ C = {an even number less than 6} = {2, 4}.
The union and intersection of two events can be generalized in an obvious way to the union and
intersection of more than two events.
Two events A and D are said to be mutually exclusive if A ∩ D = ∅, where ∅ denotes the empty
set, i.e. A and D have no outcomes in common. Intuitively, ‘A and D are mutually exclusive’
means ‘A and D cannot occur simultaneously in the experiment’.
Figure 2.1: In the left plot A and B are mutually exclusive; the right plot shows A ∪ B and A ∩ B.
In Example 5, if D = {an odd number} = {1, 3, 5}, then A ∩ D = ∅ and so A and D are
mutually exclusive. As expected, A and D cannot occur simultaneously in the experiment.
For a given event A, the complement of A is the event that consists of all the outcomes not in
A and is denoted by A0 . Note that A ∪ A0 = S and A ∩ A0 = ∅.
2 Introduction to Probability 26
Thus, we can see the parallels between Set theory and Probability theory:
Set theory Probability theory
(1) Space Sample space
(2) Element or point Elementary event
(3) Set Event
A1 P {S} = 1,
A2 0 ≤ P {A} ≤ 1 for any event A,
A3 P {A ∪ B} = P {A} + P {B} provided that A and B are mutually exclusive events.
Proof: We can write A ∪ B = (A ∩ B 0 ) ∪ (A ∩ B) ∪ (A0 ∩ B). All three of these are mutually
exclusive events. Hence,
P {A ∪ B} = P {A ∩ B 0 } + P {A ∩ B} + P {A0 ∩ B}
= P {A} − P {A ∩ B} + P {A ∩ B} + P {B} − P {A ∩ B}
= P {A} + P {B} − P {A ∩ B}.
(6) The sum of the probabilities of all the outcomes in the sample space S is 1.
2 Introduction to Probability 27
For any event A, we find P {A} by adding up 1/N for each of the outcomes in event A:
number of outcomes in A
P {A} = .
total number of possible outcomes of the experiment
Return to Example 5 where a six-faced die is rolled. Suppose that one wins a bet if a 6 is
rolled. Then the probability of winning the bet is 1/6 as there are six possible outcomes in the
sample space and exactly one of those, 6, wins the bet. Suppose A denotes the event that an
even-numbered face is rolled. Then P {A} = 3/6 = 1/2 as we can expect.
♥ Example 6 Dice throw Roll 2 distinguishable dice and observe the scores. Here S =
{(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), . . . , (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)} which consists of 36
possible outcomes or elementary events, A1 , . . . , A36 . What is the probability of the outcome 6 in
both the dice? The required probability is 1/36. What is the probability that the sum of the two
dice is greater than 6? How about the probability that the sum is less than any number, e.g. 8?
Hint: Write down the sum for each of the 36 outcomes and then find the probabilities asked just
by inspection. Remember, each of the 36 outcomes has equal probability 1/36.
The next lecture will continue to find probabilities using specialist counting techniques called
permutation and combination. This will allow us to find probabilities in a number of practical
situations.
The UK National Lottery selects 6 numbers at random from 1 to 49. I bought one ticket - what
is the probability that I will win the jackpot?
2 Introduction to Probability 28
♥ Example 7 Counting Suppose there are 7 routes to London from Southampton and then
there are 5 routes to Cambridge out of London. How many ways can I travel to Cambridge from
Southampton via London. The answer is obviously 35.
The task is to select k(≥ 1) from the n (n ≥ k) available people and sit the k selected people in k
(different) chairs. By considering the i-th sub-task as selecting a person to sit in the i-th chair (i =
1, . . . , k), it follows directly from the multiplication rule above that there are n(n−1) · · · (n−[k −1])
ways to complete the task. The number n(n−1) · · · (n−[k−1]) is called the number of permutations
of k from n and denoted by
n
Pk = n(n − 1) · · · (n − [k − 1]).
In particular, when k = n we have n Pn = n(n − 1) · · · 1, which is called ‘n factorial’ and denoted
as n!. Note that 0! is defined to be 1. It is clear that
n n(n − 1) · · · (n − [k − 1]) × (n − k)! n!
Pk = n(n − 1) · · · (n − [k − 1]) = = .
(n − k)! (n − k)!
♥ Example 8 Football How many possible rankings are there for the 20 football teams in
the premier league at the end of a season? This number is given by 20 P20 = 20!, which is a huge
number! How many possible permutations are there for the top 4 positions who will qualify to play
in Europe in the next season? This number is given by 20 P4 = 20 × 19 × 18 × 17.
nC n
The number of combinations of k from n: k or k
The task is to select k(≥ 1) from the n (n ≥ k) available people. Note that this task does NOT
involve sitting the k selected people in k (different) chairs.We want to find the number of possible
ways to complete this task, which is denoted as n Ck or nk .
For this, let us reconsider the task of “selecting k(≥ 1) from the n (n ≥ k) available people and
sitting the k selected people in k (different) chairs”, which we already know from the discussion
above has n Pk ways to complete.
Alternatively, to complete this task, one has to complete two sub-tasks sequentially. The first
sub-task is to select k(≥ 1) from the n (n ≥ k) available people, which has n Ck ways. The second
sub-task is to sit the k selected people in k (different) chairs, which has k! ways. It follows directly
from the multiplication rule that there are n Ck × k! to complete the task. Hence we have
nP n!
n k
Pk = n C k × k!, i.e., n
Ck = =
k! (n − k)!k!
2 Introduction to Probability 29
♥ Example 9 Football How many possible ways are there to choose 3 teams for the bottom
positions of the premier league table at the end of a season? This number is given by 20 C3 =
20 × 19 × 18/3!, which does not take into consideration the rankings of the three bottom teams!
♥ Example 10 Microchip A box contains 12 microchips of which 4 are faulty. A sample of size
3 is drawn from the box without replacement.
More examples and details regarding the combinations are provided in Section A.3. You are
strongly recommended to read that section now.
A sample of size n is drawn at random without replacement from a box of N items containing
a proportion p of defective items.
• How many defective items are in the box? N p. How many good items are there? N (1 − p).
Assume these to be integers.
2 Introduction to Probability 30
• Which values of x (in terms of N , n and p) make this expression well defined?
We’ll see later that these values of x and the corresponding probabilities make up what is called
the hyper-geometric distribution.
♥ Example 13 The National Lottery In Lotto, a winning ticket has six numbers from 1 to 49
matching those on the balls drawn on a Wednesday or Saturday evening. The ‘experiment’ consists
of drawing the balls from a box containing 49 balls. The ‘randomness’, the equal chance of any set
of six numbers being drawn, is ensured by the spinning machine, which rotates the balls during the
selection process. What is the probability of winning the jackpot?
There is one other way of winning by using the bonus ball – matching 5 of the selected 6 balls
plus matching the bonus ball. The probability of this is given by
6
P {5 matches + bonus} = 49 C
= 4.29 × 10−7 .
6
Adding all these probabilities of winning some kind of prize together gives
So a player buying one ticket each week would expect to win a prize, (most likely a £10 prize for
matching three numbers) about once a year
Applications of conditional probability occur naturally in actuarial science and medical stud-
ies, where conditional probabilities such as “what is the probability that a person will survive for
another 20 years given that they are still alive at the age of 40?” are calculated.
In many real problems, one has to determine the probability of an event A when one already
has some partial knowledge of the outcome of an experiment, i.e. another event B has already
occurred. For this, one needs to find the conditional probability.
♥ Example 14 Dice throw continued Return to the rolling of a fair die (Example 5). Let
It is clear that P {B} = 3/6 = 1/2. This is the unconditional probability of the event B. It is
sometimes called the prior probability of B.
2 Introduction to Probability 32
However, suppose that we are told that the event A has already occurred. What is the proba-
bility of B now given that A has already happened?
The sample space of the experiment is S = {1, 2, 3, 4, 5, 6}, which contains n = 6 equally likely
outcomes.
Given the partial knowledge that event A has occurred, only the nA = 3 outcomes in A =
{4, 5, 6} could have occurred. However, only some of the outcomes in B among these nA outcomes
in A will make event B occur; the number of such outcomes is given by the number of outcomes
nA∩B in both A and B, i.e., A ∩ B, and equal to 2. Hence the probability of B, given the partial
knowledge that event A has occurred, is equal to
2 nA∩B nA∩B /n P {A ∩ B}
= = = .
3 nA nA /n P {A}
Hence we say that P {B|A} = 32 , which is often interpreted as the posterior probability of B given
A. The additional knowledge that A has already occurred has helped us to revise the prior proba-
bility of 1/2 to 2/3.
This simple example leads to the following general definition of conditional probability.
♥ Example 15 Of all individuals buying a mobile phone, 60% include a 64GB hard disk in their
purchase, 40% include a 16 MP camera and 30% include both. If a randomly selected purchase
includes a 16 MP camera, what is the probability that a 64GB hard disk is also included?
Hence the multiplication rule of conditional probability for two events is:
♥ Example 17 Phones Suppose that in our world there are only three phone manufacturing
companies: A Pale, B Sung and C Windows, and their market shares are respectively 30, 40 and 30
percent. Suppose also that respectively 5, 8, and 10 percent of their phones become faulty within
one year. If I buy a phone randomly (ignoring the manufacturer), what is the probability that my
phone will develop a fault within one year? After finding the probability, suppose that my phone
developed a fault in the first year - what is the probability that it was made by A Pale?
To answer this type of question, we derive two of the most useful results in probability theory:
the total probability formula and the Bayes theorem. First, let us derive the total probability
formula.
2 Introduction to Probability 34
Bi ∩ Bj = ∅ for all 1 ≤ i 6= j ≤ k,
B1 ∪ B2 ∪ . . . ∪ Bk = S.
A = A ∩ S = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ . . . ∪ (A ∩ Bk )
P {A} = P {A ∩ B1 } + P {A ∩ B2 } + . . . + P {A ∩ Bk }
= P {B1 }P {A|B1 } + P {B2 }P {A|B2 } + . . . + P {Bk }P {A|Bk };
this last expression is called the total probability formula for P {A}.
Figure 2.2: The left figure shows the mutually exclusive and exhaustive events B1 , . . . , B6 (they
form a partition of the sample space); the right figure shows a possible event A.
♥ Example 18 Phones continued We can now find the probability of the event, say A, that
a randomly selected phone develops a fault within one year. Let B1 , B2 , B3 be the events that the
phone is manufactured respectively by companies A Pale, B Sung and C Windows. Then we have:
Now suppose that my phone has developed a fault within one year. What is the probability that
it was manufactured by A Pale? To answer this we need to introduce the Bayes Theorem.
2 Introduction to Probability 35
P {Bi }P {A|Bi }
P {Bi |A} = Pk
j=1 P {Bj }P {A|Bj }
The Bayes theorem follows directly by substituting P {A} by the total probability formula.
The probability, P {Bi |A} is called the posterior probability of Bi and P {Bi } is called the prior
probability. The Bayes theorem is the rule that converts the prior probability into the poste-
rior probability by using the additional information that some other event, A above, has already
occurred.
♥ Example 19 Phones continued The probability that my faulty phone was manufactured
by A Pale is
P {B1 }P {A|B1 } 0.30 × 0.05
P {B1 |A} = = = 0.1948.
P {A} 0.077
Similarly, the probability that the faulty phone was manufactured by B Sung is 0.4156, and the
probability that it was manufactured by C Windows is 1-0.1948-0.4156 = 0.3896.
Pk The worked examples section contains further illustrations of the Bayes theorem. Note that
i=1 P {Bi |A} = 1. Why? Nowadays the Bayes theorem is used to make statistical inference as
well.
2.4.2 Definition
We have seen examples where prior knowledge that an event A has occurred has changed the prob-
ability that event B occurs. There are many situations where this does not happen. The events
are then said to be independent.
Intuitively, events A and B are independent if the occurrence of one event does not affect the
probability that the other event occurs.
P {B|A} = P {B}, where P {A} > 0, and P {A|B} = P {A}, where P {B} > 0.
For this, we have P {A ∩ B} = P {either a 4 or 6 thrown} = 1/3, but P {A} = 1/2 and P {B} =
1/2, so that P {A}P {B} = 1/4 6= 1/3 = P {A ∩ B}. Therefore A and B are not independent events.
Note that independence is not the same as the mutually exclusive property. When two events, A
and B, are mutually exclusive, the probability of their intersection, A∩B, is zero, i.e. P {A∩B} = 0.
But if the two events are independent then P {A ∩ B} = P {A} × P {B}.
Independence is often assumed on physical grounds, although sometimes incorrectly. There are
serious consequences for wrongly assuming independence, e.g. the financial crisis in 2008. However,
when the events are independent then the simpler product formula for joint probability is then used
2 Introduction to Probability 37
♥ Example 21 Two fair dice when shaken together are assumed to behave independently. Hence
the probability of two sixes is 1/6 × 1/6 = 1/36.
♥ Example 22 Assessing risk in legal cases In recent years there have been some disastrous
miscarriages of justice as a result of incorrect assumption of independence. Please read “Incorrect
use of independence – Sally Clark Case” on Blackboard.
P {A0 ∩ B 0 } = 1 − P {A ∪ B}
= 1 − [P {A} + P {B} − P {A ∩ B}]
= 1 − [P {A} + P {B} − P {A}P {B}]
= [1 − P {A}] − P {B}[1 − P {A}]
= [1 − P {A}][1 − P {B}] = P {A0 }P {B 0 }
The ideas of conditional probability and independence can be extended to more than two events.
Note that (2.1) does NOT imply (2.2), as shown by the next example. Hence, to show the
independence of A, B and C, it is necessary to show that both (2.1) and (2.2) hold.
♥ Example 23 A box contains eight tickets, each labelled with a binary number. Two are
labelled with the binary number 111, two are labelled with 100, two with 010 and two with 001.
An experiment consists of drawing one ticket at random from the box.
Let A be the event “the first digit is 1”, B the event “the second digit is 1” and C be the event
“the third digit is 1”. It is clear that P {A} = P {B} = P {C} = 4/8 = 1/2 and P {A ∩ B} =
P {A ∩ C} = P {B ∩ C} = 1/4, so the events are pairwise independent, i.e. (2.1) holds. However
P {A ∩ B ∩ C} = 2/8 6= P {A}P {B}P {C} = 1/8. So (2.2) does not hold and A, B and C are not
independent.
Bernoulli trials The notion of independent events naturally leads to a set of independent trials
(or random experiments, e.g. repeated coin tossing). A set of independent trials, where each trial
has only two possible outcomes, conveniently called success (S) and failure (F), and the probability
2 Introduction to Probability 38
of success is the same in each trial are called a set of Bernoulli trials. There are lots of fun examples
involving Bernoulli trials.
♥ Example 24 Feller’s road crossing example The flow of traffic at a certain street crossing
is such that the probability of a car passing during any given second is p and cars arrive randomly,
i.e. there is no interaction between the passing of cars at different seconds. Treating seconds as
indivisible time units, and supposing that a pedestrian can cross the street only if no car is to
pass during the next three seconds, find the probability that the pedestrian has to wait for exactly
k = 0, 1, 2, 3, 4 seconds.
Let Ci denote the event that a car comes in the ith second and let Ni denote the event that no
car arrives in the ith second.
1. Consider k = 0. The pedestrian does not have to wait if and only if there are no cars in the
next three seconds, i.e. the event N1 N2 N3 . Now the arrival of the cars in successive seconds
are independent and the probability of no car coming in any second is q = 1 − p. Hence the
answer is P {N1 N2 N3 } = q · q · q = q 3 .
2. Consider k = 1. The person has to wait for one second if there is a car in the first second
and none in the next three, i.e. the event C1 N2 N3 N4 . Hence the probability of that is pq 3 .
3. Consider k = 2. The person has to wait two seconds if and only if there is a car in the 2nd
second but none in the next three. It does not matter if there is a car or none in the first
second. Hence:
P {wait 2 seconds} = P {C1 C2 N3 N4 N5 } + P {N1 C2 N3 N4 N5 } = p · p · q 3 + q · p · q 3 = pq 3 .
4. Consider k = 3. The person has to wait for three seconds if and only if a car passes in the
3rd second but none in the next three, C3 N4 N5 N6 . Anything can happen in the first two
seconds, i.e. C1 C2 , C1 N2 , N1 C2 , N1 N2 − all these four cases are mutually exclusive. Hence,
P {wait 3 seconds} = P {C1 C2 C3 N4 N5 N6 } + P {N1 C2 C3 N4 N5 N6 }
+P {C1 N2 C3 N4 N5 N6 } + P {N1 N2 C3 N4 N5 N6 }
= p · p · p · q3 + p · q · p · q3 + q · p · p · q3 + q · q · p · q3
= pq 3 .
5. Consider k = 4. This is more complicated because the person has to wait exactly 4 seconds
if and only if a car passes in at least one of the first 3 seconds, one passes at the 4th but none
pass in the next 3 seconds. The probability that at least one passes in the first three seconds
is 1 minus the probability that there is none in the first 3 seconds. This probability is 1 − q 3 .
Hence the answer is (1 − q 3 )pq 3 .
The reliability gets lower when components are included in series. For n components in series,
P {system works} = p1 p2 · · · pn . When pi = p for all i, the reliability of a series of n components is
P {system works} = pn .
This is greater than either p1 or p2 so that the inclusion of a (redundant) component in parallel
increases the reliability of the system. Another way of arriving at this result uses complementary
events:
A general system
The ideas above can be combined to evaluate the reliability of more complex systems.
♥ Example 25 Switches Six switches make up the circuit shown in the graph.
Each has the probability pi = P {Di } of closing correctly; the mechanisms are independent; all
are operated by the same impulse. Then
There are some additional examples of reliability applications given in the “Reliability Exam-
ples” document available on Blackboard. You are advised to read through and understand these
additional examples/applications.
It is unlikely that truthful answers will be given in an open questionnaire, even if it is stressed
that the responses would be treated with anonymity. Some years ago a randomised response
technique was introduced to overcome this difficulty. This is a simple application of conditional
probability. It ensures that the interviewee can answer truthfully without the interviewer (or any-
one else) knowing the answer to the sensitive question. How? Consider two alternative questions,
for example:
Question 1 should not be contentious and should not be such that the interviewer could find
out the true answer.
The respondent answers only 1 of the two questions. Which question is answered by the respon-
dent is determined by a randomisation device, the result of which is known only to the respondent.
The interviewer records only whether the answer given was Yes or No (and he/she does not know
which question has been answered). The proportion of Yes answers to the question of interest can
be estimated from the total proportion of Yes answers obtained. Carry out this simple experiment:
Toss a coin - do not reveal the result of the coin toss!
If heads - answer Question 1: Was your mother born in January?
If tails - answer Question 2: Have you ever taken illegal substances in the last 12 months?
We need to record the following information for the outcome of the experiment:
Total number in sample = n;
Total answering Yes = r, so that an estimate of P {Yes} is r/n.
This information can be used to estimate the proportion of Yes answers to the main question
of interest, Question 2.
Suppose that
Then, assuming that the coin was unbiased, P {Q1 } = 0.5 and P {Q2 } = 0.5. Also, assuming that
birthdays of mothers are evenly distributed over the months, we have that the probability that the
interviewee will answer Yes to Q1 is 1/12. Let Y be the event that a ‘Yes’ answer is given. Then
the total probability formula gives
which leads to
r 1 1 1
≈ × + × P {Y |Q2 }.
n 2 12 2
Hence
r 1
P {Y |Q2 } ≈ 2 · − .
n 12
We, however, will move on to the next chapter on random variables, which formalises the
concepts of probabilities in structured practical cases. The concept of random variables allows us
to calculate probabilities of random events much more easily in structured ways.
2 Introduction to Probability 42
Chapter 3
Chapter mission
Last chapter’s combinatorial probabilities are difficult to find and very problem-specific. Instead,
in this chapter we shall find easier ways to calculate probability in structured cases. The outcomes
of random experiments will be represented as values of a variable which will be random since the
outcomes are random (or un-predictable with certainty). In so doing, we will make our life a lot
easier in calculating probabilities in many stylised situations which represent reality. For example,
we shall learn to calculate what is the probability that a computer will make fewer than 10 errors
while making 1015 computations when it has a very tiny chance, 10−14 , of making an erroneous
computation.
3.1.2 Introduction
A random variable defines a one-to-one mapping of the sample space consisting of all possible
outcomes of a random experiment to the set of real numbers. For example, I toss a coin. Assuming
the coin is fair, there are two possible equally likely outcomes: head or tail. These two outcomes
must be mapped to real numbers. For convenience, I may define the mapping which assigns the
value 1 if head turns up and 0 otherwise. Hence, we have the mapping:
Head → 1, Tail → 0.
43
3 Random Variables and Their Probability Distributions 44
We can conveniently denote the random variable by X which is the number of heads obtained by
tossing a single coin. Obviously, all possible values of X are 0 and 1.
You will say that this is a trivial example. Indeed it is. But it is very easy to generalise the
concept of random variables. Simply define a mapping of the outcomes of a random experiment
to the real number space. For example, I toss the coin n times and count the number of heads
and denote that to be X. Obviously, X can take any real positive integer value between 0 and
n. Among other examples, suppose I select a University of Southampton student at random and
measure their height. The outcome in metres will be a number between one metre and two metres
for sure. But I can’t exactly tell which value it will be since I do not know which student will be
selected in the first place. However, when a student has been selected I can measure their height
and get a value such as 1.432 metres.
We now introduce two notations: X (or in general the capital letters Y , Z etc.) to denote the
random variable, e.g. height of a randomly selected student, and the corresponding lower case letter
x (y, z) to denote a particular value, e.g. 1.432 metres. We will follow this convention throughout.
For a random variable, say X, we will also adopt the notation P (X ∈ A), read probability that X
belongs to A, instead of the previous P {A} for any event A.
When the random variable can take any value on the real line it is called a continuous random
variable. For example, the height of a randomly selected student. A random variable can also take
a mixture of discrete and continuous values, e.g. volume of precipitation collected in a day; some
days it could be zero, on other days it could be a continuous measurement, e.g. 1.234 mm.
This is an example of the Bernoulli distribution with parameter p, perhaps the simplest discrete
distribution.
♥ Example 26 Suppose we consider tossing the coin twice and again defining the random variable
X to be the number of heads obtained. The values that X can take are 0, 1 and 2 with probabilities
(1 − p)2 , 2p(1 − p) and p2 , respectively. Here the distribution is:
Value(x) P (X = x)
0 (1 − p)2
1 2p(1 − p)
2 p2
Total prob 1.
This is a particular case of the Binomial distribution. We will learn about it soon.
In general, for a discrete random variable we define a function f (x) to denote P (X = x) (or
f (y) to denote P (Y = y)) and call the function f (x) the probability function (pf ) or probability
mass function (pmf ) of the random variable X. Arbitrary functions cannot be a pmf since the
total probability must be 1 and all probabilities are non-negative. Hence, for f (x) to be the pmf
of a random variable X, we require:
Note that f (x) = 0 for any other value of x and thus f (x) is a discrete function of x.
For a continuous random variable, P (X = x) is defined to be zero since we assume that the
measurements are continuous and there is zero probability of observing a particular value, e.g. 1.2.
The argument goes that a finer measuring instrument will give us an even more precise measure-
ment than 1.2 and so on. Thus for a continuous random variable we adopt the convention that
3 Random Variables and Their Probability Distributions 46
P (X = x) = 0 for any particular value x on the real line. But we define probabilities for positive
length intervals, e.g. P (1.2 < X < 1.9).
For a continuous random variable X we define its probability by using a continuous function
f (x) which we call its probability density function, abbreviated as its pdf. With the pdf we define
probabilities as integrals, e.g.
Z b
P (a < X < b) = f (u) du,
a
which is naturally interpreted as the area under the curve f (x) inside the interval (a, b). Recall
that we do not use f (x) = P (X = x) for any x as by convention we set P (X = x) = 0.
Figure 3.1: The shaded area is P (a < X < b) if the pdf of X is the drawn curve.
Since we are dealing with probabilities which are always between 0 and 1, just any arbitrary
function f (x) cannot be a pdf of some random variable. For f (x) to be a pdf, as in the discrete
case, we must have:
(i) the probabilities are non-negative, and (ii) the total probability must be 1,
♥ Example 27 Let X be the number of heads in the experiment of tossing two fair coins. Then
the probability function is
Note that the cdf for a discrete random variable is a step function. The jump-points are the
possible values of the random variable (r.v.), and the height of a jump gives the probability of
the random variable taking that value. It is clear that the probability mass function is uniquely
determined by the cdf.
dF (x)
f (x) =
dx
that is, for a continuous random variable the pdf is the derivative of the cdf. Also for any random
variable X, P (c < X < d) = F (d) − F (c). Let us consider an example.
♥ Example 28 Uniform distribution Suppose,
1
f (x) = b−a if a < x < b .
0 otherwise
R x du 0
We now have the cdf F (x) = a b−a = x−a
b−a , a < x < b. A quick check confirms that F (x) = f (x).
If a = 0, b = 1 and then P (0.5 < X < 0.75) = F (0.75) − F (0.5) = 0.25. We shall see many more
examples later.
1
♥ Example 30 Continuous Consider the uniform distribution which has the pdf f (x) = b−a , a <
x < b. R∞
E(X) = −∞ x f (x)dx
Rb x
= a b−a dx
b2 −a2
= 2(b−a) = b+a 2 ,
If Y = g(X) for any function g(·), then Y is a random variable as well. To find E(Y ) we simply
use the value times probability rule, i.e. the expected value of Y is either sum or integral of its
value, g(x) times probability f (x).
X
g(x)f (x) if X is discrete
E(Y ) = E(g(X)) = all x .
∞ R
−∞ g(x)f (x)dx if X is continuous
3 Random Variables and Their Probability Distributions 49
R∞
For example, if X is continuous, then E(X 2 ) = −∞ x2 f (x)dx. We prove one important property
of expectation, namely expectation is a linear operator.
using
R ∞ the value times probability definition of the expectation and the total probability is 1 property
( −∞ f (x)dx = 1) in the last integral. This is very convenient, e.g. suppose E(X) = 5 and
Y = −2X + 549; then E(Y ) = 539.
where µ = E(X), and when the sum or integral exists. They can’t always be assumed to exist!
When the variance exists, it is the expectation of (X − µ)2 where µ is the mean of X. We now
derive an easy formula to calculate the variance:
We usually denote the variance by σ 2 . The square is there to emphasise that the variance of any
random variable is always non-negative. When can the variance be zero? When there is no variation
at all in the random variable, i.e. it takes only a single value µ with probability 1. Hence, there is
nothing random about the random variable – we can predict its outcome with certainty.
The square root of the variance is called the standard deviation of the
random variable.
3 Random Variables and Their Probability Distributions 50
1
♥ Example 31 Uniform Consider the uniform distribution which has the pdf f (x) = b−a , a <
x < b. R b x2
E(X 2 ) = a b−a dx
b3 −a3 2 2
= 3(b−a) = b +ab+a
3 ,
Hence 2
b2 + ab + a2 (b − a)2
b+a
Var(X) = − = ,
3 2 12
after simplification.
2
R ∞ − E(Y ))
Var(Y ) = E(Y
= −∞ (ax + b − aµ − b)2 f (x)dx
∞
= a2 −∞ (x − µ)2 f (x)dx
R
= a2 Var(X).
This is a very useful result, e.g. suppose Var(X) = 25 and Y = −X + 5, 000, 000; then Var(Y ) =
Var(X) = 25 and the standard deviation, σ = 5. In words a location shift, b, does not change
variance but a multiplicative constant, a say, gets squared in variance, a2 .
An outcome of the experiment (of carrying out n such independent trials) is represented by a
sequence of S’s and F ’s (such as SS...F S...SF ) that comprises x S’s, and (n − x) F ’s.
For this sequence, X = x, but there are many other sequences which will also give X = x. In
fact there are nx such sequences. Hence
n x
P (X = x) = p (1 − p)n−x , x = 0, 1, . . . , n.
x
This is the pmf of the Binomial Distribution with parameters n and p, often written as Bin(n, p).
How can we guarantee that nx=0 P (X = x) = 1? This guarantee is provided by the binomial
P
theorem:
n n n n−1 n x n−x
(a + b) = b + ab + ··· + a b + · · · + an .
1 x
To prove, nx=0 P (X = x) = 1, i.e. to prove, nx=0 nx px (1 − p)n−x = 1, choose a = p and b = 1 − p
P P
♥ Example 32 Suppose that widgets are manufactured in a mass production process with 1%
defective. The widgets are packaged in bags of 10 with a money-back guarantee if more than 1
widget per bag is defective. For what proportion of bags would the company have to provide a
refund?
Firstly, we want to find the probability that a randomly selected bag has at most 1 defective
widget. Note that the number of defective widgets in a bag X, X ∼ Bin(n = 10, p = 0.01). So, this
probability is equal to
Hence the probability that a refund is required is 1 − 0.9957 = 0.0043, i.e. only just over 4 in 1000
bags will incur the refund on average.
x = 3. That is, the command dbinom(x=3, size=5, prob=0.34) will return the value P (X =
3) = 53 (0.34)3 (1 − 0.34)5−3 . The command pbinom returns the cdf or the probability up to
and including the argument. Thus pbinom(q=3, size=5, prob=0.34) will return the value of
P (X ≤ 3) when X ∼ Bin(n = 5, p = 0.34). As a check, in the above example the command is
pbinom(q=1, size=10, prob=0.01), which returns 0.9957338.
♥ Example 33 A binomial random variable can also be described using the urn model. Suppose
we have an urn (population) containing N individuals, a proportion p of which are of type S and a
proportion 1 − p of type F . If we select a sample of n individuals at random with replacement,
then the number, X, of type S individuals in the sample follows the binomial distribution with
parameters n and p.
Mean of the Binomial distribution
Let X ∼ Bin(n, p). We have
n n
X X n x
E(X) = xP (X = x) = x p (1 − p)n−x .
x
x=0 x=0
Below we prove that E(X) = np. Recall that k! = k(k − 1)! for any k > 0.
Pn
x n px (1 − p)n−x
E(X) =
Pnx=0 x n!
= x x!(n−x)! px (1 − p)n−x
Px=1
n n! x n−x
= x=1 (x−1)!(n−x)! p (1 − p)
Pn (n−1)!
= np x=1 (x−1)!(n−1−x+1)! px−1 (1 − p)n−1−x+1
Pn−1 (n−1)!
= np y=0 (y)!(n−1−y)! py (1 − p)n−1−y
= np(p + 1 − p)n−1 = np,
where we used the substitution y = x − 1 and then the binomial theorem to conclude that the last
sum is equal to 1.
It is illuminating to see these direct proofs. Later on we shall apply statistical theory to directly
prove these! Notice that the binomial theorem is used repeatedly to prove the results.
3 Random Variables and Their Probability Distributions 53
S X = 1, P (X = 1) = p
FS X = 2, P (X = 2) = (1 − p)p
FFS X = 3, P (X = 3) = (1 − p)2 p
FFFS X = 4, P (X = 4) = (1 − p)3 p
.. ..
. .
In general we have
P (X = x) = (1 − p)x−1 p, x = 1, 2, . . .
This is called the geometric distribution, and it has a (countably) infinite domain starting at 1 not
0. We write X ∼Geo(p).
Let us check that the probability function has the required property:
∞
X ∞
X
P (X = x) = (1 − p)x−1 p
x=1 x=1
X∞
= p (1 − p)y [substitute y = x − 1]
y=0
1
= p [see Section A.5]
1 − (1 − p)
= 1.
We can also find the probability that X > k for some given natural number k:
∞
X ∞
X
P (X = x) = (1 − p)x−1 p
x=k+1 x=k+1
The proof is given below. In practice this means that the random variable does not remember its
age (denoted by k) to determine how long more (denoted by s) it will survive! The proof below
uses the definition of conditional probability
P {A ∩ B}
P {A|B} = .
P {B}
Now the proof,
P (X>s+k,X>k)
P (X > s + k|X > k) = P (X>k)
P (X>s+k)
= P (X>k)
(1−p)s+k
= (1−p)k
= (1 − p)s ,
which does not depend on k. Note that the event X > s + k and X > k implies and is implied by
X > s + k since s > 0.
For n > 0 and |x| < 1, the negative binomial series is given by:
1 1 n(n + 1)(n + 2) · · · (n + k − 1) k
(1−x)−n = 1+nx+ n(n+1)x2 + n(n+1)(n+2)x3 +· · ·+ x +· · ·
2 6 k!
With n = 2 and x = 1 − p the general term is given by:
n(n + 1)(n + 2)(n + k − 1) 2 × 3 × 4 × · · · × (2 + k − 1)
= = k + 1.
k! k!
Thus E(X) = p(1 − 1 + p)−2 = 1/p. It can be shown that Var(X) = (1 − p)/p2 using negative bino-
mial series. But this is more complicated and is not required. The second-year module MATH2011
will provide an alternative proof.
The proofs of the above results use complicated finite summation and so are omitted. But note
that when N → ∞ the variance converges to the variance of the binomial distribution. Indeed, the
hypergeometric distribution is a finite population analogue of the binomial distribution.
♥ Example 34 In a board game that uses a single fair die, a player cannot start until they have
rolled a six. Let X be the number of rolls needed until they get a six. Then X is a Geometric
random variable with success probability p = 1/6.
♥ Example 35 A man plays roulette, betting on red each time. He decides to keep playing until
he achieves his second win. The success probability for each game is 18/37 and the results of games
are independent. Let X be the number of games played until he gets his second win. Then X is
a Negative Binomial random variable with r = 2 and p = 18/37. What is the probability he plays
more than 3 games? i.e. find P (X > 3).
Derivation of the mean and variance of the negative binomial distribution involves compli-
cated negative binomial series and will be skipped for now, but will be proved in Lecture 17. For
completeness we note down the mean and variance:
r 1−p
E(X) = , Var(X) = r .
p p2
3 Random Variables and Their Probability Distributions 56
Thus when r = 1, the mean and variance of the negative binomial distribution are equal to those
of the geometric distribution.
λx
e−λ
x!
as n → ∞ for any fixed value of x in the range 0, 1, 2, . . .. Note that we have used the exponential
limit:
λ n
−λ
e = lim 1 − ,
n→∞ n
and
λ −x
lim 1 − =1
n→∞ n
and
n (n − 1) (n − x + 1)
lim ··· = 1.
n→∞ n n n
A random variable X has the Poisson distribution with parameter λ if it has the pmf:
λx
P (X = x) = e−λ , x = 0, 1, 2, . . .
x!
P∞ −λ λx
We write X ∼ Poisson(λ). It is trivial to show ∞
P
x=0 P (X = x) = 1, i.e. x=0 e x! = 1. The
identity you need is simply the expansion of eλ .
3 Random Variables and Their Probability Distributions 57
Hence, the mean and variance are the same for the Poisson distribution.
The Poisson distribution can be derived from another consideration when we are waiting for
events to occur, e.g. waiting for a bus to arrive or to be served at a supermarket till. The number
of occurrences in a given time interval can sometimes be modelled by the Poisson distribution.
Here the assumption is that the probability of an event (arrival) is proportional to the length of the
waiting time for small time intervals. Such a process is called a Poisson process, and it can be shown
that the waiting time between successive events can be modelled by the exponential distribution
which is discussed in the next lecture.
is defined to be the gamma function and it has a finite real value. Moreover, we have the following
facts:
√
1
Γ = π; Γ(1) = 1; Γ(a) = (a − 1)Γ(a − 1) if a > 1.
2
These last two facts imply that Γ(k) = (k − 1)! when k is a positive integer. Find Γ 32 .
and so Var(X) = E(X 2 ) − [E(X)]2 = 2/θ2 − 1/θ2 = 1/θ2 . Note that for this random variable the
mean is equal to the standard deviation.
We have F (0) = 0 and F (x) → 1 when x → ∞ and F (x) is non-decreasing in x. The cdf can be
used to solve many problems. A few examples follow.
Using R to calculate probabilities
For the exponential distribution the command dexp(x=3, rate=1/2) calculates the pdf at
x = 3. The rate parameter to be supplied is the θ parameter here. The command pexp returns the
cdf or the probability up to and including the argument. Thus pexp(q=3, rate=1/2) will return
the value of P (X ≤ 3) when X ∼ Exponential(θ = 0.5).
♥ Example 36 Mobile phone Suppose that the lifetime of a phone (e.g. the time until the
phone does not function even after repairs), denoted by X, manufactured by the company A Pale,
is exponentially distributed with mean 550 days.
1. Find the probability that a randomly selected phone will still function after two years, i.e.
X > 730? [Assume there is no leap year in the two years].
2. What are the times by which 25%, 50%, 75% and 90% of the manufactured phones will have
failed?
Here the mean 1/θ = 550. Hence θ = 1/550 is the rate parameter. The solution to the first
problem is
For the second problem we are given the probabilities of failure (0.25, 0.50 etc.). We will have
to invert the probabilities to find the value of the random variable. In other words, we will have
to find a q such that F (q) = p, where p is the given probability. For example, what value of q will
give us F (q) = 0.25, so that 25% of the phones will have failed by time q?
For a given 0 < p < 1, the pth quantile (or 100p percentile) of the random variable
X with cdf F (x) is defined to be the value q for which F (q) = p.
The 50th percentile is called the median. The 25th and 75th percentiles are called
the quartiles.
♥ Example 37 Uniform distribution Consider the uniform distribution U (a, b) in the interval
(a, b). Here F (x) = x−a
b−a . So for a given p, F (q) = p implies q = a + p(b − a).
3 Random Variables and Their Probability Distributions 60
b+a b+3a
For the uniform U (a, b) distribution the median is 2 , and the quartiles are: 4 and 3b+a
4 .
Returning to the exponential distribution example, we have p = F (q) = 1 − e−θq . Find q when
p is given.
p = 1 − e−θq
⇒ e −θq = 1−p
⇒ −θq = log(1 − p)
− log(1−p)
⇒ q = θ
⇒ q = −550 × log(1 − p).
Review the rules of log in Section A.6. Now we have the following table:
p q = −550 × log(1 − p)
0.25 158.22
0.50 381.23
0.75 762.46
0.90 1266.422
In R you can find these values by qexp(p=0.25, rate=1/550), qexp(p=0.50, rate=1/550), etc.
For fun, you can find qexp(p=0.99, rate=1/550) = 6 years and 343 days! The function qexp(p,
rate) calculates the 100p percentile of the exponential distribution with parameter rate.
Assuming the mean survival time to be 100 days for a fatal late detected cancer, we can expect
that half of the patients survive 69.3 days after chemo since qexp(0.50, rate=1/100) = 69.3.
You will learn more about this in a third-year module, Math3085: Survival models, important in
actuary.
3 Random Variables and Their Probability Distributions 61
♥ Example 39 Memoryless property Like the geometric distribution, the exponential distri-
bution also has the memoryless property. In simple terms, it means that the probability that the
system will survive an additional period s > 0 given that it has survived up to time t is the same
as the probability that the system survives the period s to begin with. That is, it forgets that it
has survived up to a particular time when it is thinking of its future remaining life time.
The proof is exactly as in the case of the geometric distribution, reproduced below. Recall the
definition of conditional probability:
P {A ∩ B}
P {A|B} = .
P {B}
Now the proof,
P (X>s+t,X>t)
P (X > s + t|X > t) = P (X>t)
P (X>s+t)
= P (X>t)
e−θ(s+t)
= e−θt
= e−θs
= P (X > s).
Note that the event X > s + t and X > t implies and is implied by X > s + t since s > 0.
♥ Example 40 The time T between any two successive arrivals in a hospital emergency depart-
ment has probability density function:
λe−λt if t ≥ 0
f (t) =
0 otherwise.
Historically, on average the mean of these inter-arrival times is 5 minutes. Calculate (i) P (0 < T <
5), (ii) P (T < 10|T > 5).
(ii)
Suppose all of these hold (since they are proved below). Then it is easy to remember the pdf
of the normal distribution:
(variable − mean)2
1
f (variable) = √ exp −
2π variance 2 variance
where variable denotes the random variable. The density (pdf) is much easier to remember and
work with when the mean µ = 0 and variance σ 2 = 1. In this case, we simply write:
variable2
2
1 x 1
f (x) = √ exp − or f (variable) = √ exp − .
2π 2 2π 2
Now let us prove the 3 assertions, R1, R2 and R3. R1 is proved as follows:
n o
R∞ R∞ (x−µ)2
f (x)dx = √ 1 exp − 2 dx
−∞ −∞ 2πσ 2
n 2 o2σ
R∞
= √2π −∞ exp − 2 dz [substitute z = x−µ
1 z
σ so that dx = σdz]
n 2o
∞
= √12π 2 0 exp − z2 dz [since the integrand is an even function]
R
R∞ 2 √
= √12π 2 0 exp {−u} √du 2u
[substitute u = z2 so that z = 2u and dz = √du ]
2u
R∞ 1
= 2√1 π 2 0 u 2 −1 exp {−u} du [rearrange the terms]
= √1π Γ 12 [recall the definition of the Gamma function]
√ √
= √1π π = 1 [as Γ 12 = π].
3 Random Variables and Their Probability Distributions 63
X −µ
(i)X ∼ N (µ, σ 2 ) ←→ Z ≡ ∼ N (0, 1) (3.2)
σ
Then by the linearity of expectations, i.e. if X = µ + σZ for constants µ and σ then E(X) =
µ + σE(Z) = µ, the result follows. To prove (3.2), we first calculate the cdf, given by:
Φ(z) = P (Z ≤ z)
X −µ
= P ≤z
σ
= P (X ≤ µ + zσ)
Z µ+zσ
(x − µ)2
1
= √ exp − dx
−∞ 2πσ 2 2σ 2
Z z 2
1 u
= √ exp − du, [u = (x − µ)/σ]
−∞ 2π 2
2
dΦ(z) 1 z
= √ exp − for − ∞ < z < ∞,
dz 2π 2
by the fundamental theorem of calculus. This proves that Z ∼ N (0, 1). The converse is proved just
by reversing the steps. Thus we have proved (i) above. We use the Φ(·) notation to denote the cdf
of the standard normal distribution. Now:
R∞
E(Z) = zf (z)dz n
R−∞
∞ 2
o
= z √1 exp − z dz
−∞ 2π 2
= √1 × 0 = 0,
2π
n 2o
since the integrand g(z) = z exp − z2 is an odd function, i.e. g(z) = −g(−z); for an odd function
Ra
g(z), −a g(z)dz = 0 for any a. Therefore we have also proved (3.3) and hence R2.
To prove R3, i.e. Var(X) = σ 2 , we show that Var(Z) = 1 where Z = X−µ σ and then claim
that Var(X) = σ 2 Var(Z) = σ 2 from our earlier result. Since E(Z) = 0, Var(Z) = E(Z 2 ), which is
3 Random Variables and Their Probability Distributions 64
calculated below:
R∞
E(Z 2 ) = −∞ z 2 f (z)dz n o
R∞ 2
= −∞ z 2 √12π exp − z2 dz
R∞ n 2o
= √22π 0 z 2 exp − z2 dz [since the integrand is an even function]
R∞ 2 √
= √22π 0 2u exp {−u} √du 2u
[substituted u = z2 so that z = 2u and dz = √du ]
2u
R∞ 1
= 2√4 π 0 u 2 exp {−u} du
R∞ 3
= √2π 0 u 2 −1 exp {−u} du
= √2π Γ 32
[definition of the gamma function]
2 3 3
= 2 − 1 Γ 2 − 1 [reduction property of the gamma function]
√
π √ √
= √2π 12 π [since Γ 21 = π]
= 1,
distribution because of the following reasons. Suppose X ∼ N (µ, σ 2 ) and we are interested in finding
P (a ≤ X ≤ b) for two constants a and b.
Rb
P (a ≤ X ≤ b) = f (x)dx
Rab n
(x−µ)2
o
√ 1
= a 2πσ 2 exp − 2σ 2 dx
R b−µ n 2o
= √1 σ
2π a−µ
exp − z2 dz [substituted z = x−µσ so that dx = σdz]
σ
b−µ n o a−µ n o
z2 z2
R σ 1 R σ 1
= −∞ √
2π
exp − 2 dz − −∞
√
2π
exp − 2 dz
= P Z ≤ b−µ − P Z ≤ a−µ
σ σ
= cdf of Z at b−µ
σ − cdf of Z at
a−µ
σ
= Φ σ − Φ a−µ
b−µ
σ
This result allows us to find the probabilities about a normal random variable X of any mean µ and
variance σ 2 through the probabilities of the standard normal random variable Z. For this reason,
only Φ(z) is tabulated. Further more, due to the symmetry of the pdf of Z, Φ(z) is tabulated only
for positive z values. Suppose a > 0, then
In R, we use the function pnorm to calculate the probabilities. The general function is: pnorm(q,
mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE). So, we use the command pnorm(1)
to calculate Φ(1) = P (Z ≤ 1). We can also use the command pnorm(15, mean=10, sd=2) to
calculate P (X ≤ 15) when X ∼ N (µ = 10, σ 2 = 4) directly.
1. P (−1 < Z < 1) = Φ(1) − Φ(−1) = 0.6827. This means that 68.27% of the probability lies
within 1 standard deviation of the mean.
2. P (−2 < Z < 2) = Φ(2) − Φ(−2) = 0.9545. This means that 95.45% of the probability lies
within 2 standard deviations of the mean.
3. P (−3 < Z < 3) = Φ(3) − Φ(−3) = 0.9973. This means that 99.73% of the probability lies
within 3 standard deviations of the mean.
We are often interested in the quantiles (inverse-cdf of probability, Φ−1 (·) of the normal distribution
for various reasons. We find the pth quantile by issuing the R command qnorm(p).
1. qnorm(0.95) = Φ−1 (0.95) = 1.645. This means that the 95th percentile of the standard
normal distribution is 1.645. This also means that P (−1.645 < Z < 1.645) = Φ(1.645) −
Φ(−1.645) = 0.90.
3 Random Variables and Their Probability Distributions 66
2. qnorm(0.975) = Φ−1 (0.975) = 1.96. This means that the 97.5th percentile of the stan-
dard normal distribution is 1.96. This also means that P (−1.96 < Z < 1.96) = Φ(1.96) −
Φ(−1.96) = 0.95.
♥ Example 41 Historically, the marks in MATH1024 follow the normal distribution with mean
58 and standard deviation 32.25.
1. What percentage of students will fail (i.e. score less than 40) in MATH1024? Answer:
pnorm(40, mean=58, sd=32.25) = 28.84%.
2. What percentage of students will get an A result (score greater than 70)? Answer: 1-
pnorm(70, mean=58, sd=32.25) = 35.49%.
3. What is the probability that a randomly selected student will score more than 90? Answer:
1- pnorm(90, mean=58, sd=32.25) = 0.1605.
4. What is the probability that a randomly selected student will score less than 25? Answer:
pnorm(25, mean=58, sd=32.25) = 0.1531. Ouch!
5. What is the probability that a randomly selected student scores a 2:1, (i.e. a mark between
60 and 70)? Left as an exercise.
♥ Example 42 A lecturer set and marked an examination and found that the distribution
of marks was N (42, 142 ). The school’s policy is to present scaled marks whose distribution is
N (50, 152 ). What linear transformation should the lecturer apply to the raw marks to accomplish
this and what would the raw mark of 40 be transformed to?
Suppose X ∼ N (µx = 42, σx2 = 142 ) and Y ∼ N (µy = 50, σy2 = 152 ). Hence, we should have
X − µx Y − µy
Z= = ,
σx σy
giving us:
σy 15
Y = µy + (X − µx ) = 50 + (X − 42).
σx 14
Now at raw mark X = 40, the transformed mark would be:
15
Y = 50 + (40 − 42) = 47.86.
14
E(Y ) = E[exp(X)]
Z ∞
(x − µ)2
1
= exp(x) √ exp − dx
−∞ σ 2π 2σ 2
2 Z ∞
µ − (µ + σ 2 )2
2
x − 2(µ + σ 2 )x + (µ + σ 2 )2
1
= exp − √ exp − dx
2σ 2 −∞ σ 2π 2σ 2
2
µ − (µ + σ 2 )2
= exp − [integrating a N (µ + σ 2 , σ 2 ) r.v. over its domain]
2σ 2
= exp µ + σ 2 /2
E(Y 2 ) = E[exp(2X)]
Z ∞
(x − µ)2
1
= exp(2x) √ exp − dx
−∞ σ 2π 2σ 2
= ···
= exp 2µ + 2σ 2 .
(i) f (x, y) ≥ 0
The marginal probability mass functions (marginal pmf’s) of X and Y are respectively
X X
fX (x) = f (x, y), fY (y) = f (x, y).
y x
P P
Use the identity x y f (x, y) = 1 to prove that fX (x) and fY (y) are really pmf’s.
♥ Example 44 Suppose that two fair dice are tossed independently one after the other. Let
−1 if the result from die 1 is larger
X= 0 if the results are equal
1 if the result from die 1 is smaller.
Let Y = |difference between the two dice|. There are 36 possible outcomes. Each of them gives
a pair of values of X and Y . Y can take any of the values 0, 1, 2, 3, 4, 5. Construct the joint
probability table for X and Y .
Each pair of results above (and hence pair of values of X and Y ) has the same probability 1/36.
Hence the joint probability table is given in Table 3.1
The marginal probability distributions are just the row totals or column totals depending on
whether you want the marginal distribution of X or Y . For example, the marginal distribution of
X is given in Table 3.2.
3 Random Variables and Their Probability Distributions 69
Exercises: Write down the marginal distribution of Y and hence find the mean and variance of
Y.
How can we show that the above is a pdf? It is non-negative for all x and y values. But does it
integrate to 1? We are going to use the following rule.
Result Suppose that a real-valued function f (x, y) is continuous in a region D where a < x < b
and c < y < d, then
Z Z Z d Z b
f (x, y)dxdy = dy f (x, y)dx.
D c a
3 Random Variables and Their Probability Distributions 70
• Rewrite the region A as an intersection of two one-dimensional intervals. The first interval is
obtained by treating one variable as constant.
♥ Example 46 Continued
R1R1 R1R1 2
0 0 f (x, y)dxdy = 0R 0 6xyR dxdy
1 2 1
= 6 0 y dy 0 x dx
R1 2 R1
= 3 0 y dy [as 0 x dx = 21 ]
R1
= 1. [as 0 y 2 dy = 31 ]
The probability of any event in the two-dimensional space can be found by integration and again
more details will be provided in a second-year module. You will come across multivariate integrals
in a second semester module. You will not be asked to do bivariate integration in this
module.
We will not consider any continuous examples as the second-year module MATH2011 will study
them in detail.
Suppose that two random variables X and Y have joint pmf or pdf f (x, y) and let E(X) = µx
and E(Y ) = µy . The covariance between X and Y is defined by
3 Random Variables and Their Probability Distributions 71
Let σx2 = Var(X) = E(X 2 ) − µ2x and σy2 = Var(Y ) = E(Y 2 ) − µ2y . The correlation coefficient
between X and Y is defined by:
Cov(X, Y ) E(XY ) − µx µy
Corr(X, Y ) = p = .
Var(X) Var(Y ) σx σy
It can be proved that for any two random variables, −1 ≤ Corr(X, Y ) ≤ 1. The correlation
Corr(X, Y ) is a measure of linear dependency between two random variables X and Y , and it is
free of the measuring units of X and Y as the units cancel in the ratio.
3.8.4 Independence
Independence is an important concept. Recall that we say two events A and B are independent if
P (A ∩ B) = P (A) × P (B). We use the same idea here. Two random variables X and Y having the
joint pdf or pmf f (x, y) are said to be independent if and only if
♥ Example 48 Discrete Case X and Y are independent if each cell probability, f (x, y), is the
product of the corresponding row and column totals. In our very first dice example (Example 44)
X and Y are not independent. Verify that in the following example X and Y are independent. We
need to check all 9 cells.
y
1 2 3 Total
1 1 1 1
0 6 12 12 3
x 1 1 1 1
1 4 8 8 2
1 1 1 1
2 12 24 24 6
1 1 1
Total 2 4 4 1
♥ Example 49 Let f (x, y) = 6xy 2 , 0 < x < 1, 0 < y < 1. Check that X and Y are independent.
♥ Example 51 Deceptive
The joint pdf may look like something you can factorise. But X and Y may not be independent
because they may be related in the domain.
3 Random Variables and Their Probability Distributions 72
21 2
1. f (x, y) = 4 x y, x2 ≤ y ≤ 1. Not independent!
Consequences of Independence
P (X ∈ A, Y ∈ B) = P (X ∈ A) × P (Y ∈ B)
for any events A and B. That is, the joint probability can be obtained as the product of the
marginal probabilities. We will use this result in the next lecture. For example, suppose Jack
and Jess are two randomly selected students. Let X denote the height of Jack and Y denote
the height of Jess. Then we have,
Obviously this has to be true for any numbers other than the example numbers 182 and 165,
and for any inequalities.
• Further, let g(x) be a function of x only and h(y) be a function of y only. Then, if X and Y
are independent, it is easy to prove that
question, and the sample mean is proportional to the sum of the sample values. By doing this, in
the next lecture we will introduce the widely-used central limit theorem, the normal approximation
to the binomial distribution and so on. In this lecture we will also use this theory to reproduce
some of the results we obtained before, e.g. finding the mean and variance of the binomial and
negative binomial distributions.
3.9.2 Introduction
Suppose we have obtained a random sample from a distribution with pmf or pdf f (x), so that
X can either be a discrete or a continuous random variable. We will learn more about random
sampling in the next chapter. Let X1 , . . . , Xn denote the random sample of size n where n is a
positive integer. We use upper case letters since each member of the random sample is a random
variable. For example, I toss a fair coin n times and let Xi take the value 1 if a head appears in the
ith trial and 0 otherwise. Now I have a random sample X1 , . . . , Xn from the Bernoulli distribution
with probability of success equal to 0.5 since the coin is assumed to be fair.
We can get a random sample from a continuous random variable as well. Suppose it is known
that the distribution of the heights of first-year students is normal with mean 175 centimetres and
standard deviation 8 centimetres. I can randomly select a number of first-year students and record
each student’s height.
Suppose X1 , . . . , Xn is a random sample from a population with distribution f (x). Then it can
be shown that the random variables X1 , . . . , Xn are mutually independent, i.e.
P (X1 ∈ A1 , X2 ∈ A2 , . . . , Xn ∈ An ) = P (X1 ∈ A1 ) × P (X2 ∈ A2 ) × · · · P (Xn ∈ An )
for any set of events, A1 , A2 , . . . An . That is, the joint probability can be obtained as the product
of individual probabilities. An example of this for n = 2 was given in the previous lecture; see the
discussion just below the paragraph Consequences of independence.
♥ Example 52 Distribution of the sum of independent binomial random variables
Suppose X ∼ Bin(m, p) and Y ∼ Bin(n, p) independently. Note that p is the same in both distri-
butions. Using the above fact that joint probability is the multiplication of individual probabilities,
we can conclude that Z = X + Y has the binomial distribution. It is intuitively clear that this
should happen since X comes from m Bernoulli trials and Y comes from n Bernoulli trials indepen-
dently, so Z comes from m + n Bernoulli trials with common success probability p. We can prove
the result mathematically as well, by finding the probability mass function of Z = X + Y directly
and observing that it is of the appropriate form. First, note that
P (Z = z) = P (X = x, Y = y)
subject to the constraint that x + y = z, 0 ≤ x ≤ m, 0 ≤ y ≤ n. Thus,
P
P (Z = z) = P (X = x, Y = y)
Px+y=z m x
= (1 − p)m−x ny py (1 − p)n−y
x p
Px+y=z m n z m+n−z
= x+y=z x y pP(1 − p)
= pz (1 − p)m+n−z x+y=z m
n
x y
= m+n
z m+n−z ,
z p (1 − p)
3 Random Variables and Their Probability Distributions 74
using a result stated in Section A.4. Thus, we have proved that the sum of independent binomial
random variables with common probability is binomial as well. This is called the reproductive
property of random variables. You are asked to prove this for the Poisson distribution in an
exercise sheet.
Now we will state two main results without proof. The proofs will presented in the second-
year distribution theory module MATH2011. Suppose that X1 , . . . , Xn is a random sample from
a population distribution with finite variance, and suppose that E(Xi ) = µi and Var(Xi ) = σi2 .
Define a new random variable
Y = a1 X1 + a2 X2 + · · · + an Xn
1. E(Y ) = a1 µ1 + a2 µ2 + · · · + an µn .
For example, if ai = 1 for all i = 1, . . . , n, the two results above imply that:
The expectation of the sum of independent random variables is the sum of the expectations
of the individual random variables
and
the variance of the sum of independent random variables is the sum of the variances of
the individual random variables.
The second result is only true for independent random variables, e.g. random samples. Now we
will consider many examples.
Y = X1 + X2 + . . . + Xn
where each Xi is an independent Bernoulli trial with success probability p. We have shown before
that, E(Xi ) = p and Var(Xi ) = p(1 − p) by direct calculation. Now the above two results imply
that:
n
!
X
E(Y ) = E Xi = p + p + . . . + p = np.
i=1
r-th success in a sequence of independent Bernoulli trials, each with success probability p. Let Xi
be the number of trials needed after the (i − 1)-th success to obtain the i-th success. It is easy to
see that each Xi is a geometric random variable and Y = X1 + · · · + Xr . Hence,
and
Var(Y ) = Var(X1 ) + · · · + Var(Xr ) = (1 − p)/p2 + · · · + (1 − p)/p2 = r(1 − p)/p2 .
As a consequence of the stated result we can easily see the following. Suppose X1 and X2 are
independent N (µ, σ 2 ) random variables. Then 2X1 ∼ N (2µ, 4σ 2 ), X1 + X2 ∼ N (2µ, 2σ 2 ), and
X1 − X2 ∼ N (0, 2σ 2 ). Note that 2X1 and X1 + X2 have different distributions.
X1 + · · · + Xn ∼ N (nµ, nσ 2 ),
and consequently,
σ2
1
X̄ = (X1 + · · · + Xn ) ∼ N µ, .
n n
This also implies that X̄ = n1 Y also follows the normal distribution approximately, as the sample
size n → ∞. In particular, if µi = µ and σi2 = σ 2 , i.e. all means are equal and all variances are
equal, then the CLT states that, as n → ∞,
σ2
X̄ ∼ N µ, .
n
Equivalently,
√
n(X̄ − µ)
∼ N (0, 1)
σ
as n → ∞. The notion of convergence is explained by the convergence of distribution of X̄ to that
of the normal distribution with the appropriate mean and variance. It means that the cdf of the
√
left hand side, n (X̄−µ)
σ , converges to the cdf of the standard normal random variable, Φ(·). In
other words,
√ (X̄ − µ)
lim P n ≤ z = Φ(z), −∞ < z < ∞.
n→∞ σ
So for “large samples”, we can use N (0, 1) as an approximation to the sampling distribution of
√
n(X̄ − µ)/σ. This result is ‘exact’, i.e. no approximation is required, if the distribution of the
Xi ’s are normal in the first place – this was discussed in the previous lecture.
How large does n have to be before this approximation becomes usable? There is no definitive
answer to this, as it depends on how “close to normal” the distribution of X is. However, it is
often a pretty good approximation for sample sizes as small as 20, or even smaller. It also depends
on the skewness of the distribution of X; if the X-variables are highly skewed, then n will usually
need to be larger than for corresponding symmetric X-variables for the approximation to be good.
We will investigate this numerically using R.
3 Random Variables and Their Probability Distributions 77
Figure 3.3: Distribution of normalised sample means for samples of different sizes. Initially very
skew (original distribution, n = 1) becoming rapidly closer to standard normal (dashed line) with
increasing n.
We know that a binomial random variable Y with parameters n and p is the number of successes
in a set of n independent Bernoulli trials, each with success probability p. We have also learnt that
Y = X1 + X2 + · · · + Xn ,
where X1 , . . . , Xn are independent Bernoulli random variables with success probability p. It fol-
lows from the CLT that, for a sufficiently large n, Y is approximately normally distributed with
expectation E(Y ) = np and variance Var(Y ) = np(1 − p).
Hence, for given integers y1 and y2 between 0 and n and a suitably large n, we have
( )
y1 − np Y − np y2 − np
P (y1 ≤ Y ≤ y2 ) = P p ≤p ≤p
np(1 − p) np(1 − p) np(1 − p)
( )
y1 − np y2 − np
≈ P p ≤Z≤ p ,
np(1 − p) np(1 − p)
We should take account of the fact that the binomial random variable Y is integer-valued, and
so P (y1 ≤ Y ≤ y2 ) = P (y1 − f1 ≤ Y ≤ y2 + f2 ) for any two fractions 0 < f1 , f2 < 1. This is called
3 Random Variables and Their Probability Distributions 78
Figure 3.4: Histograms of normalised sample means for Bernoulli (p = 0.8) samples of different
sizes. – converging to standard normal.
♥ Example 56 A producer of natural yoghurt believed that the market share of their brand
was 10%. To investigate this, a survey of 2500 yoghurt consumers was carried out. It was observed
that only 205 of the people surveyed expressed a preference for their brand. Should the producer
be concerned that they might be losing market share?
Assume that the conjecture about market share is true. Then the number of people Y who
prefer this product follows a binomial distribution with p = 0.1 and n = 2500. So the mean is
np = 250, the variance is np(1 − p) = 225, and the standard deviation is 15. The exact probability
of observing (Y ≤ 205) is given by the sum of the binomial probabilities up to and including 205,
3 Random Variables and Their Probability Distributions 79
which is difficult to compute. However, this can be approximated by using the CLT:
P (Y ≤ 205) = P (Y ≤ 205.5)
( )
Y − np 205.5 − np
= P p ≤p
np(1 − p) np(1 − p)
( )
205.5 − np
≈ P Z≤p
np(1 − p)
205.5 − 250
= P Z≤
15
= Φ(−2.967) = 0.0015.
This probability is so small that it casts doubt on the validity of the assumption that the market
share is 10%.
Statistical Inference
Chapter mission
In the last chapter we learned the probability distributions of common random variables that we use
in practice. We learned how to calculate the probabilities based on our assumption of a probability
distribution with known parameter values. Statistical inference is the process by which we try to
learn about those probability distributions using only random observations. Hence, if our aim is
to learn about some typical characteristics of the population of Southampton students, we simply
randomly select few students, observe their characteristics and then try to generalise, as discussed
in Lecture 1. For example, suppose we are interested in learning what proportion of Southampton
students are of Indian origin. We may then select a number of students at random and observe
the sample proportion of Indian origin students. We will then claim that the sample proportion is
really our guess for the population proportion. But obviously we may be making grave errors since
we are inferring about some unknown based on only a tiny fraction of total information. Statistical
inference methods formalise these aspects. We will learn some of these methods here.
81
4 Statistical Inference 82
• The form of the assumed model helps us to understand the real-world process by which the
data were generated.
• If the model explains the observed data well, then it should also inform us about future
(or unobserved) data, and hence help us to make predictions (and decisions contingent on
unobserved data).
• The use of statistical models, together with a carefully constructed methodology for their
analysis, also allows us to quantify the uncertainty associated with any conclusions, predic-
tions or decisions we make.
Assumption 1 depends on the sampling mechanism and is very common in practice. If we are
to make this assumption for the Southampton student sampling experiment, we need to select
randomly among all possible students. We should not get the sample from an event in the Indian
or Chinese Student Association as that will give us a biased result. The assumption will be vio-
lated when samples are correlated either in time or in space, e.g. the daily air pollution level in
Southampton for the last year or the air pollution levels in two nearby locations in Southampton.
In this module we will only consider data sets where Assumption 1 is valid. Assumption 2 is not
always appropriate, but is often reasonable when we are modelling a single variable. In the fast
food waiting time example, if we assume that there are no differences between the AM and PM
waiting times, then we can say that X1 , . . . , X20 are independent and identically distributed (or
i.i.d. for short).
to make any inference about any unknown quantities, although we may use the data to judge the
plausibility of the model.
However, a fully specified model would be appropriate when for example, there is some external
(to the data) theory as to why the model (in particular the values of µ and σ 2 ) was appropriate.
Fully specified models such as this are uncommon as we rarely have external theory which allows
us to specify a model so precisely.
When a parametric statistical model is assumed with some unknown parameters, statistical
inference methods use data to estimate the unknown parameters, e.g. λ, µ, σ 2 . Estimation will be
discussed in more detail in the following lectures.
2. Parametric. Suppose we assume that X follows the Poisson distribution with parameter λ.
4 Statistical Inference 84
which involves the unknown parameter λ. For the Poisson distribution we know that E(X) =
λ. Hence we could use the sample mean X̄ to estimate E(X) = λ. Thus our estimate
λ̂ = x̄ = 3.75. This type of estimator is called a moment estimator. Now our answer is
52 × 1 − e−3.75 =52 * (1- exp(-3.75)) = 50.78 ≈ 51, which is very different compared
to our answer of 46 from the nonparametric approach.
The nonparametric approach should be preferred if the model cannot be justified for the data,
as in this case the parametric approach will provide incorrect answers.
• The observations x = (x1 , . . . , xn ) are called the sample, and quantities derived from the
sample are sample quantities. For example, as in Chapter 1, we call
n
1X
x̄ = xi
n
i=1
• The probability distribution for X specified in our model represents all possible observations
which might have been observed in our sample, and is therefore sometimes referred to as the
population. Quantities derived from this distribution are population quantities.
For example, if our model is that X1 , . . . , Xn are i.i.d., following the common distribution of
a random variable X, then we call E(X) the population mean.
The probability distribution of any estimator θ̃(X) is called its sampling distribution. The
estimate θ̃(x) is an observed value (a number), and is a single observation from the sampling dis-
tribution of θ̃(X).
♥ Example 58 Suppose that we have a random sample X1 , . . . , Xn from the uniform distribu-
tion on the interval [0, θ] where θ > 0 is unknown. Suppose that n = 5 and we have the sample
observations x1 = 2.3, x2 = 3.6, x3 = 20.2, x4 = 0.9, x5 = 17.2. Our objective is to estimate θ. How
can we proceed?
Rθ
Here the pdf f (x) = 1θ for 0 ≤ x ≤ θ and 0 otherwise. Hence E(X) = 0 1θ xdx = 2θ . There are
many possible estimators for θ, e.g. θ̂1 (X) = 2 X̄, which is motivated by the method of moments
because θ = 2E(X). A second estimator is θ̂2 (X) = max{X1 , X2 , . . . , Xn }, which is intuitive since
4 Statistical Inference 86
θ must be greater than or equal to all observed values and thus the maximum of the sample value
will be closest to θ. This is also the maximum likelihood estimate of θ, which you will learn in
MATH3044.
How could we choose between the two estimators θ̂1 and θ̂2 ? This is where we need to learn the
sampling distribution of an estimator to determine which estimator will be unbiased, i.e. correct on
average, and which will have minimum variability. We will formally define these in a minute, but
first let us derive the sampling distribution, i.e. the pdf, of θ̂2 . Note that θ̂2 is a random variable
since the sample X1 , . . . , Xn is random. We will first find its cdf and then differentiate the cdf to
get the pdf. For ease of notation, suppose Y = θ̂2 (X) = max{X1 , X2 , . . . , Xn }. For any 0 < y < θ,
the cdf of Y , F (y) is given by:
P (Y ≤ y) = P (max{X1 , X2 , . . . , Xn } ≤ y)
= P (X1 ≤ y, X2 ≤ y, . . . , Xn ≤ y)) [max ≤ y if and only if each ≤ y]
= P (X1 ≤ y)P (X2 ≤ y) · · · P (Xn ≤ y) [since the X’s are independent]
= yθ yθ · · · yθ
n
= yθ .
dF (y) n−1
Now the pdf of Y is f (y) = dy = n y θn for 0 ≤ y ≤ θ. We can plot this as a function of y to
n nθ2
see the pdf. Now E(θ̂2 ) = E(Y ) = n+1 θ and Var(θ̂2 ) = (n+2)(n+1)2
. You can prove this by easy
integration.
bias(θ̃) = E(θ̃) − θ.
So an estimator is unbiased if the expectation of its sampling distribution is equal to the quantity
we are trying to estimate. Unbiased means “getting it right on average”, i.e. under repeated sam-
pling (relative frequency interpretation of probability).
Thus for the uniform distribution example, θ̂2 is a biased estimator of θ and
n 1
bias(θ̂2 ) = E(θ̂2 ) − θ = θ−θ =− θ,
n+1 n+1
which goes to zero as n → ∞. However, θ̂1 = 2X̄ is unbiased since E(θ̂1 ) = 2E(X̄) = 2 2θ = θ.
4 Statistical Inference 87
Unbiased estimators are “correct on average”, but that does not mean that they are guaranteed
to provide estimates which are close to the estimand θ. A better measure of the quality of an
estimator than bias is the mean squared error (or m.s.e.), defined as
h i
m.s.e.(θ̃) = E (θ̃ − θ)2 .
Therefore, if θ̃ is unbiased for θ, i.e. if E(θ̃) = θ, then m.s.e.(θ̃) = Var(θ̃). In general, we have the
following result:
m.s.e.(θ̃) = Var(θ̃) + bias(θ̃)2 .
The proof is similar to the one we did in Lecture 2.
h i
m.s.e.(θ̃) = E (θ̃ − θ)2
2
= E θ̃ − E θ̃ + E θ̃ − θ
2 2
= E θ̃ − E θ̃ + E θ̃ − θ + 2 θ̃ − E θ̃ E θ̃ − θ
h i2 h i2 h i
= E θ̃ − E θ̃ + E E θ̃ − θ + 2E θ̃ − E θ̃ E θ̃ − θ
h i2 h i
= Var θ̃ + E θ̃ − θ + 2 E θ̃ − θ E θ̃ − E θ̃
h i
= Var θ̃ + bias(θ̃)2 + 2 E θ̃ − θ E θ̃ − E θ̃
= Var θ̃ + bias(θ̃)2 .
Hence, the mean squared error incorporates both the bias and the variability (sampling variance)
of θ̃. We are then faced with the bias-variance trade-off when selecting an optimal estimator. We
may allow the estimator to have a little bit of bias if we can ensure that the variance of the biased
estimator will be much smaller than that of any unbiased estimator.
♥ Example 59 Uniform distribution Continuing with the uniform distribution U [0, θ] example,
1
we have seen that θ̂1 = 2X̄ is unbiased for θ but bias(θ̂2 ) = − n+1 θ. How do these estimators
compare with respect to the m.s.e? Since θ̂1 is unbiased, its m.s.e is its variance. In the next
lecture, we will prove that for random sampling from any population
Var(X)
Var(X̄) = ,
n
where Var(X) is the variance of the population sampled from. Returning to our example, we know
θ2
that if X ∼ U [0, θ] then Var(X) = 12 . Therefore we have:
θ2 θ2
m.s.e.(θ̂1 ) = Var θ̂1 = Var 2X̄ = 4Var X̄ = 4 = .
12n 3n
1
2. bias(θ̂2 ) = − n+1 θ.
Now
m.s.e.(θ̂2 ) = Var θ̂2 + bias(θ̂2 )2
nθ2 θ2
= (n+2)(n+1) 2 + (n+1)2
θ2 n
= (n+1)2 n+2
+1
θ 2 2n+2
= (n+1)2 n+2
.
Clearly, the m.s.e of θ̂2 is an order of magnitude (of order n2 rather than n) smaller than the m.s.e
of θ̂1 , providing justification for the preference of θ̂2 = max{X1 , X2 , . . . , Xn } as an estimator of θ.
R2 Var(X̄) = σ 2 /n,
We prove R1 as follows.
n n
1X 1X
E[X̄] = E(Xi ) = E(X) = E(X),
n n
i=1 i=1
We prove R2 using the result that for independent random variables the variance of the sum is
the sum of the variances from Lecture 17. Thus,
n n
1 X 1 X n σ2
Var[X̄] = 2 Var(Xi ) = 2 Var(X) = 2 Var(X) = ,
n n n n
i=1 i=1
so the m.s.e. of X̄ is Var(X)/n. This proves the following assertion we made earlier:
Variance of the sample mean = Population Variance divided by the sample size.
We now want to prove R3, i.e. show that the sample variance with divisor n − 1 is an unbiased
estimator of the population variance σ 2 , i.e. E(S 2 ) = σ 2 . We have
n
" n #
2 1 X 2 1 X
2 2
S = Xi − X̄ = Xi − nX̄ .
n−1 n−1
i=1 i=1
To evaluate the expectation of the above, we need E(Xi2 ) and E(X̄ 2 ). In general, we know for any
random variable,
Thus, we have
E(Xi2 ) = Var(Xi ) + (E(Xi ))2 = σ 2 + µ2 ,
and
E(X̄ 2 ) = Var(X̄) + (E(X̄))2 = σ 2 /n + µ2 ,
4 Statistical Inference 90
1 Pn 2 ) − nE(X̄ 2 )
= n−1 Pni=1 E(X i
1 2 2 2 + µ2 )
= n−1 i=1 (σ + µ ) − n(σ /n
1
nσ 2 + nµ2 − σ 2 − nµ2 )
= n−1
= σ 2 ≡ Var(X).
m.s.e.(θ̃) = Var(θ̃)
and therefore the sampling variance of the estimator is an important summary of its quality.
We usually prefer to focus on the standard deviation of the sampling distribution of θ̃,
q
s.d.(θ̃) = Var(θ̃).
In practice we will not know s.d.(θ̃), as it will typically depend on unknown features of the
distribution of X1 , . . . , Xn . However, we may be able to estimate s.d.(θ̃) using the observed sample
x1 , . . . , xn . We define the standard error, s.e.(θ̃), of an estimator θ̃ to be an estimate of the standard
deviation of its sampling distribution, s.d.(θ̃).
We proved that
σ2 σ
Var[X̄] = ⇒ s.d.(X̄) = √ .
n n
As σ is unknown, we cannot calculate this standard deviation. However, we know that E(S 2 ) = σ 2 ,
i.e. that the sample variance is an unbiased estimator of the population variance. Hence S 2 /n is
an unbiased estimator for Var(X̄). Therefore we obtain the standard error of the mean, s.e.(X̄),
by plugging in the estimate
n
!1/2
1 X
s= (xi − x̄)2
n−1
i=1
Therefore, for the computer failure data, our estimate, x̄ = 3.75, for the population mean is
associated with a standard error
3.381
s.e.(X̄) = √ = 0.332.
104
Note that this is ‘a’ standard error, so other standard errors may be available. Indeed, for parametric
inference, where we make assumptions about f (x), alternative standard errors are available. For
example, X1 , . . . , Xn are i.i.d. Poisson(λ) random q variables. E(X) = λ, so X̄ is an unbiased
p
estimator of λ. Var(X) = λ, so another s.e.(X̄) = λ̂/n = x̄/n. In the computer failure data
q
example, this is 3.75 104 = 0.19.
4.4.2 Basics
An estimate θ̃ of a parameter θ is sometimes referred to as a point estimate. The usefulness of a
point estimate is enhanced if some kind of measure of its precision can also be provided. Usually,
for an unbiased estimator, this will be a standard error, an estimate of the standard deviation of the
associated estimator, as we have discussed previously. An alternative summary of the information
provided by the observed data about the location of a parameter θ and the associated precision is
an interval estimate or confidence interval.
a random variable T (X, θ) whose distribution does not depend on θ and is therefore known. This
random variable T (X, θ) is called a pivot for θ. Hence we can find numbers h1 and h2 such that
where 1 − α is any specified probability. If (1) can be ‘inverted’ (or manipulated), we can write it
as
P [g1 (X) ≤ θ ≤ g2 (X)] = 1 − α. (2)
Hence with probability 1 − α, the parameter θ will lie between the random variables g1 (X) and
g2 (X). Alternatively, the random interval [g1 (X), g2 (X)] includes θ with probability 1 − α. Now,
when we observe x1 , . . . , xn , we observe a single observation of the random interval [g1 (X), g2 (X)],
which can be evaluated as [g1 (x), g2 (x)]. We do not know if θ lies inside or outside this interval,
but we do know that if we observed repeated samples, then 100(1 − α)% of the resulting intervals
would contain θ. Hence, if 1 − α is high, we can be reasonably confident that our observed interval
contains θ. We call the observed interval [g1 (x), g2 (x)] a 100(1 − α)% confidence interval for θ.
It is common to present intervals with high confidence levels, usually 90%, 95% or 99%, so that
α = 0.1, 0.05 or 0.01 respectively.
It is common practice to make the interval symmetric, so that the two unshaded areas are equal
(to α/2), in which case
α
−h1 = h2 ≡ h and Φ(h) = 1 − .
2
The most common choice of confidence level is 1 − α = 0.95, in which case h = 1.96 =
qnorm(0.975). You may also occasionally see 90% (h = 1.645 = qnorm(0.95)) or 99% (h =
4 Statistical Inference 93
2.58=qnorm(0.995)) intervals. We discussed these values in Lecture 15. We generally use the 95%
intervals for a reasonably high level of confidence without making the interval unnecessarily wide.
Therefore we have
√ (X̄ − µ)
P −1.96 ≤ n ≤ 1.96 = 0.95
σ
σ σ
⇒ P X̄ − 1.96 √ ≤ µ ≤ X̄ + 1.96 √ = 0.95.
n n
Hence, X̄ − 1.96 √σn and X̄ + 1.96 √σn are the endpoints of a random interval which includes µ with
probability 0.95. The observed value of this interval, x̄ ± 1.96 √σn , is called a 95% confidence
interval for µ.
♥ Example 60 For the fast food waiting time data, we have n = 20 data points combined from
the morning and afternoon data sets. We have x̄ = 67.85 and n = 20. Hence, under the normal
model assuming (just for the sake of illustration) σ = 18, a 95% confidence interval for µ is
√ √
67.85 − 1.96(18/ 20) ≤ µ ≤ 67.85 + 1.96(18/ 20)
⇒ 59.96 ≤ µ ≤ 75.74
2. Confidence intervals are frequently used, but also frequently misinterpreted. A 100(1 − α)%
confidence interval for θ is a single observation of a random interval which, under repeated
sampling, would include θ 100(1 − α)% of the time.
The following example from the National Lottery in the UK clarifies the interpretation. We
collected 6 chosen lottery numbers (sampled at random from 1 to 49) for 20 weeks and then
constructed 95% confidence intervals for the population mean µ = 25 and plotted the intervals
along with the observed sample means in the following figure. It can be seen that exactly
4 Statistical Inference 94
one out of 20 (5%) of the intervals do not contain the true population mean 25. Although
this is a coincidence, it explains the main point that if we construct the random intervals
with 100(1 − α)% confidence levels again and again for hypothetical repetition of the data,
on average 100(1 − α)% of them will contain the true parameter.
3. A confidence interval is not a probability interval. You should avoid making statements like
P (1.3 < θ < 2.2) = 0.95. In the classical approach to statistics you can only make probability
statements about random variables, and θ is assumed to be a constant.
In this lecture we have learned to obtain confidence intervals by using an appropriate statistic in the
pivoting technique. The main task is then to invert the inequality so that the unknown parameter
is in the middle by itself and the two end points are functions of the sample observations. The most
difficult task is to correctly interpret confidence intervals, which are not probability intervals but
have long-run properties. That is, the interval will contain the true parameter with the stipulated
confidence level only under infinitely repeated sampling.
4 Statistical Inference 95
√ (X̄ − µ) approx
n ∼ N (0, 1) as n → ∞.
σ
So a general confidence interval for µ can be constructed, just as before in Section 4.4.3. Thus a
95% confidence interval (CI) for µ is given by x̄ ± 1.96 √σn . But note that σ is unknown so this
CI cannot be used unless we can estimate σ, i.e. replace the unknown s.d. of X̄ by its estimated
standard error. In this case, we get the CI in the familiar form:
Suppose that we do not assume any distribution for the sampled random variable X but assume
only that X1 , . . . , Xn are i.i.d, following the distribution of X where E(X) = µ and Var(X) = σ 2 .
√
We know that the standard error of X̄ is s/ n where s is the sample standard deviation with
divisor n − 1. Then the following provides a 95% CI for µ:
s
x̄ ± 1.96 √ .
n
♥ Example 61 For the computer failure data, x̄ = 3.75, s = 3.381 and n = 104. Under the
model that the data are observations of i.i.d. random variables with population mean µ (but no
other assumptions about the underlying distribution), we compute a 95% confidence interval for µ
to be
3.381 3.381
3.75 − 1.96 √ , 3.75 + 1.96 √ = (3.10, 4.40).
104 104
If we can assume a distribution for X, i.e. a parametric model for X, then we can do slightly
better in estimating the standard error of X̄ and as a result we can improve upon the previously
obtained 95% CI. Two examples follow.
♥ Example 62 Poisson If X1 , . . . , Xn are modelled as i.i.d. Poisson(λ) randomq
variables, then
2 2
p
µ = λ and σ = λ. We know Var(X̄) = σ /n = λ/n. Hence a standard error is λ̂/n = x̄/n
4 Statistical Inference 96
For the computer failure data, x̄ = 3.75, s = 3.381 and n = 104. Under the model that the data
are observations of i.i.d. random variables following a Poisson distribution with population mean
λ, we compute a 95% confidence interval for λ as
r
x̄ p
x̄ ± 1.96 = 3.75 ± 1.96 3.75/104 = (3.38, 4.12).
n
We see that this interval is narrower (0.74 = 4.12 − 3.38) than the earlier interval (3.10,4.40),
which has a length of 1.3. We prefer narrower confidence intervals as they facilitate more accurate
inference regarding the unknown parameter.
This is wrong as n is too small for the large sample approximation to be accurate. Hence we need
to look for other alternatives which may work better.
√ (X̄−p)
P −1.96 ≤ n √ ≤ 1.96 = 0.95
p(1−p)
p √ p
⇔ P −1.96 p(1 − p) ≤ n(X̄ − p) ≤ 1.96 p(1 − p) = 0.95
p p
⇔ P −1.96 p(1 − p)/n ≤ (X̄ − p) ≤ 1.96 p(1 − p)/n = 0.95
p p
⇔ P p − 1.96 p(1 − p)/n ≤ X̄ ≤ p + 1.96 p(1 − p)/n = 0.95
⇔ P L(p) ≤ X̄ ≤ R(p) = 0.95,
4 Statistical Inference 97
p p
where L(p) = p − h p(1 − p)/n, R(p) = p + h p(1 − p)/n, h = 1.96. Now, consider the inverse
mappings L−1 (x) and R−1 (x) so that:
P L(p) ≤ X̄ ≤ R(p) = 0.95
⇔ P R−1 (X̄) ≤ p ≤ L−1 (X̄) = 0.95
which now defines our confidence interval (R−1 (X̄), L−1 (X̄)) for p. We can obtain R−1 (x̄) and
L−1 (x̄) by solving the equations R(p) = x̄ and L(p) = x̄ for p, treating n and x̄ as known quantities.
Thus we have,
The endpoints of the confidence interval are the roots of the quadratic. Hence, the endpoints
of the 95% confidence interval for p are:
2 1/2
h2 h2 2 h2
2x̄ + n ± 2x̄ + n − 4x̄ 1 + n
2
2 1 + hn
2 1/2
h2 h2 2 h2
x̄ + 2n ± x̄ + 2n − x̄ 1 + n
=
2
1 + hn
h 2 i1/2
h2
x̄ + 2n ± √hn 4nh
+ x̄(1 − x̄)
=
2
.
1 + hn
This is sometimes called the Wilson Score Interval. The following R code calculates this for
given n, x̄ and confidence level α which determines the value of h. Returning to the previous
example, n = 10 and x̄ = 0.2, the 95% CI obtained from this method is (0.057, 0.510) compared
to the previous illegitimate one (−0.048, 0.448). In fact you can see that the intervals obtained by
quadratic inversion are more symmetric and narrower as n increases, and are also more symmetric
for x̄ closer to 0.5. See the table below:
n x̄ Quadratic inversion Plug-in s.e. estimation
Lower end Upper end Lower end Upper end
10 0.2 0.057 0.510 –0.048 0.448
10 0.5 0.237 0.763 0.190 0.810
20 0.1 0.028 0.301 –0.031 0.231
20 0.2 0.081 0.416 0.025 0.375
20 0.5 0.299 0.701 0.281 0.719
50 0.1 0.043 0.214 0.017 0.183
50 0.2 0.112 0.330 0.089 0.311
50 0.5 0.366 0.634 0.361 0.639
4 Statistical Inference 98
For smaller n and x̄ closer to 0 (or 1), the approximation required for the plug-in estimate of the
standard error is insufficiently reliable. However, for larger n it is adequate.
Now the confidence interval for λ is found by solving the (quadratic) equality for λ by treating n, x̄
and h to be known:
(x̄ − λ)2
n = h2 , where h = 1.96
λ
⇒ x̄2 − 2λx̄ + λ2 = h2 λ/n
⇒ λ2 − λ(2x̄ + h2 /n) + x̄2 = 0.
2 1/2
h2 h2
2x̄ + ± 2x̄ + − 4x̄2 1/2
h2
n n
2
h h
= x̄ + ± 1/2 + x̄ .
2 2n n 4n
♥ Example 64 For the computer failure data, x̄ = 3.75 and n = 104. For a 95% confidence
interval (CI), h = 1.96. Hence, we calculate the above CI using the R commands:
x <- scan("compfail.txt")
n <- length(x)
h <- qnorm(0.975)
mean(x) + (h*h)/(2*n) + c(-1, 1) * h/sqrt(n) * sqrt(h*h/(4*n) + mean(x))
The result is (3.40, 4.14), which compares well with the earlier interval (3.38, 4.12).
4.6 Lecture 24: Exact confidence interval for the normal mean
4.6.1 Lecture mission
Recall that we can obtain better quality inferences if we can justify a precise model for the data.
This saying is analogous to the claim that a person can better predict and infer in a situation
when there are established rules and regulations, i.e. the analogue of a statistical model. In this
lecture, we will discuss a procedure for finding confidence intervals based on the statistical modelling
assumption that the data are from a normal distribution. This assumption will enable us to find
an exact confidence interval for the mean rather than an approximate one using the central limit
theorem.
P (−h ≤ T ≤ h) = 1 − α
√ (X̄ − µ)
i.e. P −h ≤ n ≤ h = 0.95
S
S S
⇒ P X̄ − h √ ≤ µ ≤ X̄ + h √ = 0.95
n n
The observed value of this interval, (x̄ ± h √sn ), is the 95% confidence interval for µ. Remarkably,
this also of the general form, Estimate ± Critical value × Standard error, where the Critical value
is h and the standard error of the sample mean is √sn . Now, how do we find the critical value h for
a given 1 − α? We need to introduce the t-distribution.
1 Pn
Let X1 , . . . , Xn be i.i.d N (µ, σ 2 ) random variables. Define X̄ = n i=1 Xi and
n
!
1 X
S2 = Xi2 − nX̄ 2 .
n−1
i=1
√ (X̄ − µ)
n ∼ tn−1 ,
S
4 Statistical Inference 100
where tn−1 denotes the standard t distribution with n − 1 degrees of freedom. The standard t dis-
tribution is a family of distributions which depend on one parameter called the degrees-of-freedom
(df) which is n − 1 here. The concept of degrees of freedom is that it is usually the number of
independent random samples, n here, minus the number of linear parameters estimated, 1 here for
µ. Hence the df is n − 1.
The probability density function of the tk distribution is similar to a standard normal, in that
it is symmetric around zero and ‘bell-shaped’, but the t-distribution is more heavy-tailed, giving
greater probability to observations further away from zero. The figure below illustrates the tk
density function for k = 1, 2, 5, 20 together with the standard normal pdf (solid line).
The values of h for a given 1 − α have been tabulated using the standard t-distribution and can
be obtained using the R command qt (abbreviation for quantile of t). For example, if we want to
find h for 1 − α = 0.95 and n = 20 then we issue the command: qt(0.975, df=19) = 2.093. Note
that it should be 0.975 so that we are splitting 0.05 probability between the two tails equally and
the df should be n − 1 = 19. Indeed, using the above command repeatedly, we obtain the following
critical values for the 95% interval for different values of the sample size n.
n 2 5 10 15 20 30 50 100 ∞
h 12.71 2.78 2.26 2.14 2.09 2.05 2.01 1.98 1.96
Note that the critical value approaches 1.96 (which is the critical value for the normal distribution)
as n → ∞, since the t-distribution itself approaches the normal distribution for large values of its
df parameter.
If you can justify that the underlying distribution is normal then you
can use the t-distribution-based confidence interval.
♥ Example 65 Fast food waiting time revisited We would like to find a confidence interval for
the true mean waiting time. If X denotes the waiting time in seconds, we have n = 20, x̄ = 67.85,
s = 18.36. Hence, recalling that the critical value h = 2.093, from the command qt(0.975,
df=19), a 95% confidence interval for µ is
√ √
67.85 − 2.093 × 18.36/ 20 ≤ µ ≤ 67.85 + 2.093 × 18.36/ 20
⇒ 59.26 ≤ µ ≤ 76.44.
♥ Example 66 Weight gain revisited We would like to find a confidence interval for the true
average weight gain (final weight – initial weight). Here n = 68, x̄ = 0.8672 and s = 0.9653. Hence,
a 95% confidence interval for µ is
√ √
0.8672 − 1.996 × 0.9653/ 68 ≤ µ ≤ 0.8672 + 1.996 × 0.9653/ 68
⇒ 0.6335 ≤ µ ≤ 1.1008
[In R, we obtain the critical value 1.996 by qt(0.975, df=67) or -qt(0.025, df=67)]
In R the command is: mean(x) + c(-1, 1) * qt(0.975, df=67) * sqrt(var(x)/68) if the
vector x contains the 68 weight gain differences. You may obtain this by issuing the commands:
wgain <- read.table("wtgain.txt", head=T)
x <- wgain$final -wgain$initial
Note that the interval here does not include the value 0, so it is very likely that the weight gain
is significantly positive, which we will justfy using what is called testing of hypothesis.
claims that students gain significant weight in the first year of their life in college away form
home. How can we verify these claims? We will learn the procedures of hypothesis testing for such
problems.
4.7.2 Introduction
In statistical inference, we use observations x1 , . . . , xn of univariate random variables X1 , . . . , Xn in
order to draw inferences about the probability distribution f (x) of the underlying random variable
X. So far, we have mainly been concerned with estimating features (usually unknown parameters)
of f (x). It is often of interest to compare alternative specifications for f (x). If we have a set of
competing probability models which might have generated the observed data, we may want to de-
termine which of the models is most appropriate. A proposed (hypothesised) model for X1 , . . . , Xn
is then referred to as a hypothesis, and pairs of models are compared using hypothesis tests.
For example, we may have two competing alternatives, f (0) (x) (model H0 ) and f (1) (x) (model
H1 ) for f (x), both of which completely specify the joint distribution of the sample X1 , . . . , Xn .
Completely specified statistical models are called simple hypotheses. Usually, H0 and H1 both take
the same parametric form f (x, θ), but with different values θ(0) and θ(1) of θ. Thus the joint distri-
bution of the sample given by f (X) is completely specified apart from the values of the unknown
parameter θ and θ(0) 6= θ(1) are specified alternative values.
More generally, competing hypotheses often do not completely specify the joint distribution of
X1 , . . . , Xn . For example, a hypothesis may state that X1 , . . . , Xn is a random sample from the
probability distribution f (x; θ) where θ < 0. This is not a completely specified hypothesis, since it
is not possible to calculate probabilities such as P (X1 < 2) when the hypothesis is true, as we do
not know the exact value of θ. Such an hypothesis is called a composite hypothesis.
Examples of hypotheses:
X1 , . . . , Xn ∼ N (µ, σ 2 ) with µ = 0, σ 2 = 2.
X1 , . . . , Xn ∼ N (µ, σ 2 ) with µ = 0, σ 2 ∈ R+ .
X1 , . . . , Xn ∼ N (µ, σ 2 ) with µ 6= 0, σ 2 ∈ R+ .
X1 , . . . , Xn ∼ Bernoulli(p) with p = 12 .
X1 , . . . , Xn ∼ Bernoulli(p) with p 6= 12 .
X1 , . . . , Xn ∼ Bernoulli(p) with p > 12 .
X1 , . . . , Xn ∼ Poisson(λ) with λ = 1.
X1 , . . . , Xn ∼ Poisson(θ) with θ > 1.
Hence, the fact that a hypothesis test does not reject H0 should not be taken as evidence that
H0 is true and H1 is not, or that H0 is better-supported by the data than H1 , merely that the data
does not provide significant evidence to reject H0 in favour of H1 .
A hypothesis test is defined by its critical region or rejection region, which we shall denote by
C. C is a subset of Rn and is the set of possible observed values of X which, if observed, would
lead to rejection of H0 in favour of H1 , i.e.
If x ∈ C H0 is rejected in favour of H1
If x ∈
6 C H0 is not rejected
As X is a random variable, there remains the possibility that a hypothesis test will give an erroneous
result. We define two types of error:
H0 true H0 false
Reject H0 Type I error Correct decision
Do not reject H0 Correct decision Type II error
♥ Example 67 Uniform Suppose that we have one observation from the uniform distribution
on the range (0, θ). In this case, f (x) = 1/θ if 0 < x < θ and P (X ≤ x) = xθ for 0 < x < θ. We
want to test H0 : θ = 1 against the alternative H1 : θ = 2. Suppose we decide arbitrarily that we
will reject H0 if X > 0.75. Then
♥ Example 68 Poisson The daily demand for a product has a Poisson distribution with mean λ,
the demands on different days being statistically independent. It is desired to test the hypotheses
4 Statistical Inference 104
H0 : λ = 0.7, H1 : λ = 0.3. The null hypothesis is to be accepted if in 20 days the number of days
with no demand is less than 15. Calculate the Type I and Type II error probabilities.
Let p denote the probability that the demand on a given day is zero.
Then
−0.7
e under H0
p = e−λ =
e−0.3 under H1 .
If X denotes the number of days out of 20 with zero demand, it follows that
Thus
Furthermore
Sometimes α is called the size (or significance level) of the test and ω ≡ 1 − β is called the
power of the test. Ideally, we would like to avoid error so we would like to make both α and β as
small as possible. In other words, a good test will have small size, but large power. However, it is
not possible to make α and β both arbitrarily small. For example if C = ∅ then α = 0, but β = 1.
On the other hand if C = S = Rn then β = 0, but α = 1.
The general hypothesis testing procedure is to fix α to be some small value (often 0.05), so that
the probability of a Type I error is limited. In doing this, we are giving H0 precedence over H1 ,
and acknowledging that Type I error is potentially more serious than Type II error. (Note that for
discrete random variables, it may be difficult to find C so that the test has exactly the required
size). Given our specified α, we try to choose a test, defined by its rejection region C, to make β
as small as possible, i.e. we try to find the most powerful test of a specified size. Where H0 and
4 Statistical Inference 105
Note that tests are usually based on a one-dimensional test statistic T (X) whose sample space
is some subset of R. The rejection region is then a set of possible values for T (X), so we also think
of C as a subset of R. In order to be able to ensure the test has size α, the distribution of the test
statistic under H0 should be known.
The observed value of 7.41 does not seem reasonable from the graph below. The graph has
plotted the density of the t-distribution with 67 degrees of freedom, and a vertical line is drawn at
the observed value of 7.41. So there may be evidence here to reject H0 : µ = 0.
♥ Example 70 Fast food waiting time revisited Suppose the manager of the fast food outlet
claims that the average waiting time is only 60 seconds. So, we want to test H0 : µ = 60. We have
n = 20, x̄ = 67.85, s = 18.36. Hence our test statistic for the null hypothesis H0 : µ = µ0 = 60 is
The observed value of 1.91 may or may not be reasonable from the graph below. The graph
has plotted the density of the t-distribution with 19 degrees of freedom and a vertical line is drawn
at the observed value of 1.91. This value is a bit out in the tail but we are not sure, unlike in the
previous weight gain example. So how can we decide whether to reject the null hypothesis?
The value of h depends on the sample size n and can be found by issuing the qt command.
Here are few examples obtained from qt(0.975, df=c(1, 4, 9, 14, 19, 29, 49, 99)):
n 2 5 10 15 20 30 50 100 ∞
h 12.71 2.78 2.26 2.14 2.09 2.05 2.01 1.98 1.96
Note that we need to put n − 1 in the df argument of qt and the last value for n = ∞ is obtained
from the normal distribution.
However, if the alternative hypothesis is one-sided, e.g. H1 : µ > µ0 , then the critical region
will only be in the right tail. Consequently, we need to leave an area α on the right and as a result
the critical values will be from a command such as:
qt(0.95, df=c(1, 4, 9, 14, 19, 29, 49, 99))
4 Statistical Inference 108
n 2 5 10 15 20 30 50 100 ∞
h 6.31 2.13 1.83 1.76 1.73 1.70 1.68 1.66 1.64
♥ Example 71 Fast food waiting time We would like to test H0 : µ = 60 against the alternative
H1 : µ > 60, as this alternative will refute the claim of the store manager that customers only wait
for a maximum of one minute. We calculated the observed value to be 1.91. This is a one-sided test
and for a 5% level of significance, the critical value h will come from qt(0.95, df=19)=1.73. Thus
the observed value is higher than the critical value so we will reject the null hypothesis, disputing
the manager’s claim regarding a minute wait.
4.8.5 p-values
The result of a test is most commonly summarised by rejection or non-rejection of H0 at the
stated level of significance. An alternative, which you may see in practice, is the computation of
a p-value. This is the probability that the reference distribution would have generated the actual
observed value of the statistic or something more extreme. A small p-value is evidence against
the null hypothesis, as it indicates that the observed data were unlikely to have been generated
by the reference distribution. In many examples a threshold of 0.05 is used, below which the null
hypothesis is rejected as being insufficiently well-supported by the observed data. Hence for the
t-test with a two-sided alternative, the p-value is given by:
where T has a tn−1 distribution and tobs is the observed sample value.
However, if the alternative is one-sided and to the right then the p-value is given by:
p = P (T > tobs ),
where T has a tn−1 distribution and tobs is the observed sample value.
A small p-value corresponds to an observation of T that is improbable (since it is far out in the
low probability tail area) under H0 and hence provides evidence against H0 . The p-value should not
be misinterpreted as the probability that H0 is true. H0 is not a random event (under our models)
and so cannot be assigned a probability. The null hypothesis is rejected at significance level α if
the p-value for the test is less than α.
When the alternative hypothesis is two-sided the p-value has to be calculated from P (|T | > tobs ),
where tobs is the observed value and T follows the t-distribution with n − 1 df. For the weight gain
example, because the alternative is two-sided, the p-value is given by:
This very small p-value for the second example indicates very strong evidence against the null
hypothesis of no weight gain in the first year of university.
and
i.i.d.
Y1 , . . . , Ym ∼ N (µY , σY2 )
respectively, where it is also assumed that the X and Y variables are independent of each other.
Suppose that we want to test the hypothesis that the distributions of X and Y are identical, that
is
H0 : µX = µY , σX = σY = σ
and therefore
σ2 σ2
X̄ − Ȳ ∼ N µX − µ Y , X + Y .
n m
Hence, under H0 ,
r
1 1 nm (X̄ − Ȳ )
X̄ − Ȳ ∼ N 0, σ 2 + ⇒ ∼ N (0, 1).
n m n+m σ
The involvement of the (unknown) σ above means that this is not a pivotal test statistic. It will be
proved in MATH2011 that if σ is replaced by its unbiased estimator S, which here is the two-sample
estimator of the common standard deviation, given by
Pn 2
Pm 2
2 i=1 (Xi − X̄) + i=1 (Yi − Ȳ )
S = ,
n+m−2
then r
nm (X̄ − Ȳ )
∼ tn+m−2 .
n+m S
Hence r
nm (x̄ − ȳ)
t=
n+m s
is a test statistic for this test. The rejection region is |t| > h where −h is the α/2 (usually 0.025)
percentile of tn+m−2 .
(n − 1)s2x + (m − 1)s2y
s2 = = 354.8,
n+m−2
4 Statistical Inference 112
r
nm (x̄ − ȳ)
tobs = = 0.25.
n+m s
This is not significant as the critical value h = qt(0.975,18)= 2.10 is larger in absolute value than
0.25. This can be achieved by calling the R function t.test as follows:
y <- read.csv("servicetime.csv", head=T)
t.test(y$AM, y$PM)
It automatically calculates the test statistic as 0.249 and a p-value of 0.8067. It also obtains
the 95% CI given by (–15.94, 20.14).
Interpretation: The values of the second test are significantly higher than the ones of the first
test, and so the second test cannot be considered as a replacement for the first.
4 Statistical Inference 113
Suppose there are N individuals in the population and we are drawing a sample of n individuals.
In SRSWR, the same unit of the population may occur more than once in the sample; there are
N n possible samples (using multiplication rules of counting), and each of these samples has equal
chance of 1/N n to materialise. In the case of SRSWOR, at the rth drawing (r = 1, . . . , n) there
are N − r + 1 individuals in the population to sample from. All of these individuals are given equal
probability of inclusion in the sample. Here no member of the population can occur more than
once in the sample. There are N Cn possible samples and each has equal probability of inclusion
1/N Cn . This is also justified as at the rth stage one is to choose from N − r + 1 individuals one
4 Statistical Inference 114
of the n − r + 1 individuals to be included in the sample which have not yet been chosen in earlier
drawings. In this case too, the probability that any specified individual, say the ith, is selected at
any drawing, say the kth drawing, is:
N −1 N −2 N −k+1 1 1
× × ··· × =
N N −1 N −k+2 N −k+1 N
as in the case of the SRSWR. It is obvious that if one takes n individuals all at a time from the
population, giving equal probability to each of the N Cn combinations of n members out of the N
members in the population, one will still have SRSWOR.
There are a huge number of considerations and concepts to design good surveys avoiding bias.
There may be response bias, observational bias, biases from non-response, interviewer bias, bias
due to defective sampling technique, bias due to substitution, bias due to faulty differentiation of
sampling units and so on. However, discussion of such topics is beyond the scope and syllabus of
this module.
Treatment. The different procedures under comparison in an experiment are the different treat-
ments. For example, in a chemical engineering experiment different factors such as Temperature
(T), Concentration (C) and Catalyst (K) may affect the yield value from the experiment.
Experimental unit. An experimental unit is the material to which the treatment is applied and
on which the variable under study is measured. In a human experiment in which the treatment
affects the individual, the individual will be the experimental unit.
4 Statistical Inference 115
1. Randomisation. This is necessary to draw valid conclusions and minimise bias. In an ex-
periment to compare two pain-relief tablets we should allocate the tablets randomly among
participants – not one tablet to the boys and the other to the girls.
3. Local control. In the simplest case of local control, the experimental units are divided into
homogeneous groups or blocks. The variation among these blocks is eliminated from the error
and thereby efficiency is increased. These considerations lead to the topic of construction of
block designs, where random allocation of treatments to the experimental units may be re-
stricted in different ways in order to control experimental error. Another means of controlling
error is through the use of confounded designs where the number of treatment combinations
is very large, e.g. in factorial experiments.
Factorial experiment A thorough discussion of construction of block designs and factorial ex-
periments is beyond the scope of this module. However, these topics are studied in the third-year
module MATH3014: Design of Experiments. In the remainder of this lecture, we simply discuss an
example of a factorial experiment and how to estimate different effects.
To investigate how factors jointly influence the response, they should be investigated in an
experiment in which they are all varied. Even when there are no factors that interact, a factorial
experiment gives greater accuracy. Hence they are widely used in science, agriculture and industry.
4 Statistical Inference 116
Here we will consider factorial experiments in which each factor is used at only two levels. This is
a very common form of experiment, especially when many factors are to be investigated. We will
code the levels of each factor as 0 (low) and 1 (high). Each of the 8 combinations of the factor
levels were used in the experiment. Thus the treatments in standard order were:
000, 001, 010, 011, 100, 101, 110, 111.
Each treatment was used in the manufacture of one batch of the chemical and the yield (amount
in grams of chemical produced) was recorded. Before the experiment was run, a decision had to
be made on the order in which the treatments would be run. To avoid any unknown feature that
changes with time being confounded with the effects of interest, a random ordering was used; see
below. The response data are also shown in the table.
Questions of interest
1. How much is the response changed when the level of one factor is changed from high to low?
For simplicity, we first consider the case of two factors only and call them factors A and B, each
having two levels, ‘low’ (0) and ‘high’ (1). The four treatments in the experiment are then 00, 01,
10, 11, and suppose that we have just one response measured for each treatment combination. We
denote the four response values by yield00 , yield01 , yield10 and yield11 .
Main effects
For this particular experiment, we can answer the first question by measuring the difference between
the average yields at the two levels of A:
The average yield at the high level of A is 12 (yield11 + yield10 ).
The average yield at the low level of A is 21 (yield01 + yield00 ).
These are represented by the open stars in the Figure 4.1. The main effect of A is defined as
the difference between these two averages, that is
1 1
A = (yield11 + yield10 ) − (yield01 + yield00 )
2 2
1
= (yield11 + yield10 − yield01 − yield00 ),
2
4 Statistical Inference 117
Average yield
at high B
Yield yield11
high
Average yield
at high A
Factor B
yield01 yield10
Average yield
at low A
Average yield
yield00 low at low B
Figure
Figure1: 4.1:
Illustration
Figureof the yields infactorial
showing a two-factoreffects.
experiment.
which is represented by the difference between the two open stars in Figure 4.1. Notice that A is
used to denote the main effect of a factor as well as its name. This is a common practice. This
quantity measures how much the response changes when factor A is changed from its low to its
high level, averaged over the levels of factor B.
Yield
Effect of Factor
B at a given A yield11
high
The effect of changing
B at high A
yield01
yield10
The effect of
changing B at
low A yield00 low
yield11
Yield
Main effect of high
Factor B
The effect of changing
B at high A
1
The effect of yield01
yield10
changing B at
low A yield00 low
• If we interchange the roles of A and B in this expression we obtain the same formula.
• Definition: The main effects and interactions are known collectively as the factorial effects.
• Important note: When there is a large interaction between two factors, the two main effects
cannot be interpreted separately.
During the first week’s workshop and problem class you are asked to go through this. Please try
the proofs/exercises as well and verify that your solutions are correct by talking to a workshop
assistant. Solutions to some of the exercises are given at the end of this chapter and some others
are discussed in lectures.
119
A Mathematical Concepts Needed in MATH1024 120
4. The function f (x) attains a local maximum at the solution if the sign is negative.
5. There is neither a minima nor a maxima if the second derivative is zero at the solution. Such
a point is called a point of inflection.
Prove that
n n × (n − 1) × · · · × (n − k + 1)
=
k 1 × 2 × 3 × · · · × k.
Proof We have:
n n!
k = k! (n−k)!
1 1×2×3×···×(n−k)×(n−k+1)×···×(n−1)×n
= k! 1×2×3×···×(n−k)
1
= k! [(n − k + 1) × · · · × (n − 1) × n] .
6 6×5
Hence the proof is complete. This enables us to calculate 2 = 1×2 = 15. In general for calculating
n
k :
the numerator is the multiplication of k terms starting with n and counting down,
and the denominator is the multiplication of the first k positive integers.
1. nC= n C n−k . [This means number of ways of choosing k items out of n items is same as the
k
number of ways of choosing n − k items out of n items. Why is this meaningful?]
2. n+1 n n
k = k + k−1 .
3. For each of (1) and (2), state the meaning of these equalities in terms of the numbers of
selections of k items without replacement.
A Mathematical Concepts Needed in MATH1024 121
for any numbers a and b and a positive integer n. This is called the binomial theorem. This can
be used to prove the following:
n n n−1 n x
(1 − p) + p(1 − p) + ··· + p (1 − p)n−x + · · · + pn = (p + 1 − p)n = 1.
1 x
Thus
n
X n
px (1 − p)n−x = 1,
x
x=0
for n > 1.
X mn m + n
= ,
x+y=z
x y z
where the above sum is also over all possible integer values of x and y such that 0 ≤ x ≤ m and
0 ≤ y ≤ n.
Hint Consider the identity
(1 + t)m (1 + t)n = (1 + t)m+n
and compare the coefficients of tm+n on both sides. If this is hard, please try small values of m and
n, e.g. 2, 3 and see what happens.
Exercise 2. Hard Show that if X ∼ Poisson(λ), Y ∼ Poisson(µ) and X and Y are independent
random variables then
X + Y ∼ Poisson(λ + µ).
P {Z = z} = P {X + Y = z}
X∞
= P {X + Y = z | X = x}P {X = x}
x=0
z
X
= P {Y = z − x}P {X = x}
x=0
z
X
= e−µ µz−x /(z − x)! e−λ λx /x!
x=0
z
−µ−λ
X 1
= e λx µz−x
x! (z − x)!
x=0
−(λ+µ) z
e X z!
= λx µz−x
z! x! (z − x)!
x=0
z
e−(λ+µ) X z x z−x
= λ µ
z! x
x=0
e−(λ+µ)
= (µ + λ)z (binomial sum).
z!
Thus Z ∼ Poisson(λ+µ) since the above is the probability mass function of the Poisson distribution
with parameter λ + µ.
k
X 1 − rk+1
rx = 1 + r + r2 + · · · + rk = .
1−r
x=0
The power of r, k + 1, in the formula for the sum is the number of terms.
2. When k → ∞ we can evaluate the sum only when |r| < 1. In that case
∞
X 1
rx = [rk+1 → 0 as k → ∞ for |r| < 1].
1−r
x=0
3. For a positive n and |x| < 1, the negative binomial series is given by:
1 1 n(n + 1)(n + 2) · · · (n + k − 1) k
(1−x)−n = 1+nx+ n(n+1)x2 + n(n+1)(n+2)x3 +· · ·+ x +· · ·
2 6 k!
A Mathematical Concepts Needed in MATH1024 123
1 + 2q + 3q 2 + 4q 3 + · · · = (1 − q)−2 .
1. log(ab) = log(a) + log(b) [Log of the product is the sum of the logs]
a
2. log b = log(a) − log(b) [Log of the ratio is the difference of the logs]
There is no simple formula for log(a + b) or log(a − b). Now try the following exercises:
3
1. Show that log xeax +3x+b = log(x) + ax3 + 3x + b.
−λ λx
P∞
2. Satisfy yourself that x=0 e x! = 1.
A Mathematical Concepts Needed in MATH1024 124
A.7 Integration
A.7.1 Fundamental theorem of calculus
We need to remember (but not prove) the fundamental theorem of calculus:
Z x
dF (x)
F (x) = f (u)du implies f (x) =
−∞ dx
f (x) = f (−x)
x2
for all possible values of x. For example, f (x) = e− 2 is an even function for real x.
f (x) = −f (−x)
x2
for all possible values of x. For example, f (x) = xe− 2 is an odd function for real x. It can be
proved that:
It can be shown that Γ(α) exists and is finite for all real values of α > 0. Obviously it is non-negative
since it is the integral of a non-negative function. It is easy to see that
Γ(1) = 1
R∞
since 0 e−x dx = 1. The argument α enters only through the power of the dummy (x). Remember
this, as we will have to recognise many gamma integrals. Important points are:
1. It is an integral from 0 to ∞.
2. The integrand must be of the form dummy (x) to the power of the parameter (α) minus one
(xα−1 ) multiplied by e to the power of the negative dummy (e−x ).
provided parameter > 1. The condition α > 1 is required to ensure that Γ(α − 1) exists. The
proof of this is not required, but can be proved easily by integration by parts by integrating the
function e−x and differentiating the function xα−1 . For an integer n, by repeatedly applying the
reduction formula and Γ(1) = 1, show that
Γ(n) = (n − 1)!.
Thus Γ(5) = 4! = 24. You can guess how rapidly the gamma function increases! The last formula
we need to remember for our purposes is:
A Mathematical Concepts Needed in MATH1024 126
1
√
Γ 2 = π.
Proof of this is complicated and not required for this module. Using this we can calculate
√
3 3 π
− 1 Γ 12 =
Γ = .
2 2 2
2.
n! n!
RHS = +
k!(n − k)! (k − 1)!(n − [k − 1])!
n!
= [n − (k − 1) + k]
k!(n − [k − 1])!
n![n + 1]
=
k!(n − [k − 1])!
(n + 1)!
=
k!(n + 1 − k)!
= LHS
3. (a) Number of selections (without replacement) of k objects from n is exactly the same as
the number of selections of (n − k) objects from n.
(b) The number of selections of k items from (n + 1) consists of:
• The number of selections that include the (n + 1)th item. There are n
k−1 of these.
• The number of selections that exclude the (n + 1)th item. There are n
k of these.
Appendix B
Worked Examples
Let
127
B Worked Examples 128
Note that B1 , B2 , B3 are mutually exclusive and exhaustive. Find P {A} and P {B1 |A}.
80. [Independent events] The probability that Jane can solve a certain problem is 0.4 and that
Alice can solve it is 0.3. If they both try independently, what is the probability that it is
solved?
81. [Random variable] A fair die is tossed twice. Let X equal the first score plus the second
score. Determine
82. [Random variable] A coin is tossed three times. If X denotes the number of heads minus
the number of tails, find the probability function of X and draw a graph of its cumulative
distribution function when
83. [Expectation and variance] The random variable X has probability function
1
px = 14 (1 + x) if x = 1, 2, 3, 4
0 otherwise.
84. [Expectation and variance] Let X denote the score when a fair die is thrown. Determine the
probability function of X and find its mean and variance.
85. [Expectation and variance] Two fair dice are tossed and X equals the larger of the two scores
obtained. Find the probability function of X and determine E(X).
86. [Expectation and variance] The random variable X is uniformly distributed on the integers
0, ±1, ±2, . . . , ±n, i.e.
1
px = 2n+1 if x = 0, ±1, ±2, . . . , ±n
0 otherwise.
Obtain expressions for the mean and variance in terms of n. Given that the variance is 10,
find n.
87. [Poisson distribution] The number of incoming calls at a switchboard in one hour is Poisson
distributed with mean λ = 8. The numbers arriving in non-overlapping time intervals are
statistically independent. Find the probability that in 10 non-overlapping one hour periods
at least two of the periods have at least 15 calls.
88. [Continuous distribution] The random variable X has probability density function
kx2 (1 − x) if 0 ≤ x ≤ 1
f (x) =
0 otherwise.
B Worked Examples 129
(b)
P {both boys and at least one boy}
P {both boys|at least one boy} =
P {at least one boy}
P {both boys} 1/4 1
= = = .
1 − P {both girls} 1 − 1/4 3
78. In the obvious notation
P {LF |LH}P (LH)
P {LH|LF } =
P {LF |LH}P (LH) + P {LF |RH}P {RH}
0.8 × 0.1 16
= = .
0.8 × 0.1 + 0.15 × 0.9 43
Hence
P {A} = P {B1 }P {A|B1 } + P {B2 }P {A|B2 } + P {B3 }P {A|B3 }
= 0.3 × 0.1 + 0.5 × 0.3 + 0.2 × 0.5
= 0.28
Now by the Bayes theorem,
P {B1 }P {A|B1 }
P {B1 |A} = P {A}
0.3×0.1
= 0.28
3
= 28 .
80.
(a) Working along the cross-diagonals we find by enumeration that X has the following
probability function
x 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
px 36 36 36 36 36 36 36 36 36 36 36
More concisely,
(
6−|x−7|
36 if x = 2, . . . , 12
px =
0 otherwise.
(b)
0
if x < 2
1
F (x) = 36 if 2 ≤ x < 3
3
if 3 ≤ x < 4, etc.
36
Using the formula ni=1 i = 12 n(n + 1) we find that the cumulative distribution function
P
F (x) can be written concisely in the form
0 if x<2
(6+[x−7])(7+[x−7])
if 2≤x<7
F (x) = 72
21 [x−7](11−[x−7])
36 + 72 if 7 ≤ x < 12
1 if x ≥ 12,
(a) (b)
x -3 -1 1 3 x -3 -1 1 3
1 3 3 1 8 36 54 27
px 8 8 8 8 px 125 125 125 125
(a) (b)
0 if x < −3
0 if x < −3
1
if − 3 ≤ x < −1 8
if − 3 ≤ x < −1
8
125
1 44
F (x) = 2 if −1≤x<1 F (x) = 125 if −1≤x<1
7
if 1≤x<3 98
if 1≤x<3
8
125
1 if x ≥ 3.
1 if x ≥ 3.
B Worked Examples 133
2 3 4 5
83. p1 = 14 , p2 = 14 , p3 = 14 , p4 = 14 .
X
E(X) = xpx
x
2 3 4 5 20
+2× =1× +3× +4× = .
14 14 14 14 7
2 3 4 5 65
E(X 2 ) = 1 × +4× +9× + 16 × = .
14 14 14 14 7
Therefore Var(X) = E(X 2 ) − [E(X)]2
65 400 55
= − = .
7 49 49
x 1 2 3 4 5 6
1 1 1 1 1 1
px 6 6 6 6 6 6
X
E(X) = xpx
x
1 1 1 1 1 1
×1+ ×2+ ×3+ ×4+ ×5+ ×6
=
6 6 6 6 6 6
7
= .
2
Var(X) = E(X 2 ) − [E(X)]2
2
1 1 1 1 1 1 7
= × 1 + × 4 + × 9 + × 16 + × 25 + × 36 −
6 6 6 6 6 6 2
35
= .
12
85. Using the sample space for Question 5, we find that X has probability function
x 1 2 3 4 5 6
1 3 5 7 9 11
px 36 36 36 36 36 36
B Worked Examples 134
1 3 5 7 9 11
E(X) = ×1+ ×2+ ×3+ ×4+ ×5+ ×6
36 36 36 36 36 36
161
= .
36
86.
n
X 1 X
E(X) = xpx = x = 0.
x
2n + 1 x=−n
n
2
X
2 1 X
E(X ) = x px = x2
x
2n + 1 x=−n
2 n(n + 1)(2n + 1) n(n + 1)
= = .
(2n + 1) 6 3
n(n + 1)
Therefore Var(X) = E(X 2 ) − [E(X)]2 = .
3
If Var(X) = 10,
n(n + 1)
then = 10
3
n2 + n − 30 = 0.
Therefore n = 5 (rejecting − 6).
87. Let X be the number of calls arriving in an hour and let P (X ≥ 15) = p.
Then Y , the number of times out of 10 that X ≥ 15, is B(n, p) with n = 10 and p =
1 − 0.98274 = 0.01726.
Therefore P (Y ≥ 2) = 1 − P (Y ≤ 1)
= 1 − ((0.98274)10 + 10(0.01726)(0.98274)9 )
= 0.01223.
R1
88. (a) k 0 x2 (1 − x) dx = 1, which implies that k = 12.
R 1/2
(b) P (0 < X < 12 ) = 12 0 x2 (1 − x) dx = 16 5
.
(c) The number of observations lying in the interval (0, 12 ) is binomially distributed with
5
parameters n = 100 and p = 16 , so that
1
Mean number of observations in 0, = np = 31.25.
2
89. (a)
Z ∞
E(X) = λ xe−λx dx
0
Z ∞
−λx ∞
= [−xe ]0 + e−λx dx (integrating by parts)
0
1 −λx ∞ 1
=0− [e ]0 = .
λ λ
B Worked Examples 135
(b)
Z ∞
E(X 2 ) = λ x2 e−λx dx
0
Z ∞
= [−x2 e−λx ]∞
0 +2 xe−λx dx
0
2
= 0 + 2 (using the first result)
λ
2
= 2
λ
2 1
= λ12 and σ = Var(X) = λ1 .
p
Therefore Var(X) = λ2
− λ2
(c) The mode is x = 0 since this value maximises f (x).
(d) The median m is given by
Z m
1 1
λe−λx dx = , which implies that m = log(2).
0 2 λ
Rx dy
(b) F (x) = 1
π −∞ 1+y 2 = π1 [tan−1 (y)]x−∞ = π1 (tan−1 (x) + π2 ).
(c) P (−1 ≤ X ≤ 1) = F (1) − F (−1) = π1 (tan−1 (1) + π2 ) − π1 (tan−1 (−1) + π2 ) = 12 .
B Worked Examples 136
Find
(a) unbiased estimates of µb and σb2 , the mean and variance of the weights of the boys;
(b) unbiased estimates of µg and σg2 , the mean and variance of the weights of the girls;
(c) an unbiased estimate of µb − µg .
Assuming that σb2 = σg2 = σ 2 , calculate an unbiased estimate of σ 2 using both sets of weights.
93. [Estimation] The time that a customer has to wait for service in a restaurant has the prob-
ability density function
(
3θ3
(x+θ)4
if x ≥ 0
f (x) =
0 otherwise,
95. [Confidence interval] At the end of a severe winter a certain insurance company found that
of 972 policy holders living in a large city who had insured their homes with the company,
357 had suffered more than £500-worth of snow and frost damage. Calculate an approximate
95% confidence interval for the proportion of all homeowners in the city who suffered more
than £500-worth of damage. State any assumptions that you make.
96. [Confidence interval] The heights of n randomly selected seven-year-old children were mea-
sured. The sample mean and standard deviation were found to be 121 cm and 5 cm re-
spectively. Assuming that height is normally distributed, calculate the following confidence
intervals for the mean height of seven-year-old children:
97. [Confidence interval] A random variable is known to be normally distributed, but its mean µ
and variance σ 2 are unknown. A 95% confidence interval for µ based on 9 observations was
found to be [22.4, 25.6]. Calculate unbiased estimates of µ and σ 2 .
98. [Confidence interval] The wavelength of radiation from a certain source is 1.372 microns. The
following 10 independent measurements of the wavelength were obtained using a measuring
device:
1.359, 1.368, 1.360, 1.374, 1.375, 1.372, 1.362, 1.372, 1.363, 1.371.
Assuming that the measurements are normally distributed, calculate 95% confidence limits
for the mean error in measurements obtained with this device and comment on your result.
99. [Confidence interval] In five independent attempts, a girl completed a Rubik’s cube in 135.4,
152.1, 146.7, 143.5 and 146.0 seconds. In five further attempts, made two weeks later, she
completed the cube in 133.1, 126.9, 129.0, 139.6 and 144.0 seconds. Find a 90% confidence
interval for the change in the mean time taken to complete the cube. State your assumptions.
100. [Confidence interval] In an experiment to study the effect of a certain concentration of insulin
on blood glucose levels in rats, each member of a random sample of 10 rats was treated with
insulin. The blood glucose level of each rat was measured both before and after treatment.
The results, in suitable units, were as follows.
Rat 1 2 3 4 5 6 7 8 9 10
Level before 2.30 2.01 1.92 1.89 2.15 1.93 2.32 1.98 2.21 1.78
Level after 1.98 1.85 2.10 1.78 1.93 1.93 1.85 1.67 1.72 1.90
Let µ1 and µ2 denote respectively the mean blood glucose levels of a randomly selected rat
before and after treatment with insulin. By considering the differences of the measurements
on each rat and assuming that they are normally distributed, find a 95% confidence interval
for µ1 − µ2 .
B Worked Examples 138
101. [Confidence interval] The heights (in metres) of 10 fifteen-year-old boys were as follows:
1.59, 1.67, 1.55, 1.63, 1.69, 1.58, 1.66, 1.62, 1.64, 1.61.
Assuming that heights are normally distributed, find a 99% confidence interval for the mean
height of fifteen-year-old boys.
If you were told that the true mean height of boys of this age was 1.67 m, what would you
conclude?
1
µ̂b = (77 + 67 + . . . + 81) = 67.3.
10
An unbiased estimate of σb2 is the sample variance of the weights of the boys,
1
µ̂g = (42 + 57 + . . . + 59) = 52.4,
10
σ̂g2 = ((422 + 572 + . . . + 592 ) − 10µ̂2g )/9 = 56.71̇.
1 2 1
(σ̂b + σ̂g2 ) = (52.67̇ + 56.71̇)
2 2
= 54.694̇.
B Worked Examples 139
where zγ is the 100γ percentile of the standard normal distribution. The width of the CI is
0.8zγ . The width of the quoted confidence interval is 1.316. Therefore, assuming that the
quoted interval is symmetric,
This implies that α = 0.1 and hence 100(1 − α) = 90, i.e. the confidence level is 90%.
95. Assuming that although the 972 homeowners are all insured within the same company they
constitute a random sample from the population of all homeowners in the city, the 95%
interval is given approximately by
" r r #
p̂(1 − p̂) p̂(1 − p̂)
p̂ − 1.96 , p̂ + 1.96 ,
n n
96. The 100(1 − α)% confidence interval is, in the usual notation,
s
x̄ ± critical value √ ,
n
where the critical value is the 100(1−α/2)th percentile of the t-distribution with n−1 degrees
of freedom. Here x̄ = 121 and s = 5.
(a) For the 95% CI, critical value = 1.753 (qt(0.95,df=15) in R) and the interval is
A 95% confidence interval for the mean error is obtained by subtracting the true wavelength
of 1.372 from each endpoint. This gives [−0.0087, −0.0001]. As this contains negative values
only, we conclude that the device tends to underestimate the true value.
99. Using x to refer to the early attempts and y to refer to the later ones, we find from the data
that X X
xi = 723.7, x2i = 104896.71,
X X
yi = 672.6, yi2 = 90684.38.
This gives
x̄ = 144.74, s2x = 37.093,
ȳ = 134.52, s2y = 51.557.
Confidence limits for the change in mean time are
s
4s2x + 4s2y 1 1
ȳ − x̄ ± critical value + ,
8 5 5
leading to the interval [−18.05, −2.39], as critical value = 1.860 (qt(0.95,df=8) in R).
As it contains only negative values, this suggests that there is a real decrease in the mean
time taken to complete the cube. We have assumed that the two samples are independent
random samples from normal distributions of equal variance.
100. Let d1 , d2 , . . . , d10 denote the differences in levels before and after treatment. Their values
are
0.32, 0.16, −0.18, 0.11, 0.22, 0.00, 0.47, 0.31, 0.49, −0.12.
Then i=1 di = 1.78 and 10
P10 P 2 ¯
i=1 di = 0.7924 so that d = 0.178, sd = 0.2299.
Note that the two samples are not independent. Thus the standard method of finding a
confidence interval for µ1 − µ2 , as used in Question 9 for example, would be inappropriate.
101. The mean of the heights is 1.624 and the standard deviation is 0.04326. A 99% confidence
interval for the mean height is therefore
0.04326 0.04326
1.624 − 3.250 × √ , 1.624 + 3.250 × √ ,
10 10
B Worked Examples 142
If we were told that the true mean height was 1.67 m then, discounting the possibility that
this information is false, we would conclude that our sample is not a random sample from
the population of all fifteen-year-old boys or that we have such a sample but an event with
probability 0.01 has occurred, namely that the 99% confidence interval does not contain the
true mean height.
Appendix C
Summary
Please watch Lecture 3 on data visualisation with R first before attempting to read this any further.
These notes are designed to help you learn R at your own pace over the three planned R laboratory
hours during weeks 2-4. Live in-person help, if you get stuck, is available during the three
scheduled R lab hours only. Hence, please make the most of these hours. You will be assessed on
your proficiency in using R. More details regarding the assessment will follow.
• Rstudio is a commercial product that provides a nice front-end for the R language. You can
download a free version from https://rstudio.com/products/rstudio/download/. Down-
loading of both R and Rstudio is recommended if you are working in your own computer with
any of the operating systems: Mac, Windows and Linux.
• R provides many facilities for statistical modelling, data handling and graphical display. It
also allows the user extreme flexibility in manipulating and analysing data.
• R has an extensive on-line help system. You can access this using the Help menu. The help
system is particularly useful for looking up commands or functions that you know exist but
143
C Notes for R Laboratory Sessions 144
whose name or whose syntax you have forgotten. An alternative way of obtaining information
about a function is to type help(<function name>) or ?<function name>, for example
help(plot) or ?plot.
You can also put your query on any internet search engine.
C.1.2 Starting R
You can use Rstudio or R, but Rstudio is the preferred choice since it has nicer operational
functionality with more menu driven options. In the university computing systems you need to go
through the Start menu ==> navigate to Statistics and then you will find R and Rstudio.
In both Rstudio and R there is the R console that allows you to type in commands at the
prompt > directly.
You can exit R by typing
> q()
in the commands window and then hit the Enter key in the keyboard or click the Run button
in the menu. You may also exit by following File→Exit.
• When calling a function, the arguments can be placed in any order provided that they are
explicitly named. Any unnamed argument passed to a function is assigned to the first variable
which has not yet been assigned. Any arguments which have defaults, do not need to be
specified. For example, consider the function qnorm which gives the quantiles of the normal
distribution. We see that the order of arguments is p, mean and sd.
qnorm(0.95, mean=-2.0, sd=3.0)
qnorm(0.95, sd=3.0, mean=-2.0)
qnorm(mean=-2.0, sd=3.0, 0.95)
all have the same effect and they all produce the same result.
• Just typing the command will not produce anything. You will have to execute either by
hitting the Enter key or by clicking the Run button in the menu.
• The assignment operator in R is <-, i.e. a ‘less than’ symbol immediately followed by a hyphen
or simply the equality sign = as you have already seen. For example,
x <- 2 + 2 # The output should be 4!
You can also use the = symbol for assignment. For example, type
C Notes for R Laboratory Sessions 145
y = 2 + 2
Note that an assignment does not produce any output (unless you have made an error, in
which case an error message will appear). To see the result of an assignment, you need to
examine the contents of the object you have assigned the result of the command to. For
example, typing
x
and then hitting Enter, should now give the output [1] 4. The [1] indicates that 4 is the
first component of x. Of course x only has one component here, but this helps you keep track
when the output is a vector of many components.
• You can repeat or edit previous commands by using the up and down arrow keys (↑↓).
• We normally put the commands in a file. We can open it by following: File → Open script.
For this session, please open a new script and type the commands. Periodically save the file
as, for example, Rlabs.R in H:/math1024.
• To run a bunch of commands in the opened script file we highlight the bunch and then press
the Run button in Rstudio (towards the top right corner of the script Window with a green
colour arrow) or the Run line or selection menu button in R.
• All the commands used in the R lab sessions are already typed in the file Rfile1.R that
you can download from Blackboard. It is mostly up to you to decide whether to type in
the commands or step through the commands already there in Rfile1.R. If you are strug-
gling initially, then you can just step through the typed commands. But as you grow more
confidence you should type in yourself to make sure that you understand the commands fully.
• If you are working in your computer, please create a folder and name it C:/math1024. R is
case sensitive, so if you name it Math1024 instead of math1024 then that’s what you need to
use. Avoid folder names with spaces, e.g. do not use: Math 1024.
• In the university workstations there is a drive called H: which is permanent (will be there for
you to use throughout your 3 (or 4) year degree programme. From Windows File Explorer
navigate to H: and create a sub-folder math1024.
• Please unzip (extract) the file and save the data files in the math1024 folder you created. You
do not need to download this file again unless you are explicitly told to do so.
• In R, issue the command getwd(), which will print out the current working directory.
• Assuming you are working in the university computers, please set the working directory by
issuing the command: setwd("H:/math1024/"). In your own computer you will modify the
command to something like: setwd("C:/math1024/")
• In Rstudio, a more convenient way to set the working directory is: by following the menu
Session → Set Working Directory. It then gives you a dialogue box to navigate to the
folder you want.
• To confirm that this has been done correctly, re-issue the command getwd() and see the
output.
• Your data reading commands below will not work if you fail to follow the instruc-
tion in this subsection.
• Please remember that you need to issue the setwd("H:/math1024/") every time you log-in.
• To read a tab-delimited text file of data with the first row giving the column headers, the
command is: read.table("filename.txt", head=TRUE).
• For comma-separated files (such as the ones exported by EXCEL), the command is
read.table("filename.csv", head=TRUE, sep=",") or simply
read.csv("filename.csv", head=TRUE).
• The option head=TRUE tells that the first row of the data file contains the column headers.
• Read the help files by typing ?scan and ?read.table to learn these commands.
• You are reminded that the following data reading commands will fail if you have not set the
working directory correctly.
• Assuming that you have set the working directory to where your data files are saved, simply
type and Run
• R does not automatically show the data after reading. To see the data you need to issue a
command like: cfail, head(ffood), tail(bill) etc. after reading in the data.
• You must issue the correct command to read the data set correctly.
In the past, reading data into R has been the most difficult task for students. Please ask for
help in the lab sessions if you are still struggling with this. If all else fails, you can read the data
sets from the course web-page as follows:
• To calculate variance, try var(ffood$AM). What does the command var(c(ffood$AM, ffoood$PM)
give? In R, c is the command to combine elements. For example, x <- c(1, 5).
• Variance and standard deviation (both with divisor n − 1) are obtained by using commands
like var(cfail) and sd(cfail).
• A stem and leaf diagram is produced by the command stem. Issue the command stem(ffood$AM)
and ?stem to learn more.
• Modify the command so that it looks a bit nicer: hist(cfail, xlab="Number of weekly
computer failures")
• To obtain a scatter plot of the before and after weights of the students, we issue the command
plot(wgain$initial, wgain$final)
• A nicer and more informative plot can be obtained by: plot(wgain$initial, wgain$final,
xlab="Wt in Week 1", ylab="Wt in Week 12", pch="*", las=1)
abline(0, 1, col="red")
title("A scatter plot of the weights in Week 12 against the weights in Week 1")
• You can save the graph in any format you like using the menus.
• To draw boxplots use the boxplot command, e.g., boxplot(cfail)
• The default boxplot shows the median and whiskers drawn to the nearest observation from
the first and third quartiles but not beyond the distance 1.5 times the inter-quartile range.
Points beyond the two whiskers are suspected outliers and are plotted individually.
• boxplot(ffood) generates two boxplots side-by-side: one for the AM service times and the
other for the PM service times. Try boxplot(data=bill, wealth ∼ region, col=2:6)
• Various parameters of the plot are controlled by the par command. To learn about these
type ?par.
You can add, subtract and multiply vectors. For example, examine the output of 2*a6,
a7+a8 etc. R performs these operations element-wise.
• Matrices are rectangular arrays consisting of rows and columns. All data must be of the
same mode. For example, y <- matrix(1:6, nrow=3,ncol=2) creates a 3 × 2 matrix, called
y. You can access parts of y by calling things like:
y[1,2] # gives the first row second column entry of y
y[1,] # gives the first row of y
y[,2] # gives the second column of y
and so on.
Individual elements of vectors or matrices, or whole rows or columns of matrices may be
updated by assigning them new values, e.g.
a1[1] <- 3
y[1,2] <- 3
y[,2] <- c(2,2, 2).
You can do arithmetic with the matrices, for example suppose
x <- - matrix (1:6, nrow=3,ncol=2)
Now you can simply write z <- x+y to get the sum. However, x*y will get you a new matrix
whose elements are the simple products of corresponding elements of x and y.
• The View command lets you see its data frame argument like a spreadsheet. For example, type
View(dframe). In Rstudio the View command is invoked by double clicking the name of the
particular object in the ‘Environment Window’. You can print the list of all the objects in the
current environment by issuing the ls() command. The command for deleting (removing)
objects is rm(name) where name is the object to be removed.
• Lists are used to collect objects of different types. For example, a list may consist of two
matrices and three vectors of different size and modes. The components of a list have individ-
ual names and are accessed using <list name>$<component name>, similar to data frames
(which are themselves lists, of a particular form). For example,
myresults <- list(mean=10, sd=3.32, values=5:15)
Now myresults$mean will print the value of the member mean in the list myresults.
• Logical vectors
We can select a set of components of a vector by indicating the relevant components in square
brackets. For example, to select the first element of a1 <- c(1,3,5,6,8,21) we just type in
a1[1]. However, we often want to select components, based on their values, or on the values
of another vector. For example, how can we select all the values in a1 which are greater than
5? For the bill data set we may be interested in all the rows of bill which have wealth
greater than 5, or all the rows for region A.
Typing a condition involving a vector returns a logical vector of the same length containing T
(true) for those components which satisfy the condition and F (false) otherwise. For example,
try
a1[a1>5]
bill$wealth> 5
bill$region == "A"
(note the use of == in a logical operation, to distinguish it from the assignment =). A logical
vector may be used to select a set of components of any other vector. Try
C Notes for R Laboratory Sessions 152
• The functions any and all take a logical vector as their argument, and return a single
logical value. For example, any(x>3 & x<7) returns T, because at least one component of
its argument is T, whereas all(x>3 & x<7) returns F, because not every component of its
argument is T.
• A little exercise. How can you choose subsets of a data frame? For example, how can you pick
only the odd numbered rows? Hint: You can use the seq or rep command learned before.
For example, a <- seq(1, 10, by =2) and oddrows <- bill[a, ]
C.3.2 Plotting
We will learn to generate some interesting and informative plots using the billionaires example.
Please type in the commands and Run after each completed line with a closed ‘)’. You can ignore
the comments after the # sign, i.e. you do not have to type those in.
• All these graphics are done a lot better using a more advanced graphics package called
ggplot2. Learning of this package is not required for Math1024 assessment. But you are
invited to get started with this for your own advanced skill development. You can skip ggplot
and go straight to the next subsection if you want.
• Observe that the ggplot command takes a data frame argument and aes is the short form
for aesthetics. The arguments of aes can be varied depending on the desired plot. Here we
just have the x and y’s required to draw a scatter plot.
1. group: This is the group number of the group of students who sat together for the age guessing
exercise. There were 55 groups in total.
3. females: Number of female students in the group. Hence the number of males in each group is:
size – females. There were no other gender type of students. This can be used to investigate
if female students are on average better at guessing ages from photographs.
10. abs error: absolute value of the error: |est age – tru age|
Use the command errors <- read.csv("2019ageguess.csv", head=T) to read the data. An-
swer the following questions.
1. How many rows and columns are there in the data set?
2. How many students were there in the age guessing exercise on that day? You may use the
built-in sum command. (Think, it is not 1500!) How many of the students were male and
how many were female?
3. Looking at the column tru age (e.g. by obtaining a frequency table), find the number of
photographed mathematicians for each unique value of age. Remember there are only 10
photographed mathematicians!
4. The table command can take multiple arguments for cross-tabulation. Use the table com-
mand to obtain a two-way table providing the distribution of 10 photographed mathematicians
in different categories of race and gender.
C Notes for R Laboratory Sessions 156
5. What are the minimum and maximum true ages of the photographed mathematicians?
6. Obtain a barplot of the true age distribution. This is the unknown population distribution
of the true ages of photographed mathematicians.
7. Obtain a histogram of the estimated age column and compare this with the true age distri-
bution seen in the barplot drawn above.
8. What is the command for plotting estimated age (on the y-axis and) against true age?
9. What are the means and standard deviations for the columns: size, females, est age, tru age,
error and abs error?
10. What is the mean number of males in each group? What is the mean number of females in
each group?
12. Note down the frequency table of the sign of the errors. That is, obtain the numbers of
negative, zero and positive errors. You may use the built-in sign function for this.
13. Obtain a histogram for the errors and another for the absolute errors. Which one is bell
shaped and why?
14. Obtain a histogram for the square-root of the absolute errors. Does it look more bell shaped
than the histogram of just the absolute errors?
15. Draw a boxplot of the absolute errors and comment on its shape.
17. Draw a side by side boxplot of the absolute errors for the two groups of mathematicians:
males and females.
18. Is it easier to guess the ages of black mathematicians? How would you order the mean absolute
error by race?
There is a R cheatsheet (see below) that you can download from Blackboard (under Course
Content and R resources) for more help with getting started.
Base R Vectors Programming
Creating Vectors For Loop While Loop
Cheat Sheet c(2, 4, 6) 2 4 6
Join elements into
for (variable in sequence){ while (condition){
a vector
Do something Do something
Getting Help 2:6 2 3 4 5 6
An integer
sequence
} }
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15
Types Matrixes Strings Also see the stringr library.
m <- matrix(x, nrow = 3, ncol = 3) paste(x, y, sep = ' ')
Converting between common data types in R. Can always go Join multiple vectors together.
Create a matrix from x.
from a higher value in the table to a lower value.
paste(x, collapse = ' ') Join elements of a vector together.
m[2, ] - Select a row t(m)
w
ww
grep(pattern, x) Find regular expression matches in x.
Transpose
as.logical TRUE, FALSE, TRUE Boolean values (TRUE or FALSE).
ww
w m[ , 1] - Select a column
m %*% n gsub(pattern, replace, x) Replace matches in x with a string.
tolower(x)
Convert to uppercase.
Convert to lowercase.
Find x in: m * x = n
as.character '1', '0', '1'
Character strings. Generally
preferred to factors.
w
ww
ww
w
nchar(x) Number of characters in a string.
as.factor
'1', '0', '1',
levels: '1', '0'
Character strings with preset
levels. Needed for some
statistical models.
Lists Factors
l <- list(x = 1:5, y = c('a', 'b')) factor(x) cut(x, breaks = 4)
Maths Functions A list is collection of elements which can be of different types. Turn a vector into a factor. Can
set the levels of the factor and
Turn a numeric vector into a
factor but ‘cutting’ into
log(x) Natural log. sum(x) Sum. l[[2]] l[1] l$x l['y'] the order. sections.
New list with New list with
exp(x) Exponential. mean(x) Mean. Second element Element named
columns.
rm(list = ls()) Remove all variables from the rbind - Bind rows. plot(x) plot(x, y) hist(x)
environment. Values of x in Values of x Histogram of
dim(df)
Number of order. against y. x.
You can use the environment panel in RStudio to
df[2, 2] columns and
browse variables in your environment. rows.
Dates See the lubridate library.
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15
Prof Sujit Sahu studied statistics and math-
ematics at the University of Calcutta and the
Indian Statistical Institute and went on to ob-
tain his PhD at the University of Connecti-
cut (USA) in 1994. He joined the University
of Southampton in 1999. His research area
is Bayesian statistical modelling and computa-
tion.
Acknowledgements
Materials for this booklet are taken largely from the MATH1024 notes developed over the
years by many colleagues who previously taught this module in the University of Southamp-
ton.
Some theoretical discussions are also taken from the books: (i) An Outline of Statistical The-
ory and (ii) Fundamentals of Statistics written by A. M. Goon, M. Gupta and B. Dasgupta.
Many worked examples were taken from a series of books written by F.D.J. Dunstan, A.B.J.
Nix, J.F. Reynolds and R.J. Rowlands, published by R.N.D. Publications.
The author is also thankful to many first year students like you and Joanne Ellison who
helped in typesetting and proofreading these notes.