Edda Course Notes
Edda Course Notes
University of Melbourne
Contents
0 Introduction 1
1 Epidemiological studies & Experimental design 5
1.1 Epidemiology — an introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 What is Statistics? Biostatistics? Epidemiology? . . . . . . . . . . . . . 5
1.1.2 Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Types of (epidemiological) study . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Experimental studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Clinical trial (medical experimental study) . . . . . . . . . . . . . . . . 13
1.2.2 Field trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 Community intervention trial . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.4 Experiments and experimental principles . . . . . . . . . . . . . . . . . 14
1.3 Observational studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.1 Cohort study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.2 Case-control studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.3 Comparison of cohort and case-control studies . . . . . . . . . . . . . . 23
1.3.4 Cross-sectional studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Review of study types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.1 A dialogue with a skeptical statistician (Gary Grunwald) . . . . . . . . 27
1.5 Causality in epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Problem Set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Exploratory data analysis 39
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2 Tables and diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.2 Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3 Types of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 Some general comments on data handling . . . . . . . . . . . . . . . . 45
2.4 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.1 Univariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.2 Numerical statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.3 Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4.4 Measures of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.5 Graphical representations . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.6 Bivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Problem Set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Probability and applications 71
3.1 Probability: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.1 Probability tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.2 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.1 Multiplication rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.2 Conditional odds and odds ratio . . . . . . . . . . . . . . . . . . . . . . 79
3.3 Law of Total Probability & Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . 80
3.4 Diagnostic testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
i
lOMoARcPSD|8938243
Problem Set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4 Probability distributions 91
4.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.3 Comparison of discrete and continuous random variables . . . . . . . 94
4.1.4 Quantiles (inverse cdf) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.5 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1.6 The variance and the standard deviation . . . . . . . . . . . . . . . . . 98
4.1.7 Describing the probability distribution . . . . . . . . . . . . . . . . . . 100
4.2 Independent trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.2 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.3 Incidence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 The normal distribution and applications . . . . . . . . . . . . . . . . . . . . . 110
4.4.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4.3 Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Problem Set 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5 Estimation 121
5.1 Sampling and sampling distributions . . . . . . . . . . . . . . . . . . . . . . . 121
5.1.1 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.1.2 The distribution of X̄ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2 Inference on the population mean, µ . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Point and interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Normal: estimation of µ when σ is known . . . . . . . . . . . . . . . . . . . . . 127
5.5 Estimators that are approximately normal . . . . . . . . . . . . . . . . . . . . . 129
5.5.1 Estimation of a population proportion . . . . . . . . . . . . . . . . . . . 129
5.5.2 Estimation of a population rate . . . . . . . . . . . . . . . . . . . . . . . 133
5.6 Normal: estimation of µ when σ is unknown . . . . . . . . . . . . . . . . . . . 135
5.7 Prediction intervals (for a future observation) . . . . . . . . . . . . . . . . . . . 137
5.8 Checking normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.9 Combining estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Problem Set 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6 Hypothesis Testing 147
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.2 Types of error and power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3 Testing procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.1 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.2 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3.3 Critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.4 Hypothesis testing for normal populations . . . . . . . . . . . . . . . . . . . . 154
6.4.1 z-test (testing µ=µ0 , when σ is known/assumed) . . . . . . . . . . . . 154
6.4.2 t-test (testing µ=µ0 when σ is unknown) . . . . . . . . . . . . . . . . . 158
6.4.3 Approximate z-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.5 Case study: Bone density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Problem Set 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7 Comparative Inference 171
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.2 Paired samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.3 Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.3.1 Variances known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.3.2 Variances unknown but equal . . . . . . . . . . . . . . . . . . . . . . . . 177
7.3.3 Variances unknown and unequal . . . . . . . . . . . . . . . . . . . . . . 180
7.4 Case study: Lead exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.5 Comparing two proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.6 Comparing two rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.7 Goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
lOMoARcPSD|8938243
INTRODUCTION
This text is intended for a one-semester introductory subject taught to Biomedical students
(i.e. students who intend to major in medicine, dentistry, optometry, physiotherapy, phar-
macy or other medically related fields). It is a primer in epidemiology and biostatistics.
In this text, we look at methods to achieve the goal of studying health outcomes in the
general population: to determine what characteristics or exposures are associated with
which disease outcomes.
What is the “general population”? We would like to be able to apply our findings to all
humans (and possibly even to those yet unborn?) We cannot observe every individual on
the planet (and certainly not those who haven’t yet been born). We must choose a sample
of individuals from the population, and then take observations on this sample to obtain
data.
Our conclusions can be applied to the general population only insofar as the sample is
representative of the population. Or, to put it another way, our conclusions can only be
applied to the population that the sample represents [from which it has been drawn]. For
example, if our sample is of 50-59yo women from a Melbourne clinic, then our conclusions
might apply to 50-59yo women from Melbourne. But can the conclusions be extended to
other age groups? to Australian women? to all women? . . . to men?
✛ ✘
Probability
❄
population sample
Study Design Data
−→−→ Analysis
model observations
✻
Statistics
✚ ✙
1
lOMoARcPSD|8938243
ModelPopulation ( ◦◦ )
✲ ModelData
Faced with the real world, we generate a simplified model to describe it. (This is actually
something we as human beings do all the time: statistical theory just formalises it.) The
idea is that the model provides a reasonable description for the part of the real world we
are interested in1 . Based on such a model, we are able to work out what sort of data are
likely to be observed . . . and what sort are not!
Statistics, or Statistical Inference, is concerned with trying to work out what the population
is like based on the data. We would like to be able to say “if the data are like this, then
the population will be like that”. This is the important stuff! Armed with the ideas and
concepts of study design, data analysis and probability, we are able to make progress in
statistical inference: figuring out what the population is like on the basis of the observed
data. This is the subject of the remainder of the text.
Chapters 5 and 6 are concerned with “one-sample” statistics, introducing the fundamental
concepts of statistics. Given some RealData, i.e. observed data from a sample, we treat
this as ModelData, i.e. data obtained from the ModelPopulation, and learn what we can
say about the population it has been obtained from. Based on the (Probability) connection
between ModelPopulation and ModelData, we can infer something about the (Statistics)
1 These models can be expressed mathematically. This is where the underlying theory of Mathematical Statistics
comes in. This can involve some heavy-duty mathematics . . . which we ignore. However, a few of the basic ideas
are introduced, because they help in understanding the methodology.
lOMoARcPSD|8938243
(RealPopulation) (RealData)
✻
✻
ModelPopulation ( ◦◦ ) ✛
ModelData
Insofar as the ModelPopulation represents the RealPopulation, we can then draw conclu-
sions about the general population.
In Chapter 7, statistical inference is extended to the important “two-sample” case, which
we imagine ( ◦◦ ) are drawn from two populations (often ‘treated’ and ‘untreated’). The
problem now is to compare these two populations. Are they the same? Are they different?
(If they are different, it is indicating that the treatment is having an effect.) How different?
(How much effect?)
In Chapter 8, we generalise in another direction: one-sample but two variables. In this
case our interest is in whether there is any relation between the variables . . . and how this
relation might be described and used.
It is not hard to see that we could extend the model of Chapter 7 to the “k-sample” case; and
the methods of Chapter 8 to more than two variables. These extensions are not considered
in this text. You can learn about them in your next Statistics text.
Throughout this text, the statistical package R is used for calculations and visualising data.
R is a free software environment for statistical computing and graphics. It has become the
software of choice for most statisticians. You can download R for free from the R-project
website: https://www.r-project.org/. In addition, RStudio provides a user-friendly
interface to R. RStudio can be downloaded for free from https://www.rstudio.com/.
lOMoARcPSD|8938243
lOMoARcPSD|8938243
Chapter 1
“It is of the highest importance in the art of deduction to be able to recognise out of a number of facts which are
incidental and which are vital. Otherwise your energy and attention must be dissipated instead of concentrated.”
Sherlock Holmes, The Adventure of the Reigate Squires, 1894.
Statistical methods are central in medical research; in fact, in an editorial in the millennium
year 2000, a leading medical journal, the New England Journal of Medicine, presented
“Application of statistics to medicine” as one of the eleven most important developments
in medicine in the last thousand years.1
5
lOMoARcPSD|8938243
But each year, a greater proportion of Australian residents die. Why is it so?
Figure 1.1: Age distribution of the populations of Australia and South Africa
It is seen from Figure 1.1 that Australians tend to be older. It ia also true that for individuals
of the same age in the two countries, the death rate among Australians is less than the death
rate among South Africans; and this is true for any age. But, in any country, older people
die at a greater rate than younger people. As Australia has a population that is older than
that in South Africa, a greater proportion of Australians die in any one year, despite the
lower death rates within the age categories.
This situation illustrates what is called confounding.
1.1.2 Confounding
A confounding variable is a variable that affects the relationship between the variables in
question. We examine confounding and confounding variables in more detail later in the
chapter. Here, age is the confounding variable: it affects the relationship between country
and death-rate.
Confounding is a common problem in making comparisons between groups. One that we
need to be aware of, and to overcome.
The extreme case of confounding is where all individuals (in a group of individuals under
study) with attribute A also have attribute B; and those who do not have A do not have B.
ten males are not. If we measure cholesterol levels for this group, we cannot
know whether any difference was due to gender or diet. In this case the con-
founding is clear. We can’t distinguish the effects of A and B at all.
The same sort of thing applies, to a lesser extent, if there is a tendency for those with at-
tribute A to have attribute B: if more of the females are vegetarians compared to the males.
Instead of comparing (A) vs (A′ ), we are actually comparing (A&B) vs (A′ &B ′ ). If there is
a difference, we don’t know whether it’s due to A, or to B, or to both. The effects of A and
B are confounded. [ Note: A′ denotes notA, or the complement of A.]
In the example we started with, the comparison being made is not really (Australia) vs
(South Africa), but rather it is (Australia & older) vs (South Africa & younger).
This sort of confounding tended to happen in the bad old days (of early medical research)
when a doctor tended to give a new treatment (that they believed was better) to sicker
patients . . . in order to help them. The result then is a comparison between T &S and T ′ &S ′ ,
where T denotes treatment and S denotes sicker. Even if T is helpful (i.e. it increases the
chance of improvement) it is likely that the first group will do worse. The treatment effect
is masked by the helpful doctor.
He pointed out several problems with the conduct of the experiment. The major
problem though was the non-random allocation of treatment (milk vs no-milk).
The initial selection of children was random — on the principle that both con-
trols and treated individuals should be representative of children between 5
and 12 years of age. So far so good. But teachers were allowed to make sub-
stitutions “if it looked as though there were too many well- or ill-nourished
children in either group”. The teachers did what anyone would tend to do,
and re-assigned some of the ill-nourished children to the milk group and some
of the well-nourished children to the no-milk group. The result was that the
no-milk group was clearly superior in both height and weight to the treatment
group.
2 William Sealy Gosset (1876-1937) is best known by his pen name ‘Student’. As a result of a case of industrial
espionage, Guinness prohibited any of their employees from publishing. Gosset nevertheless published research
papers under the pseudonym Student. Among other things, he discovered the t-test, which is still often referred
to as the Student t.
lOMoARcPSD|8938243
Q UESTION : Despite the non-random allocation, Gosset suggested ways in which the effects
of raw milk or pasteurised milk could be estimated. Can you think of how this might be
done?
Only 24% of the women who were smokers at the time of the initial survey died
during the 20-year follow-up period, whereas 31% of the non-smokers died in
the same period. Does this difference indicate that women who were smokers
fared better than women who were not smokers? Not necessarily.
In Table 1.2 we give a more detailed display of the same data: an age-specific
table, or a table stratified by age (at the start of the study).
The age-specific display indicates that in the youngest and oldest age-groups
there was little difference in terms of risk of death. Few died in the younger
age categories, and most died in the older categories, regardless of whether
they were smokers or not. For women in the middle age categories there was a
consistently greater risk of death among smokers than nonsmokers.
So why did the nonsmokers have a higher risk overall? Because a greater pro-
portion of non-smokers were in the higher age-groups — presumably reflecting
social norms. In this example, smoking is confounded with age. Here it’s not
really (smoking) vs (non-smoking), but rather it’s (smoking & younger) vs (non-
3 Vanderpump et al., Clin.Endocrinol., 1995, 43, pp55.
lOMoARcPSD|8938243
We do a little bit of this when we look at regression models in Chapter 8, but in this subject
we do not go into this approach in any depth.
What about other variables?
There are variables that we can’t observe (or perhaps not until later) or choose not to
observe (too expensive, too time-consuming) or variables that we don’t know about or
haven’t even thought about. Some of these may be confounders. (Such variables are some-
times referred to as lurking variables. Note: “to lurk” = to be hidden, to exist unobserved.)
Clearly this is non-random: each digit should occur with relative frequency of
about 10%. It is ‘normal’ to get a preponderance of 7s, but the excess of 8s and
9s is somewhat unusual.
lOMoARcPSD|8938243
One way to do this is to randomly order the sequence TT. . . TNN. . . N (i.e. 10 Ts
and 10 Ns), avoiding any human choice.
We could put ten white balls and ten black balls in a bag, identical apart from
colour, assign black = treatment and white = no treatment, say; and then select
the balls one at a time from the bag.
It is more efficient to use a computer. For example:
in R randomly select a sample of size 10 without replacement from s = (1, . . . , 20) using
the function sample():
> s = 1:20 # this is a vector from 1 to 20
> sample(s, size=10) # sample 10 elements without replacement from s
[1] 4 16 3 15 8 7 12 20 14 19
then assign individuals corresponding to such indices to the treatment group.
E XERCISE . Try it.
When we use randomisation in a study, we expect that for any variable (observed or unob-
served) the values in the treatment group and the control group will be equivalent, in the
sense that they are likely to be about the same. This would apply to an observed (and possi-
bly important) variable such as blood pressure, or to an unobserved (and likely pointless)
variable like shoe-size. An individual with high blood pressure (or large shoe size) is just
as likely to be in the treatment group as in the control group.
For any variable (observed or unobserved, known or unknown, and whether it is related
to the outcome or not), randomisation neutralises its effect, by ensuring that the values of
the variable in the treatment group and the control group are expected to be the same.
We will hear much more about randomisation when we get to clinical trials, and ran-
domised controlled experiments.
lOMoARcPSD|8938243
In any epidemiological study, we are concerned with measures of a specified disease out-
come. This measurement may take the form of a count [e.g. number of individuals with
fatal myocardial infarction], a rate [e.g. number of new cases of breast cancer per year] or a
variable specifying the disease outcome [e.g. blood pressure, cell count, lung function, . . . ].
This is called the response variable.
An epidemiological study may be viewed as a disease measurement exercise. A simple
study might aim to estimate a risk of a disease outcome (in a particular group over a spec-
ified period of time) [e.g. risk of heart failure in 60yo males in the next ten years]. A more
complicated study might aim to compare risks of a disease outcome in different groups,
with the goal of prediction, explanation or evaluation [e.g. comparing the risk of compli-
cation for two surgical treatment methods]; or to compare measures of a disease outcome
in different groups, with the goal of determining a more effective treatment [e.g. mean
cholesterol levels in groups given a drug or a placebo].
Variables in epidemiological studies.
The response variable is the measurement of the disease outcome we are interested in.
An explanatory variable is a variable that may be related to the response variable: i.e. a
variable that may affect the outcome. These are often individual characteristics (sometimes
called covariates: variables such as age, gender, blood-pressure, cholesterol level, smoking
status, education level, . . . ).
In most cases, the fundamental concern of a study is to relate some exposure E to a disease
outcome D. In that case, the response variable is an indicator of disease outcome and the
primary explanatory variable is an indicator of exposure.
In broad terms, there are two types of epidemiological studies:
DEFINITION 1.1.1.
1. An experimental study is a study in which the investigator assigns the exposure
(intervention, treatment) to some of the individuals in the study with the objec-
tive of comparing the results for the exposed and unexposed individuals.
2. An observational study is a study in which the investigator selects individuals,
some of whom have had the exposure being studied, and others not, and the
outcome is observed; or individuals are selected some of whom have had the
outcome and others not, and their exposure is observed.
some further symptom, or spread of a cancer) that becomes the ‘disease outcome’ studied.
The aim is to evaluate the effect of the treatment on the disease outcome.
In most trials, treatments are assigned by randomisation so as to produce comparability be-
tween the cohorts with respect to any factors (seen or unseen) that might affect the disease
outcome.
There were 22 071 participants5 : 11 037 were assigned at random to receive as-
pirin and 11 034 to receive placebo. The results were as follows:
There appears to be evidence here that aspirin reduces the risk of myocardial
infarction. But could it just be due to chance?
D D′ n risk
Zidovudine 1 38 39 0.026
Placebo 7 31 38 0.184
The data indicate that the risk of getting an opportunistic infection during the
follow-up period was low among those who received early Zidovudine treat-
ment, and higher among those who received a placebo treatment. [But, you
should be asking, is this just due to chance, or is there a real Zidovudine effect
here?]
4 “Final report of the Aspirin Component of the Ongoing Physicians’ Health Study”. New England Journal of
questionnaires were sent to 261 248 male physicians in the US, 112 528 responded; and 59 285 were willing to
participate. Of these 33,223 were eligible. There followed a run-in period after which 11 152 changed their minds
or reported a reason for exclusion. This left 22 071 who were randomly assigned to the treatments.
lOMoARcPSD|8938243
DEFINITION 1.2.1. A clinical trial is defined as “any research study that prospectively
assigns human participants or groups of humans to one or more health-related inter-
ventions to evaluate the effects on health outcomes” (WHO/ICMJE 2008 definition).
Thus a clinical trial is essentially another name for a medical experimental study. Note that
trial is used instead of experiment, perhaps as a euphemism: people may prefer to take
part in a trial rather than an experiment.
There are several types of clinical trials:
• Treatment trials: test experimental treatments, new combinations of medication, or
new approaches to surgery or radiation therapy.
• Prevention trials: look for ways to prevent disease in people who are disease-free, or
to prevent a disease from returning. Prevention trials may include medicines, vac-
cines, vitamins, minerals, or lifestyle changes.
• Diagnostic trials: are done to find better tests or procedures for diagnosing a particu-
lar disease or condition.
• Screening trials: test the best way to detect certain diseases or health conditions.
• Supportive care trials: explore ways to improve comfort and quality of life for people
with an illness.
Treatment trials are the most common form of clinical trials. Their intention is to study
different treatments for patients who already have some disease. Consider the comparison
of treatment A and treatment B. Often treatment A may be an experimental medication,
and treatment B may be a placebo, i.e. a non-medication, disguised to look the same as the
experimental medication, so that the patient does not know which treatment they receive.
Subjects must be diagnosed as having the disease in question and be admitted to the study
soon enough to permit treatment assignment. Subjects whose illness is too mild or too
severe are usually excluded.
Treatment assignment is designed to minimize variation of extraneous factors that might
affect the comparison. So, the treatment groups should be comparable with respect to
some baseline characteristics. A random assignment scheme is the best way to achieve
these objectives. This means that for a patient who fits the criteria for admission to the trial,
the patient is assigned treatment A or B at random. Since assignment is random, various
ethical issues are involved. For example, the patient must agree to the randomisation: the
possibility that they receive either of the possible treatments.
The gold standard for a clinical trial is a randomised controlled trial (RCT); that is, an
experiment with a treatment group and a control group for which individuals are assigned
randomly to the treatment and non-treatment group.
The Physicians’ Health Study and the Zidovudine trial described above are examples of
randomised controlled trials.
There are different sorts of clinical trials depending what stage the experimental process is
at. These stages are called phases.
• Phase 1: This is the first trial of the drug on humans (up to this point, research will
usually have been conducted on animals). Healthy volunteers are given the drug and
observed by the trial team over the period of the trial. The aim is to find out whether
lOMoARcPSD|8938243
it’s safe (and at what dose), whether there are side effects, and how it’s best taken (as
tablets, liquid, or injection for instance).
• Phase 2: If the drug passes muster in phase 1, it’s next given to people who actually
have the condition for which the drug was developed. The aim of a phase 2 trial is to
see what effect the drug has — whether it improves the condition and by how much,
and again, whether there are any side effects.
• Phase 3: Phase 3 trials are similar to a phase 2 trial except the number of people given
the drug is much larger. Again, researchers are looking at safety and effectiveness.
Phase 3 is the last stage before the drug is then licensed for use by the general public.
• Phase 4: In this phase, the drug is compared to other, existing, drugs. The idea of a
phase 4 trial is to get more qualitative information – determining where exactly the
drug is most useful, and for what sort of patient. The participants in a phase 4 trial
are people in the community who have the condition.
Clinical trials suggests an association with a clinic (a facility, often associated with a hospital
or medical school, that is devoted to the diagnosis and care of outpatients). This is the
origin of the term ‘clinical trials’, and in most cases this is true. But it should be noted that,
in general, the treatment need not be applied in the clinic.
In a field trial, generally, the subjects have not got the disease. The intention of the treat-
ment/intervention is to prevent the disease. Field trials usually require a great number
of subjects so that there will be a sufficient number of “cases” (outcome events) for com-
parison. As the subjects are not patients, they need to be treated or visited in the “field”
(at work, home, school) or some centres sent up for the purpose. So, field trials are very
expensive and are usually used for the study of extremely common or extremely serious
diseases.
Examples include:
• Salk vaccine trial.
• MRFIT (Multiple Risk Factor Intervention Trial)
As in clinical trials, random assignment scheme is the ideal choice of assignment.
In this case the treatment/intervention is applied to a whole community rather than indi-
viduals.
Examples include:
• Water fluoridation to prevent dental caries.
• Fast-response emergency resuscitation program.
• Education program conducted using mass media.
The principles of experimental design apply throughout the sciences. In this section we
point out some of the general principles and terminology; and indicate how experiments
lOMoARcPSD|8938243
DEFINITION 1.2.2.
1. An experiment is one where we impose a procedure (called a treatment, inter-
vention, exposure) on particular individuals (called the experimental units or
subjects) and observe the effect on a variable (called the response variable).
2. The response variable relates to the outcome of the experiment, which may be
negative (recurrence, death, increased cancer-cell count, . . . ) or positive (cure,
symptom alleviation, reduced cholesterol level, . . . ).
3. In the case of an expermient, an explanatory variable is something which may
affect the outcome, and which is known at the time of treatment. The primary
explanatory variable is the treatment variable; other explanatory variables may
be potential confounders.
In a designed experiment, the experimenter determines which subjects receive which treat-
ment. The experimenter must adhere to the principles of design of experiments to achieve
validity and precision.
Control group
The word “control” is often misunderstood in the context of medical testing: when people
hear of a “controlled experiment”, they tend to assume that, somehow, all the problems
have been fixed . . . and under control. Not so. What it means is that the experiment in-
cludes a control group who do not receive the treatment, as well as a treatment group who
do. Usually the control group is given a placebo, i.e. a pseudo-treatment that looks like the
real thing but which is known to be neutral.
In a designed study, the control group forms a baseline for comparison, to detect the effect
of any other treatments. Comparison is the key to identifying the effects on the response
variable. If we have only one treatment group, then there is no way to identify what is and
what is not the effect.
Unfortunately, the design of the experiment was defective. There was no control
group. A better-designed experiment, done several years later, divided ulcer
patients into two groups. One group was treated by gastric freezing as before,
while the other group received a placebo treatment in which the solution in the
balloon was at body temperature rather than freezing. The results of this and
other designed experiments showed that gastric freezing had no real effect, and
its use was abandoned.
Confounding variables, lurking variables
Suppose the standard treatment is given by Doctor A and the experimental treatment is
given by Doctor B. Then, we say the treatment is confounded with the treating doctors,
lOMoARcPSD|8938243
because we cannot tell whether the effect on the response variable is due to the treatment
or the skill or the manner of the doctor.
Confounding occurs when observed effects can be explained by more than one explana-
tory variable, and the effects of the variables cannot be separated. The reason for most
experimental studies is the investigation of a treatment. Usually therefore, the primary
explanatory variable is the treatment variable (whether or not the individual receives the
treatment) and our concern is whether any other variable might be confounded with the
treatment variable.
DEFINITION 1.2.3.
1. A confounding variable is a variable that is a possible cause of the disease out-
come, which is related to the exposure (treatment, intervention).
2. A lurking variable is a confounding variable, but one which is unknown and un-
observed. It is thus a particular (and particularly dangerous) type of confounding
variable.
neticist. He was described by Anders Hald as “a genius who almost single-handedly created the foundations
for modern statistical science,” and Richard Dawkins described him as ”the greatest of Darwin’s successors”. He
spent the last years of his life working at the CSIRO in Adelaide.
lOMoARcPSD|8938243
The lady in question claimed to be able to tell whether the tea or the milk was
added first to a cup. Fisher gave her eight cups, four of each variety, in random
order. The story has it that the lady in question was Muriel Bristol, and she got
them all right. The chance of someone who just guesses getting all eight correct
is only 1 in 70.
Blocking = Stratification
When we know certain factors have an effect on the response variable, we should ensure
these factors even out in the different treatment groups, instead of trusting it to randomi-
sation. This is done by blocking. A block is a group of similar study units. A block is the
generic term, applying to a wider range of experiments than we are concerned with in this
subject. In an agricultural experiment for example, a block may comprise a set of plots of
similar fertility. Here the units are the plots and we would be concerned with the yield
from each plot. Treatments might be fertilisers. In an engineering experiment, a block may
comprise samples of material obtained from one production batch.
In our case, the unit is generally an individual. A block is a collection of similar individuals,
which is equivalent to a stratum.
DEFINITION 1.2.4. A blocked experiment is one where the randomisation of units are
carried out separately within each block.
This reduces the natural variation by making comparison on similar units. Blocking there-
fore has the effect of achieving higher precision.
Blocking is equivalent to matching. Identical twins would be the ultimate blocks! Blocks,
or sets of matched individuals, may be of any size (greater than one). Matching individuals
enables better comparison of treatments.
lOMoARcPSD|8938243
Replication
Suppose we want to compare two methods of teaching language. Student A is taught by
one method and Student B by the other. We know we cannot rely on comparing the test
scores of the two students, because student A might be brighter or more conscientious.
We need to have replications. This means enough experimental units in each treatment
group so that chance variation can be measured and systematic effect can be seen. The
more replications (the number of experimental units in each treatment group), the more
reliable (precise) the comparison of the treatments. Yes . . . but how many? (Chapter 7).
Blinding
For example, when asking consumers to compare the tastes of different brands of a product,
the identities of the latter should be concealed. Otherwise consumers may tend to prefer
the brand they are familiar with. Similarly, when evaluating the effectiveness of a medical
drug, both the patients and the doctors who administer the drug may be kept in the dark
about the nature of the drug being applied in each case.
Single-blind describes experiments where information is withheld from the participants,
but the experimenter is in full possession of the facts.
In a single-blind experiment, the individual subjects do not know whether they are so-
called “test” subjects or members of the “control” group. Single-blind experimental design
is used where the experimenters either must know the full facts (for example, when com-
paring sham to real surgery) and so the experimenters cannot themselves be blind, or where
it is believed the experimenters cannot introduce further bias and so the experimenters
need not be blind. However, there is a risk that subjects are influenced by interaction with
the researchers — known as the experimenter’s bias: the experimenter has an expectation
of what the outcome should be, and may consciously or subconsciously influence the be-
havior of the subject, and their responses, in particular.
Double-blind describes an especially stringent way of conducting an experiment, in an
attempt to eliminate subjective bias on the part of both experimental subjects and the ex-
perimenters. In most cases, double-blind experiments are held to achieve a higher standard
of scientific rigor.
In a double-blind experiment, neither the individuals nor the researchers know who be-
longs to the control group and the experimental group. Only after all the data have been
recorded (and in some cases, analyzed) do the researchers learn which individuals are
which. Performing an experiment in double-blind fashion is a way to lessen the influence
of the prejudices and unintentional cues on the results.
Random assignment of the subject to the experimental or control group is a critical part of
double-blind research design. The key that identifies the subjects and which group they
belonged to is kept by a third party and not given to the researchers until the study is over.
Balance
Balance means each treatment is applied to the same number of study units. This is desir-
able when possible, as it simplifies the analysis and gives the most precise comparison. It
is sometimes defeated by nature, e.g. some patients withdraw from the study.
lOMoARcPSD|8938243
Summary
1. Control — for validity. Comparison is the key to identifying the effects on a response
variable.
2. Randomisation — for validity. Randomly assign treatments. This neutralizes the
effects of other variables.
3. Replication — for precision. Repeat to get better results. This reduces the influence
of natural variation.
4. Blocking — for precision, and for validity in the presence of confounding. Group
the study units into blocks of similar units. This removes any unwanted source of
variation.
5. Blinding — for validity. To ensure that the expectations of the subject does not influ-
ence the outcome. And, with double-blinding, to ensure that the expectations of the
experimenter does not influence the outcome.
6. Balance — for precision. Have the same number of units in each treatment group if
feasible.
The ‘gold standard’ clinical trial is a randomised controlled trial (RCT), i.e. an experiment
with individuals assigned randomly to a treatment group or a control group. Blinding is
used where possible.
DEFINITION 1.3.1.
1. A cohort is broadly defined as “any designated group of individuals who are
followed over a period of time.”
2. A cohort study involves measuring the occurrence of disease within one or more
cohorts. (An experiment is a cohort study, but not all cohort studies are experiments.)
Many cohort studies can be expressed as the comparison of two cohorts, which we denote
as exposed (E) and unexposed (E ′ ). As has been mentioned, the “exposure” may cover a
broad range of things: from a drug treatment or immunisation to an attribute like economic
status or the presence of a particular gene. The intention then is to compare disease rates
in the exposed cohort and the unexposed cohort.
The cohort concept is straightforward enough, but there are complications involving who
is eligible to be followed, what should count as an instance of disease, how the incidence
rates or risks are measured and how exposure ought to be defined. (Mostly, we don’t resolve
these complications; we just note that they exist and trust that they are sorted out . . . by others, or
possibly by us, later, when we know some more.)
In a cohort study, exposure is not assigned. The investigator is just an observer. As a result,
causation cannot be inferred as it is not known why the individual came to be exposed.
lOMoARcPSD|8938243
The strongest conclusion that can be drawn from an observational study is that there is
an association between exposure E and disease outcome D (but it is not known why). In
particular, it cannot be concluded that E causes D. Nevertheless, the intention of a cohort
study is often to address causation, and the terms response and explanatory variables are
used for observational studies as well as experimental studies.
The women were divided into cohorts according to the amount of vitamin A
in their diet, from food or from supplements. The data are given in the table
below.
These data indicate that the prevalence of these defects increased with increas-
ing intake of vitamin A.
But does vitamin A affect the “population of births”? While vitamin A might
be a cause, another possible explanation of this result is that it could enable
embryos with the defect to survive until birth.
D n rate
(cholera death) (popln size)
E (Southwark & Vauxhall) 4093 266,516 1.54%
E′ (Lambeth) 461 173,748 0.27%
lOMoARcPSD|8938243
Residents whose water came from the Southwark & Vauxhall Company had a
cholera death rate 5.8 times greater than that of residents whose water came
from the Lambeth Company.
Snow saw that circumstance had created conditions like that of an experiment.
In an experiment, individuals who were otherwise alike differ only in whether
they receive the treatment or not. In this case, it seemed that people differed
only by their consumption of pure or impure water( ◦◦ ).
In an experiment, the investigator assigns the participants to the exposed and unexposed
groups. In a natural experiment, as studies like this have come to be known, the investi-
gator takes advantage of a setting that is like an experiment. It is like an experiment in
that the “assignment” of treatment to individuals is pseudo-random. The “assignment” is
not done by the experimenter, and not by randomisation. It is done by some other pro-
cedure that appears to mimic randomisation, and which is assumed to be equivalent to
randomisation. The validity of any conclusions depends on this assumption.
Note: a ‘natural experiment’ is not an experiment (because the treatment is not imposed
on the subjects), and there must remain some doubt about the causation conclusion.
The researchers recruited 5209 men and women between the ages of 30 and
62 from the town of Framingham, Massechusetts, and began the first round of
extensive physical examinations and lifestyle interviews that they would later
analyse for common patterns related to CVD development. Since 1948, the sub-
jects have continued to return to the study every two years for a detailed med-
ical history, physical examination and laboratory tests. In 1971, the Study en-
rolled a second generation: 5214 of the original participants’ adult children and
their spouses, to participate in similar examinations.
There have been subsequent cohorts recruited in 1994, 2002 and 2003, including
a third generation of participants: grandchildren of the Original Cohort. More
details of the cohorts and the results obtained can be found at
//www.framinghamheartstudy.org/.
Closed and open cohorts
A closed cohort is one with a fixed membership. Once it is specified and follow-up begins,
no-one can be added to a closed cohort. The cohort will dwindle as people in the cohort
die, or are lost to follow-up, or develop the disease. The Framingham Heart Study includes
several closed cohorts. We will primarily be concerned with closed cohorts.
An open cohort (or a dynamic cohort) can take on new members as time passes. An exam-
ple of an open cohort is the population of Victoria. Cancer incidence rates in Victoria over
a period of time reflect the rate of cancer occurrence among a changing population.
Western Australia, have been involved in a series of health surveys since 1966.
To date over 16 000 men, women and children of all ages have taken part in the
surveys and have helped contribute to our understanding of many common
diseases and health conditions.
Much of the data comes from cross-sectional studies (see §1.2.4 below), treating
the Busselton community as an open cohort. However, one follow-up study
of the first cross-sectional study was done thirty years on, i.e. a closed cohort
study.
case, the source population is the collection of individuals who would have attended the
specified medical centres if they had the disease in question.
A case-control study is retrospective. The cases and controls are in the present, and we
investigate their past — perhaps using hospital records, or by questioning the patients or
their relatives. A disadvantage of this is that old records or memories may be faulty. An
advantage is that a range of exposures may readily be considered for possible relation to
the specified disease.
A major advantage of a case-control study over a cohort study is that by effectively sam-
pling from the population we save considerably on cost and time.
A cohort study is usually prospective, whereas a case-control study is usually retrospective.
Let’s consider a hypothetical cohort study ( ◦◦ ) corresponding to the above bowel cancer
case-control study. It must be hypothetical, because such a study couldn’t actually be car-
ried out! (We can’t force individuals to smoke.) But let’s pretend it’s 2010, and we know all
the 40-44yo males who are going to be in the 2020 RMH source population. Let’s suppose
there are 12 000 of them.
To obtain the exposure information, i.e. to find out how many of them are smokers, we
would need to question (by questionnaire or interview or . . . ) all 12 000 of them. And
perhaps keep track of them in the intervening time. Any other exposures that we might
want to examine would have to be specified in advance (i.e. in 1999). We would then
follow these individuals for the next ten years to see how many of them get bowel cancer.
Of course, some of these individuals may get cancer any time in 2010-2020, so that it’s not
a perfect match to the case-control study. Suppose that over the next ten years, 100 of these
individuals are admitted to hospital with bowel cancer, and 50 of these were smokers, i.e.
50% are exposed . . . as for the case group above, as opposed to 25% for the rest of the
population.
Such a study would give stronger evidence that the disease is more common among ex-
posed individuals. But, even if such a procedure were possible, it would be hugely expen-
sive and time-consuming.
Q UESTION : Why does the cohort study give stronger evidence?
In a case-control study, subjects are selected on the basis of their disease status: cases have
the disease of interest. This means that we cannot estimate the relative risks for the exposed
and unexposed groups. However, the relative risks can be estimated, provided the disease
prevalence is known. This is explained in more detail in Chapter 3.
The primary difference between a cohort study and a case-control study is that a cohort
study involves complete enumeration of the cohort (sub-population), whereas a case-control
study is based on a sample from the relevant sub-population.
Q UESTION : Why is a cohort study good for studying many diseases? . . . and a case-control
study good for studying many exposures?
lOMoARcPSD|8938243
The study types described above are longitudinal studies, i.e. the information obtained
pertains to more than one point in time. Implicit in a longitudinal study is the premise that
the causal action of an exposure comes before the development of the disease. All cohort
studies and most case-control studies rely on data in which exposure information refers to
an earlier time than that of disease occurrence, making the study longitudinal.
Cross-sectional studies are occasionally used in epidemiology. A cross-sectional study in
epidemiology amounts to a survey of a defined population. As a consequence, all of the
information relates to the same point in time; they are basically snapshots of the population
with respect to disease and/or exposure at a specific point of time. A population survey,
such as the census, not only attempts to enumerate the population but also to assess the
prevalence of various characteristics. Surveys are conducted frequently to sample opinions;
they can also be used to measure disease prevalence and/or possible exposures.
A cross-sectional study cannot measure disease incidence (the rate at which the disease
outcome D occurs), since this requires information across a time period. But cross-sectional
studies can be used to assess prevalence (the proportion of the population with D).
Sometimes cross-sectional data is used as a reasonable proxy for longitudinal data, in the
absence of such information. If no record exists of past data, present data might be used as
a substitute. Current accurate data might be better than hazy recall of the past.
Surveys
Typically, a survey consists of a sample taken from a population of interest. Data are col-
lected from each person in the sample, such as the exposure status and disease status. As
the data are collected at a point in time, it is called a cross-sectional study. From a cross-
sectional study, it is possible to estimate the prevalence of disease and of exposure. It is not
suitable for investigating a causal relation, as it does not have a time dimension built into
it. However, an association might be found and further research suggested.
For validity, the sample needs to be representative of the population and to have been
drawn in an unbiased manner. Random sampling is usually used.
The aim of a survey is to obtain a representative sample from a specified population, which
enables estimation of the population characteristics.
A census, or complete enumeration of the population, is often not feasible or desirable. It
is likely to be massively expensive. A survey has the advantages of reduced cost, greater
speed, scope and accuracy. Surveys are used for planning, identifying problems, market
research and quality control. They can be both descriptive and analytical. Survey variables
can be qualitative or quantitative. Scales can be nominal, ordinal or numerical.
lOMoARcPSD|8938243
A survey is not a trivial matter to get right! Planning and executing a survey involves,
among other things:
• specifying questions such that all, and only, the relevant data are collected, fairly and
accurately;
• defining the population (so that the actual target population corresponds to the one
we wish to study);
• identifying the sampling units (usually individuals, in our applications) and the sam-
pling frame, which is a list of sampling units in the target population.
• determining the degree of precision required (this will affect the sampling procedure);
and then minimising bias, cost and time scale problems in the sampling procedure.
• choosing a suitable measurement technique;
• taking a pilot sample;
• administration and editing, processing and analysing the data.
Sampling error
DEFINITION 1.3.2. Sampling error is the random variation in the results due to the
elements in the sample being selected at random.
This can be controlled and estimated provided random sampling methods are used in se-
lecting the sample.
Non-sampling errors
Non-sampling errors include:
• Selection bias, which occurs when the true selection probabilities differ from those
assumed in calculating the results;
• Coverage problems: inclusion of individuals in the sample from outside the popu-
lation; or exclusion of individuals in the population (perhaps because the sampling
frame is incomplete);
• Loaded, ambiguous, inaccurate or poorly-posed questions;
• Measurement error: e.g. when respondents misunderstand a question, or find it dif-
ficult to answer (due to language or conceptual problems);
• Processing errors: e.g. mistakes in data coding;
• Non-response: failure to obtain data from sampled individuals.
Non-sampling errors are reduced by careful attention to the construction of the question-
naire and fieldwork. The latter may include callbacks, rewards and incentives, trained
interviewers and data checks.
A major problem with many surveys is non-response. Because we don’t know anything
about the non-respondents, there may be a bias, which we know very little about. The
only way to guarantee control of this bias is to increase the response rate. Experience indi-
cates that the response rate should be at least 50%, but serious biases can still occur with a
response rate of 70%. It depends!
In the election, Landon won only in Vermont and Maine; Franklin Delano Roo-
sevelt carried the other 46 states; Landon’s electoral vote total of eight is a tie
for the record low for a major-party nominee. The magazine was completely
discredited because of the poll and was soon discontinued. The polling tech-
niques employed by the magazine were to blame. Although it had polled 10
million individuals (only about 2.4 million of these individuals responded, an
astronomical number for any survey), it had surveyed its own readers, reg-
istered automobile owners and telephone users, and other individuals whose
names had been recorded on lists or memberships. All of these groups con-
tain an over-representation of conservative Republican voters. Literary Digest
readers were wealthy enough to afford a journal subscription and conserva-
tively inclined enough to choose the Literary Digest. Further, in those days,
relatively few people had cars or phones — and so, again, the working classes,
who favoured the Democrats, were under-represented.
“value”
clinical trials
community trials
cross-sectional studies
ecological studies (demographic data)
animal experiments
in vitro experiments
anecdotal evidence
lOMoARcPSD|8938243
✲ cohort study ✲
✛ case-control study ✛
cross-sectional study
(retrospective) (prospective)
In broad terms, there are two types of epidemiological studies:
• Experimental studies — the investigator assigns the exposure (intervention, treat-
ment) to some of the individuals in the study with the objective of comparing the
results for the exposed and unexposed individuals.
• Observational studies — the investigator selects individuals, some of whom have
had the exposure being studied, and others not, and the outcome is observed [cohort];
or individuals are selected some of whom have had the outcome and others not, and
their exposure is observed [case-control]. Essentially though, an observational study
is an epidemiological study that is not an experiment!
In an experimental study, she would find a group of eligible patients, and ran-
domly assign the patients into the two treatment groups. The patients will be
given the assigned treatments and then followed for a number of years, to ob-
serve their survival times.
Question: Does a new drug lengthen survival times of cancer patients, compared with no
drug treatment?
Proposal: Ask various doctors and hospitals for records of all cancer patients on the new
drug, and compare survival times with those who got no drug. (Suppose the result shows
survival is longer for the new drug.)
Skeptic: But that’s just looking at what’s already happened. There could be lots of reasons
why it happened.
Proposal: But still, survival is longer for people using the new drug.
Skeptic: That shows the drug is associated with survival, not that it is the cause of survival.
lOMoARcPSD|8938243
Cause: One variable makes the other change: e.g. changing variable.1
makes variable.2 change.
Skeptic: For instance, what if doctors tend to give the new drug to only the patients they
think are likely to improve anyway, and assume nothing will help the sicker ones? Couldn’t
that explain the findings?
Proposal: So we should be the ones who decide who gets which drug, not the doctors. For
instance, we’ll tell doctors to give the drug to patients with surname starting with A-M and
not to N-Z.
Skeptic: That’s better, but still not perfect. For instance, many Vietnamese are named
Nguyen, and many Koreans are named Kim, and this could put most of them in the same
drug group. If survival is related to ethnicity our results could still be biased.
Proposal: Then let’s assign patients to drug groups randomly.
Replication: Using more than one subject. This evens out the effects of
chance.
Skeptic: There could still be problems, though, since the patients will know if they got the
drug, and it could have psychological effects.
Proposal: Then let’s make sure everyone thinks they could be getting the drug.
Blind study: the subjects don’t know which treatment they got.
Skeptic: But still, won’t the doctors know who gets the new drug? And they may treat
those patients more aggressively.
Proposal: Then we should make sure the doctors don’t know either.
Double-blind study: Neither the subjects nor those providing the treat-
ment know which treatment was given.
Skeptic: Much better. The results from such a study will surely be more valid. There are
still lots of practical and ethical problems to be worked out, but these are some of the main
principles of good study design. And a well designed experiment is the only sure statistical
way to show cause.
It is useful for you as a statistician to play the skeptic (or the “devil’s advocate”). Try to
think of “What if . . . ?” possibilities ( ◦◦ ), and other possible causes or explanations for the
outcomes, and endeavour to overcome them. This will put your conclusion on a sounder
footing. Of course, it may still not be enough to cover all bases. But at least you should
make it difficult for others to criticise your experiment and therefore any conclusions that
follow from it.
Q UESTION : A pharmaceutical company wants to trial a new drug for a particular disease.
Set up a clinical trial following the principles of design, based on 1200 volunteers. 600 of
these volunteers are females and the rest males. Which stage of the clinical trial could this
be?
over a population. For example, an increased risk of death or, in assessing recovery, a
decreased level of disease indicator. Over a population, these would be measured by the
probability or the odds of the outcome; or by the mean level of the disease indicator.
A lot of what we do in this subject is about estimating these effects (of E on D).
Hill’s Criteria of causal association
Bradford Hill proposed the following criteria for an association to be causal:
1. Strength of association (A stronger association suggests causation.)
2. Consistency (Replication of the findings in different studies.)
3. Temporality (Cause should precede effect.)
4. Biological gradient (Dose-response relationship) (More of the causal factor should pro-
duce more effect.)
5. Plausibility (Does the association make sense biologically?)
6. Coherence (Does the association fit known scientific facts?)
7. Experiment (Can the association be shown experimentally?)
8. Analogy (Are there analogous associations?)
With the possible exception of temporality, none of the Hill’s criteria is absolute for estab-
lishing a causal relation, as Hill himself recognized. He argued that none of his criteria is
essential.
Counterfactual Model (The unattainable ideal ( ◦◦ ))
When we are interested in measuring the effect of a particular factor, E, we measure the
observed “effect” in a population who are exposed to E; and compare this to the “effect”
which would have been observed, if the same population had not been exposed to E, all
other conditions remaining identical. The comparison of these two effect measures is the
“effect” of the factor E that we are interested in.
However, the counterfactual effect is unobservable! We therefore seek to approximate this
ideal model as best we can. How? By considering two ’equivalent’ populations (or as close
as we can get) one of which gets E and the other does not. The experimental studies we
have considered attempt to achieve this using randomisation (or stratification/matching).
(randomisation or stratification/matching)
The clinical trials considered in the previous section play a fundamental role in developing
medical therapy, especially new medications. The tools of randomisation and blinding
lOMoARcPSD|8938243
actually allow proof of a causal connection by statistical means. This is one of the major
reasons why statistical methods are currently central in medical research.
It is a standard statistical warning that “relationship does not imply causation”. This is
quite true. But, possibly more importantly, in a well-designed experiment, relationship
does imply causation!
Relationships and causation
A positive relationship between A and B means that if you have A then you are more likely
to have B, i.e. you have an increased risk of B. And if you have B then you are more likely
to have A. There is no causation here. This is simply describing an association between
+
factors (attributes, events). We represent this as A B.
A negative relationship between A and C means that if you have A then you are less likely
to have C, i.e. you have a decreased risk of C. And if you have C then you are less likely to
−
have A. We represent this as A C.
−
If these two associations apply, you should expect that B C. And that’s the way it
is. However, it should be noted that we are really talking about fairly strong associations
(positive or negative)7 .
A two-factor relationship diagram is not very interesting. You’ve seen them; both of them.
However, a three-factor relationship diagram is a bit more interesting.
There are only four possibilities:
C C C C
+ ❅+ − ❅− − ❅+ + ❅−
❅ ❅ ❅ ❅
A B A B A B A B
+ + − −
(1) (2) (3) (4)
because we can’t have an odd number of negative relationships in the triangle: two nega-
tives make a positive. Think about what would happen if in diagram (1), C was changed
to C ′ .
How does this help? A three-factor relationship diagram is useful in showing the effect of
a confounding variable. Consider the women smokers of Whickham example (page 8). A
relationship diagram for this case is shown below:
C (old-age)
❅+
❅
E (smoker) D (death)
Clearly, there is a positive relation between old age and death. So, if there is a strong nega-
tive relation between old-age and smoking in this population — which is observed, then it
follows that there is a negative relation between smoking and death . . . in this population.
C (old-age)
− ❅+
❅
E (smoker) D (death)
−
7 In terms of correlation coefficient, which we will consider later (Chapters 2 & 8), this means |r| > 0.7
lOMoARcPSD|8938243
causation association
+ +
E −→ D E D
if individual has E there is an observed
then there is an association between E & D
increased chance of D (but we don’t know why)
If there is an observed association between A and B, this does not mean there is causation.
The association may be because:
• A may cause B; [causation]
• B may cause A; [reverse causation]
• a third factor C may cause both A and B; [common cause]
• A and B may influence each other in some kind of reinforcing relationship; [bidirec-
tional causation]
• A and B just happen to be associated; [association]
• . . . or some combination of the above.
EXAMPLE 1.5.1: Research showed that older people who walk slowly are much
likelier to die in the near future. The study online in the British Medical Journal
divided 3200 men and women over 65 into the third who walked slowest, the
middle third and the briskest third. During the next five years, those in the
slowest third were 1.4 times likelier to die from any cause, compared with those
who walked faster. Slow coaches were 2.9 times likelier to die from heart-related
causes. [BMJ 2009 (Dumurgier et al.)]
(possible common-cause?)
EXAMPLE 1.5.2: Among 1700 men and women followed for about 10 years,
those rated happiest were less likely to develop heart disease than people who
were down in the dumps. During the study, about 8 per cent of the group had a
problem such as heart attack, indicating they had coronary heart disease. Peo-
ple with a positive outlook had about 75% the risk of developing heart disease
compared to the others. [EurHeartJ 2010 (Davidson et al.)]
C (old-age) C (old-age)
❅+ − ❅+
❘
❅
✲ ❘
❅
E (smoker) D (death) E (smoker) D (death)
+ −
EXAMPLE 1.5.3: In examining the relation between low physical activity and
heart problems, obesity is not a confounding factor, since it is part of the causal
link: low physical activity causes obesity (and possibly vice versa) and obesity
causes heart problems.
An unobserved or unknown factor may act as a confounding variable too. An unobserved
confounding variable is sometimes called a lurking variable.
If the confounding factor C is positively related with E, this is still a problem because
it exaggerates the relationship between E and D, so that the data would show a falsely
strong relationship between E and D.
C (smoker) C (smoker)
❅+ + ❅+
❘
❅
✲ ❘
❅
E (factory) D ( lung ) E (factory) D ( lung )
worker + cancer worker ++ cancer
Problem Set 1
1.1 A 3-year study was conducted to look at the effect of oral contraceptive (OC) use on heart
disease in women 40–44 years of age. It is found that among 5000 OC users at baseline (i.e. the
start of the study), 15 women develop a myocardial infarction (MI) during the 3-year period,
while among 10 000 non-users at baseline, 10 developed an MI over the 3-year period.
i. Is this an experiment or an observational study?
ii. What are the exposure and the disease outcome?
iii. Is this a prospective study, retrospective study or a cross-sectional study?
iv. What are the response and explanatory variables?
v. All the women in the study are aged 40–44. Explain why this was done.
vi. How would you present the results?
1.2 The effect of exercise on the amount of lactic acid in the blood was examined in a study. Eight
men and seven women who were attending a conference participated in the study. Blood
lactate levels were measured before and after playing a set of tennis, and shown below.
player M1 M2 M3 M4 M5 M6 M7 M8 W1 W2 W3 W4 W5 W6 W7
Before 13 20 17 13 13 16 15 16 11 16 13 18 14 11 13
After 18 37 40 35 30 20 33 19 21 26 19 21 14 31 20
(a) What is the research question?
(b) Is this a designed experiment or an observational study?
(c) What is the response variable? What are the treatments?
(d) Upon further investigation, we find that nine of the sample are 20–29 years old, while the
other six are 40–49 years old. What is the potential problem with the study?
(e) What is a confounding variable? Can you think of any potential confounding variables
in this case?
1.3 Identify the type of observational study used in each of the following studies (cross-sectional,
retrospective, prospective):
(a) Medical Research. A researcher from the Melbourne Medical School obtains data about
head injuries by examining hospital records from the past five years.
(b) Psychology of Trauma. A researcher plans to obtain data by following, for ten years in the
first instance, siblings of children who died in road accidents.
(c) Flu prevalence The Health authority obtains current flu data by polling 5000 people this
month.
1.4 A study claimed to show that meditation lowers anxiety proceeded as follows. The researcher
interviewed the subjects and rated their level of anxiety. Then the subjects were randomly
assigned to two groups. The researcher taught one group how to meditate and they meditated
daily for one month. The other group was simply encouraged to relax more. At the end of
the month, the researcher interviewed all the subjects again and rated their anxiety level. The
meditation group were found to have less anxiety.
(a) What are the experimental units? What are the response variable and the explanatory
variable?
(b) Is this an experimental study or an observational study?
(c) Is this a blind study? What is the reason for designing a blind study?
(d) It was found that the control group had 70% men and the meditation group had 75%
women. Is this a problem? Explain.
1.5 A study is to be conducted to evaluate the effect of a drug on brain function. The evaluation
consisted of measuring the response of a particular part of the brain using an MRI scan. The
drug is prescribed in doses of 1, 2 and 5 milligrams. Funding allows only 24 observations to be
taken in the current study.
In a meeting to decide the design of the study, the following suggestions are made concerning
the conduct of the experiment. For each of the suggestions say whether or not you think it is
appropriate giving a reason for your answer.
(A) Amy suggests that a placebo should be used in addition to the three doses of the drug.
What is a placebo and why might its use be desirable?
lOMoARcPSD|8938243
(B) Ben says that the study should be conducted as a double-blind study. Explain what this
means, and why it might be desirable.
(C) Claire says that she is willing to be “the subject” for the study (i.e. to take different doses
of the drug and to have her response measured as often as is needed). Give one point in
favour of, and one point against this proposal.
(D) Don suggests that it would be better to have 24 subjects, and to allocate them at random
to the different drug doses. Give a reason why this design might be better than the one
suggested by Claire, and briefly explain how you would do the randomisation.
(E) Erin claims that it would be better to use 8 subjects, with each subject taking, on separate
occasions, each of the three different doses of the drug. Give one point in favour of, and
one point against this claim, and explain how you would do the required randomisation.
1.6 For the experimental situation described below, identify the experimental units, the explana-
tory variable(s), and the response variable.
Can aspirin help heart attacks? The Physicians’ Health Study, a large medical experiment
involving 22 000 male physicians, attempted to answer this question. One group of 11 000
physicians took an aspirin every second day, while the rest took a placebo. After several years
it was found that the subjects in the aspirin group had significantly fewer heart attacks than
subjects in the placebo group.
1.7 In most cases, data can be viewed as a sample, which has been obtained from some population.
The population might be real, but more often it is hypothetical. Our statistical analysis of the
sample is intended to enable us to draw inferences about this population. In many cases, we
would like the inference to be even broader. For example:
45 first-year psychology students at the University of Melbourne undertake a task and their
times to completion are measured.
This can be regarded as a sample from the population of first-year psychology students at the
University of Melbourne. We may wish to apply our results to all undergraduate students of
the University of Melbourne; maybe all university students; or even all adults.
Answer the above questions (i. and ii.) for each of the following:
(c) 30 patients in a Melbourne geriatric care facility were cared for using a new more physi-
cally active (PA) regime and their bewilderment ratings are recorded.
(d) 24 women with breast cancer requiring surgery at the Metropolitan Hospital in 2004 were
treated with radiation during surgery. Their five-year survival outcomes were observed.
1.8 You plan to conduct an experiment to test the effectiveness of SleepWell, a new drug that is
supposed to reduce insomnia. You plan to use a sample of subjects that are treated with the
drug and another sample of subjects that are given a placebo.
(a) What is ‘blinding’ and how might it be used in this experiment?
(b) Why is it important to use blinding in this experiment?
(c) What is a completely randomised design? How would this be implemented in this ex-
periment?
lOMoARcPSD|8938243
(d) What is replication, and why is it important? Does it apply to this experiment? If so,
how?
1.9 As part of a study investigating the effect of smoking on infant birthweight a physician exam-
ines the records of 40 nonsmoking mothers, 40 light-smoking, and 40 heavy-smoking mothers.
The mean birthweights (in kg) for the three groups are respectively 3.43, 3.29 and 3.21.
(a) What are the response and explanatory variables?
(b) Is this a designed experiment or an observational study? Explain your choice.
(c) What are the potential confounding variables in this case? Explain how you would elim-
inate the effect of at least some of the variables.
1.10 The cause/correlation diagram below shows the effect of a confounding variable C on the
relation between an intervention X and disease outcome D.
C
– ❅+
❘
❅
X ✲ D
?
What effect does randomisation have on this diagram? Use it to explain how randomisation
neutralises the effect of any possible confounding variable C.
1.11 You plan to conduct an experiment to test the effectiveness of the drug L, a new drug that is
supposed to reduce the progression of Alzheimer’s disease. You plan to use subjects diagnosed
with early stage Alzheimer’s disease; and you and your associates have found forty suitable
subjects who have agreed to take part in your trial.
Write a paragraph outlining the steps that you would follow in running this clinical trial. Men-
tion the following: experiment; placebo; control; randomisation; follow-up; measurements.
Suppose that analysis of the results of the data resulting from this study show that there is
a significant benefit for patients using L, would this indicate that the drug is a cause of the
benefit? Explain.
1.13 Consider each of the following studies in relation to the question “Does reducing cholesterol
reduce heart-disease risks?” In each case, indicate the type of study involved and discuss
whether the information obtained might help in answering the research question.
[1] A questionnaire about heart disease includes the question asking whether “reducing
cholesterol reduces heart-disease risk”. 85% of the general population, and 90% of medi-
cal practitioners agreed with this statement.
[2] A group of patients with heart problems attending the Royal Melbourne Hospital outpa-
tient clinic is assessed. Each of these patients is matched with another patient of the same
gender, same age, similar BMI, same SES status, but with no heart problem. The choles-
terol level for each of the heart patients is compared with that of the matched individual.
[3] A large number of individuals aged 40–49, with no current heart problems, are selected
from patients attending a large medical clinic, and their cholesterol levels are measured.
The individual is classified as L (low cholesterol) or H (high cholesterol). These individu-
als are followed up for ten years and the proportion who develop heart problems in each
group is compared.
lOMoARcPSD|8938243
[4] A large number of individuals aged 40–49, with no current heart problems, are selected
from patients attending a large medical clinic, and their cholesterol levels are measured.
These individuals are followed up and the cholesterol levels are measured again after five
years. The individuals are then classified as LL (low cholesterol initially, low cholesterol
after five years), LH (low, high), HL (high, low) and HH (high, high). After ten years, the
proportion of individuals who develop heart problems in each group is compared.
[5] A large number of volunteers with high cholesterol levels are randomly assigned to one
of two diet regimes:
(S) standard but reduced diet, with vitamin supplement;
(L) low-cholesterol diet, with low-dose cholesterol reducing drug.
The individuals are followed for ten years and their cholesterol and heart condition mon-
itored.
1.14 Research showed that older people who walk slowly are much more likely to die in the near
future. A study in the British Medical Journal divided 3200 men and women over 65 into the
third who walked slowest, the middle third and the briskest third. During the next five years,
those in the slowest third were 1.4 times likelier to die from any cause, compared to those who
walked faster. Slow-coaches, i.e. people in the slowest third, were 2.9 times likelier to die from
heart-related causes. (Dumurgier et al., BMJ 2009)
(a) What sort of study is this?
(b) On the basis of this, Mrs Green has been encouraging her mother to walk faster. Is this a
good idea? Explain. Comment on ‘cause’ in relation to the finding of this study.
lOMoARcPSD|8938243
Chapter 2
2.1 Introduction
Data are the raw material of any empirical science, whether it be agriculture, biology, en-
gineering, psychology, economics or medicine. Data usually consist of measurements or
scores derived from experiment or observation: for example, yields of a crop on a num-
ber of experimental plots, cell counts in biological specimens, strength measurements on
batches of concrete, scores obtained by children on a spatial ability test, monthly measure-
ments of inflation and unemployment, or patient assessment of new medical treatments.
Data can be obtained from:
• experiments;
• observational studies;
• polls and surveys;
• official records, government reports or scientific papers.
A data set is rather like a raw mineral which must be treated and refined in order to extract
the useful minerals. Most raw data come in the form of long lists of numbers or codings,
which must be treated and refined to extract the useful information. The methods of treat-
ment and refinement of mineral ores are chemical; those required for data are statistical.
Data analysis is the simplifying, reducing and refining of data. It is the procedure rather
than the result.
Data analysis achieves a number of things:
• discovering the important features of the data (exploration)
• improving the understanding of the data (clarification)
• improving communicability of the information in the data (presentation)
• facilitating statistical inference (validation)
39
lOMoARcPSD|8938243
6. Volume;
7. Colour hue, colour saturation, density;
What does this mean, exactly? It is saying the human eye/brain system is best at judging
the differences between quantities lined up along a common scale, and poor at distinguish-
ing quantities represented proportionally to (a two dimensional representation of) volume,
for example. While the details of this ordering are not obvious, the order does not seem
contentious and conforms to our common experience. The property identified as best is
exploited in many of the standard forms: histograms, dotplots, scatter plots, time series
graphs, boxplots, all line up quantities to be compared along a common linear scale.
This leads to the basic principle:
Encode data on a graph so that the visual decoding involves tasks as high as
possible in the ordering.
Implementation of this simple idea corrects many basic errors in graphs. For example, pie
charts require the viewer to compare angles, which we are rather bad at. Using 3D plots
generally takes one away from tasks higher up in the hierarchy, and requires assessments
of volume, for example: a bad idea.
Quantitative statement
A quantitative statement is often the most apparent (and therefore important) product of
data analysis and/or statistical inference. It is important therefore that any quantitative
statement intended to represent a data set should be accurate and clear. Unfortunately this
is often not the case.
A quantitative statement derived from a set of data may be junk because
• the data set itself is junk;
• the data analysis is incorrect or inappropriate;
• the quantitative statement is distorted; e.g. selectively abbreviated or added to.
The media are an abundant supply of such junk.
Data analysis and statistical inference
Data analysis comes before statistical inference, historically and practically. Data analysis
has been compared to detective work: finding evidence and investigating it. To accept
all appearances as conclusive would be wrong, as some indications are accidental and mis-
leading; but to fail to investigate all the clues because some, or even most, are only accidents
would be just as wrong. This is equally true in crime detection and in data analysis.
It is then the problem of statistical inference to sort out the real indications from the red her-
rings. To carry the crime detection analogy further, we might think of statistical inference
as the courts: unless the detectives turn up some evidence, there will be no case to try.
It is worthwhile to note the difference between the two most important ways data are pro-
duced:
1. observational study, and
2. experimental study.
In deriving any scientific law, the observational study always comes first (often indicating
the possible form of the scientific law); then a carefully planned and designed experiment
is required. Exploratory data analysis is an essential tool in investigating data from an
observational study. The same tools are also relevant in examining data from controlled
experiments.
lOMoARcPSD|8938243
The computer is a very useful piece of equipment in data analysis, particularly for large
data sets. However, the computer is not a data analyst (and even less is it a statistician).
Data analysis and statistical inference involves three steps:
1. selecting the appropriate technique
2. obtaining results by applying that technique
3. interpreting the results
The computer is very useful in the second of these steps but is of not much use for either
of the other two. The uncritical use of package programs for data analysis or statistical
inference is fraught with danger. One cannot leave the selection of the technique or the
interpretation of the results to the computer.
Both tables and diagrams are very useful tools in data analysis. They are both essentially
summaries of the data, and both are useful at two stages.
1. preliminary analysis (as an aid to understanding)
2. presentation of results (as an aid to communication)
As a general rule, tables contain more detail, but the message is easier to see in graph.
2.2.1 Tables
The preparation of a dummy table following these guidelines before the data are collected
is a useful exercise. The principles given above apply equally to the use of tables in the
preliminary analysis stage.
The data in each table are the same, but ordering the table by RHpositive per-
centage provides much more useful information than the table ordered alpha-
betically by nation. The same applies for a dotchart:
2.2.2 Diagrams
In presentation, diagrams can make results clear and memorable. The message has much
more impact in a diagram than in a table. In exploratory analysis, plotting data in some
way (or in several ways) is a very useful aid to understanding and seeing trends.
Note that plotting implies rounding.
Diagrams are not as good as tables in communicating detailed or complex quantitative
information. In presentation (and in analysis) an important principle is simplicity: If the
diagram becomes too cluttered, the message is lost or confused. You should ask yourself
“What should the diagram be saying?” and “What is it saying?”
lOMoARcPSD|8938243
One problem with graphs and diagrams is that it is quite easy to create a false impression:
incorrect labelling, inappropriate scaling or incorrect dimensionality are common causes of
misleading diagrams. Thus, some care should be taken with the presentation of a diagram.
Basic principles
• Show the data clearly and fairly. You should include units and scales; axes and grids; labels
and titles, including source.
• Ask: Is the diagram/table clear? . . . to you? . . . to your reader? What point are you
trying to make with the diagram/table? Is it being made?
• Use simplicity in design. Avoid information overload. Avoid folderols (pictograms, fancy
edging, shading, decoration and adornment).
• Keep the decoding simple. Use good alignment on a common scale if at all possible; use
gridlines; consider transposition of axes. Take care with colour.
We need to distinguish variable types because the different types of variable require differ-
ent methods of treatment. The classification of variables is indicated as follows:
no ✲
ordered? categorical
yes
❄ no ✲
scaled? ordinal
yes
❄ no ✲
rounding error? discrete numerical
yes
✲ continuous numerical
❄
meaningful zero?
no ❄ ❄yes
interval ratio
Categorical data (also called qualitative or nominal data) are usually comparatively simple
to handle — there are only few techniques of dealing with such data.
Examples of categorical variables: gender, colour, race, type of tumour, cause of death.
Numerical data (discrete and continuous) are our main concern: there are a wide variety
of techniques for handling such data.
Examples of discrete numerical variables: family size, number of cases (of some disease,
infection), number of failures; [usually count data];
Examples of continuous numerical variables: weight, height, score, cholesterol level, blood
pressure; [usually measurement data].
Ordinal data are something of a problem: they can be treated as categorical data, but this
loses valuable information. On the other hand, they should not be treated as numerical data
because of the arbitrariness of the unknown scale. Some methods, correct for numerical
data, may give quite misleading results for ordinal data.
lOMoARcPSD|8938243
Thus an ordinal variable can be treated as a categorical variable (ignoring the ordering);
and a numerical variable can be treated as an ordinal variable (ignoring the scaling) or as a
categorical variable (ignoring the ordering and the scaling).
ID number
Age (years)
FEV (litres)
Height (cm)
Sex 0 = female, 1 = male
Household smoking status 0 = non-smoking, 1 = smoking.
Ordering
Since categorical data have no order, we can choose an appropriate one for presentation,
e.g. decreasing frequency. Ordinal data have a specified order.
Coding
For categorical and ordinal data it is often convenient to code the data to numerical values:
for example, female = 1 and male = 2, or strongly disagree = 1, disagree = 2, neutral =
3, agree = 4 and strongly agree = 5. It must be remembered though that this is just for
convenience: the data cannot be treated as numerical data. [average gender = 1.46?]
Checking
Checking is necessary whenever we deal with data (whether or not we use a computer —
perhaps moreso with a computer). Checking is important, yet it should not be too extensive
else it becomes too time consuming. One of the most important checks is common sense
lOMoARcPSD|8938243
(experience): do the results and conclusions agree with our common sense? If not, why
not? Can we explain the differences?
Significant figures
In preliminary data analysis most people can handle with meaning at most three significant
figures, in most cases two is better. This also applies to the reader of the report of our
analysis, so that two or three figures is usually best for the presentation of our results.
When we write x = 1.41, we can mean that x is measured as 1.41 (to two-decimal accuracy);
we can mean that we have calculated x using some formula and√ are reporting the result to
two-decimal accuracy. Thus this x might actually be equal to 2. In statistics, it is preferred
that numbers are rounded off at a meaningful level.
This can lead to results that may seem odd. For example, we may write 0.33 + 0.33 + 0.33 =
1.00. This is“correct” if we are reporting 13 + 31 + 13 = 1, correct to two decimal places.
Transformations
If the data set contains numbers of widely differing size, they may be brought onto the
same scale, by taking logs for example. Of course this considerably warps the original
scale so that some care may be needed in interpretation.
Two transformations that are quite commonly used are:
the log transformation:
y = ln x, which transforms (0, ∞) to the real line (−∞, ∞);
the logistic transformation:
x
y = logit(x) = ln( 1−x ), which transforms (0, 1) to the real line (−∞, ∞).
x log(x) x logit(x)
0.001 –6.9 0.001 –6.9
0.05 –3.0 0.05 –2.9
0.2 –1.6 0.2 –1.4
1 0.0 0.5 0.0
5 1.6 0.8 1.4
20 3.0 0.95 2.9
1000 6.9 0.999 6.9
The log transformation is often used for positive data that has a long tail. The logit trans-
formation is often used for proportions, or bounded data.
If we have data like x in either of the above tables, then the log or logit transformation
converts it to a “sensible” scale.
Note: If the data are restricted to (a, b) then the transformation y = ln x−a
b−x
can be useful:
it transforms (a, b) to the real line (−∞, ∞).
We consider first the analysis of one-variable data — i.e., data consisting of a collection of
values of one variable (such as height or IQ or voting preference). Data sets consisting of
observations on more than one variable can always be subdivided into univariate data sets
and the variables analysed separately. However, methods of joint analysis are important.
Representations of bivariate data are mentioned later in this chapter, and their analysis is
considered in more detail in Chapter 8.
lOMoARcPSD|8938243
We look at the more important and useful statistics; and mention a few others.
Data can be summarised in two main ways: using numbers, or using graphs. These are
useful for different purposes. Numbers are good if you want to be exact, but it is harder to
present large amounts of information with them. Graphs are the opposite: it is easy to get
a good “sense” of the data, but some of the finer points may be hidden.
Since graphs are often based on numbers, we look at numerical statistics first. We don’t
want to show all the numbers in the data — that’s too much information! Instead, we want
to summarise the data using a small but meaningful set of numbers.
x: 4 5 4 6 1 9 7 3 12 5
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 4.00 5.00 5.60 6.75 12.00
A measure of location is a single value that is representative of the data. Thus it should be
a value that is as close to all observations in the data as possible, in some meaningful sense.
So that it can ’speak’ for every datum in the sample! Measures of location are also called
measures of central tendency.
The most commonly used measure of location is called the sample mean, the arithmetic
mean or, simply, the average. The sample mean is defined as mean = Sum of all observations
Number of observations
or more formally as follows.
Useful properties:
• The sum of total deviation of the data, {x1 , x2 , . . . , xn }, about the sample mean, x̄ is 0.
Mathematically this means, (x1 − x̄)+(x2 − x̄)+. . .+(xn − x̄) = 0. This implies that the
arithmetic mean balances out the negative and positive deviations in the data. In this
sense the sample mean is the centre of mass of the data. Hence extreme observations
can have a big effect on the sample mean.
• The sample mean is also the value that minimizes the total squared deviation of the
data about it. In other words, (x1 − a)2 + (x2 − a)2 + . . . + (xn − a)2 is minimum at
a = x̄. In this sense, x̄ is close to all observations in the data.
In the case of grouped data, where fj = freq(uj ), j = 1, 2, . . . , k:
Pn Pk 1 Pk
i=1 xi = j=1 fj uj , so that x̄ = n j=1 fj uj .
1 2 3 4 5 6
Example (die rolling) 6 10 11 8 7 8
P6 174
j=1 fj uj = 174, so that x̄ = 50 = 3.45 (≈ µ = 3.5).
Another useful measure of location is the median. The median is the value that divides
the data (arranged in increasing magnitude) in two halves. At least half the data is smaller
than or equal to the sample median. Equivalently, at least half the data is greater than or
equal to the median.
DEFINITION 2.4.2. The sample median, denoted by m̂ or ĉ0.5 , is the “middle value” of
the data. In other words, it is the value that separates the bottom half of the data from
the top half of the data.
Useful properties
• The sample median is the value that minimizes the total absolute deviation of the
data about it. In other words |x1 − a| + |x2 − a| + . . . + |x2 − a| is minimum at a = m̂.
In this sense, m̂ is close to all observations in the data.
• Unlike the sample mean, the sample median is less affected by extreme observations
in the data.
These are the most important and useful measures of location. However, others may be
used: for example, the sample mid range, the sample mode, the trimmed sample mean.
lOMoARcPSD|8938243
Note: the sample mode denotes the most frequent or the most common value; and there-
fore is not really a (good) measure of location.
1
Pn
Mean = x̄ = sample mean = n i=1 xi = 56/10;
In R:
DEFINITION 2.4.3.
1. If the sample x1 , x2 , . . . , xn is arranged in order of increasing magnitude: x(1) 6
x(2) 6 · · · 6 x(n) [so that x(1) denotes the smallest sample variate (i.e. the
minimum) and x(n) the largest (i.e. the maximum)] then x(k) is called the kth
order statistic.
2. The sample q-quantile, denoted by ĉq , is such that a proportion q of the sample is
less than ĉq . That is, ĉq = x(k) , where k = (n+1)q.
Thus half of the sample is less than ĉ0.5 , and so the 0.5-quantile, ĉ0.5 , is the median.
Note: A common notation, which you will get to see much more of, is the ‘hat’-notation. It denotes
‘an estimate of’ and/or ‘a sample version of’. Thus ĉ0.5 denotes an estimate of c0.5 ,
lOMoARcPSD|8938243
the population median. As a sample is often used to estimate a population, many of the sample
characteristics are ‘hatted’. Not all though: we prefer x̄ to µ̂, for example.
For the above sample, x(1) = 1, x(2) = 3, x(3) = 4, . . . , x(10) = 12.
So what is x(2.75) ? x(2.75) is taken to be 0.75 of the way from x(2) = 3 to x(3) = 4; thus
x(2.75) = 3.75. Check that x(8.25) = 7.5.
EXAMPLE 2.4.3: For the following sample (of sixteen observations), find the
sample median and the sample quartiles.
5.7, 4.5, 17.7, 12.3, 20.1, 6.9, 2.3, 7.0, 8.7, 8.4, 14.6, 10.0, 6.1, 9.1, 10.0, 10.7.
The data must first be ordered, from smallest to largest. This gives:
2.3, 4.5, 5.7, 6.1, 6.9, 7.0, 8.4, 8.7, 9.1, 10.0, 10.0, 10.7, 12.3, 14.6, 17.7, 20.1;
which specifies the order statistics: x(1) = 2.3, x(2) = 4.5, . . . , x(16) = 20.1.
The median ĉ0.5 = x(8.5) , since k = (16+1)×0.5 = 8.5. The median is half-way
between x(8) and x(9) , i.e. half-way between 8.7 and 9.1. So, ĉ0.5 = 8.9.
The lower quartile ĉ0.25 = x(4.25) , since k = (16+1)×0.25 = 4.25. Thus, the
lower quartile is a quarter of the way between x(4) = 6.1 and x(5) = 6.9. So,
ĉ0.25 = 6.3.
Similarly, ĉ0.75 = x(12.75) = 11.9, since x(12) = 10.7 and x(13) = 12.3.
Note: In R, quantiles are computed using the function quantile() which allows you
to specify 9 commonly used empirical quantile definitions. The above definition is met
by specifying the option type=6 when using quantile() or summary(). To see all the
quantile types see help(quantile).
Find the sample median and the sample mean. [134 & 178.9]
Why should you expect that the sample mean is greater than the sample me-
dian?
The distribution is positively skew, i.e. has a long tail at the positive end, so the mean
will be larger than the median: it gets pulled towards the longer tail. The population
distribution is positively skew, so even before the sample is taken, we should expect that
the sample mean will be greater than the sample median, since the sample will resemble
the population distribution.
Measures of location only tell us about a central or typical or representative value of a sample.
However,to assess the difference between observations, we need to study the variation in
the data. Measures of spread describe the variability in a sample or its population about
some measure of location or from one another. Sample variance is the most commonly
used measure of spread for numeric data. It is defined as:
lOMoARcPSD|8938243
Roughly speaking the sample variance of a data is the average squared distance of the sample
observations about the sample mean. To reverse the squaring process, we define the sample
standard deviation:
Usually, the range (x̄ − 2s, x̄ + 2s) will contain roughly 95% of the sample.
It is quite possible to observe samples or populations with same mean but differing stan-
dard deviations or vice-versa, as shown below.
Another measure of spread is the sample interquartile range:
The sample interquartile range is a single number: it is the difference, not the interval.
lOMoARcPSD|8938243
(a) Same mean, different standard deviation. (b) Same standard deviation, different mean.
Numerical representations are great for investigating particular aspects of the data, but to
get an overall sense of it, it is better to use a graphical representation. There are many
graphical representations, based on different properties of the data.
Frequency data:
Below we create a barchart of these data. There are gaps between the bars, as
they represent separate categories.
Stat energy
40
30
Waste
20
Industrial processes
Agriculture
Fugitive emissions
10
Note: freq denotes frequency; thus freq(X=4) denotes the frequency of X=4, i.e. the
number of times in the sample that the variable X is equal to 4. In the example be-
low, freq(X=4) = 2, since there are two 4s in the sample. Similarly, freq(X66) = 7,
freq(46X66) = 5, and so on.
lOMoARcPSD|8938243
However, if the underlying variable is continuous, then we would prefer to have a function
on the real numbers. We use a histogram.
Histograms
The standard approach to representing the frequency distribution of a continuous variable
is to use “binning”, i.e. putting the observations in “bins” or “groups” that cover the line.
This gives the histogram, which will be a familiar representation. It is just a bar chart, with
joined-up bars!
• A histogram is suitable for continuous data.
• A histogram has no gaps between “bars”.
• If all intervals are of the same width, then heights of “bars” can be frequencies or
relative frequencies.
• We should plot:
relative frequency
height = .
interval width
Thus, the areas of the “bars” correspond to relative frequencies.
freq(a < X < b)/n
i.e. fˆ(x) = for a < x < b.
b−a
• Use hist() to produce a histogram in R.
Histogram of x
3.0
2.5
2.0
Frequency
1.5
1.0
0.5
0.0
0 2 4 6 8 10 12
x
lOMoARcPSD|8938243
0.04
0.04
0.03
0.03
0.02
fˆ
0.02
f
0.01
0.01
0.00
0.00
0 20 40 60 80 100 0 20 40 60 80
x x
A random sample of 220 observations was obtained from the population dis-
tribution f (x) = (1/20)e−x/20 , x > 0. The default histogram produced by R is
shown on the right above. It is supposed to reflect the population distribution:
fˆ describes the sample, but also estimates f . A sample of random data and
histogram may be obtained as follows:
Note that the option breaks specifies an approximate number of breaks in the
histogram. If omitted, R uses a rule of thumb based on the number of observa-
tions in the sample. The options freq=FALSE and freq=TRUE specify density
and frequency histograms, respectively.
It is standard to use equal bin width, but they can be made unequal: the graph
below has bins (0,2), (2,5), (5,20), (20,50) and (50,100). This might be done for
extremely skew distributions.
lOMoARcPSD|8938243
0.04
0.05
0.04
0.03
0.03
0.02
f fˆ
0.02
0.01
0.01
0.00
0.00
0 20 40 60 80 0 20 40 60 80 100
x x
In the case of unequal bin widths, the fˆ values are obtained using the formula
25/220
above: for example for the first bin, fˆ = = 0.056818. Though, of course,
2
R does the calculations for you once you have set the breakpoints.
bin width frequency fˆ
0<x62 2 24 0.055
2<x65 3 23 0.035
5 < x 6 20 15 90 0.027
20 < x 6 50 30 66 0.010
50 < x 6 100 50 17 0.002
220
1 number of observations 6 x
F̂ (x) = freq(X 6 x) = ,
n total number of observations
i.e. the relative frequency of observations less than or equal to the number x.
The population version of this is called the cumulative distribution function (cdf), denoted by F . The
cumulative relative frequency function is the sample version, and is often referred to as the sample
cdf, or the empirical cdf. It is available in R using the function ecdf() . . .
lOMoARcPSD|8938243
1.0
0.8
0.6
F̂
0.4
0.2
0.0
0 2 4 6 8 10 12
x
For example: F̂ (4.2) = n1 freq(X 6 4.2) = 1
10 ×4 = 0.4.
F̂
Sample quantiles
Boxplot
Compared with sample A, sample B has greater location measure; sample C has a greater
spread measure. Samples A, B and C are symmetrical, but sample D shows positive skew-
ness, i.e. a longer tail at the positive end.
One problem with this representation is that one or two outlying data values could give a
misleading impression of the spread of the distribution. For this reason, we limit the length
of the “whiskers” (the lines at either end of the box) to 1.5τ̂ , i.e., 1.5 times the interquartile
range. The line extends to the most extreme data value within these limits, i.e. ĉ0.25 −1.5τ̂
for the lower end, and ĉ0.75 +1.5τ̂ at the upper end. These are sometimes called the ‘inner
fences’. Any data value outside this interval is indicated separately. Some boxplots also define
‘outer fences’: ĉ0.25 −3τ̂ and ĉ0.75 +3τ̂ , and label points outside these limits as “extreme outliers”.
Extreme values are indicated separately on a boxplot.
lOMoARcPSD|8938243
It is common to label these outlying data values by individual name or case number or
some other identification. There may be some explanation of their oddity, but in any case,
the outlying data values are often of interest.
E XERCISE . A sample of 120 patients are observed following a cancer treatment. The fol-
lowing represent recurrence times (in months) ranged from 13 months to 71 months. They
are summarised in the following stem-and-leaf plot:
1 3
1 79
2 011223333344
2 55555556666777777788888999
3 0000000000000111111122334444444
3 55555666667788889
4 000111122223333444
4 5566789
5 024
5 6
6 3
6
7 1
(a) Obtain the sample median and sample quartiles and hence draw a box-plot.
(b) Give approx values for the sample mean and sample standard deviation.
8.0 12.9 13.0 8.9 10.1 7.3 11.1 10.9 6.2 8.1
8.8 10.4 15.7 13.6 19.3 9.9 8.5 11.1 10.7 8.8
10.7 6.8 7.4 5.8 11.8 13.0 9.5 8.1 6.9 11.5
11.2 13.6 5.9 21.1 15.7 10.8 10.7 11.5 16.1 9.9
In R, after importing the data and obtaining the variable zinc, we can obtain summary
statistics, histogram and boxplot as follows:
> summary(zinc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.80 8.40 10.70 10.78 12.08 21.10
Histogram of zinc
0.15
0.10
Density
0.05
0.00
5 10 15 20
Zinc intake (mg)
10 15 20
Zinc intake (mg)
Comment on the distribution of zinc intake for these patients. In other words, comment on
location (centre), spread (scale, dispersion), symmetry (skewness) and any oddities (outliers, shape).
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.261 10.250 19.370 31.800 38.790 162.400
> summary(log(x))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2319 2.3280 2.9640 3.0200 3.6570 5.0900
The histogram (sample pdf), sample cdf and boxplot are given below for each
sample:
0.020
0.3
0.000
0.6
0.0
0.0
0 50 100 150 0 1 2 3 4 5
x log(x)
0 50 100 150 1 2 3 4 5
x log(x)
The skewness is seen in the boxplot through the asymmetry of the box, and
all the points at the top end. It is observed too that the log-transformation has
removed the skewness. A log-transformation will always reduce the skewness:
in this case, it is reduced from positive to close to zero.
Other measures of shape:
skewness:
kurtosis
Bivariate data, as the name suggests, consist of observations on two associated variables
for each of a number of individuals. The two variables may be both categorical, or one
may be categorial and one numerical; or both may be numerical. We consider some simple
examples to illustrate.
Two categorical variables
For two categorical variables, the simplest strategy is to combine the variables into one
(more-complex) variable; and the use a barchart for the super variable.
EXAMPLE 2.4.11: If the variables are gender (f, m) and blood-group (O, A, B, AB);
then we can combine them into a gender/blood-group variable with eight cat-
egories: (f O, f A, f B, f AB;
mO, mA, mB, mAB). Here there is some sort of imposed order on the categories
that is chosen by which variable comes first. This order indicates blood groups
within genders, whereas (f O, mO; f A, mA; f B, mB; f AB, mAB) would show
the difference between genders for each blood group. The order within each
variable is arbitrary (f, m) or (m, f ); though (O, A, B, AB) seems to be ‘stan-
dard’.
sup
female
male
x 0 10 20 30 40
y 84 76 65 66 60
80 78 70 63 62
For example, where x represents a drug concentration and y the response of a laboratory
animal. Here there are ten animals, two at each level of the drug concentration. Thus in
this case, the data are:
(x1 , y1 ) = (0, 84), (x2 , y2 ) = (0, 80), . . . , (x10 , y10 ) = (30, 62).
In the case that the x-variable has a natural ordering (usually time) and there can be only
one y-value for each x-value, it is common to join the consecutive points by a line. This is
called a line-plot.
lOMoARcPSD|8938243
For example, the data corresponding to the following plot come from ice core samples ob-
tained and analysed in the 1990s from the Law Dome, near Casey Station, Antarctica. The
measurements are of carbon dioxide concentrations from air samples in the past, trapped
in the Antarctic ice.1
If there is a negative relationship between the variables, then large x and small y tend to
occur together; and small x and large y tend to occur together. This is illustrated in the
diagram above.
A measure of the relationship is the correlation coefficient, r, which can be obtained in R
using
cor(x,y)
using the name or column specification for the x-variable and the y-variable.
1 Source: cdiac.ornl.gov/trends/co2/lawdome.html.
lOMoARcPSD|8938243
Correlation is discussed in more detail in Chapter 8. For now it is enough to know that it
is a number in the interval −1 6 r 6 1; its sign reflects the type of relationship (positive or
negative) and its magnitude reflects the strength of the relationship: the most extreme (±1)
being that of a straight line (with positive or negative slope).
For the height-weight data above, r = 0.809, indicating a moderately strong positive rela-
tionship.
The connection between scatter-plots and correlation is indicated in the diagram below.
However, the scatter-plots shown are ‘standard’: the distributions are roughly symmetrical
and there are no serious outliers. Like the mean and the standard deviation, the correlation
coefficient is affected by outliers.
Each of the above scatter plots has the same apparent scale on the two axes, and this is
the ideal you should aim for when generating a scatter plot. The apparent scale reflects
what we see rather than the units on the axes. Here, the apparent scales for x and y
are similar: most of the points are within an interval of about 3cm horizontally (x-axis)
and within an interval of about 3cm vertically (y-axis). The axes are not indicated. They
could be {0, 1, 2, . . .} on each axis. However they could be {1.3, 1.4, 1.5, . . .} for x and
{600, 700, 800, . . .} for y. In that case the numerical scales would be very different although
the apparent scales are the same.
lOMoARcPSD|8938243
If the apparent scales are different, then the scatter plot is distorted, giving a false impres-
sion of the relationship:
The points in the above three plots are identical: only the scale is changed, in the x-axis and
in the y-axis respectively. In each case the correlation if r = 0.45, but when the apparent
scales are different we get the impression of a greater correlation: the points seem to be
nearer to a straight line. Taking the scale change to the extreme: a very large scale on the
y-axis would produce what appears to be a horizontal straight line; and similarly a vertical
straight line results from using a very large scale on the x-axis.
We have seen that x̄ and sx indicate the location and spread of the x-data: and similarly
ȳ and sy indicate the spread of the y-data. If the x-axis has marks at x̄, x̄ ± sx , x̄ ± 2sx ,
. . . and the y-axis has tick-marks at ȳ, ȳ ± sy , ȳ ± 2sy , . . . then the apparent scales are
equivalent. About 95% of the points will be in (x̄−2sx , x̄+2sx ); and about 95% will be in
(ȳ−2sy , ȳ+2sy ).
ȳ+2sy
ȳ+sy
ȳ
ȳ−sy
ȳ−2sy
In practice, we label the axes in some ‘nice’ way (in units, or thousands, or hundredths, or
whatever) but choose the axis units so that the scales are similar. For example, if x̄ = 46.7
and sx = 4.3, then we might choose tick marks at {35, 40, 45, 50, 55}, say. The computer
package will do this scale selection automatically, so we usually don’t have to worry about
it. Sometimes though, it produces odd subdivisions: gaps of 6, 7, or 11, for example. We
humans tend to prefer gaps like 1, 2, 5 or 10.
lOMoARcPSD|8938243
Problem Set 2
2.1 (a) Classify each of the variables in the following questionnaire (as categorical, ordinal or
discrete numerical or continuous numerical).
1. Age (in months):
2. Sex: male female
3. How often do you use public transport?
never rarely sometimes often frequently
4. State the number of times you used public transport last week:
5. Do you own a car?
6. What is the fuel consumption of your car?
(b) The diagram below is not so much misleading as confusing. It relates to blood levels
observed in a sample of children. Draw a more appropriate diagram.
2.2 Comment on the following quantitative statements and conclusions. Is there sufficient evi-
dence to reach the stated conclusion? If not, why not?
(a) “Two out of three dentists, responding to a survey, said that they would recommend OR-
BIT gum to their patients who chew gum. Therefore the majority of dentists recommend
chewing ORBIT gum.”
(b) “Heart disease is responsible for 40% of deaths in Australia, therefore we should spend
more money on research into heart disease.”
(c) A survey of two thousand drivers has recently been completed. Of the drivers under 30,
35% had had a car accident in the past year, whereas only 20% of the older drivers had
been involved in an accident in that time. Clearly, therefore, young drivers are worse
drivers.”
(d) “You should cross at a pedestrian crossing. It’s five times safer.”
2.3 The data below are a sample of cholesterol levels taken from 20 hospital employees who were
on a standard (meat-eating) diet and who agreed to adopt a vegetarian diet for one month.
Serum-cholesterol measurements were made before adopting the diet and 1 month after. The
rows in the table below give patient ID, cholesterol level before and cholesterol level after the
month on the vegetarian diet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
195 205 159 244 166 250 236 192 224 238 197 158 151 197 180 222 168 168 167 161
146 178 146 208 147 202 215 184 208 206 169 127 149 178 161 187 176 145 154 153
(a) i. What is the question of interest being investigated here?
ii. What is the sample size and what are the study units?
iii. What is the underlying population?
(b) Let diff, denote the difference = before – after. For the observations on diff:
i. Draw a boxplot for the data.
ii. Calculate the mean, median, standard deviation and interquartile range.
iii. Comment on the distribution of the data.
iv. Suppose the first value is mistyped as 495 (instead of 195). Which of the statistics in
(b)ii will change, and which will not?
(c) What do you think is the answer to the question of interest given in (a)i?
2.4 (a) Find the five number summaries, and draw boxplots, for each of the following:
i. 1, 2, 4, 8, 16, 32, 64
lOMoARcPSD|8938243
2.5 A frequency distribution for the serum zinc levels of 462 males between the ages of 15 and 17
is displayed below.
Serum zinc level Number of
(µg/dL) males
50–59 6
60–69 35
70–79 110
80–89 116
90–99 91
100–109 63
110–119 30
120–129 5
130–139 2
140–149 2
150–159 2
Draw an accurate cumulative frequency polygon, and use it to find approximate values for:
(a) the median and IQR; the 10th and 90th percentiles;
(b) the proportion of workers whose serum zinc level is less than 105 µg/dL;
(c) the proportion of workers whose serum zinc level lie within mean ± 2 sd (the sample
mean = 88.1 and sample standard deviation = 16.8).
(d) How does the above proportion compare with the empirical rule?
(e) Why are the values you have found only approximate? What would be required to ob-
tained the correct values?
2.6 Scientists wanted to test whether a new corn with extra lysine (a protein) would be good for
chicks. An experimental group of 20 one-day old chicks was fed a ration containing the new
corn. A control group of another 20 one-day old chicks was fed a ration which was identical
except that it contained normal corn. Here are the weight gains (in grams) after 21 days.
Control 272 283 316 321 329 345 349 350 356 356
360 366 380 384 399 402 410 431 455 462
Lysine 318 326 339 361 375 392 393 401 403 406
407 410 420 426 427 430 434 447 467 477
(a) What is the response variable? The explanatory variable? What other variables have been
controlled by keeping them constant? What other variables might affect the weight gain
of individual chicks? Can these be controlled? Explain.
(b) Suppose that the supply of experimental chickens came from two farms. Simon suggests
that the easiest way to conduct the study would be to use 20 chicks from Farm A as the
experimental group and 20 chicks from Farm B as the control group. What do you think?
(c) Which of the following displays would be appropriate for assessing the effectiveness of
the lysine supplement: histogram(s); boxplot(s); scatterplot(s)? Draw the display that
you consider to be the most appropriate for the above data, and use it to comment on the
effectiveness of the lysine supplement.
2.7 The Newport Health Clinic experiments with two different configurations for serving patients.
In one configuration, all patients enter a single waiting line that feeds three different physi-
cians. In another configuration, patients wait in individual lines at three different physician
stations. Waiting times (in minutes) are recorded for ten patients from each configuration.
Compare the results.
Single line: 65 66 67 68 71 73 74 77 77 77
Multiple lines: 42 54 58 62 67 77 77 85 93 100
lOMoARcPSD|8938243
Interpret the results by determining whether there is a difference between the two data sets
that is not apparent from a comparison of the measures of centre. If so, what is it?
2.8 R produced the following descriptive statistics for the level of substance H in the blood of a
random sample of thirty individuals with characteristic C. The sample size is n = 30 and there
are 5 missing observations.
Min. 1st Qu. Median Mean 3rd Qu. Max.
62.90 68.40 71.5 73.00 81.25 96.30
(a) Sketch a diagram indicating the distribution of the data.
(b) Where, on your diagram, do you think the five missing observations might go? Why?
(c)What population we are trying to sample from? (i.e. what is the target population?)
(d) What assumptions are made in treating these data as a random sample of 25 observations
from the target population?
(e) Give an example of a situation when this might not be true.
2.9 The following data are the pulmonary blood flow (PBF) x (L/min · m2 ) and pulmonary blood
volume (PBV) y (mL/m2 ) values recorded for 14 infants and children with congenital heart
disease:
x 4.3 3.4 6.2 17.3 12.3 14.0 8.7 8.9 5.9 5.0 3.5 4.2 7.2 11.6
y 170 280 390 420 305 430 305 520 225 290 235 370 210 440
Draw a scatter plot and use it to assess the relationship between PBF and PBV.
2.10 The following scores represent a nurse’s assessment(x) and a physician’s assessment (y) on the
condition of each of ten patients at time of admission to a trauma centre:
x 18 13 18 15 10 12 8 4 7 3
y 23 20 18 16 14 11 10 7 6 4
i. Construct a scatter diagram for these data.
ii. Describe the relationship between x and y, and guess the value of the correlation.
iii. Use R to evaluate the correlation.
iv. If x were to be used as a predictor for y, which one of the following lines would you use?
(y = 8 + 0.5x, y = −10 + 2x, y = 1 + x. Determine which by plotting each of the lines on
your scatter diagram.)
lOMoARcPSD|8938243
Chapter 3
PROBABILITY AND
APPLICATIONS
“We balance probabilities and choose the most likely. It is the scientific use of
the imagination.” Sherlock Holmes, The Hound of the Baskervilles, 1902.
Chapter 1 provides an indication of where the data we analyse comes from. Chapter 2 tells us
something about what to do with a data set, or at least how to look at it in a sensible way. In this
chapter, and the next, we look at models for the data.
Ch1: types
(Ch3&4)
of studies
Probability
population −→ sample Ch2: data
description
model ←− observations
Statistical Inference
(Ch5–8)
Probability (chance, likelihood) has an everyday usage which gives some sort of rough
gradation between the two extremes of impossible and certain:
71
lOMoARcPSD|8938243
This Probability must obey some rules: very reasonable rules, but rules nevertheless.
Properties of Pr
1. 0 6 Pr(A) 6 1; probability must lie between 0 and 1.
2. Pr(∅) = 0, Pr(Ω) = 1; an impossible event has probability 0; a certain event has probability
1.
3. Pr(A′ ) = 1 − Pr(A); the complement of A, i.e. notA, has probability 1− Pr(A).
4. A ⊆ B ⇒ Pr(A) 6 Pr(B); a subset has smaller probability.
5. Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B). the “addition theorem”.
Intersection and union
DEFINITION 3.1.2.
1. A ∪ B denotes the union of events, corresponding to “A or B”, meaning that at
least one of the events A or B occurs.
2. A ∩ B denotes the intersection of events, corresponding to “A and B”, meaning
that both the events A and B occur.
A A
B A′
A∩B
B B′
A A
B A′
A∪B
(Venn diagram) (Probability table)
EXAMPLE 3.1.2: Suppose that 6.3% of families with four children consist of four
boys ( ◦◦ ). What is the probability that a family of four children has at least one
girl?
If A denotes “four boys”, then the complement of A, A′ = “at least one girl”.
Therefore Pr(A′ ) = 1 − 0.063 = 0.937.
lOMoARcPSD|8938243
A Probability table has the advantage of doubling for a Venn diagram, in which the subdi-
visions are squares rather than odd shapes, and which can be used for calculation.
EXAMPLE 3.1.3: Suppose that Pr(A) = 0.6, Pr(B) = 0.2 and Pr(A∩B) = 0.1.
The probability table is given by:
B B′
A 0.1 0.5 0.6
A′ 0.1 0.3 0.4
0.2 0.8 1
The bold entries are given. The rest can be obtained by subtraction.
·
· · Pr(A∪B) = 0.7, Pr(A∩B ′ ) = 0.5, Pr(A∪B ′ ) = 0.9, . . .
In a probability table:
• intersections (A∩B, A∩B ′ , A′ ∩B, A′ ∩B ′ ) are represented by one square;
• unions (A∪B, A∪B ′ , A′ ∪B, A′ ∪B ′ ) are represented by three squares, i.e. an L-shape.
To complete the probability table, and hence to work out anything involving A and B, we
need three (separate) items of information.
0.1 0.2
EXAMPLE 3.1.4: Pr(A∪B) = 0.7, Pr(A∩B ′ ) = 0.2, Pr(A) = 0.3.
0.4 0.3
·
· · Pr(B) = 0.5, Pr(A∩B) = 0.1, . . .
B B′
A γ α
A′
β 1
Simple examples of probability, and some of its earliest applications, relate to symmetrical
gambling apparatus (coins, dice, roulette wheels, . . . ), in which case probability is assigned
equally to the possible outcomes.
When a procedure is repeated a very large number of times, the proportion of times that
an event occurs approaches its probability. This can be used to provide an approximation
for probability, and a theoretical justification for it.
If an individual is chosen “at random” from a population of N individuals (meaning that
each individual is equally likely to be chosen) then the probability that the chosen individ-
ual has attribute A is n(A), where n(A) denotes the number of individuals in the popu-
lation with attribute A. In this case, probability reduces to counting . . . in theory at least.
Many of our applications take this form, and probability denotes a population proportion,
Pr(A) = n(A)/N . However, usually we don’t know what n(A) is, and, in many cases, we’re
not entirely sure about N either.
So long as the rules are adhered to, numbers can be assigned to probabilities subjectively,
based on experience or opinion. This is done to some extent by people such as bookmak-
ers and stocktraders, though it also has some value in the scientific world. Its usefulness
depends entirely on the credibility of the experience and opinion. Just another assumption?
Mostly though, what is done is to generate a probability model for the situation at hand,
and decide whether or not the observed data is compatible with the model . . . or whether
the model is compatible with the data.
Q UESTION : What is the value for Pr(6) for this die?
A lot of what we do in statistics is based on modelling. In understanding the theory and
practice of statistics, it is necessary to deal with abstractions of various kinds. So you are
often asked to assume that, or suppose that . . . and in that case to work out what is likely
to happen.
It is useful to consciously allow your mind to entertain various abstract concepts and sce-
narios: it genuinely helps with understanding. These abstractions are introduced in many
ways, sometimes by a very simple word or phrase. You may be asked to “assume that”, or
“suppose”, or “model the data as . . . ”. Perhaps the simple word “if” may be used, often in
an “if . . . then” construction, e.g. “If the coin is fair, then . . . ” or “If the sample is random,
then . . . ”. Occasionally, the symbol ( ◦◦ ) will be used to remind you of this abstraction
process.
In medical applications, probability is often called risk because it often relates to a negative
disease-outcome (such as death, relapse, recurrence, complications, . . . ), whereas we do not
talk about the risk of a positive disease-outcome (such as cure, alleviation of symptoms,
improvement, . . . ): thus we refer to the probability of cure.
In a practical application, these disease-outcomes (positive or negative) must be further
specified as occurring over a specified period of time.
This risk can be thought of as the proportion of the population with the disease-outcome
occurring during a time period, u.
lOMoARcPSD|8938243
EXAMPLE 3.1.5: “A 60-year-old man has a 2% risk of dying from cardiovascular dis-
ease.”
What does this mean? Not much! Risk applies to a specific period of time: the
time period should be specified in such a statement. We should say something
like: “A 60-year-old man has a 2% risk of dying from cardiovascular disease in the next
five years.”
Another medical probability with a name is the probability that an individual (from a given
population) has disease D at a point in time. This is called the prevalence of the disease D at
a particular time (often taken to mean ‘now’). Thus the prevalence is a measure of disease
status.
Actually, prevalence is used for many things other than disease: characteristics such as
blood group, smoking status, eye-colour, and so on.
Yet another term, which tends to get confused with prevalence, is incidence. Incidence is a
rate rather than a probability and relates to the first occurrence of disease: the rate at which
a disease occurs. The incidence rate is a measure of the frequency of disease onset. This is
discussed in Chapter 4.
3.1.2 Odds
Pr(A) Pr(A)
O(A) = = .
Pr(A )
′ 1− Pr(A)
Comparing risks
Suppose that the risk for group 1 is p1 and for group 2 the risk is p2 . How should we
compare the groups?
p1 p1 1−p2
risk difference? p1 − p2 risk ratio? odds ratio? .
p2 1−p1 p2
Each of these is used in different situations. When the risks are small, the risk difference
will be small too: is a difference of 0.001 important? Maybe if p1 = 0.002 and p2 = 0.001,
it is. When the risks are large, the risk ratio is relatively diminished. The fairest compar-
ison turns out to be the odds ratio: it’s the one biostatisticians tend to use. It has a lot of
advantages as you will see, although it may not be the simplest.
Pr(A | H) denotes the probability of A given the information H about the outcome. The
information takes the form of an event, H. We are told that the event H has occurred.
We would like to modify the probability in the light of this additional information:
Pr(A) −→ Pr(A | H).
DEFINITION 3.2.1. The conditional probability Pr(A | H) is the probability that the
event A will occur given the knowledge that the event H has already occured. It is
defined as
Pr(A ∩ H)
Pr(A | H) = .
Pr(H)
lOMoARcPSD|8938243
Given H, the H-row becomes the universe. And in that case, the conditional probability of
A is the proportion of the H-row that is in A, i.e. Pr(A∩H)/ Pr(H).
A A′
H 80 120 200
H′ 80 720 800
160 840 1000
The probability that an individual randomly chosen from this population has
160
cardiac problems is Pr(A) = 1000 = 0.16.
Pr(A∩H) 80/1000
It remains true that Pr(A | H) = Pr(H)
= 200/1000 .
80
Note that Pr(H | A) = 160 = 0.5. This is the probability that the individual has
hypertension given they have cardiac problems. To obtain this probability, we
restrict the universe to the 160 individuals with cardiac problems.
Given A, the universe becomes the column A. Similarly, given A′ the universe
would be reduced to the column A′ , and in that case Pr(H | A′ ) = 120
840
= 0.143.
E XERCISE . Check that Pr(A | H ′ ) = 0.1. Explain what this conditional probability means,
in terms of hypertension and cardiac problems.
The probability table can be calculated from the given information: e.g. Pr(D∩E) =
0.3×0.0011; and then any other probabilities involving D and E can be com-
puted.
Note: Pr(A | B) is not the same as Pr(B | A): they are in different universes!
Pr(A | B) is not equal to 1 − Pr(A | B ′ ): different universes again.
but Pr(A′ | B) = 1 − Pr(A | B): they are from the same universe, and either A or A′ must
occur, no matter what universe we are in.
DEFINITION 3.2.2. The relative risk (or risk ratio), RR, of a disease D with respect to
an exposure E is given by
Pr(D | E)
RR = .
Pr(D | E ′ )
DEFINITION 3.2.3.
1. The conditional odds of D given E is
Pr(D | E)
O(D | E) = .
Pr(D′ | E)
O(D | E)
OR = .
O(D | E ′ )
The odds ratio compares the odds of disease for the group of exposed individuals
to the odds for the group of unexposed individuals.
D D′
α α
E α β Pr(D | E) = α+β O(D | E) = β
E′ γ δ Pr(D | E ′ ) γ
= γ+δ O(D | E ′ ) = γ
δ
D D′
E 0.3 0.1 0.4
E′ 0.3 0.3 0.6
0.6 0.4 1
H H′ (H | · )
A0 0.3 (0.01)
A1 0.5 (0.1)
A2 0.2 (0.9)
1
Pr(E) = 0.3, D D′ (D | · )
Pr(D | E) = 0.011, E 0.0033 0.2967 0.3 (0.011)
Pr(D | E ′ ) = 0.001. E′ 0.0007 0.6993 0.7 (0.001)
0.0040 0.9960 1
LTP: Pr(D) = 0.004 (E | · ) (0.825) (0.298)
0.0033
BT: Pr(E | D) = 0.0040 = 0.825.
We already knew how to do this! See the earlier example. Also, Pr(E | D′ ) =
0.2967
0.9960
= 0.298.
The probability table can be obtained using the given information. Entries in
the first column can be evaluated, like Pr(A1 ∩C) = 0.45×0.024 = 0.0108; and
then the table can be completed using subtraction.
C C′ (C | · )
A1 0.0108 0.4392 0.45 (0.024)
A2 0.0129 0.2671 0.28 (0.046)
A3 0.0176 0.1824 0.20 (0.088)
A4 0.0107 0.0593 0.07 (0.153)
0.0520 0.9480 1
Then Pr(C) = 0.0520, i.e. 5.2% of the population are expected to develop cataracts.
This represents 5000×0.052 = 260 individuals.
Probability table representation:
H H′
A1 Pr(A1 ) Pr(H | A1 ) ··· Pr(A1 )
A2 Pr(A2 ) Pr(H | A2 ) ··· Pr(A2 )
.. .. ..
. . .
Ak Pr(Ak ) Pr(H | Ak ) ··· Pr(Ak )
Pr(H) ··· 1
Observe from the probability table that Pr(H) can be found by summing up the probabili-
ties in the H column.
lOMoARcPSD|8938243
D D′ D D′
E 8 2996 3004 E 0.000667 0.249667 0.250333
E′ 8 8988 8996 E′ 0.000667 0.749000 0.749667
16 11984 12000 0.001333 0.998667 1.000000
lOMoARcPSD|8938243
The first table gives the numbers in each group (this is called a contingency ta-
ble); and the second, obtained by dividing through by 12000, gives a probability
table.
D D′ D D′
E 8 12 20 E 0.1250 0.1875 0.3125
E′ 8 36 44 E′ 0.1250 0.5625 0.6875
16 48 64 0.25 0.75 1
The risk ratio is different, but the odds ratio is correct. Thus, we can simply
use the odds ratio from the case-control study to estimate the population odds
ratio.
We could use the case-control table to obtain the full population table, if we are
provided with the value of Pr(D), i.e. the proportion of the population with the
disease. The case-control table correctly gives Pr(E | D) = 0.5 and Pr(E | D′ ) =
16
0.25. Using these values in conjunction with Pr(D) = 12000 = 0.001333, the
remaining probabilities in the population table can be evaluated.
The diagnostic testing scenario is very important in medicine. There are a bunch of names
for many of the probabilities and conditional probabilities that you need to know about.
P P′
√
D × • sn = Pr(P | D)
√
D′ × sp = Pr(P ′ | D′ )
Thus if an individual is randomly selected from the population, then the probability that
lOMoARcPSD|8938243
the individual has the disease, Pr(D), is equal to the prevalence. Here probability is a
population proportion.
DEFINITION 3.4.2.
1. The sensitivity (sn) of a test is the probability that the test is positive given that
the person has the disease: sn = Pr(P | D).
2. The specificity (sp) of a test is the probability that the test is negative given that
the person does not have the disease: sp = Pr(P ′ | D′ ).
3. The positive predictive value (ppv) of the test is the probability that a person has
the disease, given the test is positive: ppv = Pr(D | P ).
4. The negative predictive value (npv) of the test is the probability that a person
does not have the disease, given that the test is negative: npv = Pr(D′ | P ′ ).
Note that all these conditional probabilities are concerned with “getting it right” . . . given D, given
D′ , given P and given P ′ .
DEFINITION 3.4.3.
1. A false negative occurs when the test is negative, and the person has the disease,
i.e. FN = D∩P ′ . However, the “probability of a false negative” is usually taken
to be the conditional probability: fn = Pr(FN | D) = Pr(P ′ | D) = 1 − sn.
2. A false positive occurs when the test is positive, and the person does not have the
disease, i.e. FP = D′ ∩P . Similarly, the “probability of a false positive” is usually
taken to be the conditional probability: fp = Pr(FP | D′ ) = Pr(P | D′ ) = 1 − sp.
P P′
D 0.0495 0.0005 0.05 (sn=0.99)
D′ 0.0475 0.9025 0.95 (sp=0.95)
0.0970 0.9030 1
0.0495
Thus ppv = Pr(D | P ) = 0.0970 = 0.510.
E XERCISE . (hypertension)
Suppose 84% of hypertensives and 23% of normotensives are classified as hypertensive by
an automated blood-pressure machine. What is the positive predictive value and negative
predictive value of the machine, assuming that 20% of the adult population is hyperten-
sive?
P P′
D 495 5 500 ⇒ (sn=0.99)
D′ 25 475 500 ⇒ (sp=0.95)
520 480 1000
495
For this sample (or this subpopulation), ppv = 520 = 0.952.
But to get the ppv right for the population, in which the prevalence is 10%, we need to
adjust:
P P′
D 0.099 0.001 0.1 (sn=0.99)
D′ 0.045 0.855 0.9 (sp=0.95)
0.144 0.856 1
0.099
so that ppv = = 0.688.
0.144
And if the prevalence were 1% then we would have:
P P′
D 0.0099 0.0001 0.01 (sn=0.99)
D′ 0.0495 0.9405 0.99 (sp=0.95)
0.0594 0.9406 1
0.0099
so that ppv = = 0.167.
0.0594
p 0.5 0.1 0.05 0.01
ppv 0.952 0.688 0.510 0.167
that a positive result on this diagnostic test would have the effect of increasing the odds
by multiplying by 9.5. And a negative result would decrease the odds by multiplying by
1/9.5. It is seen that if the odds start out very small, then the odds will still be relatively
small.
3.5 Independence
Events A and B can be positively or negatively related according as:
Pr(A | B) ≷ Pr(A) ≷ Pr(A | B ′ )
The intermediate case, when they are all equal, i.e. the “no relationship” case is the case of
independence. A and B are independent if B has no effect on the probability of A occurring
. . . and vice versa: i.e.
Pr(A | B) = Pr(A) = Pr(A | B ′ ) and Pr(B | A) = Pr(B) = Pr(B | A′ ).
Independent events and Mutually exclusive events are entirely different things.
EXAMPLE 3.5.1: A and B are mutually exclusive events such that Pr(A) =
Pr(B) = 0.4. Then Pr(A∪B) = 0.4 + 0.4 = 0.8.
C and D are independent events such that Pr(C) = Pr(D) = 0.4.
Then Pr(C∪D) = 0.4 + 0.4 − 0.4×0.4 = 0.64.
This multiplication rule extends to n independent events:
EXAMPLE 3.5.2: Find the probability of at least one six in six rolls of a fair die.
Pr(A) = 1 − Pr(A′ ) = 1 − ( 65 )6 = 0.665.
lOMoARcPSD|8938243
Find the probability that at least one individual in a sample of 100 has disease
D when the prevalence of the disease is 1%.
99 100
Pr(A) = 1 − Pr(A′ ) = 1 − ( 100 ) = 0.634.
A commonly used probability model is that of “independent trials” (commonly called
Bernoulli trials) in which each trial results in one of two outcomes, designated “success” or
“failure”, with probabilities p and q, where p + q = 1.
Simple examples of independent trials are coin-tossing and die-rolling; but the “indepen-
dent trials” model can be applied quite generally with:
trial = any (independently) repeatable random experiment;
success = A, any nominated event for the random experiment.
E XERCISE . A risky heart operation is such that the probability of a patient dying as a result
of the surgery is 0.01. If 100 such operations are performed at the hospital in a year, find
the probability that at least one of these patients dies as a result of surgery.
We assume that the operations are independent and each has the same probability of “suc-
cess” (that the patient dies!). This emphasises the fact that “success” is just a name for some
event: it clearly doesn’t have to be something good.
It soon becomes clear that the model is too simple (since not all patients are identical), but it
is nevertheless a useful place to start the modelling process.
lOMoARcPSD|8938243
Problem Set 3
3.1 Drug A causes an allergic reaction in 3% of adults, drug B in 6%, while 0.4% are allergic to
both. What sort of relationship exists between allergic reactions to the drugs A and B (positive,
negative, none)?
3.2 Suppose that events D and E are such that Pr(D | E) = 0.1 and Pr(D | E ′ ) = 0.2.
(a) Are D and E positively related, not related or negatively related? Explain.
(b) Specify the odds ratio for D and E.
Suppose also that Pr(E) = 0.4:
(c) Find Pr(D).
(d) Find Pr(E | D).
p1 p1 (1 − p2 )
3.3 The risk ratio is RR = , and the odds ratio is OR = .
p2 (1 − p1 )p2
(a) If the odds ratio is equal to 2, show that RR = 2 − p1 , and hence, or otherwise, complete
the following table:
p1
p1 p2 p1 −p2
p2
0+
0.01
0.05
0.1
0.25
0.5
0.9
1–
p
Hint: First compute p1 = RR, using the expression for the risk ratio derived above; then find p2
2
using p2 = p1 /RR; and finally p1 − p2 .
(b) If the odds ratio, OR = θ, show that RR = θ(1−p1 ) + p1 ; and hence that RR can take any
value between 1 and θ.
(What happens if θ < 1?)
(c) A case-control study gives an estimate of the odds ratio relating exposure E and disease
D of 2.0. What can you say about the relative risk of D with and without exposure E?
(d) i. If the odds ratio is 3, find the risk ratio if p1 = 0.1.
ii. If the odds ratio is 1.5, find the risk ratio if p1 = 0.2.
iii. If the odds ratio is 0.5, find the risk ratio if p1 = 0.05.
3.4 Complete the following probability tables:
(a) (b) (A and B are independent)
′
B B B B′
A 0.4 A 0.4
′
A 0.2 A′
0.5 0.5
′
(c) Pr(A)=0.6, O(B | A)=0.2 & O(B | A )=1; (d)* Pr(A)=0.4, O(A | B)=1 & O(A | B ′ )=0.5.
3.5 A study investigating the relationship between disease D and exposure E found that, of indi-
viduals who have disease D, 20% had been exposed to E, whereas for individuals who do not
have disease D, 25% had exposure E.
(a) Are E and D positively related, not related or negatively related?
(b) Specify the odds ratio relating E and D.
(c) Explain why the relative risk of disease D with or without exposure E cannot be calcu-
lated with this information alone. What additional information is required to find the
risk ratio?
lOMoARcPSD|8938243
3.6 The Chinese Mini-Mental Status Test (CMMS) is a test consisting of 114 items intended to
identify people with Alzheimer’s disease and senile dementia among people in China. Low
test scores are taken to indicate the presence of dementia. An extensive clinical evaluation was
performed of this instrument, whereby participants were interviewed by experts and definitive
diagnosis of dementia was made. The table below shows the results obtained on a group of
people from an old-peoples’ home.
Expert diagnosis
CMMS score Nondemented Demented
0–5 0 2
6–10 0 1
11–15 3 4
16–20 9 5
21–25 16 3
26–30 18 1
Total 46 16
Suppose a score of 6 20 on the test is used to identify people with dementia. Assume that the
data above are representative of the underlying probabilities.
(a) What is the sensitivity of the test?
(b) What is the specificity of the test?
(c) If 1% of a community has dementia, what is the ppv for the test?
(d) How would these values change if the threshold score changed to 15? Comment.
3.7 The level of prostate-specific antigen (PSA) in the blood is frequently used as a screening test
for prostate cancer. A report gives the following data regarding the relationship between a
positive PSA test (> 5 ng/dL) and prostate cancer.
PSA test result Prostate cancer Frequency
+ + 92
+ − 27
− + 46
− − 568
i. Use these data to estimate the sensitivity, specificity, positive predictive value of the test?
ii. How might these data have been obtained?
3.8 Suppose that among males aged 50–59 the Prostate Specific Antigen (PSA) level is given by the
following graphs, according to whether the individual has prostate cancer or does not. These
graphs give the cumulative probability F (x) = Pr(P SA 6 x). This is called the cumulative
distribution function and is equivalent to a population cumulative relative frequency.
F (x)
non-cancer
cancer
x 3 4 5 6 7 8
FN (x) 0.140 0.400 0.800 0.950 0.990 0.997
FC (x) 0.003 0.010 0.040 0.100 0.250 0.600
Suppose we choose to say the PSA test is “positive”, if the PSA level is greater than ℓ, i.e.
P = {PSA > ℓ}. Assume that the prevalence of prostate cancer in this age group is 20%.
Find the sensitivity, specificity, positive predictive value, percentage false-positive and percent-
age false-negative for ℓ = 4, 5, 6, 7.
Discuss the effects of these different levels. How would you choose what is “best”?
The ROC curve plots sn against 1−sp (true positive vs false positive). Sketch the ROC curve.
lOMoARcPSD|8938243
Chapter 4
PROBABILITY DISTRIBUTIONS
“It has long been an axiom of mine that the little things
are infinitely the most important.” Sherlock Holmes, A Case of Identity, 1892.
91
lOMoARcPSD|8938243
of the reciprocal nature of probability and inference and is captured well in the following
diagram:
Notice the word “imagine”. In understanding the theory and practice of statistics, it is
necessary to deal with abstractions of various kinds. Ironically, often these abstractions
represent what we believe or hope is reality; but we cannot observe it directly. There are
many words and phrases used in these notes that entail this notion of abstraction.
Models and distributions are abstract. A problem might ask you to assume that the random
variable X has a particular distribution. This is because inference is only possible in a
framework that has some understanding of what random process generated the data. If we
want to make an inference about an unknown population proportion, then we know how
to quantify the uncertainty if the sample has been generated from a Binomial model. Of
course models, and abstractions more generally, may or may not be true. So for a particular
data set, we always need to ask ourselves, at least implicitly: how reasonable is the model?
and, more subtly: how wrong will my inference be if the model is not reasonable? But
we can get nowhere within assuming something abstract about the underlying probability
structure.
In any research project or experiment, anything we measure will be a random variable.
The randomness might arise because of the sampling procedure (i.e. which individuals are
included in the sample), or because of measurement error, or because of variation within
individuals.
EXAMPLE 4.1.2: Let X = number of heads obtained in three tosses of a fair coin.
By enumeration of the eight equally likely outcomes (hhh, hht, . . . , ttt), we find
that
Pr(X = 0) = 81 , Pr(X = 1) = 38 , Pr(X = 2) = 83 and Pr(X = 3) = 18 .
It is necessary to distinguish between two types of random variables:
• Discrete random variables
• Continuous random variables
lOMoARcPSD|8938243
Discrete random variables are ones which can only take some values; almost always, they
are based on counts of some sort. The word “discrete” is used here to mean “separate,
distinct”. The number of children in a family is an example of a discrete random variable.
The distribution of a discrete random variable can be defined by specifying the probabilities
corresponding to each possible value that the random variable may take.
The probabilities in the distribution of a discrete random variable must be all non-negative,
and they must add to 1.
A specific example of the distribution of a discrete random variable is shown below. The
height of the spike at an x value shows the probability of observing that value. For example,
we see that the probability that this random variable takes the value 10 is about 0.15.
Continuous random variables can take any value within the range of possible values. The
distribution of a continuous random variable is defined by specifying a curve which relates
the height of the curve at any particular value to the chance of an observation close to that
value. This curve is called the probability density function.
Formally, the chance that a continuous random variable takes a value in an interval be-
tween two points a and b is the area under the curve between a and b, as shown above.
lOMoARcPSD|8938243
Why can’t we use the discrete random variable approach for a continuous random vari-
able? We may ask about the probability that a continuous random variable takes the value
12. But . . . what do we mean by that? Remember that it can take any value in a given
range, so it can be 11.9, or 12.26, or 11.607, etc. A reasonable way of giving an answer to
the probability required is to suggest that what is meant by “12” in this case is “12, to the
nearest whole number”. This means a number between 11.5 and 12.5; and now we are
talking about an interval again: a narrow interval perhaps, but an interval all the same. If
we insist that we want the probability that a continuous random variable takes the value
12 exactly, that is, 12.00000000000000000000000. . . , then this is equal to zero. Note: the area
between 12 and 12 under the graph is zero!
The probability density function must be non-negative, and the total area under its graph
must be 1.
For a discrete random variable the cdf is a For a continuous random variable it is a con-
step function. tinuous function.
lOMoARcPSD|8938243
EXAMPLE 4.1.3: Suppose that the continuous random variable X has cdf given
by
ex
F (x) = 1+e x (−∞ < x < ∞).
e 2
F (2) = Pr(X 6 2) = 1+e 2 = 0.8808.
As there are no jumps in the cdf (it is continuous), Pr(X < 2) = Pr(X 6 2) =
F (2) = 0.8808.
The probability that X lies between −1 and 2 is given by:
Pr(−1 < X < 2) = F (2) − F (−1) = 0.8808 − 0.2689 = 0.6119.
The probability that X is greater than 3, Pr(X > 3) = 1 − F (3) = 1 − 0.9526 =
0.0474.
The following graphs were obtained for the empirical cdf in each case (using
ecdf(). . . ). These are plots of the cumulative relative frequency (see Chapter 2).2
1 The numbers were actually generated to many more decimal places: for example, x1 = 10.441598 . . ..
2 The distributions used were Poisson with λ = 5.6; and Normal with µ = 8, σ = 2.
lOMoARcPSD|8938243
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 5 10 15 0 5 10 15
For samples of size 1000, the graphs resemble quite closely what is expected of
the population cdf (i.e. a step function on the one hand, and a continuous curve
on the other). To generate similar graphs one may use
Deciles (c0.1 , c0.2 , . . . , c0.9 ) and percentiles (c0.01 , c0.02 , . . . , c0.99 ) are other special cases of
quantiles that are used.
Quantiles give a useful description of the distribution that can be readily interpreted: half
the population are less than the median, a quarter are above the upper quartile, while 10%
are above the 90th percentile. Quantiles are not so useful if the distribution is ‘very’ discrete (i.e.
if the distribution has a small number of large jumps).
EXAMPLE 4.1.5: R gives the quantiles for a range of distributions, including the
ones we used above. Particularly the cdf P (X 6 x) can be obtained by adding
the prefix p to the R name of the distribution of interest, and the quantiles can be
obtained by adding the prefix q. For example, for the discrete random variable
(Poisson with λ=5.6), R gives, for the 0.75- and 0.25-quantiles:
Check this against the discrete cdf graph in the example above (page 95). It
follows that c0.75 =7. Similarly, c0.25 =4 and c0.5 =5.
For the continuous random variable (Normal with µ=8, σ=2), R gives:
c0.25 = 6.651, c0.5 = 8.000, c0.75 = 9.349.
Check these values against the continuous cdf graph in the example above
(page 95).
ex
EXAMPLE 4.1.6: Suppose that X has cdf F (x) = 1+e x (−∞ < x < ∞).
ec
The 0.9-quantile of X, c0.9 is such that 1+e c = 0.9.
ec
1+ec
= 0.9 ⇒ ec = 0.9(1 + ec ) ⇒ ec (1 − 0.9) = 0.9 ⇒ ec = 0.9
0.1
The median, c0.5 = 0; and the quartiles are c0.25 = ln 31 = −1.0986, c0.75 = ln 3 =
1.0986.
The mean of a random variable X, which we denote by µ or E(X), is the weighted average
of values that X can take, where the weights are provided by the distribution of X. It is
at the “centre of mass” of the distribution. Sometimes the term “expectation of X” is used,
which is where the notation E(X) originates (E for Expectation).
Recall that p(x) denotes the probability mass function (pmf) for a discrete random variable
and that f (x) denotes the probability density function (pdf) for a continuous random vari-
able. The following definition gives a mathematical expression for E(X) for both discrete
and continuous random variables:
lOMoARcPSD|8938243
DEFINITION 4.1.2.
1. For discrete random variables,
X
E(X) = x p(x).
The most useful measure of spread is the variance (and its square root the standard devia-
tion).
The variance of a random variable X, which we denote by σ 2 or var(X), is the weighted
average of squared deviations from the mean of X, where the weights are provided by the
distribution of X.
Mathematically, it is defined as follows:
var(X) = E (X − µ)2
= E X 2 − µ2 ,
where µ = E(X).
lOMoARcPSD|8938243
The variance is a measure of spread since the more widespread the likely values of X, the
larger the likely values of (X − µ)2 and hence the larger the value of var(X).
The standard deviation of a random variable X is the square root of the variance and is
denoted by sd(X):
EXAMPLE 4.1.7: Suppose that X has a uniform distribution on (0,1), i.e. the pdf
of X is given by f (x) = 1, (0<x<1):
Note: Such a random variable is often called a “random number”. We have pre-
viously used such random variables in randomisation. They can be generated
lOMoARcPSD|8938243
The total score could take any value in {24, 25, . . . , 144}. What are the likely
values?
7 7 7
E(T ) = 2 + 2 + ··· + 2 = 84;
35 35 35 35
var(T ) = 12 + · · · + 12
+ 12 = 24 × 12 = 70,
√
so that sd(T ) = 70 = 8.37.
The probability distribution of X is specified by p(x) or f (x), but the information conveyed
is not always easily understood.
For example, p(x) = e−25 25x /x!, (x = 0, 1, 2, . . .) means little without some evaluation. If
on the other hand we say that X is 95% likely to be in the range 25 ± 10, this information is
more readily grasped.
lOMoARcPSD|8938243
EXAMPLE 4.1.12: Sketch a pdf for which the mean is 65 and the standard devia-
tion is 10.
lOMoARcPSD|8938243
The first is the standard symmetrical graph with 2.5% below 45 (= 65 − 2×10) and
2.5% above 85 (= 65 + 2×10). The second is positively skew, with most of the 5% above
85. But both these pdfs have µ=65 and σ=10.
E XERCISE . Sketch a pdf for which the quartiles are 20, 30 and 50.
4.2.1 Introduction
A Bernoulli trial is a random experiment with two possible outcomes: “success” and “fail-
ure”. We let p = Pr(success) and q = Pr(failure), so that p + q = 1. We assume 0 < p < 1.
We consider a random experiment consisting of a sequence of independent Bernoulli trials,
observing the result of each trial.
Examples include:
• coin tossing, die rolling, firing at a target;
• sampling with replacement;
• a medical procedure applied to each of a number of individuals;
• any repeatable random experiment, with “success” = any specified event A; then
“failure” = A′ , and p = Pr(A).
Let Sk = “success at the kth trial”, and Fk = “failure at the kth trial” = Sk′ .
Then Pr(Sk ) = p, for k = 1, 2, 3, . . .
Note that Sk and Fk are mutually exclusive, while Sk and Sl (k 6= l) are independent.
DEFINITION 4.2.1. Let X be the number of successes in n trials where the probability
of success on each trial is p. Then X has a binomial distribution with parameters n and
d
p, and we write X = Bi(n, p). The pmf of X is given by
where q = 1 − p.
lOMoARcPSD|8938243
EXAMPLE 4.2.1: A machine is producing capsules such that the probability that
any capsule is defective is 0.01, independently of the others.
Find the probability that at most one of the next ten capsules produced is defec-
tive.
d
EXAMPLE 4.2.2: Suppose that X = Bi(35, 0.3). The graph of the pmf is shown
below.
0.15
0.10
0.05
0.00
0 5 10 15 20 25 30
Using R:
[1] 0.9641
> pbinom(9, size=35, prob=0.3) # cdf of binomial distribution
[1] 0.3646
d
DEFINITION 4.2.2. Suppose that X = Bi(n, p). Then
1. E(X) = np; and
2. var(X) = npq,
where q = 1 − p.
d
EXAMPLE 4.2.3: If X =Bi(100, 0.4) find the mean and standard deviation of X.
E(X) = 100×0.4 = 40; var(X) = 100×0.4×0.6 = 24; sd(X) = 4.849.
EXAMPLE 4.2.5: Suppose that with the standard treatment, the five-year recur-
rence rate of a particular cancer is 30%. A new treatment is applied to 100 indi-
viduals with the cancer. Assuming that the new treatment has the same effect
as the standard treatment ( ◦◦ ), what is the distribution of the number who are
cancer-free (i.e. no recurrence) after five years.
Let X denote the number of individuals who are cancer-free after five years,
d
then X = Bi(n=100, p=0.70).
Using R, we find Pr(X > 80) = 0.0165. Thus, if we observed that 80/100
were cancer-free after five years with the new treatment, we would suspect that
lOMoARcPSD|8938243
the recurrence rate was actually less than 30%, and that the new treatment was
better than the standard treatment.
4.3.1 Introduction
process “event”
radioactive decay arrival of particle
telephone exchange arrival of call
disease occurrence individual develops disease
production of material occurrence of flaw
(thread, plate, solid)
distribution of organisms organism
in a region
0 t
t
The time interval (0, t) is divided into n intervals each of length δt =
n
trial = interval, success = “event”, p ≈ αδt
The case that we consider most often is the disease occurrence process, which we consider
at length below. In that case, the rate of the process corresponds to the incidence rate.
DEFINITION 4.3.1. The Poisson distribution with rate parameter λ is defined by the
pmf
e−λ λx
p(x) = , for x = 0, 1, 2, . . . .
x!
d
If a discrete random variable X has a Poisson distribution we say that X = Pn(λ).
In R, the pmf and cdf are given by the functions dpois() and ppois(), respectively.
There are tables of the Poisson pmf in the Statistical Tables (Table 3).
d
EXAMPLE 4.3.1: Suppose that X = Pn(5.6). The graph of the pmf is shown
below.
lOMoARcPSD|8938243
0.15
0.10
0.05
0.00
0 5 10 15 20
d
DEFINITION 4.3.2. Suppose that X = Pn(λ). Then
1. E(X) = λ; and
2. var(X) = λ.
d
EXAMPLE 4.3.2: If X = Pn(20), then E(X) = 20 and sd(X) = 4.47:
Pr(12 6 X 6 28) ≈ 0.95.
d
If X = Pn(200), then E(X) = 200 and sd(X) = 14.1:
Pr(172 6 X 6 228) ≈ 0.95.
DEFINITION 4.3.3. Let X(t) be the number of “events” in (0, t) and suppose that α is
d
the expected number of events per unit time. Then X(t) = Pn(α t).
nk (αt)k
= lim (1 − αt
n
)n (1 − αt
n
)−k
n→∞ k! nk
(αt)k −αt
since (1 + na ) → ea as n → ∞ .
= e ,
k!
Similarly, the number of “events” in any interval of length t, i.e. an interval (s, s + t) for any
s > 0, is distributed as Pn(αt).
lOMoARcPSD|8938243
The incidence rate is the rate at which the disease occurs. We can model this as a Poisson
process: an “event” is one individual contracting the disease:
Pr(one individual contracts the disease in (t, t+dt)) = αdt.
Here, the time, t, denotes the time for one person (i.e. person-time). When we come to
dealing with the population, we need to add up all the person-times. Observing one person
for ten years is taken to be equivalent to observing ten people for one year. In this context,
time is the number of “person-years” of follow-up of the population. For example, in a 30
year study, if an individual leaves the study after five years, then that person contributes
only 5 “person-years”.
Now, using the result for a Poisson process we have
d
X(t) = Pn(αt)
where X(t) denotes the number of cases in a population followed up for a total of t person-
years, where the incidence rate is α (cases per year).
Note that incidence rate has dimension “cases per unit time” or [case] time−1 .
Incidence rates effectively treat one unit of time as equivalent to another regardless of
which person they come from or when they occurred. Incidence is usually concerned
with “once-only” events, i.e. events that can occur only once: for example, death due to
leukæmia. Events other than death can be made “once-only” by considering the first oc-
currence of the event. For example, we consider the occurrence of the first heart attack in
an individual and ignore (or study separately) second and later heart attacks.
An individual does not contribute to “person-time” after getting the disease. Other indi-
viduals may be observed for less than the period of the study: they may join the study late,
leave early by moving, dying or otherwise becoming ineligible. In our applications the
time we consider is observed “person-time”. Ten people observed for a year, or one person
lOMoARcPSD|8938243
observed for ten years, or four people observed for 1, 2, 3 and 4 years respectively, are all
equivalent to 10 person-years.
EXAMPLE 4.3.6: If the incidence rate is 0.0356 cases per person-year, then
Thus, roughly, we expect one case per 28.1 person-years. So the “mean waiting
time” until one individual gets the disease is 28.1 years:
1
mean waiting time = .
incidence rate
A rate must have a time unit. However, incidence rates are often expressed in the form
of 50 cases per 100 000 and described as “annual incidence”. This is a bit like describing
speed as an “hourly distance”. More precisely, an annual incidence of 50 cases per 100 000
is 0.0005 cases/year.
If the time unit is missing from an incidence rate, assume it is a year. For example (Tri-
ola&Triola p.137) “For a recent year in the US, there were 2 416 000 deaths in a population
of 285 318 000 people. Thus the annual incidence (the mortality rate) is
2 416 000
= 0.0085 (deaths/person-year) or 8.5 per thousand person-years.
285 318 000
This too is a bit unusual, in that we are imagining here that the entire population is ob-
served. In any application, this will usually not be the case. We take a sample and use that
to estimate the population incidence rate (Chapter 5).
This approximation is fine provided the risk is small, and there is no population depletion.
This is often the case.
d
The number of cases in a one-year period, X = Pn(2.2);
since mean = α×t = 0.000011×200 000 = 2.2.
Stratified population
It follows that if we are dealing with a stratified population with different incidence rates
in each stratum, the total number of cases in the population still has a Poisson distribution:
d
X1 + X2 + · · · + Xc = Pn(α1 t1 + α2 t2 + · · · + αc tc ).
Thus, it is enough to find the expected number of cases.
λ = α 1 t 1 + α 2 t 2 + · · · + α c tc .
The total number of cases has a Poisson distribution with this mean.
1 1 2
f (x) = √ e− 2σ2 (x−µ) , (x ∈ R)
σ 2π
d
then we say that X has a normal distribution and we write X = N(µ, σ 2 ).
d
DEFINITION 4.4.2. If X = N(µ, σ 2 ), then:
1. E(X) = µ; and
2. var(X) = σ 2 .
d
If Z = N(0, 1), then we say that Z has a standard normal distribution.
d
If Z = N(0, 1), then: E(Z) = 0, var(Z) = 1.
z
1
Z
d 1 2
The cdf of Z = N(0, 1) is denoted by Φ(z) = √ e− 2 t dt.
−∞ 2π
Table 5 gives values of Φ(z) for 0 6 z 6 4.
The standard normal cdf is available on many calculators. It is available in R using the
command pnorm().
d
E XERCISE . If Z = N(0, 1), use the Tables or calculator or computer to check that:
Pr(Z < 1) = 0.8413; Pr(Z > 0.234) = 0.4075;
Pr(−1.5 < Z < 0.5) = 0.6247. In R:
> pnorm(1) # normal cdf at 1
[1] 0.8413447
> 1-pnorm(0.234) # P(Z > 0.234)
[1] 0.4074925
> pnorm(0.5)-pnorm(-1.5) # P(-1.5 < Z < 0.5)
[1] 0.6246553
lOMoARcPSD|8938243
d
DEFINITION 4.4.3. Standardisation theorem: If X = N(µ, σ 2 ), then
X −µ d
Xs = = N(0, 1).
σ
The standardisation theorem allows us to evaluate normal probabilities using the Tables.
d
EXAMPLE 4.4.1: If X = N(10, 52 ) then
Pr(X < 8) = Pr X−10 < 8−10
5 5
= Pr(Xs < −0.4) = 0.3446.
d
EXAMPLE 4.4.2: If X = N(65, 102 ) then:
d
EXAMPLE 4.4.4: If X = N(10, 4) then:
c0.75 (X) = 10+2×0.6745 = 11.35; c0.975 (X) = 10+2×1.9600 = 13.92;
c0.25 (X) = 10 − 2 × 0.6745 = 8.65; c0.025 (X) = 10 − 2 × 1.9600 = 6.08.
In R we use the command qnorm. For example, the first quantiles in the above
example can be equivalently obtained by:
lOMoARcPSD|8938243
k
We have seen that x(k) ≈ ĉq where q = n+1 . It follows that the minimum of a
1
sample of 100, x(1) ≈ ĉq , where q = 101 .
Of course these are rough approximations, since the data are random, but we
can expect values around about these values. Thus the “expected” five-number
summary would be:
(96.7, 113.3, 120, 126.7, 143.3).
This gives some idea of what is meant by “rough approximations” in this case.
You can generate one such random sample in R using rnorm(100, mean=120,
sd=10) and then summary().
The Central Limit Theorem says that the sum of a large number of similarly distributed
random variables which are independent, but which may have any distribution, is asymp-
totically normally distributed.
It is always true that:
If T = X1 +X2 +· · ·+Xn , where X1 , X2 , . . . , Xn are independent observations on X, where
E(X) = µ and var(X) = σ 2 , then:
√
E(T ) = nµ var(T ) = nσ 2 sd(T ) = σ n
d
The central limit theorem says that, in addition, if n is large, then: T ≈ N(nµ, nσ 2 ).
lOMoARcPSD|8938243
This is a really amazing result, and is the fundamental reason for the importance of the
normal distribution.
Any variable which can be considered as being composed of the sum of many small influ-
ences will be approximately normally distributed.
E XERCISE . Try it out in R: use x <- runif(100) to generate a vector with 100 random
values and then find the sum sum(x).
This result is an indication of the power of probability. We don’t need to worry about what
is possible, it is more productive to consider what is probable. What is possible is often so
wide-ranging as to be useless; what is probable focuses on what is important.
Also (and not unrelated) the normal distribution often occurs as a limiting case of particular
distributions (Binomial, Poisson, and others).
d
• Bi(n, p) ∼ N np, np(1−p) as n → ∞. (For np > 5 and nq > 5)
d
• Pn(λ) ∼ N(λ, λ) as λ → ∞. (For λ > 10).
lOMoARcPSD|8938243
The approximation technique is simple for continuous random variables — but there is a
slight complication in approximating a discrete distribution by a continuous distribution.
Let T = X1 + X2 + · · · + X24 .
E(X) = 27 , var(X) = 12
35
, so E(T ) = 84 and var(T ) = 70.
The exact distribution of T is very messy indeed; the normal distribution pro-
vides a good approximation.
d
EXAMPLE 4.4.8: If X = Bi(40, 0.4), use a normal approximation to find an
approximate value for Pr(X > 20).
d d
If X = Bi(40, 0.4), the approximating normal random variable is X ∗ = N(16, 9.6).
d
continuity correction, Pr(X ∗ > c−0.5), where X ∗ = N(0.5n, 0.25n).
It is seen that the continuity-corrected approximation does quite well for n = 100 and better
still for n > 1000. On the other hand, the uncorrected approximation is still a bit out even
for n = 100 000.
Note: We don’t actually need to use normal approximations to evaluate these probabilities,
since the correct answer is readily available from the computer or calculator. However,
when we come to use formulae based on normal approximations for confidence intervals
and significance testing, these results suggest that a correction for continuity should be
used in the procedure. Generally, this means an adjustment in the formula.
d
EXAMPLE 4.4.9: If X = Pn(32.4), use a normal approximation to find an ap-
proximate value for Pr(X > 40).
d
The approximating normal random variable is X ∗ = N(32.4, 32.4).
Pr(X > 40) ≈ Pr(X ∗ > 39.5) = Pr(Xs∗ > 39.5−32.4
√ ) = Pr(N > 1.247) =
32.4
0.106.
The uncorrected approximation gives Pr(N > 1.335) = 0.091. The correct value
is 0.1086.
d d
EXAMPLE 4.4.10: X1 = N(68, 102 ), X2 = N(60, 152 ).
Assuming X1 and X2 are independent, we have:
d
T = 0.5X1 + 0.5X2 = N(64, 9.02 );
d
S = 0.2X1 + 0.8X2 = N(61.6, 12.22 ).
lOMoARcPSD|8938243
EXAMPLE 4.4.12: A process requires three stages: the total time taken (in hours)
for the process, T = T1 +T2 +T3 , where Ti denotes the time taken for the ith stage.
It is known that:
E(T1 ) = 40, E(T2 ) = 30, E(T3 ) = 20; sd(T1 ) = 3, sd(T2 ) = 2, sd(T3 ) = 5.
There is a deadline of 100 hours. Give an approximate probability that the dead-
line is met, i.e. find Pr(T 6 100). Assume that the times are approximately nor-
mally distributed. Which stage is most influential in determining whether the
deadline is met?
d
T ≈ N(90, 38), µ = 40+30+20, σ 2 = 32 +22 +52 ⇒ Pr(T 6 100) ≈ 0.948.
The stage with the greatest variance: i.e. stage 3. To understand this, think about
what would happen if sd(T1 ) = 0: in that case T1 is a constant and hence can
have no effect on whether the deadline is met. On the other hand, if sd(T3 ) is
very large, then T3 could be far above its mean, making it very unlikely that the
deadline could be met; and similarly T3 could be far below its mean, making
it very probable that the deadline is met. So T3 has a big effect on whether the
deadline is met.
lOMoARcPSD|8938243
Problem Set 4
4.1 The discrete random variable X, with sample space {1, 2, 3, 4, 5, 6}, has pmf
2x−1
p(x) = 36 (x = 1, 2, 3, 4, 5, 6).
(a) Sketch the graph of the pmf.
(b) Find Pr(X = 2), Pr(X > 4) and Pr(2 < X 6 5).
(c) Sketch the graph of the cdf, and indicate the probabilities found in (b) on your sketch.
4.2 A surgical hospital needs blood supplies. Suppose its daily demand for blood X, in hundreds
of litres, has cdf
F (x) = 1 − (1 − x)4 (0 < x < 1).
i. Find the probability that the daily demand exceeds 10 litres.
ii. Find the level at which the blood supply should be kept so that there is only a 1% chance
that the demand exceeds the supply.
4.3 Let X denote the result of a toss of a fair coin: X is the number of heads obtained; so that
X = 1 if a head is obtained, and X = 0 if a tail is obtained.
(a) Consider the random variable obtained by observing the result of one toss and multiply-
ing it by 2. What is the distribution of this random variable? i.e. with what probability is
it equal to 0, 1, 2?
Draw a diagram representing this distribution.
(b) Consider the random variable obtained by observing the result of one toss and adding it
to the result of a second toss. What is the distribution of this random variable? i.e. with
what probability is it equal to 0, 1, 2?
Draw a diagram representing this distribution.
(c) Do these random variables have the same mean? If not, which is bigger?
Do they have the same spread? If not, which is more spread?
The random variable in (a) is 2X , i.e. X+X , the sum of identical variables; while the random
variable in (b) is X1 +X2 , the sum of independent variables, each having the same distribution.
Their distributions are not the same.
When you are concerned with a sum, it will usually be a sum of independent variables, which
is not the same as a sum of identical variables, i.e. not nX .
4.4 Evaluate the following probabilities:
d
(a) i. Pr(X 6 3) for X = Bi(10, 0.2).
d
ii. Pr(3 < X 6 7) for X = Bi(15, 0.6).
d
iii. Pr(1 6 X 6 3) for X = Bi(6, 0.25).
d
iv. Pr(X > 16) for X = Bi(20, 0.75).
d
(b) i. Pr(X 6 2), for X = Pn(5.2).
d
ii. Pr(X > 1) for X = Pn(0.9).
d
iii. Pr(3 6 X 6 6) for X = Pn(4.6).
d
iv. Pr(X > 10), for X = Pn(3.1).
4.5 According to a national survey, 10% of the population of 18–24-year-olds in Australia are left-
handed.
(a) In a tutorial class of 20 students, how many would you expect to be left-handed? What is
the probability that in a class of 20 students, at least four of them are left-handed?
(b) In a lecture class of 400 students, how many would you expect to be left-handed? A
survey result shows that there are actually 60 left-handed students in the class. What is
the probability that in a class of 400 students, at least 60 of them are left-handed?
(c) What have you assumed in these probability calculations?
4.6 The number of cases of tetanus reported in a single month has a Poisson distribution with
mean 4.5. What is the probability that there are at least 35 cases in a six-month period?
4.7 The expected number of deaths due to bladder cancer for all workers in a tyre plant over
a 20-year period, based on national mortality rate, is 1.8. Suppose 6 deaths due to bladder
cancer were observed over the period among the tyre workers. How unusual is this event? i.e.
evaluate Pr(X > 6) assuming the national rate is applicable.
d
4.8 (a) For X = N(µ = 50, σ 2 = 102 ), find
i. Pr(X 6 47); ii. Pr(X > 64); iii. Pr(47 < X 6 64);
iv. c, such that Pr(X > c) = 0.95; v. c, such that Pr(X < c) = 0.025.
lOMoARcPSD|8938243
d
(b) Use a Normal approximation to evaluate Pr(X6150) where X = Bi(1000, 61 ); and check
your approximation by obtaining the exact value using R.
4.9 A standard test for gout is based on the serum uric acid level. The serum uric acid level,
L mg/100L is approximately Normally distributed: with mean 5.0 and standard deviation 0.8
among healthy individuals; and with mean 8.5 and standard deviation 1.2 among individuals
with gout.
Suppose we diagnose people as having gout if their serum uric acid level is greater than
6.50 mg/100L.
(a) Find the sensitivity of this test.
(b) Find the specificity of this test.
4.10 A random sample of 100 observations is obtained from a population with a Normally dis-
tributed population with mean 240 and standard deviation 40.
Sketch a likely boxplot for these data.
4.11 A medical trial was conducted to investigate whether a new drug extended the life of a patient
with lung cancer.
Assume that the survival time (in months) for patients on the drug is Normally distributed
with a mean of 30 and a standard deviation of 15. Calculate:
i. the probability that a patient survives for no more than one year;
ii. the proportion of patients who are expected to survive for between one and two years;
iii. the time for which at least 80% of the patients are expected to survive;
iv. the expected quartiles of the survival times.
The survival times (in months) for 38 cancer patients who were treated with the drug are:
1 1 5 9 10 13 14 17 18 18 19 21 22
25 25 25 26 27 29 36 38 39 39 40 41 41
43 44 44 45 46 46 49 50 50 54 54 59
The sample mean is 31.1 months and the sample standard deviation is 16.0 months.
d
Is there any reason to question the validity of the assumption that T = N(µ=30, σ=15)?
4.12 Two scales are available for measuring weights in a laboratory. Both scales give answers that
vary a bit in repeated weighings of the same item. If the true weight of a compound is 2 grams
(g), the first scale produces readings X that have mean 2.000 g and standard deviation 0.004 g.
The second scale’s readings Y have mean 2.002 g and standard deviation 0.002 g.
(a) What are the mean and standard deviation of the difference, Y −X, between the readings?
(Readings X and Y are independent.)
(b) You measure once with each scale and average the readings. Your result is Z = 21 (X+Y ).
Find the mean and standard deviation of Z, i.e. µZ and σZ .
(c) Which of the three readings would you recommend: X, Y or Z? Justify your answer.
(d) Assuming X and Y are independent and normally distributed, evaluate:
Pr(1.995 < X < 2.005), Pr(1.995 < Y < 2.005), Pr(1.995 < Z < 2.005).
d
4.13 In a particular population, adult female height, X = N(165.4, 6.72 ) and adult male height,
d 2
Y = N(173.2, 7.1 ).
(a) Sketch the pdfs of X and Y on the same graph.
(b) Assuming X and Y are independent, specify the distribution of X−Y and hence find
Pr(X > Y ). This gives the probability that a randomly selected adult female is taller
than a randomly selected adult male.
4.14* Suppose that the survival time after prostate cancer (in years) Y has a lognormal distribution,
d d
Y = ℓN(2, 1). This means that ln Y = N(2, 1) (by definition).
(a) Find Pr(Y > 10).
(b) Find the median and the quartiles of Y .
(c) Is the distribution of Y positively skew, symmetrical or negatively skew? Is the mean of
Y greater than, equal to, or less than the median of Y ?
(d) Draw a rough sketch of the pdf of Y .
4.15 Suppose a standard antibiotic kills a particular type of bacteria 80% of the time. A new antibi-
otic (XB) is reputed to have better efficacy than the standard antibiotic. Researchers propose to
try the new antibiotic on 100 patients infected with the bacteria. Using principles of hypothesis
testing (discussed in Chapter 6), researchers will deem the new antibiotic “significantly better”
lOMoARcPSD|8938243
than the standard one if it kills the bacteria in at least 88 out of the 100 infected patients.
Suppose ( ◦◦ ) there is a true probability (true efficacy) of 85% that XB will work for an individ-
ual patient.
(a) Calculate the probability (using R) that the experiment will find that XB is “significantly
better”.
(b) The statistical power is the probability of obtaining a significant result: it is the ability
to discover a better treatment (in this case a better antibiotic). So it’s an indication of the
value of the procedure.
i. Find the statistical power if the true efficacy of a new antibiotic is actually 90%.
ii. What is the power if the true efficacy is 95%?
iii. What is the power if the true efficacy is really 80%? What does this mean?
lOMoARcPSD|8938243
Chapter 5
ESTIMATION
“You can, for example, never foretell what any one (individual) will do, but you can say with preci-
sion what an average number will be up to.” Sherlock Holmes, The Sign of the Four, 1890.
Chapter 1 provides an indication of where the data we analyse comes from. Chapter 2 tells us
something about what to do with a data set, or at least how to look at it in a sensible way. Chapters
3 & 4 gave us an introduction to models for the data. Now we turn to making inferences about the
models and the populations that they describe. This is the important and useful stuff. Statistical
inference is the subject of the rest of the book. We start with Estimation in this chapter.
Ch1: types
(Ch3&4)
of studies
Probability
population −→ sample Ch2: data
description
model ←− observations
Statistical Inference
(Ch5–8)
121
lOMoARcPSD|8938243
W = g(X1 , X2 , . . . , Xn ).
Also, the sample median, the sample standard deviation, the sample interquartile range,
etc., are statistics which are used as estimators of their population counterparts.
The statistic W is a random variable; its realisation is given by the same function applied
to the observed sample values:
w = g(x1 , x2 , . . . , xn ).
For example, the sample mean:
x̄ = n1 (x1 + x2 + · · · + xn ) = 100
1
(11.43 + 8.27 + · · · + 9.19) = 10.13.
A statistic has a dual role: a measure of a sample characteristic and an estimator of the
corresponding population characteristic.
Each of the statistics we have met can be regarded as an estimator of a population character-
istic, usually referred to as a parameter. The statistic X̄ is an estimator of the parameter µ.
If the statistic W is an estimator of the parameter θ, then in order to make inferences about
θ based on W , we need to know something about the probability distribution of W .
The sample mean X̄ is an estimator of the parameter µ, so to make inferences about µ based
on X̄, we need to know something about the probability distribution of X̄. Thus, we turn
to consideration of the probability distribution of the sample mean.
population −→ sample
population mean µ ←− sample mean, x̄
lOMoARcPSD|8938243
Hence:
E(X̄) = E n1 (X1 + X2 + · · · + Xn ) = 1
n
(µ + µ + · · · + µ) = n1 (nµ) = µ
2
1 2 2
var(X̄) = var n1 (X1 +X2 + · · · +Xn ) = n
(σ + · · · +σ 2 ) = n12 (nσ 2 ) = σn
σ2
E(X̄) = µ and var(X̄) = .
n
Further, from the Central Limit Theorem (which says that the sum of a lot of independent
variables is approximately normal) we have:
d σ2
X̄ ≈ N µ, ,
n
and this approximation applies (for large n) no matter what the population distribution.
d 4 d
EXAMPLE 5.1.2: Sample of n = 100 on X = N(µ=10, σ 2 =4): X̄ = N(10, 100 ).
E(X̄) = 10, sd(X̄) = 0.2; so with probability 0.95, X̄ will be in the interval
10 ± 1.96×0.2, i.e. (9.61, 10.39).
15.2
E(X̄) = 55.4, sd(X̄) = √
180
= 1.13. With probability about 0.95: 53.2 < X̄ <
57.6.
Further, by the Central Limit Theorem, for a large sample (i.e. large n):
d σ2
X̄ ≈ N µ, . (1)
n
lOMoARcPSD|8938243
Thus with the supposed population distribution, we would expect that X̄ would
be in the interval 50 ± 4.48, i.e. 45.5 < X̄ < 54.5 with probability 0.95. So, the
observation x̄ = 53.8 is quite in line with what is expected under the proposed
model. This result gives us no real reason to question it.
d 10 2
X̄ ≈ N(50, 400 ); since n = 400, µ = 50 and σ 2 = 102 ;
Pr(49 < X̄ < 51) = Pr(−2 < X̄s < −2) = 0.9544.
Given µ, we can make a statement about X̄:
Pr µ − 1.96 √σn < X̄ < µ + 1.96 √σn ≈ 0.95.
d100 d
We have X̄ ≈ N(µ, 400 ), i.e. X̄ ≈ N(µ, 0.52 ).
·
· · Pr(µ − 1.96×0.5 < X̄ < µ + 1.96×0.5) ≈ 0.95
So (“95%”) plausible values for µ are 50.8 ± 0.98, i.e. (49.82 < µ < 51.78).
If µ=49.82, then what we observed (x̄=50.8) would be just on the upper “plau-
sible” limit for X̄ values.
If µ=51.78, then what we observed would be on the lower “plausible” limit.
lOMoARcPSD|8938243
DEFINITION 5.2.1. The estimate of the standard deviation of X̄ is called the standard
error of X̄, denoted by se(X̄). It is obtained by replacing the unknown parameter σ by
an estimate, i.e.
σ σ̂
sd(X̄) = √ ≈ √ = se(X̄).
n n
If we observed the sample mean, x̄ = 50.8, and sample standard deviation, s = 11.0,
what are plausible values for the unknown population mean µ?
An estimate of µ is x̄ = 50.8.
s 11.0
The standard error of this estimate is se(x̄) = √ = = 0.55.
n 20
This gives some idea of the precision of the estimate. We expect that, roughly,
with probability about 0.95 that x̄ will be within 1.96×0.55 = 1.1 of µ.
Therefore, we expect (with probability about 0.95) that µ will be within 1.1 of
x̄=50.8, i.e. (49.8 < µ < 51.9). So this gives a rough 95% confidence interval for
µ.
This leads to a recipe for an approximate 95% confidence interval applicable in many situ-
ations:
For our purposes, the estimates we use are intuitive and obvious. By and large, we use
a sample statistic to estimate its population counterpart. The following are the point esti-
mates of µ or σ 2 that we use for normal populations:
estimate of the population parameter µ is denoted by µ̂;
we choose µ̂ = x̄;
estimate of population parameter σ 2 is denoted by σ̂ 2 ;
we choose σ̂ 2 = s2 .
(b) Compute a point estimate for σ, the population standard deviation of the
BMI of first-year university students. What statistic did you use to obtain
your estimate? [2.37]
It is desirable to give a point estimate along with a standard error (se) which indicates how
much error there might be associated with the estimate.
A standard error of an estimate is an estimate of the standard deviation of the estimator.
As indicated above, [est ± “2”se] enables us to find an approximate 95% confidence interval
for µ. But when we use a standard error, there is a complication: the 1.96 applies only if we
lOMoARcPSD|8938243
know the standard deviation. If it’s unknown and we need to use a standard error, then we
need to use a different “2”.
In any case though, a rough approx 95% CI is given by est ± 2se.
Abstraction again
Population parameters are yet another abstraction. In statistics, we think of these as fixed
but unknown constants. To the extent that they are unknown, they are abstract; we usually
can’t identify them. But we are vitally interested in their values: we make inferences about
them.
Yet another example of a substantial abstraction is the hypothetical endless repetition of
the same study, under identical conditions. We indulge in this thought experiment when
we interpret the meaning of a probability, and specifically the meaning of the “95%” in a
95% confidence interval.
µ is a constant. X̄ is a random variable. Hence the interval endpoints are random. It is the
interval that is random; µ is a constant.
In this case, the unknown parameter is µ, and the 95% confidence interval is:
σ σ
(x̄ − 1.96 √ , x̄ + 1.96 √ )
n n
We know that sd(X̄) = √σn . If σ is known, there is no need to estimate it and thus, in this
case, se(x̄) = √σn . Thus the exact 95% confidence interval x̄ ± 1.96 √σn is very close to the
approximate version: est ± “2”se. In this case, “2” = 1.96.
EXAMPLE 5.4.1: Suppose the population is normal with unknown mean, but
d
with known standard deviation 2.5, i.e. X = N(µ, 2.52 ).
We take a random sample of n=40 and observe x̄=14.73 (cf. the above example,
but here σ is assumed known).
90% CI for µ: x̄ ± 1.6449 √σn [ 1.6449 = c0.95 (N), −1.6449 = c0.05 (N) ]
Note: the point estimate is the 0% CI, i.e. x̄ ± 0; the 100% CI is (−∞, ∞).
15
estimate = 108.2, standard error = √ = 1.34;
125
approx 99% confidence interval = (108.2 ± 2.5758×1.34) = (104.7, 111.7).
E XERCISE . Check that an approximate 90% confidence interval is given by (106.0, 110.4).
lOMoARcPSD|8938243
Statistic-parameter diagram
parameter
✛ confidence interval
µ
❅
■
❅
probability interval
x̄ statistic
For each value of the parameter (µ), the end-points of the 95% probability interval for
the statistic (X̄) are plotted, using the result specifying the distribution of the statistic
(µ−1.96 √σn , µ+1.96 √σn ).
This is done for each possible value of the parameter. The result is two lines corresponding
to the lower and upper ends of the probability interval, as shown in the diagram.
Given a value of the parameter (µ) the horizontal interval between these two lines is the
(95%) probability interval for the statistic X̄. This corresponds to equation (5).
Given an observed value of the statistic (x̄), the vertical interval between the two lines is
the (95%) confidence interval for the parameter (µ). This corresponds to the ‘inversion’ of
the probability statement to make µ the subject, as represented in equation (6).
The confidence interval is seen to be the set of values of the parameter that make the ob-
served value of the statistic “plausible” (i.e. within the 95% probability interval).
It is seldom the case in practice that σ is known, but in some cases, assuming a value for σ
yields a useful approximation.
3.4
The width of a 95% confidence interval = 2×1.96× √ 6 1;
n
√
Therefore, we require n > 2×1.96×3.4, i.e. n > 177.6.
So, a random sample of 178 would be required.
Let P̂ denote the (random) proportion in a random sample that have the attribute. Obvi-
ously, P̂ is an estimator of p. But what is the sampling distribution of P̂ ?
The estimator, P̂ = X/n, where
X = number of individuals with attribute A in a sample of n.
We define “success” as having attribute A. If the sample is randomly selected, each indi-
vidual selected can be regarded as an independent trial, and we have
X = number of successes in n independent trials,
for which we know that
d
X = Bi(n, p),
where p = Pr(A), the proportion of the population with attribute A.
Therefore:
E(X) = np and var(X) = npq.
It follows that
pq
E(P̂ ) = p and var(P̂ ) = .
n
Thus, P̂ is an unbiased estimator of p, with variance pq/n. We see that, like the sample
mean, the estimator P̂ → p as n → ∞, since its variance goes to zero as the sample size
tends to infinity.
Actually P̂ is a sample mean. It is equivalent to Z̄, where Zi = 1 if individual i has
attribute A, and 0 otherwise. So, P̂ has the properties of a sample mean: it is unbiased
and it is asymptotically normal.
We have:
d 1 pq
P̂ = Bi(n, p) ≈ N p, , for large n.
n n
This specifies the distribution of the estimator of p. For a large sample, the estimator is
approximately normally distributed.
d
EXAMPLE 5.5.3: A random sample of n=100 observations is obtained on Y =
N(50, 102 ). Let U denote the number of observations in this sample that fall
in the interval 50 < Y < 60. What is the distribution of U ? Specify a 95%
probability interval for U .
Thus p = Pr(A) = Pr(50 < Y < 60) = Pr(0 < Ys < 1) = 0.3413.
“success” = “50 < Y < 60”, and we have 100 independent trials;
d
thus U = Bi(100, 0.3413).
√
E(U ) = 34.13 and sd(U ) = 100×0.3413×0.6587 = 4.74.
So, an approximate 95% probability interval for U is 24.6 < U < 43.6. Thus you
should expect that 25 6 U 6 43. And be somewhat surprised if it were not.
Inference on p
A point estimate of p is p̂ = nx .
An interval estimate of p is obtained from the distribution of the estimator.
d pq d σ2
If n is large, we have: P̂ ≈ N p, , [ cf. X̄ = N(µ, )].
n n
But here the “σ 2 ” is not known: it depends on the unknown p. However, the sample is
large and so, to a good approximation, σ 2 ≈ p̂(1−p̂), and act as if it is known.
d p̂(1−p̂)
Thus, we use the approximate result: P̂ ≈ N p, ,
n
in which the variance is replaced by its estimate, and we assume the variance is “known”
to be this value.
q q
Note that sd(P̂ ) = p(1−p)
n , and so se(p̂) = p̂(1−p̂)
n .
Using this approximation gives the “standard” result:
r
p̂(1−p̂)
approx 95% CI for p: p̂ ± 1.96 [est ± “2”se]
n
This gives quite a reasonable approximation when the sample is large.
The method by which a confidence interval is assessed is its coverage. A 95% confidence
interval for any parameter θ is supposed to contain the true value of θ with probability 0.95.
When an approximate 95% confidence interval is used, the coverage will differ from 0.95.
For a Binomial parameter, the approximate normal-based 95% confidence interval has cov-
erage which tends to be less than 0.95, partly because of the non-normality (in particular
the skewness) and partly because of the discreteness. Further, as well as the normal ap-
proximation, the above approximate confidence interval also uses an approximation to the
standard error. It really is a rough approximation, but it gives us some indication at least.
The “right” answer, the exact 95% confidence interval, can be obtained using R, or from the
Binomial Statistic-Parameter diagram (Statistical Tables: Figure 2).
We can get closer to the exact confidence interval by making corrections (see below) to the
“standard” approximation. However, our approach (in EDDA) is to use the basic formula
(i.e. est ± “2”se) as the approximation, but to keep in mind its deficiencies. If a precise
confidence interval is required, then go to the computer for the exact result.
lOMoARcPSD|8938243
Again, p is unknown. If we are confident that p ≈ pa , then pa can be used in the above
formula.
Here, pa may denote an estimate from an earlier sample, or it may be a value based on
lOMoARcPSD|8938243
historical values or expert judgement. To be on the safe side, we should try to choose a
value on the 0.5-side of p, so that we are over-estimating rather than under-estimating the
variance. This gives a conservative value; i.e. one for which it is likely that the sample will
produce a confidence interval with margin of error less than the specified value, d.
If we have absolutely no idea about p, then we should use pa = 0.5, since this gives the
maximum value of pa (1 − pa ). This maximum value is 0.25.
Note that this result is based on the basic normal approximation, which will give a rea-
sonable answer provided the resulting sample size is large. Some checking might be in
order if the formula indicates a relatively small sample would be adequate. However, this
is unlikely unless a relatively wide margin of error is specified.
In the case of a population rate, a similar approach can be used to that used for the pop-
ulation proportion. Now it is based on the Poisson distribution rather than the Binomial
distribution.
Estimating an incidence rate α where X cases are obtained from observation of t person-
years.
X d X d
The estimator of α is
, where X = Pn(αt). , where X = Bi(n, p)]
[cf.
t n
X d α α̂ X d pq p̂q̂
≈ N(α, ) ≈ N(α, ). [cf. ≈ N(p, ) ≈ N(p, )]
t t t n n n
r √
x α̂ x
Thus, the point estimate is α̂ = , with standard error se(α̂) = = , and the approx-
t t t
imate 95% confidence interval is given by
r
α̂
approx 95% CI for α: α̂ ± 1.96 [est ± “2”se]
t
Note: This approximate 95% CI for α has the same sort of problems as the approximate
95% CI for p: it can be “corrected” (see below), but we will use this basic form as a rough
approximation, and use the computer to generate an exact answer if required.
lOMoARcPSD|8938243
This approximate 95% confidence interval can be improved in the same sort of way as the
approximate 95% confidence interval for p.
(1) The skewness correction means that the approx CI needs to be shifted upwards. The
simplest way to do this is to use Agresti’s formula: for the purposes of computing the CI, use
α̃ = x+2
t
instead of α̂.
(2) The correction for continuity means that the margin of error needs to be increased by 0.5
t
.
q
‘better’ approx 95% CI: α̃ ± 1.96 α̃t + 0.5 , where α̃ = x+2
t t
.
The “right” answer, the exact 95% confidence interval is given by the R function
poisson.test().
The exact result can also be obtained from the Poisson statistic-parameter diagram (Statis-
tical Tables: Figure 4). As far as the accuracy of the graph allows, this gives an exact 95%
confidence interval for λ = αt (for an observed x). The confidence interval for α can then
be obtained by dividing through by t.
Using Table 4, we obtain 9.9 < λ < 27.2, and hence (on dividing by 5328),
0.00186 < α < 0.00511.
Note: The ‘better’ approx 95% CI gives (0.00187, 0.00526). Better, but still not on the
money.
d
Inference on λ, where X = Pn(λ)
d
If X = Pn(λ), then the point estimate is λ̂ = x and, as above, an approximate 95% confi-
dence interval for λ is given by: p
approx 95% CI for λ: λ̂ ± 1.96 λ̂ [est ± “2”se]
Again, this is a rough approximation, but we will use it nevertheless, being aware that it
tends to be a bit low and a bit narrow. If required the exact 95% CI is available from R or
from Statistical Tables: Figure 4.
p
Note: the ‘better’ approx 95% CI: λ̃ ± 1.96 λ̃ + 0.5 , where λ̃ = x+2.
Unless the number of cases is quite large (and 26 is not that large), the normal-
approximation is not so wonderful: the approx CI will be too narrow and too
low. An exact result can be obtained using R or the Poisson SP diagram (Statis-
tical Tables: Figure 4). This gives (17.0, 38.1). The ‘better’ approximation gives
(17.1, 38.9). As usual, it tends to over-correct slightly.
The Statistical Tables (Table 7) gives the quantiles (inverse cdf) of the tk distribution for a
range of values of k and q.
R gives the usual things using
dt, pt and qt
i.e. the pdf (not much use), the cdf (probabilities) and the inverse cdf (quantiles).
E XERCISE . Check that c0.975 (t20 ) = 2.086 and c0.025 (t20 ) = −2.086.
Note: Since the t distribution is symmetrical about zero, ca (t) = −c1−a (t).
> qt(0.975, df=20) # 0.975-quantile of t_20 distribution
[1] 2.085963
> qt(0.025, df=20) # 0.025-quantile of t_20 distribution
[1] -2.085963
So . . . how does this relate to the problem in hand: finding a confidence interval for µ?
We have:
X̄ − µ d
√ = tn−1 .
S/ n
It follows that:
X̄ − µ
Pr − c0.975 (tn−1 ) < √ < c0.975 (tn−1 ) = 0.95. (4’)
S/ n
Rearrangement of this statement, as we did with the σ-known result, leads to a confidence
interval:
S S
Pr X̄ − c0.975 (tn−1 ) √ < µ < X̄ + c0.975 (tn−1 ) √ = 0.95 (6’)
n n
in which σ is replaced by S and the standard normal quantiles (±1.96) are replaced by the
tn−1 quantiles.
This gives the 95% confidence interval for µ:
s s
x̄ − c0.975 (tn−1 ) √ < µ < x̄ + c0.975 (tn−1 ) √
n n
s
which, with est = x̄ and se(x̄) = √ , exactly fits the form
n
est ± “2” se, with “2” = c0.975 (tn−1 ).
Unless the sample size n is very small, the “2” will actually be reasonably close to 2, as
we have seen: for example, c0.975 (t100 ) = 1.984, c0.975 (t30 ) = 2.042, c0.975 (t10 ) = 2.228,
c0.975 (t3 ) = 3.182.
4.7
95% CI for µ: (est ± “2”se) = 12.3 ± 2.064 √ = (10.4, 14.2).
25
the average time taken (in minutes) is 28.52, with sample standard deviation
2.36. Assuming normality, find a 95% confidence interval for the mean time.
2.36
est = 28.5, se = √ = 0.50, “2” = c0.975 (t21 ) = 2.080.
22
Note: that this confidence interval is a statement about the mean, and not about the ac-
tual time taken. A 95% interval for the time taken is approximately (28.52±2×2.36) =
(23.8, 33.2). This is an interval within which about 95% of the sample of times would
lie; and an interval within which the next observation will lie with a probability of
around 0.95.
3.134
90% CI for µ: 5.6 ± 1.833× √ = (3.78, 7.42).
10
In most cases we consider a 95% confidence intervals. However, the procedure is the same
for other levels, as indicated by the last example.
X ′ − X̄ d q q
× × q = tn−1 x̄ ± c0.975 (tn−1 ) s 1+ n1 x̄ ± c0.975 (tn−1 ) s 1
n
S 1 + n1
d
EXAMPLE 5.7.1: A random sample of n = 31 observations is obtained on X =
N(µ, σ 2 ). If the sample gives x̄ = 23.78 and s = 5.37, find a 95% prediction
interval for X.
q
1
95% PI for X: 23.78 ± 2.042×5.37× 1 + 31 = (12.64, 34.92).
The PI and CI are the same, apart from a “ 1 + ” in the right place.
The prediction interval is always substantially wider: it is a statement about a future
observation. The confidence interval is a statement about the population mean.
The t-distribution result is based on the assumption that the population distribution is ap-
proximately normal. How can we tell if a sample is normal (i.e. from a normal population)?
The sample pdf is too erratic to be much use. The sample cdf is a bit more stable. But, how
do we know which shape corresponds to a normal cdf?
Principle: the easiest curve to fit is a straight line
d d
If X = N(µ, σ 2 ), then X = µ + σN, where N denotes standard normal; and so cq (X) =
µ + σcq (N), i.e. cq = µ + σΦ−1 (q). Note: Φ denotes the standard normal cdf, so Φ−1 (q)
denotes the inverse cdf, i.e. the q-quantile of the standard normal distribution. This is often
denoted by zq .
-2 -1 0 1 2
Theoretical Quantiles
This appears to be reasonably close to a straight line, so the normal distribution is a reason-
able model for these data. The intercept of the fitted line gives an estimate of µ: µ̂ = 47.2;
and the slope of the fitted line gives an estimate of σ: σ̂ = 10.7.
k
The quantities Φ−1 ( n+1 ) are called normal scores. Roughly, these are the values you would
expect to get in an equivalent position in a sample from a standard normal distribution.
On R, to get the normal scores based on observations x, use qnorm(x).
Such a plot is called a QQ-plot because it plots the sample Quantiles against the (standard)
population Quantiles. The QQ-plot not only provides an indication of whether the model
is a reasonable fit, but also gives estimates of µ and σ. These estimates work even in some
situations where x̄ and s won’t. For example, with censored or truncated data.
68.1 73.1 86.2 85.1 70.0 67.1 64.3 65.8 64.2 62.0
48.6 74.9 72.9 54.7 78.0 79.1 60.1 63.2 63.2 78.0
77.5 65.9 79.9 59.5 56.3 59.0 66.7 74.5 79.5 67.6
0.04
80
Sample Quantiles
0.03
70
Density
0.02
60
0.01
50
0.00
50 60 70 80 90 -2 -1 0 1 2
x Theoretical Quantiles
While the histogram does not look particularly normal, the QQ-plot gives a
more useful guide. This is a plot of the data (on the vertical axis) and the normal
scores (on the horizontal axis), with a fitted straight line. The intercept and slope
are quite close to the sample mean and standard deviation, as they should be.
> mean(x) # sample mean
[1] 68.83333
> sd(x) # sample standard deviation
[1] 9.213833
What happens if it’s a bad fit? Tails too long; or tails too short. A concave graph (up or
down) indicates too short at one end and too long at the other, i.e. a skew distribution. The
following examples corresponds to the Exponential distribution with pdf e−x and the t-
distribution with 3 degrees of freedom. They give, respectively U- and S-shaped QQ-plots.
5
50
Sample Quantiles
4
Frequency
40
3
30
2
20
10
1
0
0 1 2 3 4 5 6 -2 -1 0 1 2
x Theoretical Quantiles
lOMoARcPSD|8938243
5
0.20
Sample Quantiles
0
0.15
Density
-5
0.10
0.05
-10
0.00
-10 -5 0 5 -2 -1 0 1 2
x Theoretical Quantiles
Suppose we have two estimates of the same parameter, with their standard deviations,
resulting from two separate experiments:
est1 = 5.2, sd1 = 0.2; est2 = 6.4, sd2 = 0.5
How should these estimates be combined?
We could just average the two, giving the combined estimate:
est = 5.8, sd = 0.27.
p
(average = 0.5×est1 + 0.5×est2 , so sd = (0.52 ×0.22 + 0.52 ×0.52 ) = 0.27.)
This combination gives the two estimates equal weight.
It would appear that the first experiment is ‘better’. It produces a more precise estimate:
i.e. one with smaller standard deviation. So, we ought to be giving it more weight.
Suppose the parameter we are estimating is θ, then we have:
E(T1 ) = θ, var(T1 ) = 0.22 ; E(T2 ) = θ, var(T2 ) = 0.52 ,
where T1 and T2 are independent, as they are from separate experiments.
We seek the optimal estimator of θ. Let the weights be w and 1−w, and define:
T = wT1 + (1−w)T2 .
The reason that the weights must sum to one is so that T is an unbiased estimator of θ:
E(T ) = wθ + (1−w)θ = θ.
w2 (1−w)2
V = var(T ) = w2 ×0.22 + (1−w)2 ×0.52 = + .
25 4
dV
To find where V is a minimum, we solve = 0:
dw
dV 2w 2(1−w) w 25
= − =0 ⇒ = ⇒ w = 25
29
4
, 1−w = 29 .
dw 25 4 1−w 4
So, to minimise the variance of the combined estimate we should put a lot more weight on
the first estimate (86% in fact). Then
lOMoARcPSD|8938243
q
25 4
est = 29 ×5.2 + 29 ×6.4 = 5.4; sd = ( 25 2 2 4 2 2
29 ) ×0.2 + ( 29 ) ×0.5 = 0.19.
Compared to averaging, this optimal weighting gives an estimate closer the first (more
reliable) estimate; and a smaller standard deviation.
Repeating the above for the case
E(T1 ) = θ, var(T1 ) = v1 ; E(T2 ) = θ, var(T2 ) = v2 ,
1 1
v2 v1 v2
gives w= = 1 1 ; and 1−w = 1 1 ;
v1 + v2 v1 + v2 v1 + v2
i.e. the weights for the optimal estimator are inversely proportional to the variances, and
its variance is given by:
1
V = 1 1 .
v1 + v2
How should these estimates be combined to produce an optimal estimate? The answer is
given by the above results.
We find results from three papers: est1 = 14.4, sd1 = 0.45; est2 = 15.7, sd2 =
0.92; and est3 = 16.1, sd3 = 0.67. We assume that the results are independent
(separate experiments). We want to combine these results in the most efficient
way. This can be done using a table of the same form as for combining two
estimates. (Note: a table of this form can be used for optimally combining any number
of estimates.)
Thus the combined estimate is 15.0 with standard deviation 0.35. This gives a
confidence interval of:
CI = 15.0 ± 1.96 × 0.35 = (14.81, 15.27).
From the above example we observe that the combined estimate is closer to the most precise
of the individual estimates; and the standard deviation of the pooled estimate is smaller
than any of the individual standard deviations, resulting in a smaller confidence interval.
Greater information means smaller standard deviation.
Here we are assuming that the standard deviations are known. Usually, in practice, they
are not, and they must be estimated. In that case we use the standard error (which is an
estimate of the standard deviation) and replace the unknown standard deviation (sd) by
the standard error (se) in the above.
lOMoARcPSD|8938243
Problem Set 5
5.1 A population has mean µ=50 and standard deviation σ=10.
(a) For a random sample of 10, find approximately Pr(49 < X̄ < 51).
(b) For a random sample of 100, find approximately Pr(49 < X̄ < 51).
(c) For a random sample of 1000, find approximately Pr(49 < X̄ < 51).
5.2 A population with pdf indicated below has mean µ = 55.4 and standard deviation 14.2.
A random sample of 50 observation is obtained from this population. Specify a 95% probability
interval for X̄.
5.3 A 95% confidence interval for a parameter is such that it contains the unknown parameter with
probability 0.95. We call this a “success”. So, the probability that a 95% confidence interval is
successful is 0.95. And it is a failure (i.e. does not contain the parameter) with probability 0.05.
(a) Suppose we have four independent 95% confidence intervals. Show that the probability
that all four of these intervals are successful is 0.8145.
(b) i. Suppose we have 20 independent 95% confidence intervals, what is the probability
that all 20 are successful?
ii. How many of these intervals do you ‘expect’ to be successful?
iii. What is the distribution of the number of successful intervals?
iv. Find the probability that the number of successful intervals is equal to 20? 19? 18?
5.4 The following is a random sample of n=30 observations from a normal population with (un-
known) mean µ and known standard deviation σ=8.
32.1 43.2 38.6 50.8 34.4 34.8 34.5 28.4 44.1 38.7
49.1 41.3 40.3 40.5 40.0 35.3 44.3 33.3 50.8 28.6
42.2 46.3 49.8 34.4 43.9 59.7 44.9 41.9 41.3 38.2
i. Find a 95% confidence interval for µ.
ii. Will a 50% confidence interval for µ be wider, or narrower, than the 95% confidence in-
terval? Find a 50% confidence interval for µ.
iii. What would happen if the confidence level was made even smaller? What is the 0%
confidence interval?
iv. Find a 99.9% confidence interval for µ.
5.5 For the data of Problem 5.4, find the 95% confidence interval for µ, assuming that σ is unknown.
Compare this interval to the 95% confidence interval found in Problem 5.4. Why is this interval
narrower? Under what circumstances is the 95% confidence interval assuming σ unknown
narrower than the 95% confidence interval assuming σ known? Which do you expect to be
wider?
5.6 A study was conducted to examine the efficacy of an intramuscular injection of cholecalciferol
for vitamin D deficiency. A random sample of 50 sufferers of vitamin D deficiency were chosen
and given the injection. Serum levels of 25-hydroxyvitamin D3 (250HD3 ) were measured at
the start of the study and 4 months later. The difference D was calculated as (4-month reading
– baseline reading).
The sample mean difference, d¯= 17.4 and sample standard deviation, sd = 21.2.
i. Construct a 95% confidence interval for the mean difference.
ii. Does this confidence interval include zero? What can you conclude?
5.7 The margin of error, or the half-width, of a 100(1−α)% confidence interval for µ when σ is
σ
known is given by z √n , where z = c1− 1 α (N).
2
ii. For each factor, say how it affects the width of the interval.
iii. Does a wider interval give a more or less precise estimation?
iv. If σ = 5, and I want a 95% confidence interval to have half-width 0.5, i.e. the 95% CI to be
(x̄ ± 0.5), what sample size should I use?
5.8 We are interested in estimating the prevalence of attribute D among 50-59 year-old women.
Suppose that in a sample of 1140 such women, 228 are found to have attribute D.
Obtain a point estimate and a 95% confidence interval for the prevalence.
5.9 We are interested in estimating the prevalence of breast cancer among 50–54-year-old women
whose mothers have had breast cancer. Suppose that, in a sample of 10 000 such women, 400
are found to have had breast cancer at some point in their lives.
(a) Obtain a point estimate for the prevalence, and its standard error.
(b) Obtain a 95% interval estimate for the prevalence.
5.10 Of a random sample of n = 20 items, it is found that x = 4 had a particular characteristic. Use
the chart in the Statistical Tables (Table 2) to find a 95% confidence interval for the population
proportion. Repeat the process to complete the following tables:
n x p̂ 95% CI: (a, b) n x p̂ 95% CI: (a, b)
20 4 20 16
50 10 50 40
100 20 100 80
200 40 200 160
Check your values using the intervals from R.
Use the formula p̂ ± 1.96 se(p̂) to find an approximate 95% confidence interval for the popula-
tion proportion for n=100, x=20.
5.11 (a) The following is a sample of n = 19 observations on X
84 37 33 24 58 75 55 46 65 59
18 30 48 38 70 68 41 52 50
The graph below is the QQ plot for this sample:
-2 -1 0 1 2
Theoretical Quantiles
Specify the coordinates of the indicated point, explaining how they are obtained. Use the
diagram to obtain estimates of µ and σ.
(b) Use R to obtain a probability plot for these data and indicate how it relates to the above
plot.
(c) i. Find a 95% confidence interval for µ.
ii. Find a 95% prediction interval for X.
5.12 A random sample of 100 observations on a continuous random variable X gives:
range 0 < x < 1 1 < x < 2 2 < x < 3 3 < x < 5 5 < x < 10 10 < x < 20
frequency 27 18 20 17 12 6
(a) Sketch the graph of the sample pdf.
(b) Sketch the graph of the sample cdf and hence find an approximate value for the sample
median.
(c) Find a 95% confidence interval for Pr(X < 3). Is it plausible that the median is equal
to 3? Explain.
lOMoARcPSD|8938243
5.13 Suppose the presence of a characteristic C in an individual can only be determined by means of
a blood test. We assume this test indicates the presence (or absence) of C with perfect accuracy.
If the characteristic is rare and the test is expensive (and/or time consuming) it can be more
efficient to test a combined blood sample from a group of individuals.
Suppose that the probability that an individual has characteristic C is equal to p. Blood samples
from k = 10 individuals are combined for a test.
i. Show that the probability of a positive result (indicating presence of C) is θ = 1−(1−p)10 .
ii. Ten such groups of 10 (representing blood samples from 100 individuals in all) were
tested, and yielded 4/10 positive results. Use R to obtain an exact 95% confidence in-
terval for θ. Hence derive an exact 95% confidence interval for p.
5.14 A study of workers in industry M reported 43 cases of disease D based on observation of 1047
person-years. Give an estimate and a 95% confidence interval for the incidence rate in industry
M based on these results. Is this compatible with the community incidence rate is 0.02 cases
per person-year?
5.15 (a) Two independent estimates of a parameter θ are given:
est1 = 25.0 with sd1 = 0.4; and est2 = 23.8 with sd2 = 0.3.
Find the optimal pooled estimate of θ and obtain its standard deviation.
(b) A third independent estimate of θ is obtained in a new experiment:
est3 = 24.4 with sd3 = 0.2.
Find the optimal pooled estimate of θ based on the three estimates, and obtain its stan-
dard deviation.
5.16 Oxidised low-density lipoprotein is thought to play an important part in the pathogenesis of
atherosclerosis. Observational studies have associated β-carotene with reductions in cardio-
vascular events, but clinical trials have not. A meta-analysis was undertaken to examine the
effect of compounds like β-carotene on cardiovascular mortality and morbidity.
Here we examine the effect of β-carotene on cardiovascular mortality. Six randomised trials
of β-carotene treatment were analysed. All trials included 1000 or more patients. The dose
range for β-carotene was 15–50 mg. Follow-up ranged from 1.4 to 12.0 years. The parameter
estimated is λ = ln(OR) where OR denotes the odds ratio relating E and D, and ln denotes
natural logarithm.
Note: ln OR is used rather than OR, since −∞ < ln OR < ∞, which means that its estimator is better
fitted by a normal distribution; it has no endpoint problems, cf. OR > 0.
(a) The estimates and standard errors from these trials are as follows:
est se 1/se2 w w×est
ATBC 0.0827 0.0533 ··· ··· ···
CARET 0.3520 0.1058 ··· ··· ···
HPS 0.0520 0.0503 ··· ··· ···
NSCP –0.7702 0.5109 ··· ··· ···
PHS 0.1049 0.0797 ··· ··· ···
WHS 0.1542 0.3935 ··· ··· ···
A rough 95% confidence interval for each trial is given by est ± 2se. Represent these
intervals in a diagram.
(b) Compute the optimum pooled estimate of λ and its standard error.
(c) Obtain a 95% confidence interval for λ.
(d) Hence obtain a 95% confidence interval for OR, using the fact that λ = ln OR.
(e) Let OR denote the odds ratio between exposure E and disease D. What does “OR > 1”
indicate about the relationship between E and D?
(f) What conclusion do you reach from this meta-analysis?
lOMoARcPSD|8938243
Chapter 6
HYPOTHESIS TESTING
“I had come to an entirely erroneous conclusion, which shows, my dear Watson, how dangerous it is
to reason from insufficient data.”
Sherlock Holmes, The Speckled Band, 1892.
6.1 Introduction
Hypothesis testing can be regarded as the “other side” of confidence intervals. We have
seen that a confidence interval for the parameter µ gives a set of “plausible” values for µ.
Suppose we are interested in whether µ = µ0 . In determining whether or not µ0 is a
plausible value for µ (using a confidence interval) we are really testing µ = µ0 against the
alternative that µ 6= µ0 . If µ0 is not a plausible value, then we would reject µ = µ0 .
In this subject, we deal only with two-sided confidence intervals and, correspondingly,
with two-sided tests, i.e tests against a two-sided alternative (µ = µ0 vs µ 6= µ0 ). There are
circumstances in which one-sided tests and one-sided confidence intervals may seem more
appropriate. Some statisticians argue that they are never appropriate. In any case, we will
use only two-sided tests.
All our confidence intervals are based on the central probability interval for the estimator,
i.e. that obtained by excluding probability 21 α at each end of the distribution, giving a Q%
confidence interval, where Q = 100(1 − α). This means that our tests are based on rejecting
µ = µ0 for an event of probability 12 α at either end of the estimator distribution1 .
of-fit” statistics. For example, using a test statistic such as U = (X̄−µ0 )2 to test µ=µ0 . We will consider such cases in
Chapter 7.
147
lOMoARcPSD|8938243
for the general population of adult males is 211 mg/100mL. Is the mean choles-
terol level of the subpopulation of men who smoke and are hypertensive differ-
ent?
Suppose we select a sample of 25 men from this group and their mean choles-
terol level is x̄=220 mg/100mL. What can we conclude from this?
The “logic” of the hypothesis testing procedure seems a bit back-to-front at first. It is
based on the contrapositive: [M ⇒ D] = [D′ ⇒ M ′ ].
For example: if the model M is a two-headed coin then the data D = the results are
all heads; so, if D′ = a tail is observed then M ′ = the coin is not two-headed.
Our application is rather more uncertain:
[M (µ = µ0 ) ⇒ D (x̄ ≈ µ0 )]
[D′ (x̄ 6≈ µ0 ) ⇒ M ′ (µ 6= µ0 )]
This logic means that we have a (NQR) “proof” of µ 6= µ0 . (If the signs were all
equalities rather than (random) approximations, it would be a proof.)
We have no means of “proving” (NQR or otherwise) that µ = µ0 .
“I am getting into your involved habit, Watson, of telling a story backward.”
Sherlock Holmes, The Problem of Thor Bridge, 1927.
We observe the sample and compute x̄. On the basis of the sample and the test statistic, we
must reach a decision: “reject H0 ”, or not.
Statisticians are reluctant to use “accept H0 ” for “do not reject H0 ”, for the reasons indicated
above. Mind you, this does seem a bit odd when “success” can be used to mean “the patient dies”.
If ever I use “accept H0 ” (and I’m inclined to occasionally), it means only “do not reject H0 ”.
In particular, it does not mean that H0 is true, or even that I think it likely to be true!
However, it is well to keep in mind that:
“absence of evidence is not the same as evidence of absence”.
lOMoARcPSD|8938243
Types of error
In deciding whether to accept or reject H0 , there is a risk of making two types of errors:
Q(θ) = Pr(reject H0 | θ)
= Pr(X ∈ {0, 5} | θ)
= (1 − θ)5 + θ5
graph of Q(θ):
Note 2: Q(0.75) = 0.255 + 0.755 ≈ 0.24; so this is not a particularly good test. But
we knew that anyway!
Note 3: To make a better test (one with greater power), we need to increase the sample
size. For example, with n=100, reject H0 unless 40 6 X 6 60.
There are several ways of approaching a hypothesis test. The first, and simplest after Chap-
ter 5, is to compute a confidence interval (which is a good idea in any case); and then to
check whether or not the null-hypothesis value (µ = µ0 ) is in the confidence interval.
We have seen how to obtain a confidence interval for µ, so there is not much more to do. In
fact, a number of the problems and examples had parts that questioned the plausibility of
particular values of µ. This is now seen to be equivalent to hypothesis testing.
Since the 95% confidence interval does not include 10, we reject the null hypoth-
esis µ=10. There is significant evidence in this sample that µ>10.
Since the 95% confidence interval includes 211, we do not reject the null hy-
pothesis µ = 211. There is no significant evidence in this sample that µ 6= 211.
This approach can always be used whenever you have a confidence interval, but it has
disadvantages: it does not tell you how strongly to reject (or not) a particular hypothesis,
and it does not use the hypothesised number to construct the confidence interval.
We can measure the strength of the evidence of the sample against H0 by using the “unlike-
lihood” of the data if H0 is true. The idea is to work out how unlikely the observed sample
is, assuming µ = µ0 . If it is “too unlikely”, then we reject H0 ; and otherwise, we do not
reject H0 .
DEFINITION 6.3.1. The p-value, denoted in these notes by p, is the probability (if H0
were true) of observing a value as extreme as the one observed.
Therefore: (
2 Pr(X̄ > x̄) if x̄>µ0 d 2
p= where X̄ = N(µ0 , σn ).
2 Pr(X̄ < x̄) if x̄<µ0
The 2 is because this is a two-sided test, and we must allow for the possibility of being as
extreme at the other end of the distribution (i.e. above or below).
d 4 2
p = 2 Pr(X̄>11.62), where X̄ = N(10, 40 ) (the H0 distribution).
·
· · p = 2 Pr(X̄s > 11.62−10
√ ) = 2 Pr(X̄s >2.56) = 0.010.
4/ 40
Now we must specify what is meant by “too unlikely”; i.e. how small is “too small” a value
for p? It seems sensible to match our idea of what is “too small”, with what is “implausi-
ble”. Thus, if we reject H0 if p < 0.05, then this corresponds exactly to values outside the
95% confidence interval, i.e. the “implausible” values.
lOMoARcPSD|8938243
Our standard testing procedure therefore, is to compute the p-value and to reject H0 if
p < 0.05 (and not to reject H0 otherwise). Thus, in both the above two examples, we
would reject H0 (at the 5% level of significance).
We have seen how to compute the probability, so there is nothing new in that. What is new here is
the terminology that comes with it.
One advantage of the p-value is that it gives a standard indication of the strength of the
evidence against H0 . The smaller the value of p, the stronger the evidence against H0 .
As we can specify different levels for a confidence interval, we can specify different levels
for the test. To correspond to a 99% CI, we would reject H0 if p < 0.01.
We specify α, the significance level of the test. Typically we use α=0.05, just as we typically
use a 95% confidence interval. But we may choose α=0.01 or 0.001 or another value.
DEFINITION 6.3.2. If we observe p < α, then we reject H0 and say that the result is
statistically significant.
d 2
p = 2 Pr(X̄ > 220), where X̄ = N(211, 46
25
) (the H0 distribution).
·
· · p = 2 Pr(X̄s > 220−211
√ ) = 2 Pr(X̄ > 0.978) = 0.328.
46/ 25
Since p > 0.05, we do not reject the null hypothesis µ = 211. There is no
significant evidence in this sample that µ 6= 211.
The p-value approach is the most widely used, and preferred when it is avaiable, but some-
times it is difficult to calculate the required probability. A third approach, the critical value
approach, is to specify a decision rule for rejecting H0 .
The rejection rule is often best expressed in terms of a statistic that has a standard distribu-
tion if H0 is true. Here the test statistic is
X̄ − µ0
Z= √
σ/ n
d
which is such that, if H0 is true, then Z = N(0, 1). Note that Z involves only X̄ and known
constants (the null hypothesis value µ0 , the known standard deviation, σ, and the sample
size, n). In particular, Z does not depend on the unknown parameter µ.
The rule then is to compute the observed value of Z and to see if it could plausibly be
an observation from a standard normal distribution. (Here, “plausible” is taken to mean
within the central 95% of the distribution.) If not, we reject H0 . This leads to the name often
used for this test: the z-test.
x̄ − µ0
We compute the observed value of Z, i.e. z = √ , and compare it to the standard
σ/ n
normal distribution. Thus the decision rule is
reject H0 if z < −1.96 or z > 1.96; i.e. if |z| > 1.96.
which corresponds exactly to the rejection region for x̄ given above.
lOMoARcPSD|8938243
Since |z| < 1.96, we do not reject the null hypothesis µ = 211.
There is no significant evidence in this sample that µ 6= 211.
Compute β, the probability of making a type II error, when the true value of µ
is 250.
d 2 d d
If X̄ = N(250, 46
25
), then Z = N( 250−211
√ , 1), i.e. Z = N(4.24, 1).
46/ 25
µ−a 2
[using the result that Y = X−a
b
has mean b and variance σb2 ]
Then: β = Pr(−1.96 < Z < 1.96) = Pr(−6.20 < Zs < −2.28) = 0.011, as above.
In this section, we consider tests for the parameter µ for a normal population. So the “pa-
rameter of interest” here is the population mean µ. Later in the chapter we turn to other
parameters.
We define a statistic that has a “standard” distribution when H0 is true (i.e. N or t, depend-
ing on whether σ is known or unknown). A decision is then obtained by comparing the
observed value for this statistic with the standard distribution.
In reporting the results of the test, you should give the value of the “standard” statistic,
the p-value, and a verbal conclusion/explanation. It is recommended that you also give a
confidence interval in reporting your results.
This is the scenario we have been considering in the previous sections. We define:
X̄ − µ0
Z= √
σ/ n
in which X̄ is observed; µ0 , σ and n are given or assumed known.
d
If H0 is true, then Z = N(0, 1).
We evaluate the observed value of Z:
x̄ − µ0
z= √
σ/ n
and compare it to the standard normal distribution. For significance level 0.05, we reject
H0 if |z| > 1.96.
The p-value is computed using the tail probability for a standard normal distribution:
11.62−10
z= √
4/ 40
= 2.56; p = 2 Pr(Z > 2.56) = 0.010.
The sample mean, x̄=11.62; the z-test of µ=10 gives z=2.56, p=0.010.
lOMoARcPSD|8938243
Thus there is significant evidence in this sample that µ>10; the 95% CI for µ is
(10.28, 12.86).
Is there evidence to support the claim that their mean serum-creatinine level is
different from that of the general population?
There are some routine functions in R implementing the test, but it is straight-
forward to perform directly:
Note that we are assuming the standard deviation of serum-creatine level is the
same in the treated individuals as the general population (as well as normality
etc.)
Power of a z-test
d
Suppose that Z = N(θ, 1). We observe Z, and on the basis of this one observation, we
wish to test H0 : θ = 0 against H1 : θ 6= 0.
d
Z = N(θ, 1) reject H0 don’t reject H0
|Z| > 1.96 |Z| < 1.96
H0 true (θ = 0) × error of type I X correct
d
Z = N(0, 1) α = Pr(|Z| > 1.96) = 0.05 prob = 0.95
Except for θ close to zero, it is usually the case that only one tail is required (as the other is
negligible). For example, for θ = −1,
d
power = Pr(|Z| > 1.96), where Z = N(−1, 1)
1 − power = Pr(−1.96 < Z < 1.96)
= Pr(−0.96 < Zs < 2.96)
= Pr(Z < 2.96) − Pr(Z < −0.96)
= 0.9985 − 0.1685
·
· · power = 0.170
Using the above table, we could plot a graph of the power function:
The graph has a minimum at zero (of 0.05, the significance level), and increases up to 1 on
both sides, as θ moves away from zero: for θ = 4, or θ = − 4 the power is 0.98.
lOMoARcPSD|8938243
X̄ − µ0
For the z-test, the statistic is Z = √ .
σ/ n
d
If µ = µ0 , then Z = N(0, 1).
d µ1 − µ0
If µ = µ1 , then Z = N(θ, 1), where θ = √ .
σ/ n
And we only get one observation on Z.
So the z-test is actually equivalent to the example above.
We can use the results of that example to work out power for any z-test, using power =
d
Pr(|Z| > 1.96), where Z = N(θ, 1).
To devise a test of significance level 0.05 that has power of 0.95 when µ = µ1 , we need
θ = 3.6049.
µ1 − µ0 13 σ 2
i.e. √ = 3.61 ⇒ n= . [3.60492 = 12.9953 ≈ 13]
σ/ n (µ1 − µ0 )2
The diagram indicates that to achieve a z-test of µ=µ0 , with significance level α and power
1−β when µ=µ1 , we require
µ1 − µ0 (z1− 21 α + z1−β )2 σ 2
√ > z1− 1 α + z1−β ⇒ n > ,
σ/ n 2 (µ1 − µ0 )2
X̄ − µ0
We define: T = √
S/ n
in which, X̄ and S are observed; µ0 and n are given.
d x̄ − µ0
If H0 is true, then T = tn−1 . We evaluate the observed value of T : t = √ , and
s/ n
compare it to the tn−1 distribution, i.e. the null distribution, i.e. its distribution if H0 is true.
For significance level 0.05, we reject H0 if |t| > “2” = c0.975 (tn−1 ).
The p-value is computed using the tail probability for a tn−1 distribution:
2 Pr(T > t) if t > 0
(
d
p= where T = tn−1 (the H0 distribution).
2 Pr(T < t) if t < 0
16.2 − 25
t= √ = −4.44; p = 2 Pr(t17 < −4.44) = 0.000.
8.4/ 18
The sample mean for treated patients x̄=16.2 is significantly less than the known
mean for untreated patients (t= − 4.44, p=0.000).
In reporting this test result, it is recommended that you also give the 95% CI for
µ: (12.0, 20.4).
R can be used to analyse the data using the function t.test() by entering the
lOMoARcPSD|8938243
data and the null hypothesis value µ0 . For the above example we obtain
> x = c(255, 244, 239, 242, 265, 245, 259, 248, 225, 226, 251, 233) # data
> t.test(x, mu=240) # perform t test on x with null hypothesis mu=240
data: x
t = 1.2123, df = 11, p-value = 0.2508
alternative hypothesis: true mean is not equal to 240
95 percent confidence interval:
236.4657 252.2010
sample estimates:
mean of x
244.3333
There is no significant evidence in this sample that the mean is different from
240 calories (t = 1.21, p = 0.251); the 95% CI for the mean is (236.5, 252.2).
An approximate z-test can be used in a wide variety of situations: it can be used whenever
we have a result that says the null distribution of the test statistics is approximately normal.
The central limit theorem ensures that there are many such situations.
Testing a population proportion: approx z-test for testing p=p0 (Binomial parameter)
Suppose we observe a large number of independent trials and obtain X successes. To test
H0 : p = p0 , where p denotes the probability of success, we can use
X − np0 P̂ − p0 X
Z=p =q , where P̂ =
np0 (1 − p0 ) p0 (1−p0 ) n
n
and compare it to the standard normal distribution, though in this case we should adjust
for discreteness by using a correction for continuity.
In this case there is not an exact correspondence between the test and the confidence interval, since
se0 6= se. This is because the confidence interval is based on an additional approximation: that
p(1−p) ≈ p̂(1−p̂). The test procedure is preferred. If it were used for the confidence interval, it
would give a better, but messier, confidence interval.
EXAMPLE 6.4.6: 100 independent trials resulted in 37 successes. Test the hy-
pothesis that the probability of success is 0.3.
lOMoARcPSD|8938243
x 0.07
p̂ = = 0.37, z = q = 1.528.
n 0.3×0.7
100
0.5 0.065
The correction for continuity is to reduce 0.07 by 100 = 0.005, i.e. zc = q =
0.3×0.7
100
1.418;
so p ≈ 2 Pr(N > 1.428) = 0.156.
d
The exact p-value is p = 0.160, obtained using p = 2 Pr(X > 37), where X =
Bi(100, 0.3).
EXAMPLE 6.4.7: 1000 independent trials resulted in 280 successes. Test the hy-
pothesis that the probability of success is 0.3.
x −0.02
p̂ = = 0.28, z = q = −1.380.
n 0.3×0.7
1000
0.5
The correction for continuity is to reduce 0.02 by 1000 = 0.0005,
−0.0195
i.e. zc = q = −1.346; so p ≈ 2 Pr(N > 1.346) = 0.178.
0.3×0.7
1000
Our approach then is to use the normal approximation, with continuity correction, to give
an approximation to the p-value. If an exact value is required, we can use the Binomial
probability. If the distribution of the test statistic is symmetrical then the two definitions
coincide. So it is only in the case of a skew distribution that there is a difference.
If n is small, there is little point in considering the normal approximation. We might as well
go straight to the exact test, using the Binomial distribution.
In R:
data: 5 and 13
number of successes = 5, number of trials = 13, p-value = 0.1541
alternative hypothesis: true probability of success is not equal to 0.2
95 percent confidence interval:
0.1385793 0.6842224
sample estimates:
probability of success
0.3846154
Note that R’s binom.test computes the lower tail probability a little differ-
ently; it calculates the probability that X is further from the mean than 5, whereas
we simply multiply the upper tail probability by 2.
Suppose we wish to test H0 : p = p0 using a significance level α and with power 1−β when
p=p1 . Using a normal approximation and following the derivation given for the normal
case, gives
p p 2
z1− 12 α p0 (1−p0 ) + z1−β p1 (1−p1 )
n> .
d2
lOMoARcPSD|8938243
This can be seen using a diagram like the one below (cf. the diagram on page 141):
EXAMPLE 6.4.9: Find the sample size required to test H0 : p = 0.3 with signifi-
cance level 0.05, so that the test has power 0.90 when p=0.2.
0.5
0.0525− 400
So the test is based on zc = √ 1
= 2.05, so p = 2 Pr(Z > 2.05) = 0.040.
1600
lOMoARcPSD|8938243
Hence there is significant evidence in these data to indicate that the population
median is less than 40 (since there is evidence that Pr(X < 40) > 0.5).
d
Note: the exact p-value is p = 2 Pr(U > 221), where U = Bi(400, 12 ); p = 0.040.
A confidence interval for the population median can be obtained as the set of values m′ for
which the null hypothesis m = m′ is not rejected.
The result we use to examine the population rate, α (cases per person year) is X = number
d
of cases in a population for t person-years = Pn(αt). It follows that
X
X − α0 t t − α0 d
Z= √ = p ≈ N(0, 1) if H0 true,
α0 t α0 /t
in which, X is observed; and t and α0 are specified.
This can then be used in the same way as a z-test (provided α0 t is greater than 10). We eval-
uate the observed value of Z, and compare it to the standard normal distribution. Again,
as we are approximating an integer-valued variable by a normal distribution, a continuity
correction is required. Since α̂ = Xt , the continuity correction is 0.5
t
. This is applied in the
same way as for the Binomial test, i.e. reduce |α̂ − α0 | by 0.5
t
.
43
point estimate, α̂ = 1047 = 0.041 (cases/person-year).
0.5
α̂ − α0 0.04107 − 0.025 − 1047
To test H0 : α = 0.025, use zc = = q = 3.191
se0 0.025
1047
p = 2 Pr(N > 3.191) = 0.001, and hence we conclude these data show a signifi-
cant increase in incidence rate in this subpopulation.
q
43
For these data, we have est = 1047 = 0.041, se = 0.041
1047
= 0.0062; and hence:
approx 95% CI: 0.041 ± 1.96×0.0062 = (0.029, 0.053) [which excludes 0.025.]
A common application is to compare a cohort (or a subpopulation) with the general pop-
ulation. The subpopulation may be individuals working in a particular industry, or in-
dividuals who live in a particular area — for example, close to a potential hazard. Are
the individuals in this subpopulation more likely to develop disease D than the general
population?
To examine this hypothesis, we need to work out the number of cases of D that would be
expected if the subpopulation were the same as the general population.
lOMoARcPSD|8938243
Typically, the incidence rates for D will depend on a range of covariates, usually age and
gender, but there may be others depending on the situation. To calculate the expected num-
ber of cases therefore, we stratify the subpopulation into categories of similar individuals
(e.g. age×gender categories). The expected number of cases for the subpopulation is then
worked out as
λ 0 = α 1 t 1 + α 2 t 2 + · · · + α c tc
where α1 , α2 , . . . , αc denote the general population incidence rates, and t1 , t2 , . . . , tc de-
notes the observed person-years for individuals from the subpopulation in each category.
This computation may be quite complicated and time-consuming. But we assume that all
that administration and record-keeping has been done. We are then left with the result that,
if the subpopulation behaves in the same way as the rest of the population (with respect to
disease D), then the number of observed cases of D in the subpopulation is such that
d
X = Pn(λ0 ).
If λ0 is large, then we can use a z-test:
X − λ0 d
Z= √ ≈ N(0, 1), if H0 is true;
λ0
and proceed as before for a z-test. If required, exact results can be obtained using the
Poisson distribution.
20.5 − 16.1
x = 21, λ0 = 16.1 ⇒ zc = √ = 1.097; so p = 0.273.
16.1
Since p > 0.05, this result is not significant. There is no significant evidence in
this result to indicate that the occurrence of bladder cancer is different from the
general population.
In R, an exact version of the test is implemented by the function poisson.test and can
be carried out as follows:
> poisson.test(x=21, T=1, r=16.1)
Comparing λ with the value based on the general population rates, i.e. λ0 , gives
λ
the standardised incidence ratio (SIR) = .
λ0
This may also be referred to as the standardised mortality ratio if the disease outcome is
death; or a standardised morbidity ratio if the disease outcome is diagnosis.
21
In the above example, the standardised mortality rate is estimated by 16.1 = 1.30.
λ
An exact 95% CI for λ is (13.0, 32.1), using R. It follows that a 95% CI for SMR = 16.1 is
( 13.0 , 32.1
16.1 16.1
) = (0.88, 1.99). Since the confidence interval for SMR includes 1, there is no
significant evidence that this subpopulation differs from the general population, which
agrees with the above hypothesis testing result . . . as it should. A test of SMR = 1 is the
same as a test of λ = λ0 .
For small means, the normal approximation does not apply. In that case we use the exact
result, i.e. calculate the p-value using the Poisson distribution, and compare it with 0.05.
d
H0 ⇒ X = Pn(3.3); and we observed x = 6,
d
so p = 2 Pr(X > 6), where X = Pn(3.3).
·
· · p = 2×0.117 = 0.234.
Since p > 0.05, this result is not significant. There is no significant evidence in
this result to indicate a different rate of Hodgkin’s disease among these workers.
Hopper and Seeman (1994)2 conducted a cross-sectional study to examine the relationship
between cigarette smoking and bone density. Data was collected on 41 pairs of female
twins with different smoking histories (each pair consisted of a lighter-smoking twin and
a heavier-smoking twin). Bone mineral density (BMD) was measured at three different
locations: the lumbar spine, the femoral neck and the femoral shaft. Further information,
including (but not limited to) age, height, weight, consumption of alcohol, use of tobacco,
and calcium intake, was collected on each participant using a questionnaire.
E XERCISE . This is only one possible study that could be used to examine this proposed
relationship. What other ways could we construct a cross-sectional study? What about an
experiment or another kind of observational study?
We are interested in the following research question: is there a difference between the mean
lumbar spine BMD between the lighter-smoking and heavier-smoking twins? Let µ1 de-
note the mean lumbar spine BMD for lighter-smoking twins and µ2 denote the mean lum-
bar spine BMD for heavier-smoking twins. Also define µD = µ2 − µ1 . If µD < 0 (i.e.
µ2 < µ1 ) then the mean lumbar spine BMD of heavier-smoking twins is less than the mean
lumbar spine BMD of lighter-smoking twins. We can use a one sample t-test to test the null
hypothesis H0 : µD = 0 against the alternative hypothesis H1 : µD 6= 0.
2 Hopper, J.L and Seeman, E. (1994). The bone density of female twins discordant for tobacco use. New England
The data is stored in a file called Boneden.txt, which we load into R using the following
command:
> boneden <- read.table(’Boneden.txt’, header=T)
After loading the data file into R, the data should be stored in the object boneden. The com-
mand head(boneden) can be used to preview the first six rows of the data. The variables
ls1 and ls2 contain the lumbar spine BMD measurements for each of the lighter-smoking
twins and heavier-smoking twins, respectively. If you run the code boneden$ls1, you
will be able to view the lumbar spine BMD measurements for the lighter-smoking twins.
Similarly, boneden$ls2 will show the lumbar spine BMD measurements for the heavier-
smoking twins.
Hopper and Seeman express the differences in BMD for each pair of twins as a percentage
of the mean of the pair. Store these differences (expressed as a percentage of the twin pair
mean) in differences using the following R commands:
> attach(boneden)
> differences <- (ls2 - ls1) / ((ls1 + ls2) / 2)
Q UESTION : Why should we express the difference as a percentage? What happens if we
don’t?
Let’s look at the distribution of the differences. Both the boxplot and the histogram indicate
a mostly symmetric distribution, which is centred somewhere between -0.1 and 0, i.e. the
heavier-smoking twins have a slightly lower bone mineral density, although the overall
difference as a percentage is not great. There is a reasonable amount of variation, and in
many cases the lighter-smoking twin actually has lower BMD. So it’s not obvious from the
graphs that the difference is significant.
BMD difference between lighter− and heavier−smoking twins BMD difference between lighter− and heavier−smoking twins
10
8
6
Frequency
● ●
4
2
0
−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2
The command t.test(differences) will carry out the one-sample t-test on the differ-
ences to determine if there is a significant difference. Running this command produces the
following output:
lOMoARcPSD|8938243
> t.test(differences)
data: differences
t = -2.5388, df = 40, p-value = 0.01512
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.08889922 -0.01009415
sample estimates:
mean of x
-0.04949668
The above output tells us several things:
• The mean percentage difference is d¯ = −0.049;
• In a t-test of H0 : µD = 0, the test statistic is t = −2.539, which gives a p-value of 0.015
when compared to a t distribution with 40 degrees of freedom;
• A 95% confidence interval for µD is (−0.089, −0.010).
Make sure you can identify these values in the output. Observe that the upper bound
of the 95% confidence interval is less than 0. This means we can be confident that µD <
0. Since the p-value is less than 0.05, we reject the null hypothesis H0 : µD = 0 at the
5% significance level. Therefore, we can conclude that there is a significant difference in
the mean lumbar spine BMD between the heavier- and lighter-smoking twins, with the
heavier-smoking twins having lower mean BMD.
Q UESTION : What conclusions can you draw from this study with regards to the true rela-
tionship between smoking and bone density?
lOMoARcPSD|8938243
Problem Set 6
6.1 The following is supposed to be a random sample from a normal population with unknown
mean µ and known standard deviation σ = 8.
32.1 43.2 38.6 50.8 34.4 34.8 34.5 28.4 44.1 38.7
49.1 41.3 40.3 40.5 40.0 35.3 44.3 33.3 50.8 28.6
42.2 46.3 49.8 34.4 43.9 59.7 44.9 41.9 41.3 38.2
(a) i. Find the 95% confidence interval for µ and hence test the hypothesis µ = 45.
ii. Draw a diagram representing the confidence interval and the null hypothesis value
on the same scale.
iii. Find the p-value for testing µ=45 vs µ6=45. What is your conclusion?
iv. Define the z-statistic used to test µ=45. Use it to specify the values of x̄ for which
µ=45 would be rejected. What is your conclusion for the above sample?
(b) Repeat (a) using a 99.9% confidence interval and significance level 0.001.
6.2 Assume that a person’s haemoglobin concentration (g/100mL) follows a
N(µ=16, σ 2 =6.25) distribution, unless the person has anaemia, in which case the distribution
is N(µ=9, σ 2 =9). On the basis of a haemoglobin reading, an individual undergoing routine
investigation will be diagnosed as anaemic if their reading is below 12.5, and an non-anaemic
otherwise.
(a) Find the probability that an anaemic person is correctly diagnosed.
(b) Find the probability that a non-anaemic person is correctly diagnosed.
(c) In the context of a diagnostic test, relate the probabilities found in (a) and (b) to the con-
cepts of sensitivity, specificity, predictive positive value and predictive negative value, if
applicable.
(d) In the context of a hypothesis-testing problem, relate the probabilities found in (a) and
(b) to the concepts of type I error, type II error and power. State the null and alternative
hypothesis.
6.3 Of a random sample of n = 20 items, it is found that x = 4 had a particular characteristic.
Use the chart in the Statistical Tables or R to find an exact 95% confidence interval for the
population proportion. Repeat the process to complete the following table:
In testing the null hypothesis p = 0.3, what conclusion would be reached in each case?
6.4 In an examination of a microscopic slide, the number of cells of a particular type are counted
in twenty separate regions of equal (unit) area with the following results:
22 42 31 35 34 47 21 20 34 27
22 26 NA 26 28 37 20 38 23 32
Assume that this represents a random sample from a population that has a distribution that is
approximately normal with mean µ.
(a) Find a 95% confidence interval for µ.
(b) Find the p-value to test the hypothesis µ = 31. What decision do you reach?
6.5 Among 1000 workers in industry A, the expected number of cases of B over a 5-year period
is λ0 = 10 cases, assuming the population norm applies to this group ( ◦◦ ). Suppose that 15
cases are observed.
(a) Does this represent significant evidence that the rate of occurrence of B in industry A is
different from the population norm? i.e. if λ denotes the mean number of cases among
the industry A workers, test the null hypothesis λ = λ0 .
(b) Obtain a 95% confidence interval for SMR = λ/λ0 .
(c) Obtain an estimate and a 95% confidence interval for the incidence rate (of disease out-
come B in industry A), α cases per thousand person-years.
lOMoARcPSD|8938243
6.6 Of 550 women employed at ABC Toowong Queensland during the past 15 years, eleven con-
tracted breast cancer in that time. After adjusting for a range of covariates (ages and other per-
sonal characteristics, including family history of breast cancer) the expected number of cases
of breast cancer is calculated to be 4.3. Test the hypothesis that there is an excess risk of breast
cancer at ABC Toowong.
The standardised morbidity ratio, SMR = λ/λ0 , where λ denotes the mean number of cases
among the sub-population and λ0 denotes the mean number of cases expected among the sub-
population if it were the same as the general population. Find an approximate 95% confidence
interval for SMR in this situation.
6.7 The diagram below is a typical power curve — with values of µ on the horizontal axis and
probability on the vertical axis:
A C B
Make a copy of this diagram and mark on it:
i. the significance level (i.e. the type I error probability);
ii. the power when µ = A;
iii. the type II error probability when µ = B.
What would happen to the power curve
iv. if n were increased ?
v. if the significance level were increased?
6.8 A new drug, ReChol, is supposed to reduce the serum cholesterol in overweight young indi-
viduals (20–29yo, BMI > 28). In a study to test this claim, a sample of such individuals are
given the drug for a period of six months, and their change in serum cholesterol is recorded (in
mg/100mL). Assume that these differences are normally distributed with ‘known’ standard deviation
of 38.5 mg/100mL.
Using a test with significance level 0.05, how large a sample is required to “detect” a mean
reduction of 10 mg/100mL with probability 0.95?
6.9 Among patients diagnosed with lung cancer, the proportion of patients surviving five years is
10%. As a result of new forms of treatment, it is claimed that this rate has increased. In a recent
study of 180 patients diagnosed with lung cancer, 27 survive five years, so that the estimated
survival proportion is 15%. Is there significant evidence in these data to support the claim?
(a) Define an appropriate parameter and set up the appropriate null hypothesis.
(b) Perform the hypothesis test, using the p-value method, at the 0.05 level.
(c) How large a sample would be required so that the probability of a significant result was
0.95 if the true (population) survival proportion was actually 15%?
6.10 Of 811 individuals employed at HQ centre during the past ten years, 13 contracted disease K.
After adjusting for a range of covariates, the expected number of cases of K is calculated to be
4.6. Test the hypothesis that there is no excess risk of K at the HQ centre.
6.11 In a randomised controlled experiment to examine the effect of a treatment on cholesterol lev-
els, a test comparing the mean cholesterol levels in the treatment group and the control groups
lOMoARcPSD|8938243
Chapter 7
COMPARATIVE INFERENCE
“One should always look for a possible alternative and provide against it.
It is the first rule of (statistical) investigation.” Sherlock Holmes, The Adventure of Black Peter, 1905.
7.1 Introduction
This chapter describes a standard situation where inference is required comparing two
populations. We begin with the case of comparing two population means, µ1 and µ2 .
In a one-sample test of a mean, we compare the mean µ of the population under study with
the mean µ0 of a general population which is considered as known. Hence, we only need
to take a sample from the population under study. It is much more common that the means
of both populations are unknown, and we take a sample from each population to compare
them.
It is common to consider the comparison of the effects of two treatments or interventions
or exposures or attributes. Then the populations to be compared are the hypothetical pop-
ulation with the first (treatment, intervention, exposure, attribute) and the hypothetical
population with the other.
There are two main ways in which treatments can be compared:
1. Paired comparisons — the two treatments are applied to pairs of experimental units
which have been matched so as to be as alike as possible (even the same experimental
unit at different times);
2. Independent samples — the two treatments are applied to separate sets of experi-
mental units randomly selected from the sample population.
EXAMPLE 7.1.1: The following data were obtained from each of 15 matched
pairs of individuals. For each pair, one was randomly allocated treatment 1
and the other treatment 2. Investigate the hypothesis that the treatments are
equivalent.
171
lOMoARcPSD|8938243
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x1 50 59 45 40 53 52 55 48 45 50 51 56 54 41 55
x2 53 63 48 43 52 50 56 50 49 51 53 57 55 44 58
d –3 –4 –3 –3 1 2 –1 –2 –4 –1 –2 –1 –1 –3 –3
Because the samples are matched we consider the sample of differences. This
has the effect of removing or at least reducing the effect of variation between
individuals. For the sample of differences we test whether the mean is zero,
and obtain a confidence interval.
40 45 50 55 60 65
x1-x2
-15 -10 -5 0 5 10
This reflects a general result: if the samples are paired then ignoring the pairing results in weaker
tests and conclusions. Also, in an experimental situation, producing paired data is more efficient for
the purposes of inference.
For the case of independent samples,
consider the difference between samples
i.e. we must compare sample characteristics: sample means, or sample proportions, or
sample medians, etc. This is a two sample problem which requires new techniques.
lOMoARcPSD|8938243
DEFINITION 7.2.1.
1. Self-pairing: Measurements are taken on a single subject at two distinct points
in time (“before and after” a treatment or a cross-over design) or at two different
places (e.g. left and right arms for a skin treatment).
2. Matched pair: Two individuals are matched to be alike by certain characteristics
under control (e.g. age, sex, severity of illness, etc). Then one of a pair is assigned
(at random) to one treatment, the other to the other treatment.
Pairing is used to eliminate extraneous sources of variability in the response variable. This
makes the comparison more precise.
Let µ1 and µ2 be the means of the two populations to be compared. The method of analysis
is to consider the difference between each pair.
Let X1i and X2i denote the observations for the ith pair (from population 1 and population
2, respectively), and form the differences
Di = X1i − X2i ;
i.e. D1 , D2 , . . . , Dn are the differences between the elements in each pair.
Let µD denote the mean difference. Then µD = µ1 − µ2 . Consider the population of differ-
ences, which has mean µD ; we have a sample {D1 , . . . , Dn } from this population. We can
apply the one-sample t procedure to this sample of differences to construct a confidence
interval for µD or to do hypothesis testing: usually we want to test the null hypothesis H0 :
µD = 0.
EXAMPLE 7.2.1: One method for assessing the effectiveness of a drug is to note
its concentration in blood and/or urine samples at certain periods of time after
giving the drug. Suppose we wish to compare the concentrations of two types
of aspirin (types A and B) in urine specimens taken from the same person, 1
hour after he or she has taken the drug. Hence, a specific dosage of either type
A or type B aspirin is given at one time and the 1-hour urine concentration is
measured. One week later, after the first aspirin has presumably been cleared
from the system, the same dosage of the other aspirin is given to the same per-
son and the 1-hour urine concentration is noted. For each person, which drug
is given first is decided randomly. This experiment is performed on 10 people;
the results are below.
Person 1 2 3 4 5 6 7 8 9 10
Aspirin A 15 26 13 28 17 20 7 36 12 18
Aspirin B 13 20 10 21 17 22 5 30 7 11
Construct a 95% confidence interval for the mean difference in the concentra-
tions of aspirin A and aspirin B in urine specimens 1 hour after the patient has
taken the drug. Are the two mean concentrations significantly different?
lOMoARcPSD|8938243
We thus conclude that the two mean concentrations are significantly different
(aspirin A has a higher concentration).
In R:
> A <- c(15, 26, 13, 28, 17, 20, 7, 36, 12, 18) # data
> B <- c(13, 20, 10, 21, 17, 22, 5, 30, 7, 11)
> t.test(A, B, paired=TRUE) # paired t-test
Paired t-test
data: A and B
t = 3.6742, df = 9, p-value = 0.005121
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.383548 5.816452
sample estimates:
mean of the differences
3.6
A paired analysis is powerful if the data are taken from a suitably designed study. How-
ever, sometimes pairing is difficult or impossible, or simply not done. Then we need to
model the two samples as being taken independently from the two populations to be com-
pared. This requires new methods of comparison.
We assume that
• the two underlying distributions are normal:
d d
X1 = N(µ1 , σ12 ) and X2 = N(µ2 , σ22 ).
• the two samples are independent random samples;
(Two samples are independent, if the selection of individuals that make up one sam-
ple does not influence the selection of those in the other sample.)
We use the following notation:
sample sample sample
size mean standard deviation
sample 1 (on X1 ) n1 x̄1 s1
sample 2 (on X2 ) n2 x̄2 s2
The null hypothesis is usually H0 : µ1 =µ2 . The corresponding alternative hypothesis is
H1 : µ1 6=µ2 . More generally, we may wish to estimate the difference µ1 −µ2 (using a point
estimate or an interval estimate).
Estimation of µ1 − µ2
An obvious estimator is X̄1 − X̄2 .
This estimator is unbiased, since E(X̄1 −X̄2 ) = E(X̄1 ) − E(X̄2 ) = µ1 − µ2 .
lOMoARcPSD|8938243
σ2 σ2
Its variance is given by var(X̄1 −X̄2 ) = var(X̄1 ) + var(X̄2 ) = 1 + 2 ,
n1 n2
(since X̄1 and X̄2 are independent).
Further, since both populations are assumed to be normally distributed, X̄1 and X̄2 are
normally distributed, and hence so is X̄1 −X̄2 , i.e.
d
σ2 σ2
X̄1 −X̄2 = N µ1 −µ2 , 1 + 2 .
n1 n2
The distribution of the estimator (and in particular, its mean and standard deviation) is
what we need to construct confidence intervals and perform hypothesis tests.
If we know the variances of the individual populations σ1 and σ2 , then we know the stan-
dard deviation sd(X̄1 − X̄2 ). Standardization of the estimator, using the distribution above,
gives
(X̄1 − X̄2 ) − (µ1 − µ2 ) d
Z= q 2 = N(0, 1).
σ1 σ22
n1 + n2
This can be used for constructing a confidence interval for µ1 − µ2 , or for testing a hypoth-
esis about µ1 − µ2 , if σ12 and σ22 are known.
For constructing a confidence interval, we have
If H0 is true, this statistic has a N (0, 1) distribution, and we reject H0 if its observed value z
is larger than 1.96 or smaller than -1.96. Alternatively, we can calculate the p-value by
2 Pr(Z > z) if z > 0
(
d
p= where Z = N(0, 1).
2 Pr(Z < z) if z < 0
These procedures are the sample as for the one-sample case; the only difference is the statis-
tic that we are testing against the standard normal distribution.
d
We have X̄1 −X̄2 = N(µ1 −µ2 , 0.56); since var(X̄1 −X̄2 ) = 4.0
25
+ 4.0
10
.
√
and so a 95% CI for µ1 −µ2 is 1.69 ± 1.96 0.56, i.e. 0.22 < µ1 −µ2 < 3.16.
EXAMPLE 7.3.2: It is reported that x̄1 = 15.3 and x̄2 = 12.7 from samples of
n1 = 10 and n2 = 15. In the absence of any other information, we suppose that
σ1 = σ2 = 3, perhaps on the basis of past information or values from similar
data sets. So,
d 1 1
X̄1 −X̄2 ≈ N(µ1 −µ2 , 1.5) since var(X̄1 −X̄2 ) = 32 ( 10 + 15 ) = 1.5.
√
approx 95% CI for µ1 −µ2 : 2.6 ± 1.96 1.5 = (0.2, 5.0);
Usually, the sample will give us information concerning the variances. But in some cases,
we don’t even have that: in planning for example. Then we must make a plausible estimate
(educated guess) based on similar data and other evidence.
This sort of thing is often required for budgeting; or in applying for grants for research:
if there is a difference of at least 1 unit then we would like to be reasonably (say 95%)
sure of finding it.
σ12 σ2 1 1 50
var(X̄1 −X̄2 ) = + 2 ≈ 25( + ) = .
n1 n2 n n n
q
Thus, the approx 95% CI is x̄ ± 1.96 50n.
lOMoARcPSD|8938243
q
50 √ √
So, we require 1.96 n =1 ⇒ n = 1.96 50 ⇒ n ≈ 192.
i.e. we need about 192 in each arm of the trial to achieve this level of accuracy.
Another option in planning is to specify the power of the test of µ1 =µ2 for a
specified difference. For example: find the sample size required in order that
the power is 0.9 when µ2 −µ1 = 2.5, using a test of significance level 0.05.
d
Let Z = X̄
√2 −X̄1 , so that Z = N(0, 1) when H0 is true. When µ2 −µ1 = 2.5,
50/n
E(Z) = √2.5 , and in order that we have a significance level of 0.05 and power
50/n
0.9, we require
√2.5 = 1.96 + 1.2816 ⇒ n = 84.1
50/n
Therefore we need 85 in each arm of the trial to achieve the specified power.
It is not often the case that the variances are known, but this result can be useful as a large
sample approximation. Generalising the rules obtained in the above example, we get the
following sample size rules.
Assuming populations with equal variances, σ 2 , we require a sample of at least n from each
population, where
2 2
2z1− 1 σ
α
for a 100(1−α)% CI of half-width d, n> 2
;
d2
In most cases of application of inference on difference of means, we won’t know the true
standard deviations of the populations. However, we may reasonably expect that σ12 = σ22 ,
since we are comparing similar measurements (treatments vs control, intervention A vs
intervention B). In these situations, any change will be a (relatively small) shift in the
mean. So, this is our standard assumption: we assume the variances are equal.
If σ1 = σ2 = σ, then we have
So, by analogy with the one sample case, we might hope that replacement of σ by S would
result in a t distribution.
These are used for inference on µ1 −µ2 when the variances are unknown but assumed equal.
• To find a 95% confidence interval for µ1 −µ2 , we use:
(X̄1 − X̄2 ) − (µ1 − µ2 )
Pr c0.025 (tn1 +n2 −2 ) < q < c0.975 (tn1 +n2 −2 ) = 0.95.
S n11 + n12
Show that the pooled variance estimate is s2 = 5.0. Hence obtain a 95% confi-
dence interval for µ1 −µ2 . Test the null hypothesis (µ1 =µ2 ) and give the p-value.
lOMoARcPSD|8938243
How do we know if the variances of two populations are equal or not, if we don’t know
them? There are formal tests of the hypothesis H0 : σ1 = σ2 , but we do not study them
here. A good rule of thumb is that we can assume the variances are equal if the larger of
the two sample standard deviations is less than twice the smaller, i.e. if
1 s1
6 6 2.
2 s2
If this happens, we can use the tests in the previous section. But sometimes, it doesn’t, and
then those tests are not applicable. However, we can still replace σ1 and σ2 individually by
S1 and S2 , to obtain the standard error for our estimator:
s
S12 S2
se(X̄1 −X̄2 ) = + 2.
n1 n2
Fortunately, it turns out that changing the standard deviation to the standard error still
results in a t distribution, albeit a slightly more complicated one:
where k is given by
s21
1 β2 (1−β)2 n1
= + , where β = s21 s22
.
k n1 − 1 n2 − 1 +
n1 n2
This is compared to the appropriate critical value or a p-value computed using a tk tail
probability.
E XERCISE . Consider the cholesterol level example given above. Use the unpooled ap-
proximate t-procedure to test the hypothesis that the mean cholesterol levels for the two
populations are different. (β = 0.7666, k = 62.53; t = 3.01, p = 0.004).
(Using the ‘safe’ value: k = 39, gives p = 0.005);
c0.975 (t39 ) = 2.023 cf. c0.975 (t62 ) = 1.999).
data: x and y
t = 3.2675, df = 18, p-value = 0.004277
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.4908838 2.2589641
sample estimates:
mean of x mean of y
1.4879612 0.1130373
data: x and y
t = 3.2675, df = 16.68, p-value = 0.004629
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.4858426 2.2640053
sample estimates:
mean of x mean of y
1.4879612 0.1130373
Landrigan et al. (1975)1 conducted a study examining the effects of exposure to lead on
the psychological and neurological well-being of children. The children in the study were
aged between 3 years 9 months to 15 years 11 months, and had lived within 6.6km of
a large, lead-emitting ore smelter in El Paso, Texas. The children were divided into two
groups: the control group consisted of 78 children with blood-lead levels of less than 40
µg/100mL in 1972 and 1973, and the lead-absorption group consisted of 46 children with
blood-lead levels of more than 40 µg/100mL in either 1972 or 1973. Each child completed
various neurological and psychological assessments. We are interested in one assessment
in particular: the number of taps on a metal plate that were recorded in a 10 second interval
while the child’s hand and wrist were held above the table (finger-wrist tapping). This test
was used to measure neurological function, specifically wrist flexor and extensor muscle
function, and was performed only by children over 5 years old.
Q UESTION : Is this an experiment or an observational study?
We will use an independent samples t-test to test whether there is a difference between the
mean finger-wrist tapping scores of children with low blood-lead levels and children with
high blood-lead levels. Let µ1 denote the mean finger-wrist tapping score of children with
blood-lead levels less than 40 µg/100mL, and let µ2 denote the mean finger-wrist tapping
score of children with blood-lead levels of more than 40 µg/100mL. The null hypothesis is
H0 : µ1 = µ2 (or µ1 −µ2 = 0); there is no difference between the two groups. The alternative
1 Landrigan, P. J., Whitworth, R. H., Baloh, R. W., Staehling, N. W., Barthel, W. F. and Rosenblum, B. F. (1975).
Neuropsychological dysfunction in children with chronic low-level lead absorption. The Lancet, 1, 708 – 715.
lOMoARcPSD|8938243
●● ●
2
Group
● ● ● ●
1
20 30 40 50 60 70 80
Score
Should we use a test that assumes equal variances or not? From the boxplots, the groups
look like they have similar spread. Let’s look at the respective standard deviations:
> sd(fwt.grp1)
[1] 12.05658
> sd(fwt.grp2)
lOMoARcPSD|8938243
[1] 13.15582
> sd(fwt.grp2)/sd(fwt.grp1)
[1] 1.091174
The standard deviations are very close to each other, so it appears that an equal-variance
test is reasonable.
We now perform a 2-sample t-test using t.test() to determine if there is a significant
difference. The output is given below.
> t.test(fwt.grp1, fwt.grp2, var.equal=TRUE)
Harvard Medical School to test whether aspirin taken regularly reduces mor-
tality from cardiovascular disease. Every other day, physicians participating in
the study took either one aspirin tablet or a placebo. The study was blind —
those in the study did not know which they were taking. Over the course of the
study, the number of heart attacks were recorded for both groups. The results
were
We are interested in comparing the population proportions p1 and p2 : i.e. estimating p1 −p2
and testing p1 =p2 .
d d
X1 = Bi(n1 , p1 ) and X2 = Bi(n2 , p2 ).
est − 0
To test H0 : p1 = p2 , we use a z-test, based on: z = ≷ “2”
q se0
se = se(p̂1 −p̂2 ) = p̂1 (1−
n1
p̂1 )
+ p̂2 (1−
n2
p̂2 )
. . . . but what is se0 ?
se0 is the standard error, estimated assuming H0 to be true.
x1 + x2 81
and the best estimate of p is p̂ = = 160 = 0.506,
n1 + n2
q
1 1
so se0 = 0.506×0.494 100 + 60 = 0.0816.
Here there is not much difference between se and se0 : they will be quite close if p̂1 and
p̂2 are not very different.
est − 0 0.09
z= = = 1.102, p = 2 Pr(Z > 1.102) = 0.270.
se0 0.0816
Thus we do not reject H0 (since |z| < 1.96 or p > 0.05). There is no significant
evidence in these data to indicate that p1 6= p2 .
s
p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
95% CI: est ± “2”se = p̂1 −p̂2 ± 1.96 + .
n1 n2
n1 p̂1 + n2 p̂2
Note that p̂ = , i.e. a weighted average of the p̂i (like the pooled average
n1 + n2
for the mean.)
A A′
M 54 46 100 (p̂1 = 0.54)
F 27 33 60 (p̂2 = 0.45)
81 79 160 (p̂ = 0.506)
Such a table is called a contingency table and is examined in more detail in Section 7.7.
It can be generalised to allow more rows (corresponding to more groups, or populations)
and more columns (corresponding to a categorisation of the attribute).
lOMoARcPSD|8938243
Summary:
104 189
p̂1 = 11037 = 0.009423, p̂2 = 11034 = 0.017129; p̂1 −p̂2 = −0.007706.
104+189 293
p̂ = 11037+11034 = 22071 = 0.013275.
q
1 1
se(p̂1 −p̂2 ) = 0.013275×0.986725( 11037 + 11034 ) = 0.001541
·
· · est = −0.0077, se = 0.0015.
z = est
se
= −0.007706
0.001541
= −5.001, p = 0.000.
Notation
population person number sample
rate years of cases rate
Sample 1 α1 t1 X1 x1 Â1 = X
t1
1
α̂1 = nx1
1
Sample 2 α2 t2 X2 x2 Â2 = X
t
2
α̂ 2 = x2
n
2 2
EXAMPLE 7.6.1:
Hence we would reject H0 . There is significant evidence here that the rate is
greater for exposed individuals.
category C1 C2 ... Ck
P
sample observed frequency f1 f2 ... fk fj = n
P
H0 probability p1 p2 ... pk pj = 1
P
(model) expected frequency np1 np2 ... npk npj = n
On the basis of the sample (observed frequencies), we wish to test H0 , i.e., to test the good-
ness of fit of the hypothesis to the observed data.
EXAMPLE 7.7.1: A first-year class of 200 students each selected “random dig-
its”, with the results given below. Do the digits occur with equal frequency?
1
i.e., test H0 : pi = 10 , i = 0, 1, . . . , 9.
i 0 1 2 3 4 5 6 7 8 9
obs freq fi 12 16 15 25 13 21 17 32 25 24
exp freq npi 20 20 20 20 20 20 20 20 20 20
In this case p = Pr(χ29 > 18.70) = 0.028 (using R); Table 8 indicates that p is
slightly larger than 0.025.
type SY SG WY WG
observed frequency 315 108 101 32
9 3 3 1
H0 : probability 16 16 16 16
expected frequency 312.75 104.25 104.25 34.75
9
The total number, n = 556; so the expected frequencies are given by 556 × 16 =
3
312.75, 556 × 16 = 104.25, etc.
data: observed
X-squared = 0.47002, df = 3, p-value = 0.9254
d d
Is X = Bi(10, 0.1)? In other words, test the null hypothesis H0 : X = Bi(10, 0.1).
lOMoARcPSD|8938243
x 0 1 2 >3
obs 54 79 45 22
exp 69.74 77.48 38.74 14.01
d
If H0 is true then U = χ23 , so we reject H0 if U > 7.82. From the above table,
P (o−e)2
u= e
= 8.66, hence we reject H0 . There is evidence in these data that
the distribution of X is not Bi(10, 0.1).
Fitting distributions (hypothesis specified except for one or more parameters)*
d
Consider the null hypothesis X = Bi(10, p), where p is unspecified. To fit this distribution,
we need to estimate p from the sample, and use this estimate to determine expected fre-
quencies under H0 . In estimating p we lose another degree of freedom, since we are using
the data to enable the model to fit better. More generally, each parameter estimated results
in another constraint, and another degree of freedom lost.
d
EXAMPLE 7.7.5: For the observations above, is X = Bi(10, p)?
x 0 1 2 >3
obs 54 79 45 22
exp 55.70 75.95 46.61 21.65
d
Thus, if H0 is true then U ≈ χ22 , so we reject H0 if U > 5.99.
P (o−e)2
From the sample, u = e
= 0.24, and hence we do not reject H0 .
We take this as an indication that the Binomial distribution fits the data, but
with p = 0.12, rather than p = 0.10.
Generally, in fitting a distribution in this way,
Pk (F −np )2 P (o−e)2 d 2
U = i=1 i np i = e
≈ χk−m−1 ,
i
where k = number of classes and m = number of parameters estimated.
This contingency table has two rows and two columns and is called a 2×2 table.
In this case the classification variables are treatment (aspirin and placebo) and
disease status (heart attack and no heart attack).
The null hypothesis we test is that disease status classification (heart attack
or no heart attack) is independent of the treatment classification (aspirin or
placebo). If H0 were rejected, it would provide evidence of a relation between
treatment and outcome.
obs freq A A′
G 47 23 70
G′ 13 17 30
60 40 100
On the basis of this sample, we wish to test the hypothesis that the classifications are inde-
pendent; i.e., H0 : Pr(A ∩ G) = Pr(A) Pr(G) and H1 : Pr(A ∩ G) 6= Pr(A) Pr(G).
Note: Independence can be expressed in the form Pr(A | G) = Pr(A | G ′ ), i.e. the
probability of attribute A is the same in G or in G ′ , which is equivalent to p1 = p2 .
exp freq A A′
G npG pA npG qA npG
G′ nqG pA nqG qA nqG
npA nqA n
where pA + qA = 1 and pG + qG = 1.
To evaluate the expected frequencies, we need to assign values to pG and pA . We use
70 60 30 40
p̂G = 100 and p̂A = 100 ; so that q̂G = 100 and q̂A = 100 . Then we obtain:
exp freq A A′
G 42 28 70
G′ 18 12 30
60 40 100
h i
70 60
e.g. eG∩A = 100× 100 × 100 = 70×60
100
= sum.C×sum.A
N
.
The other expected frequencies could be worked out similarly: e.g. eG∩A′ = 70×40 100
= 28,
but it easier to obtain the other expected frequencies by subtraction: eG∩A′ = 70 − 42 = 28.
lOMoARcPSD|8938243
Note: these “expected frequencies” represent estimated means, so there is no need for them to be
integers (although they are in this example).
A A′
G1 54 46 100
G2 27 33 60
81 79 160
We wish to test whether there is a difference between the groups. So the null
hypothesis is that the group classification and the attribute classification are
independent. For these data, under the null hypothesis of independence, the
expected frequencies are given by:
A A′
G1 50.625 49.375 100
G2 30.375 29.625 60
81 79 160
and so,
P (o − e)2 3.3752 3.3752 3.3752 3.3752
u= = + + + = 1.215.
e 50.625 49.375 30.375 29.625
d
Under H0 (no difference between the groups), U = χ21 and so we would reject
H0 if u > 3.84. There is no significant evidence here that there is a difference
between the groups p = Pr(χ21 > 1.215) = 0.270.
R can be used to analyse contingency tables, using chisq.test(). Consider again the
example above. When tabulated data are given, use the following:
> my.table <- matrix(c(54, 27, 46, 33), ncol=2) # code the data as a matrix
> results <- chisq.test(my.table, correct=FALSE)
lOMoARcPSD|8938243
data: my.table
X-squared = 1.2152, df = 1, p-value = 0.2703
Note that chisq.test() admits an option for continuity correction. Actually the analyses
saved in results contains quite a few outputs:
> names(results) # elements of ’results’
[1] "statistic" "parameter" "p.value" "method" "data.name"
"observed" "expected" "residuals" "stdres"
For example, to extract the expected frequencies write:
> results$expected # ’expected’ element of ’results’
[,1] [,2]
[1,] 50.625 49.375
[2,] 30.375 29.625
The χ2 -test for a 2×2 contingency table is actually identical to the z-test for testing equality
of proportions, since u = z 2 and χ21 = N2 . In the example, z = 1.102 and u = 1.215.
a b
For a 2×2 contingency table with frequencies given by , it can be shown that
c d
√
(ad − bc) n
z=p and u = z 2 .
(a + b)(c + d)(a + c)(b + d)
data: X
X-squared = 25.014, df = 1, p-value = 5.692e-07
Thus we reject H0 . There is significant evidence here that the rate of heart attacks
lOMoARcPSD|8938243
Odds Ratio
There is another useful measure of relationship in this situation that we have met: the odds
ratio, θ. Based on the above table, we obtain an estimate of the odds ratio:
ad
θ̂ = and we observe that (ad ≷ bc) ⇔ θ̂ ≷ 1.
bc
54 46
EXAMPLE 7.8.4: For the above 2×2 contingency table , we find:
27 33
ln θ̂ = ln 54 − ln 46 − ln 27 + ln 33 = 0.3610;
q
1 1 1 1
se(ln θ̂) = 54 + 46 + 27 + 33 = 0.3280.
Since the confidence interval for θ excludes 1, there is significant evidence here
of a negative relationship between aspirin and heart attacks, i.e. more aspirin,
less heart attacks.
lOMoARcPSD|8938243
The χ2 test in the r × c case is a straightforward extension of the χ2 test used in the 2 × 2
case. We now have a table of the form:
B1 B2 ... Bc
A1 a1
A2 a2
.. ..
. .
Ar ar
b1 b2 ... bc n
A B C
“little or no relief” 11 13 9
“moderate relief” 32 28 27
“total relief” 7 9 14
data: X
X-squared = 3.81, df = 4, p-value = 0.4323
TV watching time
Fitness 0 1–2 3–4 >5
Fit 35 101 28 4
Not fit 147 629 222 34
(35−25.48)2 (222−215.0)2
u= 25.48
+ ··· + 215.0
= 6.161,
df = 3, p = Pr(χ23 > 6.161) = 0.104.
The focus of the study was on the use of antacids that contain aluminum.
3 5 8
Alz vs level: There appears to be some evidence here of a relation,
9 3 2
but it can’t
be tested using χ2 as 3 cells have e < 5.
Q UESTION : What conclusion would you draw if there were a significant result?
Problem Set 7
7.1 The effect of a single 600 mg dose of Vitamin C versus a sugar placebo on the muscular en-
durance (as measured by repetitive grip strength trials) of thirteen male volunteers (19-23 years
old) was evaluated. The study was conducted in a double-blind manner, with crossover. That
is, two tests were carried out on each subject, once after taking vitamin C and once after taking
the sugar placebo.
Subject Placebo Vitamin C Difference
1 170 248 –78
2 180 218 –38
3 372 349 23
4 288 264 24
5 636 593 43
6 172 117 55
7 278 185 93
8 279 185 94
9 258 122 136
10 363 159 204
11 417 145 272
12 678 387 291
13 699 245 454
mean 368.5 247.5 121.0
stdev 188.7 132.0 148.8
(a) The following questions refer to the design of the study.
i. What is the response variable? the explanatory variable(s)?
ii. How has comparison been used in the study?
iii. How has control been used in the study?
iv. The study was conducted in a ’double-blind’ manner. What does this mean?
v. How should randomisation have been used in the study?
vi. Give one point in favour of, and one point against, the use of a crossover design for
this study.
(b) Draw a boxplot of the differences of the data, clearly labelling all relevant points, includ-
ing any outliers, should they exist. To help, here is the five number summary:
Min. 1st Qu. Median 3rd Qu. Max.
-78.0 23.5 93.0 238 454.0
i. What assumption are you looking to check in the boxplot and what do you conclude?
ii. Suggest an alternative plot that may be useful and describe what you would expect
to see if the assumption you are looking to check is reasonable.
(c) Carry out a t-test on the differences.
i. State the null and alternative hypotheses, calculate the value of the test statistic, and
give a range for the p-value (e.g. 0.05 < p < 0.1).
ii. State your conclusions, in non-statistical terms.
7.2 A colleague has analysed the data from Problem 7.1, and shows you the R output below.
data: Placebo and VitaminC
t = 1.89, df = 24, p-value = 0.070
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7670078 1.1014895
(a) Compare the point estimate of the mean difference in test scores that you obtained (in
Problem 7.1) with the result your colleague found.
(b) Compare the 95% confidence interval for the mean difference in test scores that you ob-
tained (in Problem 7.1) with the result your colleague found.
(c) Why are the results different?
(d) Which analysis is more appropriate? Explain why.
7.3 Volunteers who had developed a cold within the previous 24 hours were randomised to take
either zinc or placebo lozenges every 2 to 3 hours until their cold symptoms were gone (Prasad
et al., 2000). For the twenty-five participants who took zinc lozenges, the mean overall duration
of symptoms was 4.5 days and the standard deviation was 1.6 days. For the twenty-three
participants who took placebo lozenges, the mean overall duration of symptoms was 8.1 days
and the standard deviation was 1.8 days.
lOMoARcPSD|8938243
(a) For the two groups calculate the difference in the sample means and the standard error
of the difference in means.
(b) Compute a 95% confidence interval for the difference in mean days of overall symptoms
for the placebo and the zinc lozenge treatments, and write a sentence interpreting the
interval. (Assume that the standard deviations for the placebo and the zinc lozenge treatments
are the same. Does this seem reasonable?)
(c) Does the interval computed in (b) give evidence that the population means are different?
Explain.
7.4 The effect of exercise on the amount of lactic acid in the blood was examined in a study. Eight
men and seven women who were attending a week-long training camp participated in the
experiment. Blood lactate levels were measured before and after playing three games of rac-
quetball, and shown below.
Men Women
Player 1 2 3 4 5 6 7 8 Player 1 2 3 4 5 6 7
Before 13 20 17 13 13 16 15 16 Before 11 16 13 18 14 11 13
After 18 37 40 35 30 20 33 19 After 21 26 19 21 14 31 20
(a) Does exercise change the blood lactate level for women players? Test this.
(b) Estimate the mean change in blood lactate level for male racquetball players using a 95%
confidence interval.
(c) Is the mean change in blood level the same for men and women players? Test this.
7.5 The following observations are obtained on two treatments:
treatment C 34.7 26.7 32.0 52.7 45.4 31.5 20.3 23.4 35.9 42.1
treatment K 35.6 28.5 35.7 54.8 47.1 33.5 19.2 27.2 37.2 41.5
It can be assumed that the observations are independent and normally distributed with equal
variances. Let δ denote the increase in mean that results from using treatment K rather than
treatment C.
(a) Test for the difference in the effects of the two treatments using an independent samples
t-test. Derive a 95% confidence interval for δ.
(b) Now suppose that the columns actually correspond to blocks (a–j): for example, a ‘block’
might one individual who is given first one treatment, and then at some later time, the
other treatment.
a b c d e f g h i j
treatment C 34.7 26.7 32.0 52.7 45.4 31.5 20.3 23.4 35.9 42.1
treatment K 35.6 28.5 35.7 54.8 47.1 33.5 19.2 27.2 37.2 41.5
Test for the difference in the effects of the two treatments using a paired-samples t-test.
Derive a 95% confidence interval for δ. Why is this interval narrower than the one derived
in (a)?
7.6 Consider the independent samples
sample 1 27 34 37 39 40 43
sample 2 41 44 52 93
(a) Draw dotplots of the two samples.
(b) Show that a two-sample t-test does not reject the null hypothesis of equal means.
(c) If the observation 93 is found to be a mistake: it should have been 53. Show that a two-
sample t-test now rejects the null hypothesis of equal means.
(d) Explain the difference in the results of the tests.
7.7 A recent study compared the use of angioplasty (PTCA) with medical therapy in the treatment
of single-vessel coronary artery disease. At the six-month clinic visit, 35 on 96 patients seen in
the PTCA group and 55 of 104 patients seen in the medical therapy group have had angina.
Is there evidence in these data that PTCA is more effective than medical therapy in preventing
angina?
Find a 95% confidence interval for the difference in proportions.
7.8 In a test to evaluate the worth of a drug in treating a particular disease, the following results
were obtained in a double-blind trial:
no improvement improvement lost to survey
placebo 22 23 5
drug 12 30 8
lOMoARcPSD|8938243
Do these data indicate that the drug has brought about a significant increase in the improve-
ment rate? Explain your reasoning.
7.9 In a study, 500 patients undergoing abdominal surgery were randomly assigned to breathe one
of two oxygen mixtures during surgery and for two hours afterwards. One group received a
mixture containing 30% oxygen, a standard generally used in surgery. The other group was
given 80% oxygen. Wound infections developed in 28 of the 250 patients who received 30%
oxygen, and in 13 of the 250 patients who received 80% oxygen.
Is there evidence to conclude that the proportion of patients who develop wound infection is
lower for the 80% oxygen treatment than for the 30% oxygen treatment. Use a p-value ap-
proach.
7.10 A study on motion sickness in buses reported that seat position within a bus may have some
effect on whether one experiences motion sickness. The following table classifies each person
in a random sample of bus passengers by the location of their seat and whether nausea was
reported.
Location
front middle rear
nausea 58 166 193
no nausea 870 1163 806
Based on these data, can you conclude that there is an association between seat location and
nausea.
7.11 A case-control study with 100 cases of disease D, and 100 matched controls, yielded the fol-
lowing results with respect to exposure E:
E E′
case, D 63 37 (ncase = 100)
control, D′ 48 52 (ncontrol = 100)
i. Test the hypothesis that the proportion of individuals with exposure E is the same in both
populations (cases and controls).
ii. Find an estimate and a 95% confidence interval for the odds ratio.
7.12 Data relating to oral-contraceptive use and the incidence of breast cancer in the age-group 40–
44 years in the Nurses’ Health Study are given in the table below:
OC-use group number of cases number of person-years
current users 13 4 761
past users 164 121 091
never users 113 98 091
(a) i. Compare the incidence rate of breast cancer in current-users versus never-users us-
ing a z-test, and report a p-value.
ii. Find a 95% confidence interval for the rate ratio.
(b) i. Compare the incidence rate of breast cancer in past-users versus never-users using a
z-test, and report a p-value.
ii. Find a 95% confidence interval for the rate ratio.
lOMoARcPSD|8938243
Chapter 8
“‘Is there any other point to which you would wish to draw my attention?’ ‘To the curious incident of
the dog in the night-time.’ ‘The dog did nothing in the night-time.’ ‘That was the curious incident.’”
Sherlock Holmes, The Silver Blaze, 1894.
8.1 Introduction
In this chapter, we consider bivariate numerical data: that is, data for two numerical vari-
ables, and we seek to investigate the relationship between the variables.
80
60
y
40
20
20 40 60 80
x
201
lOMoARcPSD|8938243
In this chapter we are concerned with the case of bivariate numerical variables. However,
in general, a bivariate data set may involve variables which may be either numerical or
categorical.
Note: A categorical bivariate data set consists of n pairs of data points:
{(ci , di ), i = 1, 2, . . . , n},
where c and d are categorical variables, such as gender or attribute. ci and di denote the
values of the categorical variable for individual i, thus (ci , di ) = (F, D′ ) indicates that indi-
vidual i is a female who does not have attribute D.
D′ D
Such a data set is most simply summarised by a con-
F 15 40 55
tingency table (see §7.5) with rows representing the
M 25 20 45
c-categories and columns the d-categories.
35 65 100
Of course, either of the categorical variables may have more than two possible values,
resulting in a contingency table with more rows or columns.
A scatter plot for bivariate categorial variables is singularly unhelpful! If each observation
is represented by a point, then the result is a rectangular array of points, as in the left
diagram below.
Note that for the purposes of plotting, each category has to be allocated numerical values.
This is simply a coding device.
There needs to be some mechanism for displaying how many times each point is observed.
(There can be a similar problem, though clearly to a lesser degree, for numerical variables
where several individuals give the same values.) One way to overcome this problem is
to “jitter” the points. This means that instead of plotting at (x, y), we plot at (x+e, y+f ),
where e and f are (small) random perturbations. The extent of the jittering can be varied
to suit the situation.
A preferable alternative is to modify the size of the “point” to represent the number of
observations at the specified point (cf. Gapminder plots).
Another representation is the Mosaic diagram, which takes two forms corresponding to
row percentages and column percentages: see diagram below.
D′ D D′ D
F
F
M
M
lOMoARcPSD|8938243
8.2 Correlation
In Chapter 2, we saw that the correlation r indicates the relationship between the x and y
variables; and indicates how the points are distributed within a scatter plot. Recall that the
appearance of a scatter plot for r = −1, −0.75, . . . , 1; from a straight line with a negative
slope at r = −1 through negative relationships with more scatter to a random spread at
r = 0 and then through to a straight line with positive slope at r = 1.
lOMoARcPSD|8938243
Properties of r
1. −1 6 r 6 1
2. r > 0 indicates a positive relationship; r < 0 indicates a negative relationship.
The magnitude of r indicates the strength of the (linear) relationship.
3. r = ±1 if, and only if, y = a + bx with b 6= 0.
r = 1 if b > 0 and r = −1 if b < 0.
4. r (like x̄ and s) is affected by outliers.
x̄ and sx indicate the location and spread of the x-data: about 95% in (x̄ − 2sx , x̄ + 2sx );
ȳ and sy indicate the location and spread of the y-data: about 95% in (ȳ − 2sy , ȳ + 2sy );
the correlation, r, or rxy , indicates how the points are distributed in this region.
However, the appearance of the scatter plot is affected by the scale used on the axes in
plotting the graph! To make scatter plots comparable, you should try to arrange the scale
so that the spread of the y is about the same as the spread of the x.
The following three scatter plots plot identical points, but using different scales. The corre-
lation is 0.564.
The graph at the left is preferred, as the apparent spreads of point in the horizontal and
vertical directions are similar.
E XERCISE . The following statistics are available for a bivariate data set:
n = 100; x̄ = 55.4, sx = 14.1; ȳ = 42.8, sy = 6.3; r = −0.52.
Using only the given information, indicate the form of the scatter plot for these data.
The shape of the scatter plot is indicated by the negative correlation: r = −0.52. It will
resemble in form the scatter plot for r = −0.5. The scale is specified by the means and
standard deviations: (i.e. about 95%) of the data have 27.2 < x < 83.6 and 30.2 < y < 55.4.
To compute a correlation, use a calculator or a computer. On R, use cor(x,y) and enter
the names of the variables for which a correlation is required.
Just so you know:
1 P
sample variance of x: s2x = n−1 (x − x̄)2 ;
1
s2y = n−1 (y − ȳ)2 ;
P
sample variance of y:
1 P
sample covariance of x, y: sxy = n−1 (x − x̄)(y − ȳ);
sxy P
(x−x̄)(y−ȳ)
sample correlation of x, y: rxy = = √P .
sx sy (x−x̄)2 (y−ȳ)2
P
Even if we never use the formula to compute r, it still has its uses:
lOMoARcPSD|8938243
1
Pn n n
n−1 i=1 (xi − x̄)(yi − ȳ) 1 X xi − x̄ yi − ȳ 1 X
r= = = xsi ysi ,
sx sy n − 1 i=1 sx sy n − 1 i=1
where xsi and ysi denote the standardised scores. This indicates that the points that con-
tribute most to the correlation are those with both standardised scores large. It also tells us
that r is not affected by location and scale.
x 49.6 70.8 55.3 69.2 51.9 54.6 59.8 55.3 65.1 75.8
y 56.1 61.9 58.3 61.8 59.4 56.6 59.5 58.5 64.5 65.5
In R:
> x <- c(49.6, 70.8, 55.3, 69.2, 51.9, 54.6, 59.8, 55.3, 65.1, 75.8)
> y <- c(56.1, 61.9, 58.3, 61.8, 59.4, 56.6, 59.5, 58.5, 64.5, 65.5)
> length(x)
[1] 10
> mean(x)
[1] 60.74
> sd(x)
[1] 8.935597
> mean(y)
[1] 60.21
> sd(y)
[1] 3.15223
> cor(x,y) # correlation between x and y
[1] 0.8847845
Maybe, once only, it Pmay be worthwhile to check, P using a spreadsheet say, that
(x − x̄)2 = 718.60, (x − x̄)(y − ȳ) = 224.30, (y − ȳ)2 = 89.43; and hence that
P
224.30
r = √718.60×89.43 = 0.8848. But then again, maybe not! You will not need to com-
pute correlation like this. This is simply to indicate that this is what your computer or
calculator does when it is evaluating r.
For example, in R:
> sum((x-mean(x))ˆ2)
[1] 718.604
lOMoARcPSD|8938243
Q UESTION : The two blood measures are supposed to be comparable. Why is the correla-
tion not sufficient to ensure this, no matter how close to 1 it gets?
Thus the product (xi − x̄)(yi − ȳ) computes two properties. First, it tells us how far xi and yi
have deviated from their respective location measures, the sample means. Second, it tells
us whether or not the deviation of both xi and yi from their sample means is in the same
direction. That is, if yi takes a high (low) value whenever xi has a high (low) value. This
property is shown in the figure below. The two vertical lines are the sample means.
For the data in the left panel the yi s increase (relative to ȳ) as the xi s increase, so the co-
variance is positive. The reverse happens in the right panel, so the covariance is negative.
Comparing with sample variance, we can think of sxy = (xi − x̄)(yi − ȳ) as a joint deviation
of x and y from the sample means.
An issue with sxy is that it combines information about the spread of x and y with the
strength of their relationship. This can pose some challenges to its utility in practical use.
So it is scaled (divided) by the sample standard deviations of the individual variables, x
and y, so that this modified metric, the sample correlation coefficient (r), measures only the
strength of their relationship, and takes values between −1 and 1.
lOMoARcPSD|8938243
In the same way that the sample mean (x̄) is an estimate of a population mean (µ) for a
univariate population, the sample correlation (r) is an estimate of the population correla-
tion (ρ) for a bivariate population. The population correlation, ρ, is a measure of the linear
association between two variables.
Properties of ρ
1. −1 6 ρ 6 1
2. ρ = ±1 if, and only if, Y = a + bX with b 6= 0.
ρ = 1 if b > 0 and ρ = −1 if b < 0.
3. If X and Y are independent then ρ = 0.
Note: the converse is not true, in general, but it is true when (X, Y ) is bivariate normal.
Like µ, the population correlation is generally unknown, and we seek to estimate it using
a sample drawn from the population. The correlation obtained from the sample is r.
population sample
mean µX , µY x̄, ȳ
standard deviation σX , σY sx , sy
correlation ρ r
(−1 6 ρ 6 1) (−1 6 r 6 1)
For the above example, using the diagram in the Tables we obtain the 95% CI for
ρ as (−0.62, −0.16), which excludes zero. So we can conclude that ρ=0 would
be rejected.
Correlation provides information on the strength of the linear relationship between two
numerical variables. To further explore the relationship between two numerical variables,
we develop a model that relates one variable to the other. We can then use this model to
predict one variable from the other.
The regression of y on x is E(Y | x), i.e. the expectation of Y given the value of x. For
example, Y may be the measured pressure of a gas in a given volume x, the measurement
being subject to error. Here, we might expect that E(Y | x) = xc .
The simplest form of regression model, and the only one that we will consider, is
E(Y | x) = α + βx and var(Y | x) = σ 2 ,
so that the regression of y on x is linear, with constant variance.
However, in some cases, it is possible to transform data to produce a straight line regres-
sion. For example:
1. In the above example on pressure and volume, if we write x∗ = x1 , then E(Y | x∗ ) =
cx∗ , which is a linear model with intercept = 0.
2. If y ≈ αxβ , then it might be appropriate to take logs and consider the model E(Y ∗ | x∗ ) =
α∗ +βx∗ , where Y ∗ = ln y, x∗ = ln x, α∗ = ln α.
Note: such a transformation affects the assumption of equal variance.
x 2.5 5 10 15 17.5 20 25 30 35 40
y 63 58 55 61 62 37 38 45 46 19
60
50
y
40
30
20
10 20 30 40
x
In fitting a regression line, our aim is to use x to predict y; so the fitted line is the one that
gives the best prediction of y for a given x.
This is not necessarily the line that best fits the relationship between the variables; and it is
definitely not the line that would be used to predict x given y, if that were required.
carbon-monoxide (CO) in parts per million at certain street corners and are
shown below:
cars/hr 980 1040 1135 1450 1510 1675 1890 2225 2670 2935 3105 3330
CO conc 9.0 6.8 7.7 9.6 6.8 11.3 12.3 11.8 20.7 19.2 21.6 20.6
• The fitted model is used to estimate the average value of Y for a given value of x.
• It is also used to predict a future observation of Y for a given value of x.
We use the “least squares” method to find the fitted model. That is, we consider all straight
lines y = a + bx and select that line for which
n
X
∆ = ∆(a, b) = (yi − a − bxi )2
i=1
is a minimum. Note that yi − a − bxi is just the vertical distance of the i’th data point (xi , yi )
from the line y = a + bx. The resulting line we denote by µ̂(x) = α̂ + β̂x; α̂ and β̂ denote
the “least squares” estimates of α and β. Fortunately, there is exact formula to find the
estimates of α and β:
Pn
(x −x̄)(yi −ȳ)
β̂ = Pn i
i=1
2
, α̂ = ȳ − β̂ x̄ so that µ̂(x) = ȳ + β̂(x − x̄).
i=1 (xi −x̄)
C HALLENGE . Can you derive the least squares estimates of α̂ and β̂?
so β̂ can be evaluated from sx , sy and r. In fact, the statistics n, x̄, ȳ, sx , sy and r are
sufficient for any computation relating to straight line regression!
A neat form for the least squares regression line is ys = rxs .
y−ȳ s s
= r x−x̄ ⇔ y = ȳ − (r sy ) x̄ + r sy x, i.e. y = α̂ + β̂ x.
s y s x x x
14.19
So β̂ = −0.7953 × = −0.9009 and α̂ = 48.4 − (−0.901)×20.0 = 66.4177.
12.53
Therefore µ̂(x) = 66.42 − 0.901x, so that, for example, the mean of Y when
x=16 is estimated to be 52.0 (= 66.42 − 0.901×16).
60
50
y
40
30
20
10 20 30 40
x
Actually, α̂ and β̂ are available on many calculators; and from the computer, so
you don’t even need to do the calculation from n, x̄, ȳ, sx , sy and r!
In R, we use the function lm():
On the left-hand side of the symbol “˜” we include the response. On the right
hand side we put one or more predictors.
> summary(fit)
Call:
lm(formula = y ˜ x)
Residuals:
Min 1Q Median 3Q Max
-11.400 -5.400 -1.787 7.474 11.348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.4177 5.6481 11.759 2.5e-06 ***
x -0.9009 0.2428 -3.711 0.00595 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
In order to carry out inference, i.e. to find confidence intervals and test hypotheses, we
need expressions for the variances of the estimators.
Regression is concerned with estimating for a given x. So for statistical inference to do with
regression, we treat the xs are constants, and the Y s as random variables.
Let B̂ denote the estimator of β. Under the assumption that var(Y | x) = σ 2 :
σ2 σ2
var(Ȳ ) = , var(B̂) = , where K = Σ(x − x̄)2 ; and Ȳ & B̂ are independent.
n K 2
[Note that K = (n−1) sx .]
[We use K to indicate that B̂ behaves a bit like Ȳ , but with n replaced by K.]
It follows that, if M̂(x) denotes the estimator of µ(x), that is,
M̂(x) = Ȳ + (x − x̄)B̂,
then, since Ȳ and B̂ are independent, we have
1 (x − x̄)2 2
var M̂(x) = + σ .
n K
Note that  = M̂(0), (i.e. α̂ = µ̂(0)).
Residuals
Thus the residuals can be used to check the fit of the model, since they should behave like
the model errors, ei , i.e. independent observations on N(0, σ 2 ).
If they do not, then the model should be questioned.
◦ independence? Look for patterns in residual plots: any pattern is an indication of non-
randomness. For example, if the residuals vs fitted values follow a curve, this suggests a
curved regression, rather than a straight line. If the residuals vs observation order (often
time) shows a trend, it suggests that the regression is varying with time.
◦ mean zero? If not there is a mistake, since av(êi ) = av(yi − α̂ − β̂xi ) = ȳ − α̂ − β̂ x̄ = 0.
◦ equal variances? This may show in residual plot, though only with a lot of points. If the
residuals are close to the horizontal axis, it suggests a small error variance; if they are widely
spread it indicates a large error variance. The spread should be “about the same” for all fitted
values.
◦ normality? Use QQ-plots or normal-plots for the residuals. The normal plot of the residuals
should be close to a straight line if the errors are normally distributed.
R produces residual graphs that help to check these assumptions. The command plot(fit),
with fit being the regression output, gives the following plots.
5 9
10
1.0
Standardized residuals
5
0.5
Residuals
0.0
0
-0.5
-5
-10
6
10 6
-1.5
10
n
X
The residual sum of squares (error SS) is given by: d2 = (yi − α̂ − β̂xi )2 .
i=1
d2
And, to estimate σ , we use the error mean square (error MS): s2 =
2
,
n−2
2
which is unbiased for σ . The divisor is n−2, since there are two parameters to be estimated
in the regression model. Another way of saying this is that {ê1 , . . . , ên } has n−2 degrees of
P P
freedom, since êi = 0 and xi êi = 0.
Note: Since the sample mean of the residuals is zero, the sample variance of the residuals
1 P 2
is n−1 êi . But, as two parameters have to be estimated to make the residual mean zero,
we choose to divide by n−2 rather than n−1.
Note: Computational formula for s2 , for hand computation: s2 = n−1 s2 (1 − r2 ), which again
n−2 y
lOMoARcPSD|8938243
To do that, a prediction interval for Y is required: an interval within which we are 95%
sure that a future observation will lie.
To obtain a prediction interval, we use:
(x − x̄)2
∗ d 2 1
Z = Y − µ̂(x) = N 0, σ (1 + + )
n K
These intervals can be obtained in R using the function predict() which al-
lows us to enter the value of x for which intervals are required. For this exam-
ple, we enter 16. This gives the confidence interval and prediction interval at
the end of the regression output.
$se.fit
[1] 3.044395
$se.fit
[1] 3.044395
Note that the option interval specifies either a confidence interval for µ(16)
(the mean response at x = 16) or the prediction interval for Ŷ (16) (the new
response at x = 16).
Hypothesis testing
It is standard to carry out a test of the null hypothesis H0 : β = 0 (testing the utility of the
model: is x of any use in predicting y?). If β is not significantly different from 0 then the
(linear) relationship between x and y is weak and knowing the value of x will not be of
much use in predicting the value of y.
lOMoARcPSD|8938243
All of this is most easily done using a statistical package, such as R, which produces output
like the following:
> summary(fit)
Call:
lm(formula = y ˜ x)
Residuals:
Min 1Q Median 3Q Max
-11.400 -5.400 -1.787 7.474 11.348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.4177 5.6481 11.759 2.5e-06 ***
x -0.9009 0.2428 -3.711 0.00595 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Note that as well as estimates of parameters, standard errors (“Std. Error”) and p-values
for testing H0 : α=0 and H0 : β=0, R gives the goodness of fit F-statistic and its p-value at
the bottom, which we ignore at this stage. The F test in the last line is just a test of H0 :
β=0, which is equivalent to the t test given above. The value of the F -statistic is equal to
the value of t2 (13.77 = (−3.71)2 ).
It is common to give a value of R2 (R-squared):
regression SS
R2 = ,
total SS
which is called the coefficient of determination. R2 can be thought of as the proportion
of the variation in y that is accounted for by the regression on x. In the above, R2 =
1146.6/1812.4 = 0.633.
2
2
An alternative (adjusted) version is obtained using 1 − Radj = ss2 ; from which we obtain
y
2
2
Radj = R2 − 1−R
n−2
2
. It follows that Radj 6 R2 . In the above Radj
2
= 0.633 − 1−0.633
8
= 0.587.)
x 42 35 50 43 48 62 31 36 44 39 55 48
y 12 8 14 9 11 16 7 9 12 10 13 11
lOMoARcPSD|8938243
16
is linear.
14
(b) Fit a straight line regression by the method of
12
y
least squares.
10
α̂ = −0.950, β̂ = 0.269.
8
(c) Find a 95% confidence interval for the slope
of the regression line. 30 35 40 45 50 55 60
x
s = 1.101, se(β̂) = 0.0377;
95% CI for β: (0.185, 0.353).
i 1 2 3 ··· 100
xi 57.8 66.6 63.1 ··· 59.1
yi 51.1 55.1 58.5 ··· 53.9
(a) Estimate the straight line regression which would be used to predict y using
x.
(c) Find the standard error for the slope estimate, se(β̂).
q The 95% confidence interval for µ(60) can be expressed in the form m ±
(d)
2
c sn + f se(β̂)2 .
Specify values for m, c and f .
Answers: 4.6461
(a) β̂ = 0.8830 × 10.5781 = 0.3878; α̂ = 54.924 − 0.3878×64.114 = 30.06.
1
(b) s2 = 98 × 4.46412 (1 − 0.88302 ) = 4.804.
q
4.8042
(c) se(β̂) = 11077.6404 = 0.0208
ret lymph
3.6 2240
2.0 2678
0.3 1820
0.3 2206
0.2 2086
3.0 2299
0.2 1276
1.0 2088
2.2 2013
2.7 2600
3.2 2684
1.6 1840
2.5 1760
1.4 1950
> summary(mydata)
ret lymph
Min. :0.200 Min. :1276
1st Qu.:0.475 1st Qu.:1868
Median :1.800 Median :2087
Mean :1.729 Mean :2110
3rd Qu.:2.650 3rd Qu.:2284
Max. :3.600 Max. :2684
> fit <- lm(lymph ˜ ret, data = mydata) # fit linear model
> summary(fit)
Call:
lm(formula = lymph ˜ ret, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-558.90 -200.56 -36.36 294.66 519.15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1798.91 162.69 11.057 1.2e-07 ***
ret 179.97 78.35 2.297 0.0404 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
ii. Estimate the slope of the regression line, and interpret it in the context.
β̂ = 180.0. This is the estimated increase in the mean lymphocyte count per unit percentage increase in
reticulocytes (for 0 < %ret < 4).
iii. Find a 95% confidence interval for this slope.
179.97 ± 2.179×78.35 = (9, 351).
iv. Suppose %reticulocyte = 2.0%, what do you expect the lymphocyte count to be?
µ̂(2.0) = 2160 . . . 1400 < y(2.0) < 2920.
v. If y(2.0) = 2678, what is the residual for this observation?
ê = 2678 − 2158.8 ≈ 520. Mark it on the scatter plot.
vi. Specify a 95% confidence interval for the mean lymphocyte count for x = 2.
(1960, 2360).
vii. Specify the prediction error for x = 2.
√ √
pe = s2 + se2 = 113753 + 92.62 = 350. also (2921−1397)/(2×2.179).
viii. Find a 95% confidence interval for the mean lymphocyte count for x = 3.0.
µ̂(3.0) = 1798.9
q + 179.97×3.0 = 2338.8 ≈ 2340;
113753
se[µ̂(3.0)] = 14
+ (3−1.729)2 ×78.352 = 134.32 ≈ 134
95% CI: 2338.8 ± 2.179×134.32 = (2050, 2630).
Kleinbaum and Kupper (1978)1 provide a data set containing measurements on age, weight
and blood fat content for 25 individuals. We are interested in the relationship between age
(x) and blood fat content (y). This data set is available as BFC.txt. Load this data using
the commands
> BFC <- read.table(’BFC.txt’, header=T)
> Age <- BFC$Age
> Bfc <- BFC$BloodFatContent
This stores the age and blood fat content of the 25 individuals into Age and Bfc, respec-
tively. A scatterplot of Age and Bfc (shown below) can be obtained using the command
> plot(Age, Bfc, xlab = "Age", ylab = "Blood Fat Content", las = 1)
From this scatterplot, we see that there is a positive correlation between age and blood fat
content. That is, older people tend to have higher blood fat content, and younger people
tend to have lower blood fat content. In fact, using the command cor(Age, Bfc), we
find that the sample correlation coefficient is r = 0.837, and from the statistic-parameter di-
agram a 95% confidence interval for this correlation coefficient is (0.66, 0.93). This indicates
that there is a strong positive linear relationship between Age and Bfc.
Now we will fit a least squares regression line to the data. The command summary(model
<- lm(Bfc ˜ Age)) will fit the linear regression model, store it into the model object,
and produce the summary output below.
> summary(model <- lm(Bfc ˜ Age))
Call:
lm(formula = Bfc ˜ Age)
Residuals:
1 Kleinbaum, D.G and Kupper, L. L (1978). Applied Regression Analysis and Other Multivariable Methods.
Duxbury Press.
lOMoARcPSD|8938243
450
400
300
250
200
20 30 40 50 60
Age
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.5751 29.6376 3.461 0.00212 **
Age 5.3207 0.7243 7.346 1.79e-07 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Suppose that we want to estimate the mean blood fat content of a 55 year old individual.
Using our model, this individual’s mean blood fat content is estimated to be
Also given in the above summary output are the test statistics (t value column) and p-
values (Pr(>|t|) column) for hypothesis tests of H0 : α = 0 against H1 : α 6= 0, and
lOMoARcPSD|8938243
H0 : β = 0 against H1 : β 6= 0. These p-values, which are both much lower than 0.05, lead
us to conclude that the true α and β are both significantly different from 0, at the 5% level
of significance.
The above scatterplot is reproduced below with the least squares regression line superim-
posed. This line can be added to the existing scatterplot using the command:
> abline(model$coefficients[1], model$coefficients[2])
450
400
Blood Fat Content
350
300
250
200
20 30 40 50 60
Age
Do you think that the least squares regression line fits the data well? From the summary
output, R2 = 0.701. Therefore, 70.1% of the variation in blood fat content can be ex-
plained by age. We should also check that there are no violations of the model assump-
tions. The residuals vs fitted values and normal QQ plot are produced using the command
plot(model, which = c(1, 2)). Both figures are shown below.
●8 8●
Standardized residuals
16 ● ● 16
●6 ●6
50
● ●
Residuals
● ●
● ● ●
● ●●
●●
●
● ● ●
0
● ●
●●
0
●
● ● ● ●
● ●
● ●●●
● ●●
−50
● ●●
−1
● ●
●
● ● ●
There are no obvious patterns in the residuals vs fitted values plot, and the normal QQ plot
is approximately linear. This suggests that the form of our linear regression model, and the
assumption that the random errors have a N (0, σ 2 ) distribution, are appropriate for these
data.
A confidence interval for µ(55) (the mean blood fat content for 55 year olds) is found using
the commands
lOMoARcPSD|8938243
Problem Set 8
8.1 This problem investigates the relationship between FEV (litres) and age (years) for boys. Some
R output is shown below. The sample has 336 boys and their mean age is 10.02 years.
Call:
lm(formula = FEV ˜ age)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0736 0.1128 0.65 0.514
age 0.2735 0.0108 25.33 0.000 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
8.2 The table below gives the corresponding values of variables x and y.
x 5 6 7 8 10 11 12 13 14 14
y 28 20 26 28 24 16 22 10 12 14
For these data, check the following calculations:
x̄ = 10, ȳ = 20; sx = 3.333, sy = 6.667, r = −0.8.
i. Assuming that E(Y | x) = α + βx and var(Y | x) = σ 2 , obtain estimates of α and β using
the method of least squares. Plot the observations and your fitted line.
ii. Show that s2 = 18. Hence obtain se(β̂), and derive a 95% confidence interval for β.
lOMoARcPSD|8938243
iii. Find the sample correlation and, assuming the data are from a bivariate normal popula-
tion, find a 95% confidence interval for the population correlation.
8.3 A random sample of n = 50 observations are obtained on (X, Y ). For this sample, it is found
that x̄ = ȳ = 50, sx = sy = 10 and the sample correlation rxy = −0.5.
i. Indicate, with a rough sketch, the general nature of the scatter plot for this sample.
ii. On your diagram, indicate the fitted line for the regression of y on x.
iii. Give an approx 95% confidence interval for the population correlation.
8.4 A data set containing 6 columns of data was created by an English statistician Frank Anscombe.
The scatterplots arising from these data are sometimes called the “Anscombe quartet”.
x1 y1 y2 y3 x4 y4
10 8.04 9.14 7.46 8 6.58
8 6.95 8.14 6.77 8 5.76
13 7.58 8.74 12.74 8 7.71
9 8.81 8.77 7.11 8 8.84
11 8.33 9.26 7.81 8 8.47
14 9.96 8.10 8.84 8 7.04
6 7.24 6.13 6.08 8 5.25
4 4.26 3.10 5.39 19 12.50
12 10.84 9.13 8.15 8 5.56
7 4.82 7.26 6.42 8 7.91
5 5.68 4.74 5.73 8 6.89
i. Carry out four simple linear regressions: y1 on x1 , y2 on x1 , y3 on x1 and y4 on x4 . What
do you notice about the results?
ii. Look at the four scatterplots of the data with the corresponding fitted line. Anscombe
concocted these data to make a point. What was the point? What would you conclude
about the appropriateness of simple linear regression in each case?
iii. What are the observed and predicted values of y4 at x4 = 19? Change the y4 value for
this datum to 10 and refit the regression. What are the observed and predicted values at
x4 = 19 now?
8.5 Researchers speculate that the level of a particular type of chemical found in a patient’s blood
affects the size of a hepatocellular carcinoma. Experimenters take a random sample of 25 pa-
tients and both assess the size of their tumours (cm) and test for the levels of this chemical in
their blood (mg/L). The mean chemical level was found to be 45mg/L. A simple linear regres-
sion is fitted; a partial R output is below.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.2981 0.05134 ? ?
x -0.15123 0.00987 ? ?
---
Residual standard error: 1.213
Multiple R-squared: 0.895
(a) What is the response variable? What is the explanatory variable?
(b) Write down the general model equation, stating all assumptions about the errors. How
could you graphically check each of these assumptions?
(c) i. Write down an estimate of the slope of the regression line.
ii. Write a sentence interpreting this slope in the context of the question.
(d) Use the R output to determine whether there is evidence, at the 5% level, that the chemical
level affects tumour size.
(e) Based on the result of your test in (d), would a 95% confidence interval for the true slope
contain zero? Explain why or why not.
(f) Suppose the chemical level in a patient’s blood is 25mg/L.
i. What do you expect the tumour size to be?
ii. If the actual size is 8cm, calculate the residual for this observation.
iii. Construct a 90% confidence interval for the expected tumour size.
(g) Write a sentence interpreting the R-sq value in the context of the question. Using it,
calculate the sample correlation coefficient.
lOMoARcPSD|8938243
8.6 Low-density lipoprotein (LDL) cholesterol has consistently been shown to be related to car-
diovascular disease in adults. Researchers are interested in factors that may be associated to
LDL cholesterol in children. One such factor is obesity, which is measured by the ponderal in-
dex (kg/cm3 ). 162 children are sampled and it is found that the sample correlation coefficient
between LDL cholesterol and ponderal index is 0.22.
(a) Find a 95% confidence interval for the correlation.
(b) A simple linear regression can be fitted to the data, with the true slope denoted by β.
Based on (a), what do you expect to be the result of a hypothesis test: H0 : β = 0 versus
H1 : β 6= 0?
(c) Calculate the coefficient of determination and write a sentence interpreting it in the con-
text of the question.
8.8 Two measures are evaluated for each of fifty cases: the sample correlation between these mea-
sures is evaluated as –0.40. Find a 95% confidence interval for the correlation. Is this evidence
of a relationship between the two measures? Explain.
8.9 Is cardiovascular fitness (as measured by time to exhaustion running on a treadmill) related
to an athlete’s performance in a 20 km ski race? The following data were collected in a study:
x = treadmill run time to exhaustion (in minutes) and y = 20 km ski time (in minutes).
x 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7
y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
(a) The correlation coefficient is r = −0.796. Test the hypothesis that the two variables are
uncorrelated.
(b) A simple linear regression analysis is carried out in R, giving the following output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 88.80 5.750 15.44 0.000
ret -2.334 0.591 ? ?
---
Residual standard error: 2.188
Multiple R-squared: 0.636
lOMoARcPSD|8938243
i. The row headed ret has missing values for the t-ratio and p-value. Explain what
these numbers pertain to (no need to calculate them) and whether they bear any
relation to the test in (a).
ii. Suppose an athlete has treadmill run time to exhaustion of 10 minutes. Give a 95%
prediction interval for his 20 km ski time. (The x-values have mean x̄ = 9.66.)
8.10 (FEV vs age data from Problem 8.1: some inference questions.)
i. The null hypothesis states that there is no relation between FEV and age. State this in
terms of an appropriate parameter and test it.
ii. Obtain a 95% CI for the slope of the regression line of FEV on age.
iii. Obtain an estimate of the error variance σ 2 .
iv. Obtain a 95% CI for the mean FEV of 10 year-old boys.
v. Obtain a 95% prediction interval for the FEV of a 10 year-old boy.
lOMoARcPSD|8938243
REVISION PROBLEMS
R1.1 (a) Write two sentences to compare and contrast observational study and experiment.
(b) A “randomised controlled trial” is the gold-standard for medical experiments.
i. Give an explanation of the importance of randomisation to convince a doubting sci-
entist of its value.
ii. What is meant by “control”? Why is it important?
(c) It is thought that exposure E is a possible cause of D, so that E and D ought to be pos-
itively related. However, a recent study showed a negative relationship between E and
D. It was discovered that this was due to a confounding factor C, which is a known cause
of D, and which was strongly negatively correlated with E among the individuals used
in the study.
Draw a diagram to illustrate this situation.
R1.2 (a) A sample of nine observations is supposed to be a random sample from a Normal popu-
lation. The order statistics for this sample are as follows:
40.4 48.8 54.8 59.2 64.1 65.0 68.7 72.7 75.1
i. Evaluate the sample median and sample quartiles. Hence draw the boxplot for these
data.
ii. If the data are a random sample from a N(µ, σ 2 ) population, explain why E(X(1) ) ≈
µ − 1.28σ.
iii. Sketch a Normal QQ-plot for these data, clearly labelling the axes. Indicate how
estimates of µ and σ could be obtained on your diagram.
(b) For the data in (a), the calculator gives x̄ = 60.977778 and s = 11.376926. Specify a point
estimate and a 95% interval estimate for µ.
R1.3 (a) Write two sentences to compare and contrast independent and mutually exclusive.
p1 p1 (1 − p2 )
(b) The risk ratio is , and the odds ratio is .
p2 (1 − p1 )p2
i. If the odds ratio is 2, and p1 = 0.1, find the risk ratio.
ii. If the odds ratio is 2, and p1 → 0, what happens to the risk ratio?
iii. If the odds ratio is 2, and p1 → 1, what happens to the risk ratio?
iv. A case-control study gives an estimate of the odds ratio relating exposure E and
disease D of 2.0. What can you say about the relative risk of D with and without
exposure E?
227
lOMoARcPSD|8938243
R1.4 (a) Write two sentences to compare and contrast prevalence and incidence.
(b) Individuals with disease D have chemical L at high levels in the bloodstream. For these
d
individuals, L = N(40, 42 ). There is a threshold beyond which the body has an overload
problem.
i. Find Pr(L > 50).
d
ii. Suppose that the threshold is actually a random variable, T = N(50, 22 ), which is
independent of L. Find Pr(L > T ).
(c) i. T1 and T2 are independent random variables with
E(T1 ) = E(T2 ) = θ and sd(T1 ) = 1, sd(T2 ) = 2.
Let T = wT1 + (1 − w)T2 . Show that var(T ) = 5w2 − 8w + 4 and hence show that
var(T ) is minimised when w = 0.8.
ii. Two independent random experiments have been carried out, each with the inten-
tion of estimating the parameter θ. The results are:
experiment 1 n1 = 40 θ̂1 = 50.0 se(θ̂1 ) = 1.0
experiment 2 n1 = 10 θ̂2 = 55.0 se(θ̂2 ) = 2.0.
Use these to results to give the optimal estimate of θ and specify its standard error.
R1.5 (a) Write two sentences to compare and contrast standard deviation and standard error.
(b) A prevalence study, i.e. a survey, collects data from a sample at a specific time-point (al-
though the time-point may be relative, as the survey may take a week or so to complete).
In such a survey of 2000 individuals, 350 of them had attribute H. Find a 95% confidence
interval for the prevalence of H.
(c) A cohort of 400 individuals is followed for a period of five years. The total observed
person-time was 1200 person-years, and 36 cases were observed.
i. Give a reason why the person-time is not 400×5 = 2000 person-years.
ii. Find an approximate 95% confidence interval for the incidence rate (cases per person-
year).
R1.6 (a) Write two sentences to compare and contrast p-value and power.
d
(b) Suppose that Z = N(θ, 1).
It is planned to use an observation on Z to test the hypothesis H0 : θ = 0.
Consider the test: “reject H0 if |Z| > 1.96”.
i. Show that this test has significance level 0.05.
ii. Show that this test has power 0.80 when θ = 2.80.
d
(c) A random sample of n observations is obtained on X = N(µ, 52 ), i.e. σ is assumed known
(and σ = 5).
X̄ − 40
Let Z = √ . This is the test statistic used to test the hypothesis µ = 40.
5/ n
i. Find E(Z) when µ = 41.
ii. How large a sample is required so that the z-test of µ = 40 based on X̄ with signifi-
cance level 0.05 has power 0.80 against the alternative µ = 41?
R1.7 A recent study compared the use of angioplasty (PTCA) with medical therapy in the treatment
of single-vessel coronary artery disease. At the six-month clinic visit, 10 of 40 patients seen in
the PTCA group and 20 of 40 patients seen in the medical therapy group, have had angina.
(a) Is there evidence in these data that PTCA is more effective than medical therapy in pre-
venting angina? Test the hypothesis that the probabilities are the same in the two groups.
Give a p-value and state your conclusion.
(b) Using these data, find an estimate of the odds ratio relating PTCA and angina.
By estimating ln OR, obtain a 95% confidence interval for the odds ratio.
R1.8 Two measures are evaluated for each of fifty cases: the sample correlation between these mea-
sures is evaluated as –0.40. Find a 95% confidence interval for the correlation. Is this evidence
of a relationship between the two measures? Explain.
lOMoARcPSD|8938243
Copy, roughly, your selected QQ-plot and indicate on your copy the labels and scales on
each axis.
(c) i. Give a rough approximation for a 95% confidence interval for the population mean.
ii. Give a rough approximation for a 95% prediction interval for an observation from
this population. Hint: 4/80 = 5%.
Note: You should not assume the population distribution is normal.
R2.3 A research paper modelled the Rayley Psychomotor Development Index in five year-old chil-
dren as having a normal distribution with mean 100 and standard deviation 10. Assume this
model is correct, and that a random sample of 200 observations is to be obtained.
(a) Indicate values that you would expect to observe for the five-number summary for this
sample, i.e. the minimum, the lower quartile, the median, the upper quartile and the
maximum, briefly explaining your reasoning.
Hence draw a likely boxplot for such a sample.
(b) An observation is nominated as a ‘potential outlier’ if it is more than 1.5 IQR above the up-
per quartile, or 1.5 IQR below the lower quartile, where IQR denotes inter-quartile range.
i. Show that, for a sample from a normally distributed population, the probability of a
potential outlier is 0.0070.
ii. What is the probability of at least one potential outlier in the sample of 200?
R2.4 (a) If 15% of university students are left handed, find the probability that, of a tutorial class
of sixteen students, at most one is left handed.
What assumptions have you made in evaluating this probability?
(b) Three research papers report estimates of µ and their standard errors. These results are
used to produce the following meta-analysis table used to obtain the optimal estimate
based on these three reported results.
lOMoARcPSD|8938243
R2.7 (a) A treatment for migraine is trialled in a double-blind randomised experimental study
involving 400 patients: 200 receive a placebo (P) and 200 receive the treatment (T). Three
months later, the patients report whether they were worse, the same, or better on the
medication they received. The results were as follows:
worse same better
T 40 100 60 200
P 60 100 40 200
100 200 100 400
To examine whether the treatment having an effect, we test the hypothesis that treatment
P (o−e)2
and outcome classification are independent using a χ2 test. Let U = e
denote
the χ2 statistic used to test for independence.
Show that, for the table above, u = 8.0; and give an approximate p-value.
What is your conclusion?
(b) i. When comparing two independent samples, we wish to test the null hypothesis that
the samples are drawn from the same population (H0 ). One way to do this is to use
an independent samples t-test.
What assumption is made about the common population distribution in applying
the independent samples t-test?
ii. Consider the following data
sample 1: 27, 28, 31, 33, 35, 45
sample 2: 41, 46, 94
We can use a rank test to test H0 . To do this, we use
lOMoARcPSD|8938243
w̄1 − w̄2
z= q
1
12
N (N + 1)( n11 + 1
n2
)
where w̄1 denotes the average rank for sample 1, w̄2 the average rank for sample 2
and N = n1 +n2 .
Show that, for these data, z = −2.07, and hence that the rank test indicates rejection
of the null hypothesis at the 5% significance level.
lOMoARcPSD|8938243
R3.1 (a) A study is to be conducted to evaluate the effect of a drug on brain function. The evalu-
ation consisted of measuring the response of a particular part of the brain using an MRI
scan. The drug is prescribed in doses of 1, 2 and 5 milligrams. Funding allows only 24
observations to be taken in the current study.
In a meeting to decide the design of the study, the following suggestions are made con-
cerning the conduct of the experiment. For each of the suggestions say whether or not
you think it is appropriate giving a reason for your answer.
(A) Amy suggests that a placebo should be used in addition to the three doses of the
drug. What is a placebo and why might its use be desirable?
(B) Ben says that the study should be conducted as a double-blind study. Explain what
this means, and why it might be desirable.
(C) Claire says that she is willing to be “the subject” for the study (i.e. to take different
doses of the drug and to have her response measured as often as is needed). Give
one point in favour of, and one point against this proposal.
(D) Don suggests that it would be better to have 24 subjects, and to allocate them at ran-
dom to the different drug doses. Give a reason why this design might be better than
the one suggested by Claire, and briefly explain how you would do the randomisa-
tion.
(E) Erin claims that it would be better to use 8 subjects, with each subject taking, on
separate occasions, each of the three different doses of the drug. Give one point
in favour of, and one point against this claim, and explain how you would do the
required randomisation.
(b) i. An exposure E is thought to cause disease outcome D. Suppose that C is a pos-
sible confounding factor. How would this be represented on a causal relationship
diagram?
ii. Smoking is thought to cause heart disease. Dr.W. claims that an individual’s level of
exercise may be a confounding factor. Represent the relationship between smoking
(S), heart disease (D) and above-average exercise level (X) on a causal relationship
diagram.
Mr.H. states that X should not be considered as a confounder. Holmes is right again!
Explain why exercise level should not be considered as a confounding factor for the
relation between smoking and heart disease.
❅
❘
R3.3 (a) Suppose events D and E are such that Pr(E) = 0.4, Pr(D | E) = 0.1, Pr(D | E ′ ) = 0.2.
i. Find Pr(D).
ii. Find Pr(E | D).
iii. Are D and E positively related, not related or negatively related? Explain.
iv. Specify the odds ratio for D and E.
(b) A new test for a disease, C, was applied to 100 individuals with disease C, and 100
individuals who do not have C. The following results were obtained:
R3.4 (a) Suppose that, in a population, 30% of individuals have attribute A. A random sample
of 240 is selected from this population. Let X denote the number of individuals in the
sample with attribute A. Find an approximate 95% probability interval for X.
(b) A cohort of individuals is observed for a total of 10 000 person-years. If the incidence rate
of disease B is 0.0022 per person-year, give an approximate 95% probability interval for
the number of cases of B in this cohort.
(c) Among healthy individuals in a particular population, the serum uric acid level Y mg/100L
is distributed as N(5.0, 0.82 ).
i. Find a 99% probability interval for Y .
ii. Find Pr(Y > 6.0).
iii. Find Pr(Y > 7.0 | Y > 6.0).
lOMoARcPSD|8938243
d
R3.5 Twelve independent observations are obtained on X = N(µ, 1), i.e. we have a random sample
of n=12 from a Normal population, for which the variance is known: σ 2 =1. The sample mean
for this sample is denoted by X̄.
To test H0 : µ = 10, we use the decision rule: “reject H0 if |X̄ − 10| > 0.6”.
(a) Find a 95% probability interval for X̄ if µ=10.
(b) Find the significance level of this test.
(c) Find the p-value if x̄ = 10.8.
(d) Find the power of the test if µ = 11.
(e) Find a 95% confidence interval for µ if x̄ = 10.8.
(f) Find a 95% prediction interval for X if x̄ = 10.8.
R3.6 (a) A study was conducted to examine the efficacy of an intramuscular injection of cholecal-
ciferol for vitamin D deficiency. A random sample of 30 sufferers of vitamin D deficiency
were chosen and given the injection. Serum levels of 25-hydroxyvitamin D3 (25OHD3 )
were measured at the start of the study and 4 months later. The difference X was calcu-
lated as (4-month reading – baseline reading).
For the sample of differences: sample mean = 15.0 and sample standard deviation = 18.4.
Construct a 95% confidence interval for the mean difference. What can you conclude?
(b) We are interested in estimating the prevalence of attribute B among 50-59 year-old women.
Suppose that in a sample of 2000 such women, 400 are found to have attribute B. Obtain
a point estimate and a 95% confidence interval for the prevalence.
(c) Of 1200 individuals employed at the PQR centre during the past ten years, 28 contracted
disease K. After adjusting for a range of covariates, the expected number of cases of K is
calculated to be 16.0.
i. Test the hypothesis that there is an excess risk of K at the PQR centre.
ii. The standardised morbidity ratio, SMR = µ/µ0 , where µ denotes the mean number
of cases among the subpopulation, and µ0 denotes the mean number of cases ex-
pected among the subpopulation if it were the same as the general population. Find
an approximate 95% confidence interval for SMR in this case.
R3.7 The data below are obtained from a trial comparing drug A, drug B and a placebo C. The
table indicates the number of individuals who reported improvement (I) with the treatment,
and those who did not.
improvement no improvement
drug A 15 10 25
drug B 10 15 25
placebo C 10 40 50
35 65 100
Let p1 = Pr(I | A), p2 = Pr(I | B), and p3 = Pr(I | C).
(a) The following (incomplete) R output was obtained to answer the question: “Is there a sig-
nificant difference between the proportion reporting improvement in the three groups?”,
i.e. test H0 : p1 = p2 = p3 .
Cell Contents
|-------------------------|
| N |
| Expected N |
| Chi-square contribution |
|-------------------------|
|
| [,1] | [,2] | Row Total |
-------------|-----------|-----------|-----------|
[1,] | 15 | 10 | 25 |
| 8.750 | 16.250 | |
| 4.464 | 2.404 | |
-------------|-----------|-----------|-----------|
[2,] | 10 | 15 | 25 |
| 8.750 | 16.250 | |
| 0.179 | 0.096 | |
-------------|-----------|-----------|-----------|
[3,] | 10 | 40 | 50 |
| 17.500 | 32.500 | |
| 3.214 | 1.731 | |
-------------|-----------|-----------|-----------|
Column Total | 35 | 65 | 100 |
-------------|-----------|-----------|-----------|
i. Explain how the values 8.750 and 4.464 can be calculated.
ii. Complete the test giving the p-value, and state your conclusion.
(b) Test the null hypothesis p1 = p2 .
(c) Assume that we increase the number of subjects treated with drug A and drug B, so that
n of each are tested. Find the sample size n required in order that we obtain a confidence
interval for p1 −p2 of half-width less than 0.15, i.e. the confidence interval should take the
form p̂1 −p̂2 ± h, where h 6 0.15.
R3.8 A random sample of 50 observations are obtained on the bivariate normal data (X, Y ). For
this sample, it is found that x̄ = ȳ = 30, sx = sy = 10 and the sample correlation rxy = 0.4. The
regression of Y on x is given by E(Y | x) = α+βx.
i. Indicate, with a rough sketch, the general nature of the scatter plot for this sample.
ii. Show that, for
Pthese data:
K = (x − x̄)2 = 4900 and sxy = n−1 1
P
(x − x̄)(y − ȳ) = 40.
Hence, or otherwise, find β̂ and α̂.
iii. On your diagram, indicate the fitted regression line.
iv. Given that the estimate of the error variance, s2 = 85.75, find a 95% confidence interval
for β.
v. Give an approximate 95% confidence interval for the population correlation.
lOMoARcPSD|8938243
R4.2 A five-year study was conducted to look at the effect of oral contraceptive (OC) use on heart
disease in women 40–49 years of age. All women were aged 40–44 years at the start of the
study. There were 5624 OC users at baseline (i.e. the start of the study), who were followed for
a total of 23 058 person-years, and of these women, 31 developed a myocardial infarction (MI)
during the five-year period. There were 9472 non-users, followed for 40 730 person-years, and
19 of them developed an MI over the five-year period.
n t x
OC-users 5624 23 058 31
non-users 9472 40 730 19
i. Is this a designed experiment or an observational study?
ii. What are the experimental/study units?
iii. Is this a prospective study, retrospective study or a cross-sectional study?
iv. All the women in the study are aged 40–44. Explain why this was done.
v. Use these data to test the hypothesis that the incidence rate for MI is unaffected by OC-
use. What conclusion can you draw?
vi. Consider a hypothetical population of 10 000 women.
Let µ1 denote the expected number of cases of MI in the next five years if all of the women
were OC-users. Let µ2 denote the expected number of cases of MI in the next five years
if none of the women were OC-users.
Obtain an estimate and a 95% confidence interval for µ1 − µ2 .
Give an interpretation of this result.
R4.3 (a) If two independent events each has probability 0.6 of occurring, find the probability that
at least one of them occurs.
(b) A test for detecting a characteristic C gives a positive result for 60% of a large number
of patients subsequently found to have the characteristic, and gave a negative result for
90% of those not having it. If the test is applied randomly to a population in which the
proportion of persons with the characteristic C is 30%, find the probability that a person
has the characteristic if their test gave a positive result.
What is the sensitivity of this test? What is its negative predictive value?
(c) What is relative risk? Write a sentence describing relative risk.
Why can’t we estimate relative risk with only the data from a case-control study? What
else do we need?
lOMoARcPSD|8938243
R4.4 The number of times a particular device is used in a given medical procedure is a random
variable X with pmf given by
x 0 1 2 3
p(x) 0.2 0.4 0.3 0.1
(a) Draw a sketch graph of the cdf of X.
(b) Show that E(X) = 1.3 and sd(X) = 0.9.
(c) The total number of times the device is used in 100 of these procedures is given by T =
X1 + X2 + · · · + X100 , where X1 , X2 , . . . , X100 are independent random variables each
with the pmf given in (a).
i. Find the mean and standard deviation of T .
ii. Explain why the distribution of T is approximately Normal.
iii. Find approximately Pr(T 6 125).
R4.5 (a) In daily self-administered blood pressure readings, it is expected that, if the blood pres-
sure is stable, the readings (in mm Hg) will have standard deviation 10.
Suppose that Ms. J. obtains eleven daily observations. Specify the standard deviation of
the average of these eleven readings.
Specify the assumptions you have made in obtaining your result.
(b) A study was conducted on the blood pressure of people with glaucoma. In the study,
25 people with glaucoma were recruited and their mean systolic blood pressure was
142 mm Hg, with a standard deviation of 20 mm Hg. Give a point estimate and a 95%
interval estimate for the mean systolic blood pressure for individuals with glaucoma.
R4.6 (a) The following is a random sample from a Normal population:
7.0, 9.0, 10.0, 11.0, 13.0.
i. Verify that x̄ = 10.0 and s2 = 5.0.
ii. Find a 95% prediction interval for a future observation from this population.
(b) In a particular district, the average number of cases of D reported each month is 2.75.
What is the probability that there are at most 10 cases reported in a particular six-month
period?
(c) The index of numerical development, NDI, measures the ability of a first-year university
student to deal with numbers. The standard score when this test was devised in 1975
was 500. It is believed that the advent of computers and calculators has brought about a
decline in NDI.
Values of NDI were obtained on a random sample of first-year students with the follow-
ing results:
540, 450, 399, 415, 556, 488, 366, 490, 474, 456, 398, 513, 342, 328, 593, 360.
For this sample n = 16, x̄ = 448 and s = 80.
d
Assume these data are a random sample on X =N(µ, σ 2 ). We wish to test the hypothesis
µ=500 against µ6=500 using a significance level of 0.05.
i. Show that t = −2.6, and show that H0 is rejected by comparing t with the appropri-
ate critical value. Specify the critical value.
ii. Specify the p-value for this test.
iii. What can you conclude?
R4.7 (a) Of 100 independent 95% confidence intervals, let Z denote the number of these confi-
dence intervals that contain the true parameter value.
Specify the distribution of Z.
d
(b) One observation is obtained on W = N(µ, 1). To test H0 : µ = 0 vs µ 6= 0, the decision
rule is to reject H0 if |W | > 2.17. The observation is w = 1.53.
i. Find the significance level.
ii. Find the p-value.
iii. Find the power if µ = 3.
lOMoARcPSD|8938243
R4.8 A pilot study of a new antihypertensive agent is performed for the purpose of planning a
larger study. Twenty five patients who have diastolic blood pressure of at least 95 mm Hg are
recruited for the study. Fifteen patients are given the treatment, and ten get the placebo. After
one month, the observed reduction in diastolic blood pressure yields the following results.
n1 = 15; x̄1 = 9.0, s21 = 60.0;
n2 = 10; x̄2 = 2.5, s22 = 44.7.
Assume that these are independent samples obtained from Normally distributed populations,
d d
X1 = N(µ1 , σ 2 ) and X2 = N(µ2 , σ 2 ). It is assumed that the population variances are equal,
and so the sample variances are pooled to give s2 = 54.0.
i. Explain how this pooled variance is obtained.
ii. Find a 95% confidence interval for µ1 −µ2 .
iii. What are your conclusions from this study?
R4.9 Transient hypothyroxinemia, a common finding in premature infants, is not thought to have
long-term consequences, or to require treatment. A study was performed to investigate whether
hypothyroxinemia in premature infants is a cause of subsequent motor and cognitive abnor-
malities. Blood thyroxine values were obtained on routine screening in the first week of life
from a number of infants who weighed 2000g or less at birth and were born at 34 weeks gesta-
tion or earlier. The data given below gives the gestational age (x, in weeks) and the thyroxine
level (y, in unspecified units).
x 25 26 27 28 30 31 32 33 34 34
y 10 12 16 14 24 20 28 26 22 28
For these data, the following statistics were calculated:
n = 10, x̄ = 30, ȳ = 20; (x − x̄)2 = 100, (x−x̄)(y−ȳ) = 180, (y−ȳ)2 = 400.
P P P
R4.10 (a) Sixty independent procedures yielded 48 successes. Find a 95% confidence interval for
the probability of success.
State any assumptions you have made.
(b) The diagram below gives the sample cdf for a random sample of 100 observations on the
recurrence time (in months) for a particular condition following treatment.
lOMoARcPSD|8938243
UV Level n x̄ s
Moderate 30 0.04 0.11
High 30 0.10 0.25
Based on this study, is there evidence to suggest that there is a difference in the mean change
in pulmonary function between the two groups? Use a significance level of 0.05, and state any
assumptions that you make.
R5.5 A study investigated the relationship between the use of a type of oral contraceptive and the
development of endometrial cancer. The study found that out of 100 subjects who took the
contraceptive, 6 developed endometrial cancer. Of the 225 subjects who did not take the con-
traceptive, 9 developed endometrial cancer.
(a) Based on this study, is there evidence at the 5% level to suggest that there is a higher
proportion of people with endometrial cancer amongst those taking the contraceptive
compared to the control group?
(b) Describe what is meant by a Type I error and a Type II error in the context of the question.
(c) Medical authorities decide that if the test shows that there is a significantly higher pro-
portion of people with endometrial cancer in the group taking the contraceptive, then the
oral contraceptive will be removed from the market.
i. Describe the consequences of a Type I error and of a Type II error.
ii. Explain, for each type of error, whether the consequences are more of a problem for
the women using the oral contraceptive or the manufacturer of the contraceptive.
R5.6 (a) A recent study compared the use of angioplasty (PTCA) with medical therapy in the
treatment of single-vessel coronary artery disease. At the six-month clinic visit, 35 of 96
patients seen in the PTCA group were found to have had angina.
Find a 95% confidence interval for the probability of angina within six months after PTCA
treatment.
(b) The mortality experience of 8146 male employees of a research, engineering and metal-
fabrication plant in Tonawanda, New York, was studied from 1946 to 1981. Potential
workplace exposure included welding fumes, cutting oils, asbestos, organic solvents and
environmental ionizing radiation. Comparisons were made for specific causes of death
between mortality rates in the workers and the U.S. white male mortality rates from 1950
to 1978.
Suppose that, among workers who were hired prior to 1946 and who had worked in the
plant for 10 or more years, 17 deaths due to cirrhosis of the liver were observed, while 6.3
were expected based on U.S. white male mortality rates.
i. Estimate λ, the mean number of deaths for this subpopulation of workers.
ii. Test the hypothesis that λ = λ0 , where λ0 denotes the population value, 6.3.
iii. Find a 95% confidence interval for λ and SMR = λ/λ0 .
R5.7 A group of researchers are investigating a new treatment for reducing systolic blood pressure.
They want to compare the results of a group of patients receiving the new treatment with a
group of subjects receiving a placebo treatment.
(a) Your boss says that it is too costly to include a group of subjects taking a placebo. What
can you say to justify including them in the experiment?
(b) It is decided that 20 patients will take the new treatment and 20 will take the placebo.
Since the investigation is taking place over two cities (Melbourne and Sydney), to make
things simpler, the new treatment will be administered in Melbourne and the placebo
will be given to subjects in Sydney.
i. Identify a potential problem with this design.
ii. Briefly describe a way to overcome this problem.
(c) What is the definition of a lurking variable in the context of this question? Write down
two potential lurking variables.
(d) It is finally decided to run the whole experiment in one city, with 20 subjects taking the
treatment and 20 taking the placebo. The sample mean change in systolic blood pressure
for the new-treatment group is −10.5mmHg, with a standard deviation of 5.2mmHg. The
sample mean change for the placebo group is −6.1mmHg, with a standard deviation of
4.9mmHg.
lOMoARcPSD|8938243
i. Assuming the underlying variances are equal for the two groups, construct a 95%
confidence interval for the difference in the mean change in blood pressure for the
two groups.
ii. From your confidence interval, explain whether you think the change in blood pres-
sure differs between the two groups.
R5.8 A new antibiotic is thought to affect plasma-glucose concentration (mg/dL). It is known that
in the general population, the mean plasma-glucose concentration is 4.91 with a standard de-
viation of 0.57. A random sample of 10 people is given a fixed dosage of the antibiotic. Their
plasma-glucose concentrations are measured the next day. The concentrations are given in the
table below.
subject 1 2 3 4 5 6 7 8 9 10
concentration 5.05 4.35 5.36 5.46 5.40 4.55 6.45 5.28 4.95 5.50
(a) Draw a boxplot of concentration, making sure you label it appropriately. Show any
working required to construct the graph.
(b) Assume that the true standard deviation for the antibiotic group is the same as for the
general population. Conduct a test at the 1% level to investigate whether the mean
plasma-glucose concentration is higher for those people taking the antibiotic, compared
to the general population. Use the p-value approach, and state any assumptions that you
make.
(c) i. If the true mean is actually µ = 5.5, what is the power of this test?
ii. What happens to the power if α is increased to 0.05? Briefly explain your answer.
(There is no need for any calculations for this part of the question).
R5.9 (a) FEV (forced expiratory volume) is an index of pulmonary function that measures the vol-
ume of air expelled after one second of constant effort. A longitudinal study collected
data on children aged 3–19. The following is a partial R output on a simple linear regres-
sion analysis, relating the variables FEV and AGE for the boys in the group. The group
consisted of 336 boys and their mean age was 10.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0736 0.1128 0.65 0.514
AGE 0.2735 0.0108 25.33 0.000 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
i. The slope of the regression line is 0.273. What is the interpretation of this?
ii. The p-value given for the t-ratio 25.33 is 0.000. What does this signify? What distri-
bution is used to find the p-value?
iii. Obtain an estimate and a 95% confidence interval for the mean FEV for 12-year-old
boys. (You may use the table value 1.967 for this.)
σ2
[Hint: var µ̂(x) = n + (x−x̄)2 var(β̂).]
(b) A study of fifty individuals found a sample correlation coefficient of r = +0.31 between
the variables u and v. Does this represent significant evidence of a positive relationship
between u and v? Explain.
lOMoARcPSD|8938243
Problem Set 1
1.1 i. Observational study: the treatment is not imposed;
ii. Exposure = oral contraceptive (OC) use; disease outcome = myocardial infarction (MI);
iii. Prospective study;
iv. Response = (MI or not); explanatory variable = (OC or not);
v. To avoid confounding with age;
vi. Keep it simple!
1.2 (a) We really don’t know! It might have been ”Does exercise increase lactic acid? . . . and by
how much?” However, we will take to to have been ”Is the change in lactic acid after
exercise different for men and women?”
(b) Observational study;
(c) Response variable = change in lactate levels; for the question of comparing males and
females, the gender categories (male and female) take the role of exposure or treatment:
explanatory variable = gender.
(d) Age is a potential confounder (for example, if most of the males were 40–49 and most of
the females were 20–29, then the difference between the groups may be due to age rather
than gender);
(e) A confounding variable is one that affects the outcome (blood lactate level), and which
is related to gender, in the sense that the variable is not balanced between the males and
the females in the sample. Apart from age, there are a number of possible confounders
that suggest themselves: for example individual’s fitness level, weight or recent activity.
1.3 (a) Retrospective;
(b) Prospective;
(c) Cross-sectional.
1.4 (a) Individual subjects (worried about anxiety? . . . how were they chosen? what is their age?
gender?); response variable = difference in anxiety level, explanatory variable, treatment
= meditation;
(b) Experimental study: random allocation of treatment or non-treatment;
(c) No (presumably each individual knows whether or not what therapy they receive is med-
itation); an individual may respond better to a treatment they believe will do them good
— a blind study would avoid this problem. Could this experiment be blind?
(d) Yes: gender is confounded with the treatment.
1.5 (A) A placebo is an inactive drug, which appears the same as the active drug. It is desirable
in order to ascertain whether the active drug is having an effect.
(B) In a double-blind study, neither the subject not the treatment provider knows whether
the treatment is the active or the inactive drug. It is desirable in order to guard against
any possible bias: on the part of the subject or on the part of the treatment provider (due
to prior expectations).
(C) In favour: there would be be no between-subject variation. Against: there may be carry-
over effects, from one treatment to the next. The results may not be generalisable: is
Claire representative?
243
lOMoARcPSD|8938243
(D) There can be no carry-over effect in this case. It is likely to be generalisable to a larger
population (the population that the subjects represent). Choose a random order for
AAAAAAAABBBBBBBBCCCCCCCC (using R sampling) and assign these treatments to
subjects 1, 2, . . . , 24; or AAAAAABBBBBBCCCCCCXXXXXX, where X is the placebo.
(E) This method eliminates the between subject variation, but there may be possible carry-
over effects. For each subject, choose a random order for ABC (or ABCX).
1.6 experimental unit = male physicians, response variable = heart attacks (or perhaps heart prob-
lems), explanatory variable = treatment (aspirin/placebo); and other recorded covariates, such
as age, medical history, . . . .
1.7 (a) i. The women should be chosen at random, from among the women attending the
program. The population is then the women attending the Omega program. Issues
of time, location, program leader are all relevant here.
ii. If the program had a strict protocol for how it was delivered that was followed ev-
erywhere you might consider the conclusions to apply to any Omega program . . .
perhaps.
(b) i. The population of items from Grokkle’s production line. The items should be sam-
pled at random. An important issue here is time. You can only sample in a particular
time period. Strictly speaking, the population could then be real: the population of
items in the time period from which you sampled.
ii. If the production line process is believed to be stable over time (usually, a very brave
assumption!) you might consider applying the conclusions to a longer time period
than that sampled. In practice, this is often done: a sample is taken in a week in
March, and an inference is drawn about the whole year. This is a rather dangerous
practice.
(c) Geriatric patients: When an intervention has been used, the circumstances in which it
was applied are usually very important. We could say that the population is all geriatric
patients “like the ones in the Melbourne facility”, but that really doesn’t say anything
useful: what exactly does “like” mean here? This is why randomization is so important in
assessing an intervention. When we have a randomized trial, and therefore some patients
with the intervention and some not, it can be reasonable to apply the conclusions more
widely, to all geriatric patients. Effectively, this is often done.
(d) Breast cancer: Similar issues to the geriatric patients arise.
1.8 (a) The subject does not know whether they have received the treatment drug (SleepWell) or
the control drug (Placebo).
(b) So as to reduce bias and provide a fair comparison of the effect of the drug. Subjects may
tend to sleep more because of the suggestion that the drug will help: the placebo effect.
(c) Each patient has an equal chance of being assigned to the treatment or control. If there are
2n subjects who have agreed to take part in the experiment, randomly choose an order
for T T · · · T CC · · · C, i.e. n T s and n Cs, and assign in this order to the subjects.
(d) Replication is repetition of the experiment. We want a large number of replicates as this
increase the precision of the comparison being made.
1.9 (a) response variable = birthweight; explanatory variable = mother’s smoking status;
(b) observational study: mother’s smoking status is not imposed;
(c) race, (physical) size of parents, socio-economic status, mother’s health, pre-natal care, . . . ;
we can choose the mothers so that (some of) these variables are similar in the two groups.
1.10 The C–X line is removed: randomisation means that there can be no correlation between the
intervention X and the variable C, so the relationship between X and D is unaffected by the
relation between C and D.
1.11 We use a randomised controlled experiment. Patient function/status will be assessed by an
initial test, i.e. before treatment commences. The drug will be given in the form of a pill to be
administered by the carer. The control group will receive a placebo, i.e. a pill identical in ap-
pearance to the treatment (drug-L) pill. The treatment/placebo pill package will be randomly
assigned by the statistician (20 of each), so that neither the patient/carer nor the physician will
know whether the pill is drug-L or placebo. Thus the trial is double blind. At the end of the
treatment time (six months, say) the patients will be re-tested.
Because randomisation has been used, the significant difference can be attributed to the causal
effect of drug L.
lOMoARcPSD|8938243
Problem Set 2
2.1 (a) continuous; categorical; ordinal; discrete; categorical; continuous.
(b)
The heights should be proportional to (37/6, 13/5, 5/5, 0, 1/5). Whether the reported
data are correct is another matter, given the silliness of the graph, but this is about the
best representation of the data, as given.
The given graph actually appeared in ‘The Age’ (some time ago now)!
2.2 (a) not much recommendation?
(b) people have to die of something; look at quantity/quality of life lost?
(c) “more accidents” is not the same as “worse drivers” (poorer cars? more time on the
roads? . . . );
(d) nonsense, but it might be interesting to work out what it might mean.
2.3 (a) i. whether a vegetarian diet will change the cholesterol level;
ii. n=20, study unit = hospital employees (on standard meat-eating diet who agreed to
adopt a vegetarian diet for one month);
iii. all (hospital) employees on a standard meat-eating diet (extension?)
(b) i.
lOMoARcPSD|8938243
(a) med ≈ 87, IQR ≈ 99-76= 23, P10 ≈ 71, P90 ≈ 108;
(b) 0.85;
(c) 0.96;
(d) close to 0.95 [x̄ ± 2s is supposed to contain about 95% of the data];
(e) The data are classified into intervals (groups), so we do not know their values. To get the
correct values for these sample statistics, we would need the actual data.
2.6 (a) response variable = weight (in gram) after 21 days;
explanatory variable = treatment (control/lysine);
age of chicks (1-day → 22-days);
breed of chicks [yes]; conditions (temperature, humidity, . . . ) [yes]
(b) Not a good idea. Actually a really bad idea! Then ‘farm’ would be confounded with
‘treatment’.
(c)
It appears that Lysine has the effect of increasing the weight gain.
2.7 The mean and median are the same for each data set, but the spread of the data sets are sub-
stantially different. This is apparent in a dotplot. It is indicated by the standard deviation or
the interquartile range.
lOMoARcPSD|8938243
2.8 (a)
(b) It depends! If the missing observations are typical, then they are likely to go where the
observed data are: mostly in the middle with one or two a bit further away from the
middle. But they might be missing because the patient was too ill, or was unable to give
a reading . . . in which case the H-level might be very high? . . . and the missing data are
atypical.
(c) The target population would be all individuals with characteristic C.
(d) We would be assuming that the missing observations are typical — so that the remaining
(observed) 25 are too.
2.9
Problem Set 3
3.1 B B′
A 0.004 0.026 0.03
A′ 0.056 0.914 0.97
0.06 0.94 1
There is a positive relationship between A and B:
0.004
Pr(A | B) = 0.06 = 0.067 > Pr(A) = 0.03;
0.004
Pr(B | A) = 0.03 = 0.133 > Pr(B) = 0.06;
or Pr(A ∩ B) = 0.004 > Pr(A) Pr(B) = 0.0018.
Note: any one of these inequalities is enough to show a positive relationship.
3.2 D and E are events with Pr(D | E) = 0.1 and Pr(D | E ′ ) = 0.2.
(a) Pr(E | D) < Pr(E), so E and D are negatively related.
0.1/0.9
(b) OR = 0.2/0.8 = 0.444.
lOMoARcPSD|8938243
D D′
E 0.04 0.36 0.4
E′ 0.12 0.48 0.6
0.16 0.84 1
0.04 0.04×0.48
(c) Pr(D) = 0.16; (d) Pr(E | D) = 0.16 = 0.25. Note: OR = 0.12×0.36 = 0.444.
p1 (1 − p2 )
3.3 (a) = 2 ⇒ p1 − p1 p2 = 2p2 − 2p1 p2 ⇒ p1 = 2p2 − p1 p2 .
(1 − p1 )p2
Dividing through by p2 gives the result.
p1 p2 p1 − p2 p1 /p2
0.00 0.0000 0.0000 2.00
0.01 0.0050 0.0050 1.99
0.05 0.0256 0.0244 1.95
0.10 0.0526 0.0474 1.90
0.25 0.1429 0.1071 1.75
0.50 0.3333 0.1667 1.50
0.90 0.8182 0.0818 1.10
1.00 1.0000 0.0000 1.00
p1 (1 − p2 )
(b) As for (a): = θ ⇒ p1 − p1 p2 = θp2 − θp1 p2 ⇒ p1 = θp2 − (θ−1)p1 p2 .
(1 − p1 )p2
Again, dividing through by p2 gives the required result:
RR = θ×(1 − p1 ) + 1×p1
This is a weighted average of 1 and θ, with weight p1 on 1 and 1−p1 on θ, and must
therefore lie between 1 and θ. Note: RR divides 1 and θ in the ratio 1−p1 : p1 .
This applies even if θ < 1; i.e. it will lie between θ and 1, but in this case, RR will be less
than 1 (and greater than θ).
(c) OR = 2 ⇒ RR must lie between 1 and 2. If the risks are small, then it will be close to
2, but slightly smaller than 2.
(d) i. 2.8; ii. 1.4; iii. 0.525.
3.4 (a) (b) (c) (d)
0.1 0.3 0.4 0.2 0.2 0.4 0.1 0.5 0.6 0.2 0.2 0.4
0.4 0.2 0.6 0.3 0.3 0.6 0.2 0.2 0.4 0.2 0.4 0.6
0.5 0.5 1 0.5 0.5 1 0.3 0.7 1 0.4 0.6 1
1 1
(d) Let Pr(B) = b, then the entries are b, (1−b); 21 b, 32 (1−b).
2 3
Then 1
2
b + 13 (1−b) = 0.4 ⇒
b=0.4.
3.5 (a) Since Pr(E | D) = 0.20 < Pr(E|D′ ) = 0.25, it follows that E and D are negatively related:
E is less likely for D than for D′ .
0.2 1/4
(b) O(E | D) = 0.8 = 41 ; and O(E | D′ ) = 0.75
0.25
= 13 . So, OR = 1/3 = 0.75.
(Note: OR < 1 ⇒ negative relationship.)
3.6 Note that these results are estimated, and as they are based on a relatively small sample of 62 individuals
and therefore not particularly reliable.
′ ′
P20 P20 P20 P20
D 12 4 16 D 0.194 0.065 0.258 sn = 0.750
⇒
D′ 12 34 46 D′ 0.194 0.548 0.742 sp = 0.739
24 38 62 0.387 0.613 1
′
P20 P20
0.0075
D 0.0075 0.0025 0.01 ppv = 0.2658 = 0.028;
and for prevalence 0.01, ⇒
D′ 0.2583 0.7317 0.99 0.7317
npv = 0.7342 = 0.997.
0.2658 0.7342 1
′ ′
P15 P15 P15 P15
D 7 9 16 D 0.113 0.145 0.258 sn = 0.438
⇒
D′ 3 43 46 D′ 0.048 0.694 0.742 sp = 0.935
10 52 62 0.161 0.694 1
lOMoARcPSD|8938243
′
P15 P15
0.0044
D 0.0044 0.0056 0.01 ppv = 0.0689 = 0.064;
and for prevalence 0.01, ⇒
D′ 0.0646 0.9254 0.99 0.9254
npv = 0.9311 = 0.994.
0.0689 0.9311 1
Changing to threshold of 15 increases the ppv (there are fewer false positives . . . and fewer
true positives), but decreases the npv (more false negatives).
3.7 i. Assuming this sample is representative of the population (say, of men aged 50–59)
P P′ P P′
D 92 46 595 D 0.126 0.063 0.188 sn = 0.667
⇒
D′ 27 568 46 D′ 0.037 0.775 0.812 sp = 0.955
119 614 733 0.162 0.838 1
0.126
ppv = 0.162 = 0.773.
ii. Possibly from a community screening program (like mammography screening for breast
cancer) in which, say, men aged 50–59 are invited to attend for a free test. In this case, we
would have to assume that those who chose to attend for the screening test are represen-
tative of the target population. If such data came from routine GP tests (say applied to all
men 50–59 attending the clinic) this would be less representative.
To discover whether they had cancer, there would need to be some sort of follow-up
(perhaps we might take ‘no diagnosed cancer’ in five years time as an indicator). In that
case, there are (statistical) risks: that some cancers have not shown symptoms in that
time; or that some cancers developed after the test.
3.8 For the case ℓ = 3, we say that the test is positive if {PSA > 3}, and we denote this event by P3 .
In that case,
sensitivity, sn = Pr(P3 | C) = Pr(PSA > 3) = 1 − 0.003 = 0.997;
specificity, sp = Pr(P3′ | C ′ ) = Pr(PSA 6 3) = 0.140.
Hence the C×P3 probability table can be completed (the top left on in the array below); and
from that we obtain:
0.1994 0.1120
ppv = Pr(C | P3 ) = 0.8874 = 0.225 and npv = Pr(C ′ | P3′ ) = 0.1136 = 0.995.
Similarly for the other values of ℓ. The Disease/Test probability tables are given below for
ℓ = 3, 4, 5, 6, 7, 8.
We want sn, sp, ppv and npv large; we want fp and fn small. The problem is we can’t have it
all. There are no simple rules for what is ‘best’. It depends on the situation which is rated more
important, and even then, there is disagreement even between experts!
lOMoARcPSD|8938243
Problem Set 4
4.1 (a) graphs of pmf and cdf:
3 1
(b) Pr(X = 2) = 36 = 12 (0.083)
9 11
Pr(X > 4) = Pr(X=5) + Pr(X=6) = 36 + 36 = 59 (0.556)
5 7 9 7
Pr(2 < X 6 5) = Pr(X=3) + Pr(X=4) + Pr(X=5) = 36 + 36 + 36 = 12
(0.583)
1
(c) (see the graph of the cdf above) Pr(X = 2) = (jump in F at x=2) = 12
Pr(X > 4) = 1 − F (4) = 95
7
Pr(2 < X 6 5) = F (5) − F (2) = 12
4.2 i. Pr(X > 0.1) = 1 − F (0.1) = 0.94 = 0.656.
ii. Pr(X > s) = 0.01 ⇒ (1 − s)4 = 0.01 ⇒ 1 − s = 0.32, i.e. s = 0.68.
Thus, the supply needs to be at least 68L.
y 0 1 2
4.3 (a) Y = 2X has pmf
p(y) 0.5 0 0.5
z 0 1 2
(b) Z = X1 +X2 has pmf
p(z) 0.25 0.5 0.25
lOMoARcPSD|8938243
The dotplot suggests bimodality, but the sample is small; x̄ ≈ 30 and s ≈ 15 (these are quite
reasonable sample values: we don’t expect values identical to the population values).
Of course, this cannot be an exact model: for example, this model would mean that
Pr(T < 0) = 0.023.
d
Nevertheless, it seems not unreasonable to use T ≈ N(30, 152 ) as an approximate model.
lOMoARcPSD|8938243
Note: An alternative approach may be to consider the observed number of months as an integer variable.
Then we need to consider how to interpret events such as “more than a year”: is this “X > 12” or
“X > 12” or something else?
p
4.12 (a) µY −X = 0.002; σY −X = (0.004)2 + (0.002)2 = 0.00448.
(b) µZ = 2.001; σZ = 0.00224. Z is more variable than Y , but with a mean closer to 2.
(c) Which is ‘best’ X, Y or Z? There is no simple answer here, each has its merits. X is unbi-
ased (mean = 2), but it has the largest standard deviation, and hence the least precision. Y
is biased, with a larger bias than Z, but it is more precise than Z; it has a smaller standard
deviation. So what is needed is a trade-off between bias and precision. I would choose Z
as a compromise, but choosing either X, because it is the only one that is unbiased, or Y ,
because it has the smallest standard deviation (and quite a small bias) is acceptable.
d
(d) X = N(2, 0.0042 ) ⇒ Pr(1.995 < X < 2.005)
= Pr(−1.25 < Xs < 1.25) = 0.8944 − 0.1056 = 0.7887;
d
Y = N(2.002, 0.0022 ) ⇒ Pr(1.995 < X < 2.005)
= Pr(−3.5 < Xs < 1.5) = 0.9332 − 0.0002 = 0.9330;
d
X = N(2.001, 0.0022362 ) ⇒ Pr(1.995 < X < 2.005)
= Pr(−2.683 < Xs < 1.789) = 0.9632 − 0.0036 = 0.9595;
which gives some support to Z as a good estimator, because these results suggest that it
is more likely to be “close” to the true value, i.e. within 0.005 of 2.
4.13 (a)
d
(b) X − Y = N(−7.8, 95.30),
mean = 165.4 − 173.2 and variance = 6.72 + 7.12 , so sd = 9.762.
0+7.8
Pr(X > Y ) = Pr(X−Y > 0) = Pr(Z > 9.762
) = Pr(Z > 0.7990) = 0.212.
4.14 (a) Pr(Y > 10) = Pr(ln Y > ln 10) = Pr(Z > 0.303) = 0.381;
(b) Let c0.25 , c0.5 and c0.75 denote the quartiles and median of Y .
c0.25 is such that Pr(Y < c0.25 ) = 0.25. Therefore:
Pr(ln Y < ln c0.25 ) = 0.25 ⇒ ln c0.25 = 2 − 0.6745×1 = 1.3255
⇒ c0.25 = e1.3255
⇒ c0.25 ≈ 3.76.
Similarly, we find c0.5 = e2 ≈ 7.39 and c0.75 = 2.6745 ≈ 14.51.
(c) Y is positively skew: since c0.75 −c0.5 > c0.5 −c0.25
(d) the graph of the pdf of Y is:
d
4.15 (a) Let X denote the number of patients in which XB kills the bacteria; then X = Bi(100, 0.85)
(since the probability of “success” is the efficacy). Then Pr(“significantly better”) =
lOMoARcPSD|8938243
Problem Set 5
d 102
5.1 (a) X̄ = N(50, 10
);
Pr(49 < X̄ < 51) = Pr(− √110 < X̄s < √1 )
10
= Pr(−0.316 < X̄s < 0.316) = 0.248.
d 10 2
(b) X̄ = N(50, 100
);
Pr(49 < X̄ < 51) = Pr(−1 < X̄s < 1) = 0.683.
d 102
(c) X̄ = N(50, 1000
);
√ √
Pr(49 < X̄ < 51) = Pr(− 10 < X̄s 10) = Pr(−3.162 < X̄s < 3.162) = 0.998.
d 14.22
5.2 X̄ ≈ N(55.4, 50 ).
14.2
95% prob interval: 55.4 ± 1.96× √ = (51.5, 59.3)
50
5.3 [cf. Computer Lab Week 7: StatPlay & Confidence Intervals]
(a) 0.954 = 0.8145.
(b) i. 0.9520 = 0.3585;
ii. about 19 = 20×0.95;
iii. Bi(20, 0.95);
iv. 0.3585, 0.3774, 0.1887.
5.4 i. n = 30, x̄ = 40.86, (σ = 8);
8
95% CI for µ: 40.86 ± 1.9600× √ = (38.00, 43.72).
30
8
ii. narrower: it has less chance of containing µ. 40.86 ± 0.6745× √ = (39.87, 41.84).
30
iii. the confidence interval would continue to get narrower, until it reaches the point estimate
x̄, which is the 0% confidence interval.
8
iv. 40.86 ± 3.2905× √ = (36.05, 45.66).
30
5.5 n = 30, x̄ = 40.86, s = 7.036.
7.036
95% CI for µ: 40.86 ± 2.045× √ = (38.23, 43.48). cf. (38.00, 43.72).
30
This interval is narrower because the sample standard deviation s = 7.036 happens to be less
than the population standard deviation σ = 8 for this sample. If the population standard
deviation is actually equal to 8, then sometimes s will be less than 8, and sometimes it will
be more than 8. In this case we were ‘lucky’. On average, the interval based on s will be
wider, since not only is s ≈ 8 on average, but the multiplier of s (based on t) is larger than the
multiplier of σ (based on z).
5.6 n = 50, d¯ = 17.4, sd = 21.2.
21.2
i. d¯ ± 2.010× √ = (11.4, 23.4);
50
ii. the CI excludes zero, so that a mean difference of zero is implausible; this indicates an
increase.
5.7 There is no need to assume Normal population; though we are assuming that the sample size is large
d
enough for the CLT to apply, so that X̄ ≈ N.
i. σ, population standard deviation; n, the sample size; α, the probability of error, equiva-
lently the confidence level 100(1−α).
lOMoARcPSD|8938243
ii. the width increases with increasing σ; the width increases with decreasing α (or increas-
ing confidence level); and the width decreases with increasing n.
iii. wider interval means less precision, i.e. the “answer” is less precise: a wider interval
gives the scientist less precise information about the parameter.
5 √
iv. c0.975 (N) = 1.96 ⇒ 1.96× √n = 0.5 ⇒ n = 19.6 ⇒ n = 384.2;
Thus we want the sample size to be at least 385.
q
228 0.20×0.80
5.8 p̂ = 1140 = 0.20; se(p̂) = 1140
= 0.0118.
(approx) 95% CI for p: (0.20 ± 1.96×0.0118) = (0.177, 0.223).
Note that because n is large the exact interval will be almost the same: R gives (0.177, 0.224).
5.9
5.10
n x p̂ 95% CI p̂n x
95% CI
20 4 0.2 (0.06, 0.44) 20 16
0.8 (0.56, 0.94)
50 10 0.2 (0.10, 0.34) 50 40
0.8 (0.66, 0.90)
100 20 0.2 (0.13, 0.29) 100 80
0.8 (0.71, 0.87)
200 40 0.2 (0.15, 0.26) 200 160
0.8 (0.74, 0.85)
q
n=100, x=20 ⇒ approx 95% CI for p: 0.2 ± 1.96 0.2×0.8 100
= (0.122, 0.278)
cf. exact 95% CI from tables: (0.13, 0.29). Note: R gives (0.127, 0.292);
and the ‘better’ approximation gives (0.126, 0.297).
5.11 (a) If the data are a random sample from a Normal population, then the QQ plot should be
close to a straight line, with intercept µ and slope σ.
(k=15): y-coordinate = x(15) = 65 and x-coordinate Φ−1 ( 15
20
) = 0.6745.
So the point is (0.6745, 65).
µ̂ = 50 (intercept); σ̂ = 20 (slope).
(b)
For a Probability plot the axes are interchanged; the x-coordinate = x(15) = 65 and the
y-coordinate Φ−1 (0.75) = 0.6745, though the y-axis label is 0.75 (= Φ(0.6745)).
(c) n = 19, x̄ = 50.05, s = 17.81.
17.81
i. 95% CI for µ: 50.05 ± 2.101× √ = (41.47, 58.63);
19
q
1
ii. 95% PI for X: 50.05 ± 2.101×17.81 1 + 19 = (11.66, 88.44).
5.12
interval freq fˆ x cum.freq F̂
0<x<1 27 0.27 1 27 0.27
1<x<2 18 0.18 2 45 0.45
2<x<3 20 0.20 3 65 0.65
3<x<5 17 0.085 5 82 0.82
5 < x < 10 12 0.024 10 94 0.94
10 < x < 20 6 0.006 20 100 1.00
lOMoARcPSD|8938243
(a)
(b)
5.16 (a)
est se 1/seˆ2 w w*est
0.0827 0.0533 352.0024 0.3505 0.029
0.3520 0.1058 89.3364 0.089 0.0313
0.0520 0.0503 395.2429 0.3936 0.0205
-0.7702 0.5109 3.8311 0.0038 -0.0029
0.1049 0.0797 157.4285 0.1568 0.0164
0.1542 0.3935 6.4582 0.0064 0.001
1004.2995 1 0.0953
√
(b) est = 0.0953, se = 0.0316 (= 1/ 1004.2995)
(c) 95% CI for ln OR: 0.0953 ± 1.96×0.0316 = (0.0334, 0.1571)
(d) 95% CI for OR = exp(0.0334, 0.1571) = (1.034, 1, 170).
(e) OR > 1 : exposure and disease outcome are positively related, i.e. exposure is associated
with greater probability of the disease outcome: .
(f) There is significant evidence in these data to indicate that OR > 1, since the 95% CI is
entirely greater than 1, i.e. the ’plausible’ values are greater than 1; i.e. there is significant
evidence to indicate that β-carotene increases the risk of cardiovascular mortality (small,
but significant).
Problem Set 6
6.1 n = 30, x̄ = 40.86, (σ = 8). Note: s = 7.04.
(a) i. 95% CI for µ: 40.86 ± 1.96 √830 = (38.00, 43.72);
hence we reject (µ=45) since 45 6∈ CI.
ii.
35 40 45
40.86−45
iii. p = 2 Pr(X̄ < 40.86) = 2 Pr X̄s < √
8/ 30
= 2 Pr(X̄s < −2.83) = 0.005;
hence we reject (µ=45) since p < 0.05.
√ ; reject (µ=45) if |z| > 1.96, i.e. if x̄ 6∈ (45 ± 1.96 √8 );
x̄−45
iv. z = 8/ 30 30
35 40 45
12.5−9
6.2 (a) Pr(X < 12.5 | anaemic) = Pr(Xs < 3
) = Pr(Xs < 1.167) = 0.879;
12.5−16
(b) Pr(X > 12.5 | healthy) = Pr(Xs > 2.5 ) = Pr(Xs < −1.4) = 0.919;
(c) sensitivity = Pr(P | D) = 0.879; specificity = Pr(P ′ | D′ ) = 0.919;
Unless we know Pr(anaemic) (i.e. prevalence in the population, or relevant sub-population)
we cannot evaluate ppv or pnv. (In hypothesis testing generally we never know Pr(H0 ) or
Pr(H1 ), i.e. we never know “prevalence”.)
lOMoARcPSD|8938243
6.3 p̂ = 0.2, 95% CI: 0.09 < p < 0.36; 0.2, (0.11, 0.33); 0.2, (0.13, 0.29); 0.2, (0.15, 0.26);
do not reject, do not reject, reject, reject.
These results are summarised in the following table. In addition, more precise values for the
exact confidence interval, obtained from R are also listed, along with the approximate and
‘better’ approximate confidence intervals for comparison.
6.4 data give n = 19, x̄ = 29.74 and s = 7.85; (n∗ =1: one observation missing. We assume that
the missing observation is “missing at random”, i.e. it’s just as likely to be large or small: we’re
assuming it’s distributed like the others. In particular, we are assuming that it has not been
discarded because it was too large, for example.)
s 7.85
n = 19; µ̂ = x̄ = 29.74; and se(µ̂) = √n = √ = 1.80.
19 33.5);
(a) 95% CI for µ: 29.74 ± 2.101×1.801 = (26.0,
and, since 31 ∈ CI, we do not reject λ=31.
x̄−µ 29.74−31
(b) t = s/√n0 = √ = −0.701;
7.85/ 19
p = 2 Pr(t18 < −0.701) = 0.492 (using R); or, from tables p ≈ 0.5
[ since c0.75 (t18 ) = 0.688 and c0.8 (t18 ) = 0.862; so Pr(t18 > 0.7) ≈ 0.25.]
There is no significant evidence here to indicate that µ 6= 31.
Note: as the data are counts, and therefore integer-valued, we should really have made a correction for
continuity (ΣX 6 565). This gives tc = −0.687.
6.7
type II error when µ=B ✲
✛ significance level
A C B
(Note that the H0 -value is µ = C, the value at which the power-curve has a minimum.)
H0 : p = 0.10
0.025
0.10 ✻ q
0.10×0.90
0.10 + 1.96 n
q
0.15×0.85
0.15 − 1.6449 n
H1 : p = 0.15
0.05
6.10 Let X denote the number of cases of K. Under the null hypothesis (that the individuals at HQ
d
centre are the same as the general population), X = Pn(4.6).
Therefore p = 2 Pr(X > 13) = 2×0.001 = 0.002. Hence there is significant evidence of an
excess risk of K among HQ employees.
6.11 There is no evidence in these data that the treatment has an effect. The data are compatible
with the null hypothesis (that the treatment has no effect).
6.12 H0 : µ = 20 vs H1 : µ 6= 20; the test statistic, t = x̄−µ√0
s/ n
17.4−20
⇒ tobs = 5.1/ √
20
= −2.28. The null
distribution of t, i.e. the distribution of t under H0 , is t19 , assuming the population is normally
distributed. The critical value (for a test of significance level 0.05) is c0.975 (t19 ) = 2.093. Since
|tobs | is greater than this, there is significant evidence that µ < 20.
6.13 From the definition of the median: m = 20 ⇒ Pr(X < 20) = 0.5 (for a continuous random
variable).
Let Y denote the number of observations less than 20, i.e. Y = freq(X < 20). Then, if m = 20,
d
Y = Bi(11, 0.5); and we observe y = 10. So, p = 2 Pr(Y > 10) = 2×0.0059 = 0.002. Thus we
reject H0 , and conclude there is evidence that the median is less than 20.
6.14 i. There are 10 observations that round to 37.0. We don’t know whether these 10 observa-
tions are above or below 37 (i.e. 37.0000. . . ). So we delete them from consideration. This
leaves 120 observations, of which 81 are less than 37, and 39 are greater than 37.
d
If H0 (m = 37) is true, then W = freq(X<37) = Bi(120, 0.5);
using the approximate z-test, zc = 81−60−0.5
√
30
= 3.743, so that p ≈ 0.0002.
Thus we reject the hypothesis that m = 37, and conclude that there is significant evidence
that m < 37. [The null hypothesis is rejected and since m̂ < 37, the plausible values for m (as
specified by the CI, even though we haven’t found it) will be less than 37.]
ii. That this is a random sample (of healthy adults) and that temperatures are correctly mea-
sured. No assumption is made about the distribution of temperatures.
iii. Using Stat > Nonparametrics ◮ 1-Sample Sign . . . gives:
Sign test of median = 37.00 versus not = 37.00
N Below Equal Above P Median
x 130 81 10 39 0.0002 36.80
Sign confidence interval for median
Confidence
Achieved Interval
N Median Confidence Lower Upper Position
x 130 36.80 0.9345 36.80 36.90 55
0.9500 36.74 36.90 NLI
0.9563 36.70 36.90 54
Try also Stat > Basic Statistics ◮ Graphical Summary . . . which gives the CI.
lOMoARcPSD|8938243
6.16 The definitions and assumptions are missing. Here, it is assumed that we are sampling from a
Normal population with known variance σ 2 .
For the confidence interval case, we require the sample size n to be large enough so that the
margin of error of a 100(1 − α)% confidence interval should be at most d.
For the hypothesis testing case, we are testing the null hypothesis H0 : µ = µ0 using a signifi-
cance level α; and we require the sample size n to be large enough so that when µ = µ1 (where
µ1 = µ0 ± d), the power of the test should 2be at 2least 1 −2 β.2
z (kσ) z σ
i. n increases by a factor of k2 : n′ = = k2 2 = k2 n.
d2 d
2 ′ z2 σ2 1 z2 σ2 1
ii. n decreases by a factor of k : n = = 2 = 2 n.
(kd)2 k d2 k
2.5758
iii. 0.95 →
7 0.99 means z = 1.96 7→ z ′ = 2.5758, so n increases by a factor of ( 1.96 )2 :
1.962 σ 2 2.57582 σ 2 n′ 2.5758 2
n= 2
7→ n′ = 2
, so = = 1.727.
d d n 1.96
z1− 1 α σ
iv. For the diagram shown below, √2 = d:
n
And this diagram corresponds to the power diagram (EDDA p141) with β = 0.5.
v. β = 0.1 7→ β ′ = 0.01 means that z1−β = 1.2816 7→ z1−β ′ = 2.3263;
(1.96+1.2816)2 σ 2 ′ (1.96+2.3263)2 σ 2
and so n = d 2 →
7 n = d2
.
n′ (1.96+2.3263)2
Therefore n = (1.96+1.2816)2 = 1.748.
Problem Set 7
7.1 (a) i.
muscular endurance, as measured by repetitive grip strength trials;
ii.
paired comparisons: treatment and control applied to the same subject;
iii.
control = sugar placebo;
iv.
neither the patient nor the tester know whether the treatment received was the vita-
min C or the placebo;
v. randomisation should be used to determine which treatment (vitamin C or placebo)
is used first;
vi. better (more efficient) comparisons between treatment and control; the possibility of
carry-over effects.
(b) i.
lOMoARcPSD|8938243
To check on outliers; and a rough check of normality, via symmetry at least. This
looks a bit positively skew, but there are relatively few observations.
ii. To check on Normality, use a QQ-plot or a Probability plot. They should be close to
a straight line. The plot below indicates that this sample is acceptably Normal:
(c) i. H0 : µD = 0 (i.e. µV C = µP ) vs H1 : µD 6= 0
d¯ 121.0
t12 = s /√n = √ = 2.93, 0.01 < p < 0.02.
d 148.8/ 13
ii. There is significant evidence in these data that the muscular endurance is less with
vitamin C than with the placebo (assuming that large values of the response variable
corresponds to greater muscular endurance), i.e. there is significant evidence here
that vitamin C reduces muscular endurance. We reject H0 , and the plausible values of
µD (as specified by the CI) are positive; and µD > 0 corresponds to µP > µV C .
[t12 = 2.93, p = 0.013; 95%CI : (31, 211)].
7.2 (a) x̄1 − x̄2 = 121.0 = d, ¯ so the point estimate is the same; but, this two-sample approach
q
1
gives se(x̄1 − x̄2 ) = 162.9 13 1
+ 13 ¯ = 41.3.
= 63.9, vs the difference approach: se(d)
(b) 95% CI using the two-sample approach: (−12, 254), vs 95% CI from Problem 7.1 (differ-
ence approach): (31, 211)
(c) The two-samples approach assumes there is no connection between the two results of an
individual. It is assumed that samples are independent random samples from the treated
and untreated (placeboed?) populations.
(d) Clearly there is a connection between the results for a given subject. Some individuals are
stronger than others. Look at the results for subjects 5 and 6. In using the two-samples
(independent samples) approach, the treatment difference is masked by the difference
between individuals. The differences approach (paired samples) effectively removes the
individual differences.
7.3 Zinc: n1 = 25, x̄1 = 4.5, s1 = 1.6; P lacebo: n2 = 23, x̄2 = 8.1, s2 = 1.8.
r
24×1.62 +22×1.82
q
1 1
(a) x̄1 − x̄2 = −3.6, s = 46
= 1.70; se = s 25
+ 23
= 0.49.
(b) 95% CI for µ1 −µ2 : −3.6 ± 2.015×0.491 = (−4.6, −2.6).
c0.975 (t46 ) = 2.015 using R; or tables (c0.975 (t40 ) = 2.021, c0.975 (t50 ) = 2.009). Even if you
used 2.021, the 95% CI is unchanged to two decimal places.
(c) Yes. The 95% CI excludes zero. There is significant evidence here that the mean recovery
time is less with the zinc treatment.
Note that if you assumed that σ1 6= σ2 , little would change as q
s1 and s2 are not very different:
2 2
df = 44 (using R); t = −7.30 [instead of t = −7.34]; se = 1.6 25
+ 1.8
23
= 0.493 [instead of
se = 0.491]; and 95% CI = (−4.59, −2.61) [instead of (−4.59, −2.61)!]
lOMoARcPSD|8938243
7.4 (a) sample of differences (10, 10, 6, 3, 0, 20, 7) n1 = 7, d¯1 = 8.0, s1 = 6.40;
8.0√
t= = 3.31 cf. t6 ; reject H0 , p = 0.016;
6.40/ 7
(b) sample of differences (5, 17, 23, 22, 17, 4, 18, 3) n2 = 8, d¯2 = 13.6, s1 = 8.28;
8.28
95% CI: (13.6 ± 2.365× √ ) = (6.7, 20.6).
8
(c) compare female and male differences (two-sample test) n1 = 7, d¯1 = 8.0, s1 = 6.40;
n2 = 8, d¯2 = 13.6, s1 = 8.28; (s = 7, 47):
¯
d −d ¯
t = q 11 2 1 = 1.45, cf. t13 ; do not reject H0 , p = 0.170;
s n1
+n
2 q
1
Note, the 95% CI: (5.6 ± 2.160×7.47× 7
+ 18 ) = (−2.7, 14.0).
34.47−36.03
√1 1 −1.56
7.5 (a) Two-sample t-test t = = 4.523 = −0.345, cf. c0.975 (t18 ) = 2.101;
10.11 10 + 10
9s2 +9s2
so we accept µ1 = µ2 . [s2 = 118 2 = 12 (s21 + s22 ) = 12 (10.062 + 10.172 ) = 10.112 .
With equal sample sizes, the pooled s2 is the average of s21 and s22 .]
95% CI for µ1 −µ2 : −1.56 ± 2.101×4.523 = (−11.06, 7.94).
Using R, the following output is obtained:
Welch Two Sample t-test
data: C and K
t = -0.34487, df = 17.998, p-value = 0.7342
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.063388 7.943388
sample estimates:
mean of x mean of y
34.47 36.03
Thus, with this (independent samples) t-test, we do not reject µC = µK (δ = 0); the 95%
confidence interval for δ is given by (−11.06, 7.94). (Note that the confidence interval
contains zero, indicating non-rejection of δ = 0.)
(b) If the data are paired then we consider the sample of differences, di = xCi − xKi :
–0.9, –1.8, –3.7, –2.1, –1.7, –2.0, 1.1, –3.8, –1.3, 0.6.
¯
For this sample, t = s /d√10 = 1.578/
−1.56
√
10
= −1.56
0.499
= −3.13, cf. c0.975 (t9 ) = 2.262;
d
so we reject δ = 0 (i.e. we reject µC = µK ).
95% CI: −1.56 ± 2.262×0.499 = (−2.69, −0.43) (which does not contain zero).
7.6 (a)
q
1 1
p̂1 −p̂2 = −0.164, se(p̂1 −p̂2 ) = 0.45×0.55( 96 + 104 ) = 0.070.
−0.164
z = 0.070 = −2.333, p = 0.020.
There is significant evidence here that p1 < p2 , i.e. that PTCA is more effective in preventing
angina.
7.8 What should be done with the “lost to survey” individuals? If these individuals are omitted
then for the resulting 2 × 2 table, we have χ21 = 3.77, so that p > 0.05 and we do not reject H0 .
This test indicates there is no significant evidence of any change in the improvement rate.
Note: If we choose to omit the “lost to survey” individuals, then we are implicitly assuming that these
individuals are similar to those that remain in the sample. This would not be the case if, for example,
individuals who showed no improvement were more inclined to remove themselves from the survey.
This is a common problem with non-respondents. We must make (reasonable) assumptions about their
behaviour — and attempt to justify these assumptions.
28
7.9 Group 1 (30% O2 ): p̂1 = 250 = 0.112;
13 41
Group 2 (80% O2 ): p̂2 = 250 = 0.052; p̂ = 500 = 0.082.
q
1 1
p̂1 −p̂2 = 0.060, se0 (p̂1 −p̂2 ) = 0.082×0.918( 250 + 250 ) = 0.0245;
est 0.060
z = se = 0.0245 = 2.445, p = 0.014.
0
Since p < 0.05, there is significant evidence in these data to indicate that p2 < p1 , i.e. that the
rate of wound infection is less with the 80% oxygen treatment.
7.10 obs: 58 166 193 417 exp: 118.9 170.2 127.9 417
870 1163 806 2839 809.1 1158.8 871.1 2839
928 1329 999 3256 928 1329 999 3256
P (o−e)2
u= e
= 73.79; df = 2, c0.95 (χ22 ) = 5.991; p = 0.000.
There is significant evidence of an association between nausea and seat position. The indi-
viduals in the rear seats are more likely to experience nausea, and those in the front seats are
less likely to experience nausea. (This is seen by comparing observed and expected frequencies
based on independence. If nausea and seat position were independent, we would expect about
128 of those in the back seats to experience nausea, whereas 193 were observed. And for the
front seats, we observed 58 compared to the expected 119.)
63
7.11 i. case D: p̂1 = 100 = 0.63;
48 111
control D′ : p̂2 = 100 = 0.48; p̂ = 200 = 0.555;
q
1 1
p̂1 −p̂2 = 0.15, se(p̂1 −p̂2 ) = 0.555×0.445( 100 + 100
) = 0.0703;
est 0.15
z = se = 0.0703 = 2.134, p = 0.033.
We reject H0 , and conclude that there is significant evidence in these data that the cases
have a greater probability of exposure (compared to the controls).
Note: treating the data as a 2×2 contingency table gives u = 4.55 (= z 2 ), p = 0.033.
q
63×52 1 1 1 1
ii. θ̂ = 48×37 = 1.84; ln θ̂ = 0.612, se(ln θ̂) = 63 + 37 + 48 + 52 = 0.288
95% CI for ln θ: (0.612 ± 1.96×0.288) = (0.048, 1.177)
95% CI for θ: (e0.048 , e1.177 ) = (1.05, 3.24)
r
13 α̂1
7.12 (a) current users: t1 = 4761, x1 = 13; α̂1 = 4761 = 0.002731, se(α̂1 ) = t
= 0.000757;
r 1
113 α̂3
never users: t3 = 98091, x3 = 113; α̂3 = 98091 = 0.001152, se(α̂3 ) = t
= 0.000108.
3
rate-difference, α1 − α3 :
est
est.diff = 0.001579, se0 = 0.000519; z = se = 3.039, p = 2 Pr(Z > 3.039) = 0.002.
0
There is significant evidence in these data that α1 > α3 , i.e. that the incidence rate among
current-users is greater than among never-users.
α 13+113
rate ratio φ = α1 (see EDDA p165) [Note: α̂ = 4761+98091 = 0.001225.]
3
φ̂ = 2.37, ln φ̂ = 0.863, 95% CI for ln φ: 0.863 ± 1.96×0.293 = (0.289 < ln φ < 1.437);
lOMoARcPSD|8938243
rate-difference, α2 − α3 :
est
est.diff = 0.000202, se0 = 0.000153; z = se = 1.325, p = 2 Pr(Z > 1.325) = 0.185.
0
There is no significant evidence in these data that α2 6= α3 , i.e. no evidence that the
incidence rate among current-users is different from the rate among never-users.
α 164+113
rate ratio φ = α2 (see EDDA p165) [Note: α̂ = 121091+98091 = 0.001264.]
3
φ̂ = 1.18, ln φ̂ = 0.162, 95% CI for ln φ: 0.162 ± 1.96×0.122 = (−0.078 < ln φ < 0.401);
95% CI for φ: (0.93 < φ < 1.49).
Note: the CI includes 1, indicating no evidence against φ = 1, i.e. φ2 = φ3 .
Problem Set 8
8.1 i. yes: a positive relationship; straight-line regression looks OK, there may be question-
marks at the ends, but there are only a few observations there.
ii. E(Y | x) = α + βx, var(Y | x) = σ 2 ; and the errors are independent.
We also usually assume that the distribution is Normal.
iii. β̂ = 0.273 indicates that the average FEV increases by 0.273 L for each year of age.
iv. µ̂(10) = 0.0736 + 0.27348×10 = 2.81.
v. R2 is the proportion of the variation of FEV explained by the boys’ ages.
√ √
vi. r = R2 = 0.658 = 0.811 (It is positive because the relationship is positive, as seen
from the scatter plot and/or the fact that β̂ > 0.)
sy 6.667
8.2 i. β̂ = r s = −0.8 × 3.333 = −1.6; α̂ = ȳ − β̂ x̄ = 20 + 1.6×10 = 36; µ̂(x) = 36 − 1.6x.
x
n−1 9
ii. s2 = n−2 (1 − r2 )s2y = 8 (1 − 0.82 )6.6672 = 18;
q
18 √
K = (x − x̄)2 = 9 × 3.3332 = 100; se(β̂) = 100 = 0.18 = 0.424;
P
8.3 i.&ii.
8.4 i.
Exactly the same fitted regression line results in each case: y = 3.0 + 0.5x.
ii. The point that Anscombe wanted to make was that it is important to examine the scatter-
plots before calculating regression lines and correlations. By looking at just the regression
analyses, we would not have seen how different the data sets were.
Comment: Data set 1 (y1 on x1 ) looks reasonable for the usual assumptions and so the regression
is meaningful and appropriate. Set 2 (y2 on x1 ) is curvilinear and therefore linear regression is not
appropriate. Set 3 (y3 on x1 ) lies almost on an exact straight line except for one observation which
looks like an outlier and should therefore be investigated further before carrying out the regression.
Set 4 (y4 on y4 ) looks very unusual. The x values are identical except for one. With only two x
values represented there is no way of knowing if the relationship is linear or non-linear.
iii. The observed value at x4 = 19 is 12.5, which is the same as the predicted value. Changing
y4 from 12.5 to 10 and refitting the regression line results in a predicted value of 10, which
is the same as the observed again. From the plot, we can see that the point (19, 12.5) is
used to fit the regression line, resulting in the observed being the same as the fitted.
8.5 (a) response variable, y = size of tumour; explanatory variable, x = level of chemical in the
blood.
d
(b) Yi = α + βxi + Ei , where Ei = N(0, σ 2 ); the Ei are assumed to be independent.
A residual plot indicates E(Ei ) = 0 (average at zero), var(Ei ) = K (spread roughly con-
stant), and linearity of the model (no curved pattern in the residual plot); a normal plot
of the residuals checks their normality. The scatter plot indicates the reasonableness of
the straight line regression.
(c) i. β̂ = −0.15;
ii. An increase of 1 mg/L of this chemical in the blood corresponds to a decrease of
0.15cm in the mean tumour size.
lOMoARcPSD|8938243
β̂ −0.15123
(d) A test of β=0 is given by t = = 0.00987 = −15.32, which is significant, compared
se(β̂)
to t23 . We conclude that there is significant evidence in these data indicating β < 0.
(e) No: if 0 ∈ CI we would not reject (β=0).
(f) i. µ̂i = 10.3 − 0.15×25 = 6.55;
ii. êi = yi − µ̂i = 1.45;
r
1.2132
iii. se(µ̂i ) = 25
+ (25−45)2 ×0.009872 = 0.313;
90% CI for µi : (6.55 ± 1.714×0.313) = (6.01, 7.09).
(g) R2 indicates the proportion of the variation in y explained by the explanatory variable x:
in this case, about 90%.
√
r = −0.946, (r < 0 since there is a negative correlation (β̂ < 0) and |r| = 0.895).
8.7 (a)
8.8 We have n = 50, r = −0.40. From the correlation SP diagram (Figure 10), we obtain:
95% CI for ρ: (−0.61 < ρ < −0.14).
So there is significant evidence in these data to indicate that a negative relationship exists
(i.e. ρ < 0), since CI < 0 and 0 6∈ CI.
√
−0.796 9
8.9 (a) t = √ 2 = −3.945, cf. c0.975 (t9 ) = 2.262, p = 0.003;
1−0.796
hence we reject the hypothesis that the variables are uncorrelated. There is evidence here
that they are negatively correlated.
(b) i. t is the t-statistic to test β=0; it is equal to the t-statistic calculated in (a) to test ρ=0.
Thus t = −3.945 and p = 0.003.
ii. µ̂(10) = 88.80r − 10×2.334 = 65.46;
2.1882
pe(Y (10)) = 2.1882 + 11
+ (10 − 9.66)2 ×0.5912 = 2.294;
95% PI for Y (10): (65.46 ± 2.262×2.294) = (60.3, 70.7).
lOMoARcPSD|8938243
0.27348
8.10 i. β = 0, t = 0.01080 = 25.33, cf. c0.975 (t334 ) = 1.967, p = 0.000;
hence we reject the hypothesis that there is no relation between FEV and age: the data
indicate there is a positive relationship.
ii. 95% CI for β: 0.27348 ± 1.967×0.01080 = (0.252, 0.295).
0.2735−0.16
iii. H0 : β = 0.16. We reject H0 since 0.16 6∈ CI; or t = 0.01080
= 10.51.
iv. s2 = 0.5881022 = 0.346. r
0.5881022
v. µ̂(10) = 2.8084; se(µ̂(10)) = 336
+ 0.022 ×0.010802 = 0.03208;
95% CI for µ(10): 2.8084 ± 1.967×0.03208 = (2.745, 2.872).
r
0.5881022
vi. pe(Y (10)) = 0.5881022 + 336
+ 0.022 ×0.010802 = 0.5890;
95% PI for Y (10): 2.8084 ± 1.967×0.5890 = (1.65, 3.97).
lOMoARcPSD|8938243
k
ii. x(k) ∼ cq , where q = n+1 ; thus x(1) ∼ c0.1 = µ − 1.28σ.
iii.
z x
-1.28 40.4
-0.84 48.8
-0.52 54.8
-0.25 59.2
0.00 64.1
0.25 65.0
0.52 68.7
0.84 72.7
1.28 75.1
d
ii. Pr(L > T ) = Pr(L − T > 0), where L − T = N(−10, 42 + 22 );
0+10
= Pr (L − T )s > √ = Pr(N > 2.236) = 0.0127.
20
(c) i. V = var(T ) = w2 var(T1 ) + (1 − w)2 var(T2 ) = w2 + 4(1 − w)2 = 5w2 − 8w + 4.
dV dV
V is minimised when dw = 0; dw = 10w − 8 = 0 ⇒ w = 0.8.
√
ii. θ̂ = 0.8×50.0 + 0.2×55.0 = 51.0; se(θ̂) = 0.82 ×1.0 + 0.22 ×2.0 = 0.89.
350
R1.5 (b) prevalence estimate, p̂H = 2000 = 0.175.
q
0.175×0.825
95% CI for pH : est ± 1.95 se = 0.175 ± 1.96 2000
= (0.158, 0.192).
(c) i. Not all 400 individuals are observed for five years: some become cases, some may
leave the study early (others may enter it late) and some may die.
36
ii. incidence rate estimate α̂ = 1200 = 0.03 (cases per person-year).
q
0.03
95% CI for α: 0.03 ± 1.96 1200 = (0.020, 0.040).
d
R1.6 (b) i. H0 ⇒ Z = N(0, 1)
significance level = Pr(reject H0 | H0 )
d
= Pr(Z > 1.96) + Pr(Z < −1.96), where Z = N(0, 1)
= 0.025 + 0.025
= 0.05.
d
ii. H1 (θ = 2.80) ⇒ Z = N(2.80, 1)
power = Pr(reject H0 | H1 )
d
= Pr(Z > 1.96) + Pr(Z < −1.96), where Z = N(2.80, 1)
= Pr(Zs > −0.84) + Pr(Zs < −4.76)
= 0.800 + 0.000
= 0.80.
√
41−40 n
(c) i. E(Z) = 5/√n = 5 .
√
n
ii. To have power 0.80, we require E(Z) = 2.8, i.e. 5 = 2.8 ⇒ n = 196.
R1.7 (a) This can be tested using either a χ2 -test or a z-test. They are equivalent.
obs A A′ exp A A′
P 10 30 40 P 15 25 40
P′ 20 20 40 P′ 15 25 40
30 50 80 30 50 80
P (o−e)2 1 1 1 1
uc = e
= 52 ( 15 + 25 + 15 + 25 ) = 5.33, p = 0.021.
0.25 − 0.5
zc = q = −2.309, p = 0.021. (Note: 2.3092 = 5.33.)
1 1
0.375×0.625( 40 + 40 )
There is significant evidence that PTCA reduces the rsik of angina.
1
(b) Let θ denote the odds ratio. θ̂ = 3 .
q
1 1 1 1
ln θ̂ = −1.0986, and se(ln θ̂) = 10 + 30 + 20 + 20 = 0.483.
95% CI for ln θ: −1.0986 ± 1.96×0.483 = (−2.045, −0.152).
95% CI for θ: (0.129, 0.859).
The confidence interval suggests that the odds ratio is less than 1, indicating that PTCA
reduces the odds of angina, in acordance with the result of i, which indicated that PTCA
reduces the risk of angina.
(b) Randomisation is important to ensure validity and to balance the effects of any potential
confounding or lurking variables.
The subjects should be randomly allocated so that each is equally likely to receive the
treatment or the placebo.
(c) Assuming this experiment was performed as a randomised controlled trial, then a signif-
icant result provides evidence supporting drug ZZZ as a cause of improvement.
R2.2 (a) The sample data are negatively skewed with mean 67 and standard deviation 20.
(b) [1]; the horizontal scale gives the standard normal quantiles with grid z = −2, −1, 0, 1, 2
(the tick-marks are at −2, 0, 2); the vertical scale gives the sample quantiles, with grid
x = 0, 10, . . . , 100 (tick-marks at 0, 20, . . . , 100).
(c) i. approx 95% CI for µ: 66.6 ± 1.99×2.2 = (62, 2, 71.0).
Note: the t-distribution is not strictly appropriate here, as the population is non-normal. As
the sample is moderately large it provides a reasonable approximation.
77
ii. (x(2) , x(79) ) = (16, 96) gives a 81 = 95.1% prediction interval.
R2.3 (a) q zq xq
median 0.5 0 100
Q1, Q3 0.25, 0.75 ±0.67 93.3, 106.7
min, max 0.005, 0.995 ±2.58 74.2, 125.8
1
Note: For a sample of n = 200, x(1) ∼ cq , where q = 201 ≈ 0.005, and x(200) ∼ cq , where
200
q = 201 ≈ 0.995. Thus the minimum and maximum are approximated by the 0.005 and 0.995
quantiles.
An approximate (average) boxplot for this sample:
0.36
(b) i. 95% CI for µ: 2.65 ± 2.306× √ = (2.37, 2.93).
9
ii. 2.90 is in the confidence interval. Hence we do not reject H0 , i.e. there is no significant
evidence of a difference in means. There is no evidence in the data that the mean
vitamin A level for stomach cancer patients is different from the controls.
lOMoARcPSD|8938243
d
R2.6 (a) i. significance level = Pr(reject H0 ; H0 true) = Pr(|Z| > 1.96), where Z = N(0, 1);
= 0.025 + 0.025 = 0.05.
d
ii. power = Pr(reject H0 ; H1 true) = Pr(|Z| > 1.96), where Z = N(3.24, 1);
= Pr(Z > 1.96) + Pr(Z < −1.96)
= Pr(Zs > −1.28) + Pr(Z < −5.20)
= 0.8997 + 0.0000 = 0.90.
X̄−30 E(X̄)−30 µ−30
(b) i. E(Z) = E( 10/√n ) = 10/√n = 10/√n ;
31−30 √
ii. 10/√n = 3.24 ⇒ n = 32.4,
So the sample size needs to be at least 1050.
(1.96+1.28)2 102
Note: the formula gives n > (31−30)2
.
Since |z| > 1.96, the rank test indicates rejection of the null hypothesis at the 5%
significance level.
R3.1 (a) (A) A placebo is an inactive drug, which appears the same as the active drug. It is desir-
able in order to ascertain whether the active drug is having an effect.
(B) In a double-blind study, neither the subject not the treatment provider knows whether
the treatment is the active or the inactive drug. It is desirable in order to guard
against any possible bias: on the part of the subject or on the part of the treatment
provider (due to prior expectations).
(C) In favour: there would be be no between-subject variation. Against: there may be
carry-over effects, from one treatment to the next. The results may not be generalis-
able: is Claire representative?
(D) There can be no carry-over effect in this case. It is likely to be generalisable to a
larger population (the population that the subjects represent). Choose a random
order for AAAAAAAABBBBBBBBCCCCCCCC (using R sampling) and assign these
treatments to subjects 1, 2, . . . , 24.
(E) This method eliminates the between subject variation, but there may be possible
carry-over effects. For each subject, choose a random order for ABC.
(b) i. C
❅
✲❘
❅
E D
ii. X
−✒ ❅−
❘
❅
✲
S H
+
It seems likely that X may be part of the causal link between smoking and cancer, as
indicated in the diagram, and can therefore not be considered as a confounder.
R3.2 (a) i. x̄ = 30.03;
lOMoARcPSD|8938243
ii. s = 4.069;
iii. Q3 = x(15) = 33.2;
iv. ĉ0.1 = x(2) = 24.4.
(b) i. x̄ ≈ µ = 31;
ii. s ≈ σ = 5;
iii. Q3 ≈ c0.75 = 31 + 0.6745×5 = 34.4;
iv. ĉ0.1 ≈ c0.1 = 31 − 1.2816×5 = 24.6.
4
(c) i. k = 4: x-coordinate = Φ−1 ( 20 ) = −0.84; y-coordinate = x(4) = 26.9.
ii. µ̂ = 30 (y-intercept); σ̂ = 4 (slope = 34−30
1−0
).
iii. A normal probability plot is a QQ-plot with axes interchanged (and the population
quantile axis relabelled).
R3.3 (a) The probability table below can be found from the given information: Pr(E) = 0.4, so
Pr(E ′ ) = 0.6; Pr(E ∩ D) = Pr(E) Pr(D | E) = 0.4×0.1 = 0.04 and Pr(E ′ ∩ D) =
Pr(E ′ ) Pr(D | E ′ ) = 0.6×0.2 = 0.12. The other entries follow by subtraction, and addi-
tion.
D D′
E 0.04 0.36 0.4
E ′ 0.12 0.48 0.6
0.16 0.84 1
Then, from the probability table, we obtain:
i. Pr(D) = 0.16;
0.04
ii. Pr(E | D) = 0.16 = 0.25;
iii. negatively related since, for example, Pr(D | E) < Pr(D | E ′ );
0.04×0.48 4
iv. OR = 0.12×0.36 = 9 = 0.44.
85 90
(b) i. sensitivity = Pr(P | D) = 100 = 0.85; specificity = Pr(P ′ | D′ ) = 100 = 0.90.
ii. Using prevalence, Pr(C) = 0.1, we can complete the probability table:
P P′
C 0.085 0.015 0.1 (0.85)
C ′ 0.090 0.810 0.9 (0.90)
0.175 0.825 1
0.085
Hence ppv = 0.175 = 0.486.
iii. The maximum value of ppv occurs when the sensitivity is equal to 1.
0.1
Thus ppvmax = 0.19
= 0.526.
d √
R3.4 (a) X = Bi(240, 0.3). Therefore E(X) = 240×0.3 = 72 and sd(X) = 240×0.3×0.7 = 7.10.
approximate 95% probability interval: 72 ± 1.96×7.10 = (58.1, 85.9).
d √
(b) X = Pn(22) ⇒ E(X) = 22, sd(X) = 22 = 4.69.
approximate 95% probability interval: 22 ± 1.96×4.69 = (12.8, 31.2).
(c) i. 99% probability interval for Y : 5.0 ± 2.5758×0.8 = (2.94, 7.06);
ii. Pr(Y > 6.0) = Pr(Ys > 1.25) = 0.106;
Pr(Y >7.0) Pr(Ys >2.5) 0.0062
iii. Pr(Y > 7.0 | Y > 6.0) = Pr(Y >6.0) = Pr(Y >1.25) = 0.1056 = 0.059.
s
1
R3.5 (a) 10 ± 1.96× √ = (9.43, 10.57);
12
(b) α = 2 Pr(X̄ > 10.6 | µ = 10) = 2 Pr(X̄s > 2.078) = 0.038;
(c) p = 2 Pr(X̄ > 10.8 | µ = 10) = 2 Pr(X̄s > 2.771) = 0.006;
d 1
(d) power = 1 − Pr(9.4 < X̄ < 10.6), where X̄ = N(11, 12
);
power = 1 − Pr(−5.542 < X̄s < −1.386) = 1 − 0.0829 = 0.917;
(e) 95% confidence interval for µ: 10.8 ± 1.96× √112 = (10.23, 11.37);
q
1
(f) 95% prediction interval for X: 10.8 ± 1.96× 1 + 12 = (8.76, 12.84).
lOMoARcPSD|8938243
18.4
R3.6 (a) 95% confidence interval for mean difference: 15.0 ± 2.045× √ 30
= (8.13, 21.87);
there is significant evidence of an increase in mean vitamin D levels.
q
400
(b) p̂B = 2000 = 0.2; se(p̂B ) = 0.2×0.8
2000
= 0.0089;
95% confidence interval for pB : 0.2 ± 1.96×0.0089 = (0.182, 0.218).
d
(c) Under the null hypothesis (of ‘normal’ risk), the number of cases of K, X = Pn(16).
27.5−16
i. p = 2 Pr(X > 28) = 2 Pr(Xs∗ > 4
) = 0.004, so there is significant evidence
of excess risk.
√
ii. 95% confidence interval for µ: 18 ± 1.96 28 = (17.6, 38.4);
95% confidence interval for SMR = µ/16: (1.1, 2.4).
25×35 (15−8.75)2
R3.7 (a) i. 8.750 = 100
; 4.464 = 8.75
.
ii. u = 4.464 + · · · + 1.731 = 12.09, cf. χ22 ;
Tables: 0.001 < p < 0.005, so we reject H0 , and conclude that there is significant
evidence of a difference between the groups.
(b) p̂1 = 0.6, p̂2 = 0.4; so p̂ = 0.5.
0.6 − 0.4
z= q = 1.414, so that p = 2 Pr(Z > 1.414) = 0.157;
1 1
0.5×0.5( 25 + 25 )
and we conclude that there is no significant evidence of a difference between the proba-
bility of improvement with A and with B.
r r r
1 1 0.5 0.5
(c) se = 0.5×0.5 + = ; thus we require 1.96 6 0.15 ⇒ n > 86.
n n n n
R3.8 i.
sxy
ii. K = (n − 1)s2x = 49×102 = 4900; rxy = ⇒ sxy = 0.4×102 = 40.
sx sy
· 40
· ·β̂ = 100 = 0.4 and α̂ = 30 − 0.4×30 = 18.
iii. fitted line: y = 18 + 0.4x, shown on diagram.
r
85.75
iv. se(β̂) = = 0.132;
4900
95% confidence interval for β: 0.4 ± 2.011×0.132 = (0.13, 0.67).
v. Tables (SP diagram for correlation): 0.15 < ρ < 0.60.
ii. roughly a straight line with intercept ≈ 140 and slope ≈ 10, but with points in an
increasing sequence.
R4.2 i.observational study;
ii.women 40–44 years old at baseline;
iii.prospective study;
iv. to avoid age dependence of myocardial infarction;
q
31 0.001344
v. α̂1 = 23058 = 0.001344, se(α̂1 ) = 23058
= 0.000241;
q
19 0.000466
α̂2 = 40730 = 0.000466, se(α̂2 ) = 40730
= 0.000107;
q
1 1
α̂1 −α̂2 = 0.000878, se0 (α̂1 −α̂2 ) = 0.000784( 23058 + 40730 ) = 0.000231.
α̂ −α̂ 0.000878
z = se (1α̂ −2α̂ ) = 0.000231 = 3.805, p = 0.000. [2 Pr(Z > 3.805) = 0.000142]
0 1 2
Since z > 1.96 (or p < 0.05) we reject H0 (α1 = α2 ).
There is significant evidence in these data that OC-users have a greater incidence of
myocardial infarction.
vi. µ1 −µ2 = 50000(α1 −α2 ); est = 43.9, se = 13.2; 95% CI: 43.9±1.96×13.2 = (18.0, 69.8).
Note: the point and interval estimates√for µ1 −µ2 are just 50 000 times the point and interval
estimates for α1 −α2 ; se(α̂1 −α̂2 ) = 0.0002412 + 0.0001072 = 0.000264.
The difference is the increase in the number of myocardial infarctions associated with
OC-use among 10 000 women in five years.
R4.3 (a) prob = 0.6 + 0.6 − 0.62 = 0.84 or 1 − 0.42 = 0.84;
(b) P P′
C 0.18 0.12 0.3
C′ 0.07 0.63 0.7
0.25 0.75 1
0.18
Pr(C | P ) = 0.25 = 0.72;
sensitivity, sn = Pr(P | C) = 0.6;
0.63
negative predictive value, npv = Pr(C ′ | P ′ ) = 0.75 = 0.84;
Pr(D | E)
(c) relative risk, RR = Pr(D | E ′ ) ; i.e. the ratio of the probability of the disease given the
exposure to the probability of the disease given non-exposure;
prevalence is required to estimate relative risk.
R4.4 (a) step-function cdf: F (x) = 0.2, (06 x < 1); 0.6, (16 x < 2); 0.9, (26 x < 3); 1.0, (x>3).
(b) E(X) = 0×0.2 + 1×0.4 + 2×0.3 + 3×0.1 = 1.3;
var(X) = E((X − 1.3)2 ) = 1.69×0.2 + 0.09×0.4 + 0.49×0.3 + 2.89×0.1 = 0.81;
or var(X) = E(X 2 ) − E(X)2 = 02 ×0.2 + 12 ×0.4 + 22 ×0.3 + 32 ×0.1 − 1.32 = 0.81;
(c) i. E(T ) = 100×1.3 = 130; var(T ) = 100×0.81 = 81, so sd(T ) = 9;
ii. central limit theorem: sum of iidrvs is asymptotically Normal;
iii. Pr(T 6 125) ≈ Pr(T ∗ < 125.5) = Pr(Ts∗ < −0.5) = 0.309.
lOMoARcPSD|8938243
σ 10
R4.5 (a) sd(X̄) = √n = √ = 3.0.
11
It is assumed that Ms.J’s blood pressure is stable and that the daily readings are indepen-
dent.
20
(b) µ̂ = 142; se(µ̂) = √ = 4;
25
50.0 2
R4.6 (a) i. x̄ = 5
= 10.0; s2 = 14 (9 + 1 + 1 + 9) = 5.0, or s2 = 14 (520 − 505 ) = 5.0.
q
1
ii. 95% PI for X: 10.0 ± 2.776× 5(1 + 5 ) = (10.0 ± 6.8) = (3.2, 16.8);
(b) The mean of a six-month period is 6×2.75 = 16.5. So, the number of cases in a six-month
d
period, X = Pn(16.5).
10.5−16.5
√
Pr(X 6 10) ≈ Pr(X ∗ < 10.5) = Pr Xs∗ < = Pr(Xs∗ < −1.477) = 0.070.
16.5
Using R, Pr(X 6 10) = 0.0619.
448−500
√ −52
(c) i. t = = 20 = −2.6; cf. c0.975 (t15 ) = 2.131;
80/ 16
ii. p = 2 Pr(t15 < −2.6) ≈ 0.02.
iii. reject H0 . There is significant evidence that the mean MDI is less than 500.
d
R4.7 (a) Z = Bi(100, 0.95);
d
(b) i. α = Pr(|W | > 2.17), where W = N(0, 1), = 2×0.015 = 0.03;
d
ii. p = 2 Pr(W > 1.53), where W = N(0, 1), = 2×0.063 = 0.126;
d
iii. power = Pr(|W | > 2.17), where W = N(3, 1), = Pr(Ws > −0.83) = 0.797.
(c) i. X 6 4 or X > 19;
ii. 2.2 < λ < 13.1;
iii. 0.44 < α < 2.62, since α = λ/5.
14×60+9×44.7
R4.8 (a) s2 = 23
= 54.0;
q
1 1
(b) 95% CI for µ1 −µ2 : 6.5 ± 2.069 54( 15 + 10 ) = (6.5 ± 6.2) = (0.3, 12.7).
(c) There is significant evidence that the treatment gives a greater decrease in mean diastolic
blood pressure.
180
R4.9 i. β̂ = 100 = 1.8; α̂ = ȳ − β̂ x̄ = 20−30×1.8 = −34;
q
1802 9.5
ii. s2 = 81 (400 − 100
) = 76
8
= 9.5; se(β̂) = 100
= 0.31;
iii.
180
iv. r = 10×20 = 0.9; 95% CI for ρ: (0.60, 0.97), using Tables Figure 10.
iii. The vertical scale is warped so that a Normal cdf is a straight line.
iv. T > 0; mode less than 10; T < 40; positively skew.
(c)
i. scatter plot
R5.3 (a) F F′
I 0.01 0.02 0.03
I ′ 0.13 0.84 0.97
0.14 0.86 1
′ ′
i. Pr(F ∩ I ) = 0.84;
1
ii. Pr(F | I) = 3 = 0.33 > Pr(F ) = 0.14, so F & I are positively related.
(b) i. E(Z) = a×10 + (1−a)×10 = 10;
var(Z) = a2 ×22 + (1−a)2 ×12 = 4a2 + (1−a)2 = 5a2 − 2a + 1.
dV
ii. da = 10a − 2 = 0 ⇒ a = 0.2;
iii. Vmin = 0.22 ×22 + 0.82 ×12 = 0.16 + 0.64 = 0.8.
lOMoARcPSD|8938243
29s21 +29s22
R5.4 (pooled-t) s2 = 58
= 0.0373, s = 0.193;
0.04−0.10
√ 1 1 = −1.203, cf. c0.975 (t58 ) = 2.00; so we do not reject H0 (µ1 = µ2 ).
tp =
0.193 30 + 30
This test assumes that the samples are independent random samples from populations that
are normally distributed with equal variances. There may be some question about the last
0.04−0.10
assumption, but . . . if we were to use the unpooled-t: tu = q 2 2
= −1.203.
0.11
30
+ 0.25
30
2
Note: since s = 1
2
(s21 + s22 ), it follows that tu = tp .
tu is compared to c0.975 (tk ); and since 29 6 k 6 58, 2.00 6 c 6 2.05, thus the conclusion is the
same: do not reject H0 .
R5.5 data: E E′
C 6 94 100 p̂1 = 0.06
C′ 9 216 225 p̂2 = 0.04
15 310 325 p̂ = 0.0462
√ 0.06−0.04
(a) z = 1 1
= 0.793;
0.0462×0.9538×( 100 + 225 )
since |z| < 1.96, there is no evidence to indicate rejection of p1 = p2 .
(b) type I error (rejecting H0 when H0 is true) means we would conclude that OC alters the
cancer risk when it does not;
type II error (not rejecting H0 when H1 is true) means we would conclude that OC does
not alter the cancer risk when it does.
(c) i. type I ⇒ OC removed when it is OK; type II ⇒ OC continues to be sold when it is
not OK.
ii. type I is a problem for the drug company; type II is a problem for women using OC.
q
35 0.365×0.635
R5.6 (a) p̂ = 96 = 0.365, se(p̂) = 96
= 0.049;
95% CI for p: 0.365 ± 1.96×0.049 = (0.27, 0.46).
(b) i. λ̂ = 17;
d
ii. Let X denote the number of deaths recorded among the specified cohort. X =
Pn(λ), and we wish to test H0 : λ = 6.3.
d
Thus p = 2 Pr(X > 17), where X = Pn(6.3); and hence p = 0.0006, using Tables or
R. Hence we reject H0 . There is significant evidence here that λ > 6.3, i.e. that there
is excess mortality due to cirrhosis of the liver among this cohort.
17−6.3−0.5
√
Note: the approx z-test gives zc = = 4.06, so p ≈ 0.000, and we reject
6.3
H0 .
iii. The Poisson SP diagram (Tables Figure 4) gives 95% CI for λ: (9.9, 27.2).
Note: R gives (9.90, 27.22); the approx 95% CI gives (8.9, 25.1).
λ
SMR = 6.3 . 95% CI for SMR: (1.6, 4.3).
R5.7 (a) a comparison is required to demonstrate the effectiveness of the treatment;
(b) i. city and treatment are confounded;
ii. ten treatments and ten controls in each city.
(c) a variable that may be confounded with the treatment: gender, age, health, . . . .
(d) nT = 20, x̄T = −10.5, sT = 5.2; nC = 20, x̄C = −6.1, sC = 4.9.
s2 = 21 (5.22 + 4.92 ) = 25.525 ⇒ s = 5.05.
q
1 1
95% CI: (−10.5+6.1) ± 2.024×5.05 20 + 20 = (−4.4 ± 3.23) = (−7.6, −1.2).
There is significant evidence that the decrease is greater with the treatment, since 0 6∈ CI.
R5.8 (a) ordered data: (4.35, 4.55, 4.95, 5.05, 5.28, 5.36, 5.40, 5.46, 5.50, 6.45);
min =4.35 med=x(5.5) =5.32 max =6.45
Q1=x(2.75) =4.85 Q3=x(8.25) =5.47
d 0.572
(b) population: µ0 = 4.91, σ0 = 0.57; X̄ ≈ N(µ, 10 ), x̄ = 5.235;
5.235−4.91
√
z= = 1.80, p = 2 Pr(N > 1.80) = 0.071.
0.57/ 10
Since p > 0.01, we do not reject H0 .
lOMoARcPSD|8938243
Apart from assuming σ = 0.57, we also assume that X̄ is approximately normally dis-
tributed. This is based on the central limit theorem, but since we only have a sample of
10, the population distribution cannot be too far from Normal. (There may be some doubt
about the “outlier” at 6.45; but, as s = 0.58 which is very close to the assumed population σ, it
seems that this is not unreasonable.)
(c) i. To determine the power, we need to specify the decision rule.
For α = 0.01, we reject H0 if |z| > 2.5758,
0.57 0.57
i.e. x̄ > 4.91 − 2.5758× √ 10
= 4.45 or if x̄ < 4.91 + 2.5758× √ 10
= 5.37.
d 0.572
So, if µ = 5.5, power ≈ Pr(X̄ ′ > 5.37), where X̄ ′ = N(5.5, 10
).
5.37−5.5
√ ) = Pr(N > −0.697) = 0.757.
power = Pr(X̄s′ >
0.57/ 10
ii. The power would be increased. The critical values would be closer to 4.91 (4.91 ±
0.57
1.96× √ 10
, i.e. 4.56 and 5.26), so power = Pr(X̄ ′ > 5.26) > Pr(X̄ ′ > 5.37).
R5.9 (a) i. β̂ = 0.273 indicates that the average FEV increases by 0.273 L for each year of age.
ii. Test of β=0; p = 0.000 means that the probability of observing a value of t as extreme
d
as this if β=0 is less than 0.0005; t = t334 .
r
0.58812
iii. µ̂(12) = 3.355; se(µ̂(12)) = 336
+ 22 ×0.010802 = 0.0387;
95% CI for µ(12): 3.355 ± 1.967×0.0387 = (3.28, 3.43).
(b) Yes. The 95% confidence interval obtained from the Correlation SP diagram (Tables Fig-
ure 10) gives (0.03 < ρ < 0.65), which excludes zero. Hence ρ = 0 would be rejected;
there is significant evidence indicating a positive relationship.
lOMoARcPSD|8938243
Statistical Tables
p
x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
n=1 0 .9900 .9500 .9000 .8500 .8000 .7500 .7000 .6500 .6000 .5500 .5000 1
1 .0100 .0500 .1000 .1500 .2000 .2500 .3000 .3500 .4000 .4500 .5000 0
n=2 0 .9801 .9025 .8100 .7225 .6400 .5625 .4900 .4225 .3600 .3025 .2500 2
1 .0198 .0950 .1800 .2550 .3200 .3750 .4200 .4550 .4800 .4950 .5000 1
2 .0001 .0025 .0100 .0225 .0400 .0625 .0900 .1225 .1600 .2025 .2500 0
n=3 0 .9703 .8574 .7290 .6141 .5120 .4219 .3430 .2746 .2160 .1664 .1250 3
1 .0294 .1354 .2430 .3251 .3840 .4219 .4410 .4436 .4320 .4084 .3750 2
2 .0003 .0071 .0270 .0574 .0960 .1406 .1890 .2389 .2880 .3341 .3750 1
3 .0001 .0010 .0034 .0080 .0156 .0270 .0429 .0640 .0911 .1250 0
n=4 0 .9606 .8145 .6561 .5220 .4096 .3164 .2401 .1785 .1296 .0915 .0625 4
1 .0388 .1715 .2916 .3685 .4096 .4219 .4116 .3845 .3456 .2995 .2500 3
2 .0006 .0135 .0486 .0975 .1536 .2109 .2646 .3105 .3456 .3675 .3750 2
3 .0005 .0036 .0115 .0256 .0469 .0756 .1115 .1536 .2005 .2500 1
4 .0001 .0005 .0016 .0039 .0081 .0150 .0256 .0410 .0625 0
n=5 0 .9510 .7738 .5905 .4437 .3277 .2373 .1681 .1160 .0778 .0503 .0313 5
1 .0480 .2036 .3281 .3915 .4096 .3955 .3602 .3124 .2592 .2059 .1563 4
2 .0010 .0214 .0729 .1382 .2048 .2637 .3087 .3364 .3456 .3369 .3125 3
3 .0011 .0081 .0244 .0512 .0879 .1323 .1811 .2304 .2757 .3125 2
4 .0005 .0022 .0064 .0146 .0284 .0488 .0768 .1128 .1563 1
5 .0001 .0003 .0010 .0024 .0053 .0102 .0185 .0313 0
n=6 0 .9415 .7351 .5314 .3771 .2621 .1780 .1176 .0754 .0467 .0277 .0156 6
1 .0571 .2321 .3543 .3993 .3932 .3560 .3025 .2437 .1866 .1359 .0938 5
2 .0014 .0305 .0984 .1762 .2458 .2966 .3241 .3280 .3110 .2780 .2344 4
3 .0021 .0146 .0415 .0819 .1318 .1852 .2355 .2765 .3032 .3125 3
4 .0001 .0012 .0055 .0154 .0330 .0595 .0951 .1382 .1861 .2344 2
5 .0001 .0004 .0015 .0044 .0102 .0205 .0369 .0609 .0938 1
6 .0001 .0002 .0007 .0018 .0041 .0083 .0156 0
n=7 0 .9321 .6983 .4783 .3206 .2097 .1335 .0824 .0490 .0280 .0152 .0078 7
1 .0659 .2573 .3720 .3960 .3670 .3115 .2471 .1848 .1306 .0872 .0547 6
2 .0020 .0406 .1240 .2097 .2753 .3115 .3177 .2985 .2613 .2140 .1641 5
3 .0036 .0230 .0617 .1147 .1730 .2269 .2679 .2903 .2918 .2734 4
4 .0002 .0026 .0109 .0287 .0577 .0972 .1442 .1935 .2388 .2734 3
5 .0002 .0012 .0043 .0115 .0250 .0466 .0774 .1172 .1641 2
6 .0001 .0004 .0013 .0036 .0084 .0172 .0320 .0547 1
7 .0001 .0002 .0006 .0016 .0037 .0078 0
n=8 0 .9227 .6634 .4305 .2725 .1678 .1001 .0576 .0319 .0168 .0084 .0039 8
1 .0746 .2793 .3826 .3847 .3355 .2670 .1977 .1373 .0896 .0548 .0313 7
2 .0026 .0515 .1488 .2376 .2936 .3115 .2965 .2587 .2090 .1569 .1094 6
3 .0001 .0054 .0331 .0839 .1468 .2076 .2541 .2786 .2787 .2568 .2188 5
4 .0004 .0046 .0185 .0459 .0865 .1361 .1875 .2322 .2627 .2734 4
5 .0004 .0026 .0092 .0231 .0467 .0808 .1239 .1719 .2188 3
6 .0002 .0011 .0038 .0100 .0217 .0413 .0703 .1094 2
7 .0001 .0004 .0012 .0033 .0079 .0164 .0313 1
8 .0001 .0002 .0007 .0017 .0039 0
0.99 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 x
p
lOMoARcPSD|8938243
p
x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
n=9 0 .9135 .6302 .3874 .2316 .1342 .0751 .0404 .0207 .0101 .0046 .0020 9
1 .0830 .2985 .3874 .3679 .3020 .2253 .1556 .1004 .0605 .0339 .0176 8
2 .0034 .0629 .1722 .2597 .3020 .3003 .2668 .2162 .1612 .1110 .0703 7
3 .0001 .0077 .0446 .1069 .1762 .2336 .2668 .2716 .2508 .2119 .1641 6
4 .0006 .0074 .0283 .0661 .1168 .1715 .2194 .2508 .2600 .2461 5
5 .0008 .0050 .0165 .0389 .0735 .1181 .1672 .2128 .2461 4
6 .0001 .0006 .0028 .0087 .0210 .0424 .0743 .1160 .1641 3
7 .0003 .0012 .0039 .0098 .0212 .0407 .0703 2
8 .0001 .0004 .0013 .0035 .0083 .0176 1
9 .0001 .0003 .0008 .0020 0
n=10 0 .9044 .5987 .3487 .1969 .1074 .0563 .0282 .0135 .0060 .0025 .0010 10
1 .0914 .3151 .3874 .3474 .2684 .1877 .1211 .0725 .0403 .0207 .0098 9
2 .0042 .0746 .1937 .2759 .3020 .2816 .2335 .1757 .1209 .0763 .0439 8
3 .0001 .0105 .0574 .1298 .2013 .2503 .2668 .2522 .2150 .1665 .1172 7
4 .0010 .0112 .0401 .0881 .1460 .2001 .2377 .2508 .2384 .2051 6
5 .0001 .0015 .0085 .0264 .0584 .1029 .1536 .2007 .2340 .2461 5
6 .0001 .0012 .0055 .0162 .0368 .0689 .1115 .1596 .2051 4
7 .0001 .0008 .0031 .0090 .0212 .0425 .0746 .1172 3
8 .0001 .0004 .0014 .0043 .0106 .0229 .0439 2
9 .0001 .0005 .0016 .0042 .0098 1
10 .0001 .0003 .0010 0
n=11 0 .8953 .5688 .3138 .1673 .0859 .0422 .0198 .0088 .0036 .0014 .0005 11
1 .0995 .3293 .3835 .3248 .2362 .1549 .0932 .0518 .0266 .0125 .0054 10
2 .0050 .0867 .2131 .2866 .2953 .2581 .1998 .1395 .0887 .0513 .0269 9
3 .0002 .0137 .0710 .1517 .2215 .2581 .2568 .2254 .1774 .1259 .0806 8
4 .0014 .0158 .0536 .1107 .1721 .2201 .2428 .2365 .2060 .1611 7
5 .0001 .0025 .0132 .0388 .0803 .1321 .1830 .2207 .2360 .2256 6
6 .0003 .0023 .0097 .0268 .0566 .0985 .1471 .1931 .2256 5
7 .0003 .0017 .0064 .0173 .0379 .0701 .1128 .1611 4
8 .0002 .0011 .0037 .0102 .0234 .0462 .0806 3
9 .0001 .0005 .0018 .0052 .0126 .0269 2
10 .0002 .0007 .0021 .0054 1
11 .0002 .0005 0
n=12 0 .8864 .5404 .2824 .1422 .0687 .0317 .0138 .0057 .0022 .0008 .0002 12
1 .1074 .3413 .3766 .3012 .2062 .1267 .0712 .0368 .0174 .0075 .0029 11
2 .0060 .0988 .2301 .2924 .2835 .2323 .1678 .1088 .0639 .0339 .0161 10
3 .0002 .0173 .0852 .1720 .2362 .2581 .2397 .1954 .1419 .0923 .0537 9
4 .0021 .0213 .0683 .1329 .1936 .2311 .2367 .2128 .1700 .1208 8
5 .0002 .0038 .0193 .0532 .1032 .1585 .2039 .2270 .2225 .1934 7
6 .0005 .0040 .0155 .0401 .0792 .1281 .1766 .2124 .2256 6
7 .0006 .0033 .0115 .0291 .0591 .1009 .1489 .1934 5
8 .0001 .0005 .0024 .0078 .0199 .0420 .0762 .1208 4
9 .0001 .0004 .0015 .0048 .0125 .0277 .0537 3
10 .0002 .0008 .0025 .0068 .0161 2
11 .0001 .0003 .0010 .0029 1
12 .0001 .0002 0
n=13 0 .8775 .5133 .2542 .1209 .0550 .0238 .0097 .0037 .0013 .0004 .0001 13
1 .1152 .3512 .3672 .2774 .1787 .1029 .0540 .0259 .0113 .0045 .0016 12
2 .0070 .1109 .2448 .2937 .2680 .2059 .1388 .0836 .0453 .0220 .0095 11
3 .0003 .0214 .0997 .1900 .2457 .2517 .2181 .1651 .1107 .0660 .0349 10
4 .0028 .0277 .0838 .1535 .2097 .2337 .2222 .1845 .1350 .0873 9
5 .0003 .0055 .0266 .0691 .1258 .1803 .2154 .2214 .1989 .1571 8
6 .0008 .0063 .0230 .0559 .1030 .1546 .1968 .2169 .2095 7
7 .0001 .0011 .0058 .0186 .0442 .0833 .1312 .1775 .2095 6
8 .0001 .0011 .0047 .0142 .0336 .0656 .1089 .1571 5
9 .0001 .0009 .0034 .0101 .0243 .0495 .0873 4
10 .0001 .0006 .0022 .0065 .0162 .0349 3
11 .0001 .0003 .0012 .0036 .0095 2
12 .0001 .0005 .0016 1
13 .0001 0
0.99 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 x
p
lOMoARcPSD|8938243
p
x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
n=14 0 .8687 .4877 .2288 .1028 .0440 .0178 .0068 .0024 .0008 .0002 .0001 14
1 .1229 .3593 .3559 .2539 .1539 .0832 .0407 .0181 .0073 .0027 .0009 13
2 .0081 .1229 .2570 .2912 .2501 .1802 .1134 .0634 .0317 .0141 .0056 12
3 .0003 .0259 .1142 .2056 .2501 .2402 .1943 .1366 .0845 .0462 .0222 11
4 .0037 .0349 .0998 .1720 .2202 .2290 .2022 .1549 .1040 .0611 10
5 .0004 .0078 .0352 .0860 .1468 .1963 .2178 .2066 .1701 .1222 9
6 .0013 .0093 .0322 .0734 .1262 .1759 .2066 .2088 .1833 8
7 .0002 .0019 .0092 .0280 .0618 .1082 .1574 .1952 .2095 7
8 .0003 .0020 .0082 .0232 .0510 .0918 .1398 .1833 6
9 .0003 .0018 .0066 .0183 .0408 .0762 .1222 5
10 .0003 .0014 .0049 .0136 .0312 .0611 4
11 .0002 .0010 .0033 .0093 .0222 3
12 .0001 .0005 .0019 .0056 2
13 .0001 .0002 .0009 1
14 .0001 0
n=15 0 .8601 .4633 .2059 .0874 .0352 .0134 .0047 .0016 .0005 .0001 15
1 .1303 .3658 .3432 .2312 .1319 .0668 .0305 .0126 .0047 .0016 .0005 14
2 .0092 .1348 .2669 .2856 .2309 .1559 .0916 .0476 .0219 .0090 .0032 13
3 .0004 .0307 .1285 .2184 .2501 .2252 .1700 .1110 .0634 .0318 .0139 12
4 .0049 .0428 .1156 .1876 .2252 .2186 .1792 .1268 .0780 .0417 11
5 .0006 .0105 .0449 .1032 .1651 .2061 .2123 .1859 .1404 .0916 10
6 .0019 .0132 .0430 .0917 .1472 .1906 .2066 .1914 .1527 9
7 .0003 .0030 .0138 .0393 .0811 .1319 .1771 .2013 .1964 8
8 .0005 .0035 .0131 .0348 .0710 .1181 .1647 .1964 7
9 .0001 .0007 .0034 .0116 .0298 .0612 .1048 .1527 6
10 .0001 .0007 .0030 .0096 .0245 .0515 .0916 5
11 .0001 .0006 .0024 .0074 .0191 .0417 4
12 .0001 .0004 .0016 .0052 .0139 3
13 .0001 .0003 .0010 .0032 2
14 .0001 .0005 1
15 0
n=16 0 .8515 .4401 .1853 .0743 .0281 .0100 .0033 .0010 .0003 .0001 16
1 .1376 .3706 .3294 .2097 .1126 .0535 .0228 .0087 .0030 .0009 .0002 15
2 .0104 .1463 .2745 .2775 .2111 .1336 .0732 .0353 .0150 .0056 .0018 14
3 .0005 .0359 .1423 .2285 .2463 .2079 .1465 .0888 .0468 .0215 .0085 13
4 .0061 .0514 .1311 .2001 .2252 .2040 .1553 .1014 .0572 .0278 12
5 .0008 .0137 .0555 .1201 .1802 .2099 .2008 .1623 .1123 .0667 11
6 .0001 .0028 .0180 .0550 .1101 .1649 .1982 .1983 .1684 .1222 10
7 .0004 .0045 .0197 .0524 .1010 .1524 .1889 .1969 .1746 9
8 .0001 .0009 .0055 .0197 .0487 .0923 .1417 .1812 .1964 8
9 .0001 .0012 .0058 .0185 .0442 .0840 .1318 .1746 7
10 .0002 .0014 .0056 .0167 .0392 .0755 .1222 6
11 .0002 .0013 .0049 .0142 .0337 .0667 5
12 .0002 .0011 .0040 .0115 .0278 4
13 .0000 .0002 .0008 .0029 .0085 3
14 .0001 .0005 .0018 2
15 .0001 .0002 1
16 0
n=17 0 .8429 .4181 .1668 .0631 .0225 .0075 .0023 .0007 .0002 17
1 .1447 .3741 .3150 .1893 .0957 .0426 .0169 .0060 .0019 .0005 .0001 16
2 .0117 .1575 .2800 .2673 .1914 .1136 .0581 .0260 .0102 .0035 .0010 15
3 .0006 .0415 .1556 .2359 .2393 .1893 .1245 .0701 .0341 .0144 .0052 14
4 .0076 .0605 .1457 .2093 .2209 .1868 .1320 .0796 .0411 .0182 13
5 .0010 .0175 .0668 .1361 .1914 .2081 .1849 .1379 .0875 .0472 12
6 .0001 .0039 .0236 .0680 .1276 .1784 .1991 .1839 .1432 .0944 11
7 .0007 .0065 .0267 .0668 .1201 .1685 .1927 .1841 .1484 10
8 .0001 .0014 .0084 .0279 .0644 .1134 .1606 .1883 .1855 9
9 .0003 .0021 .0093 .0276 .0611 .1070 .1540 .1855 8
10 .0004 .0025 .0095 .0263 .0571 .1008 .1484 7
11 .0001 .0005 .0026 .0090 .0242 .0525 .0944 6
12 .0001 .0006 .0024 .0081 .0215 .0472 5
13 .0001 .0005 .0021 .0068 .0182 4
14 .0001 .0004 .0016 .0052 3
15 .0001 .0003 .0010 2
16 .0001 1
17 0
0.99 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 x
p
lOMoARcPSD|8938243
p
x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
n=18 0 .8345 .3972 .1501 .0536 .0180 .0056 .0016 .0004 .0001 18
1 .1517 .3763 .3002 .1704 .0811 .0338 .0126 .0042 .0012 .0003 .0001 17
2 .0130 .1683 .2835 .2556 .1723 .0958 .0458 .0190 .0069 .0022 .0006 16
3 .0007 .0473 .1680 .2406 .2297 .1704 .1046 .0547 .0246 .0095 .0031 15
4 .0093 .0700 .1592 .2153 .2130 .1681 .1104 .0614 .0291 .0117 14
5 .0014 .0218 .0787 .1507 .1988 .2017 .1664 .1146 .0666 .0327 13
6 .0002 .0052 .0301 .0816 .1436 .1873 .1941 .1655 .1181 .0708 12
7 .0010 .0091 .0350 .0820 .1376 .1792 .1892 .1657 .1214 11
8 .0002 .0022 .0120 .0376 .0811 .1327 .1734 .1864 .1669 10
9 .0004 .0033 .0139 .0386 .0794 .1284 .1694 .1855 9
10 .0001 .0008 .0042 .0149 .0385 .0771 .1248 .1669 8
11 .0001 .0010 .0046 .0151 .0374 .0742 .1214 7
12 .0002 .0012 .0047 .0145 .0354 .0708 6
13 .0002 .0012 .0045 .0134 .0327 5
14 .0002 .0011 .0039 .0117 4
15 .0002 .0009 .0031 3
16 .0001 .0006 2
17 .0001 1
18 0
n=19 0 .8262 .3774 .1351 .0456 .0144 .0042 .0011 .0003 .0001 19
1 .1586 .3774 .2852 .1529 .0685 .0268 .0093 .0029 .0008 .0002 18
2 .0144 .1787 .2852 .2428 .1540 .0803 .0358 .0138 .0046 .0013 .0003 17
3 .0008 .0533 .1796 .2428 .2182 .1517 .0869 .0422 .0175 .0062 .0018 16
4 .0112 .0798 .1714 .2182 .2023 .1491 .0909 .0467 .0203 .0074 15
5 .0018 .0266 .0907 .1636 .2023 .1916 .1468 .0933 .0497 .0222 14
6 .0002 .0069 .0374 .0955 .1574 .1916 .1844 .1451 .0949 .0518 13
7 .0014 .0122 .0443 .0974 .1525 .1844 .1797 .1443 .0961 12
8 .0002 .0032 .0166 .0487 .0981 .1489 .1797 .1771 .1442 11
9 .0007 .0051 .0198 .0514 .0980 .1464 .1771 .1762 10
10 .0001 .0013 .0066 .0220 .0528 .0976 .1449 .1762 9
11 .0003 .0018 .0077 .0233 .0532 .0970 .1442 8
12 .0004 .0022 .0083 .0237 .0529 .0961 7
13 .0001 .0005 .0024 .0085 .0233 .0518 6
14 .0001 .0006 .0024 .0082 .0222 5
15 .0001 .0005 .0022 .0074 4
16 .0001 .0005 .0018 3
17 .0001 .0003 2
18 1
19 0
n=20 0 .8179 .3585 .1216 .0388 .0115 .0032 .0008 .0002 20
1 .1652 .3774 .2702 .1368 .0576 .0211 .0068 .0020 .0005 .0001 19
2 .0159 .1887 .2852 .2293 .1369 .0669 .0278 .0100 .0031 .0008 .0002 18
3 .0010 .0596 .1901 .2428 .2054 .1339 .0716 .0323 .0123 .0040 .0011 17
4 .0133 .0898 .1821 .2182 .1897 .1304 .0738 .0350 .0139 .0046 16
5 .0022 .0319 .1028 .1746 .2023 .1789 .1272 .0746 .0365 .0148 15
6 .0003 .0089 .0454 .1091 .1686 .1916 .1712 .1244 .0746 .0370 14
7 .0020 .0160 .0545 .1124 .1643 .1844 .1659 .1221 .0739 13
8 .0004 .0046 .0222 .0609 .1144 .1614 .1797 .1623 .1201 12
9 .0001 .0011 .0074 .0271 .0654 .1158 .1597 .1771 .1602 11
10 .0002 .0020 .0099 .0308 .0686 .1171 .1593 .1762 10
11 .0005 .0030 .0120 .0336 .0710 .1185 .1602 9
12 .0001 .0008 .0039 .0136 .0355 .0727 .1201 8
13 .0002 .0010 .0045 .0146 .0366 .0739 7
14 .0002 .0012 .0049 .0150 .0370 6
15 .0003 .0013 .0049 .0148 5
16 .0003 .0013 .0046 4
17 .0002 .0011 3
18 .0002 2
19 1
20 0
0.99 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 x
p
lOMoARcPSD|8938243
✛ x/n
1.0 0.9 0.8 0.7 0.6 0.5
0.9 0.1
✻
0.8 0.2
0.7 0.3
0.6 0.4
10
0.5 0.5
p 20 p
0.4 0.6
50
100
200
0.3 500 0.7
500
200
100
0.2 50 0.8
20
10
0.1 0.9
0.0 1.0 ❄
0.0 0.1 0.2 0.3 0.4 0.5
x/n ✲
λ
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x
0 .9048 .8187 .7408 .6703 .6065 .5488 .4966 .4493 .4066 .3679 0
1 .0905 .1637 .2222 .2681 .3033 .3293 .3476 .3595 .3659 .3679 1
2 .0045 .0164 .0333 .0536 .0758 .0988 .1217 .1438 .1647 .1839 2
3 .0002 .0011 .0033 .0072 .0126 .0198 .0284 .0383 .0494 .0613 3
4 .0001 .0003 .0007 .0016 .0030 .0050 .0077 .0111 .0153 4
5 .0001 .0002 .0004 .0007 .0012 .0020 .0031 5
6 .0001 .0002 .0003 .0005 6
7 .0001 7
λ
x 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 x
0 .3329 .3012 .2725 .2466 .2231 .2019 .1827 .1653 .1496 .1353 0
1 .3662 .3614 .3543 .3452 .3347 .3230 .3106 .2975 .2842 .2707 1
2 .2014 .2169 .2303 .2417 .2510 .2584 .2640 .2678 .2700 .2707 2
3 .0738 .0867 .0998 .1128 .1255 .1378 .1496 .1607 .1710 .1804 3
4 .0203 .0260 .0324 .0395 .0471 .0551 .0636 .0723 .0812 .0902 4
5 .0045 .0062 .0084 .0111 .0141 .0176 .0216 .0260 .0309 .0361 5
6 .0008 .0012 .0018 .0026 .0035 .0047 .0061 .0078 .0098 .0120 6
7 .0001 .0002 .0003 .0005 .0008 .0011 .0015 .0020 .0027 .0034 7
8 .0001 .0001 .0001 .0002 .0003 .0005 .0006 .0009 8
9 .0001 .0001 .0001 .0002 9
λ
x 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 x
0 .1225 .1108 .1003 .0907 .0821 .0743 .0672 .0608 .0550 .0498 0
1 .2572 .2438 .2306 .2177 .2052 .1931 .1815 .1703 .1596 .1494 1
2 .2700 .2681 .2652 .2613 .2565 .2510 .2450 .2384 .2314 .2240 2
3 .1890 .1966 .2033 .2090 .2138 .2176 .2205 .2225 .2237 .2240 3
4 .0992 .1082 .1169 .1254 .1336 .1414 .1488 .1557 .1622 .1680 4
5 .0417 .0476 .0538 .0602 .0668 .0735 .0804 .0872 .0940 .1008 5
6 .0146 .0174 .0206 .0241 .0278 .0319 .0362 .0407 .0455 .0504 6
7 .0044 .0055 .0068 .0083 .0099 .0118 .0139 .0163 .0188 .0216 7
8 .0011 .0015 .0019 .0025 .0031 .0038 .0047 .0057 .0068 .0081 8
9 .0003 .0004 .0005 .0007 .0009 .0011 .0014 .0018 .0022 .0027 9
10 .0001 .0001 .0001 .0002 .0002 .0003 .0004 .0005 .0006 .0008 10
11 .0001 .0001 .0001 .0002 .0002 11
12 .0001 12
λ
x 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 x
0 .0450 .0408 .0369 .0334 .0302 .0273 .0247 .0224 .0202 .0183 0
1 .1397 .1304 .1217 .1135 .1057 .0984 .0915 .0850 .0789 .0733 1
2 .2165 .2087 .2008 .1929 .1850 .1771 .1692 .1615 .1539 .1465 2
3 .2237 .2226 .2209 .2186 .2158 .2125 .2087 .2046 .2001 .1954 3
4 .1733 .1781 .1823 .1858 .1888 .1912 .1931 .1944 .1951 .1954 4
5 .1075 .1140 .1203 .1264 .1322 .1377 .1429 .1477 .1522 .1563 5
6 .0555 .0608 .0662 .0716 .0771 .0826 .0881 .0936 .0989 .1042 6
7 .0246 .0278 .0312 .0348 .0385 .0425 .0466 .0508 .0551 .0595 7
8 .0095 .0111 .0129 .0148 .0169 .0191 .0215 .0241 .0269 .0298 8
9 .0033 .0040 .0047 .0056 .0066 .0076 .0089 .0102 .0116 .0132 9
10 .0010 .0013 .0016 .0019 .0023 .0028 .0033 .0039 .0045 .0053 10
11 .0003 .0004 .0005 .0006 .0007 .0009 .0011 .0013 .0016 .0019 11
12 .0001 .0001 .0001 .0002 .0002 .0003 .0003 .0004 .0005 .0006 12
13 .0001 .0001 .0001 .0001 .0002 .0002 13
14 .0001 14
lOMoARcPSD|8938243
λ
x 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 x
0 .0166 .0150 .0136 .0123 .0111 .0101 .0091 .0082 .0074 .0067 0
1 .0679 .0630 .0583 .0540 .0500 .0462 .0427 .0395 .0365 .0337 1
2 .1393 .1323 .1254 .1188 .1125 .1063 .1005 .0948 .0894 .0842 2
3 .1904 .1852 .1798 .1743 .1687 .1631 .1574 .1517 .1460 .1404 3
4 .1951 .1944 .1933 .1917 .1898 .1875 .1849 .1820 .1789 .1755 4
5 .1600 .1633 .1662 .1687 .1708 .1725 .1738 .1747 .1753 .1755 5
6 .1093 .1143 .1191 .1237 .1281 .1323 .1362 .1398 .1432 .1462 6
7 .0640 .0686 .0732 .0778 .0824 .0869 .0914 .0959 .1002 .1044 7
8 .0328 .0360 .0393 .0428 .0463 .0500 .0537 .0575 .0614 .0653 8
9 .0150 .0168 .0188 .0209 .0232 .0255 .0281 .0307 .0334 .0363 9
10 .0061 .0071 .0081 .0092 .0104 .0118 .0132 .0147 .0164 .0181 10
11 .0023 .0027 .0032 .0037 .0043 .0049 .0056 .0064 .0073 .0082 11
12 .0008 .0009 .0011 .0013 .0016 .0019 .0022 .0026 .0030 .0034 12
13 .0002 .0003 .0004 .0005 .0006 .0007 .0008 .0009 .0011 .0013 13
14 .0001 .0001 .0001 .0001 .0002 .0002 .0003 .0003 .0004 .0005 14
15 .0001 .0001 .0001 .0001 .0001 .0002 15
λ
x 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 x
0 .0061 .0055 .0050 .0045 .0041 .0037 .0033 .0030 .0027 .0025 0
1 .0311 .0287 .0265 .0244 .0225 .0207 .0191 .0176 .0162 .0149 1
2 .0793 .0746 .0701 .0659 .0618 .0580 .0544 .0509 .0477 .0446 2
3 .1348 .1293 .1239 .1185 .1133 .1082 .1033 .0985 .0938 .0892 3
4 .1719 .1681 .1641 .1600 .1558 .1515 .1472 .1428 .1383 .1339 4
5 .1753 .1748 .1740 .1728 .1714 .1697 .1678 .1656 .1632 .1606 5
6 .1490 .1515 .1537 .1555 .1571 .1584 .1594 .1601 .1605 .1606 6
7 .1086 .1125 .1163 .1200 .1234 .1267 .1298 .1326 .1353 .1377 7
8 .0692 .0731 .0771 .0810 .0849 .0887 .0925 .0962 .0998 .1033 8
9 .0392 .0423 .0454 .0486 .0519 .0552 .0586 .0620 .0654 .0688 9
10 .0200 .0220 .0241 .0262 .0285 .0309 .0334 .0359 .0386 .0413 10
11 .0093 .0104 .0116 .0129 .0143 .0157 .0173 .0190 .0207 .0225 11
12 .0039 .0045 .0051 .0058 .0065 .0073 .0082 .0092 .0102 .0113 12
13 .0015 .0018 .0021 .0024 .0028 .0032 .0036 .0041 .0046 .0052 13
14 .0006 .0007 .0008 .0009 .0011 .0013 .0015 .0017 .0019 .0022 14
15 .0002 .0002 .0003 .0003 .0004 .0005 .0006 .0007 .0008 .0009 15
16 .0001 .0001 .0001 .0001 .0001 .0002 .0002 .0002 .0003 .0003 16
17 .0001 .0001 .0001 .0001 .0001 17
λ
x 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 x
0 .0022 .0020 .0018 .0017 .0015 .0014 .0012 .0011 .0010 .0009 0
1 .0137 .0126 .0116 .0106 .0098 .0090 .0082 .0076 .0070 .0064 1
2 .0417 .0390 .0364 .0340 .0318 .0296 .0276 .0258 .0240 .0223 2
3 .0848 .0806 .0765 .0726 .0688 .0652 .0617 .0584 .0552 .0521 3
4 .1294 .1249 .1205 .1162 .1118 .1076 .1034 .0992 .0952 .0912 4
5 .1579 .1549 .1519 .1487 .1454 .1420 .1385 .1349 .1314 .1277 5
6 .1605 .1601 .1595 .1586 .1575 .1562 .1546 .1529 .1511 .1490 6
7 .1399 .1418 .1435 .1450 .1462 .1472 .1480 .1486 .1489 .1490 7
8 .1066 .1099 .1130 .1160 .1188 .1215 .1240 .1263 .1284 .1304 8
9 .0723 .0757 .0791 .0825 .0858 .0891 .0923 .0954 .0985 .1014 9
10 .0441 .0469 .0498 .0528 .0558 .0588 .0618 .0649 .0679 .0710 10
11 .0244 .0265 .0285 .0307 .0330 .0353 .0377 .0401 .0426 .0452 11
12 .0124 .0137 .0150 .0164 .0179 .0194 .0210 .0227 .0245 .0263 12
13 .0058 .0065 .0073 .0081 .0089 .0099 .0108 .0119 .0130 .0142 13
14 .0025 .0029 .0033 .0037 .0041 .0046 .0052 .0058 .0064 .0071 14
15 .0010 .0012 .0014 .0016 .0018 .0020 .0023 .0026 .0029 .0033 15
16 .0004 .0005 .0005 .0006 .0007 .0008 .0010 .0011 .0013 .0014 16
17 .0001 .0002 .0002 .0002 .0003 .0003 .0004 .0004 .0005 .0006 17
18 .0001 .0001 .0001 .0001 .0001 .0001 .0002 .0002 .0002 18
19 .0001 .0001 .0001 .0001 19
lOMoARcPSD|8938243
λ
x 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 x
0 .0008 .0007 .0007 .0006 .0006 .0005 .0005 .0004 .0004 .0003 0
1 .0059 .0054 .0049 .0045 .0041 .0038 .0035 .0032 .0029 .0027 1
2 .0208 .0194 .0180 .0167 .0156 .0145 .0134 .0125 .0116 .0107 2
3 .0492 .0464 .0438 .0413 .0389 .0366 .0345 .0324 .0305 .0286 3
4 .0874 .0836 .0799 .0764 .0729 .0696 .0663 .0632 .0602 .0573 4
5 .1241 .1204 .1167 .1130 .1094 .1057 .1021 .0986 .0951 .0916 5
6 .1468 .1445 .1420 .1394 .1367 .1339 .1311 .1282 .1252 .1221 6
7 .1489 .1486 .1481 .1474 .1465 .1454 .1442 .1428 .1413 .1396 7
8 .1321 .1337 .1351 .1363 .1373 .1381 .1388 .1392 .1395 .1396 8
9 .1042 .1070 .1096 .1121 .1144 .1167 .1187 .1207 .1224 .1241 9
10 .0740 .0770 .0800 .0829 .0858 .0887 .0914 .0941 .0967 .0993 10
11 .0478 .0504 .0531 .0558 .0585 .0613 .0640 .0667 .0695 .0722 11
12 .0283 .0303 .0323 .0344 .0366 .0388 .0411 .0434 .0457 .0481 12
13 .0154 .0168 .0181 .0196 .0211 .0227 .0243 .0260 .0278 .0296 13
14 .0078 .0086 .0095 .0104 .0113 .0123 .0134 .0145 .0157 .0169 14
15 .0037 .0041 .0046 .0051 .0057 .0062 .0069 .0075 .0083 .0090 15
16 .0016 .0019 .0021 .0024 .0026 .0030 .0033 .0037 .0041 .0045 16
17 .0007 .0008 .0009 .0010 .0012 .0013 .0015 .0017 .0019 .0021 17
18 .0003 .0003 .0004 .0004 .0005 .0006 .0006 .0007 .0008 .0009 18
19 .0001 .0001 .0001 .0002 .0002 .0002 .0003 .0003 .0003 .0004 19
20 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0002 20
21 .0001 .0001 21
x 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 x
0 .0003 .0003 .0002 .0002 .0002 .0002 .0002 .0002 .0001 .0001 0
1 .0025 .0023 .0021 .0019 .0017 .0016 .0014 .0013 .0012 .0011 1
2 .0100 .0092 .0086 .0079 .0074 .0068 .0063 .0058 .0054 .0050 2
3 .0269 .0252 .0237 .0222 .0208 .0195 .0183 .0171 .0160 .0150 3
4 .0544 .0517 .0491 .0466 .0443 .0420 .0398 .0377 .0357 .0337 4
5 .0882 .0849 .0816 .0784 .0752 .0722 .0692 .0663 .0635 .0607 5
6 .1191 .1160 .1128 .1097 .1066 .1034 .1003 .0972 .0941 .0911 6
7 .1378 .1358 .1338 .1317 .1294 .1271 .1247 .1222 .1197 .1171 7
8 .1395 .1392 .1388 .1382 .1375 .1366 .1356 .1344 .1332 .1318 8
9 .1256 .1269 .1280 .1290 .1299 .1306 .1311 .1315 .1317 .1318 9
10 .1017 .1040 .1063 .1084 .1104 .1123 .1140 .1157 .1172 .1186 10
11 .0749 .0776 .0802 .0828 .0853 .0878 .0902 .0925 .0948 .0970 11
12 .0505 .0530 .0555 .0579 .0604 .0629 .0654 .0679 .0703 .0728 12
13 .0315 .0334 .0354 .0374 .0395 .0416 .0438 .0459 .0481 .0504 13
14 .0182 .0196 .0210 .0225 .0240 .0256 .0272 .0289 .0306 .0324 14
15 .0098 .0107 .0116 .0126 .0136 .0147 .0158 .0169 .0182 .0194 15
16 .0050 .0055 .0060 .0066 .0072 .0079 .0086 .0093 .0101 .0109 16
17 .0024 .0026 .0029 .0033 .0036 .0040 .0044 .0048 .0053 .0058 17
18 .0011 .0012 .0014 .0015 .0017 .0019 .0021 .0024 .0026 .0029 18
19 .0005 .0005 .0006 .0007 .0008 .0009 .0010 .0011 .0012 .0014 19
20 .0002 .0002 .0002 .0003 .0003 .0004 .0004 .0005 .0005 .0006 20
21 .0001 .0001 .0001 .0001 .0001 .0002 .0002 .0002 .0002 .0003 21
22 .0001 .0001 .0001 .0001 .0001 .0001
x 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0 x
Note: It is assumed that an observation x is obtained from a Poisson distribution with parameter λ.
For a specified value of x, the curves specify a 95% confidence interval for λ. For a specified value of λ, the
curves give a two-sided critical region of size 0.05 to test the hypothesis that the specified value of λ is the
true value.
lOMoARcPSD|8938243
x 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359 4 8 12 16 20 24 28 32 36
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 4 8 12 16 20 24 28 32 35
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 4 8 12 15 19 23 27 31 35
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 4 8 11 15 19 23 26 30 34
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 4 7 11 14 18 22 25 29 32
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 3 7 10 14 17 21 24 27 31
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 3 6 10 13 16 19 23 26 29
0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 3 6 9 12 15 18 21 24 27
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 3 6 8 11 14 17 19 22 25
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 3 5 8 10 13 15 18 20 23
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621 2 5 7 9 12 14 16 18 21
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 2 4 6 8 10 12 14 16 19
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 2 4 6 7 9 11 13 15 16
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 2 3 5 6 8 10 11 13 14
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 1 3 4 6 7 8 10 11 13
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 1 2 4 5 6 7 8 10 11
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 1 2 3 4 5 6 7 8 9
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 1 2 3 3 4 5 6 7 8
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 1 1 2 3 4 4 5 6 6
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 1 1 2 2 3 4 4 5 5
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 0 1 1 2 2 3 3 4 4
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 0 1 1 2 2 2 3 3 4
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890 0 1 1 1 2 2 2 3 3
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 0 1 1 1 1 2 2 2 2
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 0 0 1 1 1 1 1 2 2
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 0 0 0 1 1 1 1 1 1
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 0 0 0 0 1 1 1 1 1
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 0 0 0 0 0 1 1 1 1
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981 0 0 0 0 0 0 0 1 1
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 0 0 0 0 0 0 0 0 0
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 0 0 0 0 0 0 0 0 0
3.1 .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .9993 0 0 0 0 0 0 0 0 0
3.2 .9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995 .9995 0 0 0 0 0 0 0 0 0
3.3 .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997 0 0 0 0 0 0 0 0 0
3.4 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998 0 0 0 0 0 0 0 0 0
3.5 .9998 .9998 .9998 .9998 .9998 .9998 .9998 .9998 .9998 .9998 0 0 0 0 0 0 0 0 0
3.6 .9998 .9998 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 0 0 0 0 0 0 0 0 0
3.7 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 0 0 0 0 0 0 0 0 0
3.8 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 0 0 0 0 0 0 0 0 0
q cq q cq q cq q cq q cq q cq
0.50 0.0000 0.60 0.2533 0.70 0.5244 0.80 0.8416 0.90 1.2816 0.99 2.3263
0.51 0.0251 0.61 0.2793 0.71 0.5534 0.81 0.8779 0.91 1.3408 0.991 2.3656
0.52 0.0502 0.62 0.3055 0.72 0.5828 0.82 0.9154 0.92 1.4051 0.992 2.4089
0.53 0.0753 0.63 0.3319 0.73 0.6128 0.83 0.9542 0.93 1.4758 0.993 2.4573
0.54 0.1004 0.64 0.3585 0.74 0.6433 0.84 0.9945 0.94 1.5548 0.994 2.5121
0.55 0.1257 0.65 0.3853 0.75 0.6745 0.85 1.0364 0.95 1.6449 0.995 2.5758
0.56 0.1510 0.66 0.4125 0.76 0.7063 0.86 1.0803 0.96 1.7507 0.996 2.6521
0.57 0.1764 0.67 0.4399 0.77 0.7388 0.87 1.1264 0.97 1.8808 0.997 2.7478
0.58 0.2019 0.68 0.4677 0.78 0.7722 0.88 1.1750 0.975 1.9600 0.998 2.8782
0.59 0.2275 0.69 0.4958 0.79 0.8064 0.89 1.2265 0.98 2.0537 0.999 3.0902
lOMoARcPSD|8938243
p
df 0.600 0.750 0.800 0.900 0.950 0.975 0.990 0.995 0.999 0.9995
1 0.325 1.000 1.376 3.078 6.314 12.71 31.82 63.66 318.3 636.6
2 0.289 0.816 1.061 1.886 2.920 4.303 6.965 9.925 22.33 31.60
3 0.277 0.765 0.978 1.638 2.353 3.182 4.541 5.841 10.21 12.92
4 0.271 0.741 0.941 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 0.267 0.727 0.920 1.476 2.015 2.571 3.365 4.032 5.894 6.869
6 0.265 0.718 0.906 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 0.263 0.711 0.896 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 0.262 0.706 0.889 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 0.261 0.703 0.883 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 0.260 0.700 0.879 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 0.260 0.697 0.876 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 0.259 0.695 0.873 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 0.259 0.694 0.870 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 0.258 0.692 0.868 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 0.258 0.691 0.866 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 0.258 0.690 0.865 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 0.257 0.689 0.863 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 0.257 0.688 0.862 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 0.257 0.688 0.861 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 0.257 0.687 0.860 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 0.257 0.686 0.859 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 0.256 0.686 0.858 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 0.256 0.685 0.858 1.319 1.714 2.069 2.500 2.807 3.485 3.768
24 0.256 0.685 0.857 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 0.256 0.684 0.856 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 0.256 0.684 0.856 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 0.256 0.684 0.855 1.314 1.703 2.052 2.473 2.771 3.421 3.689
28 0.256 0.683 0.855 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 0.256 0.683 0.854 1.311 1.699 2.045 2.462 2.756 3.396 3.660
30 0.256 0.683 0.854 1.310 1.697 2.042 2.457 2.750 3.385 3.646
31 0.256 0.682 0.853 1.309 1.696 2.040 2.453 2.744 3.375 3.633
32 0.255 0.682 0.853 1.309 1.694 2.037 2.449 2.738 3.365 3.622
33 0.255 0.682 0.853 1.308 1.692 2.035 2.445 2.733 3.356 3.611
34 0.255 0.682 0.852 1.307 1.691 2.032 2.441 2.728 3.348 3.601
35 0.255 0.682 0.852 1.306 1.690 2.030 2.438 2.724 3.340 3.591
36 0.255 0.681 0.852 1.306 1.688 2.028 2.434 2.719 3.333 3.582
37 0.255 0.681 0.851 1.305 1.687 2.026 2.431 2.715 3.326 3.574
38 0.255 0.681 0.851 1.304 1.686 2.024 2.429 2.712 3.319 3.566
39 0.255 0.681 0.851 1.304 1.685 2.023 2.426 2.708 3.313 3.558
40 0.255 0.681 0.851 1.303 1.684 2.021 2.423 2.704 3.307 3.551
50 0.255 0.679 0.849 1.299 1.676 2.009 2.403 2.678 3.261 3.496
60 0.254 0.679 0.848 1.296 1.671 2.000 2.390 2.660 3.232 3.460
70 0.254 0.678 0.847 1.294 1.667 1.994 2.381 2.648 3.211 3.435
80 0.254 0.678 0.846 1.292 1.664 1.990 2.374 2.639 3.195 3.416
90 0.254 0.677 0.846 1.291 1.662 1.987 2.368 2.632 3.183 3.402
100 0.254 0.677 0.845 1.290 1.660 1.984 2.364 2.626 3.174 3.390
120 0.254 0.677 0.845 1.289 1.658 1.980 2.358 2.617 3.160 3.373
160 0.254 0.676 0.844 1.287 1.654 1.975 2.350 2.607 3.142 3.352
200 0.254 0.676 0.843 1.286 1.653 1.972 2.345 2.601 3.131 3.340
240 0.254 0.676 0.843 1.285 1.651 1.970 2.342 2.596 3.125 3.332
300 0.254 0.675 0.843 1.284 1.650 1.968 2.339 2.592 3.118 3.323
400 0.254 0.675 0.843 1.284 1.649 1.966 2.336 2.588 3.111 3.315
∞ 0.253 0.674 0.842 1.282 1.645 1.960 2.326 2.576 3.090 3.290
p
df 0.005 0.010 0.025 0.050 0.100 0.250 0.500 0.750 0.900 0.950 0.975 0.990 0.995 0.999
1 0.000 0.000 0.001 0.004 0.016 0.102 0.455 1.323 2.706 3.841 5.024 6.635 7.879 10.83
2 0.010 0.020 0.051 0.103 0.211 0.575 1.386 2.773 4.605 5.991 7.378 9.210 10.60 13.82
3 0.072 0.115 0.216 0.352 0.584 1.213 2.366 4.108 6.251 7.815 9.348 11.34 12.84 16.27
4 0.207 0.297 0.484 0.711 1.064 1.923 3.357 5.385 7.779 9.488 11.14 13.28 14.86 18.47
5 0.412 0.554 0.831 1.145 1.610 2.675 4.351 6.626 9.236 11.07 12.83 15.09 16.75 20.51
6 0.676 0.872 1.237 1.635 2.204 3.455 5.348 7.841 10.64 12.59 14.45 16.81 18.55 22.46
7 0.989 1.239 1.690 2.167 2.833 4.255 6.346 9.037 12.02 14.07 16.01 18.48 20.28 24.32
8 1.344 1.647 2.180 2.733 3.490 5.071 7.344 10.22 13.36 15.51 17.53 20.09 21.95 26.12
9 1.735 2.088 2.700 3.325 4.168 5.899 8.343 11.39 14.68 16.92 19.02 21.67 23.59 27.88
10 2.156 2.558 3.247 3.940 4.865 6.737 9.342 12.55 15.99 18.31 20.48 23.21 25.19 29.59
11 2.603 3.053 3.816 4.575 5.578 7.584 10.34 13.70 17.28 19.68 21.92 24.73 26.76 31.26
12 3.074 3.571 4.404 5.226 6.304 8.438 11.34 14.85 18.55 21.03 23.34 26.22 28.30 32.91
13 3.565 4.107 5.009 5.892 7.041 9.299 12.34 15.98 19.81 22.36 24.74 27.69 29.82 34.53
14 4.075 4.660 5.629 6.571 7.790 10.17 13.34 17.12 21.06 23.68 26.12 29.14 31.32 36.12
15 4.601 5.229 6.262 7.261 8.547 11.04 14.34 18.25 22.31 25.00 27.49 30.58 32.80 37.70
16 5.142 5.812 6.908 7.962 9.312 11.91 15.34 19.37 23.54 26.30 28.85 32.00 34.27 39.25
17 5.697 6.408 7.564 8.672 10.09 12.79 16.34 20.49 24.77 27.59 30.19 33.41 35.72 40.79
18 6.265 7.015 8.231 9.390 10.86 13.68 17.34 21.60 25.99 28.87 31.53 34.81 37.16 42.31
19 6.844 7.633 8.907 10.12 11.65 14.56 18.34 22.72 27.20 30.14 32.85 36.19 38.58 43.82
20 7.434 8.260 9.591 10.85 12.44 15.45 19.34 23.83 28.41 31.41 34.17 37.57 40.00 45.31
21 8.034 8.897 10.28 11.59 13.24 16.34 20.34 24.93 29.62 32.67 35.48 38.93 41.40 46.80
22 8.643 9.542 10.98 12.34 14.04 17.24 21.34 26.04 30.81 33.92 36.78 40.29 42.80 48.27
23 9.260 10.20 11.69 13.09 14.85 18.14 22.34 27.14 32.01 35.17 38.08 41.64 44.18 49.73
24 9.886 10.86 12.40 13.85 15.66 19.04 23.34 28.24 33.20 36.42 39.36 42.98 45.56 51.18
25 10.52 11.52 13.12 14.61 16.47 19.94 24.34 29.34 34.38 37.65 40.65 44.31 46.93 52.62
26 11.16 12.20 13.84 15.38 17.29 20.84 25.34 30.43 35.56 38.89 41.92 45.64 48.29 54.05
27 11.81 12.88 14.57 16.15 18.11 21.75 26.34 31.53 36.74 40.11 43.19 46.96 49.65 55.48
28 12.46 13.56 15.31 16.93 18.94 22.66 27.34 32.62 37.92 41.34 44.46 48.28 50.99 56.89
29 13.12 14.26 16.05 17.71 19.77 23.57 28.34 33.71 39.09 42.56 45.72 49.59 52.34 58.30
30 13.79 14.95 16.79 18.49 20.60 24.48 29.34 34.80 40.26 43.77 46.98 50.89 53.67 59.70
31 14.46 15.66 17.54 19.28 21.43 25.39 30.34 35.89 41.42 44.99 48.23 52.19 55.00 61.10
32 15.13 16.36 18.29 20.07 22.27 26.30 31.34 36.97 42.58 46.19 49.48 53.49 56.33 62.49
33 15.82 17.07 19.05 20.87 23.11 27.22 32.34 38.06 43.75 47.40 50.73 54.78 57.65 63.87
34 16.50 17.79 19.81 21.66 23.95 28.14 33.34 39.14 44.90 48.60 51.97 56.06 58.96 65.25
35 17.19 18.51 20.57 22.47 24.80 29.05 34.34 40.22 46.06 49.80 53.20 57.34 60.27 66.62
36 17.89 19.23 21.34 23.27 25.64 29.97 35.34 41.30 47.21 51.00 54.44 58.62 61.58 67.98
37 18.59 19.96 22.11 24.07 26.49 30.89 36.34 42.38 48.36 52.19 55.67 59.89 62.88 69.35
38 19.29 20.69 22.88 24.88 27.34 31.81 37.34 43.46 49.51 53.38 56.90 61.16 64.18 70.70
39 20.00 21.43 23.65 25.70 28.20 32.74 38.34 44.54 50.66 54.57 58.12 62.43 65.48 72.06
40 20.71 22.16 24.43 26.51 29.05 33.66 39.34 45.62 51.81 55.76 59.34 63.69 66.77 73.40
50 27.99 29.71 32.36 34.76 37.69 42.94 49.33 56.33 63.17 67.50 71.42 76.15 79.49 86.66
60 35.53 37.48 40.48 43.19 46.46 52.29 59.33 66.98 74.40 79.08 83.30 88.38 91.95 99.61
70 43.28 45.44 48.76 51.74 55.33 61.70 69.33 77.58 85.53 90.53 95.02 100.4 104.2 112.3
80 51.17 53.54 57.15 60.39 64.28 71.14 79.33 88.13 96.58 101.9 106.6 112.3 116.3 124.8
90 59.20 61.75 65.65 69.13 73.29 80.62 89.33 98.65 107.6 113.1 118.1 124.1 128.3 137.2
100 67.33 70.06 74.22 77.93 82.36 90.13 99.33 109.1 118.5 124.3 129.6 135.8 140.2 149.4
120 83.85 86.92 91.57 95.70 100.6 109.2 119.3 130.1 140.2 146.6 152.2 159.0 163.6 173.6
140 100.7 104.0 109.1 113.7 119.0 128.4 139.3 150.9 161.8 168.6 174.6 181.8 186.8 197.4
160 117.7 121.3 126.9 131.8 137.5 147.6 159.3 171.7 183.3 190.5 196.9 204.5 209.8 221.0
180 134.9 138.8 144.7 150.0 156.2 166.9 179.3 192.4 204.7 212.3 219.0 227.1 232.6 244.4
200 152.2 156.4 162.7 168.3 174.8 186.2 199.3 213.1 226.0 234.0 241.1 249.4 255.3 267.5
240 187.3 192.0 199.0 205.1 212.4 224.9 239.3 254.4 268.5 277.1 284.8 293.9 300.2 313.4
300 240.7 246.0 253.9 260.9 269.1 283.1 299.3 316.1 331.8 341.4 349.9 359.9 366.8 381.4
400 330.9 337.2 346.5 354.6 364.2 380.6 399.3 418.7 436.6 447.6 457.3 468.7 476.6 493.1
Note: Linear interpolation with respect to df should be should be satisfactory for most purposes.
√ 2
For df > 100, use cq (χ2df ) ≈ 12 cq (N) + 2 df − 1 , where N denotes the standard
normal distribution.
lOMoARcPSD|8938243
10
20
50
100
200
500
500
200
100
50
20
10
Note: It is assumed that a random sample of n observations is obtained on a bivariate normal pop-
ulation with correlation coefficient, ρ. The numbers on the curves indicate the sample size.
For an observed value of the sample correlation coefficient, r, the curves specify a 95% con-
fidence interval for ρ. For a given value of ρ, the curves specify a two-sided critical region of
size 0.05 to test the hypothesis that the given value is the true value.
lOMoARcPSD|8938243
STATISTICS
Types of variable properties
categorical category
ordinal category + order
numerical category + order + scale; [counting = discrete, measurement = continuous]
1 n 2 1 n 1 Pk 1 Pk
xi ) 2 fj u2j − f j uj )2
form for computation
P P
= n−1 i=1 xi − n ( i=1 ≈ n−1 j=1 n
( j=1
√
sample standard deviation, s s2
sample interquartile range, IQR IQR = Q3 − Q1, τ̂ = ĉ0.75 − ĉ0.25 (a number, not an interval)
sample range x(n) − x(1)
frequency distributions dotplot bar graph histogram
q
q q
q q q q
q q q q q
1
sample pmf, p̂(x) p̂(x) = n
freq(X = x)
sample pdf, fˆ(x) fˆ(x) = 1
n(b−a)
freq(a < X < b) for cell a < x < b [histogram]
1 k
sample cdf, F̂ (x) F̂ (x) = n
freq(X 6 x); F̂ (x) = n
, (x(k) 6 x < x(k+1) )
F̂
1
q
x
ĉq
−1
sample quantiles (inverse cdf) F̂ (ĉq ) ≈ q; ĉq ≈ F̂ (q).
1
Pn
sample covariance, sxy sxy = n−1 i=1 (xi − x̄)(yi − ȳ)
sxy Σ(x−x̄)(y−ȳ) 1
Pn
sample correlation, r = rxy rxy = s s = √ = n−1 i=1 xsi ysi .
x y Σ(x−x̄)2 Σ(y−ȳ)2
STATISTICS
Data sources. Types of studies: experimental studies observational studies
clinical trials cohort (follow-up, prospective)
field trials case-control (retrospective)
community intervention cross sectional (survey)
imposed intervention no intervention
(randomisation)
inferred causation no inferred causation
0 1 0 1
−1
q-quantile, cq (0 < q < 1) c q = FX (q)
continuous random variables Pr(X = x) = 0
d
probability density function, pdf f (x) = dx F (x) ; Pr(X ≈ x) ≈ f (x)δx
properties of a pdf f (1) fR (x) > 0
∞
(2) −∞
f (x)dx = 1
Rb Rx
probability from pdf Pr(a < X 6 b) = a f (x)dx ⇒ F (x) = −∞ f (t)dt
sketch pdf
✻
area = 1
✠
✲
discrete random variables
probability mass function, pmf p(x) = Pr(X = x)
properties of a pmf f (1) p(x)
P >0
r r
(2) p(x) = 1
r
sketch pmf r r
r r
r
relation of pmf to cdf p(x) = F (x + 0) − F (x − 0) = jump in F at x
Expectation, E R P
expectation of ψ(X) E(ψ(X)) = ψ(x)f (x)dx or ψ(x)p(x)
R P
mean of X, µ, E(X) xf (x)dx or xp(x)
E(a + bX), E(X + Y ) a + b E(X), E(X) + E(Y )
median of X, m 0.5-quantile, c0.5 = F −1 (0.5)
mode of X, M f (M) > f (x) for all x or p(M) > p(x) for all x
variance of X, var(X), σ 2 2 2 2
E (X − µ) p = E(X ) − E(X)
standard deviation, sd(X), σ sd(X) = var(X)
var(a + bX), sd(a + bX) b2 var(X), |b| sd(X)
var(X + Y ) (X and Y independent) var(X) + var(Y )
covariance of X and Y , cov(X, Y ) σXY = E (X−µX )(Y −µY ) (zero if X and Y are independent).
σXY
correlation of X and Y , ρ(X, Y ) ρXY = σ σ (zero if X and Y are independent).
X Y
var(aX + bY ) a2 var(X) + b2 var(Y ) + 2ab cov(X, Y )
lOMoARcPSD|8938243
d
Straight line regression Yi = N(α + βxi , σ 2 ), (i = 1, 2, . . . , n).
rsy Σ(x−x̄)(y−ȳ)
least squares estimates β̂ = s = Σ(x−x̄)2
; α̂ = ȳ − β̂ x̄
x
1 n−1 1 (Σ(x−x̄)(y−ȳ))2
estimate of σ 2 s2 = n−2 Σ(y−α̂−β̂xi )2 = n−2 (1 − r2 )s2y = n−2 Σ(y−ȳ)2 − Σ(x−x̄)2
d σ2 d σ2 2
estimators ȳ = N(α + β x̄, n ), β̂ = N(β, K ), where K = Σ(x−x̄) ; ȳ, β̂ independent.
d 1 (x−x̄)2
µ̂(x) = ȳ + (x−x̄)β̂ = N µ(x), c(x)σ 2 , where c(x) = n + K
d 2 d d
β̂ = N(β, σK ), µ̂(x) = N µ(x), c(x)σ 2 Y (x) = N µ(x), σ 2
inference on β, µ̂(x), Y (x)
β̂−β µ̂(x)−µ(x) Y (x)−µ̂(x)
√ =d
tn−2 ; √
= tn−2 ;
d
√ = tn−2
d
S/ K S c(x) S 1+c(x)
β̂ ± c0.975 (tn−2 ) √sK ,
p p
CI for β, CI for µ(x), PI for Y (x) µ̂(x) ± c0.975 (tn−2 )s c(x), µ̂(x) ± c0.975 (tn−2 )s 1+c(x)
Probability Distributions
d
1. Binomial distribution X = Bi(n, p) [n positive integer, 0 6 p 6 1]
pmf, p(x) (n x n−x
x )p q , x = 0, 1, 2, . . . , n; p + q = 1 [Table 1]
physical interpretation X = number of successes in n independent trials,
each having probability p of success (Bernoulli trials)
E(X), var(X) np, npq
d d
properties (1) If Zi iidrvs = Bi(1, p) then X = Z1 + Z2 + · · · + Zn = Bi(n, p)
d d d
(2) X1 = Bi(n1 , p), X2 = Bi(n2 , p) indept ⇒ X1 +X2 = Bi(n1 +n2 , p)
(3) If n → ∞, p → 0, so that np → λ, then Bi(n, p) → Pn(λ)
(4) If n → ∞, then Bi(n, p) ∼ N(np, npq) [np > 5, nq > 5], in which case:
d
if X ∗ = N(np, npq), then Pr(X = k) ≈ Pr(k−0.5 < X ∗ < k+0.5) [CC]
d
2. Poisson distribution X = Pn(λ) [λ > 0]
e−λ λx
pmf, p(x) x!
, (x = 0, 1, 2, ...) [Table 3]
Poisson process “events” occurring so that the probability that an “event” occurs
in (t, t + δt) is αδt + o(δt), where α = rate of the process
physical interpretation X = number of “events” in unit time of a Poisson process with rate λ.
E(X), var(X) λ, λ
d d d
properties (1) X1 = Pn(λ1 ), X2 = Pn(λ2 ) independent ⇒ X1 + X2 = Pn(λ1 + λ2 )
(2) approximation to Bi(n, p) when n large, p small: λ = np.
(3) if λ → ∞ then Pn(λ) ∼ N(λ, λ) [λ > 10], in which case:
d
if X ∗ = N(λ, λ), then Pr(X = k) ≈ Pr(k−0.5 < X ∗ < k+0.5) [CC]
d
3. Normal distribution X = N(µ, σ 2 ) [σ > 0]
standard normal distribution N(0, 1)
1 2 Rx 1 2
pdf, ϕ(x); cdf, Φ(x) ϕ(x) = √12π e− 2 x ; Φ(x) = −∞ √1 e− 2 t dt
2π
[cdf: Table 5]
E(X), var(X) 0, 1 [inverse cdf: Table 6]
√1 e− 2σ2 (x−µ)
1 2
general normal distribution, pdf, f (x)
σ 2π
physical interpretation just about any variable obtained from a large number of components
(by the central limit theorem)
E(X), var(X) µ, σ 2
d d
properties (1) if X = N(µ, σ 2 ) then a + bX = N(a + bµ, b2 σ 2 )
X−µ d d
(2) Z= σ
= N(0, 1) ⇔ X = µ + σZ = N(µ, σ 2 ); cq (X) = µ + σcq (Z)
d d d
(3) X1 = N(µ1 , σ12 ), X2 = N(µ2 , σ22 ) indept ⇒ X1 +X2 = N(µ1 +µ2 , σ12 +σ22 )
d
4. t distribution X = tn [n = 1, 2, 3, . . .]
d d d
definition if Z = N(0, 1), U = χ2n indept, then X = √ Z = tn
U/n
Γ( n+1 )
pdf, f (x) √1 2 1
(x > 0) [inverse cdf: Table 7]
nπ Γ( n ) 2 n+1
2 (1+ xn ) 2
n
E(X), var(X) 0, n−2
2 n+1 1 2
comparison with standard normal tn has wider tails: var > 1; tn → N(0, 1) as n→∞: (1+ xn )− 2 → e− 2 x
d
5. χ2 distribution X = χ2n [n = 1, 2, 3, . . .]
d d
definition if Z1 , Z2 , . . . , Zn iidrvs = N(0, 1) then X = Z12 + Z22 + · · · + Zn2 = χ2n
− 1 x 1 n−1
e 2 x2
pdf, fX (x) 1n
1 n)
(x > 0) [inverse cdf: Table 8]
2 2 Γ( 2
E(X), var(X) n, 2n
d d d
properties (1) X1 = χ2m , X2 = χ2n indept ⇒ X1 +X2 = χ2m+n
(n−1)S 2 d 2σ 4
(2) sample on N(µ, σ 2 ): σ2
= χ2n−1 ⇒ E(S 2 ) = σ 2 , var(S 2 ) = n−1
P (o−e)2 d 2
(3) goodness of fit test: e
= χk−p−1
lOMoARcPSD|8938243
Index
abstraction ( ◦◦ ), 2, 74, 91, 92, 127 comparing distributions, 59
accept H0 , 148 outliers, 58
addition theorem, 72, 73 Busselton Health Study, 21
additivity
of means, 98 case-control study, 22, 23, 26, 82, 83, 196
of variance, for independent random variables, comparison with cohort study, 24
99 categorical data, 44
age-specific table, 8 causality, 29
Agresti’s approx CI cause and association, 31, 32
for α, 134 definition, 29
for p, 132 Hill’s criteria, 30
alternative hypothesis, 147–149 reverse, 32
animal experiments, 26 census, 24
approx CI central limit theorem, 112, 123
basic, 125 certain, 71, 72
for α, 133 chance, 71
for λ, 134 chartjunk, 40
at least one = not none, 86 checking, 45
checking normality, 138
balance, 18, 19, 176 chi-squared distribution, χ2
bar graph, 53 in R, 188
barchart, 52 inverse cdf table, 290
Bayes’ theorem, 80 chi-squared test statistic, 192
formula, 82 cholera data, 20
odds view, 85 clinical trial, 9, 11, 13, 21, 26
better approx CI coding, 45
for α, 134 coefficient of determination, 216
for p, 132 cohort, 12, 19, 134
binning, 54 closed, 21
binomial distribution, 102 open, 21
Bi(n, p), 103 cohort study, 19–21, 23, 26, 82
approximated by normal, 113 comparison with case-control, 24
parameter, testing, 159 combining estimates, 141
pmf table, 279 common cause, 32
pmf, graph, 103 community intervention trial, 14
pmf, in Tables, 103 comparative inference, 171
SP diagram, 283 comparing risks, 76
biostatistics, 5 comparison of proportions, 183
bivariate data, 46, 62, 201 summary, 186
C×N, 62 comparison of rates, 186
categorical data, 190 summary, 187
C×C, 62 complement, 72
N×N, 63 complementary event, A′ , 72
numerical data, 201 computers and data analysis, 42
bivariate normal distribution, 207 conditional odds, 79
blind study, 28, 184 conditional probability, 76
blinding, 18, 19 confidence interval, 125–127, 129, 147
block, 17, 19 0%, 128
boxplot, 58 and hypothesis test, 147, 150
inexact correspondence, 159
299
lOMoARcPSD|8938243
subjective probability, 74
success
and failure, 102
in independent trials, 87
survey, 24, 39, 132, 153
symmetry, 73
t distribution, 135
in R, 136
inverse cdf in Tables, 136
inverse cdf table, 289
t-test, for µ=µ0 , 158
tables, 42
tabular presentation, 42
tea-tasting experiment, 16
test for independence of variables, 207
test reporting, 154
test statistic, 148, 149
testing
binomial parameter, 159
population proportion, 159
population rate, 163
with discrete variables, 165
third quartile, 49
time-line diagram, 27
transformations, 46
treatment, 15–19, 27, 39, 171–173, 176, 177, 186, 191
trimmed mean, 49
two-sided, 147
type I error, 149
type II error, 149
types of error, 149
types of study, 11
types of variable, 44
unexposed, 1, 19
union, 72, 73
univariate data, 46
universe reduction, 76
unrelated events, 78
upper quartile, 49
utility test for straight-line regression, 215
validity, 15
variable types, 44
variance, 98
additivity, 99
properties, 99
Venn diagram, 72
Whickham study, 8
z-test, 152
approximate, 159
for µ=µ0 , 154