0% found this document useful (0 votes)
545 views310 pages

Edda Course Notes

This document provides an overview of experimental design and data analysis. It discusses epidemiological studies and different types of experimental and observational studies. It also covers topics such as exploratory data analysis, probability, probability distributions, and statistical concepts like the mean, variance, and standard deviation. The document is a course manual that aims to equip students with foundational knowledge in research methods.

Uploaded by

lwekdl wdjlkj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
545 views310 pages

Edda Course Notes

This document provides an overview of experimental design and data analysis. It discusses epidemiological studies and different types of experimental and observational studies. It also covers topics such as exploratory data analysis, probability, probability distributions, and statistical concepts like the mean, variance, and standard deviation. The document is a course manual that aims to equip students with foundational knowledge in research methods.

Uploaded by

lwekdl wdjlkj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 310

lOMoARcPSD|8938243

University of Melbourne

School of Mathematics and Statistics

MAST 10011 (2020)

Experimental Design & Data Analysis


lOMoARcPSD|8938243
lOMoARcPSD|8938243

Contents

0 Introduction 1
1 Epidemiological studies & Experimental design 5
1.1 Epidemiology — an introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 What is Statistics? Biostatistics? Epidemiology? . . . . . . . . . . . . . 5
1.1.2 Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Types of (epidemiological) study . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Experimental studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Clinical trial (medical experimental study) . . . . . . . . . . . . . . . . 13
1.2.2 Field trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 Community intervention trial . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.4 Experiments and experimental principles . . . . . . . . . . . . . . . . . 14
1.3 Observational studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.1 Cohort study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.2 Case-control studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.3 Comparison of cohort and case-control studies . . . . . . . . . . . . . . 23
1.3.4 Cross-sectional studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Review of study types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.1 A dialogue with a skeptical statistician (Gary Grunwald) . . . . . . . . 27
1.5 Causality in epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Problem Set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Exploratory data analysis 39
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2 Tables and diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.2 Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3 Types of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 Some general comments on data handling . . . . . . . . . . . . . . . . 45
2.4 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.1 Univariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.2 Numerical statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.3 Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4.4 Measures of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.5 Graphical representations . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.6 Bivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Problem Set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Probability and applications 71
3.1 Probability: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.1 Probability tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.2 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.1 Multiplication rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.2 Conditional odds and odds ratio . . . . . . . . . . . . . . . . . . . . . . 79
3.3 Law of Total Probability & Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . 80
3.4 Diagnostic testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

i
lOMoARcPSD|8938243

page ii Experimental Design and Data Analysis

Problem Set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4 Probability distributions 91
4.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.3 Comparison of discrete and continuous random variables . . . . . . . 94
4.1.4 Quantiles (inverse cdf) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.5 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1.6 The variance and the standard deviation . . . . . . . . . . . . . . . . . 98
4.1.7 Describing the probability distribution . . . . . . . . . . . . . . . . . . 100
4.2 Independent trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.2 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.3 Incidence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 The normal distribution and applications . . . . . . . . . . . . . . . . . . . . . 110
4.4.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4.3 Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Problem Set 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5 Estimation 121
5.1 Sampling and sampling distributions . . . . . . . . . . . . . . . . . . . . . . . 121
5.1.1 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.1.2 The distribution of X̄ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2 Inference on the population mean, µ . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Point and interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Normal: estimation of µ when σ is known . . . . . . . . . . . . . . . . . . . . . 127
5.5 Estimators that are approximately normal . . . . . . . . . . . . . . . . . . . . . 129
5.5.1 Estimation of a population proportion . . . . . . . . . . . . . . . . . . . 129
5.5.2 Estimation of a population rate . . . . . . . . . . . . . . . . . . . . . . . 133
5.6 Normal: estimation of µ when σ is unknown . . . . . . . . . . . . . . . . . . . 135
5.7 Prediction intervals (for a future observation) . . . . . . . . . . . . . . . . . . . 137
5.8 Checking normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.9 Combining estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Problem Set 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6 Hypothesis Testing 147
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.2 Types of error and power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3 Testing procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.1 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.2 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3.3 Critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.4 Hypothesis testing for normal populations . . . . . . . . . . . . . . . . . . . . 154
6.4.1 z-test (testing µ=µ0 , when σ is known/assumed) . . . . . . . . . . . . 154
6.4.2 t-test (testing µ=µ0 when σ is unknown) . . . . . . . . . . . . . . . . . 158
6.4.3 Approximate z-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.5 Case study: Bone density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Problem Set 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7 Comparative Inference 171
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.2 Paired samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.3 Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.3.1 Variances known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.3.2 Variances unknown but equal . . . . . . . . . . . . . . . . . . . . . . . . 177
7.3.3 Variances unknown and unequal . . . . . . . . . . . . . . . . . . . . . . 180
7.4 Case study: Lead exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.5 Comparing two proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.6 Comparing two rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.7 Goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
lOMoARcPSD|8938243

Contents page iii

7.7.1 Completely specified hypothesis . . . . . . . . . . . . . . . . . . . . . . 187


7.8 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.8.1 2×2 contingency table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.8.2 r × c contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Problem Set 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8 Regression and Correlation 201
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.2.1 Inference based on correlation . . . . . . . . . . . . . . . . . . . . . . . 207
8.3 Straight-line regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.4 Estimation of α and β: least squares . . . . . . . . . . . . . . . . . . . . . . . . 210
8.5 Inference for the straight line regression model . . . . . . . . . . . . . . . . . . 212
8.6 Case study: Blood fat content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Problem Set 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
R Revision Problem Sets 227
Problem Set R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Problem Set R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Problem Set R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Problem Set R4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Problem Set R5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
A Answers to the Problems 243
T Statistical Tables 279
S Summary Notes 292
lOMoARcPSD|8938243
lOMoARcPSD|8938243

INTRODUCTION

This text is intended for a one-semester introductory subject taught to Biomedical students
(i.e. students who intend to major in medicine, dentistry, optometry, physiotherapy, phar-
macy or other medically related fields). It is a primer in epidemiology and biostatistics.
In this text, we look at methods to achieve the goal of studying health outcomes in the
general population: to determine what characteristics or exposures are associated with
which disease outcomes.
What is the “general population”? We would like to be able to apply our findings to all
humans (and possibly even to those yet unborn?) We cannot observe every individual on
the planet (and certainly not those who haven’t yet been born). We must choose a sample
of individuals from the population, and then take observations on this sample to obtain
data.
Our conclusions can be applied to the general population only insofar as the sample is
representative of the population. Or, to put it another way, our conclusions can only be
applied to the population that the sample represents [from which it has been drawn]. For
example, if our sample is of 50-59yo women from a Melbourne clinic, then our conclusions
might apply to 50-59yo women from Melbourne. But can the conclusions be extended to
other age groups? to Australian women? to all women? . . . to men?
✛ ✘
Probability

population sample
Study Design Data
−→−→ Analysis
model observations

Statistics
✚ ✙

What is an “exposure”? Exposure is a broad term, covering a range of possibilities, for


example:
* drug treatment (e.g. exposure to drug Z);
* type of care (e.g. exposure to care protocol Y );
* immunisation (e.g. exposure to swine-flu vaccine);
* the presence of a particular gene (e.g. exposure to gene Q);
* type of medical treatment (e.g. exposure to surgery X);
* radiation (e.g. exposure to radiation treatment; or Chernobyl, or Maralinga).
What are “disease outcomes”? The disease outcome depends on the investigation. It may
be negative (recurrence, death, increased cancer-cell count, . . . ) or positive (cure, symptom

1
lOMoARcPSD|8938243

page 2 Experimental Design and Data Analysis

alleviation, reduced cholesterol level, . . . ).


Chapter 1 (Study Design) is concerned with how we obtain the data so that our conclusions
can be applied to the general population and how we can investigate causation: does the
exposure actually cause the disease outcome?
In Chapter 2 (Data Analysis) we look at ways and means of describing and representing
the data obtained. A lot of this material will be familiar: you will have heard of means
and medians and histograms, and maybe quartiles, standard deviations and boxplots. This
chapter presents a range of useful and important aspects of description and presentation
of data that will be useful in your future work . . . in any field where data are involved.
Chapters 3 and 4 are concerned with Probability. Here we are concerned with specifying
models for the data. Roughly speaking: “if the population is like this, then the sample will
be like that”. Probability is a concept that is fundamental to all that is done in biostatistics
and epidemiology. Chapter 3 introduces and describes the ins and outs of probability and
some of its applications. Chapter 4 is concerned with the modelling of numerical data:
random variables and their distributions.
A lot of what is done in this text (and, for that matter, in all of biostatistics and epidemi-
ology) is based on models and assumptions. This is a characteristic of the biostatistical
beast: a lot of what is done is concerned with abstractions of various kinds (assume that
this . . . , or suppose that the other . . . ). Any model is abstract: it exists only in the mind.

To remind you of this, the symbol ( ◦◦ ) is occasionally used to indicate these hypotheticals
and abstractions.
(RealPopulation) ✲ (RealData)

ModelPopulation ( ◦◦ )
✲ ModelData

Faced with the real world, we generate a simplified model to describe it. (This is actually
something we as human beings do all the time: statistical theory just formalises it.) The
idea is that the model provides a reasonable description for the part of the real world we
are interested in1 . Based on such a model, we are able to work out what sort of data are
likely to be observed . . . and what sort are not!
Statistics, or Statistical Inference, is concerned with trying to work out what the population
is like based on the data. We would like to be able to say “if the data are like this, then
the population will be like that”. This is the important stuff! Armed with the ideas and
concepts of study design, data analysis and probability, we are able to make progress in
statistical inference: figuring out what the population is like on the basis of the observed
data. This is the subject of the remainder of the text.
Chapters 5 and 6 are concerned with “one-sample” statistics, introducing the fundamental
concepts of statistics. Given some RealData, i.e. observed data from a sample, we treat
this as ModelData, i.e. data obtained from the ModelPopulation, and learn what we can
say about the population it has been obtained from. Based on the (Probability) connection
between ModelPopulation and ModelData, we can infer something about the (Statistics)
1 These models can be expressed mathematically. This is where the underlying theory of Mathematical Statistics

comes in. This can involve some heavy-duty mathematics . . . which we ignore. However, a few of the basic ideas
are introduced, because they help in understanding the methodology.
lOMoARcPSD|8938243

Chapter 0: Introduction page 3

connection between ModelData and ModelPopulation.

(RealPopulation) (RealData)

ModelPopulation ( ◦◦ ) ✛

ModelData

Insofar as the ModelPopulation represents the RealPopulation, we can then draw conclu-
sions about the general population.
In Chapter 7, statistical inference is extended to the important “two-sample” case, which

we imagine ( ◦◦ ) are drawn from two populations (often ‘treated’ and ‘untreated’). The
problem now is to compare these two populations. Are they the same? Are they different?
(If they are different, it is indicating that the treatment is having an effect.) How different?
(How much effect?)
In Chapter 8, we generalise in another direction: one-sample but two variables. In this
case our interest is in whether there is any relation between the variables . . . and how this
relation might be described and used.
It is not hard to see that we could extend the model of Chapter 7 to the “k-sample” case; and
the methods of Chapter 8 to more than two variables. These extensions are not considered
in this text. You can learn about them in your next Statistics text.
Throughout this text, the statistical package R is used for calculations and visualising data.
R is a free software environment for statistical computing and graphics. It has become the
software of choice for most statisticians. You can download R for free from the R-project
website: https://www.r-project.org/. In addition, RStudio provides a user-friendly
interface to R. RStudio can be downloaded for free from https://www.rstudio.com/.
lOMoARcPSD|8938243
lOMoARcPSD|8938243

Chapter 1

EPIDEMIOLOGICAL STUDIES &


EXPERIMENTAL DESIGN

“It is of the highest importance in the art of deduction to be able to recognise out of a number of facts which are
incidental and which are vital. Otherwise your energy and attention must be dissipated instead of concentrated.”
Sherlock Holmes, The Adventure of the Reigate Squires, 1894.

1.1 Epidemiology — an introduction

Statistical methods are central in medical research; in fact, in an editorial in the millennium
year 2000, a leading medical journal, the New England Journal of Medicine, presented
“Application of statistics to medicine” as one of the eleven most important developments
in medicine in the last thousand years.1

1.1.1 What is Statistics? Biostatistics? Epidemiology?


• Statistics is the study of variability and uncertainty.
• Biostatistics is the application of statistics to a range of topics in biology.
• Epidemiology is the study of the occurrence of disease.
Statistics provides the tools scientists use to analyse their data, and principles on how best
to design their experiments to collect data. In evidence-based medicine, treatments and
procedures advocated must be supported by hard evidence, which means data from well-
designed experiments, ensuring valid and efficient outcomes; and analysed by appropriate
statistical methods.
The basic principles of epidemiology are straightforward, and common-sense goes a long
way. But a bit more than common sense is required. For example, common sense suggests
that the residents of Australia (relatively high living standard) should have lower death
rates than residents of South Africa (relatively low living standard).
1 Editorial: Looking back on the millennium in Medicine. New England Journal of Medicine, 2000; 342: 42-49.

5
lOMoARcPSD|8938243

page 6 Experimental Design and Data Analysis

But each year, a greater proportion of Australian residents die. Why is it so?

Figure 1.1: Age distribution of the populations of Australia and South Africa
It is seen from Figure 1.1 that Australians tend to be older. It ia also true that for individuals
of the same age in the two countries, the death rate among Australians is less than the death
rate among South Africans; and this is true for any age. But, in any country, older people
die at a greater rate than younger people. As Australia has a population that is older than
that in South Africa, a greater proportion of Australians die in any one year, despite the
lower death rates within the age categories.
This situation illustrates what is called confounding.

1.1.2 Confounding

A confounding variable is a variable that affects the relationship between the variables in
question. We examine confounding and confounding variables in more detail later in the
chapter. Here, age is the confounding variable: it affects the relationship between country
and death-rate.
Confounding is a common problem in making comparisons between groups. One that we
need to be aware of, and to overcome.
The extreme case of confounding is where all individuals (in a group of individuals under
study) with attribute A also have attribute B; and those who do not have A do not have B.

EXAMPLE 1.1.1: Suppose we have a group of twenty individuals for a choles-


terol level study and, as it happens, all the ten women are vegetarians and the
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 7

ten males are not. If we measure cholesterol levels for this group, we cannot
know whether any difference was due to gender or diet. In this case the con-
founding is clear. We can’t distinguish the effects of A and B at all.
The same sort of thing applies, to a lesser extent, if there is a tendency for those with at-
tribute A to have attribute B: if more of the females are vegetarians compared to the males.
Instead of comparing (A) vs (A′ ), we are actually comparing (A&B) vs (A′ &B ′ ). If there is
a difference, we don’t know whether it’s due to A, or to B, or to both. The effects of A and
B are confounded. [ Note: A′ denotes notA, or the complement of A.]
In the example we started with, the comparison being made is not really (Australia) vs
(South Africa), but rather it is (Australia & older) vs (South Africa & younger).
This sort of confounding tended to happen in the bad old days (of early medical research)
when a doctor tended to give a new treatment (that they believed was better) to sicker
patients . . . in order to help them. The result then is a comparison between T &S and T ′ &S ′ ,
where T denotes treatment and S denotes sicker. Even if T is helpful (i.e. it increases the
chance of improvement) it is likely that the first group will do worse. The treatment effect
is masked by the helpful doctor.

EXAMPLE 1.1.2: Lanarkshire milk experiment.


In the spring of 1930, a large-scale nutritional experiment was carried out in the
schools of Lanarkshire, in Southern Scotland. For four months, 10 000 school-
children received 34 pints (about 400 mL) of milk per day: 5000 got raw milk and
5000 pasteurised milk. Another 10 000 children were selected as controls and all
the 20 000 children were weighed and their height measured at the beginning
and the end of the experiment.

Student (William Gosset)2 published a paper reviewing this experiment (Biometrika,


1931, pp398) in which he writes “. . . to carry out an experiment of this magni-
tude successfully requires organisation of no mean order and the whole busi-
ness of distribution of milk and measurement of growth reflects great credit on
those concerned. It may therefore seem ungracious to be wise after the event
and to suggest that had the arrangement of the experiment been slightly differ-
ent the results would have carried greater weight . . . ”. William Gosset, accord-
ing to all who knew him, was a very pleasant and amiable man. This was his
pleasant way of saying “you got it wrong”.

He pointed out several problems with the conduct of the experiment. The major
problem though was the non-random allocation of treatment (milk vs no-milk).
The initial selection of children was random — on the principle that both con-
trols and treated individuals should be representative of children between 5
and 12 years of age. So far so good. But teachers were allowed to make sub-
stitutions “if it looked as though there were too many well- or ill-nourished
children in either group”. The teachers did what anyone would tend to do,
and re-assigned some of the ill-nourished children to the milk group and some
of the well-nourished children to the no-milk group. The result was that the
no-milk group was clearly superior in both height and weight to the treatment
group.

2 William Sealy Gosset (1876-1937) is best known by his pen name ‘Student’. As a result of a case of industrial

espionage, Guinness prohibited any of their employees from publishing. Gosset nevertheless published research
papers under the pseudonym Student. Among other things, he discovered the t-test, which is still often referred
to as the Student t.
lOMoARcPSD|8938243

page 8 Experimental Design and Data Analysis

Q UESTION : Despite the non-random allocation, Gosset suggested ways in which the effects
of raw milk or pasteurised milk could be estimated. Can you think of how this might be
done?

EXAMPLE 1.1.3: Consider the following mortality data, summarised from a


study that interviewed a group of female residents of Whickham, England,
in the period 1972–1974 and then tracked their survival over the next twenty
years3 . The women were interviewed at the start of the study and, among other
things, their age and smoking status were recorded. Among 1314 women in
the study, nearly half were smokers. During the next twenty years, proportion-
ately fewer of the smokers died compared to the non-smokers. The data are
reproduced in Table 1.1.

Table 1.1: Risk of death in a 20-year follow-up period in Whickham, England


according to smoking status at the beginning of the period.

proportion of women dying in follow-up period


smoker non-smoker total
139/582 (24%) 230/732 (31%) 369/1314 (28%)

Only 24% of the women who were smokers at the time of the initial survey died
during the 20-year follow-up period, whereas 31% of the non-smokers died in
the same period. Does this difference indicate that women who were smokers
fared better than women who were not smokers? Not necessarily.

In Table 1.2 we give a more detailed display of the same data: an age-specific
table, or a table stratified by age (at the start of the study).

Table 1.2: Risk of death in a 20-year follow-up period in Whickham, England


according to smoking status at the beginning of the period, by age.

proportion of women dying in follow-up period


age smoker non-smoker total
18–24 2/55 (4%) 1/62 (2%) 3/117 (3%)
25–34 3/124 (2%) 5/157 (3%) 8/281 (3%)
35–44 14/109 (13%) 7/121 (6%) 21/230 (9%)
45–54 27/130 (21%) 12/78 (15%) 39/208 (19%)
55–64 51/115 (44%) 40/121 (33%) 91/236 (39%)
65–74 29/36 (81%) 101/129 (78%) 130/165 (79%)
75+ 13/13 (100%) 64/64 (100%) 77/77 (100%)

139/582 (24%) 230/732 (31%) 369/1314 (28%)

The age-specific display indicates that in the youngest and oldest age-groups
there was little difference in terms of risk of death. Few died in the younger
age categories, and most died in the older categories, regardless of whether
they were smokers or not. For women in the middle age categories there was a
consistently greater risk of death among smokers than nonsmokers.

So why did the nonsmokers have a higher risk overall? Because a greater pro-
portion of non-smokers were in the higher age-groups — presumably reflecting
social norms. In this example, smoking is confounded with age. Here it’s not
really (smoking) vs (non-smoking), but rather it’s (smoking & younger) vs (non-
3 Vanderpump et al., Clin.Endocrinol., 1995, 43, pp55.
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 9

smoking & older).


In other cases there may be more than one confounder. And there are cases where there are
confounders that we are unaware of.
How can we overcome the problem of confounding?
For the variables (attributes) that are observed, which are perceived to influence the out-
come (the most common are age and gender, but there may be others), we can ‘stratify’ or
‘block’ the study group, i.e. consider the strata separately (e.g. 50-59yo males). Then, within
such a stratum, there is no confounding (or at least, not much), as all the individuals are
the same (age and gender).
This overcomes the problem of confounding with age or gender or any other variable that
we choose to stratify by (e.g. 50-59yo male smokers; or overweight 40-44yo females; or . . . ).
If we wish to apply the results to an entire population then we need to combine the results
for the strata. For most of our examples, we deal with just one stratum. Results for a
specific stratum would apply only to that stratum of the population (e.g. 50-59yo males).
In many epidemiological studies, stratification is important. It is standard to stratify by
age and gender because, in most instances, these factors affect the disease or the exposure
and/or the relation between them. Other factors may be used in particular cases: factors
such as employment category or ethnic background for example.
Another approach is to adjust the results for the potentially confounding variables, using a
statistical model:

response variable = treatment effect + effect of other variables.

We do a little bit of this when we look at regression models in Chapter 8, but in this subject
we do not go into this approach in any depth.
What about other variables?
There are variables that we can’t observe (or perhaps not until later) or choose not to
observe (too expensive, too time-consuming) or variables that we don’t know about or
haven’t even thought about. Some of these may be confounders. (Such variables are some-
times referred to as lurking variables. Note: “to lurk” = to be hidden, to exist unobserved.)

RANDOMISATION is the answer.


In cases where the treatment is imposed (as in a clinical trial), then we should do this at
random. Each individual is assigned to receive the treatment, or not (i.e. control) with
equal probability. How? Using R or some other randomisation device . . . without human
intervention. Humans are really bad at random. When we humans are asked to select
things at random, it seems we just can’t do it.

EXAMPLE 1.1.4: Random digits.


When a group of 213 first-year Biomedical students were asked to select a ran-
dom digit, the response was as indicated in the figure below.

Clearly this is non-random: each digit should occur with relative frequency of
about 10%. It is ‘normal’ to get a preponderance of 7s, but the excess of 8s and
9s is somewhat unusual.
lOMoARcPSD|8938243

page 10 Experimental Design and Data Analysis

EXAMPLE 1.1.5: Suppose we have twenty individuals and we want to randomly


assign ten to get the treatment (and ten not).

One way to do this is to randomly order the sequence TT. . . TNN. . . N (i.e. 10 Ts
and 10 Ns), avoiding any human choice.

We could put ten white balls and ten black balls in a bag, identical apart from
colour, assign black = treatment and white = no treatment, say; and then select
the balls one at a time from the bag.
It is more efficient to use a computer. For example:
in R randomly select a sample of size 10 without replacement from s = (1, . . . , 20) using
the function sample():
> s = 1:20 # this is a vector from 1 to 20
> sample(s, size=10) # sample 10 elements without replacement from s
[1] 4 16 3 15 8 7 12 20 14 19
then assign individuals corresponding to such indices to the treatment group.
E XERCISE . Try it.
When we use randomisation in a study, we expect that for any variable (observed or unob-
served) the values in the treatment group and the control group will be equivalent, in the
sense that they are likely to be about the same. This would apply to an observed (and possi-
bly important) variable such as blood pressure, or to an unobserved (and likely pointless)
variable like shoe-size. An individual with high blood pressure (or large shoe size) is just
as likely to be in the treatment group as in the control group.
For any variable (observed or unobserved, known or unknown, and whether it is related
to the outcome or not), randomisation neutralises its effect, by ensuring that the values of
the variable in the treatment group and the control group are expected to be the same.
We will hear much more about randomisation when we get to clinical trials, and ran-
domised controlled experiments.
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 11

1.1.3 Types of (epidemiological) study

In any epidemiological study, we are concerned with measures of a specified disease out-
come. This measurement may take the form of a count [e.g. number of individuals with
fatal myocardial infarction], a rate [e.g. number of new cases of breast cancer per year] or a
variable specifying the disease outcome [e.g. blood pressure, cell count, lung function, . . . ].
This is called the response variable.
An epidemiological study may be viewed as a disease measurement exercise. A simple
study might aim to estimate a risk of a disease outcome (in a particular group over a spec-
ified period of time) [e.g. risk of heart failure in 60yo males in the next ten years]. A more
complicated study might aim to compare risks of a disease outcome in different groups,
with the goal of prediction, explanation or evaluation [e.g. comparing the risk of compli-
cation for two surgical treatment methods]; or to compare measures of a disease outcome
in different groups, with the goal of determining a more effective treatment [e.g. mean
cholesterol levels in groups given a drug or a placebo].
Variables in epidemiological studies.
The response variable is the measurement of the disease outcome we are interested in.
An explanatory variable is a variable that may be related to the response variable: i.e. a
variable that may affect the outcome. These are often individual characteristics (sometimes
called covariates: variables such as age, gender, blood-pressure, cholesterol level, smoking
status, education level, . . . ).
In most cases, the fundamental concern of a study is to relate some exposure E to a disease
outcome D. In that case, the response variable is an indicator of disease outcome and the
primary explanatory variable is an indicator of exposure.
In broad terms, there are two types of epidemiological studies:

DEFINITION 1.1.1.
1. An experimental study is a study in which the investigator assigns the exposure
(intervention, treatment) to some of the individuals in the study with the objec-
tive of comparing the results for the exposed and unexposed individuals.
2. An observational study is a study in which the investigator selects individuals,
some of whom have had the exposure being studied, and others not, and the
outcome is observed; or individuals are selected some of whom have had the
outcome and others not, and their exposure is observed.

1.2 Experimental studies

In epidemiology, an experiment is a study in which measures of disease frequency in two


cohorts are compared, after assigning the exposure to some of the individuals who comprise
the cohort. In an experiment, the exposure is often referred to as an ‘intervention’ (an
intervention by the experimenter). Indeed, the reason for the exposure is the experiment.
Epidemiological experiments are most frequently conducted in a clinical setting, with the
aim of evaluating which treatment for a particular disease is better. Such studies are known
as clinical trials.
Often all study subjects have been diagnosed with a particular disease, but that is not the
‘disease outcome’ being studied. It is some consequence of that disease (such as death, or
lOMoARcPSD|8938243

page 12 Experimental Design and Data Analysis

some further symptom, or spread of a cancer) that becomes the ‘disease outcome’ studied.
The aim is to evaluate the effect of the treatment on the disease outcome.
In most trials, treatments are assigned by randomisation so as to produce comparability be-
tween the cohorts with respect to any factors (seen or unseen) that might affect the disease
outcome.

EXAMPLE 1.2.1: (Physicians’ Health Study)4


The Physicians’ Health Study is an experiment designed to determine whether
low-dose aspirin (325 mg every other day) decreases cardiovascular mortality.

There were 22 071 participants5 : 11 037 were assigned at random to receive as-
pirin and 11 034 to receive placebo. The results were as follows:

fatal myocardial infarction non-fatal myocardial infarction


aspirin 10/11 037 = 0.091% 129/11 037 = 1.17%
placebo 26/11 034 = 0.236% 213/11 034 = 1.93%
relative risk = 0.39 relative risk = 0.61

There appears to be evidence here that aspirin reduces the risk of myocardial
infarction. But could it just be due to chance?

EXAMPLE 1.2.2: (Zidovudine trial for HIV)


The data in the table below come from a clinical trial of adult patients recently
infected with human immunodeficiency virus, to determine whether early treat-
ment with Zidovudine was effective in improving the prognosis. Patients were
randomly assigned to receive either Zidovudine or placebo and then followed
for an average of 15 months.
Randomised trial comparing the risk of opportunistic infection
among patients with a recent Human Immunodeficiency Virus
infection who received either Zidovudine or placebo

D D′ n risk
Zidovudine 1 38 39 0.026
Placebo 7 31 38 0.184

where D denotes the patient suffered from an opportunistic infection.

The data indicate that the risk of getting an opportunistic infection during the
follow-up period was low among those who received early Zidovudine treat-
ment, and higher among those who received a placebo treatment. [But, you
should be asking, is this just due to chance, or is there a real Zidovudine effect
here?]
4 “Final report of the Aspirin Component of the Ongoing Physicians’ Health Study”. New England Journal of

Medicine, 1989, 321, pp129


5 It should noted that the organisation and administration of such a study is an immense task: enrolment

questionnaires were sent to 261 248 male physicians in the US, 112 528 responded; and 59 285 were willing to
participate. Of these 33,223 were eligible. There followed a run-in period after which 11 152 changed their minds
or reported a reason for exclusion. This left 22 071 who were randomly assigned to the treatments.
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 13

1.2.1 Clinical trial (medical experimental study)

DEFINITION 1.2.1. A clinical trial is defined as “any research study that prospectively
assigns human participants or groups of humans to one or more health-related inter-
ventions to evaluate the effects on health outcomes” (WHO/ICMJE 2008 definition).

Thus a clinical trial is essentially another name for a medical experimental study. Note that
trial is used instead of experiment, perhaps as a euphemism: people may prefer to take
part in a trial rather than an experiment.
There are several types of clinical trials:
• Treatment trials: test experimental treatments, new combinations of medication, or
new approaches to surgery or radiation therapy.
• Prevention trials: look for ways to prevent disease in people who are disease-free, or
to prevent a disease from returning. Prevention trials may include medicines, vac-
cines, vitamins, minerals, or lifestyle changes.
• Diagnostic trials: are done to find better tests or procedures for diagnosing a particu-
lar disease or condition.
• Screening trials: test the best way to detect certain diseases or health conditions.
• Supportive care trials: explore ways to improve comfort and quality of life for people
with an illness.
Treatment trials are the most common form of clinical trials. Their intention is to study
different treatments for patients who already have some disease. Consider the comparison
of treatment A and treatment B. Often treatment A may be an experimental medication,
and treatment B may be a placebo, i.e. a non-medication, disguised to look the same as the
experimental medication, so that the patient does not know which treatment they receive.
Subjects must be diagnosed as having the disease in question and be admitted to the study
soon enough to permit treatment assignment. Subjects whose illness is too mild or too
severe are usually excluded.
Treatment assignment is designed to minimize variation of extraneous factors that might
affect the comparison. So, the treatment groups should be comparable with respect to
some baseline characteristics. A random assignment scheme is the best way to achieve
these objectives. This means that for a patient who fits the criteria for admission to the trial,
the patient is assigned treatment A or B at random. Since assignment is random, various
ethical issues are involved. For example, the patient must agree to the randomisation: the
possibility that they receive either of the possible treatments.
The gold standard for a clinical trial is a randomised controlled trial (RCT); that is, an
experiment with a treatment group and a control group for which individuals are assigned
randomly to the treatment and non-treatment group.
The Physicians’ Health Study and the Zidovudine trial described above are examples of
randomised controlled trials.
There are different sorts of clinical trials depending what stage the experimental process is
at. These stages are called phases.
• Phase 1: This is the first trial of the drug on humans (up to this point, research will
usually have been conducted on animals). Healthy volunteers are given the drug and
observed by the trial team over the period of the trial. The aim is to find out whether
lOMoARcPSD|8938243

page 14 Experimental Design and Data Analysis

it’s safe (and at what dose), whether there are side effects, and how it’s best taken (as
tablets, liquid, or injection for instance).
• Phase 2: If the drug passes muster in phase 1, it’s next given to people who actually
have the condition for which the drug was developed. The aim of a phase 2 trial is to
see what effect the drug has — whether it improves the condition and by how much,
and again, whether there are any side effects.
• Phase 3: Phase 3 trials are similar to a phase 2 trial except the number of people given
the drug is much larger. Again, researchers are looking at safety and effectiveness.
Phase 3 is the last stage before the drug is then licensed for use by the general public.
• Phase 4: In this phase, the drug is compared to other, existing, drugs. The idea of a
phase 4 trial is to get more qualitative information – determining where exactly the
drug is most useful, and for what sort of patient. The participants in a phase 4 trial
are people in the community who have the condition.
Clinical trials suggests an association with a clinic (a facility, often associated with a hospital
or medical school, that is devoted to the diagnosis and care of outpatients). This is the
origin of the term ‘clinical trials’, and in most cases this is true. But it should be noted that,
in general, the treatment need not be applied in the clinic.

1.2.2 Field trial

In a field trial, generally, the subjects have not got the disease. The intention of the treat-
ment/intervention is to prevent the disease. Field trials usually require a great number
of subjects so that there will be a sufficient number of “cases” (outcome events) for com-
parison. As the subjects are not patients, they need to be treated or visited in the “field”
(at work, home, school) or some centres sent up for the purpose. So, field trials are very
expensive and are usually used for the study of extremely common or extremely serious
diseases.
Examples include:
• Salk vaccine trial.
• MRFIT (Multiple Risk Factor Intervention Trial)
As in clinical trials, random assignment scheme is the ideal choice of assignment.

1.2.3 Community intervention trial

In this case the treatment/intervention is applied to a whole community rather than indi-
viduals.
Examples include:
• Water fluoridation to prevent dental caries.
• Fast-response emergency resuscitation program.
• Education program conducted using mass media.

1.2.4 Experiments and experimental principles

The principles of experimental design apply throughout the sciences. In this section we
point out some of the general principles and terminology; and indicate how experiments
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 15

in the medical sciences fit into the wider scheme.

DEFINITION 1.2.2.
1. An experiment is one where we impose a procedure (called a treatment, inter-
vention, exposure) on particular individuals (called the experimental units or
subjects) and observe the effect on a variable (called the response variable).
2. The response variable relates to the outcome of the experiment, which may be
negative (recurrence, death, increased cancer-cell count, . . . ) or positive (cure,
symptom alleviation, reduced cholesterol level, . . . ).
3. In the case of an expermient, an explanatory variable is something which may
affect the outcome, and which is known at the time of treatment. The primary
explanatory variable is the treatment variable; other explanatory variables may
be potential confounders.

In a designed experiment, the experimenter determines which subjects receive which treat-
ment. The experimenter must adhere to the principles of design of experiments to achieve
validity and precision.

Control group
The word “control” is often misunderstood in the context of medical testing: when people
hear of a “controlled experiment”, they tend to assume that, somehow, all the problems
have been fixed . . . and under control. Not so. What it means is that the experiment in-
cludes a control group who do not receive the treatment, as well as a treatment group who
do. Usually the control group is given a placebo, i.e. a pseudo-treatment that looks like the
real thing but which is known to be neutral.
In a designed study, the control group forms a baseline for comparison, to detect the effect
of any other treatments. Comparison is the key to identifying the effects on the response
variable. If we have only one treatment group, then there is no way to identify what is and
what is not the effect.

EXAMPLE 1.2.3: (Gastric freezing)


A proposed treatment for ulcer patients: the patient swallows a deflated balloon
with tubes attached and then a refrigerated solution is pumped through the
balloon for an hour. This “gastric freezing” therapy promised to reduce acid
secretion by cooling the stomach and so relieve ulcers. An experiment reported
in the Journal of the American Medical Association showed that gastric freezing did
reduce acid secretion and relieve ulcer pain. The treatment was safe and easy
and widely used for several years.

Unfortunately, the design of the experiment was defective. There was no control
group. A better-designed experiment, done several years later, divided ulcer
patients into two groups. One group was treated by gastric freezing as before,
while the other group received a placebo treatment in which the solution in the
balloon was at body temperature rather than freezing. The results of this and
other designed experiments showed that gastric freezing had no real effect, and
its use was abandoned.
Confounding variables, lurking variables
Suppose the standard treatment is given by Doctor A and the experimental treatment is
given by Doctor B. Then, we say the treatment is confounded with the treating doctors,
lOMoARcPSD|8938243

page 16 Experimental Design and Data Analysis

because we cannot tell whether the effect on the response variable is due to the treatment
or the skill or the manner of the doctor.
Confounding occurs when observed effects can be explained by more than one explana-
tory variable, and the effects of the variables cannot be separated. The reason for most
experimental studies is the investigation of a treatment. Usually therefore, the primary
explanatory variable is the treatment variable (whether or not the individual receives the
treatment) and our concern is whether any other variable might be confounded with the
treatment variable.

DEFINITION 1.2.3.
1. A confounding variable is a variable that is a possible cause of the disease out-
come, which is related to the exposure (treatment, intervention).
2. A lurking variable is a confounding variable, but one which is unknown and un-
observed. It is thus a particular (and particularly dangerous) type of confounding
variable.

EXAMPLE 1.2.4: A number of observational studies have shown an inverse rela-


tionship of consumption of vegetables rich in beta-carotene with risk of cancer.
While it may be the beta-carotene itself that is responsible for this lower risk, it
is also possible that the association is confounded by other differences between
consumers and non-consumers of vegetables. It may not be the beta-carotene at
all, but rather another component of vegetables such as fibre, which is known to
reduce cancer risk. In addition, those who eat vegetables might also be younger
or less likely to eat fat or to smoke cigarettes, all of which in themselves might
reduce cancer risk. Thus, the observed decreased risk of cancer among those
consuming large amounts of vegetables rich in beta-carotene may be due, ei-
ther totally or in part, to the effect of these confounding factors.
Confounding, and especially the possibility of lurking variables, is an important obstacle
to drawing conclusions from any experimental study. For this reason, we need to deal with
confounding. The most important and useful technique we have for this is randomisation.
Randomisation
• The treatments are allocated to the experimental units randomly.
• Randomisation is usually done using computer-generated random numbers.
• The effect of randomisation is to use randomness to even out the effect of confound-
ing variables.

EXAMPLE 1.2.5: (Fisher’s tea-tasting experiment)


Ronald A. Fisher6 is regarded as a founding father of modern statistics. He
derived a range of important results and techniques which are basic to statistical
analysis. Fisher introduced the concept of randomisation as essential to the
validity of experimental design. The lady tasting tea is a famous randomized
experiment which he used as an introductory example in his book “The Design
of Experiments”.
6 Ronald Aylmer Fisher FRS (1890-1962) was an English statistician, evolutionary biologist, eugenicist and ge-

neticist. He was described by Anders Hald as “a genius who almost single-handedly created the foundations
for modern statistical science,” and Richard Dawkins described him as ”the greatest of Darwin’s successors”. He
spent the last years of his life working at the CSIRO in Adelaide.
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 17

The lady in question claimed to be able to tell whether the tea or the milk was
added first to a cup. Fisher gave her eight cups, four of each variety, in random
order. The story has it that the lady in question was Muriel Bristol, and she got
them all right. The chance of someone who just guesses getting all eight correct
is only 1 in 70.

[from Fisher “Design of Experiments”]

Blocking = Stratification
When we know certain factors have an effect on the response variable, we should ensure
these factors even out in the different treatment groups, instead of trusting it to randomi-
sation. This is done by blocking. A block is a group of similar study units. A block is the
generic term, applying to a wider range of experiments than we are concerned with in this
subject. In an agricultural experiment for example, a block may comprise a set of plots of
similar fertility. Here the units are the plots and we would be concerned with the yield
from each plot. Treatments might be fertilisers. In an engineering experiment, a block may
comprise samples of material obtained from one production batch.
In our case, the unit is generally an individual. A block is a collection of similar individuals,
which is equivalent to a stratum.

DEFINITION 1.2.4. A blocked experiment is one where the randomisation of units are
carried out separately within each block.

This reduces the natural variation by making comparison on similar units. Blocking there-
fore has the effect of achieving higher precision.
Blocking is equivalent to matching. Identical twins would be the ultimate blocks! Blocks,
or sets of matched individuals, may be of any size (greater than one). Matching individuals
enables better comparison of treatments.
lOMoARcPSD|8938243

page 18 Experimental Design and Data Analysis

Replication
Suppose we want to compare two methods of teaching language. Student A is taught by
one method and Student B by the other. We know we cannot rely on comparing the test
scores of the two students, because student A might be brighter or more conscientious.
We need to have replications. This means enough experimental units in each treatment
group so that chance variation can be measured and systematic effect can be seen. The
more replications (the number of experimental units in each treatment group), the more
reliable (precise) the comparison of the treatments. Yes . . . but how many? (Chapter 7).
Blinding

DEFINITION 1.2.5. A blind experiment is an experiment where the subject (and, in


some cases, the experimenter, i.e. the person administering the treatment, also) are pre-
vented from knowing which treatment is used, so as to avoid conscious or unconscious
bias on their part, which would invalidate the results.

For example, when asking consumers to compare the tastes of different brands of a product,
the identities of the latter should be concealed. Otherwise consumers may tend to prefer
the brand they are familiar with. Similarly, when evaluating the effectiveness of a medical
drug, both the patients and the doctors who administer the drug may be kept in the dark
about the nature of the drug being applied in each case.
Single-blind describes experiments where information is withheld from the participants,
but the experimenter is in full possession of the facts.
In a single-blind experiment, the individual subjects do not know whether they are so-
called “test” subjects or members of the “control” group. Single-blind experimental design
is used where the experimenters either must know the full facts (for example, when com-
paring sham to real surgery) and so the experimenters cannot themselves be blind, or where
it is believed the experimenters cannot introduce further bias and so the experimenters
need not be blind. However, there is a risk that subjects are influenced by interaction with
the researchers — known as the experimenter’s bias: the experimenter has an expectation
of what the outcome should be, and may consciously or subconsciously influence the be-
havior of the subject, and their responses, in particular.
Double-blind describes an especially stringent way of conducting an experiment, in an
attempt to eliminate subjective bias on the part of both experimental subjects and the ex-
perimenters. In most cases, double-blind experiments are held to achieve a higher standard
of scientific rigor.
In a double-blind experiment, neither the individuals nor the researchers know who be-
longs to the control group and the experimental group. Only after all the data have been
recorded (and in some cases, analyzed) do the researchers learn which individuals are
which. Performing an experiment in double-blind fashion is a way to lessen the influence
of the prejudices and unintentional cues on the results.
Random assignment of the subject to the experimental or control group is a critical part of
double-blind research design. The key that identifies the subjects and which group they
belonged to is kept by a third party and not given to the researchers until the study is over.
Balance
Balance means each treatment is applied to the same number of study units. This is desir-
able when possible, as it simplifies the analysis and gives the most precise comparison. It
is sometimes defeated by nature, e.g. some patients withdraw from the study.
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 19

Summary
1. Control — for validity. Comparison is the key to identifying the effects on a response
variable.
2. Randomisation — for validity. Randomly assign treatments. This neutralizes the
effects of other variables.
3. Replication — for precision. Repeat to get better results. This reduces the influence
of natural variation.
4. Blocking — for precision, and for validity in the presence of confounding. Group
the study units into blocks of similar units. This removes any unwanted source of
variation.
5. Blinding — for validity. To ensure that the expectations of the subject does not influ-
ence the outcome. And, with double-blinding, to ensure that the expectations of the
experimenter does not influence the outcome.
6. Balance — for precision. Have the same number of units in each treatment group if
feasible.
The ‘gold standard’ clinical trial is a randomised controlled trial (RCT), i.e. an experiment
with individuals assigned randomly to a treatment group or a control group. Blinding is
used where possible.

EXAMPLE 1.2.6: (Physicians’ Health Study)


The Physicians’ Health Study is a randomised controlled trial (designed to de-
termine whether low-dose aspirin decreases cardiovascular mortality).

1.3 Observational studies

1.3.1 Cohort study

DEFINITION 1.3.1.
1. A cohort is broadly defined as “any designated group of individuals who are
followed over a period of time.”
2. A cohort study involves measuring the occurrence of disease within one or more
cohorts. (An experiment is a cohort study, but not all cohort studies are experiments.)

Many cohort studies can be expressed as the comparison of two cohorts, which we denote
as exposed (E) and unexposed (E ′ ). As has been mentioned, the “exposure” may cover a
broad range of things: from a drug treatment or immunisation to an attribute like economic
status or the presence of a particular gene. The intention then is to compare disease rates
in the exposed cohort and the unexposed cohort.
The cohort concept is straightforward enough, but there are complications involving who
is eligible to be followed, what should count as an instance of disease, how the incidence
rates or risks are measured and how exposure ought to be defined. (Mostly, we don’t resolve
these complications; we just note that they exist and trust that they are sorted out . . . by others, or
possibly by us, later, when we know some more.)
In a cohort study, exposure is not assigned. The investigator is just an observer. As a result,
causation cannot be inferred as it is not known why the individual came to be exposed.
lOMoARcPSD|8938243

page 20 Experimental Design and Data Analysis

The strongest conclusion that can be drawn from an observational study is that there is
an association between exposure E and disease outcome D (but it is not known why). In
particular, it cannot be concluded that E causes D. Nevertheless, the intention of a cohort
study is often to address causation, and the terms response and explanatory variables are
used for observational studies as well as experimental studies.

EXAMPLE 1.3.1: (Simple cohorts)


Simple examples of cohorts are: the students enrolled in this subject; children
born at RWH in 2020; women attending BreastScreen for the first time in 2020.

EXAMPLE 1.3.2: (Cohort study of vitamin A during pregnancy)


To study the relation between diet of pregnant women and the development of
birth defects in their offspring, a US study interviewed more than 22 000 preg-
nant women early in their pregnancies.

The women were divided into cohorts according to the amount of vitamin A
in their diet, from food or from supplements. The data are given in the table
below.

Table 1.3: Prevalence of birth defects among the offspring


of four cohorts of pregnant women, classified according to their
intake of supplemental vitamin A during early pregnancy.

vitamin A level (IU/day) D n risk


0–5000 51 11083 0.0046
5001–8000 54 10585 0.0051
8001–10000 9 763 0.0118
10001– 7 317 0.0221

These data indicate that the prevalence of these defects increased with increas-
ing intake of vitamin A.

But does vitamin A affect the “population of births”? While vitamin A might
be a cause, another possible explanation of this result is that it could enable
embryos with the defect to survive until birth.

EXAMPLE 1.3.3: (John Snow’s ‘natural experiment’)


John Snow collected data regarding the cholera outbreak in London in 1854.
At the time, there were several water companies that piped drinking water
to London residents. Snow’s ‘natural experiment’ consisted of comparing the
mortality rates for residents subscribing to two of the major water companies,
SV (Southwark & Vauxhall), which piped impure Thames water (contaminated
with sewage) and L (Lambeth), which collected water from upstream Thames,
and therefore relatively free of London sewage.

Table 1.4: Frequency of death due to cholera among customers


of the Southwark & Vauxhall Company (exposed cohort) and
the Lambeth Company (unexposed cohort), London 1854.

D n rate
(cholera death) (popln size)
E (Southwark & Vauxhall) 4093 266,516 1.54%
E′ (Lambeth) 461 173,748 0.27%
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 21

Residents whose water came from the Southwark & Vauxhall Company had a
cholera death rate 5.8 times greater than that of residents whose water came
from the Lambeth Company.

Snow saw that circumstance had created conditions like that of an experiment.
In an experiment, individuals who were otherwise alike differ only in whether
they receive the treatment or not. In this case, it seemed that people differed

only by their consumption of pure or impure water( ◦◦ ).

In an experiment, the investigator assigns the participants to the exposed and unexposed
groups. In a natural experiment, as studies like this have come to be known, the investi-
gator takes advantage of a setting that is like an experiment. It is like an experiment in
that the “assignment” of treatment to individuals is pseudo-random. The “assignment” is
not done by the experimenter, and not by randomisation. It is done by some other pro-
cedure that appears to mimic randomisation, and which is assumed to be equivalent to
randomisation. The validity of any conclusions depends on this assumption.
Note: a ‘natural experiment’ is not an experiment (because the treatment is not imposed
on the subjects), and there must remain some doubt about the causation conclusion.

EXAMPLE 1.3.4: (Framingham Heart Study)


In 1948, the Framingham Heart Study embarked on an ambitious project in
health research. Its objective was to identify common factors or characteristics
that contribute to CardioVascular Disease (CVD) by following its development
over a long period of time in a large group of participants who had not yet
developed overt symptoms of CVD or suffered a heart attack or stroke.

The researchers recruited 5209 men and women between the ages of 30 and
62 from the town of Framingham, Massechusetts, and began the first round of
extensive physical examinations and lifestyle interviews that they would later
analyse for common patterns related to CVD development. Since 1948, the sub-
jects have continued to return to the study every two years for a detailed med-
ical history, physical examination and laboratory tests. In 1971, the Study en-
rolled a second generation: 5214 of the original participants’ adult children and
their spouses, to participate in similar examinations.

There have been subsequent cohorts recruited in 1994, 2002 and 2003, including
a third generation of participants: grandchildren of the Original Cohort. More
details of the cohorts and the results obtained can be found at
//www.framinghamheartstudy.org/.
Closed and open cohorts
A closed cohort is one with a fixed membership. Once it is specified and follow-up begins,
no-one can be added to a closed cohort. The cohort will dwindle as people in the cohort
die, or are lost to follow-up, or develop the disease. The Framingham Heart Study includes
several closed cohorts. We will primarily be concerned with closed cohorts.
An open cohort (or a dynamic cohort) can take on new members as time passes. An exam-
ple of an open cohort is the population of Victoria. Cancer incidence rates in Victoria over
a period of time reflect the rate of cancer occurrence among a changing population.

EXAMPLE 1.3.5: (Busselton Health Study)


The Busselton Health Study is one of the longest running epidemiological re-
search programs in the world. It’s the Australian version of Framingham. The
residents of the town of Busselton, a coastal community in the south-west of
lOMoARcPSD|8938243

page 22 Experimental Design and Data Analysis

Western Australia, have been involved in a series of health surveys since 1966.
To date over 16 000 men, women and children of all ages have taken part in the
surveys and have helped contribute to our understanding of many common
diseases and health conditions.

Much of the data comes from cross-sectional studies (see §1.2.4 below), treating
the Busselton community as an open cohort. However, one follow-up study
of the first cross-sectional study was done thirty years on, i.e. a closed cohort
study.

More information can be found at //www.busseltonhealthstudy.com/.

1.3.2 Case-control studies

A considerable drawback of conducting a cohort study is the necessity, in many situations,


to obtain information on exposure and other variables from large populations in order to
measure the risk or rate of a disease. In many studies, however, only a tiny minority of
those who are at risk actually develop the disease. The case-control study aims at achieving
the same goals as a cohort study but more efficiently, using sampling. Properly carried out,
case-control studies provide information that mirrors what could be learned from a cohort
study, usually at considerably less cost and time.
We explain a case-control study by means of an illustrative example.
Consider a serious and reasonably rare disease, such as bowel cancer, which requires hos-
pital treatment. Suppose we are interested in an exposure, such as smoking (or diet, or
lifestyle history, or past medical treatment, or a gene-marker g, . . . ).
We begin with the cases. Suppose our case source is the Royal Melbourne Hospital. We
consider a stratum of individuals: say 50-54yo males. Suppose that there were 16 such
cases admitted to RMH in 2020. Of these, 8/16 (i.e. 50%) were smokers.
We wish to compare the cases with the rest of the population. But what exactly is the
population? In this case the population in question is the collection of individuals who, if
they had bowel cancer would have attended RMH. This is called the source population.
This population is hypothetical, and impossible to enumerate! But we need to sample from
it. How? A plausible way to sample from this population is to take a matching patient who
was admitted to RMH in 2020, but with an unrelated ailment. This sample is the control
sample. We want to compare the cases with the controls; hence a case-control study.
Q UESTION : Why should we match controls and cases?
We could choose 16 controls; the same as the number of cases. But there is no need to
restrict the number of controls. The more controls we have, the better the information we
have about the source population. So, if it’s possible (allowing for the number of available
patients, research budget and time) we should choose more. Suppose we select 32 controls.
Of these 8/32 (i.e. 25%) were smokers.
The fact that 50% of the bowel-cancer patients were smokers [50% of cases were exposed]
and only 25% of the non bowel-cancer patients were smokers [25% of controls were ex-
posed] suggests a relationship between exposure and disease. Is there?
This example is typical of a case-control study, though a full case-control study would
generally involve more than one stratum, and possibly smaller and more specific strata.
The data source may be a large clinic, or a series of clinics (e.g. medical clinics in the western
suburbs); or it may be a group of hospitals (e.g. all the major Melbourne hospitals). In each
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 23

case, the source population is the collection of individuals who would have attended the
specified medical centres if they had the disease in question.
A case-control study is retrospective. The cases and controls are in the present, and we
investigate their past — perhaps using hospital records, or by questioning the patients or
their relatives. A disadvantage of this is that old records or memories may be faulty. An
advantage is that a range of exposures may readily be considered for possible relation to
the specified disease.

1.3.3 Comparison of cohort and case-control studies

A major advantage of a case-control study over a cohort study is that by effectively sam-
pling from the population we save considerably on cost and time.
A cohort study is usually prospective, whereas a case-control study is usually retrospective.

Let’s consider a hypothetical cohort study ( ◦◦ ) corresponding to the above bowel cancer
case-control study. It must be hypothetical, because such a study couldn’t actually be car-
ried out! (We can’t force individuals to smoke.) But let’s pretend it’s 2010, and we know all
the 40-44yo males who are going to be in the 2020 RMH source population. Let’s suppose
there are 12 000 of them.
To obtain the exposure information, i.e. to find out how many of them are smokers, we
would need to question (by questionnaire or interview or . . . ) all 12 000 of them. And
perhaps keep track of them in the intervening time. Any other exposures that we might
want to examine would have to be specified in advance (i.e. in 1999). We would then
follow these individuals for the next ten years to see how many of them get bowel cancer.
Of course, some of these individuals may get cancer any time in 2010-2020, so that it’s not
a perfect match to the case-control study. Suppose that over the next ten years, 100 of these
individuals are admitted to hospital with bowel cancer, and 50 of these were smokers, i.e.
50% are exposed . . . as for the case group above, as opposed to 25% for the rest of the
population.
Such a study would give stronger evidence that the disease is more common among ex-
posed individuals. But, even if such a procedure were possible, it would be hugely expen-
sive and time-consuming.
Q UESTION : Why does the cohort study give stronger evidence?
In a case-control study, subjects are selected on the basis of their disease status: cases have
the disease of interest. This means that we cannot estimate the relative risks for the exposed
and unexposed groups. However, the relative risks can be estimated, provided the disease
prevalence is known. This is explained in more detail in Chapter 3.
The primary difference between a cohort study and a case-control study is that a cohort
study involves complete enumeration of the cohort (sub-population), whereas a case-control
study is based on a sample from the relevant sub-population.
Q UESTION : Why is a cohort study good for studying many diseases? . . . and a case-control
study good for studying many exposures?
lOMoARcPSD|8938243

page 24 Experimental Design and Data Analysis

Table 1.6: Comparison of the characteristics of cohort and case-control studies

Cohort Study Case-Control Study

Complete source population Sampling from source population


experience tallied
Can calculate incidence rates or risks Can usually calculate only the
and their differences and ratios ratio of incidence rates or risks
Usually very expensive Usually less expensive
Convenient for studying many Convenient for studying many
diseases exposures
Usually prospective Usually retrospective

1.3.4 Cross-sectional studies

The study types described above are longitudinal studies, i.e. the information obtained
pertains to more than one point in time. Implicit in a longitudinal study is the premise that
the causal action of an exposure comes before the development of the disease. All cohort
studies and most case-control studies rely on data in which exposure information refers to
an earlier time than that of disease occurrence, making the study longitudinal.
Cross-sectional studies are occasionally used in epidemiology. A cross-sectional study in
epidemiology amounts to a survey of a defined population. As a consequence, all of the
information relates to the same point in time; they are basically snapshots of the population
with respect to disease and/or exposure at a specific point of time. A population survey,
such as the census, not only attempts to enumerate the population but also to assess the
prevalence of various characteristics. Surveys are conducted frequently to sample opinions;
they can also be used to measure disease prevalence and/or possible exposures.
A cross-sectional study cannot measure disease incidence (the rate at which the disease
outcome D occurs), since this requires information across a time period. But cross-sectional
studies can be used to assess prevalence (the proportion of the population with D).
Sometimes cross-sectional data is used as a reasonable proxy for longitudinal data, in the
absence of such information. If no record exists of past data, present data might be used as
a substitute. Current accurate data might be better than hazy recall of the past.
Surveys
Typically, a survey consists of a sample taken from a population of interest. Data are col-
lected from each person in the sample, such as the exposure status and disease status. As
the data are collected at a point in time, it is called a cross-sectional study. From a cross-
sectional study, it is possible to estimate the prevalence of disease and of exposure. It is not
suitable for investigating a causal relation, as it does not have a time dimension built into
it. However, an association might be found and further research suggested.
For validity, the sample needs to be representative of the population and to have been
drawn in an unbiased manner. Random sampling is usually used.
The aim of a survey is to obtain a representative sample from a specified population, which
enables estimation of the population characteristics.
A census, or complete enumeration of the population, is often not feasible or desirable. It
is likely to be massively expensive. A survey has the advantages of reduced cost, greater
speed, scope and accuracy. Surveys are used for planning, identifying problems, market
research and quality control. They can be both descriptive and analytical. Survey variables
can be qualitative or quantitative. Scales can be nominal, ordinal or numerical.
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 25

A survey is not a trivial matter to get right! Planning and executing a survey involves,
among other things:
• specifying questions such that all, and only, the relevant data are collected, fairly and
accurately;
• defining the population (so that the actual target population corresponds to the one
we wish to study);
• identifying the sampling units (usually individuals, in our applications) and the sam-
pling frame, which is a list of sampling units in the target population.
• determining the degree of precision required (this will affect the sampling procedure);
and then minimising bias, cost and time scale problems in the sampling procedure.
• choosing a suitable measurement technique;
• taking a pilot sample;
• administration and editing, processing and analysing the data.
Sampling error

DEFINITION 1.3.2. Sampling error is the random variation in the results due to the
elements in the sample being selected at random.

This can be controlled and estimated provided random sampling methods are used in se-
lecting the sample.
Non-sampling errors
Non-sampling errors include:
• Selection bias, which occurs when the true selection probabilities differ from those
assumed in calculating the results;
• Coverage problems: inclusion of individuals in the sample from outside the popu-
lation; or exclusion of individuals in the population (perhaps because the sampling
frame is incomplete);
• Loaded, ambiguous, inaccurate or poorly-posed questions;
• Measurement error: e.g. when respondents misunderstand a question, or find it dif-
ficult to answer (due to language or conceptual problems);
• Processing errors: e.g. mistakes in data coding;
• Non-response: failure to obtain data from sampled individuals.
Non-sampling errors are reduced by careful attention to the construction of the question-
naire and fieldwork. The latter may include callbacks, rewards and incentives, trained
interviewers and data checks.
A major problem with many surveys is non-response. Because we don’t know anything
about the non-respondents, there may be a bias, which we know very little about. The
only way to guarantee control of this bias is to increase the response rate. Experience indi-
cates that the response rate should be at least 50%, but serious biases can still occur with a
response rate of 70%. It depends!

EXAMPLE 1.3.6: The Literary Digest poll


If it is remembered at all, the Literary Digest is probably best remembered for
lOMoARcPSD|8938243

page 26 Experimental Design and Data Analysis

the circumstances surrounding its demise. It conducted a survey to predict the


outcome of the 1936 presidential election. The poll showed that the Republican
governor of Kansas, Alf Landon, would likely be the overwhelming winner.

In the election, Landon won only in Vermont and Maine; Franklin Delano Roo-
sevelt carried the other 46 states; Landon’s electoral vote total of eight is a tie
for the record low for a major-party nominee. The magazine was completely
discredited because of the poll and was soon discontinued. The polling tech-
niques employed by the magazine were to blame. Although it had polled 10
million individuals (only about 2.4 million of these individuals responded, an
astronomical number for any survey), it had surveyed its own readers, reg-
istered automobile owners and telephone users, and other individuals whose
names had been recorded on lists or memberships. All of these groups con-
tain an over-representation of conservative Republican voters. Literary Digest
readers were wealthy enough to afford a journal subscription and conserva-
tively inclined enough to choose the Literary Digest. Further, in those days,
relatively few people had cars or phones — and so, again, the working classes,
who favoured the Democrats, were under-represented.

George Gallup’s American Institute of Public Opinion achieved recognition by


correctly predicting the result of the election, and for accurately predicting the
results of the Literary Digest poll, using a much smaller sample size. The Lit-
erary Digest survey debacle led to a considerable refinement of public opinion
polling techniques and was largely regarded as spurring the beginning of the
era of modern scientific public opinion research.
Non-sampling errors are most important. Usually they are the greatest contributor to error
in any real survey. But in this subject, we gloss over the non-sampling errors and deal only
with the sampling error.

1.4 Review of study types


(1) The hierarchy of evidence

“value”

clinical trials
community trials

cohort studies “natural experiments”


case-control studies

cross-sectional studies
ecological studies (demographic data)
animal experiments
in vitro experiments
anecdotal evidence
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 27

(2) Time-line diagram



past present future
clinical trial ✲

✲ cohort study ✲

✛ case-control study ✛
cross-sectional study
(retrospective) (prospective)
In broad terms, there are two types of epidemiological studies:
• Experimental studies — the investigator assigns the exposure (intervention, treat-
ment) to some of the individuals in the study with the objective of comparing the
results for the exposed and unexposed individuals.
• Observational studies — the investigator selects individuals, some of whom have
had the exposure being studied, and others not, and the outcome is observed [cohort];
or individuals are selected some of whom have had the outcome and others not, and
their exposure is observed [case-control]. Essentially though, an observational study
is an epidemiological study that is not an experiment!

EXAMPLE 1.4.1: Suppose an investigator wants to determine whether a radical


mastectomy is more effective than a simple mastectomy in prolonging the life
of women with breast cancer.

In an experimental study, she would find a group of eligible patients, and ran-
domly assign the patients into the two treatment groups. The patients will be
given the assigned treatments and then followed for a number of years, to ob-
serve their survival times.

In an observational cohort study, the investigator could examine the records of


hospitals to gather information on the survival times after surgery of all women
who have had either operation.

1.4.1 A dialogue with a skeptical statistician (Gary Grunwald)


Skepticism is the first step towards truth. Denis Diderot

Question: Does a new drug lengthen survival times of cancer patients, compared with no
drug treatment?
Proposal: Ask various doctors and hospitals for records of all cancer patients on the new
drug, and compare survival times with those who got no drug. (Suppose the result shows
survival is longer for the new drug.)
Skeptic: But that’s just looking at what’s already happened. There could be lots of reasons
why it happened.

Observational study: A study that observes what already exists.

Proposal: But still, survival is longer for people using the new drug.
Skeptic: That shows the drug is associated with survival, not that it is the cause of survival.
lOMoARcPSD|8938243

page 28 Experimental Design and Data Analysis

Association: One variable shows patterns related to another variable.

Cause: One variable makes the other change: e.g. changing variable.1
makes variable.2 change.

Skeptic: For instance, what if doctors tend to give the new drug to only the patients they
think are likely to improve anyway, and assume nothing will help the sicker ones? Couldn’t
that explain the findings?

Confounding factor: Another variable that is related to the treatments


and that affects the response.

Proposal: So we should be the ones who decide who gets which drug, not the doctors. For
instance, we’ll tell doctors to give the drug to patients with surname starting with A-M and
not to N-Z.

Designed experiment: A study where the experimenter assigns the


treatments to the subjects.

Skeptic: That’s better, but still not perfect. For instance, many Vietnamese are named
Nguyen, and many Koreans are named Kim, and this could put most of them in the same
drug group. If survival is related to ethnicity our results could still be biased.
Proposal: Then let’s assign patients to drug groups randomly.

Randomisation: Random assignment of subjects to treatment groups.

Skeptic: Much better.


Proposal: But it’s quite a bit of work. Why not just use two patients and randomly assign
one to get the drug? Why use lots of patients?
Skeptic: What if the one who is healthiest happens to be assigned to get the drug? There’s
a 50/50 chance of this.

Replication: Using more than one subject. This evens out the effects of
chance.

Skeptic: There could still be problems, though, since the patients will know if they got the
drug, and it could have psychological effects.
Proposal: Then let’s make sure everyone thinks they could be getting the drug.

Blind study: the subjects don’t know which treatment they got.

Skeptic: But how can we do that?


Proposal: We’ll give the non-drug group a placebo.

Placebo: A fake treatment — a pill with no drug, for example, which


resembles the treatment pill as far as possible (same size, same colour,
same taste, . . . ) except that it doen not contain the drug.
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 29

Skeptic: But still, won’t the doctors know who gets the new drug? And they may treat
those patients more aggressively.
Proposal: Then we should make sure the doctors don’t know either.

Double-blind study: Neither the subjects nor those providing the treat-
ment know which treatment was given.

Skeptic: Much better. The results from such a study will surely be more valid. There are
still lots of practical and ethical problems to be worked out, but these are some of the main
principles of good study design. And a well designed experiment is the only sure statistical
way to show cause.
It is useful for you as a statistician to play the skeptic (or the “devil’s advocate”). Try to

think of “What if . . . ?” possibilities ( ◦◦ ), and other possible causes or explanations for the
outcomes, and endeavour to overcome them. This will put your conclusion on a sounder
footing. Of course, it may still not be enough to cover all bases. But at least you should
make it difficult for others to criticise your experiment and therefore any conclusions that
follow from it.
Q UESTION : A pharmaceutical company wants to trial a new drug for a particular disease.
Set up a clinical trial following the principles of design, based on 1200 volunteers. 600 of
these volunteers are females and the rest males. Which stage of the clinical trial could this
be?

1.5 Causality in epidemiology

Causality is not simple. The concept continues to be debated as a philosophical matter in


the scientific literature. We know a causal factor does not always result in a disease; and a
disease may occur in the absence of a factor which is known to be a cause of the disease.
For example, smoking does not cause lung cancer in every smoker; and some non-smokers
develop lung cancer.
Think about the inquisitive child’s mantra “Why?”
“Why did X die?” Because he stopped breathing.
“Why?” Heart failure.
“Why?” Because he was old.
“But so is Y?” X also smoked.
“But so does Z. So why X?” . . .
Eventually the Why-cycle may lead to a mechanistic explanation, but usually (and espe-
cially in epidemiology) it does not. In any case it generally ends up with randomness or
god, which may or may not be the same thing! But let’s stick with epidemiology here.
A statistical view of causality allows a non-deterministic view of causal relationships. In
medicine, risk factors do not lead to the disease outcome with certainty, they just increase
its likelihood.
Definition of cause
My dictionary defines cause as “that which produces an effect, phenomenon or condition”.
In epidemiology, “an effect, phenomenon or condition” = “ a disease outcome”, for exam-
ple: death, disease progression, contracting the disease, improvement, recovery or cure.
A cause of a disease outcome is a factor that plays a part in producing the disease outcome.
This will be indicated by (the risk of disease outcome, or the level of a disease indicator)
lOMoARcPSD|8938243

page 30 Experimental Design and Data Analysis

over a population. For example, an increased risk of death or, in assessing recovery, a
decreased level of disease indicator. Over a population, these would be measured by the
probability or the odds of the outcome; or by the mean level of the disease indicator.

cause = “that which produces an effect”


(disease outcome D)
E is a cause of D if it increases the risk of D [probability/odds]
increases the level of D [mean]
increases the rate of D (cases per time)

A lot of what we do in this subject is about estimating these effects (of E on D).
Hill’s Criteria of causal association
Bradford Hill proposed the following criteria for an association to be causal:
1. Strength of association (A stronger association suggests causation.)
2. Consistency (Replication of the findings in different studies.)
3. Temporality (Cause should precede effect.)
4. Biological gradient (Dose-response relationship) (More of the causal factor should pro-
duce more effect.)
5. Plausibility (Does the association make sense biologically?)
6. Coherence (Does the association fit known scientific facts?)
7. Experiment (Can the association be shown experimentally?)
8. Analogy (Are there analogous associations?)
With the possible exception of temporality, none of the Hill’s criteria is absolute for estab-
lishing a causal relation, as Hill himself recognized. He argued that none of his criteria is
essential.

Counterfactual Model (The unattainable ideal ( ◦◦ ))

When we are interested in measuring the effect of a particular factor, E, we measure the
observed “effect” in a population who are exposed to E; and compare this to the “effect”
which would have been observed, if the same population had not been exposed to E, all
other conditions remaining identical. The comparison of these two effect measures is the
“effect” of the factor E that we are interested in.
However, the counterfactual effect is unobservable! We therefore seek to approximate this
ideal model as best we can. How? By considering two ’equivalent’ populations (or as close
as we can get) one of which gets E and the other does not. The experimental studies we
have considered attempt to achieve this using randomisation (or stratification/matching).

identical: P & E vs P & E ′


any difference must be due to E vs E ′

equivalent: P1 & E vs P2 & E ′


(treatment) (control)

any difference is likely to be due to E vs E ′

(randomisation or stratification/matching)

The clinical trials considered in the previous section play a fundamental role in developing
medical therapy, especially new medications. The tools of randomisation and blinding
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 31

actually allow proof of a causal connection by statistical means. This is one of the major
reasons why statistical methods are currently central in medical research.
It is a standard statistical warning that “relationship does not imply causation”. This is
quite true. But, possibly more importantly, in a well-designed experiment, relationship
does imply causation!
Relationships and causation
A positive relationship between A and B means that if you have A then you are more likely
to have B, i.e. you have an increased risk of B. And if you have B then you are more likely
to have A. There is no causation here. This is simply describing an association between
+
factors (attributes, events). We represent this as A B.
A negative relationship between A and C means that if you have A then you are less likely
to have C, i.e. you have a decreased risk of C. And if you have C then you are less likely to

have A. We represent this as A C.

If these two associations apply, you should expect that B C. And that’s the way it
is. However, it should be noted that we are really talking about fairly strong associations
(positive or negative)7 .
A two-factor relationship diagram is not very interesting. You’ve seen them; both of them.
However, a three-factor relationship diagram is a bit more interesting.
There are only four possibilities:
C C C C
+ ❅+ − ❅− − ❅+ + ❅−
❅ ❅ ❅ ❅
A B A B A B A B
+ + − −
(1) (2) (3) (4)

because we can’t have an odd number of negative relationships in the triangle: two nega-
tives make a positive. Think about what would happen if in diagram (1), C was changed
to C ′ .
How does this help? A three-factor relationship diagram is useful in showing the effect of
a confounding variable. Consider the women smokers of Whickham example (page 8). A
relationship diagram for this case is shown below:

C (old-age)
❅+

E (smoker) D (death)

Clearly, there is a positive relation between old age and death. So, if there is a strong nega-
tive relation between old-age and smoking in this population — which is observed, then it
follows that there is a negative relation between smoking and death . . . in this population.

C (old-age)
− ❅+

E (smoker) D (death)

7 In terms of correlation coefficient, which we will consider later (Chapters 2 & 8), this means |r| > 0.7
lOMoARcPSD|8938243

page 32 Experimental Design and Data Analysis

We can represent cause on these relationship diagrams, using an arrow:


+
A −→ B means that A is a cause of B.

causation association
+ +
E −→ D E D
if individual has E there is an observed
then there is an association between E & D
increased chance of D (but we don’t know why)

If there is an observed association between A and B, this does not mean there is causation.
The association may be because:
• A may cause B; [causation]
• B may cause A; [reverse causation]
• a third factor C may cause both A and B; [common cause]
• A and B may influence each other in some kind of reinforcing relationship; [bidirec-
tional causation]
• A and B just happen to be associated; [association]
• . . . or some combination of the above.

EXAMPLE 1.5.1: Research showed that older people who walk slowly are much
likelier to die in the near future. The study online in the British Medical Journal
divided 3200 men and women over 65 into the third who walked slowest, the
middle third and the briskest third. During the next five years, those in the
slowest third were 1.4 times likelier to die from any cause, compared with those
who walked faster. Slow coaches were 2.9 times likelier to die from heart-related
causes. [BMJ 2009 (Dumurgier et al.)]

(possible common-cause?)

EXAMPLE 1.5.2: Among 1700 men and women followed for about 10 years,
those rated happiest were less likely to develop heart disease than people who
were down in the dumps. During the study, about 8 per cent of the group had a
problem such as heart attack, indicating they had coronary heart disease. Peo-
ple with a positive outlook had about 75% the risk of developing heart disease
compared to the others. [EurHeartJ 2010 (Davidson et al.)]

(possible reverse causation and/or common cause?)


In the Whickham example, both smoking and old-age are a cause of death:

[ not confounded ] [ confounded ]

C (old-age) C (old-age)
❅+ − ❅+


✲ ❘

E (smoker) D (death) E (smoker) D (death)
+ −

In examining the relation between the exposure E and the disease


outcome D, the factor C is a confounding factor if:
(i) C is a causal factor for D and (ii) C is related to E.
(Further, C must not be caused by E, else C would be just part of
the causal cycle.)
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies & Experimental design page 33

EXAMPLE 1.5.3: In examining the relation between low physical activity and
heart problems, obesity is not a confounding factor, since it is part of the causal
link: low physical activity causes obesity (and possibly vice versa) and obesity
causes heart problems.
An unobserved or unknown factor may act as a confounding variable too. An unobserved
confounding variable is sometimes called a lurking variable.
If the confounding factor C is positively related with E, this is still a problem because
it exaggerates the relationship between E and D, so that the data would show a falsely
strong relationship between E and D.

EXAMPLE 1.5.4: Suppose that working in a particular factory is a possible cause


of lung cancer, but that these factory workers tend to smoke, which is a cause
of lung cancer:

[ not confounded ] [ confounded ]

C (smoker) C (smoker)
❅+ + ❅+


✲ ❘

E (factory) D ( lung ) E (factory) D ( lung )
worker + cancer worker ++ cancer

E XERCISE . Draw a relationship diagram for the Australia–South Africa example.


lOMoARcPSD|8938243

page 34 Chapter 1: Epidemiological studies and Experimental design

Problem Set 1

1.1 A 3-year study was conducted to look at the effect of oral contraceptive (OC) use on heart
disease in women 40–44 years of age. It is found that among 5000 OC users at baseline (i.e. the
start of the study), 15 women develop a myocardial infarction (MI) during the 3-year period,
while among 10 000 non-users at baseline, 10 developed an MI over the 3-year period.
i. Is this an experiment or an observational study?
ii. What are the exposure and the disease outcome?
iii. Is this a prospective study, retrospective study or a cross-sectional study?
iv. What are the response and explanatory variables?
v. All the women in the study are aged 40–44. Explain why this was done.
vi. How would you present the results?

1.2 The effect of exercise on the amount of lactic acid in the blood was examined in a study. Eight
men and seven women who were attending a conference participated in the study. Blood
lactate levels were measured before and after playing a set of tennis, and shown below.
player M1 M2 M3 M4 M5 M6 M7 M8 W1 W2 W3 W4 W5 W6 W7
Before 13 20 17 13 13 16 15 16 11 16 13 18 14 11 13
After 18 37 40 35 30 20 33 19 21 26 19 21 14 31 20
(a) What is the research question?
(b) Is this a designed experiment or an observational study?
(c) What is the response variable? What are the treatments?
(d) Upon further investigation, we find that nine of the sample are 20–29 years old, while the
other six are 40–49 years old. What is the potential problem with the study?
(e) What is a confounding variable? Can you think of any potential confounding variables
in this case?

1.3 Identify the type of observational study used in each of the following studies (cross-sectional,
retrospective, prospective):
(a) Medical Research. A researcher from the Melbourne Medical School obtains data about
head injuries by examining hospital records from the past five years.
(b) Psychology of Trauma. A researcher plans to obtain data by following, for ten years in the
first instance, siblings of children who died in road accidents.
(c) Flu prevalence The Health authority obtains current flu data by polling 5000 people this
month.

1.4 A study claimed to show that meditation lowers anxiety proceeded as follows. The researcher
interviewed the subjects and rated their level of anxiety. Then the subjects were randomly
assigned to two groups. The researcher taught one group how to meditate and they meditated
daily for one month. The other group was simply encouraged to relax more. At the end of
the month, the researcher interviewed all the subjects again and rated their anxiety level. The
meditation group were found to have less anxiety.
(a) What are the experimental units? What are the response variable and the explanatory
variable?
(b) Is this an experimental study or an observational study?
(c) Is this a blind study? What is the reason for designing a blind study?
(d) It was found that the control group had 70% men and the meditation group had 75%
women. Is this a problem? Explain.

1.5 A study is to be conducted to evaluate the effect of a drug on brain function. The evaluation
consisted of measuring the response of a particular part of the brain using an MRI scan. The
drug is prescribed in doses of 1, 2 and 5 milligrams. Funding allows only 24 observations to be
taken in the current study.
In a meeting to decide the design of the study, the following suggestions are made concerning
the conduct of the experiment. For each of the suggestions say whether or not you think it is
appropriate giving a reason for your answer.
(A) Amy suggests that a placebo should be used in addition to the three doses of the drug.
What is a placebo and why might its use be desirable?
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies and Experimental design page 35

(B) Ben says that the study should be conducted as a double-blind study. Explain what this
means, and why it might be desirable.
(C) Claire says that she is willing to be “the subject” for the study (i.e. to take different doses
of the drug and to have her response measured as often as is needed). Give one point in
favour of, and one point against this proposal.
(D) Don suggests that it would be better to have 24 subjects, and to allocate them at random
to the different drug doses. Give a reason why this design might be better than the one
suggested by Claire, and briefly explain how you would do the randomisation.
(E) Erin claims that it would be better to use 8 subjects, with each subject taking, on separate
occasions, each of the three different doses of the drug. Give one point in favour of, and
one point against this claim, and explain how you would do the required randomisation.
1.6 For the experimental situation described below, identify the experimental units, the explana-
tory variable(s), and the response variable.
Can aspirin help heart attacks? The Physicians’ Health Study, a large medical experiment
involving 22 000 male physicians, attempted to answer this question. One group of 11 000
physicians took an aspirin every second day, while the rest took a placebo. After several years
it was found that the subjects in the aspirin group had significantly fewer heart attacks than
subjects in the placebo group.

1.7 In most cases, data can be viewed as a sample, which has been obtained from some population.
The population might be real, but more often it is hypothetical. Our statistical analysis of the
sample is intended to enable us to draw inferences about this population. In many cases, we
would like the inference to be even broader. For example:
45 first-year psychology students at the University of Melbourne undertake a task and their
times to completion are measured.
This can be regarded as a sample from the population of first-year psychology students at the
University of Melbourne. We may wish to apply our results to all undergraduate students of
the University of Melbourne; maybe all university students; or even all adults.

For each of the following data sets:


i. What population would correspond to this sample? Is this population real or hypotheti-
cal?
ii. Under what circumstances would you be prepared to apply conclusions drawn from
analysis of these data to a larger (more general) population?
(a) 16 women attending the Omega weight loss program have their weight loss recorded
after six months.
(b) 20 items from a production line at Grokkle Manufacturing are tested for defects.
Consider a sample with some treatment applied. For example:
45 first-year psychology students at the University of Melbourne undertake a task (having
smoked a marijuana joint) and their times to completion are measured.
This can be regarded as a sample from the population of first-year psychology students at the
University of Melbourne, having smoked a marijuana joint. This is hypothetical: we have to
imagine “what if . . . ”. We may wish to apply our results to all undergraduate students of the
University of Melbourne; maybe all university students; or even all adults . . . in each case,
having smoked a marijuana joint.

Answer the above questions (i. and ii.) for each of the following:
(c) 30 patients in a Melbourne geriatric care facility were cared for using a new more physi-
cally active (PA) regime and their bewilderment ratings are recorded.
(d) 24 women with breast cancer requiring surgery at the Metropolitan Hospital in 2004 were
treated with radiation during surgery. Their five-year survival outcomes were observed.

1.8 You plan to conduct an experiment to test the effectiveness of SleepWell, a new drug that is
supposed to reduce insomnia. You plan to use a sample of subjects that are treated with the
drug and another sample of subjects that are given a placebo.
(a) What is ‘blinding’ and how might it be used in this experiment?
(b) Why is it important to use blinding in this experiment?
(c) What is a completely randomised design? How would this be implemented in this ex-
periment?
lOMoARcPSD|8938243

page 36 Chapter 1: Epidemiological studies and Experimental design

(d) What is replication, and why is it important? Does it apply to this experiment? If so,
how?

1.9 As part of a study investigating the effect of smoking on infant birthweight a physician exam-
ines the records of 40 nonsmoking mothers, 40 light-smoking, and 40 heavy-smoking mothers.
The mean birthweights (in kg) for the three groups are respectively 3.43, 3.29 and 3.21.
(a) What are the response and explanatory variables?
(b) Is this a designed experiment or an observational study? Explain your choice.
(c) What are the potential confounding variables in this case? Explain how you would elim-
inate the effect of at least some of the variables.

1.10 The cause/correlation diagram below shows the effect of a confounding variable C on the
relation between an intervention X and disease outcome D.

C
– ❅+


X ✲ D
?
What effect does randomisation have on this diagram? Use it to explain how randomisation
neutralises the effect of any possible confounding variable C.
1.11 You plan to conduct an experiment to test the effectiveness of the drug L, a new drug that is
supposed to reduce the progression of Alzheimer’s disease. You plan to use subjects diagnosed
with early stage Alzheimer’s disease; and you and your associates have found forty suitable
subjects who have agreed to take part in your trial.
Write a paragraph outlining the steps that you would follow in running this clinical trial. Men-
tion the following: experiment; placebo; control; randomisation; follow-up; measurements.
Suppose that analysis of the results of the data resulting from this study show that there is
a significant benefit for patients using L, would this indicate that the drug is a cause of the
benefit? Explain.

1.12 Compare and contrast:


(a) experimental study and observational study;
(b) cohort study and case-control study;
(c) treatment and control;
(d) blind and double-blind studies;
(e) blocking and stratification;
(f) confounding and lurking variables;
(g) randomisation and balance;
(h) matched and independent samples;
(i) cause and association.

1.13 Consider each of the following studies in relation to the question “Does reducing cholesterol
reduce heart-disease risks?” In each case, indicate the type of study involved and discuss
whether the information obtained might help in answering the research question.
[1] A questionnaire about heart disease includes the question asking whether “reducing
cholesterol reduces heart-disease risk”. 85% of the general population, and 90% of medi-
cal practitioners agreed with this statement.
[2] A group of patients with heart problems attending the Royal Melbourne Hospital outpa-
tient clinic is assessed. Each of these patients is matched with another patient of the same
gender, same age, similar BMI, same SES status, but with no heart problem. The choles-
terol level for each of the heart patients is compared with that of the matched individual.
[3] A large number of individuals aged 40–49, with no current heart problems, are selected
from patients attending a large medical clinic, and their cholesterol levels are measured.
The individual is classified as L (low cholesterol) or H (high cholesterol). These individu-
als are followed up for ten years and the proportion who develop heart problems in each
group is compared.
lOMoARcPSD|8938243

Chapter 1: Epidemiological studies and Experimental design page 37

[4] A large number of individuals aged 40–49, with no current heart problems, are selected
from patients attending a large medical clinic, and their cholesterol levels are measured.
These individuals are followed up and the cholesterol levels are measured again after five
years. The individuals are then classified as LL (low cholesterol initially, low cholesterol
after five years), LH (low, high), HL (high, low) and HH (high, high). After ten years, the
proportion of individuals who develop heart problems in each group is compared.
[5] A large number of volunteers with high cholesterol levels are randomly assigned to one
of two diet regimes:
(S) standard but reduced diet, with vitamin supplement;
(L) low-cholesterol diet, with low-dose cholesterol reducing drug.
The individuals are followed for ten years and their cholesterol and heart condition mon-
itored.

1.14 Research showed that older people who walk slowly are much more likely to die in the near
future. A study in the British Medical Journal divided 3200 men and women over 65 into the
third who walked slowest, the middle third and the briskest third. During the next five years,
those in the slowest third were 1.4 times likelier to die from any cause, compared to those who
walked faster. Slow-coaches, i.e. people in the slowest third, were 2.9 times likelier to die from
heart-related causes. (Dumurgier et al., BMJ 2009)
(a) What sort of study is this?
(b) On the basis of this, Mrs Green has been encouraging her mother to walk faster. Is this a
good idea? Explain. Comment on ‘cause’ in relation to the finding of this study.
lOMoARcPSD|8938243

page 38 Chapter 1: Epidemiological studies and Experimental design


lOMoARcPSD|8938243

Chapter 2

EXPLORATORY DATA ANALYSIS

“Data! Data! Data! I can’t make bricks without clay.”


Sherlock Holmes, The Copper Beeches, 1892.

2.1 Introduction

Data are the raw material of any empirical science, whether it be agriculture, biology, en-
gineering, psychology, economics or medicine. Data usually consist of measurements or
scores derived from experiment or observation: for example, yields of a crop on a num-
ber of experimental plots, cell counts in biological specimens, strength measurements on
batches of concrete, scores obtained by children on a spatial ability test, monthly measure-
ments of inflation and unemployment, or patient assessment of new medical treatments.
Data can be obtained from:
• experiments;
• observational studies;
• polls and surveys;
• official records, government reports or scientific papers.
A data set is rather like a raw mineral which must be treated and refined in order to extract
the useful minerals. Most raw data come in the form of long lists of numbers or codings,
which must be treated and refined to extract the useful information. The methods of treat-
ment and refinement of mineral ores are chemical; those required for data are statistical.
Data analysis is the simplifying, reducing and refining of data. It is the procedure rather
than the result.
Data analysis achieves a number of things:
• discovering the important features of the data (exploration)
• improving the understanding of the data (clarification)
• improving communicability of the information in the data (presentation)
• facilitating statistical inference (validation)

39
lOMoARcPSD|8938243

page 40 Experimental Design and Data Analysis

Quality data presentation


Researchers have investigated good and bad ways of representing data, using a mix of
empirical research and creative flair. A ground-breaking and award-winning book on the
topic is “The Visual Display of Quantitative Information”, by Edward Tufte (Graphics Press,
Connecticut, 1983), and includes such memorable terms and expressions as “chartjunk”
and the “data-ink ratio”, the latter defined as the ratio of the ink used to represent data to
the total ink used to print the graphic (high in good graphics). A more recent book that
includes a lot of the theory based on empirical research is “The Elements of Graphing Data”,
by William Cleveland (Summit Hill, NJ, Hobart Press, 1994).
Many of the principles that these writers have espoused amount to thoughtful common
sense; others are less obvious and have arisen out of their research. Many graphics pro-
duced by commonly-used software adhere to these principles reasonably well; some graph-
ics produced from such software do not. In general, the default graphics from most soft-
ware can be improved.
Edward Tufte wrote: “Data graphics are paragraphs about data.” Tufte was writing about
the integration of data and text, but it is also possible to see the analogy the following way.
A good graph should be about a single idea (the “paragraph”), and contain the data (the
“words”) arranged in coherent and meaningful ways (the “sentences”).
Guidelines for good practice are:
• A good graph has clear and informative labelling. This applies to both the caption
and the parts of the graph. The reader should not have to guess what anything means,
or interpret the meaning of an abbreviation. Where it is reasonable and possible to do
so, the units of a variable should be included.
• For the purposes of comparisons in graphs, as much as possible, line up the features
to be compared along a common scale. Research shows humans are best at visual
comparison when a common linear scale is used, as opposed to many other possi-
ble ways of representing things to be compared, for example, by using volumes, or
angles, or lengths not lined up along a common scale.
• Minimize the amount of ink, including colour shading, by eliminating the use of ink
that does not communicate anything meaningful.
• Avoid distortions from spurious use of perspective and other artistic tricks.
Cleveland, in ”The Elements of Graphing Data”, developed a theoretical framework of sta-
tistical graphics, which leads to practical recommendations. The theory is based on an
encoding/decoding paradigm. This leads to a number of practical consequences.
Firstly, if we are serious about communicating statistically, we will want to make decoration
and adornment a rather secondary consideration. A graph may be beautiful but a failure:
the visual decoding has to work.
Secondly, to make graphs work we need to understand graphical perception: what the eye
perceives and can decode well. Cleveland carried out research which led to conclusions
about the order of the accuracy with which we carry out these tasks. The order, from most
accurate to least accurate, is:
1. Position along a common scale;
2. Position along identical, non-aligned scales;
3. Length;
4. Angle and slope;
5. Area;
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 41

6. Volume;
7. Colour hue, colour saturation, density;
What does this mean, exactly? It is saying the human eye/brain system is best at judging
the differences between quantities lined up along a common scale, and poor at distinguish-
ing quantities represented proportionally to (a two dimensional representation of) volume,
for example. While the details of this ordering are not obvious, the order does not seem
contentious and conforms to our common experience. The property identified as best is
exploited in many of the standard forms: histograms, dotplots, scatter plots, time series
graphs, boxplots, all line up quantities to be compared along a common linear scale.
This leads to the basic principle:
Encode data on a graph so that the visual decoding involves tasks as high as
possible in the ordering.
Implementation of this simple idea corrects many basic errors in graphs. For example, pie
charts require the viewer to compare angles, which we are rather bad at. Using 3D plots
generally takes one away from tasks higher up in the hierarchy, and requires assessments
of volume, for example: a bad idea.
Quantitative statement
A quantitative statement is often the most apparent (and therefore important) product of
data analysis and/or statistical inference. It is important therefore that any quantitative
statement intended to represent a data set should be accurate and clear. Unfortunately this
is often not the case.
A quantitative statement derived from a set of data may be junk because
• the data set itself is junk;
• the data analysis is incorrect or inappropriate;
• the quantitative statement is distorted; e.g. selectively abbreviated or added to.
The media are an abundant supply of such junk.
Data analysis and statistical inference
Data analysis comes before statistical inference, historically and practically. Data analysis
has been compared to detective work: finding evidence and investigating it. To accept
all appearances as conclusive would be wrong, as some indications are accidental and mis-
leading; but to fail to investigate all the clues because some, or even most, are only accidents
would be just as wrong. This is equally true in crime detection and in data analysis.
It is then the problem of statistical inference to sort out the real indications from the red her-
rings. To carry the crime detection analogy further, we might think of statistical inference
as the courts: unless the detectives turn up some evidence, there will be no case to try.
It is worthwhile to note the difference between the two most important ways data are pro-
duced:
1. observational study, and
2. experimental study.
In deriving any scientific law, the observational study always comes first (often indicating
the possible form of the scientific law); then a carefully planned and designed experiment
is required. Exploratory data analysis is an essential tool in investigating data from an
observational study. The same tools are also relevant in examining data from controlled
experiments.
lOMoARcPSD|8938243

page 42 Experimental Design and Data Analysis

The computer is a very useful piece of equipment in data analysis, particularly for large
data sets. However, the computer is not a data analyst (and even less is it a statistician).
Data analysis and statistical inference involves three steps:
1. selecting the appropriate technique
2. obtaining results by applying that technique
3. interpreting the results
The computer is very useful in the second of these steps but is of not much use for either
of the other two. The uncritical use of package programs for data analysis or statistical
inference is fraught with danger. One cannot leave the selection of the technique or the
interpretation of the results to the computer.

2.2 Tables and diagrams

Both tables and diagrams are very useful tools in data analysis. They are both essentially
summaries of the data, and both are useful at two stages.
1. preliminary analysis (as an aid to understanding)
2. presentation of results (as an aid to communication)
As a general rule, tables contain more detail, but the message is easier to see in graph.

2.2.1 Tables

Why? (what is the purpose of the table?)


• for the record: collection of data in accessible form;
• special purpose tables to indicate a particular feature of the data;
• for data to which repeated references are to be made.
When? (i.e. should it be in a table or in the text?)
• tabular form gives greater emphasis to the data presented and
• a long list in the text is difficult to read and comprehend — a table is much more
easily understood;
• tables are neater.
How? (aim for simplicity and clarity)
• tables should be self-contained;
• the title is useful as a labelling and display device;
• headings should be clear, concise and unambiguous;
• numbers should be rounded to two or three significant figures;
• spacing and ruling: the figures to be compared should be close, the layout should be
planned;
• ordering of rows and columns in order of decreasing average can aid in comprehen-
sion;
• exceptional values should be noted and excluded from averaging.
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 43

The preparation of a dummy table following these guidelines before the data are collected
is a useful exercise. The principles given above apply equally to the use of tables in the
preliminary analysis stage.

EXAMPLE 2.2.1: (Blood groups)


The table below was obtained from data given in Wikipedia. It gives, for a
selection of nations, the proportion of the population that are Rhesus positive:

nation RHpositive nation RHpositive


Australia 81.0% Taiwan 99.7%
Belgium 84.7% HongKong 99.3%
Denmark 84.0% India 95.9%
Finland 87.0% SaudiArabia 92.8%
France 85.0% Israel 90.0%
Germany 85.0% Turkey 89.0%
HongKong 99.3% Finland 87.0%
India 95.9% France 85.0%
Israel 90.0% Germany 85.0%
Netherlands 83.7% Belgium 84.7%
NewZealand 82.0% Denmark 84.0%
SaudiArabia 92.8% Sweden 84.0%
Sweden 84.0% Netherlands 83.7%
Taiwan 99.7% UnitedKingdom 83.0%
Turkey 89.0% NewZealand 82.0%
UnitedKingdom 83.0% Australia 81.0%

The data in each table are the same, but ordering the table by RHpositive per-
centage provides much more useful information than the table ordered alpha-
betically by nation. The same applies for a dotchart:

2.2.2 Diagrams

In presentation, diagrams can make results clear and memorable. The message has much
more impact in a diagram than in a table. In exploratory analysis, plotting data in some
way (or in several ways) is a very useful aid to understanding and seeing trends.
Note that plotting implies rounding.
Diagrams are not as good as tables in communicating detailed or complex quantitative
information. In presentation (and in analysis) an important principle is simplicity: If the
diagram becomes too cluttered, the message is lost or confused. You should ask yourself
“What should the diagram be saying?” and “What is it saying?”
lOMoARcPSD|8938243

page 44 Experimental Design and Data Analysis

One problem with graphs and diagrams is that it is quite easy to create a false impression:
incorrect labelling, inappropriate scaling or incorrect dimensionality are common causes of
misleading diagrams. Thus, some care should be taken with the presentation of a diagram.
Basic principles
• Show the data clearly and fairly. You should include units and scales; axes and grids; labels
and titles, including source.
• Ask: Is the diagram/table clear? . . . to you? . . . to your reader? What point are you
trying to make with the diagram/table? Is it being made?
• Use simplicity in design. Avoid information overload. Avoid folderols (pictograms, fancy
edging, shading, decoration and adornment).
• Keep the decoding simple. Use good alignment on a common scale if at all possible; use
gridlines; consider transposition of axes. Take care with colour.

2.3 Types of variables

We need to distinguish variable types because the different types of variable require differ-
ent methods of treatment. The classification of variables is indicated as follows:

no ✲
ordered? categorical
yes
❄ no ✲
scaled? ordinal
yes
❄ no ✲
rounding error? discrete numerical
yes
✲ continuous numerical


meaningful zero?
no ❄ ❄yes
interval ratio

Categorical data (also called qualitative or nominal data) are usually comparatively simple
to handle — there are only few techniques of dealing with such data.
Examples of categorical variables: gender, colour, race, type of tumour, cause of death.
Numerical data (discrete and continuous) are our main concern: there are a wide variety
of techniques for handling such data.
Examples of discrete numerical variables: family size, number of cases (of some disease,
infection), number of failures; [usually count data];
Examples of continuous numerical variables: weight, height, score, cholesterol level, blood
pressure; [usually measurement data].
Ordinal data are something of a problem: they can be treated as categorical data, but this
loses valuable information. On the other hand, they should not be treated as numerical data
because of the arbitrariness of the unknown scale. Some methods, correct for numerical
data, may give quite misleading results for ordinal data.
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 45

Examples of ordinal variables: grades, degree of satisfaction, ratings, severity of injury.


It should be noted that these variable types are hierarchical:

categorical variable = category


ordinal variable = category + order
numerical variable = category + order + scale

Thus an ordinal variable can be treated as a categorical variable (ignoring the ordering);
and a numerical variable can be treated as an ordinal variable (ignoring the scaling) or as a
categorical variable (ignoring the ordering and the scaling).

EXAMPLE 2.3.1: (Forced expiratory volume, FEV)


FEV is an index of pulmonary function that measures the volume of air ex-
pelled after 1 second of constant effort. The data set FEV.DAT contains determi-
nations of FEV in 1980 on 654 children ages 3–10 who were seen in the Child-
hood Respiratory Disease Study (CRD study) in East Boston, Massachusetts.
These data are part of a study to follow the change in pulmonary function over
time in children from smoking and non-smoking households.

Data on the following variables are available.

ID number
Age (years)
FEV (litres)
Height (cm)
Sex 0 = female, 1 = male
Household smoking status 0 = non-smoking, 1 = smoking.

(a) What is the underlying population?


(b) How big is the sample?
(c) What is the response variable here?
(d) What are the explanatory variables?
(e) What is the aim of the study?
(f) Classify each variable (as numerical, ordinal or categorical).

2.3.1 Some general comments on data handling

Ordering
Since categorical data have no order, we can choose an appropriate one for presentation,
e.g. decreasing frequency. Ordinal data have a specified order.
Coding
For categorical and ordinal data it is often convenient to code the data to numerical values:
for example, female = 1 and male = 2, or strongly disagree = 1, disagree = 2, neutral =
3, agree = 4 and strongly agree = 5. It must be remembered though that this is just for
convenience: the data cannot be treated as numerical data. [average gender = 1.46?]
Checking
Checking is necessary whenever we deal with data (whether or not we use a computer —
perhaps moreso with a computer). Checking is important, yet it should not be too extensive
else it becomes too time consuming. One of the most important checks is common sense
lOMoARcPSD|8938243

page 46 Experimental Design and Data Analysis

(experience): do the results and conclusions agree with our common sense? If not, why
not? Can we explain the differences?
Significant figures
In preliminary data analysis most people can handle with meaning at most three significant
figures, in most cases two is better. This also applies to the reader of the report of our
analysis, so that two or three figures is usually best for the presentation of our results.
When we write x = 1.41, we can mean that x is measured as 1.41 (to two-decimal accuracy);
we can mean that we have calculated x using some formula and√ are reporting the result to
two-decimal accuracy. Thus this x might actually be equal to 2. In statistics, it is preferred
that numbers are rounded off at a meaningful level.
This can lead to results that may seem odd. For example, we may write 0.33 + 0.33 + 0.33 =
1.00. This is“correct” if we are reporting 13 + 31 + 13 = 1, correct to two decimal places.
Transformations
If the data set contains numbers of widely differing size, they may be brought onto the
same scale, by taking logs for example. Of course this considerably warps the original
scale so that some care may be needed in interpretation.
Two transformations that are quite commonly used are:
the log transformation:
y = ln x, which transforms (0, ∞) to the real line (−∞, ∞);
the logistic transformation:
x
y = logit(x) = ln( 1−x ), which transforms (0, 1) to the real line (−∞, ∞).

x log(x) x logit(x)
0.001 –6.9 0.001 –6.9
0.05 –3.0 0.05 –2.9
0.2 –1.6 0.2 –1.4
1 0.0 0.5 0.0
5 1.6 0.8 1.4
20 3.0 0.95 2.9
1000 6.9 0.999 6.9
The log transformation is often used for positive data that has a long tail. The logit trans-
formation is often used for proportions, or bounded data.
If we have data like x in either of the above tables, then the log or logit transformation
converts it to a “sensible” scale.
Note: If the data are restricted to (a, b) then the transformation y = ln x−a

b−x
can be useful:
it transforms (a, b) to the real line (−∞, ∞).

2.4 Descriptive statistics

2.4.1 Univariate data

We consider first the analysis of one-variable data — i.e., data consisting of a collection of
values of one variable (such as height or IQ or voting preference). Data sets consisting of
observations on more than one variable can always be subdivided into univariate data sets
and the variables analysed separately. However, methods of joint analysis are important.
Representations of bivariate data are mentioned later in this chapter, and their analysis is
considered in more detail in Chapter 8.
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 47

We will use three main data description techniques:


1. frequency distributions;
2. cumulative frequency distributions and quantiles;
3. moment statistics.
As we have seen, an ordinal variable contains more information than a categorical variable
and a numerical variable contains more information than an ordinal variable. This is re-
flected in the data analysis: the more information in the variable, the more that can be done
with it. Thus the treatment depends on the variable type.
variable type
Data Description Technique categorical ordinal numerical
√ √ √
frequency distribution √ √
cum freq distn / quantiles × √
moment statistics × ×

We look at the more important and useful statistics; and mention a few others.

2.4.2 Numerical statistics

Data can be summarised in two main ways: using numbers, or using graphs. These are
useful for different purposes. Numbers are good if you want to be exact, but it is harder to
present large amounts of information with them. Graphs are the opposite: it is easy to get
a good “sense” of the data, but some of the finer points may be hidden.
Since graphs are often based on numbers, we look at numerical statistics first. We don’t
want to show all the numbers in the data — that’s too much information! Instead, we want
to summarise the data using a small but meaningful set of numbers.

EXAMPLE 2.4.1: To begin, let’s look at a simple example:

x: 4 5 4 6 1 9 7 3 12 5

In R, create the vector as

> x <- c(4,5,4,6,1,9,7,3,12,5) # c stands for combine


> x
[1] 4 5 4 6 1 9 7 3 12 5
Descriptive statistics are numbers derived from the data to describe various
features of the data. This is an example of an R descriptive statistics output,
using the function summary:

> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 4.00 5.00 5.60 6.75 12.00

This output is appropriate no matter whether x is discrete or continuous (and


the above data could actually be either!)
Q UESTION : How could this sample be an observation on a continuous variable?
Location and spread are the two basic features that we need to describe any set of data or
the population from which the data was sampled. Collectively they allow us to summarize
the important features of the data set.
lOMoARcPSD|8938243

page 48 Experimental Design and Data Analysis

2.4.3 Measures of location

A measure of location is a single value that is representative of the data. Thus it should be
a value that is as close to all observations in the data as possible, in some meaningful sense.
So that it can ’speak’ for every datum in the sample! Measures of location are also called
measures of central tendency.
The most commonly used measure of location is called the sample mean, the arithmetic
mean or, simply, the average. The sample mean is defined as mean = Sum of all observations
Number of observations
or more formally as follows.

DEFINITION 2.4.1. For a set of observations x1 , . . . , xn , the sample mean x̄ is defined


as
n
1X 1
x̄ = xi = (x1 + x2 + · · · + xn ) .
n i=1 n

Useful properties:
• The sum of total deviation of the data, {x1 , x2 , . . . , xn }, about the sample mean, x̄ is 0.
Mathematically this means, (x1 − x̄)+(x2 − x̄)+. . .+(xn − x̄) = 0. This implies that the
arithmetic mean balances out the negative and positive deviations in the data. In this
sense the sample mean is the centre of mass of the data. Hence extreme observations
can have a big effect on the sample mean.
• The sample mean is also the value that minimizes the total squared deviation of the
data about it. In other words, (x1 − a)2 + (x2 − a)2 + . . . + (xn − a)2 is minimum at
a = x̄. In this sense, x̄ is close to all observations in the data.
In the case of grouped data, where fj = freq(uj ), j = 1, 2, . . . , k:
Pn Pk 1 Pk
i=1 xi = j=1 fj uj , so that x̄ = n j=1 fj uj .
1 2 3 4 5 6
Example (die rolling) 6 10 11 8 7 8
P6 174
j=1 fj uj = 174, so that x̄ = 50 = 3.45 (≈ µ = 3.5).

Another useful measure of location is the median. The median is the value that divides
the data (arranged in increasing magnitude) in two halves. At least half the data is smaller
than or equal to the sample median. Equivalently, at least half the data is greater than or
equal to the median.

DEFINITION 2.4.2. The sample median, denoted by m̂ or ĉ0.5 , is the “middle value” of
the data. In other words, it is the value that separates the bottom half of the data from
the top half of the data.

Useful properties
• The sample median is the value that minimizes the total absolute deviation of the
data about it. In other words |x1 − a| + |x2 − a| + . . . + |x2 − a| is minimum at a = m̂.
In this sense, m̂ is close to all observations in the data.
• Unlike the sample mean, the sample median is less affected by extreme observations
in the data.
These are the most important and useful measures of location. However, others may be
used: for example, the sample mid range, the sample mode, the trimmed sample mean.
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 49

Note: the sample mode denotes the most frequent or the most common value; and there-
fore is not really a (good) measure of location.

EXAMPLE 2.4.2: For the example data on page 47, we have:

1
Pn
Mean = x̄ = sample mean = n i=1 xi = 56/10;

Median = ĉ0.5 = sample median = ĉ0.5 = 5;


middle observation: (1 3 4 4 5 5 6 7 9 12)

TrMean = trimmed mean = 43/8;


[In R one can specify the amount of trimmed data. For example 10% trimmed mean
means that we are trimming 5% (rounded up) at either end.]

Min = sample minimum, Max = sample maximum;

Q1 = lower (first) quartile = ĉ0.25 = x(2.75) ;


Q3 = upper (third) quartile = ĉ0.75 = x(8.25) [ (Q1, Q3) contains about 50% of the
sample.]

In R:

> mean(x) # mean


[1] 5.6
> mean(x, trim=0.1) # trimmed mean
[1] 5.375
> min(x); max(x) # minimum and maximum
[1] 1
[1] 12
> quantile(x, c(0.25, 0.75), type=6) # first and third quartiles
25% 75%
3.75 7.50
Here, we have introduced some notation for the sample median and quartiles, which are
special cases of sample quantiles. Now, for small samples, we usually can’t get a proportion
q of the sample exactly: for example, what is a a quarter of a sample of nine? . . . and how
can we find a number such that quarter of the sample is less than it? There are several ways
of defining the sample quantiles. They all fit the definition that a proportion of about q is
less than it. We use the following definition:

DEFINITION 2.4.3.
1. If the sample x1 , x2 , . . . , xn is arranged in order of increasing magnitude: x(1) 6
x(2) 6 · · · 6 x(n) [so that x(1) denotes the smallest sample variate (i.e. the
minimum) and x(n) the largest (i.e. the maximum)] then x(k) is called the kth
order statistic.
2. The sample q-quantile, denoted by ĉq , is such that a proportion q of the sample is
less than ĉq . That is, ĉq = x(k) , where k = (n+1)q.

Thus half of the sample is less than ĉ0.5 , and so the 0.5-quantile, ĉ0.5 , is the median.
Note: A common notation, which you will get to see much more of, is the ‘hat’-notation. It denotes
‘an estimate of’ and/or ‘a sample version of’. Thus ĉ0.5 denotes an estimate of c0.5 ,
lOMoARcPSD|8938243

page 50 Experimental Design and Data Analysis

the population median. As a sample is often used to estimate a population, many of the sample
characteristics are ‘hatted’. Not all though: we prefer x̄ to µ̂, for example.
For the above sample, x(1) = 1, x(2) = 3, x(3) = 4, . . . , x(10) = 12.
So what is x(2.75) ? x(2.75) is taken to be 0.75 of the way from x(2) = 3 to x(3) = 4; thus
x(2.75) = 3.75. Check that x(8.25) = 7.5.

EXAMPLE 2.4.3: For the following sample (of sixteen observations), find the
sample median and the sample quartiles.
5.7, 4.5, 17.7, 12.3, 20.1, 6.9, 2.3, 7.0, 8.7, 8.4, 14.6, 10.0, 6.1, 9.1, 10.0, 10.7.

The data must first be ordered, from smallest to largest. This gives:
2.3, 4.5, 5.7, 6.1, 6.9, 7.0, 8.4, 8.7, 9.1, 10.0, 10.0, 10.7, 12.3, 14.6, 17.7, 20.1;
which specifies the order statistics: x(1) = 2.3, x(2) = 4.5, . . . , x(16) = 20.1.

The median ĉ0.5 = x(8.5) , since k = (16+1)×0.5 = 8.5. The median is half-way
between x(8) and x(9) , i.e. half-way between 8.7 and 9.1. So, ĉ0.5 = 8.9.

The lower quartile ĉ0.25 = x(4.25) , since k = (16+1)×0.25 = 4.25. Thus, the
lower quartile is a quarter of the way between x(4) = 6.1 and x(5) = 6.9. So,
ĉ0.25 = 6.3.

Similarly, ĉ0.75 = x(12.75) = 11.9, since x(12) = 10.7 and x(13) = 12.3.
Note: In R, quantiles are computed using the function quantile() which allows you
to specify 9 commonly used empirical quantile definitions. The above definition is met
by specifying the option type=6 when using quantile() or summary(). To see all the
quantile types see help(quantile).

EXAMPLE 2.4.4: The numbers given below represent 20 observations from


a failure time distribution like the one illustrated in the example above (see
page 55).

236 1 59 177 75 440 11 172 56 264


215 262 158 62 348 9 110 84 39 800

Find the sample median and the sample mean. [134 & 178.9]
Why should you expect that the sample mean is greater than the sample me-
dian?
The distribution is positively skew, i.e. has a long tail at the positive end, so the mean
will be larger than the median: it gets pulled towards the longer tail. The population
distribution is positively skew, so even before the sample is taken, we should expect that
the sample mean will be greater than the sample median, since the sample will resemble
the population distribution.

2.4.4 Measures of spread

Measures of location only tell us about a central or typical or representative value of a sample.
However,to assess the difference between observations, we need to study the variation in
the data. Measures of spread describe the variability in a sample or its population about
some measure of location or from one another. Sample variance is the most commonly
used measure of spread for numeric data. It is defined as:
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 51

Figure 2.1: Most data have variation.

DEFINITION 2.4.4. For a set of observations x1 , . . . , xn , the sample variance is defined


as
n
1 X
s2 = (xi − x̄)2 .
n − 1 i=1

Roughly speaking the sample variance of a data is the average squared distance of the sample
observations about the sample mean. To reverse the squaring process, we define the sample
standard deviation:

DEFINITION 2.4.5. The sample standard deviation is



s = s2 ,

i.e. the square root of the sample variance.

The most convenient form of s2 (and therefore s) for hand-computation is:


2 1 Pn 2 1 Pn 2

s = n−1 i=1 xi − n ( i=1 xi ) .

An even easier method is to use a computer or a calculator with an s button.


Calculating s2 from grouped data: with fj = freq(uj ), j = 1, . . . , k:
Pn Pk Pn 2
Pk 2
Pk
i=1 xi = j=1 fj uj , i=1 xi = j=1 fj uj , (and n = j=1 fj ):
2
P
1 P ( f j uj )
s2 ≈ ( fj u2j − ).
n−1 n
P6
fj u2j = 736, so:
P
In the die-rolling example j=1 fj uj = 174,
1 1742
s2 ≈ 49
(736 − 50
) = 2.663.

Usually, the range (x̄ − 2s, x̄ + 2s) will contain roughly 95% of the sample.
It is quite possible to observe samples or populations with same mean but differing stan-
dard deviations or vice-versa, as shown below.
Another measure of spread is the sample interquartile range:

DEFINITION 2.4.6. The sample interquartile range is

τ̂ = IQR = Q3 − Q1 or ĉ0.75 − ĉ0.25 .

The sample interquartile range is a single number: it is the difference, not the interval.
lOMoARcPSD|8938243

page 52 Experimental Design and Data Analysis

(a) Same mean, different standard deviation. (b) Same standard deviation, different mean.

Figure 2.2: Distributions can be different in various ways.

EXAMPLE 2.4.5: For the example data from page 47:

> sd(x) # standard deviation


[1] 3.134042
> IQR(x, type=6) # interquartile range
[1] 3.75

2.4.5 Graphical representations

Numerical representations are great for investigating particular aspects of the data, but to
get an overall sense of it, it is better to use a graphical representation. There are many
graphical representations, based on different properties of the data.

Frequency data:

dotchart, dotplot, bar graph, histrogram.


Barcharts and piecharts
The distribution of a categorical variable is typically graphed with a bar chart. Each bar
represents the frequency (or percentage) of observations in each category.

EXAMPLE 2.4.6: (Greenhouse gas emission)


Data relating to the contributions of various sources to Australia’s greenhouse
gas emissions:
Agriculture 16%, Fugitive emissions 6%, Industrial processes 5%,
Land-use & forestry 6%, Passenger cars 8%, Stationary energy 50%,
Transport other than cars 6%, Waste 3%. (Royal Auto, 2008).

The first thing to do is to re-order the categories by frequency. As the variable


is categorical, it has no order, so we can choose the order.
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 53

Stationary energy 50%


Agriculture 16%
Passenger cars 8%
Transport other than cars 6%
Land-use & forestry 6%
Fugitive emissions 6%
Industrial processes 5%
Waste 3%

Below we create a barchart of these data. There are gaps between the bars, as
they represent separate categories.

> x <- c(50, 16, 8, 6, 6, 6, 5, 3)


> z <- c("Stationary energy", "Agriculture", "Passenger cars",
"Transport other than cars", "Land-use & forestry",
"Fugitive emissions", "Industrial processes", "Waste")
> barplot(x)
> pie(x, labels = z)
50

Stat energy
40
30

Waste
20

Industrial processes
Agriculture
Fugitive emissions
10

Land−use & forestry


Passenger cars
Transport other than cars
0

More options for these functions are in help(barplot) and help(pie).


Bar graphs
• Suitable for categorical data, ordinal data and discrete numerical data.
• Bars should be of equal (and preferably small) width and separated from each other
so as not to imply continuity.
• Heights of bars correspond to frequencies or relative frequencies.
If the underlying variable is discrete, then we use the relative frequency function: p̂(x) =
1
n
freq(X = x). Note the ‘hat’: p̂(x) is an estimate of the probability function p(x).

Note: freq denotes frequency; thus freq(X=4) denotes the frequency of X=4, i.e. the
number of times in the sample that the variable X is equal to 4. In the example be-
low, freq(X=4) = 2, since there are two 4s in the sample. Similarly, freq(X66) = 7,
freq(46X66) = 5, and so on.
lOMoARcPSD|8938243

page 54 Experimental Design and Data Analysis

However, if the underlying variable is continuous, then we would prefer to have a function
on the real numbers. We use a histogram.
Histograms
The standard approach to representing the frequency distribution of a continuous variable
is to use “binning”, i.e. putting the observations in “bins” or “groups” that cover the line.
This gives the histogram, which will be a familiar representation. It is just a bar chart, with
joined-up bars!
• A histogram is suitable for continuous data.
• A histogram has no gaps between “bars”.
• If all intervals are of the same width, then heights of “bars” can be frequencies or
relative frequencies.
• We should plot:
relative frequency
height = .
interval width
Thus, the areas of the “bars” correspond to relative frequencies.
freq(a < X < b)/n
i.e. fˆ(x) = for a < x < b.
b−a
• Use hist() to produce a histogram in R.

EXAMPLE 2.4.7: sample: 1, 3, 4, 4, 5, 5, 6, 7, 9, 12.


. . . treating this as a sample on a continuous random variable (such as age or time)
Here we use bins (groups, intervals) {0 < x 6 2}, {2 < x 6 4}, . . .

> x <- c(1, 3, 4, 4, 5, 5, 6, 7, 9, 12)


> hist(x)

Histogram of x
3.0
2.5
2.0
Frequency

1.5
1.0
0.5
0.0

0 2 4 6 8 10 12

x
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 55

EXAMPLE 2.4.8: (Failure times)


A commonly-used model for the distribution of failure times takes the form
of the density shown on the left, known as the exponential distribution. This
applies if failures are random and are equally likely to occur at any time.
0.05

0.04
0.04

0.03
0.03

0.02

0.02

f
0.01
0.01

0.00
0.00

0 20 40 60 80 100 0 20 40 60 80
x x

A random sample of 220 observations was obtained from the population dis-
tribution f (x) = (1/20)e−x/20 , x > 0. The default histogram produced by R is
shown on the right above. It is supposed to reflect the population distribution:
fˆ describes the sample, but also estimates f . A sample of random data and
histogram may be obtained as follows:

x = rexp(220, 1/20) # Generates 220 observations from f(x)


hist(x, freq=FALSE, breaks=15) # density histogram

Note that the option breaks specifies an approximate number of breaks in the
histogram. If omitted, R uses a rule of thumb based on the number of observa-
tions in the sample. The options freq=FALSE and freq=TRUE specify density
and frequency histograms, respectively.

It is standard to use equal bin width, but they can be made unequal: the graph
below has bins (0,2), (2,5), (5,20), (20,50) and (50,100). This might be done for
extremely skew distributions.
lOMoARcPSD|8938243

page 56 Experimental Design and Data Analysis

hist(x, freq=FALSE, breaks=c(0, 2, 5, 20, 50, 100))

0.04

0.05
0.04
0.03

0.03
0.02

f fˆ

0.02
0.01

0.01
0.00

0.00
0 20 40 60 80 0 20 40 60 80 100
x x

In the case of unequal bin widths, the fˆ values are obtained using the formula
25/220
above: for example for the first bin, fˆ = = 0.056818. Though, of course,
2
R does the calculations for you once you have set the breakpoints.
bin width frequency fˆ
0<x62 2 24 0.055
2<x65 3 23 0.035
5 < x 6 20 15 90 0.027
20 < x 6 50 30 66 0.010
50 < x 6 100 50 17 0.002
220

Cumulative frequency data

DEFINITION 2.4.7. The cumulative relative frequency function is defined as

1 number of observations 6 x
F̂ (x) = freq(X 6 x) = ,
n total number of observations
i.e. the relative frequency of observations less than or equal to the number x.

The population version of this is called the cumulative distribution function (cdf), denoted by F . The
cumulative relative frequency function is the sample version, and is often referred to as the sample
cdf, or the empirical cdf. It is available in R using the function ecdf() . . .
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 57

EXAMPLE 2.4.9: sample: 1, 3, 4, 4, 5, 5, 6, 7, 9, 12.

1.0
0.8
0.6


0.4
0.2
0.0

0 2 4 6 8 10 12
x
For example: F̂ (4.2) = n1 freq(X 6 4.2) = 1
10 ×4 = 0.4.

In R write: > x <- c(1, 3, 4, 4, 5, 5, 6, 7, 9, 12) and > plot(ecdf(x)).

This has the form of a step-function. Nevertheless, in the case of a continuous


variable, this graph will approximate a continuous function as the sample size
becomes large: see the following diagram, which represents a sample of 100
observations from a normal distribution.

Sample quantiles

(inverse cumulative relative frequency function)


Because the sample quantile ĉq is the number with a proportion q of the sample less than
it, it follows that the cumulative relative frequency at ĉq is q; i.e. F̂ (ĉq ) ≈ q:
lOMoARcPSD|8938243

page 58 Experimental Design and Data Analysis

Thus ĉq is (approximately) the inverse of F̂ : quantile = inverse cdf.

Boxplot

The simplest boxplot is a graphical representation of the “five-number summary”:


(min, Q1, med, Q3, max).
• The boxplot gives an immediate impression of not only the location and spread of the
data, but also of the symmetry (or otherwise) of the distribution.
• A boxplot indicates skewness as well as location and spread.
• Use boxplot to produce a boxplot in R

Compared with sample A, sample B has greater location measure; sample C has a greater
spread measure. Samples A, B and C are symmetrical, but sample D shows positive skew-
ness, i.e. a longer tail at the positive end.

One problem with this representation is that one or two outlying data values could give a
misleading impression of the spread of the distribution. For this reason, we limit the length
of the “whiskers” (the lines at either end of the box) to 1.5τ̂ , i.e., 1.5 times the interquartile
range. The line extends to the most extreme data value within these limits, i.e. ĉ0.25 −1.5τ̂
for the lower end, and ĉ0.75 +1.5τ̂ at the upper end. These are sometimes called the ‘inner
fences’. Any data value outside this interval is indicated separately. Some boxplots also define
‘outer fences’: ĉ0.25 −3τ̂ and ĉ0.75 +3τ̂ , and label points outside these limits as “extreme outliers”.
Extreme values are indicated separately on a boxplot.
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 59

It is common to label these outlying data values by individual name or case number or
some other identification. There may be some explanation of their oddity, but in any case,
the outlying data values are often of interest.

Boxplots are useful for comparing distributions.


Boxplots provide a quick visual comparison of data sets, as we have seen in the diagram
on the previous page. The boxplot can be presented horizontally, or vertically as below.

E XERCISE . A sample of 120 patients are observed following a cancer treatment. The fol-
lowing represent recurrence times (in months) ranged from 13 months to 71 months. They
are summarised in the following stem-and-leaf plot:
1 3
1 79
2 011223333344
2 55555556666777777788888999
3 0000000000000111111122334444444
3 55555666667788889
4 000111122223333444
4 5566789
5 024
5 6
6 3
6
7 1
(a) Obtain the sample median and sample quartiles and hence draw a box-plot.
(b) Give approx values for the sample mean and sample standard deviation.

E XERCISE . (Zinc intake)


The trace element zinc is an important dietary constituent, partly because it aids in the
maintenance of the immune system. The accompanying data are on zinc intake (mg) for a
sample of 40 patients with rheumatoid arthritis.
lOMoARcPSD|8938243

page 60 Experimental Design and Data Analysis

8.0 12.9 13.0 8.9 10.1 7.3 11.1 10.9 6.2 8.1
8.8 10.4 15.7 13.6 19.3 9.9 8.5 11.1 10.7 8.8
10.7 6.8 7.4 5.8 11.8 13.0 9.5 8.1 6.9 11.5
11.2 13.6 5.9 21.1 15.7 10.8 10.7 11.5 16.1 9.9

In R, after importing the data and obtaining the variable zinc, we can obtain summary
statistics, histogram and boxplot as follows:
> summary(zinc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.80 8.40 10.70 10.78 12.08 21.10

> hist(zinc, freq=FALSE) # density histogram of zinc


> points(density(zinc), type="l") # adds a smooth density curve
> boxplot(zinc) # gives a boxplot

Histogram of zinc
0.15
0.10
Density
0.05
0.00

5 10 15 20
Zinc intake (mg)

10 15 20
Zinc intake (mg)

Comment on the distribution of zinc intake for these patients. In other words, comment on
location (centre), spread (scale, dispersion), symmetry (skewness) and any oddities (outliers, shape).

EXAMPLE 2.4.10: A sample of 200 observations is obtained. Its distribution is


positively skew: it has a long tail at the positive end.

x: 21.99, 9.02, 16.81, 16.41, . . . , 7.84, 39.34.

A log-transformation was used; y = ln x (in R: write y <- log(x)). This gives


a sample of 200 on y:

y: 3.09, 2.20, 2.82, 2.80, . . . , 2.06, 3.67.


lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 61

The descriptive statistics for these samples are as follows:

> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.261 10.250 19.370 31.800 38.790 162.400
> summary(log(x))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2319 2.3280 2.9640 3.0200 3.6570 5.0900

The histogram (sample pdf), sample cdf and boxplot are given below for each
sample:
0.020

0.3
0.000

0 50 100 150 0.0 0 1 2 3 4 5


x log(x)
0.6

0.6
0.0

0.0

0 50 100 150 0 1 2 3 4 5
x log(x)

0 50 100 150 1 2 3 4 5
x log(x)

The skewness is seen in the boxplot through the asymmetry of the box, and
all the points at the top end. It is observed too that the log-transformation has
removed the skewness. A log-transformation will always reduce the skewness:
in this case, it is reduced from positive to close to zero.
Other measures of shape:
skewness:

negatively skew (symmetric) positively skew

kurtosis

platykurtic (normal) leptokurtic


(kurtosis negative) (kurtosis positive)
lOMoARcPSD|8938243

page 62 Experimental Design and Data Analysis

2.4.6 Bivariate data

Bivariate data, as the name suggests, consist of observations on two associated variables
for each of a number of individuals. The two variables may be both categorical, or one
may be categorial and one numerical; or both may be numerical. We consider some simple
examples to illustrate.
Two categorical variables
For two categorical variables, the simplest strategy is to combine the variables into one
(more-complex) variable; and the use a barchart for the super variable.

EXAMPLE 2.4.11: If the variables are gender (f, m) and blood-group (O, A, B, AB);
then we can combine them into a gender/blood-group variable with eight cat-
egories: (f O, f A, f B, f AB;
mO, mA, mB, mAB). Here there is some sort of imposed order on the categories
that is chosen by which variable comes first. This order indicates blood groups
within genders, whereas (f O, mO; f A, mA; f B, mB; f AB, mAB) would show
the difference between genders for each blood group. The order within each
variable is arbitrary (f, m) or (m, f ); though (O, A, B, AB) seems to be ‘stan-
dard’.

sup

One categorical and one numerical variable


The idea here is to compare the distribution of the numerical variable for each level of the
categorical variable. We have already seen how to compare distributions using parallel
boxplots (page 59). The same idea can be used for parallel dotplots or histograms.

female

male

150 160 170 180 190


lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 63

Two numerical variables


This scenario is the most common when we talk about “bivariate data”, and is the one
which most requires new forms of representation. A simple example is height and weight:
for each individual in the sample, the values of height and weight are observed. This is
denoted by
{(xi , yi ), i = 1, 2, . . . , n},
where xi denotes the height of individual i, yi denotes the weight of individual i, and i
takes values 1, 2, . . . , n. There are n individuals in the sample. For example:
x 170 178 175 167 182 172 165 180 162 171
y 62 76 65 56 80 70 58 75 64 74
i.e., (x1 , y1 ) = (170, 62), (x2 , y2 ) = (178, 76), . . . , (x10 , y10 ) = (171, 74).
The simplest representation of these data is a scatter diagram or a scatter-plot, the bivariate
equivalent of a dot-plot. Each pair, (xi , yi ), specifies a point in the Cartesian plane. Each
plotted point corresponds to an individual.
80
75
70
weight
65
60

165 170 175 180


height

A scatter-plot is obtained in R using the function plot().


x <- c(170, 178, 175, 167, 182, 172, 165, 180, 162, 171)
y <- c(62, 76, 65, 56, 80, 70, 58, 75, 64, 74)
plot(x, y, xlab="height", ylab="weight") # xlab and ylab give axis labels

Note: in some cases the data may be presented in the form

x 0 10 20 30 40
y 84 76 65 66 60
80 78 70 63 62

For example, where x represents a drug concentration and y the response of a laboratory
animal. Here there are ten animals, two at each level of the drug concentration. Thus in
this case, the data are:
(x1 , y1 ) = (0, 84), (x2 , y2 ) = (0, 80), . . . , (x10 , y10 ) = (30, 62).
In the case that the x-variable has a natural ordering (usually time) and there can be only
one y-value for each x-value, it is common to join the consecutive points by a line. This is
called a line-plot.
lOMoARcPSD|8938243

page 64 Experimental Design and Data Analysis

For example, the data corresponding to the following plot come from ice core samples ob-
tained and analysed in the 1990s from the Law Dome, near Casey Station, Antarctica. The
measurements are of carbon dioxide concentrations from air samples in the past, trapped
in the Antarctic ice.1

A scatter-plot (or a line-plot) enables us to see a relationship between the variables. We


say that there is a positive relationship between the variables if large x and large y tend
to occur together, and small x and small y tend to occur together. The extent to which
this is true reflects the strength of the positive relationship. The above plot indicates a
positive relationship. This is expected, as taller individuals tend to weigh more, and shorter
individuals tend to weigh less.

If there is a negative relationship between the variables, then large x and small y tend to
occur together; and small x and large y tend to occur together. This is illustrated in the
diagram above.
A measure of the relationship is the correlation coefficient, r, which can be obtained in R
using
cor(x,y)
using the name or column specification for the x-variable and the y-variable.
1 Source: cdiac.ornl.gov/trends/co2/lawdome.html.
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 65

Correlation is discussed in more detail in Chapter 8. For now it is enough to know that it
is a number in the interval −1 6 r 6 1; its sign reflects the type of relationship (positive or
negative) and its magnitude reflects the strength of the relationship: the most extreme (±1)
being that of a straight line (with positive or negative slope).
For the height-weight data above, r = 0.809, indicating a moderately strong positive rela-
tionship.

The connection between scatter-plots and correlation is indicated in the diagram below.
However, the scatter-plots shown are ‘standard’: the distributions are roughly symmetrical
and there are no serious outliers. Like the mean and the standard deviation, the correlation
coefficient is affected by outliers.

Each of the above scatter plots has the same apparent scale on the two axes, and this is
the ideal you should aim for when generating a scatter plot. The apparent scale reflects
what we see rather than the units on the axes. Here, the apparent scales for x and y
are similar: most of the points are within an interval of about 3cm horizontally (x-axis)
and within an interval of about 3cm vertically (y-axis). The axes are not indicated. They
could be {0, 1, 2, . . .} on each axis. However they could be {1.3, 1.4, 1.5, . . .} for x and
{600, 700, 800, . . .} for y. In that case the numerical scales would be very different although
the apparent scales are the same.
lOMoARcPSD|8938243

page 66 Experimental Design and Data Analysis

If the apparent scales are different, then the scatter plot is distorted, giving a false impres-
sion of the relationship:

The points in the above three plots are identical: only the scale is changed, in the x-axis and
in the y-axis respectively. In each case the correlation if r = 0.45, but when the apparent
scales are different we get the impression of a greater correlation: the points seem to be
nearer to a straight line. Taking the scale change to the extreme: a very large scale on the
y-axis would produce what appears to be a horizontal straight line; and similarly a vertical
straight line results from using a very large scale on the x-axis.
We have seen that x̄ and sx indicate the location and spread of the x-data: and similarly
ȳ and sy indicate the spread of the y-data. If the x-axis has marks at x̄, x̄ ± sx , x̄ ± 2sx ,
. . . and the y-axis has tick-marks at ȳ, ȳ ± sy , ȳ ± 2sy , . . . then the apparent scales are
equivalent. About 95% of the points will be in (x̄−2sx , x̄+2sx ); and about 95% will be in
(ȳ−2sy , ȳ+2sy ).

ȳ+2sy

ȳ+sy

ȳ−sy

ȳ−2sy

x̄−2sx x̄−sx x̄ x̄+sx x̄+2sx

In practice, we label the axes in some ‘nice’ way (in units, or thousands, or hundredths, or
whatever) but choose the axis units so that the scales are similar. For example, if x̄ = 46.7
and sx = 4.3, then we might choose tick marks at {35, 40, 45, 50, 55}, say. The computer
package will do this scale selection automatically, so we usually don’t have to worry about
it. Sometimes though, it produces odd subdivisions: gaps of 6, 7, or 11, for example. We
humans tend to prefer gaps like 1, 2, 5 or 10.
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 67

Problem Set 2
2.1 (a) Classify each of the variables in the following questionnaire (as categorical, ordinal or
discrete numerical or continuous numerical).
1. Age (in months):
2. Sex: male female
3. How often do you use public transport?
never rarely sometimes often frequently
4. State the number of times you used public transport last week:
5. Do you own a car?
6. What is the fuel consumption of your car?
(b) The diagram below is not so much misleading as confusing. It relates to blood levels
observed in a sample of children. Draw a more appropriate diagram.

2.2 Comment on the following quantitative statements and conclusions. Is there sufficient evi-
dence to reach the stated conclusion? If not, why not?
(a) “Two out of three dentists, responding to a survey, said that they would recommend OR-
BIT gum to their patients who chew gum. Therefore the majority of dentists recommend
chewing ORBIT gum.”
(b) “Heart disease is responsible for 40% of deaths in Australia, therefore we should spend
more money on research into heart disease.”
(c) A survey of two thousand drivers has recently been completed. Of the drivers under 30,
35% had had a car accident in the past year, whereas only 20% of the older drivers had
been involved in an accident in that time. Clearly, therefore, young drivers are worse
drivers.”
(d) “You should cross at a pedestrian crossing. It’s five times safer.”
2.3 The data below are a sample of cholesterol levels taken from 20 hospital employees who were
on a standard (meat-eating) diet and who agreed to adopt a vegetarian diet for one month.
Serum-cholesterol measurements were made before adopting the diet and 1 month after. The
rows in the table below give patient ID, cholesterol level before and cholesterol level after the
month on the vegetarian diet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
195 205 159 244 166 250 236 192 224 238 197 158 151 197 180 222 168 168 167 161
146 178 146 208 147 202 215 184 208 206 169 127 149 178 161 187 176 145 154 153
(a) i. What is the question of interest being investigated here?
ii. What is the sample size and what are the study units?
iii. What is the underlying population?
(b) Let diff, denote the difference = before – after. For the observations on diff:
i. Draw a boxplot for the data.
ii. Calculate the mean, median, standard deviation and interquartile range.
iii. Comment on the distribution of the data.
iv. Suppose the first value is mistyped as 495 (instead of 195). Which of the statistics in
(b)ii will change, and which will not?
(c) What do you think is the answer to the question of interest given in (a)i?
2.4 (a) Find the five number summaries, and draw boxplots, for each of the following:
i. 1, 2, 4, 8, 16, 32, 64
lOMoARcPSD|8938243

page 68 Experimental Design and Data Analysis

ii. 1, 2, 4, 8, 16, 32, 64, 128, 256, 512


(b) Find the mean & median, and standard deviation & inter-quartile range for each of the
following data sets.
i. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10;
ii. 1, 2, 3, 4, 5, 6, 7, 8, 9, 100;
(c) Give a set of 3 integers between 1 and 9 that would give (1) the smallest value and (2) the
largest value of s, and find the value of s in each case:
i. if repeats are allowed;
ii. if no repeats are allowed.

2.5 A frequency distribution for the serum zinc levels of 462 males between the ages of 15 and 17
is displayed below.
Serum zinc level Number of
(µg/dL) males
50–59 6
60–69 35
70–79 110
80–89 116
90–99 91
100–109 63
110–119 30
120–129 5
130–139 2
140–149 2
150–159 2
Draw an accurate cumulative frequency polygon, and use it to find approximate values for:
(a) the median and IQR; the 10th and 90th percentiles;
(b) the proportion of workers whose serum zinc level is less than 105 µg/dL;
(c) the proportion of workers whose serum zinc level lie within mean ± 2 sd (the sample
mean = 88.1 and sample standard deviation = 16.8).
(d) How does the above proportion compare with the empirical rule?
(e) Why are the values you have found only approximate? What would be required to ob-
tained the correct values?
2.6 Scientists wanted to test whether a new corn with extra lysine (a protein) would be good for
chicks. An experimental group of 20 one-day old chicks was fed a ration containing the new
corn. A control group of another 20 one-day old chicks was fed a ration which was identical
except that it contained normal corn. Here are the weight gains (in grams) after 21 days.
Control 272 283 316 321 329 345 349 350 356 356
360 366 380 384 399 402 410 431 455 462
Lysine 318 326 339 361 375 392 393 401 403 406
407 410 420 426 427 430 434 447 467 477
(a) What is the response variable? The explanatory variable? What other variables have been
controlled by keeping them constant? What other variables might affect the weight gain
of individual chicks? Can these be controlled? Explain.
(b) Suppose that the supply of experimental chickens came from two farms. Simon suggests
that the easiest way to conduct the study would be to use 20 chicks from Farm A as the
experimental group and 20 chicks from Farm B as the control group. What do you think?
(c) Which of the following displays would be appropriate for assessing the effectiveness of
the lysine supplement: histogram(s); boxplot(s); scatterplot(s)? Draw the display that
you consider to be the most appropriate for the above data, and use it to comment on the
effectiveness of the lysine supplement.

2.7 The Newport Health Clinic experiments with two different configurations for serving patients.
In one configuration, all patients enter a single waiting line that feeds three different physi-
cians. In another configuration, patients wait in individual lines at three different physician
stations. Waiting times (in minutes) are recorded for ten patients from each configuration.
Compare the results.
Single line: 65 66 67 68 71 73 74 77 77 77
Multiple lines: 42 54 58 62 67 77 77 85 93 100
lOMoARcPSD|8938243

Chapter 2: Exploratory data analysis page 69

Interpret the results by determining whether there is a difference between the two data sets
that is not apparent from a comparison of the measures of centre. If so, what is it?

2.8 R produced the following descriptive statistics for the level of substance H in the blood of a
random sample of thirty individuals with characteristic C. The sample size is n = 30 and there
are 5 missing observations.
Min. 1st Qu. Median Mean 3rd Qu. Max.
62.90 68.40 71.5 73.00 81.25 96.30
(a) Sketch a diagram indicating the distribution of the data.
(b) Where, on your diagram, do you think the five missing observations might go? Why?
(c)What population we are trying to sample from? (i.e. what is the target population?)
(d) What assumptions are made in treating these data as a random sample of 25 observations
from the target population?
(e) Give an example of a situation when this might not be true.

2.9 The following data are the pulmonary blood flow (PBF) x (L/min · m2 ) and pulmonary blood
volume (PBV) y (mL/m2 ) values recorded for 14 infants and children with congenital heart
disease:
x 4.3 3.4 6.2 17.3 12.3 14.0 8.7 8.9 5.9 5.0 3.5 4.2 7.2 11.6
y 170 280 390 420 305 430 305 520 225 290 235 370 210 440
Draw a scatter plot and use it to assess the relationship between PBF and PBV.

2.10 The following scores represent a nurse’s assessment(x) and a physician’s assessment (y) on the
condition of each of ten patients at time of admission to a trauma centre:
x 18 13 18 15 10 12 8 4 7 3
y 23 20 18 16 14 11 10 7 6 4
i. Construct a scatter diagram for these data.
ii. Describe the relationship between x and y, and guess the value of the correlation.
iii. Use R to evaluate the correlation.
iv. If x were to be used as a predictor for y, which one of the following lines would you use?
(y = 8 + 0.5x, y = −10 + 2x, y = 1 + x. Determine which by plotting each of the lines on
your scatter diagram.)
lOMoARcPSD|8938243

page 70 Experimental Design and Data Analysis


lOMoARcPSD|8938243

Chapter 3

PROBABILITY AND
APPLICATIONS

“We balance probabilities and choose the most likely. It is the scientific use of
the imagination.” Sherlock Holmes, The Hound of the Baskervilles, 1902.

Chapter 1 provides an indication of where the data we analyse comes from. Chapter 2 tells us
something about what to do with a data set, or at least how to look at it in a sensible way. In this
chapter, and the next, we look at models for the data.

Ch1: types
(Ch3&4)
of studies
Probability
population −→ sample Ch2: data
description
model ←− observations

Statistical Inference
(Ch5–8)

3.1 Probability: the basics

Probability (chance, likelihood) has an everyday usage which gives some sort of rough
gradation between the two extremes of impossible and certain:

impossible maybe certain


not likely probably
possibly no worries
no way fair chance
We wish to make probability numerical: i.e. we wish to define a mathematical probability.
To do this we need a structure and some rules.

71
lOMoARcPSD|8938243

page 72 Experimental Design and Data Analysis

DEFINITION 3.1.1. Probability, Pr(A), is a number assigned to each event A, which


reflects its (probability, chance, likelihood) of occurrence.

This Probability must obey some rules: very reasonable rules, but rules nevertheless.
Properties of Pr
1. 0 6 Pr(A) 6 1; probability must lie between 0 and 1.
2. Pr(∅) = 0, Pr(Ω) = 1; an impossible event has probability 0; a certain event has probability
1.
3. Pr(A′ ) = 1 − Pr(A); the complement of A, i.e. notA, has probability 1− Pr(A).
4. A ⊆ B ⇒ Pr(A) 6 Pr(B); a subset has smaller probability.
5. Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B). the “addition theorem”.
Intersection and union

DEFINITION 3.1.2.
1. A ∪ B denotes the union of events, corresponding to “A or B”, meaning that at
least one of the events A or B occurs.
2. A ∩ B denotes the intersection of events, corresponding to “A and B”, meaning
that both the events A and B occur.

We can use a Venn diagram or a probability table to illustrate A ∩ B and A ∪ B:


B B′

A A

B A′
A∩B

B B′

A A

B A′
A∪B
(Venn diagram) (Probability table)

EXAMPLE 3.1.1: If Pr(A) = 0.45, then Pr(A′ ) = 0.55.


If Pr(B) = 0.35 and A and B are mutually exclusive; then Pr(A ∪ B) = 0.80.

EXAMPLE 3.1.2: Suppose that 6.3% of families with four children consist of four

boys ( ◦◦ ). What is the probability that a family of four children has at least one
girl?

If A denotes “four boys”, then the complement of A, A′ = “at least one girl”.
Therefore Pr(A′ ) = 1 − 0.063 = 0.937.
lOMoARcPSD|8938243

Chapter 3: Probability and applications page 73

3.1.1 Probability tables

A Probability table has the advantage of doubling for a Venn diagram, in which the subdi-
visions are squares rather than odd shapes, and which can be used for calculation.

EXAMPLE 3.1.3: Suppose that Pr(A) = 0.6, Pr(B) = 0.2 and Pr(A∩B) = 0.1.
The probability table is given by:

B B′
A 0.1 0.5 0.6
A′ 0.1 0.3 0.4
0.2 0.8 1

The bold entries are given. The rest can be obtained by subtraction.

·
· · Pr(A∪B) = 0.7, Pr(A∩B ′ ) = 0.5, Pr(A∪B ′ ) = 0.9, . . .
In a probability table:
• intersections (A∩B, A∩B ′ , A′ ∩B, A′ ∩B ′ ) are represented by one square;
• unions (A∪B, A∪B ′ , A′ ∪B, A′ ∪B ′ ) are represented by three squares, i.e. an L-shape.
To complete the probability table, and hence to work out anything involving A and B, we
need three (separate) items of information.

0.1 0.2
EXAMPLE 3.1.4: Pr(A∪B) = 0.7, Pr(A∩B ′ ) = 0.2, Pr(A) = 0.3.
0.4 0.3
·
· · Pr(B) = 0.5, Pr(A∩B) = 0.1, . . .

E XERCISE . (Addition theorem)

B B′
A γ α
A′
β 1

Complete the above table and hence show that Pr(A ∪ B) = α + β − γ,


i.e. Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B).

Probability tables are very useful devices.


10 20
Their data counterparts are called “contingency tables”.
40 30
We will deal with probability tables and contingency tables quite a lot.
Assigning values to Pr
Nothing in any of the above says how a value is assigned to probability. It says only that
the numbers assigned should follow the rules, and we can use these rules to deduce other
probabilities!
The following can be used to assign values to probability:
• symmetry
lOMoARcPSD|8938243

page 74 Experimental Design and Data Analysis

• long-term relative frequency


• population proportion
• subjective
• model

Simple examples of probability, and some of its earliest applications, relate to symmetrical
gambling apparatus (coins, dice, roulette wheels, . . . ), in which case probability is assigned
equally to the possible outcomes.
When a procedure is repeated a very large number of times, the proportion of times that
an event occurs approaches its probability. This can be used to provide an approximation
for probability, and a theoretical justification for it.
If an individual is chosen “at random” from a population of N individuals (meaning that
each individual is equally likely to be chosen) then the probability that the chosen individ-
ual has attribute A is n(A), where n(A) denotes the number of individuals in the popu-
lation with attribute A. In this case, probability reduces to counting . . . in theory at least.
Many of our applications take this form, and probability denotes a population proportion,
Pr(A) = n(A)/N . However, usually we don’t know what n(A) is, and, in many cases, we’re
not entirely sure about N either.
So long as the rules are adhered to, numbers can be assigned to probabilities subjectively,
based on experience or opinion. This is done to some extent by people such as bookmak-
ers and stocktraders, though it also has some value in the scientific world. Its usefulness
depends entirely on the credibility of the experience and opinion. Just another assumption?
Mostly though, what is done is to generate a probability model for the situation at hand,
and decide whether or not the observed data is compatible with the model . . . or whether
the model is compatible with the data.
Q UESTION : What is the value for Pr(6) for this die?
A lot of what we do in statistics is based on modelling. In understanding the theory and
practice of statistics, it is necessary to deal with abstractions of various kinds. So you are
often asked to assume that, or suppose that . . . and in that case to work out what is likely
to happen.
It is useful to consciously allow your mind to entertain various abstract concepts and sce-
narios: it genuinely helps with understanding. These abstractions are introduced in many
ways, sometimes by a very simple word or phrase. You may be asked to “assume that”, or
“suppose”, or “model the data as . . . ”. Perhaps the simple word “if” may be used, often in
an “if . . . then” construction, e.g. “If the coin is fair, then . . . ” or “If the sample is random,

then . . . ”. Occasionally, the symbol ( ◦◦ ) will be used to remind you of this abstraction
process.
In medical applications, probability is often called risk because it often relates to a negative
disease-outcome (such as death, relapse, recurrence, complications, . . . ), whereas we do not
talk about the risk of a positive disease-outcome (such as cure, alleviation of symptoms,
improvement, . . . ): thus we refer to the probability of cure.
In a practical application, these disease-outcomes (positive or negative) must be further
specified as occurring over a specified period of time.
This risk can be thought of as the proportion of the population with the disease-outcome
occurring during a time period, u.
lOMoARcPSD|8938243

Chapter 3: Probability and applications page 75

Or it can be thought of as applying to an individual (a ‘typical’ individual), in which case


we are describing the probability that this person will have the disease-outcome in the
specified time period u.

EXAMPLE 3.1.5: “A 60-year-old man has a 2% risk of dying from cardiovascular dis-
ease.”
What does this mean? Not much! Risk applies to a specific period of time: the
time period should be specified in such a statement. We should say something
like: “A 60-year-old man has a 2% risk of dying from cardiovascular disease in the next
five years.”
Another medical probability with a name is the probability that an individual (from a given
population) has disease D at a point in time. This is called the prevalence of the disease D at
a particular time (often taken to mean ‘now’). Thus the prevalence is a measure of disease
status.
Actually, prevalence is used for many things other than disease: characteristics such as
blood group, smoking status, eye-colour, and so on.
Yet another term, which tends to get confused with prevalence, is incidence. Incidence is a
rate rather than a probability and relates to the first occurrence of disease: the rate at which
a disease occurs. The incidence rate is a measure of the frequency of disease onset. This is
discussed in Chapter 4.

3.1.2 Odds

An alternative to probability is “odds”. This is a useful concept: bookmakers think so;


and so do biostatisticians. It provides an alternative measure of chance/likelihood, which
spans from 0 to ∞ instead of 0 to 1. It is defined as follows:

DEFINITION 3.1.3. The odds of A (odds for A) is

Pr(A) Pr(A)
O(A) = = .
Pr(A )
′ 1− Pr(A)

• Because 0 6 Pr(A) 6 1, it follows from the defition of odds that 0 6 O(A) 6 ∞.


O(A)
• A consequence of the above definition is that Pr(A) = .
1 + O(A)
• The odds against A is O(A′ ) = 1/O(A).

Pr(A) 0 0.05 0.0909 0.2 0.5 0.6 0.8 0.95


O(A) 0 0.0526 0.01 0.25 1 1.5 4 19

Odds of ‘4 to 1 on’ is equivalent to a probability of 0.8; and odds of ‘4 to 1 against’ is


equivalent to probability of 0.2.
Note: A further useful extension of this transformation idea is the log-odds, i.e. ln O(A),
which spans −∞ to ∞ as Pr(A) spans 0 to 1.
lOMoARcPSD|8938243

page 76 Experimental Design and Data Analysis

Comparing risks

Suppose that the risk for group 1 is p1 and for group 2 the risk is p2 . How should we
compare the groups?
p1 p1 1−p2
risk difference? p1 − p2 risk ratio? odds ratio? .
p2 1−p1 p2
Each of these is used in different situations. When the risks are small, the risk difference
will be small too: is a difference of 0.001 important? Maybe if p1 = 0.002 and p2 = 0.001,
it is. When the risks are large, the risk ratio is relatively diminished. The fairest compar-
ison turns out to be the odds ratio: it’s the one biostatisticians tend to use. It has a lot of
advantages as you will see, although it may not be the simplest.

3.2 Conditional probability

Pr(A | H) denotes the probability of A given the information H about the outcome. The
information takes the form of an event, H. We are told that the event H has occurred.
We would like to modify the probability in the light of this additional information:
Pr(A) −→ Pr(A | H).

EXAMPLE 3.2.1: (tossing three fair coins)


1
A = “at least two heads” and H = “first toss is a head”. Pr(A) = 2, but
Pr(A | H) > 12 .

An appropriate way to adjust the probability is


Pr(A∩H)
Pr(A | H) = .
Pr(H)
. . . but why?
Given that H has occurred, we treat H as the ‘universe’: any outcome in H ′ is impossible.
So, it seems reasonable that the probability of A should be proportional to A ∩ H, since that
is the only way A can have occurred.

EXAMPLE 3.2.2: (tossing three fair coins)


Pr(A∩H) 3/8
Pr(A | H) = = = 34 .
Pr(H) 1/2

Conditional probability is usefully treated as “universe reduction”. Conditional on H, i.e.


given that H has occurred, the universe is effectively reduced to H: only the outcomes
inside H can have happened. Then to find Pr(A | H), we find the probability of A within
this reduced universe.

DEFINITION 3.2.1. The conditional probability Pr(A | H) is the probability that the
event A will occur given the knowledge that the event H has already occured. It is
defined as
Pr(A ∩ H)
Pr(A | H) = .
Pr(H)
lOMoARcPSD|8938243

Chapter 3: Probability and applications page 77

Given H, the H-row becomes the universe. And in that case, the conditional probability of
A is the proportion of the H-row that is in A, i.e. Pr(A∩H)/ Pr(H).

EXAMPLE 3.2.3: Consider a retirement village of 1000 as a simple example of


a population. Suppose that 200 of them have hypertension (H) and 160 have
cardiac problems (A), as indicated in the table below:

A A′
H 80 120 200
H′ 80 720 800
160 840 1000

The probability that an individual randomly chosen from this population has
160
cardiac problems is Pr(A) = 1000 = 0.16.

Given the individual has hypertension, the probability of cardiac problems is


increased.

Conditioning on H means restricting the universe to those with hypertension.


We consider only those individuals with hypertension. Of the 200 individuals
80
with hypertension, 80 have cardiac problems, so Pr(A | H) = 200 = 0.4.

Pr(A∩H) 80/1000
It remains true that Pr(A | H) = Pr(H)
= 200/1000 .

80
Note that Pr(H | A) = 160 = 0.5. This is the probability that the individual has
hypertension given they have cardiac problems. To obtain this probability, we
restrict the universe to the 160 individuals with cardiac problems.

Given A, the universe becomes the column A. Similarly, given A′ the universe
would be reduced to the column A′ , and in that case Pr(H | A′ ) = 120
840
= 0.143.

E XERCISE . Check that Pr(A | H ′ ) = 0.1. Explain what this conditional probability means,
in terms of hypertension and cardiac problems.

E XERCISE . Pr(A) = 0.3, Pr(B) = 0.4, Pr(A | B) = 0.6

E XERCISE . Pr(A) = 0.4, Pr(B | A) = 0.8, Pr(B | A′ ) = 0.3


lOMoARcPSD|8938243

page 78 Experimental Design and Data Analysis

EXAMPLE 3.2.4: (exposure and disease)


0.0033 0.2967
Pr(E) = 0.3, Pr(D | E ′ ) = 0.001, Pr(D | E) = 0.011
0.0007 0.6993

The probability table can be calculated from the given information: e.g. Pr(D∩E) =
0.3×0.0011; and then any other probabilities involving D and E can be com-
puted.

Thus Pr(D) = 0.0033 + 0.0007 = 0.004; and Pr(E | D) = 0.0033


0.004
= 0.825.

Note: Pr(A | B) is not the same as Pr(B | A): they are in different universes!
Pr(A | B) is not equal to 1 − Pr(A | B ′ ): different universes again.
but Pr(A′ | B) = 1 − Pr(A | B): they are from the same universe, and either A or A′ must
occur, no matter what universe we are in.

3.2.1 Multiplication rule

From the definition of conditional probability:


Pr(A∩B) = Pr(A) Pr(B | A) = Pr(B) Pr(A | B).
Dividing through by Pr(A) Pr(B) gives:
Pr(A∩B) Pr(B | A) Pr(A | B)
= = [= c, say].
Pr(A) Pr(B) Pr(B) Pr(A)

Relationship between A and B


• If c > 1, then: Pr(B | A) > Pr(B) and Pr(A | B) > Pr(A) and Pr(A∩B) > Pr(A) Pr(B);
and we say that A and B are positively related, since each increases the chance of the
other occurring.
• If c < 1, then A and B are negatively related. In that case, each decreases the chance
of the other occurring.
• If c = 1, then A and B are unrelated: they are independent. In that case, neither
affects the chance of the other occurring.
• If A and B are positively related, then A and B ′ are negatively related. Therefore:
Pr(A | B) > Pr(A) > Pr(A | B ′ ) [ and Pr(B | A) > Pr(B) > Pr(B | A′ )].
• Conversely, if A and B are negatively related, then
Pr(A | B) < Pr(A) < Pr(A | B ′ ) [ and Pr(B | A) < Pr(B) < Pr(B | A′ )].
• And, if A and B are independent, then
Pr(A | B) = Pr(A) = Pr(A | B ′ ) [ and Pr(B | A) = Pr(B) = Pr(B | A′ )].

EXAMPLE 3.2.5: (exposure and disease) If the exposure E is positively related


to disease D, then
Pr(D | E) > Pr(D) > Pr(D | E ′ ).

[0.011] [0.004] [0.001] ←− from example above.


lOMoARcPSD|8938243

Chapter 3: Probability and applications page 79

DEFINITION 3.2.2. The relative risk (or risk ratio), RR, of a disease D with respect to
an exposure E is given by
Pr(D | E)
RR = .
Pr(D | E ′ )

For the above example, the relative risk is RR = 0.011/0.001 = 11.


The relative risk is the ratio of the probability of the disease given the exposure and the
probability of disease given non-exposure. It’s neat that RR stands for risk ratio and relative
risk!
Thus relative risk compares the risk (probability) of disease for two groups: those exposed
and those not exposed.

3.2.2 Conditional odds and odds ratio

DEFINITION 3.2.3.
1. The conditional odds of D given E is

Pr(D | E)
O(D | E) = .
Pr(D′ | E)

This is the odds of disease for the exposed group.


2. The odds ratio of D with respect to E is given by:

O(D | E)
OR = .
O(D | E ′ )

The odds ratio compares the odds of disease for the group of exposed individuals
to the odds for the group of unexposed individuals.

D D′
α α
E α β Pr(D | E) = α+β O(D | E) = β
E′ γ δ Pr(D | E ′ ) γ
= γ+δ O(D | E ′ ) = γ
δ

risk ratio odds ratio


α(γ+δ) αδ
RR = OR =
γ(α+β) βγ

positive relationship between E and D: Pr(D | E) > Pr(D | E ′ )


RR > 1 OR > 1
negative relationship between E and D: Pr(D | E) < Pr(D | E ′ )
RR < 1 OR < 1
Note: one advantage of the odds ratio is that it doesn’t matter in which order we consider
E and D (i.e. D with respect to E or E with respect to D), since:
O(D | E) O(E | D)
= .
O(D | E ′ ) O(E | D′ )

Thus, the odds ratio is a measure of the connection between E and D.


lOMoARcPSD|8938243

page 80 Experimental Design and Data Analysis

Q UESTION : Is this interchangeability true for the risk ratio?


No, it’s not true for the risk ratio. If E and D are interchanged, then we obtain:
α
Pr(E | D) α+γ α(δ + β)
RR∗ = = β
= 6= RR.
Pr(E | D′ ) β+δ
β(γ + α)

To illustrate, we consider a simple example.

EXAMPLE 3.2.6: Suppose E and D are such that

D D′
E 0.3 0.1 0.4
E′ 0.3 0.3 0.6
0.6 0.4 1

Pr(D | E) 0.75 Pr(E | D) 0.5


RR = = = 1.5 and RR∗ = = = 2.0.
Pr(D | E ′ ) 0.5 Pr(E | D′ ) 0.25

The odds ratio is the same whichever way it is evaluated:


0.3×0.3
OR = = 3.
0.3×0.1
E XERCISE . Check that O(D | E) = 3, O(D | E ′ ) = 1; and O(E | D) = 1, O(E | D′ ) = 31 .

3.3 Law of Total Probability & Bayes’ Theorem

These results generally apply in the context of:


mutually exclusive and exhaustive “causes” A1 , A2 , . . . , Ak of some “result” H,
where we know the probability of the possible “causes”, i.e. Pr(A1 ), Pr(A2 ), . . . , Pr(Ak ); and
the probability of the “result” given each of the “causes”, i.e. Pr(H | A1 ), Pr(H | A2 ), . . . ,
Pr(H | Ak ).
The Law of Total Probability gives Pr(H); Bayes’ theorem gives Pr(Aj | H).
The formulae will be given later. But first we’ll learn how to work these out. Then you don’t need
the formulae!
Standard applications are:
“causes” = exposure −→ “result” = disease;
“causes” = disease; −→ “result” = test result.

EXAMPLE 3.3.1: (diagnosis: alcohol and headache)


Ray’s Saturday night: A0 = no alcohol consumption, A1 = low alcohol consump-
tion, and A2 = high alcohol consumption; and H denotes a Sunday morning
headache.

H H′ (H | · )
A0 0.3 (0.01)
A1 0.5 (0.1)
A2 0.2 (0.9)
1

LTP: Pr(H) = 0.233; BT: Pr(A2 | H) = 0.773.


lOMoARcPSD|8938243

Chapter 3: Probability and applications page 81

EXAMPLE 3.3.2: (exposure and disease) A1 = E, A2 = E ′ ; H = D.

Pr(E) = 0.3, D D′ (D | · )
Pr(D | E) = 0.011, E 0.0033 0.2967 0.3 (0.011)
Pr(D | E ′ ) = 0.001. E′ 0.0007 0.6993 0.7 (0.001)
0.0040 0.9960 1
LTP: Pr(D) = 0.004 (E | · ) (0.825) (0.298)
0.0033
BT: Pr(E | D) = 0.0040 = 0.825.

We already knew how to do this! See the earlier example. Also, Pr(E | D′ ) =
0.2967
0.9960
= 0.298.

EXAMPLE 3.3.3: (Ophthalmology)


We are planning a 5-year study of cataracts in a population of 5000 people 60
years of age and older. We know from census data that 45% of this population
are ages 60-64, 28% are ages 65-69, 20% are ages 70-74, and 7% are age 75 or
older. We also know from the Framingham Eye Study that 2.4%, 4.6%, 8.8% and
15.3% of the people in those respective age groups will develop cataracts over
the next 5 years. What percentage of this population will develop cataracts over
the next 5 years, and how many people does this percentage represent?

The probability table can be obtained using the given information. Entries in
the first column can be evaluated, like Pr(A1 ∩C) = 0.45×0.024 = 0.0108; and
then the table can be completed using subtraction.

C C′ (C | · )
A1 0.0108 0.4392 0.45 (0.024)
A2 0.0129 0.2671 0.28 (0.046)
A3 0.0176 0.1824 0.20 (0.088)
A4 0.0107 0.0593 0.07 (0.153)
0.0520 0.9480 1

Then Pr(C) = 0.0520, i.e. 5.2% of the population are expected to develop cataracts.
This represents 5000×0.052 = 260 individuals.
Probability table representation:

H H′
A1 Pr(A1 ) Pr(H | A1 ) ··· Pr(A1 )
A2 Pr(A2 ) Pr(H | A2 ) ··· Pr(A2 )
.. .. ..
. . .
Ak Pr(Ak ) Pr(H | Ak ) ··· Pr(Ak )
Pr(H) ··· 1

Observe from the probability table that Pr(H) can be found by summing up the probabili-
ties in the H column.
lOMoARcPSD|8938243

page 82 Experimental Design and Data Analysis

The LTP is a statement about the unconditional probability Pr(H):

DEFINITION 3.3.1. Suppose that A1 , . . . , Ak are mutually exclusive events with A1 ∩


· · · ∩ Ak = Ω, then the Law of Total Probability states that
k
X k
X
Pr(H) = Pr(Ai ) Pr(H | Ai ) = Pr(Ai ∩ H).
i=1 i=1

Bayes’ Theorem is a statement about the conditional probability Pr(Aj | H):

DEFINITION 3.3.2. Suppose that A1 , . . . , Ak are mutually exclusive events with A1 ∩


· · · ∩ Ak = Ω. Also suppose that Pr(H) 6= 0. Then Bayes’ Theorem is stated as follows:

Pr(Aj ) Pr(H | Aj ) Pr(Aj ) Pr(H | Aj )


Pr(Aj | H) = Pk = .
i=1 Pr(A i ) Pr(H | Ai ) Pr(H)

The case of a non-representative sample


Suppose the population under consideration is such that:
D D′ D D′
E α β E 0.0033 0.2967 0.3
E′ γ δ E′ 0.0007 0.6993 0.7
0.0040 0.9960 1
Pr(E | D) = 0.8250, Pr(E | D′ ) = 0.2979;
α(γ+δ) αδ
RR = , OR = . RR = 11, OR = 11.11. (see the example above.)
γ(α+β) βγ
If we were to take a non-representative sample — as in a case-control study — where the
individuals with the disease (D) are over-represented, then we have
D D′ D D′
E kα ℓβ E 0.4125 0.1489 0.5614
E′ kγ ℓδ E′ 0.0875 0.3511 0.4386
0.5 0.5 1
Pr(D | E) = 0.7347, Pr(D | E ′ ) = 0.1995;
α(kγ+ℓδ) αδ
RR = , OR = RR = 3.68, OR = 11.11.
γ(kα+ℓβ) βγ
The odds ratio is unaffected; the risk ratio is changed considerably.
This indicates that the odds ratio is a good thing to be using, even if it is a bit harder to
understand.

EXAMPLE 3.3.4: (hypothetical cohort study vs case-control study, Chapter 1)


For the cohort study we have:

D D′ D D′
E 8 2996 3004 E 0.000667 0.249667 0.250333
E′ 8 8988 8996 E′ 0.000667 0.749000 0.749667
16 11984 12000 0.001333 0.998667 1.000000
lOMoARcPSD|8938243

Chapter 3: Probability and applications page 83

The first table gives the numbers in each group (this is called a contingency ta-
ble); and the second, obtained by dividing through by 12000, gives a probability
table.

From which we obtain:


8/3004 8 × 8996 8 × 8988
RR = = = 2.995; and OR = = 3.0.
8/8996 8 × 3004 8 × 2996
The same answers are obtained if the probability table is used.

For the case-control study we have:

D D′ D D′
E 8 12 20 E 0.1250 0.1875 0.3125
E′ 8 36 44 E′ 0.1250 0.5625 0.6875
16 48 64 0.25 0.75 1

From which we obtain:


8/20 8 × 44 8 × 36
RR = = = 2.2; and OR = = 3.0.
8/44 8 × 20 8 × 12

The risk ratio is different, but the odds ratio is correct. Thus, we can simply
use the odds ratio from the case-control study to estimate the population odds
ratio.

We could use the case-control table to obtain the full population table, if we are
provided with the value of Pr(D), i.e. the proportion of the population with the
disease. The case-control table correctly gives Pr(E | D) = 0.5 and Pr(E | D′ ) =
16
0.25. Using these values in conjunction with Pr(D) = 12000 = 0.001333, the
remaining probabilities in the population table can be evaluated.

3.4 Diagnostic testing

The diagnostic testing scenario is very important in medicine. There are a bunch of names
for many of the probabilities and conditional probabilities that you need to know about.

“false negative” = D∩P ′


“false positive” = D′ ∩P prevalence = Pr(D).

P P′

D × • sn = Pr(P | D)

D′ × sp = Pr(P ′ | D′ )

ppv = Pr(D | P ) npv = Pr(D′ | P ′ )

DEFINITION 3.4.1. The prevalence of a disease is the proportion of individuals in a


population with the disease.

Thus if an individual is randomly selected from the population, then the probability that
lOMoARcPSD|8938243

page 84 Experimental Design and Data Analysis

the individual has the disease, Pr(D), is equal to the prevalence. Here probability is a
population proportion.

DEFINITION 3.4.2.
1. The sensitivity (sn) of a test is the probability that the test is positive given that
the person has the disease: sn = Pr(P | D).
2. The specificity (sp) of a test is the probability that the test is negative given that
the person does not have the disease: sp = Pr(P ′ | D′ ).
3. The positive predictive value (ppv) of the test is the probability that a person has
the disease, given the test is positive: ppv = Pr(D | P ).
4. The negative predictive value (npv) of the test is the probability that a person
does not have the disease, given that the test is negative: npv = Pr(D′ | P ′ ).

Note that all these conditional probabilities are concerned with “getting it right” . . . given D, given
D′ , given P and given P ′ .

DEFINITION 3.4.3.
1. A false negative occurs when the test is negative, and the person has the disease,
i.e. FN = D∩P ′ . However, the “probability of a false negative” is usually taken
to be the conditional probability: fn = Pr(FN | D) = Pr(P ′ | D) = 1 − sn.
2. A false positive occurs when the test is positive, and the person does not have the
disease, i.e. FP = D′ ∩P . Similarly, the “probability of a false positive” is usually
taken to be the conditional probability: fp = Pr(FP | D′ ) = Pr(P | D′ ) = 1 − sp.

EXAMPLE 3.4.1: (diagnostic test)


Consider a diagnostic test with sensitivity 99% and specificity 95% applied to a
population with disease prevalence 5%. Find the positive predictive value for
this test.

P P′
D 0.0495 0.0005 0.05 (sn=0.99)
D′ 0.0475 0.9025 0.95 (sp=0.95)
0.0970 0.9030 1

0.0495
Thus ppv = Pr(D | P ) = 0.0970 = 0.510.

E XERCISE . (hypertension)
Suppose 84% of hypertensives and 23% of normotensives are classified as hypertensive by
an automated blood-pressure machine. What is the positive predictive value and negative
predictive value of the machine, assuming that 20% of the adult population is hyperten-
sive?

The case of a non-representative sample


The positive predictive value depends on the prevalence, so to get its value right, we must
get the prevalence right, i.e. the prevalence for the population we are applying it to.
If we had a non-representative sample (for example a test sample, or a specific subpopula-
tion, like hospital patients) we could still estimate sensitivity and specificity:
lOMoARcPSD|8938243

Chapter 3: Probability and applications page 85

P P′
D 495 5 500 ⇒ (sn=0.99)
D′ 25 475 500 ⇒ (sp=0.95)
520 480 1000
495
For this sample (or this subpopulation), ppv = 520 = 0.952.
But to get the ppv right for the population, in which the prevalence is 10%, we need to
adjust:
P P′
D 0.099 0.001 0.1 (sn=0.99)
D′ 0.045 0.855 0.9 (sp=0.95)
0.144 0.856 1
0.099
so that ppv = = 0.688.
0.144
And if the prevalence were 1% then we would have:
P P′
D 0.0099 0.0001 0.01 (sn=0.99)
D′ 0.0495 0.9405 0.99 (sp=0.95)
0.0594 0.9406 1
0.0099
so that ppv = = 0.167.
0.0594
p 0.5 0.1 0.05 0.01
ppv 0.952 0.688 0.510 0.167

Another (individual) view of diagnostic testing


In applying the test to a particular individual i, Pr(D) represents the (prior) probability
[i.e. before the test] that the individual i has the disease (based on family history, medical
history, and other information). Suppose Pr(D) = 0.4.
Suppose individual i undergoes a test with sensitivity 0.99 and specificity 0.95 (as above).
A positive test result yields a modified (posterior) probability [i.e. after the test]: Pr(D | P ) =
0.930. This is obtained using exactly the same procedure as used in the examples above,
but with Pr(D) = 0.4 instead of the population prevalence. [ex. check this.]
A negative test result would also modify the probability; and we find Pr(D | P ′ ) = 0.007.

An odds view of Bayes’ theorem and diagnostic testing


Using a bit of algebra, Bayes’ theorem is equivalent to:
Pr(B | A)
O(A | B) = O(A),
Pr(B | A′ )
This is a relatively simple result. In words: given the additional information B, the odds of
A is adjusted by multiplying by the likelihood ratio, Pr(B | A)/ Pr(B | A′ ).
In diagnostic testing, this becomes
Pr(P | D) sn
O(D | P ) = O(D) = O(D),
Pr(P | D′ ) 1 − sp
Thus, if the sensitivity is 0.95 and specificity is 0.9, the likelihood ratio is 9.5. This means
lOMoARcPSD|8938243

page 86 Experimental Design and Data Analysis

that a positive result on this diagnostic test would have the effect of increasing the odds
by multiplying by 9.5. And a negative result would decrease the odds by multiplying by
1/9.5. It is seen that if the odds start out very small, then the odds will still be relatively
small.

EXAMPLE 3.4.2: O(D) = 0.001 ⇒ O(D | P ) = 9.5×0.001 = 0.0095.


It follows that Pr(D | P ) = 0.0094.

If sn = 0.999 and sp = 0.999, then the multiplier is 999.


In that case O(D) = 0.001 ⇒ O(D | P ) = 0.999, Pr(D | P ) = 0.4998.
and O(D) = 0.1 ⇒ O(D | P ) = 99.9, Pr(D | P ) = 0.9901.

3.5 Independence
Events A and B can be positively or negatively related according as:
Pr(A | B) ≷ Pr(A) ≷ Pr(A | B ′ )
The intermediate case, when they are all equal, i.e. the “no relationship” case is the case of
independence. A and B are independent if B has no effect on the probability of A occurring
. . . and vice versa: i.e.
Pr(A | B) = Pr(A) = Pr(A | B ′ ) and Pr(B | A) = Pr(B) = Pr(B | A′ ).

This means that:


Pr(A∩B) = Pr(A) Pr(B),
which is often taken as the ‘rule’ for independence.

DEFINITION 3.5.1. Two events A and B are independent if

Pr(A ∩ B) = Pr(A) Pr(B).

Independent events and Mutually exclusive events are entirely different things.

EXAMPLE 3.5.1: A and B are mutually exclusive events such that Pr(A) =
Pr(B) = 0.4. Then Pr(A∪B) = 0.4 + 0.4 = 0.8.
C and D are independent events such that Pr(C) = Pr(D) = 0.4.
Then Pr(C∪D) = 0.4 + 0.4 − 0.4×0.4 = 0.64.
This multiplication rule extends to n independent events:

DEFINITION 3.5.2. The events A1 , . . . , An are mutually independent if

Pr(A1 ∩A2 ∩ · · · ∩An ) = Pr(A1 ) Pr(A2 ) · · · Pr(An ).

The converse of the above definition is not true.

Also, Pr(A1 ∪ · · · ∪An ) = 1 − Pr(A′1 ∩ · · · ∩A′n ) = Pr(A′1 ) · · · Pr(A′n ),


i.e. Pr(“at least one”) = 1 – Pr(“none”).

EXAMPLE 3.5.2: Find the probability of at least one six in six rolls of a fair die.
Pr(A) = 1 − Pr(A′ ) = 1 − ( 65 )6 = 0.665.
lOMoARcPSD|8938243

Chapter 3: Probability and applications page 87

Find the probability that at least one individual in a sample of 100 has disease
D when the prevalence of the disease is 1%.
99 100
Pr(A) = 1 − Pr(A′ ) = 1 − ( 100 ) = 0.634.
A commonly used probability model is that of “independent trials” (commonly called
Bernoulli trials) in which each trial results in one of two outcomes, designated “success” or
“failure”, with probabilities p and q, where p + q = 1.
Simple examples of independent trials are coin-tossing and die-rolling; but the “indepen-
dent trials” model can be applied quite generally with:
trial = any (independently) repeatable random experiment;
success = A, any nominated event for the random experiment.

E XERCISE . A risky heart operation is such that the probability of a patient dying as a result
of the surgery is 0.01. If 100 such operations are performed at the hospital in a year, find
the probability that at least one of these patients dies as a result of surgery.
We assume that the operations are independent and each has the same probability of “suc-
cess” (that the patient dies!). This emphasises the fact that “success” is just a name for some
event: it clearly doesn’t have to be something good.
It soon becomes clear that the model is too simple (since not all patients are identical), but it
is nevertheless a useful place to start the modelling process.
lOMoARcPSD|8938243

page 88 Experimental Design and Data Analysis

Problem Set 3
3.1 Drug A causes an allergic reaction in 3% of adults, drug B in 6%, while 0.4% are allergic to
both. What sort of relationship exists between allergic reactions to the drugs A and B (positive,
negative, none)?
3.2 Suppose that events D and E are such that Pr(D | E) = 0.1 and Pr(D | E ′ ) = 0.2.
(a) Are D and E positively related, not related or negatively related? Explain.
(b) Specify the odds ratio for D and E.
Suppose also that Pr(E) = 0.4:
(c) Find Pr(D).
(d) Find Pr(E | D).
p1 p1 (1 − p2 )
3.3 The risk ratio is RR = , and the odds ratio is OR = .
p2 (1 − p1 )p2
(a) If the odds ratio is equal to 2, show that RR = 2 − p1 , and hence, or otherwise, complete
the following table:
p1
p1 p2 p1 −p2
p2
0+
0.01
0.05
0.1
0.25
0.5
0.9
1–
p
Hint: First compute p1 = RR, using the expression for the risk ratio derived above; then find p2
2
using p2 = p1 /RR; and finally p1 − p2 .
(b) If the odds ratio, OR = θ, show that RR = θ(1−p1 ) + p1 ; and hence that RR can take any
value between 1 and θ.
(What happens if θ < 1?)
(c) A case-control study gives an estimate of the odds ratio relating exposure E and disease
D of 2.0. What can you say about the relative risk of D with and without exposure E?
(d) i. If the odds ratio is 3, find the risk ratio if p1 = 0.1.
ii. If the odds ratio is 1.5, find the risk ratio if p1 = 0.2.
iii. If the odds ratio is 0.5, find the risk ratio if p1 = 0.05.
3.4 Complete the following probability tables:
(a) (b) (A and B are independent)

B B B B′
A 0.4 A 0.4

A 0.2 A′
0.5 0.5

(c) Pr(A)=0.6, O(B | A)=0.2 & O(B | A )=1; (d)* Pr(A)=0.4, O(A | B)=1 & O(A | B ′ )=0.5.

3.5 A study investigating the relationship between disease D and exposure E found that, of indi-
viduals who have disease D, 20% had been exposed to E, whereas for individuals who do not
have disease D, 25% had exposure E.
(a) Are E and D positively related, not related or negatively related?
(b) Specify the odds ratio relating E and D.
(c) Explain why the relative risk of disease D with or without exposure E cannot be calcu-
lated with this information alone. What additional information is required to find the
risk ratio?
lOMoARcPSD|8938243

Chapter 3: Probability and applications page 89

3.6 The Chinese Mini-Mental Status Test (CMMS) is a test consisting of 114 items intended to
identify people with Alzheimer’s disease and senile dementia among people in China. Low
test scores are taken to indicate the presence of dementia. An extensive clinical evaluation was
performed of this instrument, whereby participants were interviewed by experts and definitive
diagnosis of dementia was made. The table below shows the results obtained on a group of
people from an old-peoples’ home.
Expert diagnosis
CMMS score Nondemented Demented
0–5 0 2
6–10 0 1
11–15 3 4
16–20 9 5
21–25 16 3
26–30 18 1
Total 46 16
Suppose a score of 6 20 on the test is used to identify people with dementia. Assume that the
data above are representative of the underlying probabilities.
(a) What is the sensitivity of the test?
(b) What is the specificity of the test?
(c) If 1% of a community has dementia, what is the ppv for the test?
(d) How would these values change if the threshold score changed to 15? Comment.
3.7 The level of prostate-specific antigen (PSA) in the blood is frequently used as a screening test
for prostate cancer. A report gives the following data regarding the relationship between a
positive PSA test (> 5 ng/dL) and prostate cancer.
PSA test result Prostate cancer Frequency
+ + 92
+ − 27
− + 46
− − 568
i. Use these data to estimate the sensitivity, specificity, positive predictive value of the test?
ii. How might these data have been obtained?
3.8 Suppose that among males aged 50–59 the Prostate Specific Antigen (PSA) level is given by the
following graphs, according to whether the individual has prostate cancer or does not. These
graphs give the cumulative probability F (x) = Pr(P SA 6 x). This is called the cumulative
distribution function and is equivalent to a population cumulative relative frequency.
F (x)

non-cancer

cancer

x 3 4 5 6 7 8
FN (x) 0.140 0.400 0.800 0.950 0.990 0.997
FC (x) 0.003 0.010 0.040 0.100 0.250 0.600
Suppose we choose to say the PSA test is “positive”, if the PSA level is greater than ℓ, i.e.
P = {PSA > ℓ}. Assume that the prevalence of prostate cancer in this age group is 20%.
Find the sensitivity, specificity, positive predictive value, percentage false-positive and percent-
age false-negative for ℓ = 4, 5, 6, 7.
Discuss the effects of these different levels. How would you choose what is “best”?
The ROC curve plots sn against 1−sp (true positive vs false positive). Sketch the ROC curve.
lOMoARcPSD|8938243

page 90 Experimental Design and Data Analysis


lOMoARcPSD|8938243

Chapter 4

PROBABILITY DISTRIBUTIONS

“It has long been an axiom of mine that the little things
are infinitely the most important.” Sherlock Holmes, A Case of Identity, 1892.

4.1 Random variables

A random variable then is a numerical outcome of a random procedure. Here “random”


simply means uncertain: before the procedure is carried out and we make the observation,
we do not know what its value will be. A random variable might be a count, or a measure
on a continuous scale, or a zero-one variable, or a proportion, or an average, or something
else.
For example:
• an individual is treated: Z = 1 or 0, if the individual’s condition improves or not;
• a community is observed for ten years: U = number undergoing heart surgery in that
time;
• patient diagnosed with cancer: X = survival time;
• ten individuals have their blood pressure measured: Y = average blood pressure
reading.
The set of possible values of X is called the sample space for X.

More words on the abstract


In a long run of repeated samples, the value of the random variable is thought to follow
some rule of probability, which may be described by some mathematical relationship. This
defines the distribution of the random variable. The notion of a distribution is quite a deep
one. We have already seen many distributions of data in Chapter 2; these are empirical
distributions, constructed from observed data. What we are now considering are theo-
retical distributions for random variables. The connection between the two is a reminder

91
lOMoARcPSD|8938243

page 92 Chapter 4: Probability distributions

of the reciprocal nature of probability and inference and is captured well in the following
diagram:

Empirical versus theoretical distributions . . . from Wild C. “The concept of distribution.”


Statistics Education Research Journal 2006; 5:10-25.

Notice the word “imagine”. In understanding the theory and practice of statistics, it is
necessary to deal with abstractions of various kinds. Ironically, often these abstractions
represent what we believe or hope is reality; but we cannot observe it directly. There are
many words and phrases used in these notes that entail this notion of abstraction.
Models and distributions are abstract. A problem might ask you to assume that the random
variable X has a particular distribution. This is because inference is only possible in a
framework that has some understanding of what random process generated the data. If we
want to make an inference about an unknown population proportion, then we know how
to quantify the uncertainty if the sample has been generated from a Binomial model. Of
course models, and abstractions more generally, may or may not be true. So for a particular
data set, we always need to ask ourselves, at least implicitly: how reasonable is the model?
and, more subtly: how wrong will my inference be if the model is not reasonable? But
we can get nowhere within assuming something abstract about the underlying probability
structure.
In any research project or experiment, anything we measure will be a random variable.
The randomness might arise because of the sampling procedure (i.e. which individuals are
included in the sample), or because of measurement error, or because of variation within
individuals.

EXAMPLE 4.1.1: (blood pressure)


Blood pressure is different for different individuals. It varies for an individual
from day to day and even from hour to hour. The measured blood pressure
depends on the accuracy of the measuring instrument.
We seek to describe a model for the random variable, which is supposed to represent the
population which generates the observations. We start with a very simple example.

EXAMPLE 4.1.2: Let X = number of heads obtained in three tosses of a fair coin.

By enumeration of the eight equally likely outcomes (hhh, hht, . . . , ttt), we find
that
Pr(X = 0) = 81 , Pr(X = 1) = 38 , Pr(X = 2) = 83 and Pr(X = 3) = 18 .
It is necessary to distinguish between two types of random variables:
• Discrete random variables
• Continuous random variables
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 93

4.1.1 Discrete random variables

Discrete random variables are ones which can only take some values; almost always, they
are based on counts of some sort. The word “discrete” is used here to mean “separate,
distinct”. The number of children in a family is an example of a discrete random variable.
The distribution of a discrete random variable can be defined by specifying the probabilities
corresponding to each possible value that the random variable may take.
The probabilities in the distribution of a discrete random variable must be all non-negative,
and they must add to 1.
A specific example of the distribution of a discrete random variable is shown below. The
height of the spike at an x value shows the probability of observing that value. For example,
we see that the probability that this random variable takes the value 10 is about 0.15.

4.1.2 Continuous random variables

Continuous random variables can take any value within the range of possible values. The
distribution of a continuous random variable is defined by specifying a curve which relates
the height of the curve at any particular value to the chance of an observation close to that
value. This curve is called the probability density function.

Formally, the chance that a continuous random variable takes a value in an interval be-
tween two points a and b is the area under the curve between a and b, as shown above.
lOMoARcPSD|8938243

page 94 Chapter 4: Probability distributions

Why can’t we use the discrete random variable approach for a continuous random vari-
able? We may ask about the probability that a continuous random variable takes the value
12. But . . . what do we mean by that? Remember that it can take any value in a given
range, so it can be 11.9, or 12.26, or 11.607, etc. A reasonable way of giving an answer to
the probability required is to suggest that what is meant by “12” in this case is “12, to the
nearest whole number”. This means a number between 11.5 and 12.5; and now we are
talking about an interval again: a narrow interval perhaps, but an interval all the same. If
we insist that we want the probability that a continuous random variable takes the value
12 exactly, that is, 12.00000000000000000000000. . . , then this is equal to zero. Note: the area
between 12 and 12 under the graph is zero!
The probability density function must be non-negative, and the total area under its graph
must be 1.

4.1.3 Comparison of discrete and continuous random variables

probability mass function, pmf probability density function, pdf


p(x) = Pr(X = x); f (x)dx = Pr(x− 12 dx < X 6 x+ 12 dx).
P R
p(x) > 0, p(x) = 1. f (x) > 0, f (x)dx = 1.
Pb Rb
Pr(a 6 X 6 b) = a p(x) Pr(a 6 X 6 b) = a f (x)dx
Px Rx
Pr(X 6 x) = u=0 p(u) Pr(X 6 x) = −∞ f (u)du
(e.g. Binomial, Poisson, . . . ) (e.g. Normal, . . . )

DEFINITION 4.1.1. The cumulative distribution function (cdf) of a random variable X


is
F (x) = Pr(X 6 x).
This applies for any random variable, discrete or continuous.

For a discrete random variable the cdf is a For a continuous random variable it is a con-
step function. tinuous function.
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 95

Properties of the cdf (for discrete or continuous)


1. 0 6 F (x) 6 1; [ F (x) is a probability];
2. F (−∞) = 0, F (∞) = 1; [ (X 6 −∞) is impossible, (X 6 ∞) is certain ];
3. Pr(a < X 6 b) = F (b) − F (a), if a < b; [ (X6a) ∪ (a<X6b) = (X6b) ];
4. F (x) is non-decreasing; [ F (b) − F (a) > 0 for b > a ];
5. F (x) is continuous on the right; [ (X 6 x+h) → (X 6 x) ];
6. Pr(X = x) = jump in F at x; [ (X 6 x−h) → (X < x) ]
Connection between cdf and pmf, pdf
Both the pmf and pdf relate to the increase in the cdf, but the increase is lumpy for discrete
and smooth for continuous.
The pmf is specified by the size of the jumps in the cdf.
The pdf is specified by the gradient of the cdf graph: the pdf and cdf are a derivative-
antiderivative pair:
Rx
f (x) = F ′ (x) and F (x) = −∞ f (u)du.

EXAMPLE 4.1.3: Suppose that the continuous random variable X has cdf given
by
ex
F (x) = 1+e x (−∞ < x < ∞).

e 2
F (2) = Pr(X 6 2) = 1+e 2 = 0.8808.

As there are no jumps in the cdf (it is continuous), Pr(X < 2) = Pr(X 6 2) =
F (2) = 0.8808.
The probability that X lies between −1 and 2 is given by:
Pr(−1 < X < 2) = F (2) − F (−1) = 0.8808 − 0.2689 = 0.6119.
The probability that X is greater than 3, Pr(X > 3) = 1 − F (3) = 1 − 0.9526 =
0.0474.

EXAMPLE 4.1.4: (R simulation)


R can be used to generate realisations of random variables having a range of
distributions, some of which we will consider later.

1000 observations were generated on a discrete random variable, with results:


5, 5, 5, 4, 6, 3, 9, 6, 7, 1, . . .;
and 1000 observations were generated on a continuous random variable, with
results:
10.44, 4.83, 6.39, 5.02, 8.68, 6.12, . . ..1

The following graphs were obtained for the empirical cdf in each case (using
ecdf(). . . ). These are plots of the cumulative relative frequency (see Chapter 2).2

1 The numbers were actually generated to many more decimal places: for example, x1 = 10.441598 . . ..
2 The distributions used were Poisson with λ = 5.6; and Normal with µ = 8, σ = 2.
lOMoARcPSD|8938243

page 96 Chapter 4: Probability distributions

1.0

1.0
0.8

0.8
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0
0 5 10 15 0 5 10 15

For samples of size 1000, the graphs resemble quite closely what is expected of
the population cdf (i.e. a step function on the one hand, and a continuous curve
on the other). To generate similar graphs one may use

> x <- rpois(1000, lambda=5.6) # generate 1000 Poisson observations


> plot(ecdf(x)) # empirical cdf
> y <- rnorm(1000, mean=8, sd=2) # generate 1000 Normal observations
> plot(ecdf(y)) # empirical cdf

4.1.4 Quantiles (inverse cdf)

The q-quantile of X, denoted by cq , is such that the probability of observing a value of X


smaller than it, is about q.
• If X is continuous: cq is such that F (cq ) = q.
• If X is discrete, it is not quite so simple: cq is defined as shown in the diagram below.
✻y
y
✻ 1
q
q
y = F (x)
q
q
y = F (x)
q
q
x
✲ x

cq 0 cq

Q UESTION : What happens if q hits a ‘flat bit’?


The median is the 0.5-quantile, i.e. c0.5 .
The quartiles are c0.25 , c0.5 and c0.75 . They divide the distribution into quarters (approx-
imately at least). c0.25 is usually described as the lower quartile, and c0.75 as the upper
quartile.
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 97

Deciles (c0.1 , c0.2 , . . . , c0.9 ) and percentiles (c0.01 , c0.02 , . . . , c0.99 ) are other special cases of
quantiles that are used.
Quantiles give a useful description of the distribution that can be readily interpreted: half
the population are less than the median, a quarter are above the upper quartile, while 10%
are above the 90th percentile. Quantiles are not so useful if the distribution is ‘very’ discrete (i.e.
if the distribution has a small number of large jumps).

EXAMPLE 4.1.5: R gives the quantiles for a range of distributions, including the
ones we used above. Particularly the cdf P (X 6 x) can be obtained by adding
the prefix p to the R name of the distribution of interest, and the quantiles can be
obtained by adding the prefix q. For example, for the discrete random variable
(Poisson with λ=5.6), R gives, for the 0.75- and 0.25-quantiles:

> qpois(0.75, lambda=5.6) # 0.75-quantile for Poisson distribution


[1] 7
> qpois(0.25, lambda=5.6) # 0.25-quantile for Poisson distribution
[1] 4

Check this against the discrete cdf graph in the example above (page 95). It
follows that c0.75 =7. Similarly, c0.25 =4 and c0.5 =5.

For the continuous random variable (Normal with µ=8, σ=2), R gives:
c0.25 = 6.651, c0.5 = 8.000, c0.75 = 9.349.
Check these values against the continuous cdf graph in the example above
(page 95).

ex
EXAMPLE 4.1.6: Suppose that X has cdf F (x) = 1+e x (−∞ < x < ∞).

ec
The 0.9-quantile of X, c0.9 is such that 1+e c = 0.9.
ec
1+ec
= 0.9 ⇒ ec = 0.9(1 + ec ) ⇒ ec (1 − 0.9) = 0.9 ⇒ ec = 0.9
0.1

Thus, c0.9 = ln 9 = 2.197.


q
This method can be used for any q (between 0 and 1), to give cq = ln 1−q .
Note that cq = F −1 (q), where F −1 denotes the inverse function of F .

The median, c0.5 = 0; and the quartiles are c0.25 = ln 31 = −1.0986, c0.75 = ln 3 =
1.0986.

4.1.5 The mean

The mean of a random variable X, which we denote by µ or E(X), is the weighted average
of values that X can take, where the weights are provided by the distribution of X. It is
at the “centre of mass” of the distribution. Sometimes the term “expectation of X” is used,
which is where the notation E(X) originates (E for Expectation).
Recall that p(x) denotes the probability mass function (pmf) for a discrete random variable
and that f (x) denotes the probability density function (pdf) for a continuous random vari-
able. The following definition gives a mathematical expression for E(X) for both discrete
and continuous random variables:
lOMoARcPSD|8938243

page 98 Chapter 4: Probability distributions

DEFINITION 4.1.2.
1. For discrete random variables,
X
E(X) = x p(x).

2. For continuous random variables,


Z
E(X) = xf (x)dx.

Important properties of the mean


1. E(X) is often denoted by µX or µ. It is also called the expected value, the mean value,
or the mean.
2. The mean is a centre of mass.
3. If the pmf or the pdf is symmetrical, the mean is on the axis of symmetry.
4. The mean need not be a possible value of the random variable.
5. The expectation is not the value that we expect to observe. The mean is not the most
likely value. It need not even be near the most likely value. Mostly though X is
“around about” its mean.
6. The most important property of expectation — and the one that makes expectation
the pre-eminent measure of location — is additivity:
E(X + Y ) = E(X) + E(Y )
and this is true for any random variables.
It follows that “The mean of a sum is the sum of the means”:
E(X1 + · · · + Xn ) = E(X1 ) + · · · + E(Xn ).
7. E(a + bX) = a + bE(X);
and for any other function E(g(X)) 6= g(E(X)).

4.1.6 The variance and the standard deviation

The most useful measure of spread is the variance (and its square root the standard devia-
tion).
The variance of a random variable X, which we denote by σ 2 or var(X), is the weighted
average of squared deviations from the mean of X, where the weights are provided by the
distribution of X.
Mathematically, it is defined as follows:

DEFINITION 4.1.3. The variance of X is

var(X) = E (X − µ)2


= E X 2 − µ2 ,


where µ = E(X).
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 99

The variance is a measure of spread since the more widespread the likely values of X, the
larger the likely values of (X − µ)2 and hence the larger the value of var(X).
The standard deviation of a random variable X is the square root of the variance and is
denoted by sd(X):

DEFINITION 4.1.4. The standard deviation of X is


p
sd(X) = var(X).

Important properties of the variance and standard deviation


2
1. var(X) is often denoted by σX or σ 2 .
2. var(X) > 0, since (X − µ)2 > 0.
p
3. sd(X) = var(X) > 0. sd(X) is often denoted by σX or σ.
4. var(a + bX) = b2 var(X); sd(a + bX) = |b| sd(X).
X−µ
5. If X has mean µ and variance σ 2 , then Z = σ has mean 0 and variance 1. Z is
called a standardised random variable.
6. The mean and variance do not specify the distribution; they just give some idea of its
location and spread.
7. Pr(µ − 2σ < X < µ + 2σ) ≈ 0.95.
8. The fundamental reason for the importance of the variance is that, like the mean, it
is additive. This additivity does not hold for all random variables however. It does
hold for independent random variables:
If X and Y are independent
var(X + Y ) = var(X) + var(Y ).
This result extends to n independent variables,
i.e. “the variance of a sum is the sum of the variances”:
var(X1 + · · · + Xn ) = var(X1 ) + · · · + var(Xn ).
9. Standard deviation is not additive.

EXAMPLE 4.1.7: Suppose that X has a uniform distribution on (0,1), i.e. the pdf
of X is given by f (x) = 1, (0<x<1):

Note: Such a random variable is often called a “random number”. We have pre-
viously used such random variables in randomisation. They can be generated
lOMoARcPSD|8938243

page 100 Chapter 4: Probability distributions

in R using runif(). For example 10 random numbers between 0 and 1 can be


generated as follows:

> runif(10) # 10 random numbers between 0 and 1


[1] 0.86 0.49 0.46 0.59 0.93 0.59 0.65 0.87 0.40 0.74

The mean of X, E(X) = 0.5, by symmetry.


R1
Note: This could be calculated using E(X) = 0 x · 1dx = [ 21 x2 ]10 = 12 .

The variance of X is given by


R1 3 (−0.5)3
var(X) = E[(X −0.5)2 ] = 0
(x−0.5)2 ·1dx = [ 31 (x−0.5)3 ]10 = 0.5
3
− 3
1
= 12 .
q
1
And so sd(X) = 12 = 0.289.

EXAMPLE 4.1.8: If X and Y are independent random variables with standard


deviations 2 and 3 respectively, find the standard deviation of X+Y .

var(X+Y ) = var(X) + var(Y ) = 22 + 32 = 13,



so sd(X+Y ) = 13 = 3.61 [ 6= 5 = 2 + 3 = sd(X) + sd(Y )].
Q UESTION : What is the standard deviation of X − Y ?

EXAMPLE 4.1.9: (dice sum)


x 1 2 3 4 5 6
If a fair die is rolled, the result has pmf: 1 1 1 1 1 1
p(x) 6 6 6 6 6 6
7 35
This distribution has mean 2 and variance 12 .

If the die is rolled 24 times, the total score obtained is


T = X1 + X2 + · · · + X24 ,
where Xi denotes the result of the ith roll; X1 , X2 , . . . , X24 are independent and
identically distributed random variables, each with the above pmf.

The total score could take any value in {24, 25, . . . , 144}. What are the likely
values?

7 7 7
E(T ) = 2 + 2 + ··· + 2 = 84;
35 35 35 35
var(T ) = 12 + · · · + 12
+ 12 = 24 × 12 = 70,

so that sd(T ) = 70 = 8.37.

Hence, with probability about 0.95, 68 6 T 6 100, since 84 ± 2×8.37 =


(67.3, 100.7).
Q UESTION : What would the distribution of T look like?

4.1.7 Describing the probability distribution

The probability distribution of X is specified by p(x) or f (x), but the information conveyed
is not always easily understood.
For example, p(x) = e−25 25x /x!, (x = 0, 1, 2, . . .) means little without some evaluation. If
on the other hand we say that X is 95% likely to be in the range 25 ± 10, this information is
more readily grasped.
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 101

What we want is a simple summary of the probability distribution: indicators of important


features of the distribution. As with a sample, the most important features of a distribution
are location and spread. There are other features which could be described: e.g., skewness
and kurtosis.
Measures of location
The mean (or expectation) of a random variable is a measure of location or the centre of the
distribution. It is the most important measure of location, but not the only one.
Another important measure of location is the median, m = c0.5 (X). The median is the
0.5-quantile.
Another measure, which is actually not a measure of location, is the mode. It is a measure
of what is most likely (the ‘mode’ is the fashion) rather than the centre of the distribution.
Despite the fact that it’s not a location measure, the mode is often included with the mean
and the median, and we do too. It is a simple indicator, though not a very useful one.
The mode, M, is the value that maximises the pmf or the pdf:
i.e. f (M) > f (x), for all x; or p(M) > p(x), for all x.
It is possible to have more than one mode.

EXAMPLE 4.1.10: f (x) = 2x (0 < x < 1)


µ = 23 (centre of mass)
M = 1 (from the graph of f )
1
m = √2 (m is such that 21 m × 2m = 21 )

EXAMPLE 4.1.11: If X has pmf given by:


x 0 1 2 3 4
p(x) 0.3 0.2 0.2 0.2 0.1
then M = 0, m = 1.5 and µ = 1.6.
The mean, median and mode can be described in physical terms:
mean ↔ centre of mass
median ↔ approx half mass either side
mode ↔ point of greatest (density or mass)
If the probability distribution is symmetrical and unimodal, then mean = median = mode.
Otherwise a rough rule is that they occur as in the dictionary — with median between mean
and mode, but nearer to mean.
E XERCISE . Sketch a pdf for which mode > median > mean.
Measures of spread
The most important and useful measure of spread is the variance (and its square root, the
standard deviation).
The other measure of spread we use is the interquartile range, τ = c0.75 − c0.25 .
This is a commonly used measure for long-tailed distributions, as it is not affected by the
tails. It is always finite, whereas the variance may be infinite. The interquartile range
also has the advantage of easy interpretation: it is the width of the interval containing the
“middle 50%”.

EXAMPLE 4.1.12: Sketch a pdf for which the mean is 65 and the standard devia-
tion is 10.
lOMoARcPSD|8938243

page 102 Chapter 4: Probability distributions

The first is the standard symmetrical graph with 2.5% below 45 (= 65 − 2×10) and
2.5% above 85 (= 65 + 2×10). The second is positively skew, with most of the 5% above
85. But both these pdfs have µ=65 and σ=10.
E XERCISE . Sketch a pdf for which the quartiles are 20, 30 and 50.

4.2 Independent trials

4.2.1 Introduction

A Bernoulli trial is a random experiment with two possible outcomes: “success” and “fail-
ure”. We let p = Pr(success) and q = Pr(failure), so that p + q = 1. We assume 0 < p < 1.
We consider a random experiment consisting of a sequence of independent Bernoulli trials,
observing the result of each trial.
Examples include:
• coin tossing, die rolling, firing at a target;
• sampling with replacement;
• a medical procedure applied to each of a number of individuals;
• any repeatable random experiment, with “success” = any specified event A; then
“failure” = A′ , and p = Pr(A).
Let Sk = “success at the kth trial”, and Fk = “failure at the kth trial” = Sk′ .
Then Pr(Sk ) = p, for k = 1, 2, 3, . . .
Note that Sk and Fk are mutually exclusive, while Sk and Sl (k 6= l) are independent.

4.2.2 Binomial distribution

DEFINITION 4.2.1. Let X be the number of successes in n trials where the probability
of success on each trial is p. Then X has a binomial distribution with parameters n and
d
p, and we write X = Bi(n, p). The pmf of X is given by

p(x) = Pr(X = x) = (nx )px q n−x , for x = 0, 1, 2, . . . , n,

where q = 1 − p.
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 103

To show this, we observe that one way of obtaining x successes is:


S S ... S F F ... F
←x→ ← n−x →
The probability of this sequence is px q n−x . But, X = x for any ordering of this sequence.
There are (n x ) ways of arranging x S s and (n−x) F s; and for each arrangement the proba-
bility is px q n−x .
Note that:
1. p(x) > 0
Pn
2. x=0 p(x) = 1 (using the binomial theorem)

EXAMPLE 4.2.1: A machine is producing capsules such that the probability that
any capsule is defective is 0.01, independently of the others.

Find the probability that at most one of the next ten capsules produced is defec-
tive.

trial = production of capsule (assumed independent);


success = defective capsule;
probability of success, p = 0.01; and
number of trials, n = 10.
d
X = Bi(10, 0.01); and therefore
Pr(X = x) = (10 x
x )0.01 0.99
10−x
for x = 0, 1, 2, . . . , 10.

∴ Pr(X 6 1) = 0.9910 + 10×0.01×0.999 = 0.904382 + 0.091352 = 0.9957.


Binomial probabilities are generally difficult to calculate by hand. And mostly unneces-
sary. Binomial pmf and cdf values can be obtained in R using dbinom() and pbinom(),
respectively.
There are tables of the binomial pmf among the Statistical Tables for n 6 20.

d
EXAMPLE 4.2.2: Suppose that X = Bi(35, 0.3). The graph of the pmf is shown
below.
0.15
0.10
0.05
0.00

0 5 10 15 20 25 30

Using R:

> pbinom(15, size=35, prob=0.3) # cdf of binomial distribution


lOMoARcPSD|8938243

page 104 Chapter 4: Probability distributions

[1] 0.9641
> pbinom(9, size=35, prob=0.3) # cdf of binomial distribution
[1] 0.3646

Pr(X = 10) = 0.1454, Pr(10 6 X 6 15) = F (15) − F (9) = 0.9641 − 0.3646 =


0.5995.

d
DEFINITION 4.2.2. Suppose that X = Bi(n, p). Then
1. E(X) = np; and
2. var(X) = npq,
where q = 1 − p.

This is proved as follows: If X denotes the number of successes in n independent Bernoulli


trials with probability of success p, then:
X = Z1 + Z2 + · · · + Zn
where the Zi are independent and identically distributed with pmf: pZ (0) = q, pZ (1) = p;
Zi is the number of successes at the ith trial, which must be either 0 or 1: 0 for a failure and
1 for a success.
E(Zi ) = p,
var(Zi ) = (0−p)2 q + (1−p)2 p = p2 q + q 2 p = pq(p + q) = pq, since p+q = 1.
E(X) = E(Z1 ) + · · · + E(Zn ) and
var(X) = var(Z1 ) + · · · + var(Zn )
∴ E(X) = p + · · · + p = np, and var(X) = pq + · · · + pq = npq.

d
EXAMPLE 4.2.3: If X =Bi(100, 0.4) find the mean and standard deviation of X.
E(X) = 100×0.4 = 40; var(X) = 100×0.4×0.6 = 24; sd(X) = 4.849.

EXAMPLE 4.2.4: A cohort study is proposed, following 1200 individuals over


a ten-year period. On the basis of population figures, it is expected that over
the ten-year follow-up period, 7% will develop blood-pressure problems. As-
sume that each individual has probability 0.07 of developing a blood-pressure
problem. Let X denote the number of individuals in the study who do develop
blood-pressure problems. Specify the distribution of X and hence find an ap-
proximate 95% probability interval for X.
d
X = Bi(1200, 0.07)
⇒ E(X) = 84, sd(X) = 8.84,
⇒ approx 95% probability interval: 66.3 < X < 101.7, i.e. 67 6 X 6 101.

EXAMPLE 4.2.5: Suppose that with the standard treatment, the five-year recur-
rence rate of a particular cancer is 30%. A new treatment is applied to 100 indi-
viduals with the cancer. Assuming that the new treatment has the same effect

as the standard treatment ( ◦◦ ), what is the distribution of the number who are
cancer-free (i.e. no recurrence) after five years.

Let X denote the number of individuals who are cancer-free after five years,
d
then X = Bi(n=100, p=0.70).

Using R, we find Pr(X > 80) = 0.0165. Thus, if we observed that 80/100
were cancer-free after five years with the new treatment, we would suspect that
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 105

the recurrence rate was actually less than 30%, and that the new treatment was
better than the standard treatment.

4.3 Poisson process

4.3.1 Introduction

A Poisson process is a continuous time version of a sequence of Bernoulli trials. It is a


process in which “events” occur randomly in time. We use “event” differently here: it
denotes the occurrence of some random phenomenon. And it can be applied to “events”
occurring in space as well as time. For example

process “event”
radioactive decay arrival of particle
telephone exchange arrival of call
disease occurrence individual develops disease
production of material occurrence of flaw
(thread, plate, solid)
distribution of organisms organism
in a region

Pr (“event” in (t, t + dt)) = αdt where α = rate of the process.


The Poisson process is the continuous time analogue of independent trials:

0 t
t
The time interval (0, t) is divided into n intervals each of length δt =
n
trial = interval, success = “event”, p ≈ αδt
The case that we consider most often is the disease occurrence process, which we consider
at length below. In that case, the rate of the process corresponds to the incidence rate.

4.3.2 Poisson distribution

DEFINITION 4.3.1. The Poisson distribution with rate parameter λ is defined by the
pmf
e−λ λx
p(x) = , for x = 0, 1, 2, . . . .
x!
d
If a discrete random variable X has a Poisson distribution we say that X = Pn(λ).

In R, the pmf and cdf are given by the functions dpois() and ppois(), respectively.
There are tables of the Poisson pmf in the Statistical Tables (Table 3).

d
EXAMPLE 4.3.1: Suppose that X = Pn(5.6). The graph of the pmf is shown
below.
lOMoARcPSD|8938243

page 106 Chapter 4: Probability distributions

0.15
0.10
0.05
0.00

0 5 10 15 20

Using the computer, or Table 3:


Pr(X = 2) = 0.0580, Pr(4 6 X 6 6) = 0.1515 + 0.1697 + 0.1584 = 0.4796.

d
DEFINITION 4.3.2. Suppose that X = Pn(λ). Then
1. E(X) = λ; and
2. var(X) = λ.

d
EXAMPLE 4.3.2: If X = Pn(20), then E(X) = 20 and sd(X) = 4.47:
Pr(12 6 X 6 28) ≈ 0.95.

Note: the exact probability is Pr(12 6 X 6 28) = 0.944 (using R).

d
If X = Pn(200), then E(X) = 200 and sd(X) = 14.1:
Pr(172 6 X 6 228) ≈ 0.95.

Note: the exact probability is Pr(172 6 X 6 228) = 0.956, (using R).

DEFINITION 4.3.3. Let X(t) be the number of “events” in (0, t) and suppose that α is
d
the expected number of events per unit time. Then X(t) = Pn(α t).

Proof by considering a limit of sequence of independent trials:

Pr(X(t) = k) = lim nk ( αt )k (1 − αt )n−k



n→∞ n n

nk (αt)k
= lim (1 − αt
n
)n (1 − αt
n
)−k
n→∞ k! nk

(αt)k −αt
since (1 + na ) → ea as n → ∞ .
 
= e ,
k!
Similarly, the number of “events” in any interval of length t, i.e. an interval (s, s + t) for any
s > 0, is distributed as Pn(αt).
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 107

EXAMPLE 4.3.3: (radiation counts)


Particles are counted at a rate of α = 2.5 per second.
Let Z = number arriving in five minutes.
d
Z = X(300) = Pn(750) (2.5 × 60 × 5).

So E(Z) = 750 and sd(Z) = 750 = 27.5.
Hence Pr(695 6 Z 6 805) ≈ 0.95.

> ppois(805, lambda=750) - ppois(695, lambda=750) # difference in cdfs


[1] 0.9554147

EXAMPLE 4.3.4: The occurrence of malfunctions in a pacemaker can be de-


scribed by a Poisson process with rate α = 0.05 faults per year. Find the proba-
bility that a pacemaker has no malfunctions in five years.

If T = time gap between faults (in years), then


Pr(T > 5) = Pr(no faults in five years)
= Pr(X(5) = 0) = e−0.05×5 = e−0.25 = 0.779.

> dpois(0, lambda=0.05*5) # Poisson pmf


[1] 0.7788008

4.3.3 Incidence rate

The incidence rate is the rate at which the disease occurs. We can model this as a Poisson
process: an “event” is one individual contracting the disease:
Pr(one individual contracts the disease in (t, t+dt)) = αdt.
Here, the time, t, denotes the time for one person (i.e. person-time). When we come to
dealing with the population, we need to add up all the person-times. Observing one person
for ten years is taken to be equivalent to observing ten people for one year. In this context,
time is the number of “person-years” of follow-up of the population. For example, in a 30
year study, if an individual leaves the study after five years, then that person contributes
only 5 “person-years”.
Now, using the result for a Poisson process we have
d
X(t) = Pn(αt)
where X(t) denotes the number of cases in a population followed up for a total of t person-
years, where the incidence rate is α (cases per year).
Note that incidence rate has dimension “cases per unit time” or [case] time−1 .
Incidence rates effectively treat one unit of time as equivalent to another regardless of
which person they come from or when they occurred. Incidence is usually concerned
with “once-only” events, i.e. events that can occur only once: for example, death due to
leukæmia. Events other than death can be made “once-only” by considering the first oc-
currence of the event. For example, we consider the occurrence of the first heart attack in
an individual and ignore (or study separately) second and later heart attacks.
An individual does not contribute to “person-time” after getting the disease. Other indi-
viduals may be observed for less than the period of the study: they may join the study late,
leave early by moving, dying or otherwise becoming ineligible. In our applications the
time we consider is observed “person-time”. Ten people observed for a year, or one person
lOMoARcPSD|8938243

page 108 Chapter 4: Probability distributions

observed for ten years, or four people observed for 1, 2, 3 and 4 years respectively, are all
equivalent to 10 person-years.

EXAMPLE 4.3.5: A group of factory workers are observed for a total of t = 80


person-years. Suppose that the incidence rate of disease D is 0.015 cases per
person-year. Then the number of cases amongst these workers has a Poisson
distribution with mean 80×0.015 = 1.2. If 4 cases were observed, then we might
be concerned. Why? Because if the incidence rate, α = 0.015, then Pr(X > 4) =

0.034 ( ◦◦ ). Perhaps α is greater than 0.015 for these workers?

EXAMPLE 4.3.6: If the incidence rate is 0.0356 cases per person-year, then

incidence rate, α = 0.0356 (cases/person-year)


1
= 28.1 (cases/person-year)
=1 (case/person-(28.1 year))

Thus, roughly, we expect one case per 28.1 person-years. So the “mean waiting
time” until one individual gets the disease is 28.1 years:

1
mean waiting time = .
incidence rate

A rate must have a time unit. However, incidence rates are often expressed in the form
of 50 cases per 100 000 and described as “annual incidence”. This is a bit like describing
speed as an “hourly distance”. More precisely, an annual incidence of 50 cases per 100 000
is 0.0005 cases/year.
If the time unit is missing from an incidence rate, assume it is a year. For example (Tri-
ola&Triola p.137) “For a recent year in the US, there were 2 416 000 deaths in a population
of 285 318 000 people. Thus the annual incidence (the mortality rate) is
2 416 000
= 0.0085 (deaths/person-year) or 8.5 per thousand person-years.
285 318 000
This too is a bit unusual, in that we are imagining here that the entire population is ob-
served. In any application, this will usually not be the case. We take a sample and use that
to estimate the population incidence rate (Chapter 5).

Relation between Risk and Incidence rate

It follows from the above result that


Pr(an individual gets the disease in time u) = Pr(X(u) > 0) = 1 − e−αu .
If α is small, which it usually is, then 1 − e−αu ≈ αu.
risk of getting the disease in time u ≈ incidence rate × time
This risk can be interpreted as the probability that one individual gets the disease in a
period of time u; or as the proportion of the population who get the disease in a period of

time u . . . assuming that the population consists of similar individuals ( ◦◦ ).

This approximation is fine provided the risk is small, and there is no population depletion.
This is often the case.

EXAMPLE 4.3.7: (disease occurrence)


The incidence rate of disease D in a population is known to be 0.000011 cases/person-
year. Consider a population of 200 000 individuals.
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 109

d
The number of cases in a one-year period, X = Pn(2.2);
since mean = α×t = 0.000011×200 000 = 2.2.

Pr(X > 8) = 0.002, so the occurrence of 8 cases would be very unlikely. If we


had observed 8 cases, we should be wondering why. Has something changed?

Stratified population

A large population of identical individuals is not very realistic. In many applications we


would want to stratify by age and gender. Because the Poisson distribution is additive, this
can be accommodated.

DEFINITION 4.3.4. Poisson additivity: If X1 , X2 , . . . , Xc are independent Poisson


random variables, then
X1 + X2 + · · · + Xc
has a Poisson distribution.

It follows that if we are dealing with a stratified population with different incidence rates
in each stratum, the total number of cases in the population still has a Poisson distribution:
d
X1 + X2 + · · · + Xc = Pn(α1 t1 + α2 t2 + · · · + αc tc ).
Thus, it is enough to find the expected number of cases.
λ = α 1 t 1 + α 2 t 2 + · · · + α c tc .
The total number of cases has a Poisson distribution with this mean.

EXAMPLE 4.3.8: A sub-population of factory workers is divided into age cate-


gories with different incidence rates as follows:

category 20–29 30–39 40–49 50–59 60–69


incidence rate 0.01 0.02 0.03 0.04 0.05
person-years observed 106 122 91 63 17
d
The total number of cases observed, X = Pn(9.6), since the expected number
of cases, λ = 0.01×106 + 0.02×122 + 0.03×91 + 0.04×63 + 0.05×17 = 9.60.
Thus, using tables or the computer, we find that Pr(X > 14) = 0.108. So, an
observation of 14 cases would not be that uncommon. However, an observation
of 17 cases would be unusual, since Pr(X > 17) = 0.010.
Suppose a population is divided into categories A1 , A2 , . . . , Ac with incidence rates
α1 , α2 , . . . , αc . If the proportion of the population in each of the categories are π1 , π2 , . . . , πc
(where π1 + π2 + · · · + πc = 1), then the overall population incidence rate is given by
α = π1 α1 + π2 α2 + · · · + πc αc . This follows from the Law of Total Probability.
lOMoARcPSD|8938243

page 110 Chapter 4: Probability distributions

4.4 The normal distribution and applications

4.4.1 The normal distribution

DEFINITION 4.4.1. If X has pdf

1 1 2
f (x) = √ e− 2σ2 (x−µ) , (x ∈ R)
σ 2π
d
then we say that X has a normal distribution and we write X = N(µ, σ 2 ).

d
DEFINITION 4.4.2. If X = N(µ, σ 2 ), then:
1. E(X) = µ; and
2. var(X) = σ 2 .

d
If Z = N(0, 1), then we say that Z has a standard normal distribution.
d
If Z = N(0, 1), then: E(Z) = 0, var(Z) = 1.
z
1
Z
d 1 2
The cdf of Z = N(0, 1) is denoted by Φ(z) = √ e− 2 t dt.
−∞ 2π
Table 5 gives values of Φ(z) for 0 6 z 6 4.
The standard normal cdf is available on many calculators. It is available in R using the
command pnorm().

d
E XERCISE . If Z = N(0, 1), use the Tables or calculator or computer to check that:
Pr(Z < 1) = 0.8413; Pr(Z > 0.234) = 0.4075;
Pr(−1.5 < Z < 0.5) = 0.6247. In R:
> pnorm(1) # normal cdf at 1
[1] 0.8413447
> 1-pnorm(0.234) # P(Z > 0.234)
[1] 0.4074925
> pnorm(0.5)-pnorm(-1.5) # P(-1.5 < Z < 0.5)
[1] 0.6246553
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 111

d
DEFINITION 4.4.3. Standardisation theorem: If X = N(µ, σ 2 ), then

X −µ d
Xs = = N(0, 1).
σ

The standardisation theorem allows us to evaluate normal probabilities using the Tables.

d
EXAMPLE 4.4.1: If X = N(10, 52 ) then
Pr(X < 8) = Pr X−10 < 8−10

5 5
= Pr(Xs < −0.4) = 0.3446.

d
EXAMPLE 4.4.2: If X = N(65, 102 ) then:

Pr(50 < X < 70)


= Pr 50−65 < Xs < 70−65

10 10
= Pr(−1.5 < Xs < 0.5) = 0.6915 − 0.0668
= 0.6247.
Using the computer, this standardisation step can be avoided.
Specify µ and σ and the package does the standardisation:
in R, specify additional arguments mean and sd in the command pnorm or dnorm.
For example, the probability in the previous example can be computed as
> pnorm(70, mean=65, sd=10) - pnorm(50, mean=65, sd=10)
[1] 0.6246553

EXAMPLE 4.4.3: Suppose that Y is an integer-valued variable which is approx-


d
imately normally distributed: Y ≈ N(65, 102 ). Since Y is integer-valued, its
distribution is discrete and so it cannot be exactly normal. We define an ap-
d
proximating normal random variable: Y ∗ = N(65, 102 ), which provides an ap-
proximation to Y as follows:

Pr(Y = 70) ≈ Pr(69.5 < Y ∗ < 70.5) = 0.7088 − 0.6736 = 0.0352.

Pr(Y 6 70) ≈ Pr(Y ∗ < 70.5) = 0.7088.

This adjustment is called the “correction for continuity”.


Table 6 gives values of the standard normal inverse cdf, i.e. quantiles of the standard nor-
d
mal distribution — from which we can obtain quantiles for X = N(µ, σ 2 ) using the stan-
dardisation theorem:
cq (X) = µ + σcq (Xs ).
Because the normal distribution is symmetric, cq (N) = −c1−q (N). For this reason, only
values for q > 0 are tabulated.

d
EXAMPLE 4.4.4: If X = N(10, 4) then:
c0.75 (X) = 10+2×0.6745 = 11.35; c0.975 (X) = 10+2×1.9600 = 13.92;
c0.25 (X) = 10 − 2 × 0.6745 = 8.65; c0.025 (X) = 10 − 2 × 1.9600 = 6.08.

In R we use the command qnorm. For example, the first quantiles in the above
example can be equivalently obtained by:
lOMoARcPSD|8938243

page 112 Chapter 4: Probability distributions

> qnorm(0.75, mean = 10, sd = 2) # 0.75-quantile of N(10,4)


[1] 11.34898
> 10 + 2*qnorm(0.75) # using standard normal quantiles
[1] 11.34898
d
EXAMPLE 4.4.5: Consider a random sample of n = 100 on X = N(120, 102 ).
What values are expected (roughly) for the five number summary from this
sample? i.e. specify approximate values expected for minimum, lower quartile,
median, upper quartile and maximum. The sample median will be around the
the population median, i.e. ĉ0.5 ≈ c0.5 = 120.

Similarly, ĉ0.75 ≈ c0.75 = 120 + 10×0.6745 = 126.7; ĉ0.25 ≈ c0.25 = 120 −


10×0.6745 = 113.3.

k
We have seen that x(k) ≈ ĉq where q = n+1 . It follows that the minimum of a
1
sample of 100, x(1) ≈ ĉq , where q = 101 .

Therefore x(1) ≈ c0.0099 = 120 − 10×2.330 = 96.7; x(100) ≈ c0.9901 = 120 +


10×2.330 = 143.3.

Of course these are rough approximations, since the data are random, but we
can expect values around about these values. Thus the “expected” five-number
summary would be:
(96.7, 113.3, 120, 126.7, 143.3).

Random samples, simulated in R, gave the following results:


Min. 1st Qu. Median 3rd Qu. Max.
98.18 113.90 120.70 127.30 144.40
87.77 112.20 120.00 127.90 146.30
97.6 113.2 119.3 127.5 142.2
91.41 115.00 120.30 126.60 141.10
99.12 110.80 117.80 126.00 142.60
92.13 112.30 118.00 126.20 145.40
89.64 111.90 119.50 125.60 141.30
92.34 114.10 119.70 127.70 150.40
91.0 112.7 120.3 126.9 140.8
97.39 111.70 119.90 128.70 148.20

This gives some idea of what is meant by “rough approximations” in this case.
You can generate one such random sample in R using rnorm(100, mean=120,
sd=10) and then summary().

4.4.2 The Central Limit Theorem

The Central Limit Theorem says that the sum of a large number of similarly distributed
random variables which are independent, but which may have any distribution, is asymp-
totically normally distributed.
It is always true that:
If T = X1 +X2 +· · ·+Xn , where X1 , X2 , . . . , Xn are independent observations on X, where
E(X) = µ and var(X) = σ 2 , then:

E(T ) = nµ var(T ) = nσ 2 sd(T ) = σ n
d
The central limit theorem says that, in addition, if n is large, then: T ≈ N(nµ, nσ 2 ).
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 113

DEFINITION 4.4.4. Central Limit Theorem: Suppose that X1 , X2 , . . . , Xn are indepen-


dent observations on X, where E(X) = µ and var(X) = σ 2 . Also suppose that n is
large. Then
d
T = X1 + X2 + · · · + Xn ≈ N (nµ, nσ 2 ).

This is a really amazing result, and is the fundamental reason for the importance of the
normal distribution.
Any variable which can be considered as being composed of the sum of many small influ-
ences will be approximately normally distributed.

EXAMPLE 4.4.6: (Sum of “random numbers”)


Let T denote the sum of 100 independent “random numbers”, i.e.
T = X1 + X2 + · · · + X100 ,
where each X is equally likely to take any value in (0, 1), and the Xs are inde-
pendent.
e.g. Tobs = 0.3053 + 0.6344 + 0.7645 + 0.4176 + 0.0162 + · · · + 0.4687.
1
We saw (page 99) that E(X) = 21 and var(X) = 12 . It follows that
1 1 1
E(T ) = 2 + 2 + · · · + 2 = 50
1
var(T ) = 12 1
+ 12 1
+ · · · + 12 = 100
12
and therefore q
sd(T ) = 100 12 = 2.89.

Further, by the central limit theorem, we have, to a good approximation:


d
T ≈ N(50, 2.892 ).

Hence Pr(50 − 1.96×2.89 < T < 50 + 1.96×2.89) ≈ 0.95,


i.e. Pr(44.34 < T < 55.66) ≈ 0.95.

Note: T is continuous, so there is no need for a continuity correction.

If we add together 100 of these random numbers, independently generated,


then there is a 95% chance that the sum lies between 44.34 and 55.66. To most
people, this seems like a remarkably narrow interval. It is possible that T could
be anywhere between 0 and 100, but it is 95% likely that it is between 44.34 and
55.66.

E XERCISE . Try it out in R: use x <- runif(100) to generate a vector with 100 random
values and then find the sum sum(x).
This result is an indication of the power of probability. We don’t need to worry about what
is possible, it is more productive to consider what is probable. What is possible is often so
wide-ranging as to be useless; what is probable focuses on what is important.
Also (and not unrelated) the normal distribution often occurs as a limiting case of particular
distributions (Binomial, Poisson, and others).

d 
• Bi(n, p) ∼ N np, np(1−p) as n → ∞. (For np > 5 and nq > 5)
d
• Pn(λ) ∼ N(λ, λ) as λ → ∞. (For λ > 10).
lOMoARcPSD|8938243

page 114 Chapter 4: Probability distributions

The approximation technique is simple for continuous random variables — but there is a
slight complication in approximating a discrete distribution by a continuous distribution.

DEFINITION 4.4.5. If X is integer-valued and X ∗ is an approximating normal random


variable, then:
Pr(X = k) ≈ Pr(k − 0.5 < X ∗ < k + 0.5).
This is called the correction for continuity (see the example on page 111).

It is as if X is a “rounded off” version of X ∗ :


Pr(12 6 X < 16) = Pr(X = 12, 13, 14, 15) ≈ Pr(11.5 < X ∗ < 15.5).

EXAMPLE 4.4.7: (dice sum, continued . . . see page 100.)


Find the probability of getting a total of at least 100 in 24 rolls of a fair die.

Let T = X1 + X2 + · · · + X24 .
E(X) = 27 , var(X) = 12
35
, so E(T ) = 84 and var(T ) = 70.

T is an integer-valued random variable, the distribution of which is approxi-


d
mated by the normally distributed random variable T ∗ = N(84, 70):
Pr(T > 100) ≈ Pr(T ∗ > 99.5) = Pr(Ts∗ > 99.5−84
√ ) = Pr(Ts∗ > 1.853) =
70
0.032.

The exact distribution of T is very messy indeed; the normal distribution pro-
vides a good approximation.

d
EXAMPLE 4.4.8: If X = Bi(40, 0.4), use a normal approximation to find an
approximate value for Pr(X > 20).

d d
If X = Bi(40, 0.4), the approximating normal random variable is X ∗ = N(16, 9.6).

Pr(X > 20) ≈ Pr(X ∗ > 19.5) = Pr(Xs∗ > 19.5−16


√ ) = Pr(N > 1.130) =
9.6
0.1293.

If no correction for continuity is used, the approximation would be


Pr(X > 20) ≈ Pr(X ∗ > 20) = Pr(Xs∗ > 20−16√ ) = Pr(N > 1.291) = 0.0984.
9.6

The correct value, which can be readily obtained from R or a calculator, is


Pr(X > 20) = 0.1298.
The correction for continuity gives a reasonable approximation; and ignoring it can give a
pretty bad approximation.
d d 
If X = Bi(n, p) and X ∗ = N np, np(1−p) , then (for integers a, b and c), we have
Pr(X > c) ≈ Pr(X ∗ > c−0.5) > Pr(X ∗ > c).
Such probabilities are important later, when we use them to obtain p-values. To get them
right, we should use a correction for continuity.
For the normal approximation to apply, n needs to be moderately large, but it needs to be
very large indeed for the correction for continuity to be unimportant.
d
The table below gives values for Pr(X > c), where X = Bi(n, 0.5); the normal approx-
imation without continuity correction, Pr(X ∗ > c), and the normal approximation with
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 115

d
continuity correction, Pr(X ∗ > c−0.5), where X ∗ = N(0.5n, 0.25n).

n c Pr(X > c) Pr(X ∗ > c) Pr(X ∗ > c−0.5)


10 8 0.0547 0.0289 0.0569
100 58 0.0666 0.0548 0.0668
1000 524 0.0686 0.0645 0.0686
10000 5075 0.0681 0.0668 0.0681
100000 50240 0.0649 0.0645 0.0649

It is seen that the continuity-corrected approximation does quite well for n = 100 and better
still for n > 1000. On the other hand, the uncorrected approximation is still a bit out even
for n = 100 000.
Note: We don’t actually need to use normal approximations to evaluate these probabilities,
since the correct answer is readily available from the computer or calculator. However,
when we come to use formulae based on normal approximations for confidence intervals
and significance testing, these results suggest that a correction for continuity should be
used in the procedure. Generally, this means an adjustment in the formula.

Similar considerations apply to the Poisson distribution.

d
EXAMPLE 4.4.9: If X = Pn(32.4), use a normal approximation to find an ap-
proximate value for Pr(X > 40).

d
The approximating normal random variable is X ∗ = N(32.4, 32.4).
Pr(X > 40) ≈ Pr(X ∗ > 39.5) = Pr(Xs∗ > 39.5−32.4
√ ) = Pr(N > 1.247) =
32.4
0.106.

The uncorrected approximation gives Pr(N > 1.335) = 0.091. The correct value
is 0.1086.

4.4.3 Linear combinations

If X1 , . . . , Xn are independent random variables, with means µ1 , µ2 , . . . , µn and variances


σ12 , σ22 , . . . , σn2 respectively, then
E(a1 X1 + · · · + an Xn ) = a1 µ1 + · · · + an µn ,
var(a1 X1 + · · · + an Xn ) = a21 σ12 + · · · + a2n σn2 ,
since E(aX) = aµ, var(aX) = a2 σ 2 and the mean and variance are additive.
In particular (using a1 = 1 and a2 = −1 in the general result):
E(X1 −X2 ) = µ1 −µ2 and var(X1 −X2 ) = σ12 +σ22 .

In addition, if X1 , . . . , Xn are normally distributed, then so is


T = a 1 X1 + · · · + an Xn .
While the mean and variance properties are generally true, this distributional property is
unique to the normal distribution.

d d
EXAMPLE 4.4.10: X1 = N(68, 102 ), X2 = N(60, 152 ).
Assuming X1 and X2 are independent, we have:
d
T = 0.5X1 + 0.5X2 = N(64, 9.02 );
d
S = 0.2X1 + 0.8X2 = N(61.6, 12.22 ).
lOMoARcPSD|8938243

page 116 Chapter 4: Probability distributions

EXAMPLE 4.4.11: Suppose that, in a particular population, the height of an


d d
adult female X = N(165, 62 ) and the height of an adult male Y = N(175, 82 ).
Find the probability that a randomly selected female is taller than a randomly
selected male, i.e. find Pr(X > Y ).
d
X−Y = N(−10, 102 ), since 165−175 = −10 and 62 +82 = 102 .
Pr(X > Y ) = Pr(X−Y > 0) = Pr(Z > 1) = 0.159.

EXAMPLE 4.4.12: A process requires three stages: the total time taken (in hours)
for the process, T = T1 +T2 +T3 , where Ti denotes the time taken for the ith stage.
It is known that:
E(T1 ) = 40, E(T2 ) = 30, E(T3 ) = 20; sd(T1 ) = 3, sd(T2 ) = 2, sd(T3 ) = 5.
There is a deadline of 100 hours. Give an approximate probability that the dead-
line is met, i.e. find Pr(T 6 100). Assume that the times are approximately nor-
mally distributed. Which stage is most influential in determining whether the
deadline is met?
d
T ≈ N(90, 38), µ = 40+30+20, σ 2 = 32 +22 +52 ⇒ Pr(T 6 100) ≈ 0.948.
The stage with the greatest variance: i.e. stage 3. To understand this, think about
what would happen if sd(T1 ) = 0: in that case T1 is a constant and hence can
have no effect on whether the deadline is met. On the other hand, if sd(T3 ) is
very large, then T3 could be far above its mean, making it very unlikely that the
deadline could be met; and similarly T3 could be far below its mean, making
it very probable that the deadline is met. So T3 has a big effect on whether the
deadline is met.
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 117

Problem Set 4
4.1 The discrete random variable X, with sample space {1, 2, 3, 4, 5, 6}, has pmf
2x−1
p(x) = 36 (x = 1, 2, 3, 4, 5, 6).
(a) Sketch the graph of the pmf.
(b) Find Pr(X = 2), Pr(X > 4) and Pr(2 < X 6 5).
(c) Sketch the graph of the cdf, and indicate the probabilities found in (b) on your sketch.
4.2 A surgical hospital needs blood supplies. Suppose its daily demand for blood X, in hundreds
of litres, has cdf
F (x) = 1 − (1 − x)4 (0 < x < 1).
i. Find the probability that the daily demand exceeds 10 litres.
ii. Find the level at which the blood supply should be kept so that there is only a 1% chance
that the demand exceeds the supply.
4.3 Let X denote the result of a toss of a fair coin: X is the number of heads obtained; so that
X = 1 if a head is obtained, and X = 0 if a tail is obtained.
(a) Consider the random variable obtained by observing the result of one toss and multiply-
ing it by 2. What is the distribution of this random variable? i.e. with what probability is
it equal to 0, 1, 2?
Draw a diagram representing this distribution.
(b) Consider the random variable obtained by observing the result of one toss and adding it
to the result of a second toss. What is the distribution of this random variable? i.e. with
what probability is it equal to 0, 1, 2?
Draw a diagram representing this distribution.
(c) Do these random variables have the same mean? If not, which is bigger?
Do they have the same spread? If not, which is more spread?
The random variable in (a) is 2X , i.e. X+X , the sum of identical variables; while the random
variable in (b) is X1 +X2 , the sum of independent variables, each having the same distribution.
Their distributions are not the same.
When you are concerned with a sum, it will usually be a sum of independent variables, which
is not the same as a sum of identical variables, i.e. not nX .
4.4 Evaluate the following probabilities:
d
(a) i. Pr(X 6 3) for X = Bi(10, 0.2).
d
ii. Pr(3 < X 6 7) for X = Bi(15, 0.6).
d
iii. Pr(1 6 X 6 3) for X = Bi(6, 0.25).
d
iv. Pr(X > 16) for X = Bi(20, 0.75).
d
(b) i. Pr(X 6 2), for X = Pn(5.2).
d
ii. Pr(X > 1) for X = Pn(0.9).
d
iii. Pr(3 6 X 6 6) for X = Pn(4.6).
d
iv. Pr(X > 10), for X = Pn(3.1).
4.5 According to a national survey, 10% of the population of 18–24-year-olds in Australia are left-
handed.
(a) In a tutorial class of 20 students, how many would you expect to be left-handed? What is
the probability that in a class of 20 students, at least four of them are left-handed?
(b) In a lecture class of 400 students, how many would you expect to be left-handed? A
survey result shows that there are actually 60 left-handed students in the class. What is
the probability that in a class of 400 students, at least 60 of them are left-handed?
(c) What have you assumed in these probability calculations?
4.6 The number of cases of tetanus reported in a single month has a Poisson distribution with
mean 4.5. What is the probability that there are at least 35 cases in a six-month period?
4.7 The expected number of deaths due to bladder cancer for all workers in a tyre plant over
a 20-year period, based on national mortality rate, is 1.8. Suppose 6 deaths due to bladder
cancer were observed over the period among the tyre workers. How unusual is this event? i.e.
evaluate Pr(X > 6) assuming the national rate is applicable.
d
4.8 (a) For X = N(µ = 50, σ 2 = 102 ), find
i. Pr(X 6 47); ii. Pr(X > 64); iii. Pr(47 < X 6 64);
iv. c, such that Pr(X > c) = 0.95; v. c, such that Pr(X < c) = 0.025.
lOMoARcPSD|8938243

page 118 Chapter 4: Probability distributions

d
(b) Use a Normal approximation to evaluate Pr(X6150) where X = Bi(1000, 61 ); and check
your approximation by obtaining the exact value using R.
4.9 A standard test for gout is based on the serum uric acid level. The serum uric acid level,
L mg/100L is approximately Normally distributed: with mean 5.0 and standard deviation 0.8
among healthy individuals; and with mean 8.5 and standard deviation 1.2 among individuals
with gout.
Suppose we diagnose people as having gout if their serum uric acid level is greater than
6.50 mg/100L.
(a) Find the sensitivity of this test.
(b) Find the specificity of this test.
4.10 A random sample of 100 observations is obtained from a population with a Normally dis-
tributed population with mean 240 and standard deviation 40.
Sketch a likely boxplot for these data.
4.11 A medical trial was conducted to investigate whether a new drug extended the life of a patient
with lung cancer.
Assume that the survival time (in months) for patients on the drug is Normally distributed
with a mean of 30 and a standard deviation of 15. Calculate:
i. the probability that a patient survives for no more than one year;
ii. the proportion of patients who are expected to survive for between one and two years;
iii. the time for which at least 80% of the patients are expected to survive;
iv. the expected quartiles of the survival times.
The survival times (in months) for 38 cancer patients who were treated with the drug are:
1 1 5 9 10 13 14 17 18 18 19 21 22
25 25 25 26 27 29 36 38 39 39 40 41 41
43 44 44 45 46 46 49 50 50 54 54 59

The sample mean is 31.1 months and the sample standard deviation is 16.0 months.
d
Is there any reason to question the validity of the assumption that T = N(µ=30, σ=15)?
4.12 Two scales are available for measuring weights in a laboratory. Both scales give answers that
vary a bit in repeated weighings of the same item. If the true weight of a compound is 2 grams
(g), the first scale produces readings X that have mean 2.000 g and standard deviation 0.004 g.
The second scale’s readings Y have mean 2.002 g and standard deviation 0.002 g.
(a) What are the mean and standard deviation of the difference, Y −X, between the readings?
(Readings X and Y are independent.)
(b) You measure once with each scale and average the readings. Your result is Z = 21 (X+Y ).
Find the mean and standard deviation of Z, i.e. µZ and σZ .
(c) Which of the three readings would you recommend: X, Y or Z? Justify your answer.
(d) Assuming X and Y are independent and normally distributed, evaluate:
Pr(1.995 < X < 2.005), Pr(1.995 < Y < 2.005), Pr(1.995 < Z < 2.005).
d
4.13 In a particular population, adult female height, X = N(165.4, 6.72 ) and adult male height,
d 2
Y = N(173.2, 7.1 ).
(a) Sketch the pdfs of X and Y on the same graph.
(b) Assuming X and Y are independent, specify the distribution of X−Y and hence find
Pr(X > Y ). This gives the probability that a randomly selected adult female is taller
than a randomly selected adult male.
4.14* Suppose that the survival time after prostate cancer (in years) Y has a lognormal distribution,
d d
Y = ℓN(2, 1). This means that ln Y = N(2, 1) (by definition).
(a) Find Pr(Y > 10).
(b) Find the median and the quartiles of Y .
(c) Is the distribution of Y positively skew, symmetrical or negatively skew? Is the mean of
Y greater than, equal to, or less than the median of Y ?
(d) Draw a rough sketch of the pdf of Y .

4.15 Suppose a standard antibiotic kills a particular type of bacteria 80% of the time. A new antibi-
otic (XB) is reputed to have better efficacy than the standard antibiotic. Researchers propose to
try the new antibiotic on 100 patients infected with the bacteria. Using principles of hypothesis
testing (discussed in Chapter 6), researchers will deem the new antibiotic “significantly better”
lOMoARcPSD|8938243

Chapter 4: Probability distributions page 119

than the standard one if it kills the bacteria in at least 88 out of the 100 infected patients.

Suppose ( ◦◦ ) there is a true probability (true efficacy) of 85% that XB will work for an individ-
ual patient.
(a) Calculate the probability (using R) that the experiment will find that XB is “significantly
better”.
(b) The statistical power is the probability of obtaining a significant result: it is the ability
to discover a better treatment (in this case a better antibiotic). So it’s an indication of the
value of the procedure.
i. Find the statistical power if the true efficacy of a new antibiotic is actually 90%.
ii. What is the power if the true efficacy is 95%?
iii. What is the power if the true efficacy is really 80%? What does this mean?
lOMoARcPSD|8938243

page 120 Chapter 4: Probability distributions


lOMoARcPSD|8938243

Chapter 5

ESTIMATION

“You can, for example, never foretell what any one (individual) will do, but you can say with preci-
sion what an average number will be up to.” Sherlock Holmes, The Sign of the Four, 1890.

Chapter 1 provides an indication of where the data we analyse comes from. Chapter 2 tells us
something about what to do with a data set, or at least how to look at it in a sensible way. Chapters
3 & 4 gave us an introduction to models for the data. Now we turn to making inferences about the
models and the populations that they describe. This is the important and useful stuff. Statistical
inference is the subject of the rest of the book. We start with Estimation in this chapter.

Ch1: types
(Ch3&4)
of studies
Probability
population −→ sample Ch2: data
description
model ←− observations

Statistical Inference
(Ch5–8)

5.1 Sampling and sampling distributions

5.1.1 Random sampling

DEFINITION 5.1.1. A random sample on a random variable X is a sequence of inde-


pendent random variables X1 , X2 , . . . , Xn , each having the same distribution as X.

Realisations of random variables X1 , X2 , . . . , Xn are denoted by x1 , x2 , . . . , xn .

121
lOMoARcPSD|8938243

page 122 Experimental Design and Data Analysis

EXAMPLE 5.1.1: Mathematically, a random sample of 100 on a normal popula-


tion, N(µ=10, σ 2 =22 ), consists of the independent random variables:
d d d
X1 = N(µ=10, σ 2 =22 ), X2 = N(µ=10, σ 2 =22 ), . . . , X100 = N(µ=10, σ 2 =22 ).

The observed sample consists of realisations of these random variables:


x1 = 11.43, x2 = 8.27, . . . , x100 = 9.19.
From a random sample on X, we wish to make inferences about the distribution of X. We
wish to estimate the pmf (or the pdf) and the cdf; we wish to estimate measures of location
and spread, or other characteristics of the distribution of X.
To do this, we use a statistic:

DEFINITION 5.1.2. A statistic is a function of the sample variates:

W = g(X1 , X2 , . . . , Xn ).

For example, the sample mean:


X̄ = n1 (X1 + X2 + · · · + Xn ).

Also, the sample median, the sample standard deviation, the sample interquartile range,
etc., are statistics which are used as estimators of their population counterparts.
The statistic W is a random variable; its realisation is given by the same function applied
to the observed sample values:
w = g(x1 , x2 , . . . , xn ).
For example, the sample mean:
x̄ = n1 (x1 + x2 + · · · + xn ) = 100
1
(11.43 + 8.27 + · · · + 9.19) = 10.13.
A statistic has a dual role: a measure of a sample characteristic and an estimator of the
corresponding population characteristic.
Each of the statistics we have met can be regarded as an estimator of a population character-
istic, usually referred to as a parameter. The statistic X̄ is an estimator of the parameter µ.

If the statistic W is an estimator of the parameter θ, then in order to make inferences about
θ based on W , we need to know something about the probability distribution of W .
The sample mean X̄ is an estimator of the parameter µ, so to make inferences about µ based
on X̄, we need to know something about the probability distribution of X̄. Thus, we turn
to consideration of the probability distribution of the sample mean.

Q UESTION : We take a sample of 100 from a N(10, 22 ) population.


What do you think the distribution of X̄ would look like?
What do you think the distribution of S would look like?

5.1.2 The distribution of X̄

population −→ sample
population mean µ ←− sample mean, x̄
lOMoARcPSD|8938243

Chapter 5: Estimation page 123

x̄ gives an estimate of the value of µ.


But what other values of µ are likely? plausible? possible?

The answer is provided by the distribution of X̄.


Remember:
mean of a sum = sum of the means
variance of a sum = sum of the variances (for independent random variables)
E(cX) = cE(X) and var(cX) = c2 var(X)

Hence:  
E(X̄) = E n1 (X1 + X2 + · · · + Xn ) = 1
n
(µ + µ + · · · + µ) = n1 (nµ) = µ
2
 
1 2 2
var(X̄) = var n1 (X1 +X2 + · · · +Xn ) = n
(σ + · · · +σ 2 ) = n12 (nσ 2 ) = σn

σ2
E(X̄) = µ and var(X̄) = .
n

Further, from the Central Limit Theorem (which says that the sum of a lot of independent
variables is approximately normal) we have:

d σ2 
X̄ ≈ N µ, ,
n

and this approximation applies (for large n) no matter what the population distribution.

d 4 d
EXAMPLE 5.1.2: Sample of n = 100 on X = N(µ=10, σ 2 =4): X̄ = N(10, 100 ).

E(X̄) = 10, sd(X̄) = 0.2; so with probability 0.95, X̄ will be in the interval
10 ± 1.96×0.2, i.e. (9.61, 10.39).

EXAMPLE 5.1.3: Suppose that X has a (non-normal) distribution with mean


55.4 and standard deviation 15.2. A random sample of 180 is obtained. What
are the likely values of the sample mean?

15.2
E(X̄) = 55.4, sd(X̄) = √
180
= 1.13. With probability about 0.95: 53.2 < X̄ <
57.6.

5.2 Inference on the population mean, µ


The sample mean then has a dual role: (1) it indicates the centre of the sample distribution
and (2) it gives an estimate of the population mean. It it this second role that we investigate
more closely in this section. The population mean µ is unknown, and we wish to use the
data (a random sample on X) to estimate it.
We have seen that, for a random sample of n obtained from a population with mean µ and
variance σ 2 (no matter what the population distribution is):
σ2
E(X̄) = µ and var(X̄) = .
n

Further, by the Central Limit Theorem, for a large sample (i.e. large n):
d σ2 
X̄ ≈ N µ, . (1)
n
lOMoARcPSD|8938243

page 124 Experimental Design and Data Analysis

If the population itself is normally distributed, then (exactly):


d σ2 
X̄ = N µ, . (2)
n
It follows that if the population distribution is not too far from normal, then the approxi-
mation (1) will hold for quite small values of the sample size n.

EXAMPLE 5.2.1: Suppose a random sample of 20 observations is obtained from a


population that is supposed to be normally distributed: N(50, 102 ). We observe
x̄ = 53.8. Is this plausible? The mean is supposed to be 50. Does this suggest
that the population mean is actually more than 50?

If the assumed model is correct, then


102
E(X̄) = 50, var(X̄) = 20 = 5, so that sd(X̄) = 2.24.

Thus with the supposed population distribution, we would expect that X̄ would
be in the interval 50 ± 4.48, i.e. 45.5 < X̄ < 54.5 with probability 0.95. So, the
observation x̄ = 53.8 is quite in line with what is expected under the proposed
model. This result gives us no real reason to question it.

EXAMPLE 5.2.2: A random sample of n = 400 observations is obtained from a


population with mean µ = 50 and standard deviation σ = 10.
Specify the approximate distribution of the sample mean and hence find ap-
proximately Pr(49 < X̄ < 51).

d 10 2
X̄ ≈ N(50, 400 ); since n = 400, µ = 50 and σ 2 = 102 ;
Pr(49 < X̄ < 51) = Pr(−2 < X̄s < −2) = 0.9544.
Given µ, we can make a statement about X̄:
 
Pr µ − 1.96 √σn < X̄ < µ + 1.96 √σn ≈ 0.95.

This means that, given x̄ we can make a statement about µ.


 
Pr X̄ − 1.96 √σn < µ < X̄ + 1.96 √σn ≈ 0.95.

i.e. if X̄ is within ǫ of µ, then µ is within ǫ of X̄.

EXAMPLE 5.2.3: A random sample of n = 400 observations is obtained from a


population with standard deviation σ = 10. If we observed the sample mean,
x̄ = 50.8, what are plausible values for the unknown population mean µ?

d100 d
We have X̄ ≈ N(µ, 400 ), i.e. X̄ ≈ N(µ, 0.52 ).

·
· · Pr(µ − 1.96×0.5 < X̄ < µ + 1.96×0.5) ≈ 0.95

i.e. Pr(µ − 0.98 < X̄ < µ + 0.98) ≈ 0.95.

Hence Pr(X̄ − 0.98 < µ < X̄ + 0.98) ≈ 0.95.

So (“95%”) plausible values for µ are 50.8 ± 0.98, i.e. (49.82 < µ < 51.78).

If µ=49.82, then what we observed (x̄=50.8) would be just on the upper “plau-
sible” limit for X̄ values.
If µ=51.78, then what we observed would be on the lower “plausible” limit.
lOMoARcPSD|8938243

Chapter 5: Estimation page 125

This set of (95%) plausible values


is called a (95%) confidence interval for µ.
There is one other problem if this approach is to be used to estimate an unknown popula-
tion mean µ: If µ is unknown, then often σ will be too. So, in most cases, we don’t know
the value of sd(X̄).
What to do? If σ is unknown, then we estimate it.

DEFINITION 5.2.1. The estimate of the standard deviation of X̄ is called the standard
error of X̄, denoted by se(X̄). It is obtained by replacing the unknown parameter σ by
an estimate, i.e.
σ σ̂
sd(X̄) = √ ≈ √ = se(X̄).
n n

Usually, but not always, σ̂ = s.

EXAMPLE 5.2.4: (an approximate 95% confidence interval)


A random sample of n = 400 observations is to be obtained from a population,
about which nothing at all is known!

If we observed the sample mean, x̄ = 50.8, and sample standard deviation, s = 11.0,
what are plausible values for the unknown population mean µ?

An estimate of µ is x̄ = 50.8.

s 11.0
The standard error of this estimate is se(x̄) = √ = = 0.55.
n 20

This gives some idea of the precision of the estimate. We expect that, roughly,
with probability about 0.95 that x̄ will be within 1.96×0.55 = 1.1 of µ.

Therefore, we expect (with probability about 0.95) that µ will be within 1.1 of
x̄=50.8, i.e. (49.8 < µ < 51.9). So this gives a rough 95% confidence interval for
µ.
This leads to a recipe for an approximate 95% confidence interval applicable in many situ-
ations:

DEFINITION 5.2.2. An approximate 95% confidence interval is given by

est ± “2” se.

5.3 Point and interval estimation

The process of drawing conclusions about an entire population based on information in a


sample is known as statistical inference. There are two types of statistical inference:
• Estimation (this chapter)
• Hypothesis testing (next chapter)
lOMoARcPSD|8938243

page 126 Experimental Design and Data Analysis

In estimation, we want to give an estimate to a population characteristic, e.g. µ or σ 2 or p


or λ. There are two kinds of estimates:
• point estimates
• interval estimates.

DEFINITION 5.3.1. A point estimate of a population characteristic (parameter) is a


single number calculated from sample data that represents our “best guess” value of
the characteristic based on the data.

For our purposes, the estimates we use are intuitive and obvious. By and large, we use
a sample statistic to estimate its population counterpart. The following are the point esti-
mates of µ or σ 2 that we use for normal populations:
estimate of the population parameter µ is denoted by µ̂;
we choose µ̂ = x̄;
estimate of population parameter σ 2 is denoted by σ̂ 2 ;
we choose σ̂ 2 = s2 .

Estimators and estimates

• An estimator is a statistic used to estimate a parameter. It is a random variable.


• An estimate is a realisation of an estimator.
• An estimator, being a random variable, is represented by an upper case letter, e.g. X̄,
S2.
• An estimate (a realisation of a random variable) is represented by a lower case letter,
e.g. x̄, s2 .
In the above, the estimators of µ and σ 2 are respectively X̄ and S 2 , while the estimates are
x̄ and s2 , respectively.

EXAMPLE 5.3.1: (BMI)


A random sample of 12 first-year university students was selected and for each
student, their height (in metres) and their weight (in kilograms) were measured.
From these measurements, the body-mass index, BMI was obtained for each
student: BMI = W/h2 . The resulting observations were as follows:
21.2 23.9 22.4 19.2 22.0 27.4
25.1 20.7 19.7 22.3 24.5 21.6
(a) Compute a point estimate for the mean BMI of first-year university stu-
dents. What statistic did you use to obtain your estimate? [22.5]

(b) Compute a point estimate for σ, the population standard deviation of the
BMI of first-year university students. What statistic did you use to obtain
your estimate? [2.37]
It is desirable to give a point estimate along with a standard error (se) which indicates how
much error there might be associated with the estimate.
A standard error of an estimate is an estimate of the standard deviation of the estimator.
As indicated above, [est ± “2”se] enables us to find an approximate 95% confidence interval
for µ. But when we use a standard error, there is a complication: the 1.96 applies only if we
lOMoARcPSD|8938243

Chapter 5: Estimation page 127

know the standard deviation. If it’s unknown and we need to use a standard error, then we
need to use a different “2”.
In any case though, a rough approx 95% CI is given by est ± 2se.

EXAMPLE 5.3.2: (BMI, continued)


Give the standard errors for the sample mean found in the above example and
hence give an approximate 95% confidence interval for µ.
2.37
µ: est = 22.5, se = √ = 0.68; rough approx 95% CI: (21.1, 23.9).
12
An interval estimate, or a confidence interval, is a set of plausible values for the parameter.
More precisely, an interval estimator is a random interval which is expected (with specified
probability) to contain the unknown parameter. An interval estimate then is a realisation of
an interval estimator. This is discussed further in the following sections, where we consider
sampling from a normal population in more detail.
If we are sampling from a normal population, then we can actually specify the exact distri-
butions of the statistics we need for inference on µ. This enables us to extend our inference
to more accurate confidence intervals (a better “2”); and different confidence levels (i.e.
other than 95%); and to the tails of the distribution, required for hypothesis testing (Chap-
ter 6).

Abstraction again
Population parameters are yet another abstraction. In statistics, we think of these as fixed
but unknown constants. To the extent that they are unknown, they are abstract; we usually
can’t identify them. But we are vitally interested in their values: we make inferences about
them.
Yet another example of a substantial abstraction is the hypothetical endless repetition of
the same study, under identical conditions. We indulge in this thought experiment when
we interpret the meaning of a probability, and specifically the meaning of the “95%” in a
95% confidence interval.

5.4 Normal: estimation of µ when σ is known

If σ is known, then we have the exact result:


d σ2 X̄ − µ d
X̄ = N(µ, ), i.e. √ = N(0, 1). (3)
n σ/ n
This can be used for inference on µ.
The point estimate of µ is x̄.
The interval estimate of µ is obtained as follows:
 X̄ − µ 
Pr − 1.96 < √ < 1.96 = 0.95, (4)
σ/ n

Pr(µ − 1.96 √σn < X̄ < µ + 1.96 √σn ) = 0.95, (5)

which can be used to obtain a probability interval for X̄.


Rearrangement gives:
 
Pr X̄ − 1.96 √σn < µ < X̄ + 1.96 √σn = 0.95. (6)
lOMoARcPSD|8938243

page 128 Experimental Design and Data Analysis

This is a RANDOM INTERVAL (X̄ − 1.96 √σn , X̄ + 1.96 √σn )


that contains the CONSTANT µ with probability 0.95.

µ is a constant. X̄ is a random variable. Hence the interval endpoints are random. It is the
interval that is random; µ is a constant.

DEFINITION 5.4.1. A 95% confidence interval is the realisation of a random interval


that has probability 0.95 of containing an unknown parameter.

In this case, the unknown parameter is µ, and the 95% confidence interval is:
σ σ
(x̄ − 1.96 √ , x̄ + 1.96 √ )
n n

We know that sd(X̄) = √σn . If σ is known, there is no need to estimate it and thus, in this
case, se(x̄) = √σn . Thus the exact 95% confidence interval x̄ ± 1.96 √σn is very close to the
approximate version: est ± “2”se. In this case, “2” = 1.96.

EXAMPLE 5.4.1: Suppose the population is normal with unknown mean, but
d
with known standard deviation 2.5, i.e. X = N(µ, 2.52 ).

We take a random sample of n=40 and observe x̄=14.73 (cf. the above example,
but here σ is assumed known).

The 95% CI for µ is 14.73 ± 1.96× √2.5


40
= (14.0, 15.5).
A normal-based confidence interval takes the form (m ± E) or (m−E, m+E), where m, the
midpoint, is the point-estimate and E, the half-width, is called the margin of error. In the
above example, m = 14.73 and E = 1.96× √2.540
= 0.775.

The result (3) enables us to find confidence intervals at other levels:


99% CI for µ: x̄ ± 2.5758 √σn [ 2.5758 = c0.995 (N), −2.5758 = c0.005 (N) ]

90% CI for µ: x̄ ± 1.6449 √σn [ 1.6449 = c0.95 (N), −1.6449 = c0.05 (N) ]

Note: the point estimate is the 0% CI, i.e. x̄ ± 0; the 100% CI is (−∞, ∞).

Thus, in general, m = x̄ and E = c1− 12 α (N) √σn .

EXAMPLE 5.4.2: A psychological test score is assumed to have standard devia-


tion 15 for a particular population. A random sample of 125 individuals taken
from this population gives a sample mean of 108.2. Find a 99% confidence in-
terval for the population mean test score.

15
estimate = 108.2, standard error = √ = 1.34;
125
approx 99% confidence interval = (108.2 ± 2.5758×1.34) = (104.7, 111.7).

E XERCISE . Check that an approximate 90% confidence interval is given by (106.0, 110.4).
lOMoARcPSD|8938243

Chapter 5: Estimation page 129

Statistic-parameter diagram

(Another view of a confidence interval.)

parameter

✛ confidence interval

µ



probability interval

x̄ statistic

For each value of the parameter (µ), the end-points of the 95% probability interval for
the statistic (X̄) are plotted, using the result specifying the distribution of the statistic
(µ−1.96 √σn , µ+1.96 √σn ).
This is done for each possible value of the parameter. The result is two lines corresponding
to the lower and upper ends of the probability interval, as shown in the diagram.
Given a value of the parameter (µ) the horizontal interval between these two lines is the
(95%) probability interval for the statistic X̄. This corresponds to equation (5).
Given an observed value of the statistic (x̄), the vertical interval between the two lines is
the (95%) confidence interval for the parameter (µ). This corresponds to the ‘inversion’ of
the probability statement to make µ the subject, as represented in equation (6).
The confidence interval is seen to be the set of values of the parameter that make the ob-
served value of the statistic “plausible” (i.e. within the 95% probability interval).
It is seldom the case in practice that σ is known, but in some cases, assuming a value for σ
yields a useful approximation.

EXAMPLE 5.4.3: (Sample size determination)


If we are sampling from a population with a standard deviation assumed to be
3.4, how large a sample would we need in order to obtain a 95% confidence
interval of half-width 0.5, i.e. est ± 0.5. Note: this half-width is also called the
margin of error.

3.4
The width of a 95% confidence interval = 2×1.96× √ 6 1;
n

Therefore, we require n > 2×1.96×3.4, i.e. n > 177.6.
So, a random sample of 178 would be required.

5.5 Estimators that are approximately normal

5.5.1 Estimation of a population proportion

Suppose p is the proportion of a population that has an attribute A.


lOMoARcPSD|8938243

page 130 Experimental Design and Data Analysis

Let P̂ denote the (random) proportion in a random sample that have the attribute. Obvi-
ously, P̂ is an estimator of p. But what is the sampling distribution of P̂ ?
The estimator, P̂ = X/n, where
X = number of individuals with attribute A in a sample of n.
We define “success” as having attribute A. If the sample is randomly selected, each indi-
vidual selected can be regarded as an independent trial, and we have
X = number of successes in n independent trials,
for which we know that
d
X = Bi(n, p),
where p = Pr(A), the proportion of the population with attribute A.
Therefore:
E(X) = np and var(X) = npq.
It follows that
pq
E(P̂ ) = p and var(P̂ ) = .
n
Thus, P̂ is an unbiased estimator of p, with variance pq/n. We see that, like the sample
mean, the estimator P̂ → p as n → ∞, since its variance goes to zero as the sample size
tends to infinity.
Actually P̂ is a sample mean. It is equivalent to Z̄, where Zi = 1 if individual i has
attribute A, and 0 otherwise. So, P̂ has the properties of a sample mean: it is unbiased
and it is asymptotically normal.

We have:
d 1 pq 
P̂ = Bi(n, p) ≈ N p, , for large n.
n n
This specifies the distribution of the estimator of p. For a large sample, the estimator is
approximately normally distributed.

EXAMPLE 5.5.1: A random sample of n=20 is selected from a large population


for which p = Pr(A) = 0.5. Find a (symmetric >) 95% probability interval for
P̂ .
d
X = Bi(20, 0.5), so from tables or computer: Pr(6 6 X 6 14) = 0.959. Since
P̂ = X/20, it follows that Pr(0.3 6 P̂ 6 0.7) = 0.959.
This (symmetric >) 95% interval is chosen so that the probability in each tail is at most
0.025. This is what we mean when we use a 95% interval for a discrete distribution.
q
Note: (µ ± 1.96σ) = (0.5 ± 1.96 0.5×0.5
20
) = (0.5 ± 0.22) = (0.28, 0.72),
which gives not too bad an approximation!

EXAMPLE 5.5.2: A random sample of n=200 is selected from a large population


for which p = Pr(A) = 0.2. Find an approximate 95% probability interval for P̂ .
q
approx 95% prob interval: 0.2±1.96 0.2×0.8
200
= 0.2±1.96×0.028 = (0.145, 0.255).

This interval corresponds to (28.9 < X < 51.1). As X is integer-valued, this


converts to (29 6 X 6 51).
Q UESTION : The probability of this interval can be expected to be more than 0.95. Why?
Check that Pr(29 6 X 6 51) = 0.9585.
lOMoARcPSD|8938243

Chapter 5: Estimation page 131

d
EXAMPLE 5.5.3: A random sample of n=100 observations is obtained on Y =
N(50, 102 ). Let U denote the number of observations in this sample that fall
in the interval 50 < Y < 60. What is the distribution of U ? Specify a 95%
probability interval for U .

Here the attribute A is “50 < Y < 60”.

Thus p = Pr(A) = Pr(50 < Y < 60) = Pr(0 < Ys < 1) = 0.3413.

“success” = “50 < Y < 60”, and we have 100 independent trials;
d
thus U = Bi(100, 0.3413).

E(U ) = 34.13 and sd(U ) = 100×0.3413×0.6587 = 4.74.

So, an approximate 95% probability interval for U is 24.6 < U < 43.6. Thus you
should expect that 25 6 U 6 43. And be somewhat surprised if it were not.
Inference on p
A point estimate of p is p̂ = nx .
An interval estimate of p is obtained from the distribution of the estimator.
d pq  d σ2
If n is large, we have: P̂ ≈ N p, , [ cf. X̄ = N(µ, )].
n n
But here the “σ 2 ” is not known: it depends on the unknown p. However, the sample is
large and so, to a good approximation, σ 2 ≈ p̂(1−p̂), and act as if it is known.
d p̂(1−p̂) 
Thus, we use the approximate result: P̂ ≈ N p, ,
n
in which the variance is replaced by its estimate, and we assume the variance is “known”
to be this value.
q q
Note that sd(P̂ ) = p(1−p)
n , and so se(p̂) = p̂(1−p̂)
n .
Using this approximation gives the “standard” result:
r
p̂(1−p̂)
approx 95% CI for p: p̂ ± 1.96 [est ± “2”se]
n
This gives quite a reasonable approximation when the sample is large.

The method by which a confidence interval is assessed is its coverage. A 95% confidence
interval for any parameter θ is supposed to contain the true value of θ with probability 0.95.
When an approximate 95% confidence interval is used, the coverage will differ from 0.95.
For a Binomial parameter, the approximate normal-based 95% confidence interval has cov-
erage which tends to be less than 0.95, partly because of the non-normality (in particular
the skewness) and partly because of the discreteness. Further, as well as the normal ap-
proximation, the above approximate confidence interval also uses an approximation to the
standard error. It really is a rough approximation, but it gives us some indication at least.
The “right” answer, the exact 95% confidence interval, can be obtained using R, or from the
Binomial Statistic-Parameter diagram (Statistical Tables: Figure 2).
We can get closer to the exact confidence interval by making corrections (see below) to the
“standard” approximation. However, our approach (in EDDA) is to use the basic formula
(i.e. est ± “2”se) as the approximation, but to keep in mind its deficiencies. If a precise
confidence interval is required, then go to the computer for the exact result.
lOMoARcPSD|8938243

page 132 Experimental Design and Data Analysis

Improving on the basic approximate confidence interval.


(1) The standard error correction and the skewness correction mean that the approx CI needs
to be shifted towards 0.5. The simplest way to do this is to use Agresti’s formula: for the
x+2
purposes of computing the CI, use p̃ = n+4 instead of p̂.
(2) The correction for continuity means that the margin of error needs to be increased by 0.5
n
.
q
p̃(1−p̃)
+ 0.5 x+2

‘better’ approx 95% CI: p̃ ± 1.96 n n
, where p̃ = n+4 .

The exact confidence interval is given by R:


binom.test()
and then: enter x and n, or specify a column (with 0s and 1s). A normal-based confidence
interval (with correction for continuity) can be found in R with prop.test().
The statistic-parameter diagram (Statistical Tables: Figure 2) gives, as far as the accuracy of
the graph allows, the exact 95% confidence interval for p for an observed p̂. The CI is skew,
i.e. not symmetrical about the estimate, especially for values of p near 0 and 1.

EXAMPLE 5.5.4: Suppose n = 20 and x = 4.

approx CI: est±1.96 se (0.02, 0.38)


[ ‘better’ approx CI: est∗ ± Ec∗ (0.04, 0.46) ]
Figure 2 diagram (0.06, 0.44)

EXAMPLE 5.5.5: A random sample of n = 200 yields x = 34 with attribute A,


find a 95% confidence interval for p = Pr(A).
q
34 0.17×0.83
p̂ = 200 = 0.17; se(p̂) = 200 = 0.0266;
approx 95% CI: 0.17 ± 0.052 = (0.118, 0.222).

The exact 95% CI, obtained from R, is (0.121, 0.229).


(The ‘better’ approximation gives (0.121, 0.232)).

# proportion confidence interval


> binom.test(n=200, x=34, conf.level=0.95)
...
95 percent confidence interval:
0.1206956 0.2293716
E XERCISE . (Injecting room)
A survey of 1014 residents in a certain community found that 425 support the idea of a
safe injecting room in the area.
(a) Give a 95% confidence interval of the proportion of the community that supports
having a safe injecting room. [0.39, 0.45]
(b) What assumptions have been made in using the procedure in (a)?
Sample size determination
To determine the sample size to achieve a 95% confidence interval with margin of error d,
we require
q  1.96√p(1−p) 2
p(1−p)
1.96 n
6 d ⇒ n > d
.

Again, p is unknown. If we are confident that p ≈ pa , then pa can be used in the above
formula.
Here, pa may denote an estimate from an earlier sample, or it may be a value based on
lOMoARcPSD|8938243

Chapter 5: Estimation page 133

historical values or expert judgement. To be on the safe side, we should try to choose a
value on the 0.5-side of p, so that we are over-estimating rather than under-estimating the
variance. This gives a conservative value; i.e. one for which it is likely that the sample will
produce a confidence interval with margin of error less than the specified value, d.
If we have absolutely no idea about p, then we should use pa = 0.5, since this gives the
maximum value of pa (1 − pa ). This maximum value is 0.25.
Note that this result is based on the basic normal approximation, which will give a rea-
sonable answer provided the resulting sample size is large. Some checking might be in
order if the formula indicates a relatively small sample would be adequate. However, this
is unlikely unless a relatively wide margin of error is specified.

EXAMPLE 5.5.6: How large a sample is required to produce a 95% confidence


interval with margin of error 0.02,
i. if past experience indicates that p is between 0.1 and 0.2?
ii. if we have no information about p?

1.96 0.2×0.8 2
 
In the first case, we would use pa = 0.2, which gives n > 0.02
=
1536.6,
so a sample of at least 1537 is required.

1.96 0.5×0.5 2
 
In the second case, we use pa = 0.5 to be safe, which gives n > 0.02
=
2401.0,
so a sample of at least 2401 is required.

5.5.2 Estimation of a population rate

In the case of a population rate, a similar approach can be used to that used for the pop-
ulation proportion. Now it is based on the Poisson distribution rather than the Binomial
distribution.
Estimating an incidence rate α where X cases are obtained from observation of t person-
years.
X d X d
The estimator of α is
, where X = Pn(αt). , where X = Bi(n, p)]
[cf.
t n
X d α α̂ X d pq p̂q̂
≈ N(α, ) ≈ N(α, ). [cf. ≈ N(p, ) ≈ N(p, )]
t t t n n n
r √
x α̂ x
Thus, the point estimate is α̂ = , with standard error se(α̂) = = , and the approx-
t t t
imate 95% confidence interval is given by
r
α̂
approx 95% CI for α: α̂ ± 1.96 [est ± “2”se]
t
Note: This approximate 95% CI for α has the same sort of problems as the approximate
95% CI for p: it can be “corrected” (see below), but we will use this basic form as a rough
approximation, and use the computer to generate an exact answer if required.
lOMoARcPSD|8938243

page 134 Experimental Design and Data Analysis

This approximate 95% confidence interval can be improved in the same sort of way as the
approximate 95% confidence interval for p.
(1) The skewness correction means that the approx CI needs to be shifted upwards. The
simplest way to do this is to use Agresti’s formula: for the purposes of computing the CI, use
α̃ = x+2
t
instead of α̂.
(2) The correction for continuity means that the margin of error needs to be increased by 0.5
t
.
q
‘better’ approx 95% CI: α̃ ± 1.96 α̃t + 0.5 , where α̃ = x+2

t t
.

The “right” answer, the exact 95% confidence interval is given by the R function
poisson.test().
The exact result can also be obtained from the Poisson statistic-parameter diagram (Statis-
tical Tables: Figure 4). As far as the accuracy of the graph allows, this gives an exact 95%
confidence interval for λ = αt (for an observed x). The confidence interval for α can then
be obtained by dividing through by t.

EXAMPLE 5.5.7: In a follow-up study, 17 cases are observed from observation


of individuals in a cohort for 5328 person-years.
q
17 0.003191
α̂ = 5328 = 0.003191 (cases/person-year); se(α̂) = 5328 = 0.000774;
approx 95% CI for α: (0.003191 ± 1.96×0.000774) = (0.00167, 0.00471).

The exact 95% CI for α, using R is (0.00186, 0.00511), obtained as

# rate confidence interval


poisson.test(x = 17, T = 5328, conf.level=0.95)
...
95 percent confidence interval:
0.001858695 0.005108605

Using Table 4, we obtain 9.9 < λ < 27.2, and hence (on dividing by 5328),
0.00186 < α < 0.00511.

Note: The ‘better’ approx 95% CI gives (0.00187, 0.00526). Better, but still not on the
money.
d
Inference on λ, where X = Pn(λ)
d
If X = Pn(λ), then the point estimate is λ̂ = x and, as above, an approximate 95% confi-
dence interval for λ is given by: p
approx 95% CI for λ: λ̂ ± 1.96 λ̂ [est ± “2”se]
Again, this is a rough approximation, but we will use it nevertheless, being aware that it
tends to be a bit low and a bit narrow. If required the exact 95% CI is available from R or
from Statistical Tables: Figure 4.
p 
Note: the ‘better’ approx 95% CI: λ̃ ± 1.96 λ̃ + 0.5 , where λ̃ = x+2.

EXAMPLE 5.5.8: A study of workers in industry K yielded 26 cases of disease


outcome D over a ten-year period. Let λ denote the mean number of cases in
d
this period. The number of cases X = Pn(λ).

An approximate√95% confidence interval for the mean number of cases is (as


above) 26 ± 1.96 26 = (16.0, 36.0).
lOMoARcPSD|8938243

Chapter 5: Estimation page 135

Unless the number of cases is quite large (and 26 is not that large), the normal-
approximation is not so wonderful: the approx CI will be too narrow and too
low. An exact result can be obtained using R or the Poisson SP diagram (Statis-
tical Tables: Figure 4). This gives (17.0, 38.1). The ‘better’ approximation gives
(17.1, 38.9). As usual, it tends to over-correct slightly.

5.6 Normal: estimation of µ when σ is unknown

The point estimate of µ is x̄.


The interval estimate is not so simple. To obtain a confidence interval for µ when σ is
X̄−µ d
unknown, we cannot use √
σ/ n
= N(0, 1) as we did when σ was known. The obvious thing
X̄−µ
to try is √ ,
S/ n
replacing the unknown σ by its estimator S.
X̄−µ
The quantity √
S/ n
has what is known as a t distribution with n−1 degrees of freedom; and
we write
X̄ − µ d
√ = tn−1 .
S/ n Why n−1 degrees of freedom?
s2 is based on (x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2 .
This sum of squares is equivalent to a sum of n−1 independent terms,
rather than n, because (x1 − x̄) + (x2 − x̄) + · · · + (xn − x̄) = 0, since
(x1 + x2 + · · · + xn ) − nx̄ = 0.
The divisor of n−1 used in the definition of s2 is also indicative.

What does the tk distribution look like?


k
E(T ) = 0, var(T ) = k−2 . The distribution is symmetric and bell-shaped . . . and not unlike
the normal distribution. In fact, as k → ∞, tk → N. But for small k the tails of tk are longer
than N, and for very small k, very much longer!

distribution mean & sd 95% prob interval


d
T = t1 E(T ) = 0*, sd(T ) = ∞ (−12.71, 12.71)
d
T = t3 E(T ) = 0, sd(T ) = 1.732 (−3.182, 3.182)
d
T = t10 E(T ) = 0, sd(T ) = 1.118 (−2.228, 2.228)
d
T = t30 E(T ) = 0, sd(T ) = 1.035 (−2.042, 2.042)
d
T = t∞ E(T ) = 0, sd(T ) = 1.000 (−1.960, 1.960)
lOMoARcPSD|8938243

page 136 Experimental Design and Data Analysis

The Statistical Tables (Table 7) gives the quantiles (inverse cdf) of the tk distribution for a
range of values of k and q.
R gives the usual things using
dt, pt and qt
i.e. the pdf (not much use), the cdf (probabilities) and the inverse cdf (quantiles).

E XERCISE . Check that c0.975 (t20 ) = 2.086 and c0.025 (t20 ) = −2.086.
Note: Since the t distribution is symmetrical about zero, ca (t) = −c1−a (t).
> qt(0.975, df=20) # 0.975-quantile of t_20 distribution
[1] 2.085963
> qt(0.025, df=20) # 0.025-quantile of t_20 distribution
[1] -2.085963

So . . . how does this relate to the problem in hand: finding a confidence interval for µ?
We have:
X̄ − µ d
√ = tn−1 .
S/ n
It follows that:
 X̄ − µ 
Pr − c0.975 (tn−1 ) < √ < c0.975 (tn−1 ) = 0.95. (4’)
S/ n
Rearrangement of this statement, as we did with the σ-known result, leads to a confidence
interval:
 S S 
Pr X̄ − c0.975 (tn−1 ) √ < µ < X̄ + c0.975 (tn−1 ) √ = 0.95 (6’)
n n
in which σ is replaced by S and the standard normal quantiles (±1.96) are replaced by the
tn−1 quantiles.
This gives the 95% confidence interval for µ:

s s
x̄ − c0.975 (tn−1 ) √ < µ < x̄ + c0.975 (tn−1 ) √
n n
s
which, with est = x̄ and se(x̄) = √ , exactly fits the form
n
est ± “2” se, with “2” = c0.975 (tn−1 ).

Unless the sample size n is very small, the “2” will actually be reasonably close to 2, as
we have seen: for example, c0.975 (t100 ) = 1.984, c0.975 (t30 ) = 2.042, c0.975 (t10 ) = 2.228,
c0.975 (t3 ) = 3.182.

EXAMPLE 5.6.1: A random sample of n=25 observations from a normal popu-


lation gives x̄=12.3 and s=4.7. Find a 95% confidence interval for µ.

4.7
95% CI for µ: (est ± “2”se) = 12.3 ± 2.064 √ = (10.4, 14.2).
25

EXAMPLE 5.6.2: The time taken to complete a particular surgical procedure is a


random variable. For twenty-two independent observations of this procedure,
lOMoARcPSD|8938243

Chapter 5: Estimation page 137

the average time taken (in minutes) is 28.52, with sample standard deviation
2.36. Assuming normality, find a 95% confidence interval for the mean time.

2.36
est = 28.5, se = √ = 0.50, “2” = c0.975 (t21 ) = 2.080.
22

95% CI for µ: (28.52 ± 1.04) = (27.5, 29.6).

Note: that this confidence interval is a statement about the mean, and not about the ac-
tual time taken. A 95% interval for the time taken is approximately (28.52±2×2.36) =
(23.8, 33.2). This is an interval within which about 95% of the sample of times would
lie; and an interval within which the next observation will lie with a probability of
around 0.95.

Such an interval is called a prediction interval. This is discussed further below.

EXAMPLE 5.6.3: For the sample {1, 3, 4, 4, 5, 5, 6, 7, 9, 12},


we have n = 10, x̄ = 5.6 and s = 3.134.
So, if this was a sample from a normal population, then

3.134
90% CI for µ: 5.6 ± 1.833× √ = (3.78, 7.42).
10

In most cases we consider a 95% confidence intervals. However, the procedure is the same
for other levels, as indicated by the last example.

5.7 Prediction intervals (for a future observation)

A 95% prediction interval for X is an interval within which a future observation of X


will lie, with probability 0.95. Let X ′ denote a future observation on X. If we knew the
d
values of µ and σ, then it is simple: since X ′ = N(µ, σ 2 ), a 95% prediction interval is
(µ − 1.96σ, µ + 1.96σ).
In the cases that µ and/or σ are unknown, we use their estimates instead, but this intro-
duces an additional uncertainty resulting in a wider interval.
If µ is unknown, but σ is known.
d d 2
We have X ′ = N(µ, σ 2 ) and X̄ = N(µ, σn ).
Since X ′ and X̄ are independent random variables,
d
X ′ − X̄ = N 0, σ 2 (1 + n1 )

q q
1
1+ n1 = 0.95.
· ′

· · Pr − 1.96σ 1+ n < X − X̄ < 1.96σ
q q
i.e. Pr X̄ − 1.96σ 1+ n1 < X ′ < X̄ + 1.96σ 1+ n1 = 0.95.


Therefore, a 95% prediction interval for X is


r r
 1 1 
x̄ − 1.96 σ 1 + , x̄ + 1.96 σ 1 +
n n
The interval is centred on x̄ (the point estimate of µ), but it is a bit wider to allow for the
uncertainty in using x̄ in place of µ.
lOMoARcPSD|8938243

page 138 Experimental Design and Data Analysis

If µ is unknown, and σ is also unknown.


If σ is also unknown then we replace σ by S, and, as in the case of the confidence interval
the N(0, 1) distribution is replaced by tn−1 . This gives
r r
 1 1 
x̄ − c0.975 (tn−1 ) s 1 + , x̄ + c0.975 (tn−1 ) s 1 +
n n
The interval is again centred at the point estimate x̄, and it is even wider (since c0.975 (tn−1 )
is greater than 1.96) to allow for the additional uncertainty in using s in place of σ.
Summary
The results are summarised in the following table:

µ σ result 95% prediction interval (95% confidence interval)


√ √ ′
X −µ d
= N(0, 1) µ ± 1.96 σ µ
σ
√ X ′ − X̄ d q q
× q = N(0, 1) x̄ ± 1.96 σ 1+ n1 x̄ ± 1.96 σ 1
n
σ 1 + n1

X ′ − X̄ d q q
× × q = tn−1 x̄ ± c0.975 (tn−1 ) s 1+ n1 x̄ ± c0.975 (tn−1 ) s 1
n
S 1 + n1

d
EXAMPLE 5.7.1: A random sample of n = 31 observations is obtained on X =
N(µ, σ 2 ). If the sample gives x̄ = 23.78 and s = 5.37, find a 95% prediction
interval for X.
q
1
95% PI for X: 23.78 ± 2.042×5.37× 1 + 31 = (12.64, 34.92).

Compare this with the 95% CI for µ q


obtained earlier:
1
95% CI for µ: 23.78 ± 2.042×5.37× 31 = (21.81, 25.75).

The PI and CI are the same, apart from a “ 1 + ” in the right place.
The prediction interval is always substantially wider: it is a statement about a future
observation. The confidence interval is a statement about the population mean.

5.8 Checking normality

The t-distribution result is based on the assumption that the population distribution is ap-
proximately normal. How can we tell if a sample is normal (i.e. from a normal population)?
The sample pdf is too erratic to be much use. The sample cdf is a bit more stable. But, how
do we know which shape corresponds to a normal cdf?
Principle: the easiest curve to fit is a straight line

Our definition of sample quantiles suggests a solution:


ĉq = x(k) , where k = (n+1)q.
Therefore:
k
x(k) ∼ cq where q = n+1 .
lOMoARcPSD|8938243

Chapter 5: Estimation page 139

d d
If X = N(µ, σ 2 ), then X = µ + σN, where N denotes standard normal; and so cq (X) =
µ + σcq (N), i.e. cq = µ + σΦ−1 (q). Note: Φ denotes the standard normal cdf, so Φ−1 (q)
denotes the inverse cdf, i.e. the q-quantile of the standard normal distribution. This is often
denoted by zq .

Therefore we have (if the normal model is correct):


k
x(k) ∼ µ + σΦ−1 ( n+1 ).
k

So, if we plot the points Φ−1 ( n+1 ), x(k) , the result should be something close to a straight
line, with intercept µ and slope σ; as illustrated below for a sample of 30 observations.

Normal Q-Q Plot


70
Sample Quantiles
60
50
40

-2 -1 0 1 2
Theoretical Quantiles

> x <- rnorm(30, mean=50, sd=10) # generate a sample


> qqnorm(x) # normal QQ-plot
> qqline(x) # add "best fitting" line

This appears to be reasonably close to a straight line, so the normal distribution is a reason-
able model for these data. The intercept of the fitted line gives an estimate of µ: µ̂ = 47.2;
and the slope of the fitted line gives an estimate of σ: σ̂ = 10.7.
k
The quantities Φ−1 ( n+1 ) are called normal scores. Roughly, these are the values you would
expect to get in an equivalent position in a sample from a standard normal distribution.
On R, to get the normal scores based on observations x, use qnorm(x).
Such a plot is called a QQ-plot because it plots the sample Quantiles against the (standard)
population Quantiles. The QQ-plot not only provides an indication of whether the model
is a reasonable fit, but also gives estimates of µ and σ. These estimates work even in some
situations where x̄ and s won’t. For example, with censored or truncated data.

EXAMPLE 5.8.1: A random sample of 30 observations gives:

68.1 73.1 86.2 85.1 70.0 67.1 64.3 65.8 64.2 62.0
48.6 74.9 72.9 54.7 78.0 79.1 60.1 63.2 63.2 78.0
77.5 65.9 79.9 59.5 56.3 59.0 66.7 74.5 79.5 67.6

Is this a sample from a normal population? Can we reasonably assume that it


could have come from a normal population?
lOMoARcPSD|8938243

page 140 Experimental Design and Data Analysis

Histogram of x Normal Q-Q Plot

0.04

80
Sample Quantiles
0.03

70
Density
0.02

60
0.01

50
0.00

50 60 70 80 90 -2 -1 0 1 2
x Theoretical Quantiles

While the histogram does not look particularly normal, the QQ-plot gives a
more useful guide. This is a plot of the data (on the vertical axis) and the normal
scores (on the horizontal axis), with a fitted straight line. The intercept and slope
are quite close to the sample mean and standard deviation, as they should be.
> mean(x) # sample mean
[1] 68.83333
> sd(x) # sample standard deviation
[1] 9.213833

What happens if it’s a bad fit? Tails too long; or tails too short. A concave graph (up or
down) indicates too short at one end and too long at the other, i.e. a skew distribution. The
following examples corresponds to the Exponential distribution with pdf e−x and the t-
distribution with 3 degrees of freedom. They give, respectively U- and S-shaped QQ-plots.

Histogram of x Normal Q-Q Plot


6
60

5
50

Sample Quantiles
4
Frequency
40

3
30

2
20
10

1
0

0 1 2 3 4 5 6 -2 -1 0 1 2
x Theoretical Quantiles
lOMoARcPSD|8938243

Chapter 5: Estimation page 141

0.25 Histogram of x Normal Q-Q Plot

5
0.20

Sample Quantiles
0
0.15
Density

-5
0.10
0.05

-10
0.00

-10 -5 0 5 -2 -1 0 1 2
x Theoretical Quantiles

5.9 Combining estimates

Suppose we have two estimates of the same parameter, with their standard deviations,
resulting from two separate experiments:
est1 = 5.2, sd1 = 0.2; est2 = 6.4, sd2 = 0.5
How should these estimates be combined?
We could just average the two, giving the combined estimate:
est = 5.8, sd = 0.27.
p
(average = 0.5×est1 + 0.5×est2 , so sd = (0.52 ×0.22 + 0.52 ×0.52 ) = 0.27.)
This combination gives the two estimates equal weight.
It would appear that the first experiment is ‘better’. It produces a more precise estimate:
i.e. one with smaller standard deviation. So, we ought to be giving it more weight.
Suppose the parameter we are estimating is θ, then we have:
E(T1 ) = θ, var(T1 ) = 0.22 ; E(T2 ) = θ, var(T2 ) = 0.52 ,
where T1 and T2 are independent, as they are from separate experiments.

We seek the optimal estimator of θ. Let the weights be w and 1−w, and define:
T = wT1 + (1−w)T2 .
The reason that the weights must sum to one is so that T is an unbiased estimator of θ:
E(T ) = wθ + (1−w)θ = θ.
w2 (1−w)2
V = var(T ) = w2 ×0.22 + (1−w)2 ×0.52 = + .
25 4
dV
To find where V is a minimum, we solve = 0:
dw
dV 2w 2(1−w) w 25
= − =0 ⇒ = ⇒ w = 25
29
4
, 1−w = 29 .
dw 25 4 1−w 4

So, to minimise the variance of the combined estimate we should put a lot more weight on
the first estimate (86% in fact). Then
lOMoARcPSD|8938243

page 142 Experimental Design and Data Analysis

q
25 4
est = 29 ×5.2 + 29 ×6.4 = 5.4; sd = ( 25 2 2 4 2 2
29 ) ×0.2 + ( 29 ) ×0.5 = 0.19.

Compared to averaging, this optimal weighting gives an estimate closer the first (more
reliable) estimate; and a smaller standard deviation.
Repeating the above for the case
E(T1 ) = θ, var(T1 ) = v1 ; E(T2 ) = θ, var(T2 ) = v2 ,
1 1
v2 v1 v2
gives w= = 1 1 ; and 1−w = 1 1 ;
v1 + v2 v1 + v2 v1 + v2
i.e. the weights for the optimal estimator are inversely proportional to the variances, and
its variance is given by:
1
V = 1 1 .
v1 + v2

This makes routine calculation simpler.

EXAMPLE 5.9.1: We have independent estimates with known standard devia-


tions: est1 = 5.2, sd1 = 0.2; est2 = 6.4, sd = 0.5. This can be represented in
a table, with estimates and standard deviations in the first two columns. The
third column 1/v = 1/sd2 . The fourth column gives the weights, obtained by
dividing 1/v by the column sum.

est sd 1/v wt wt×est


5.2 0.2 25 0.862 4.48
6.4 0.5 4 0.138 0.88
29 5.36

sd = 1/ 29 = 0.19 est = 5.4 (as above).

We can now calculate an optimal confidence interval for this quantity:


CI = 5.4 ± 1.96 × 0.19 = (5.04, 5.76).

This extends to the case of k independent estimators of a parameter θ:


k
X c 1
T = wi Ti , with weights wi = , where c = Pk ;
i=1
vi j=1
1
vj

and in that case


V = var(T ) = c.

This is precisely the technique used in meta-analysis.


Note: the reciprocal of variance can be interpreted as ‘information’. The smaller the variance,
the more information we get from an estimate. Combining estimates in this way means that
we are adding the information. The information in the combined estimate (1/V ) is the sum
of the information in each of the estimates (1/vi ).

In meta-analysis, research papers reporting estimates of a parameter (such as the effect of


drug A) are collected. (The aim of a meta-analysis is to collect all such papers). The papers
report an estimate and the standard deviation of the estimate. Thus, it can be assumed
that we have realisations of independent random variables T1 , T2 , . . . , Tk with standard
deviations σ1 , σ2 , . . . , σk .
It is assumed that all these papers are estimating the same parameter, θ, which might be
the effect of a new drug, or the odds ratio relating disease and exposure, or the increased
survival time with a new treatment.
lOMoARcPSD|8938243

Chapter 5: Estimation page 143

How should these estimates be combined to produce an optimal estimate? The answer is
given by the above results.

EXAMPLE 5.9.2: (meta-analysis)

We find results from three papers: est1 = 14.4, sd1 = 0.45; est2 = 15.7, sd2 =
0.92; and est3 = 16.1, sd3 = 0.67. We assume that the results are independent
(separate experiments). We want to combine these results in the most efficient
way. This can be done using a table of the same form as for combining two
estimates. (Note: a table of this form can be used for optimally combining any number
of estimates.)

paper est sd 1/v wt wt×est


1 14.4 0.45 4.94 0.592 8.52
2 15.7 0.92 1.18 0.142 2.22
3 16.1 0.67 2.23 0.267 4.30
8.35 15.04
√ ↓ ↓
sd = 1/ 8.35 = 0.35 est = 15.0

Thus the combined estimate is 15.0 with standard deviation 0.35. This gives a
confidence interval of:
CI = 15.0 ± 1.96 × 0.35 = (14.81, 15.27).
From the above example we observe that the combined estimate is closer to the most precise
of the individual estimates; and the standard deviation of the pooled estimate is smaller
than any of the individual standard deviations, resulting in a smaller confidence interval.
Greater information means smaller standard deviation.
Here we are assuming that the standard deviations are known. Usually, in practice, they
are not, and they must be estimated. In that case we use the standard error (which is an
estimate of the standard deviation) and replace the unknown standard deviation (sd) by
the standard error (se) in the above.
lOMoARcPSD|8938243

page 144 Experimental Design and Data Analysis

Problem Set 5
5.1 A population has mean µ=50 and standard deviation σ=10.
(a) For a random sample of 10, find approximately Pr(49 < X̄ < 51).
(b) For a random sample of 100, find approximately Pr(49 < X̄ < 51).
(c) For a random sample of 1000, find approximately Pr(49 < X̄ < 51).
5.2 A population with pdf indicated below has mean µ = 55.4 and standard deviation 14.2.

A random sample of 50 observation is obtained from this population. Specify a 95% probability
interval for X̄.
5.3 A 95% confidence interval for a parameter is such that it contains the unknown parameter with
probability 0.95. We call this a “success”. So, the probability that a 95% confidence interval is
successful is 0.95. And it is a failure (i.e. does not contain the parameter) with probability 0.05.
(a) Suppose we have four independent 95% confidence intervals. Show that the probability
that all four of these intervals are successful is 0.8145.
(b) i. Suppose we have 20 independent 95% confidence intervals, what is the probability
that all 20 are successful?
ii. How many of these intervals do you ‘expect’ to be successful?
iii. What is the distribution of the number of successful intervals?
iv. Find the probability that the number of successful intervals is equal to 20? 19? 18?
5.4 The following is a random sample of n=30 observations from a normal population with (un-
known) mean µ and known standard deviation σ=8.
32.1 43.2 38.6 50.8 34.4 34.8 34.5 28.4 44.1 38.7
49.1 41.3 40.3 40.5 40.0 35.3 44.3 33.3 50.8 28.6
42.2 46.3 49.8 34.4 43.9 59.7 44.9 41.9 41.3 38.2
i. Find a 95% confidence interval for µ.
ii. Will a 50% confidence interval for µ be wider, or narrower, than the 95% confidence in-
terval? Find a 50% confidence interval for µ.
iii. What would happen if the confidence level was made even smaller? What is the 0%
confidence interval?
iv. Find a 99.9% confidence interval for µ.
5.5 For the data of Problem 5.4, find the 95% confidence interval for µ, assuming that σ is unknown.
Compare this interval to the 95% confidence interval found in Problem 5.4. Why is this interval
narrower? Under what circumstances is the 95% confidence interval assuming σ unknown
narrower than the 95% confidence interval assuming σ known? Which do you expect to be
wider?
5.6 A study was conducted to examine the efficacy of an intramuscular injection of cholecalciferol
for vitamin D deficiency. A random sample of 50 sufferers of vitamin D deficiency were chosen
and given the injection. Serum levels of 25-hydroxyvitamin D3 (250HD3 ) were measured at
the start of the study and 4 months later. The difference D was calculated as (4-month reading
– baseline reading).
The sample mean difference, d¯= 17.4 and sample standard deviation, sd = 21.2.
i. Construct a 95% confidence interval for the mean difference.
ii. Does this confidence interval include zero? What can you conclude?
5.7 The margin of error, or the half-width, of a 100(1−α)% confidence interval for µ when σ is
σ
known is given by z √n , where z = c1− 1 α (N).
2

i. List the factors that affect the width of a confidence interval.


lOMoARcPSD|8938243

Chapter 5: Estimation page 145

ii. For each factor, say how it affects the width of the interval.
iii. Does a wider interval give a more or less precise estimation?
iv. If σ = 5, and I want a 95% confidence interval to have half-width 0.5, i.e. the 95% CI to be
(x̄ ± 0.5), what sample size should I use?
5.8 We are interested in estimating the prevalence of attribute D among 50-59 year-old women.
Suppose that in a sample of 1140 such women, 228 are found to have attribute D.
Obtain a point estimate and a 95% confidence interval for the prevalence.
5.9 We are interested in estimating the prevalence of breast cancer among 50–54-year-old women
whose mothers have had breast cancer. Suppose that, in a sample of 10 000 such women, 400
are found to have had breast cancer at some point in their lives.
(a) Obtain a point estimate for the prevalence, and its standard error.
(b) Obtain a 95% interval estimate for the prevalence.
5.10 Of a random sample of n = 20 items, it is found that x = 4 had a particular characteristic. Use
the chart in the Statistical Tables (Table 2) to find a 95% confidence interval for the population
proportion. Repeat the process to complete the following tables:
n x p̂ 95% CI: (a, b) n x p̂ 95% CI: (a, b)
20 4 20 16
50 10 50 40
100 20 100 80
200 40 200 160
Check your values using the intervals from R.
Use the formula p̂ ± 1.96 se(p̂) to find an approximate 95% confidence interval for the popula-
tion proportion for n=100, x=20.
5.11 (a) The following is a sample of n = 19 observations on X
84 37 33 24 58 75 55 46 65 59
18 30 48 38 70 68 41 52 50
The graph below is the QQ plot for this sample:

Normal Q-Q Plot


80
70
Sample Quantiles
60
50
40
30
20

-2 -1 0 1 2
Theoretical Quantiles

Specify the coordinates of the indicated point, explaining how they are obtained. Use the
diagram to obtain estimates of µ and σ.
(b) Use R to obtain a probability plot for these data and indicate how it relates to the above
plot.
(c) i. Find a 95% confidence interval for µ.
ii. Find a 95% prediction interval for X.
5.12 A random sample of 100 observations on a continuous random variable X gives:
range 0 < x < 1 1 < x < 2 2 < x < 3 3 < x < 5 5 < x < 10 10 < x < 20
frequency 27 18 20 17 12 6
(a) Sketch the graph of the sample pdf.
(b) Sketch the graph of the sample cdf and hence find an approximate value for the sample
median.
(c) Find a 95% confidence interval for Pr(X < 3). Is it plausible that the median is equal
to 3? Explain.
lOMoARcPSD|8938243

page 146 Experimental Design and Data Analysis

5.13 Suppose the presence of a characteristic C in an individual can only be determined by means of
a blood test. We assume this test indicates the presence (or absence) of C with perfect accuracy.
If the characteristic is rare and the test is expensive (and/or time consuming) it can be more
efficient to test a combined blood sample from a group of individuals.
Suppose that the probability that an individual has characteristic C is equal to p. Blood samples
from k = 10 individuals are combined for a test.
i. Show that the probability of a positive result (indicating presence of C) is θ = 1−(1−p)10 .
ii. Ten such groups of 10 (representing blood samples from 100 individuals in all) were
tested, and yielded 4/10 positive results. Use R to obtain an exact 95% confidence in-
terval for θ. Hence derive an exact 95% confidence interval for p.
5.14 A study of workers in industry M reported 43 cases of disease D based on observation of 1047
person-years. Give an estimate and a 95% confidence interval for the incidence rate in industry
M based on these results. Is this compatible with the community incidence rate is 0.02 cases
per person-year?
5.15 (a) Two independent estimates of a parameter θ are given:
est1 = 25.0 with sd1 = 0.4; and est2 = 23.8 with sd2 = 0.3.
Find the optimal pooled estimate of θ and obtain its standard deviation.
(b) A third independent estimate of θ is obtained in a new experiment:
est3 = 24.4 with sd3 = 0.2.
Find the optimal pooled estimate of θ based on the three estimates, and obtain its stan-
dard deviation.
5.16 Oxidised low-density lipoprotein is thought to play an important part in the pathogenesis of
atherosclerosis. Observational studies have associated β-carotene with reductions in cardio-
vascular events, but clinical trials have not. A meta-analysis was undertaken to examine the
effect of compounds like β-carotene on cardiovascular mortality and morbidity.
Here we examine the effect of β-carotene on cardiovascular mortality. Six randomised trials
of β-carotene treatment were analysed. All trials included 1000 or more patients. The dose
range for β-carotene was 15–50 mg. Follow-up ranged from 1.4 to 12.0 years. The parameter
estimated is λ = ln(OR) where OR denotes the odds ratio relating E and D, and ln denotes
natural logarithm.
Note: ln OR is used rather than OR, since −∞ < ln OR < ∞, which means that its estimator is better
fitted by a normal distribution; it has no endpoint problems, cf. OR > 0.
(a) The estimates and standard errors from these trials are as follows:
est se 1/se2 w w×est
ATBC 0.0827 0.0533 ··· ··· ···
CARET 0.3520 0.1058 ··· ··· ···
HPS 0.0520 0.0503 ··· ··· ···
NSCP –0.7702 0.5109 ··· ··· ···
PHS 0.1049 0.0797 ··· ··· ···
WHS 0.1542 0.3935 ··· ··· ···
A rough 95% confidence interval for each trial is given by est ± 2se. Represent these
intervals in a diagram.
(b) Compute the optimum pooled estimate of λ and its standard error.
(c) Obtain a 95% confidence interval for λ.
(d) Hence obtain a 95% confidence interval for OR, using the fact that λ = ln OR.
(e) Let OR denote the odds ratio between exposure E and disease D. What does “OR > 1”
indicate about the relationship between E and D?
(f) What conclusion do you reach from this meta-analysis?
lOMoARcPSD|8938243

Chapter 6

HYPOTHESIS TESTING

“I had come to an entirely erroneous conclusion, which shows, my dear Watson, how dangerous it is
to reason from insufficient data.”
Sherlock Holmes, The Speckled Band, 1892.

6.1 Introduction

Hypothesis testing can be regarded as the “other side” of confidence intervals. We have
seen that a confidence interval for the parameter µ gives a set of “plausible” values for µ.
Suppose we are interested in whether µ = µ0 . In determining whether or not µ0 is a
plausible value for µ (using a confidence interval) we are really testing µ = µ0 against the
alternative that µ 6= µ0 . If µ0 is not a plausible value, then we would reject µ = µ0 .
In this subject, we deal only with two-sided confidence intervals and, correspondingly,
with two-sided tests, i.e tests against a two-sided alternative (µ = µ0 vs µ 6= µ0 ). There are
circumstances in which one-sided tests and one-sided confidence intervals may seem more
appropriate. Some statisticians argue that they are never appropriate. In any case, we will
use only two-sided tests.
All our confidence intervals are based on the central probability interval for the estimator,
i.e. that obtained by excluding probability 21 α at each end of the distribution, giving a Q%
confidence interval, where Q = 100(1 − α). This means that our tests are based on rejecting
µ = µ0 for an event of probability 12 α at either end of the estimator distribution1 .

EXAMPLE 6.1.1: (Serum cholesterol level)


The distribution of serum cholesterol level for the population of adult males
who are hypertensive and who smoke is approximately normal with an un-
known mean µ. However, we do know that the mean serum cholesterol level
1 Note: this is not always the case for other test statistics, i.e. test statistics that are not estimators, such as “goodness-

of-fit” statistics. For example, using a test statistic such as U = (X̄−µ0 )2 to test µ=µ0 . We will consider such cases in
Chapter 7.

147
lOMoARcPSD|8938243

page 148 Experimental Design and Data Analysis

for the general population of adult males is 211 mg/100mL. Is the mean choles-
terol level of the subpopulation of men who smoke and are hypertensive differ-
ent?
Suppose we select a sample of 25 men from this group and their mean choles-
terol level is x̄=220 mg/100mL. What can we conclude from this?

DEFINITION 6.1.1. A statistical hypothesis is a statement concerning the probability


distribution of a population (a random variable X).

We are concerned with parametric hypotheses: the distribution of X is specified except


for a parameter. In the present case µ, where the population distribution is N(µ, σ 2 ), the
hypotheses can take the form µ = 6, or µ 6= 4, or . . . .
The hypothesis under test is called the null hypothesis, denoted H0 . It has a special impor-
tance in that it usually reflects the status quo: the way things were, or should be. Often
the null hypothesis represents a “no effect” hypothesis. The onus is on the experimenter
to demonstrate that an “effect” exists. We don’t reject the null hypothesis unless there is
strong evidence against it.
We always take H0 to be a simple hypothesis: µ = µ0 .
We test the null hypothesis against an alternative hypothesis, denoted by H1 . We will always
take the alternative hypothesis to be H′0 , i.e. the complement of H0 (µ 6= µ0 ).

EXAMPLE 6.1.2: (Serum cholesterol level, continued)


In this case, the null hypothesis is that the mean cholesterol level for SH men
(i.e. men who smoke and have hypertension) is the same as the mean for all
men. The alternative hypothesis is that it is different.

The “logic” of the hypothesis testing procedure seems a bit back-to-front at first. It is
based on the contrapositive: [M ⇒ D] = [D′ ⇒ M ′ ].
For example: if the model M is a two-headed coin then the data D = the results are
all heads; so, if D′ = a tail is observed then M ′ = the coin is not two-headed.
Our application is rather more uncertain:
[M (µ = µ0 ) ⇒ D (x̄ ≈ µ0 )]
[D′ (x̄ 6≈ µ0 ) ⇒ M ′ (µ 6= µ0 )]
This logic means that we have a (NQR) “proof” of µ 6= µ0 . (If the signs were all
equalities rather than (random) approximations, it would be a proof.)
We have no means of “proving” (NQR or otherwise) that µ = µ0 .
“I am getting into your involved habit, Watson, of telling a story backward.”
Sherlock Holmes, The Problem of Thor Bridge, 1927.

We observe the sample and compute x̄. On the basis of the sample and the test statistic, we
must reach a decision: “reject H0 ”, or not.
Statisticians are reluctant to use “accept H0 ” for “do not reject H0 ”, for the reasons indicated
above. Mind you, this does seem a bit odd when “success” can be used to mean “the patient dies”.
If ever I use “accept H0 ” (and I’m inclined to occasionally), it means only “do not reject H0 ”.
In particular, it does not mean that H0 is true, or even that I think it likely to be true!
However, it is well to keep in mind that:
“absence of evidence is not the same as evidence of absence”.
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 149

To demonstrate the existence of an effect (µ 6= µ0 ), the sample must produce evidence


against the no-effect hypothesis (µ = µ0 ).

6.2 Types of error and power

Types of error
In deciding whether to accept or reject H0 , there is a risk of making two types of errors:

reject H0 don’t reject H0


H0 true × error of type I X correct
prob α prob 1−α
α = significance level
reject H0 don’t reject H0
H1 true X correct × error of type II
prob 1−β prob β
1−β = power

We want α and β to be small.


The significance level, α is usually pre-set at 0.05; we then do what we can to make the
power large (and hence β small). This will generally mean taking a bigger sample.
There is a helpful analogy between legal processes, at least in Westminster-style legal sys-
tems, and hypothesis testing.

hypothesis testing the law


null hypothesis H0 accused is innocent
alternative hypothesis H1 accused is guilty
don’t reject H0 innocent until proven guilty,
without strong evidence beyond a reasonable doubt
type I error convict an innocent person
type II error acquit a guilty person
α = Pr(type I error) beyond reasonable doubt
power = 1 − Pr(type II error) effectiveness of system
in convicting a guilty person

EXAMPLE 6.2.1: (A simple example to illustrate some of the terms used.)


I have a coin which I think may be biased. To test this I toss it five times: if I get
all heads or all tails, I will say it is biased, otherwise I’ll say it’s unbiased.

Let θ = probability of obtaining a head, then:


null hypothesis, H0 : θ = 12 (unbiased);
alternative hypothesis, H1 : θ 6= 21 (biased);
test statistic, X = number of heads obtained.
test (decision rule): reject H0 if X ∈ {0, 5}.

significance level = Pr(reject H0 | H0 true)


= Pr(X ∈ {0, 5} | θ = 21 )
= ( 21 )5 + ( 12 )5
1
= 16 ≈ 0.06
lOMoARcPSD|8938243

page 150 Experimental Design and Data Analysis

power = Pr(reject H0 | H1 true)


= Pr(X ∈ {0, 5} | θ 6= 21 ) . . . (this can’t be evaluated)

So, we define the power function:

Q(θ) = Pr(reject H0 | θ)
= Pr(X ∈ {0, 5} | θ)
= (1 − θ)5 + θ5

graph of Q(θ):

Note 1: Q(0.5) is the significance level of the test

Note 2: Q(0.75) = 0.255 + 0.755 ≈ 0.24; so this is not a particularly good test. But
we knew that anyway!

Note 3: To make a better test (one with greater power), we need to increase the sample
size. For example, with n=100, reject H0 unless 40 6 X 6 60.

6.3 Testing procedures

6.3.1 Confidence intervals

There are several ways of approaching a hypothesis test. The first, and simplest after Chap-
ter 5, is to compute a confidence interval (which is a good idea in any case); and then to
check whether or not the null-hypothesis value (µ = µ0 ) is in the confidence interval.
We have seen how to obtain a confidence interval for µ, so there is not much more to do. In
fact, a number of the problems and examples had parts that questioned the plausibility of
particular values of µ. This is now seen to be equivalent to hypothesis testing.

EXAMPLE 6.3.1: We obtain a random sample of n=40 from a normal population


with known standard deviation σ=4. The sample mean is x̄=11.62. Test the null
hypothesis H0 : µ=10 (against a two-sided alternative).

95% CI for µ: (11.62 ± 1.96 √4 ) = (10.38, 12.86).


40
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 151

Since the 95% confidence interval does not include 10, we reject the null hypoth-
esis µ=10. There is significant evidence in this sample that µ>10.

EXAMPLE 6.3.2: (Serum cholesterol level)


[n=25, µ0 =211, σ=46; x̄=220]

95% CI for µ: 220 ± 1.96× √46 = (202.0, 238.0).


25

Since the 95% confidence interval includes 211, we do not reject the null hy-
pothesis µ = 211. There is no significant evidence in this sample that µ 6= 211.
This approach can always be used whenever you have a confidence interval, but it has
disadvantages: it does not tell you how strongly to reject (or not) a particular hypothesis,
and it does not use the hypothesised number to construct the confidence interval.

6.3.2 The p-value

We can measure the strength of the evidence of the sample against H0 by using the “unlike-
lihood” of the data if H0 is true. The idea is to work out how unlikely the observed sample
is, assuming µ = µ0 . If it is “too unlikely”, then we reject H0 ; and otherwise, we do not
reject H0 .

DEFINITION 6.3.1. The p-value, denoted in these notes by p, is the probability (if H0
were true) of observing a value as extreme as the one observed.

This means that:



p = Pr X̄ is at least as far from µ0 as the observed x̄, above or below ,
d 2
where X̄ denotes the sample mean, assuming H0 is true ( ◦◦ ), i.e. X̄ = N(µ0 , σn ).

Therefore: (
2 Pr(X̄ > x̄) if x̄>µ0 d 2
p= where X̄ = N(µ0 , σn ).
2 Pr(X̄ < x̄) if x̄<µ0

The 2 is because this is a two-sided test, and we must allow for the possibility of being as
extreme at the other end of the distribution (i.e. above or below).

EXAMPLE 6.3.3: We obtain a random sample of n=40 from a normal population


with known standard deviation σ=4. The sample mean is x̄=11.62. Test the null
hypothesis H0 : µ=10 (against a two-sided alternative).

[n=40, σ=4, µ0 =10, x̄=11.62]

d 4 2
p = 2 Pr(X̄>11.62), where X̄ = N(10, 40 ) (the H0 distribution).

·
· · p = 2 Pr(X̄s > 11.62−10
√ ) = 2 Pr(X̄s >2.56) = 0.010.
4/ 40

Now we must specify what is meant by “too unlikely”; i.e. how small is “too small” a value
for p? It seems sensible to match our idea of what is “too small”, with what is “implausi-
ble”. Thus, if we reject H0 if p < 0.05, then this corresponds exactly to values outside the
95% confidence interval, i.e. the “implausible” values.
lOMoARcPSD|8938243

page 152 Experimental Design and Data Analysis

Our standard testing procedure therefore, is to compute the p-value and to reject H0 if
p < 0.05 (and not to reject H0 otherwise). Thus, in both the above two examples, we
would reject H0 (at the 5% level of significance).
We have seen how to compute the probability, so there is nothing new in that. What is new here is
the terminology that comes with it.
One advantage of the p-value is that it gives a standard indication of the strength of the
evidence against H0 . The smaller the value of p, the stronger the evidence against H0 .
As we can specify different levels for a confidence interval, we can specify different levels
for the test. To correspond to a 99% CI, we would reject H0 if p < 0.01.
We specify α, the significance level of the test. Typically we use α=0.05, just as we typically
use a 95% confidence interval. But we may choose α=0.01 or 0.001 or another value.

DEFINITION 6.3.2. If we observe p < α, then we reject H0 and say that the result is
statistically significant.

EXAMPLE 6.3.4: (Serum cholesterol level, continued)


[n=25, µ0 =211, σ=46; x̄=220]

d 2
p = 2 Pr(X̄ > 220), where X̄ = N(211, 46
25
) (the H0 distribution).

·
· · p = 2 Pr(X̄s > 220−211
√ ) = 2 Pr(X̄ > 0.978) = 0.328.
46/ 25

Since p > 0.05, we do not reject the null hypothesis µ = 211. There is no
significant evidence in this sample that µ 6= 211.

6.3.3 Critical values

The p-value approach is the most widely used, and preferred when it is avaiable, but some-
times it is difficult to calculate the required probability. A third approach, the critical value
approach, is to specify a decision rule for rejecting H0 .
The rejection rule is often best expressed in terms of a statistic that has a standard distribu-
tion if H0 is true. Here the test statistic is
X̄ − µ0
Z= √
σ/ n
d
which is such that, if H0 is true, then Z = N(0, 1). Note that Z involves only X̄ and known
constants (the null hypothesis value µ0 , the known standard deviation, σ, and the sample
size, n). In particular, Z does not depend on the unknown parameter µ.
The rule then is to compute the observed value of Z and to see if it could plausibly be
an observation from a standard normal distribution. (Here, “plausible” is taken to mean
within the central 95% of the distribution.) If not, we reject H0 . This leads to the name often
used for this test: the z-test.
x̄ − µ0
We compute the observed value of Z, i.e. z = √ , and compare it to the standard
σ/ n
normal distribution. Thus the decision rule is
reject H0 if z < −1.96 or z > 1.96; i.e. if |z| > 1.96.
which corresponds exactly to the rejection region for x̄ given above.
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 153

A random sample of 50 observations is obtained from a normal population with standard


deviation 5. The observed sample mean is 8.3. Test the null hypothesis that µ=10.
[n=50, µ0 =10, σ=5, x̄=8.3] ⇒ z = 8.3−10
√ = −2.40.
5/ 50

Hence we reject H0 (using significance level 0.05) since z < −1.96.


There is significant evidence in this sample that µ < 10.

EXAMPLE 6.3.5: (Serum cholesterol level, continued)


[n=25, µ0 =211, σ=46; x̄=220]
x̄−µ
z = σ/√n0 = 220−211
√ = 0.978.
46/ 25

Since |z| < 1.96, we do not reject the null hypothesis µ = 211.
There is no significant evidence in this sample that µ 6= 211.

EXAMPLE 6.3.6: (Birth weights)


A researcher thinks that mothers with low socioeconomic status (SES) deliver
babies whose birthweights are lower than “normal”. To test this hypothesis, a
random sample of birthweights from 100 consecutive, full-term, live-born ba-
bies from the maternity ward of a hospital in a low-SES area is obtained. Their
mean birthweight is found to be 3240 g. We know from nationwide surveys that
the population mean birthweight is 3400 g with a standard deviation of 700 g.

Do the data support her hypothesis?


[n=100, x̄=3240; we assume σ=700; µ0 =3400]
z = 3240−3400
√ = −2.29. Since |z| > 1.96, we reject H0 . There is significant evidence in
700/ 100
this sample that the mean birthweight of SES babies is less than the national average.

Describe the type I and type II errors in this context.


In this context, a type I error is to conclude that “SES babies” are different, when they
are actually the same as the rest of the population; a type II error is to conclude that
“SES babies” are the same, when they are in fact different.

EXAMPLE 6.3.7: (Serum cholesterol level, continued)


Describe the type I and type II errors in this context.
A type I error is to conclude that the group of interest (SH men) have different mean
serum cholesterol level from the general adult male population, when they actually
have the same mean.
A type II error is to conclude that the SH men are no different from the general adult
male population with respect to serum cholesterol levels, when in fact they are different.

Compute β, the probability of making a type II error, when the true value of µ
is 250.

[n = 25, µ0 = 211, σ = 46]

β = Pr(don’t reject H0 | µ = 250)


d 2
= Pr(211−1.96× √46 < X̄ < 211+1.96× √46 ), where X̄ = N(250, 46
25
)
25 25
39 39
= Pr(− √ − 1.96 < X̄s < − √ + 1.96)
46/ 25 46/ 25
= Pr(−6.20 < X̄s < −2.28)
= 0.0113 − 0.0000
= 0.011
lOMoARcPSD|8938243

page 154 Experimental Design and Data Analysis

This calculation can be done more neatly in terms of Z = X̄−211


√ .
46/ 25

d 2 d d
If X̄ = N(250, 46
25
), then Z = N( 250−211
√ , 1), i.e. Z = N(4.24, 1).
46/ 25
µ−a 2
[using the result that Y = X−a
b
has mean b and variance σb2 ]

Then: β = Pr(−1.96 < Z < 1.96) = Pr(−6.20 < Zs < −2.28) = 0.011, as above.

6.4 Hypothesis testing for normal populations

In this section, we consider tests for the parameter µ for a normal population. So the “pa-
rameter of interest” here is the population mean µ. Later in the chapter we turn to other
parameters.
We define a statistic that has a “standard” distribution when H0 is true (i.e. N or t, depend-
ing on whether σ is known or unknown). A decision is then obtained by comparing the
observed value for this statistic with the standard distribution.
In reporting the results of the test, you should give the value of the “standard” statistic,
the p-value, and a verbal conclusion/explanation. It is recommended that you also give a
confidence interval in reporting your results.

6.4.1 z-test (testing µ=µ0 , when σ is known/assumed)

This is the scenario we have been considering in the previous sections. We define:
X̄ − µ0
Z= √
σ/ n
in which X̄ is observed; µ0 , σ and n are given or assumed known.
d
If H0 is true, then Z = N(0, 1).
We evaluate the observed value of Z:
x̄ − µ0
z= √
σ/ n
and compare it to the standard normal distribution. For significance level 0.05, we reject
H0 if |z| > 1.96.

The p-value is computed using the tail probability for a standard normal distribution:

2 Pr(Z > z) if z > 0


(
d
p= where Z = N(0, 1) (the H0 distribution)
2 Pr(Z < z) if z < 0

EXAMPLE 6.4.1: We obtain a random sample of n=40 from a normal population


with known standard deviation σ=4. The sample mean is x̄=11.62. Test the
null hypothesis H0 : µ=10 (against a two-sided alternative).

[n = 40, σ = 4, x̄ = 11.62, µ0 = 10]

11.62−10
z= √
4/ 40
= 2.56; p = 2 Pr(Z > 2.56) = 0.010.

The sample mean, x̄=11.62; the z-test of µ=10 gives z=2.56, p=0.010.
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 155

Thus there is significant evidence in this sample that µ>10; the 95% CI for µ is
(10.28, 12.86).

EXAMPLE 6.4.2: (Renal disease)


The mean serum-creatinine level measured in 12 patients 24 hours after they re-
ceived a newly proposed antibiotic was 1.2 mg/dL. The mean and standard de-
viation of serum-creatinine level in the general population are 1.0 and 0.4 mg/dL
respectively.

Is there evidence to support the claim that their mean serum-creatinine level is
different from that of the general population?

There are some routine functions in R implementing the test, but it is straight-
forward to perform directly:

> z <- (1.2 - 1)/(0.4/sqrt(12)) # Z-statistic


> z
[1] 1.732051
> 2*(1-pnorm(z)) # p-value
[1] 0.08326452

Note that we are assuming the standard deviation of serum-creatine level is the
same in the treated individuals as the general population (as well as normality
etc.)

There is no significant evidence in this sample that the mean serum-creatine


level is different in these patients (z = 1.73, p = 0.083); the 95% CI for the mean
is (0.97, 1.43) and may be obtained as follows:

> 1.2 + qnorm(0.025)*0.4/sqrt(12) # lower end


[1] 0.9736829
> 1.2 + qnorm(0.975)*0.4/sqrt(12) # upper end
[1] 1.426317
The z-test provides a routine which can be used in other cases.

Power of a z-test
d
Suppose that Z = N(θ, 1). We observe Z, and on the basis of this one observation, we
wish to test H0 : θ = 0 against H1 : θ 6= 0.

For example, for θ = 3,


d
power = Pr(|Z| > 1.96), where Z = N(3, 1)
1 − power = Pr(−1.96 < Z < 1.96)
= Pr(−4.96 < Zs < −1.04)
= 0.1492 − 0.0000
·
· · power = 0.851
The following table gives us some information on the power and probability of type II error
for different values of θ:
lOMoARcPSD|8938243

page 156 Experimental Design and Data Analysis

d
Z = N(θ, 1) reject H0 don’t reject H0
|Z| > 1.96 |Z| < 1.96
H0 true (θ = 0) × error of type I X correct
d
Z = N(0, 1) α = Pr(|Z| > 1.96) = 0.05 prob = 0.95

reject H0 don’t reject H0


|Z| > 1.96 |Z| < 1.96
H1 true (θ 6= 0) X correct × error of type II
d
e.g. Z = N(1, 1) power = Pr(|Z| > 1.96) = 0.17 prob = 0.83
d
e.g. Z = N(2, 1) power = Pr(|Z| > 1.96) = 0.52 prob = 0.48
d
e.g. Z = N(3, 1) power = Pr(|Z| > 1.96) = 0.85 prob = 0.15
d
e.g. Z = N(3.61, 1) power = Pr(|Z| > 1.96) = 0.95 prob = 0.05
d
e.g. Z = N(4, 1) power = Pr(|Z| > 1.96) = 0.98 prob = 0.02
d
e.g. Z = N(−1, 1) power = Pr(|Z| > 1.96) = 0.17 prob = 0.83
d
e.g. Z = N(−2, 1) power = Pr(|Z| > 1.96) = 0.52 prob = 0.48
d
e.g. Z = N(−3, 1) power = Pr(|Z| > 1.96) = 0.85 prob = 0.15
d
e.g. Z = N(−3.61, 1) power = Pr(|Z| > 1.96) = 0.95 prob = 0.05
d
e.g. Z = N(−4, 1) power = Pr(|Z| > 1.96) = 0.98 prob = 0.02

Except for θ close to zero, it is usually the case that only one tail is required (as the other is
negligible). For example, for θ = −1,
d
power = Pr(|Z| > 1.96), where Z = N(−1, 1)
1 − power = Pr(−1.96 < Z < 1.96)
= Pr(−0.96 < Zs < 2.96)
= Pr(Z < 2.96) − Pr(Z < −0.96)
= 0.9985 − 0.1685
·
· · power = 0.170

Using the above table, we could plot a graph of the power function:

The graph has a minimum at zero (of 0.05, the significance level), and increases up to 1 on
both sides, as θ moves away from zero: for θ = 4, or θ = − 4 the power is 0.98.
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 157

X̄ − µ0
For the z-test, the statistic is Z = √ .
σ/ n
d
If µ = µ0 , then Z = N(0, 1).
d µ1 − µ0
If µ = µ1 , then Z = N(θ, 1), where θ = √ .
σ/ n
And we only get one observation on Z.
So the z-test is actually equivalent to the example above.
We can use the results of that example to work out power for any z-test, using power =
d
Pr(|Z| > 1.96), where Z = N(θ, 1).

Sample size determination

To devise a test of significance level 0.05 that has power of 0.95 when µ = µ1 , we need
θ = 3.6049.
µ1 − µ0 13 σ 2
i.e. √ = 3.61 ⇒ n= . [3.60492 = 12.9953 ≈ 13]
σ/ n (µ1 − µ0 )2

EXAMPLE 6.4.3: (Serum cholesterol level, continued)


Find the required sample size if we want a test to have significance level 0.05
and power 0.95 when µ = 220.

Here µ0 = 211, µ1 = 220 and σ = 46. Therefore:


13×462
n> = 340.
92
Thus we need a sample of at least 340, in order to ensure a power of 0.95 when
the population mean is 220.
The sample size result can be generalised, to any significance level α and specified power
1−β as indicated in the following diagram, which indicates the derivation of 3.6049 =
1.96 + 1.6449.
lOMoARcPSD|8938243

page 158 Experimental Design and Data Analysis

The diagram indicates that to achieve a z-test of µ=µ0 , with significance level α and power
1−β when µ=µ1 , we require

µ1 − µ0 (z1− 21 α + z1−β )2 σ 2
√ > z1− 1 α + z1−β ⇒ n > ,
σ/ n 2 (µ1 − µ0 )2

where zq denotes the standard normal q-quantile.

6.4.2 t-test (testing µ=µ0 when σ is unknown)

X̄ − µ0
We define: T = √
S/ n
in which, X̄ and S are observed; µ0 and n are given.
d x̄ − µ0
If H0 is true, then T = tn−1 . We evaluate the observed value of T : t = √ , and
s/ n
compare it to the tn−1 distribution, i.e. the null distribution, i.e. its distribution if H0 is true.

For significance level 0.05, we reject H0 if |t| > “2” = c0.975 (tn−1 ).

The p-value is computed using the tail probability for a tn−1 distribution:
2 Pr(T > t) if t > 0
(
d
p= where T = tn−1 (the H0 distribution).
2 Pr(T < t) if t < 0

EXAMPLE 6.4.4: (Cardiology)


A topic of recent clinical interest is the possibility of using drugs to reduce in-
farct size in patients who have had a myocardial infarction (MI) within the past
24 hours. Suppose we know that in untreated patients the mean infarct size
is 25. In 18 patients treated with the drug, the sample mean infarct size is 16.2
with a sample standard deviation of 8.4. Is the drug effective in reducing infarct
size?

[µ0 = 25; n = 18, x̄ = 16.2, s = 8.4]

16.2 − 25
t= √ = −4.44; p = 2 Pr(t17 < −4.44) = 0.000.
8.4/ 18

The sample mean for treated patients x̄=16.2 is significantly less than the known
mean for untreated patients (t= − 4.44, p=0.000).

In reporting this test result, it is recommended that you also give the 95% CI for
µ: (12.0, 20.4).

EXAMPLE 6.4.5: (Calorie content)


Many consumers pay careful attention to stated nutritional contents on pack-
aged foods when making purchases. It is therefore important that the informa-
tion on packages be accurate. A random sample of n = 12 frozen dinners of a
certain type was selected from production during a particular period, and calo-
rie content of each one was determined. Here are the resulting observations.
255 244 239 242 265 245 259 248 225 226 251 233
The stated calorie content is 240. Do the data suggest otherwise?

R can be used to analyse the data using the function t.test() by entering the
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 159

data and the null hypothesis value µ0 . For the above example we obtain

> x = c(255, 244, 239, 242, 265, 245, 259, 248, 225, 226, 251, 233) # data
> t.test(x, mu=240) # perform t test on x with null hypothesis mu=240

One Sample t-test

data: x
t = 1.2123, df = 11, p-value = 0.2508
alternative hypothesis: true mean is not equal to 240
95 percent confidence interval:
236.4657 252.2010
sample estimates:
mean of x
244.3333

There is no significant evidence in this sample that the mean is different from
240 calories (t = 1.21, p = 0.251); the 95% CI for the mean is (236.5, 252.2).

6.4.3 Approximate z-tests

An approximate z-test can be used in a wide variety of situations: it can be used whenever
we have a result that says the null distribution of the test statistics is approximately normal.
The central limit theorem ensures that there are many such situations.

Testing a population proportion: approx z-test for testing p=p0 (Binomial parameter)

Suppose we observe a large number of independent trials and obtain X successes. To test
H0 : p = p0 , where p denotes the probability of success, we can use
X − np0 P̂ − p0 X
Z=p =q , where P̂ =
np0 (1 − p0 ) p0 (1−p0 ) n
n

in which, X, or P̂ , is observed; p0 and n are given.


d
If H0 is true, then Z ≈ N(0, 1), provided n is large.
This can then be used in the same way as a z-test: we evaluate the observed value of Z:
p̂ − p0  est − θ0 
z=q z=
p0 (1−p0 ) se0
n

and compare it to the standard normal distribution, though in this case we should adjust
for discreteness by using a correction for continuity.
In this case there is not an exact correspondence between the test and the confidence interval, since
se0 6= se. This is because the confidence interval is based on an additional approximation: that
p(1−p) ≈ p̂(1−p̂). The test procedure is preferred. If it were used for the confidence interval, it
would give a better, but messier, confidence interval.

EXAMPLE 6.4.6: 100 independent trials resulted in 37 successes. Test the hy-
pothesis that the probability of success is 0.3.
lOMoARcPSD|8938243

page 160 Experimental Design and Data Analysis

x 0.07
p̂ = = 0.37, z = q = 1.528.
n 0.3×0.7
100
0.5 0.065
The correction for continuity is to reduce 0.07 by 100 = 0.005, i.e. zc = q =
0.3×0.7
100
1.418;
so p ≈ 2 Pr(N > 1.428) = 0.156.

d
The exact p-value is p = 0.160, obtained using p = 2 Pr(X > 37), where X =
Bi(100, 0.3).

Ignoring the continuity correction gives p ≈ 2 Pr(Z > 1.528) = 0.127.

There is no significant evidence in this result to indicate that p is different from


0.3.

EXAMPLE 6.4.7: 1000 independent trials resulted in 280 successes. Test the hy-
pothesis that the probability of success is 0.3.

x −0.02
p̂ = = 0.28, z = q = −1.380.
n 0.3×0.7
1000

0.5
The correction for continuity is to reduce 0.02 by 1000 = 0.0005,
−0.0195
i.e. zc = q = −1.346; so p ≈ 2 Pr(N > 1.346) = 0.178.
0.3×0.7
1000

There is no significant evidence in this result to indicate that p is different from


0.3.
(The exact p-value is p = 0.177. Ignoring the continuity correction gives p ≈ 0.168.)
It is observed that ignoring the continuity correction gives an underestimate of the p-value,
meaning that we are more likely to reject H0 when we should not. The effect of ignoring the
continuity correction is to increase the significance level (above the specified level of 0.05).
It is also observed that the effect of the continuity correction decreases as n increases though,
as seen in the example above, it can be non-negligible for quite large values of n.

Testing a population proportion: exact test

The exact p-value can be evaluated as:


(
2 Pr(X > x) if x > np0 d
p= , where X = Bi(n, p0 ) (the H0 distribution).
2 Pr(X 6 x) if x < np0

When using a normal approximation, we should use a correction for continuity:


(
2 Pr(X ∗ > x−0.5) if x > np0 d 
p≈ ∗
, where X ∗ = N np0 , np0 q0 .
2 Pr(X 6 x+0.5) if x < np0
X
n − p0
Standardisation, using p p0 q0 , leads to the continuity correction rule specified above;
n
i.e. reduce the magnitude of p̂−p0 by 0.5
n
.
The exact p-value can always be computed as a Binomial probability specified above, using
R.
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 161

Our approach then is to use the normal approximation, with continuity correction, to give
an approximation to the p-value. If an exact value is required, we can use the Binomial
probability. If the distribution of the test statistic is symmetrical then the two definitions
coincide. So it is only in the case of a skew distribution that there is a difference.

If n is small, there is little point in considering the normal approximation. We might as well
go straight to the exact test, using the Binomial distribution.

EXAMPLE 6.4.8: (Occupational medicine)


Suppose that 13 deaths have occurred among 55–64 year-old male workers in a
nuclear power plant and that the cause of death was cancer in 5 of them. As-
sume, based on vital-statistics reports, that approximately 20% of all deaths in
this age-group can be attributed to some form of cancer. Is this result signifi-
cant?
d
p = 2 Pr(X > 5), where X = Bi(13, 0.2);
thus p = 0.198 and we do not reject H0 . There is no significant evidence in these
data to indicate that the percentage of deaths attributable to cancer is different
from 20%.
0.5
0.385 − 0.2 − 13
The approx z-test gives zc = q = 1.321, so p ≈ 2 Pr(N >
0.2×0.8
13
1.321) = 0.187.

In R:

> binom.test(x=5, n=13, p=0.2)

Exact binomial test

data: 5 and 13
number of successes = 5, number of trials = 13, p-value = 0.1541
alternative hypothesis: true probability of success is not equal to 0.2
95 percent confidence interval:
0.1385793 0.6842224
sample estimates:
probability of success
0.3846154

Note that R’s binom.test computes the lower tail probability a little differ-
ently; it calculates the probability that X is further from the mean than 5, whereas
we simply multiply the upper tail probability by 2.

Sample size determination

Suppose we wish to test H0 : p = p0 using a significance level α and with power 1−β when
p=p1 . Using a normal approximation and following the derivation given for the normal
case, gives
p p 2
z1− 12 α p0 (1−p0 ) + z1−β p1 (1−p1 )
n> .
d2
lOMoARcPSD|8938243

page 162 Experimental Design and Data Analysis

This can be seen using a diagram like the one below (cf. the diagram on page 141):

EXAMPLE 6.4.9: Find the sample size required to test H0 : p = 0.3 with signifi-
cance level 0.05, so that the test has power 0.90 when p=0.2.

According to the above result, we require:


√ √ 2
1.96 0.3×0.7 + 1.2816 0.2×0.8
n> , i.e. n > 199.04.
0.12
Thus, we require a sample of size 200, at least.

Testing for a population median: approximate z-test for testing m = m0

If we have a population variable X having any continuous distribution with median m,


then Pr(X < m) = 12 .
This means that the number of observations in a random sample on X that are less than
d
the population median, freq(X < m) = Bi(n, 12 ); since we can regard an observation as a
trial, with probability of success Pr(X < m) = 21 , and the trials are independent since it is
a random sample.
1
To test H0 : m = m0 , we define p = Pr(X < m0 ), and test the hypothesis p = 2, as in the
d 1
previous section. Let U = freq(X < m0 ), then if m = m0 then U = Bi(n, and a test 2 ),
of H0 : m = m0 based on U is equivalent to a test of p = 0.5. If n is large, we can use a z-
U − 1n p̂ − 1 U d
test: Z = q 2 = q 2 , where P̂ = ; since, if H0 is true, then Z ≈ N(0, 1), provided
1 1 n
4n 4n
n is large. This approximation works quite well even for n relatively small, since in this
application p0 = 12 (and the normal approximation works best for p = 21 ).
So this can be used in the same way as a z-test for a proportion: we evaluate the observed
value of Z and compare it to the standard normal distribution, with a continuity correction.
The Binomial distribution can be used to evaluate exact p-values.

EXAMPLE 6.4.10: Consider a random sample of n = 400 observations on a pop-


ulation specified by the random variable X. We wish to test the null hypothesis
H0 : m = 40; and we observe that u = freq(X < 40) = 221. Note that this suggests
that the median might be less than 40, as more than half of the sample is less than 40.
221
Let p = Pr(X < 40), then p̂ = 400 = 0.5525.

0.5
0.0525− 400
So the test is based on zc = √ 1
= 2.05, so p = 2 Pr(Z > 2.05) = 0.040.
1600
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 163

Hence there is significant evidence in these data to indicate that the population
median is less than 40 (since there is evidence that Pr(X < 40) > 0.5).

d
Note: the exact p-value is p = 2 Pr(U > 221), where U = Bi(400, 12 ); p = 0.040.
A confidence interval for the population median can be obtained as the set of values m′ for
which the null hypothesis m = m′ is not rejected.

Testing a population rate: approximate z-test for testing α = α0

The result we use to examine the population rate, α (cases per person year) is X = number
d
of cases in a population for t person-years = Pn(αt). It follows that
X
X − α0 t t − α0 d
Z= √ = p ≈ N(0, 1) if H0 true,
α0 t α0 /t
in which, X is observed; and t and α0 are specified.

This can then be used in the same way as a z-test (provided α0 t is greater than 10). We eval-
uate the observed value of Z, and compare it to the standard normal distribution. Again,
as we are approximating an integer-valued variable by a normal distribution, a continuity
correction is required. Since α̂ = Xt , the continuity correction is 0.5
t
. This is applied in the
same way as for the Binomial test, i.e. reduce |α̂ − α0 | by 0.5
t
.

EXAMPLE 6.4.11: The incidence rate for disease D is supposed to be α = 0.025


cases/person-year, based on population data. A study of a particular subpopu-
lation reported x = 43 cases based on 1047 person-years.
Does this represent a significant departure from the population value?
Give a 95% confidence interval for the incidence rate for this subpopulation
based on the results of this study.

43
point estimate, α̂ = 1047 = 0.041 (cases/person-year).
0.5
α̂ − α0 0.04107 − 0.025 − 1047
To test H0 : α = 0.025, use zc = = q = 3.191
se0 0.025
1047
p = 2 Pr(N > 3.191) = 0.001, and hence we conclude these data show a signifi-
cant increase in incidence rate in this subpopulation.
q
43
For these data, we have est = 1047 = 0.041, se = 0.041
1047
= 0.0062; and hence:
approx 95% CI: 0.041 ± 1.96×0.0062 = (0.029, 0.053) [which excludes 0.025.]

Expected number of cases in a subpopulation

A common application is to compare a cohort (or a subpopulation) with the general pop-
ulation. The subpopulation may be individuals working in a particular industry, or in-
dividuals who live in a particular area — for example, close to a potential hazard. Are
the individuals in this subpopulation more likely to develop disease D than the general
population?
To examine this hypothesis, we need to work out the number of cases of D that would be
expected if the subpopulation were the same as the general population.
lOMoARcPSD|8938243

page 164 Experimental Design and Data Analysis

Typically, the incidence rates for D will depend on a range of covariates, usually age and
gender, but there may be others depending on the situation. To calculate the expected num-
ber of cases therefore, we stratify the subpopulation into categories of similar individuals
(e.g. age×gender categories). The expected number of cases for the subpopulation is then
worked out as
λ 0 = α 1 t 1 + α 2 t 2 + · · · + α c tc
where α1 , α2 , . . . , αc denote the general population incidence rates, and t1 , t2 , . . . , tc de-
notes the observed person-years for individuals from the subpopulation in each category.
This computation may be quite complicated and time-consuming. But we assume that all
that administration and record-keeping has been done. We are then left with the result that,
if the subpopulation behaves in the same way as the rest of the population (with respect to
disease D), then the number of observed cases of D in the subpopulation is such that
d
X = Pn(λ0 ).
If λ0 is large, then we can use a z-test:
X − λ0 d
Z= √ ≈ N(0, 1), if H0 is true;
λ0
and proceed as before for a z-test. If required, exact results can be obtained using the
Poisson distribution.

EXAMPLE 6.4.12: (Occupational health)


Many studies have looked at possible health hazards of workers in the alu-
minium industry. In one such study, a group of 8418 male workers aged 40–64
(either active or retired) on January 1, 1994, were followed for 10 years for var-
ious mortality outcomes. Their mortality rates were then compared with na-
tional male mortality rates in 1998. In one of the reported findings, there were
21 observed cases of bladder cancer and an expected number of events from
general-population cancer mortality rates of 16.1. Evaluate the statistical signif-
icance of this result.

20.5 − 16.1
x = 21, λ0 = 16.1 ⇒ zc = √ = 1.097; so p = 0.273.
16.1

Since p > 0.05, this result is not significant. There is no significant evidence in
this result to indicate that the occurrence of bladder cancer is different from the
general population.
In R, an exact version of the test is implemented by the function poisson.test and can
be carried out as follows:
> poisson.test(x=21, T=1, r=16.1)

Exact Poisson test

data: 21 time base: 1


number of events = 21, time base = 1, p-value = 0.2115
alternative hypothesis: true event rate is not equal to 16.1
95 percent confidence interval:
12.99933 32.10073
sample estimates:
event rate
21
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 165

The argument r specifies the mean under the null hypothesis.

Comparing λ with the value based on the general population rates, i.e. λ0 , gives
λ
the standardised incidence ratio (SIR) = .
λ0
This may also be referred to as the standardised mortality ratio if the disease outcome is
death; or a standardised morbidity ratio if the disease outcome is diagnosis.
21
In the above example, the standardised mortality rate is estimated by 16.1 = 1.30.
λ
An exact 95% CI for λ is (13.0, 32.1), using R. It follows that a 95% CI for SMR = 16.1 is
( 13.0 , 32.1
16.1 16.1
) = (0.88, 1.99). Since the confidence interval for SMR includes 1, there is no
significant evidence that this subpopulation differs from the general population, which
agrees with the above hypothesis testing result . . . as it should. A test of SMR = 1 is the
same as a test of λ = λ0 .
For small means, the normal approximation does not apply. In that case we use the exact
result, i.e. calculate the p-value using the Poisson distribution, and compare it with 0.05.

EXAMPLE 6.4.13: In a study of workers in the aluminium industry (see above),


six deaths due to Hodgkin’s disease were observed compared with 3.3 deaths
expected from general mortality rates. Is this difference significant?

d
H0 ⇒ X = Pn(3.3); and we observed x = 6,
d
so p = 2 Pr(X > 6), where X = Pn(3.3).
·
· · p = 2×0.117 = 0.234.

Since p > 0.05, this result is not significant. There is no significant evidence in
this result to indicate a different rate of Hodgkin’s disease among these workers.

6.5 Case study: Bone density

Hopper and Seeman (1994)2 conducted a cross-sectional study to examine the relationship
between cigarette smoking and bone density. Data was collected on 41 pairs of female
twins with different smoking histories (each pair consisted of a lighter-smoking twin and
a heavier-smoking twin). Bone mineral density (BMD) was measured at three different
locations: the lumbar spine, the femoral neck and the femoral shaft. Further information,
including (but not limited to) age, height, weight, consumption of alcohol, use of tobacco,
and calcium intake, was collected on each participant using a questionnaire.
E XERCISE . This is only one possible study that could be used to examine this proposed
relationship. What other ways could we construct a cross-sectional study? What about an
experiment or another kind of observational study?
We are interested in the following research question: is there a difference between the mean
lumbar spine BMD between the lighter-smoking and heavier-smoking twins? Let µ1 de-
note the mean lumbar spine BMD for lighter-smoking twins and µ2 denote the mean lum-
bar spine BMD for heavier-smoking twins. Also define µD = µ2 − µ1 . If µD < 0 (i.e.
µ2 < µ1 ) then the mean lumbar spine BMD of heavier-smoking twins is less than the mean
lumbar spine BMD of lighter-smoking twins. We can use a one sample t-test to test the null
hypothesis H0 : µD = 0 against the alternative hypothesis H1 : µD 6= 0.
2 Hopper, J.L and Seeman, E. (1994). The bone density of female twins discordant for tobacco use. New England

Journal of Medicine, 330, 387 – 392.


lOMoARcPSD|8938243

page 166 Experimental Design and Data Analysis

The data is stored in a file called Boneden.txt, which we load into R using the following
command:
> boneden <- read.table(’Boneden.txt’, header=T)
After loading the data file into R, the data should be stored in the object boneden. The com-
mand head(boneden) can be used to preview the first six rows of the data. The variables
ls1 and ls2 contain the lumbar spine BMD measurements for each of the lighter-smoking
twins and heavier-smoking twins, respectively. If you run the code boneden$ls1, you
will be able to view the lumbar spine BMD measurements for the lighter-smoking twins.
Similarly, boneden$ls2 will show the lumbar spine BMD measurements for the heavier-
smoking twins.
Hopper and Seeman express the differences in BMD for each pair of twins as a percentage
of the mean of the pair. Store these differences (expressed as a percentage of the twin pair
mean) in differences using the following R commands:
> attach(boneden)
> differences <- (ls2 - ls1) / ((ls1 + ls2) / 2)
Q UESTION : Why should we express the difference as a percentage? What happens if we
don’t?
Let’s look at the distribution of the differences. Both the boxplot and the histogram indicate
a mostly symmetric distribution, which is centred somewhere between -0.1 and 0, i.e. the
heavier-smoking twins have a slightly lower bone mineral density, although the overall
difference as a percentage is not great. There is a reasonable amount of variation, and in
many cases the lighter-smoking twin actually has lower BMD. So it’s not obvious from the
graphs that the difference is significant.

BMD difference between lighter− and heavier−smoking twins BMD difference between lighter− and heavier−smoking twins
10
8
6
Frequency

● ●
4
2
0

−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2

Percentage difference Percentage difference

The command t.test(differences) will carry out the one-sample t-test on the differ-
ences to determine if there is a significant difference. Running this command produces the
following output:
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 167

> t.test(differences)

One Sample t-test

data: differences
t = -2.5388, df = 40, p-value = 0.01512
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.08889922 -0.01009415
sample estimates:
mean of x
-0.04949668
The above output tells us several things:
• The mean percentage difference is d¯ = −0.049;
• In a t-test of H0 : µD = 0, the test statistic is t = −2.539, which gives a p-value of 0.015
when compared to a t distribution with 40 degrees of freedom;
• A 95% confidence interval for µD is (−0.089, −0.010).
Make sure you can identify these values in the output. Observe that the upper bound
of the 95% confidence interval is less than 0. This means we can be confident that µD <
0. Since the p-value is less than 0.05, we reject the null hypothesis H0 : µD = 0 at the
5% significance level. Therefore, we can conclude that there is a significant difference in
the mean lumbar spine BMD between the heavier- and lighter-smoking twins, with the
heavier-smoking twins having lower mean BMD.
Q UESTION : What conclusions can you draw from this study with regards to the true rela-
tionship between smoking and bone density?
lOMoARcPSD|8938243

page 168 Experimental Design and Data Analysis

Problem Set 6
6.1 The following is supposed to be a random sample from a normal population with unknown
mean µ and known standard deviation σ = 8.
32.1 43.2 38.6 50.8 34.4 34.8 34.5 28.4 44.1 38.7
49.1 41.3 40.3 40.5 40.0 35.3 44.3 33.3 50.8 28.6
42.2 46.3 49.8 34.4 43.9 59.7 44.9 41.9 41.3 38.2
(a) i. Find the 95% confidence interval for µ and hence test the hypothesis µ = 45.
ii. Draw a diagram representing the confidence interval and the null hypothesis value
on the same scale.
iii. Find the p-value for testing µ=45 vs µ6=45. What is your conclusion?
iv. Define the z-statistic used to test µ=45. Use it to specify the values of x̄ for which
µ=45 would be rejected. What is your conclusion for the above sample?
(b) Repeat (a) using a 99.9% confidence interval and significance level 0.001.
6.2 Assume that a person’s haemoglobin concentration (g/100mL) follows a
N(µ=16, σ 2 =6.25) distribution, unless the person has anaemia, in which case the distribution
is N(µ=9, σ 2 =9). On the basis of a haemoglobin reading, an individual undergoing routine
investigation will be diagnosed as anaemic if their reading is below 12.5, and an non-anaemic
otherwise.
(a) Find the probability that an anaemic person is correctly diagnosed.
(b) Find the probability that a non-anaemic person is correctly diagnosed.
(c) In the context of a diagnostic test, relate the probabilities found in (a) and (b) to the con-
cepts of sensitivity, specificity, predictive positive value and predictive negative value, if
applicable.
(d) In the context of a hypothesis-testing problem, relate the probabilities found in (a) and
(b) to the concepts of type I error, type II error and power. State the null and alternative
hypothesis.
6.3 Of a random sample of n = 20 items, it is found that x = 4 had a particular characteristic.
Use the chart in the Statistical Tables or R to find an exact 95% confidence interval for the
population proportion. Repeat the process to complete the following table:

n x p̂ 95% CI: (a, b)


20 4
50 10
100 20
200 40

In testing the null hypothesis p = 0.3, what conclusion would be reached in each case?

6.4 In an examination of a microscopic slide, the number of cells of a particular type are counted
in twenty separate regions of equal (unit) area with the following results:
22 42 31 35 34 47 21 20 34 27
22 26 NA 26 28 37 20 38 23 32
Assume that this represents a random sample from a population that has a distribution that is
approximately normal with mean µ.
(a) Find a 95% confidence interval for µ.
(b) Find the p-value to test the hypothesis µ = 31. What decision do you reach?

6.5 Among 1000 workers in industry A, the expected number of cases of B over a 5-year period

is λ0 = 10 cases, assuming the population norm applies to this group ( ◦◦ ). Suppose that 15
cases are observed.
(a) Does this represent significant evidence that the rate of occurrence of B in industry A is
different from the population norm? i.e. if λ denotes the mean number of cases among
the industry A workers, test the null hypothesis λ = λ0 .
(b) Obtain a 95% confidence interval for SMR = λ/λ0 .
(c) Obtain an estimate and a 95% confidence interval for the incidence rate (of disease out-
come B in industry A), α cases per thousand person-years.
lOMoARcPSD|8938243

Chapter 6: Hypothesis Testing page 169

6.6 Of 550 women employed at ABC Toowong Queensland during the past 15 years, eleven con-
tracted breast cancer in that time. After adjusting for a range of covariates (ages and other per-
sonal characteristics, including family history of breast cancer) the expected number of cases
of breast cancer is calculated to be 4.3. Test the hypothesis that there is an excess risk of breast
cancer at ABC Toowong.
The standardised morbidity ratio, SMR = λ/λ0 , where λ denotes the mean number of cases
among the sub-population and λ0 denotes the mean number of cases expected among the sub-
population if it were the same as the general population. Find an approximate 95% confidence
interval for SMR in this situation.
6.7 The diagram below is a typical power curve — with values of µ on the horizontal axis and
probability on the vertical axis:

A C B
Make a copy of this diagram and mark on it:
i. the significance level (i.e. the type I error probability);
ii. the power when µ = A;
iii. the type II error probability when µ = B.
What would happen to the power curve
iv. if n were increased ?
v. if the significance level were increased?
6.8 A new drug, ReChol, is supposed to reduce the serum cholesterol in overweight young indi-
viduals (20–29yo, BMI > 28). In a study to test this claim, a sample of such individuals are
given the drug for a period of six months, and their change in serum cholesterol is recorded (in
mg/100mL). Assume that these differences are normally distributed with ‘known’ standard deviation
of 38.5 mg/100mL.
Using a test with significance level 0.05, how large a sample is required to “detect” a mean
reduction of 10 mg/100mL with probability 0.95?
6.9 Among patients diagnosed with lung cancer, the proportion of patients surviving five years is
10%. As a result of new forms of treatment, it is claimed that this rate has increased. In a recent
study of 180 patients diagnosed with lung cancer, 27 survive five years, so that the estimated
survival proportion is 15%. Is there significant evidence in these data to support the claim?
(a) Define an appropriate parameter and set up the appropriate null hypothesis.
(b) Perform the hypothesis test, using the p-value method, at the 0.05 level.
(c) How large a sample would be required so that the probability of a significant result was
0.95 if the true (population) survival proportion was actually 15%?
6.10 Of 811 individuals employed at HQ centre during the past ten years, 13 contracted disease K.
After adjusting for a range of covariates, the expected number of cases of K is calculated to be
4.6. Test the hypothesis that there is no excess risk of K at the HQ centre.
6.11 In a randomised controlled experiment to examine the effect of a treatment on cholesterol lev-
els, a test comparing the mean cholesterol levels in the treatment group and the control groups
lOMoARcPSD|8938243

page 170 Experimental Design and Data Analysis

is found to be not significant. What does this indicate?


Your answer may include one or more of the following statements:
The data indicates that: (the treatment has no effect); (the treatment has a small effect); (the data
are compatible with the hypothesis of no effect); (the data do not indicate that the treatment is
having an effect).
6.12 A numerical competency test score was obtained from a random sample of twenty final year
high school students. These students gave a sample mean of 17.4 and sample standard de-
viation 5.1. When this test was standardised ten years ago, the mean level was 20. Test the
hypothesis that these students are from a population with mean 20.
Give the details of your test (i.e. specify H0 , H1 , the test statistic and its distribution under H0 ).
State your conclusion clearly.
6.13 We have a random sample of n observations on a continuous random variable X. We wish to
test the null hypothesis that the population median, m = 20. Explain why a test of m = 20 is
equivalent to a test of p = 0.5, where p = Pr(X < 20).
If 10 of a sample of 11 are less than 20, show that the p-value based on this observation is
approximately 0.01, giving the p-value to three decimal places. What is your conclusion?
6.14 The following represents the body temperature (in degrees Celsius) of 130 healthy adults, or-
dered in increasing magnitude.
35.7 35.8 35.9 35.9 36.0 36.1 36.1 36.2 36.2 36.2 36.2 36.2 36.2
36.3 36.3 36.3 36.3 36.3 36.3 36.4 36.4 36.4 36.4 36.4 36.4 36.5
36.5 36.5 36.6 36.6 36.6 36.6 36.6 36.6 36.6 36.6 36.6 36.6 36.6
36.6 36.7 36.7 36.7 36.7 36.7 36.7 36.7 36.7 36.7 36.7 36.7 36.7
36.7 36.7 36.8 36.8 36.8 36.8 36.8 36.8 36.8 36.8 36.8 36.8 36.8
36.8 36.8 36.8 36.8 36.9 36.9 36.9 36.9 36.9 36.9 36.9 36.9 36.9
36.9 36.9 36.9 37.0 37.0 37.0 37.0 37.0 37.0 37.0 37.0 37.0 37.0
37.1 37.1 37.1 37.1 37.1 37.1 37.1 37.1 37.1 37.1 37.1 37.1 37.1
37.1 37.1 37.1 37.1 37.1 37.2 37.2 37.2 37.2 37.2 37.2 37.2 37.3
37.3 37.3 37.3 37.3 37.3 37.4 37.4 37.4 37.4 37.5 37.7 37.8 38.2
i. For these data, use a sign test to test the hypothesis that the median body temperature is
37.0◦ C.
ii. What assumptions have you made?
iii. Use R to check the result of the test, and to obtain a 95% confidence interval for the
median.
d
6.15 If X = Pn(λ), use the Poisson Statistic-Parameter diagram to obtain:
(a) a rejection region for X to test H0 : λ=15 vs H1 : λ6=15 using a test of nominal significance
level of 0.05;
(b) a 95% confidence interval for λ when x = 15;
(c) A ten-year cohort study involving 1000 individuals was undertaken. There were 15 cases
of disease D observed in 5000 person-years of follow-up. Specify a 95% confidence inter-
val for the incidence rate of D based on these data.
2 2
z1− 1 ασ
2
6.16* For a 100(1 − α)% confidence interval, we require n > ;
d2
(z1− 1 α + z1−β )2 σ 2
2
and for a test of significance level α and power 1 − β, we require n > .
d2
These specifications are incomplete! What is missing? Specify precisely the meaning of d2 in
each formula.
i. What is the effect of increasing σ by a factor of k?
ii. What is the effect of increasing d by a factor of k?
iii. Show that the effect of changing the confidence level from 95% to 99% is to increase the
required sample size by a factor of 1.727.
iv. The confidence interval formula is the same as the power formula provided β = 0.5
(which means that z1−β = 0). Draw a diagram to illustrate why this is so.
v. For tests of significance level 0.05, show that the effect of changing β from 0.1 to 0.01, is
to increase the sample size by a factor of 1.748.
lOMoARcPSD|8938243

Chapter 7

COMPARATIVE INFERENCE

“One should always look for a possible alternative and provide against it.
It is the first rule of (statistical) investigation.” Sherlock Holmes, The Adventure of Black Peter, 1905.

7.1 Introduction

This chapter describes a standard situation where inference is required comparing two
populations. We begin with the case of comparing two population means, µ1 and µ2 .
In a one-sample test of a mean, we compare the mean µ of the population under study with
the mean µ0 of a general population which is considered as known. Hence, we only need
to take a sample from the population under study. It is much more common that the means
of both populations are unknown, and we take a sample from each population to compare
them.
It is common to consider the comparison of the effects of two treatments or interventions
or exposures or attributes. Then the populations to be compared are the hypothetical pop-
ulation with the first (treatment, intervention, exposure, attribute) and the hypothetical
population with the other.
There are two main ways in which treatments can be compared:
1. Paired comparisons — the two treatments are applied to pairs of experimental units
which have been matched so as to be as alike as possible (even the same experimental
unit at different times);
2. Independent samples — the two treatments are applied to separate sets of experi-
mental units randomly selected from the sample population.

EXAMPLE 7.1.1: The following data were obtained from each of 15 matched
pairs of individuals. For each pair, one was randomly allocated treatment 1
and the other treatment 2. Investigate the hypothesis that the treatments are
equivalent.

171
lOMoARcPSD|8938243

page 172 Experimental Design and Data Analysis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x1 50 59 45 40 53 52 55 48 45 50 51 56 54 41 55
x2 53 63 48 43 52 50 56 50 49 51 53 57 55 44 58
d –3 –4 –3 –3 1 2 –1 –2 –4 –1 –2 –1 –1 –3 –3

Because the samples are matched we consider the sample of differences. This
has the effect of removing or at least reducing the effect of variation between
individuals. For the sample of differences we test whether the mean is zero,
and obtain a confidence interval.

Let D = X1 − X2 . Then we have n = 15, d¯ = −1.867, sd = 1.727;


1.7265
a 95% CI for µD : −1.867 ± 2.145× √ = (−2.82, −0.91);
15
−1.867 − 0
to test (µD =0) we use t = √ = −4.187 (cf. 2.145),
1.727/ 15
so we reject (µD =0); p = 2 Pr(t14 < −4.187) = 0.001.

There is a significant difference between treatment effects with treatment 2 scor-


ing higher by δ, where δ̂=1.9 with 95% confidence interval 0.9<δ<2.8.
For the case of paired comparisons, to compare the treatment effects,
consider the sample of differences
so the problem reduces to a one sample problem.
In the example above, boxplots of the samples reveal little apparent difference between x1 and x2.
The difference in treatment effects is masked by the differences between the individuals.
x2
x1

40 45 50 55 60 65
x1-x2

-15 -10 -5 0 5 10

This reflects a general result: if the samples are paired then ignoring the pairing results in weaker
tests and conclusions. Also, in an experimental situation, producing paired data is more efficient for
the purposes of inference.
For the case of independent samples,
consider the difference between samples
i.e. we must compare sample characteristics: sample means, or sample proportions, or
sample medians, etc. This is a two sample problem which requires new techniques.
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 173

7.2 Paired samples


In paired sample data, to each datum in the first sample is associated a unique datum in the
second sample and vice versa. They may belong to the same subject or a pair of subjects
matched by design.

DEFINITION 7.2.1.
1. Self-pairing: Measurements are taken on a single subject at two distinct points
in time (“before and after” a treatment or a cross-over design) or at two different
places (e.g. left and right arms for a skin treatment).
2. Matched pair: Two individuals are matched to be alike by certain characteristics
under control (e.g. age, sex, severity of illness, etc). Then one of a pair is assigned
(at random) to one treatment, the other to the other treatment.

Pairing is used to eliminate extraneous sources of variability in the response variable. This
makes the comparison more precise.
Let µ1 and µ2 be the means of the two populations to be compared. The method of analysis
is to consider the difference between each pair.
Let X1i and X2i denote the observations for the ith pair (from population 1 and population
2, respectively), and form the differences
Di = X1i − X2i ;
i.e. D1 , D2 , . . . , Dn are the differences between the elements in each pair.
Let µD denote the mean difference. Then µD = µ1 − µ2 . Consider the population of differ-
ences, which has mean µD ; we have a sample {D1 , . . . , Dn } from this population. We can
apply the one-sample t procedure to this sample of differences to construct a confidence
interval for µD or to do hypothesis testing: usually we want to test the null hypothesis H0 :
µD = 0.

EXAMPLE 7.2.1: One method for assessing the effectiveness of a drug is to note
its concentration in blood and/or urine samples at certain periods of time after
giving the drug. Suppose we wish to compare the concentrations of two types
of aspirin (types A and B) in urine specimens taken from the same person, 1
hour after he or she has taken the drug. Hence, a specific dosage of either type
A or type B aspirin is given at one time and the 1-hour urine concentration is
measured. One week later, after the first aspirin has presumably been cleared
from the system, the same dosage of the other aspirin is given to the same per-
son and the 1-hour urine concentration is noted. For each person, which drug
is given first is decided randomly. This experiment is performed on 10 people;
the results are below.

Person 1 2 3 4 5 6 7 8 9 10
Aspirin A 15 26 13 28 17 20 7 36 12 18
Aspirin B 13 20 10 21 17 22 5 30 7 11

Construct a 95% confidence interval for the mean difference in the concentra-
tions of aspirin A and aspirin B in urine specimens 1 hour after the patient has
taken the drug. Are the two mean concentrations significantly different?
lOMoARcPSD|8938243

page 174 Experimental Design and Data Analysis

differences, d = {2, 6, 3, 7, 0, −2, 2, 6, 5, 7}; n = 10, d¯ = 3.6, sd = 3.098.


95% CI for µD : 3.6 ± 2.262× 3.098
√ = (1.38, 5.82)
10
3.6−0

test µD = 0: t = = 3.67 cf. c0.975 (t9 ) = 2.262 (p = 0.005)
3.098/ 10

We thus conclude that the two mean concentrations are significantly different
(aspirin A has a higher concentration).

In R:

> A <- c(15, 26, 13, 28, 17, 20, 7, 36, 12, 18) # data
> B <- c(13, 20, 10, 21, 17, 22, 5, 30, 7, 11)
> t.test(A, B, paired=TRUE) # paired t-test

Paired t-test

data: A and B
t = 3.6742, df = 9, p-value = 0.005121
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.383548 5.816452
sample estimates:
mean of the differences
3.6

7.3 Independent samples

A paired analysis is powerful if the data are taken from a suitably designed study. How-
ever, sometimes pairing is difficult or impossible, or simply not done. Then we need to
model the two samples as being taken independently from the two populations to be com-
pared. This requires new methods of comparison.
We assume that
• the two underlying distributions are normal:
d d
X1 = N(µ1 , σ12 ) and X2 = N(µ2 , σ22 ).
• the two samples are independent random samples;
(Two samples are independent, if the selection of individuals that make up one sam-
ple does not influence the selection of those in the other sample.)
We use the following notation:
sample sample sample
size mean standard deviation
sample 1 (on X1 ) n1 x̄1 s1
sample 2 (on X2 ) n2 x̄2 s2
The null hypothesis is usually H0 : µ1 =µ2 . The corresponding alternative hypothesis is
H1 : µ1 6=µ2 . More generally, we may wish to estimate the difference µ1 −µ2 (using a point
estimate or an interval estimate).
Estimation of µ1 − µ2
An obvious estimator is X̄1 − X̄2 .
This estimator is unbiased, since E(X̄1 −X̄2 ) = E(X̄1 ) − E(X̄2 ) = µ1 − µ2 .
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 175

σ2 σ2
Its variance is given by var(X̄1 −X̄2 ) = var(X̄1 ) + var(X̄2 ) = 1 + 2 ,
n1 n2
(since X̄1 and X̄2 are independent).

Further, since both populations are assumed to be normally distributed, X̄1 and X̄2 are
normally distributed, and hence so is X̄1 −X̄2 , i.e.
d
 σ2 σ2 
X̄1 −X̄2 = N µ1 −µ2 , 1 + 2 .
n1 n2
The distribution of the estimator (and in particular, its mean and standard deviation) is
what we need to construct confidence intervals and perform hypothesis tests.

7.3.1 Variances known

If we know the variances of the individual populations σ1 and σ2 , then we know the stan-
dard deviation sd(X̄1 − X̄2 ). Standardization of the estimator, using the distribution above,
gives
(X̄1 − X̄2 ) − (µ1 − µ2 ) d
Z= q 2 = N(0, 1).
σ1 σ22
n1 + n2

This can be used for constructing a confidence interval for µ1 − µ2 , or for testing a hypoth-
esis about µ1 − µ2 , if σ12 and σ22 are known.
For constructing a confidence interval, we have

Pr(−1.96 < Z < 1.96) = 0.95


 
( X̄ 1 − X̄ ) − (µ − µ )
Pr −1.96 < q2 2 1 2
< 1.96 = 0.95
σ1 σ22
n1 + n2

and rearranging this gives the 95% confidence interval for µ1 − µ2 :


q 2
σ σ2
95% CI for (µ1 −µ2 ): (x̄1 −x̄2 ) ± 1.96 n11 + n22 .
Likewise, to test the hypothesis H0 : µ1 = µ2 , we use the test statistic
(X̄1 − X̄2 )
Z= q 2 .
σ1 σ22
n1 + n2

If H0 is true, this statistic has a N (0, 1) distribution, and we reject H0 if its observed value z
is larger than 1.96 or smaller than -1.96. Alternatively, we can calculate the p-value by
2 Pr(Z > z) if z > 0
(
d
p= where Z = N(0, 1).
2 Pr(Z < z) if z < 0
These procedures are the sample as for the one-sample case; the only difference is the statis-
tic that we are testing against the standard normal distribution.

EXAMPLE 7.3.1: n1 = 25 x̄1 = 11.43 (σ12 = 4.0)


n2 = 10 x̄2 = 9.74 (σ22 = 4.0)

d
We have X̄1 −X̄2 = N(µ1 −µ2 , 0.56); since var(X̄1 −X̄2 ) = 4.0
25
+ 4.0
10
.

and so a 95% CI for µ1 −µ2 is 1.69 ± 1.96 0.56, i.e. 0.22 < µ1 −µ2 < 3.16.

To test µ1 =µ2 (i.e. µ1 −µ2 = 0), we use


11.43−9.74
z= √ = 2.258, for which p = 2 Pr(Z > 2.258) = 0.024.
0.56
lOMoARcPSD|8938243

page 176 Experimental Design and Data Analysis

hence we reject H0 (using significance level 0.05), since p < 0.05.


These results can be applied more widely as an approximation to populations that are not
normally distributed, using the central limit theorem:
d σ2
X̄1 ≈ N(µ1 , 1 ), provided n1 is not small;
n1
d σ22
X̄2 ≈ N(µ2 , ), provided n2 is not small;
n2
and just how small is too small depends on the underlying population. For a population
that is reasonably symmetric n>10 is fine; if it is skew then n>25, say.

EXAMPLE 7.3.2: It is reported that x̄1 = 15.3 and x̄2 = 12.7 from samples of
n1 = 10 and n2 = 15. In the absence of any other information, we suppose that
σ1 = σ2 = 3, perhaps on the basis of past information or values from similar
data sets. So,
d 1 1
X̄1 −X̄2 ≈ N(µ1 −µ2 , 1.5) since var(X̄1 −X̄2 ) = 32 ( 10 + 15 ) = 1.5.


approx 95% CI for µ1 −µ2 : 2.6 ± 1.96 1.5 = (0.2, 5.0);

approx z-test: z = 2.6−0


√ = 2.12; p ≈ 2 Pr(N > 2.12) = 0.034.
1.5

Sample size determination

Usually, the sample will give us information concerning the variances. But in some cases,
we don’t even have that: in planning for example. Then we must make a plausible estimate
(educated guess) based on similar data and other evidence.

EXAMPLE 7.3.3: We wish to test a treatment using a controlled experiment


(treatment vs control; or treatment vs standard). Suppose it is desired to esti-
mate the difference in means so that we obtain a 95% confidence interval with
margin of error 1 (i.e. est ± 1). How big a sample is required?

This sort of thing is often required for budgeting; or in applying for grants for research:
if there is a difference of at least 1 unit then we would like to be reasonably (say 95%)
sure of finding it.

(i) We choose a balanced experiment: with n1 = n2 = n.


(ii) We assume (on the basis of similar trials in the past, or pilot samples, or
theory, or intelligent guess-work) that σ1 = σ2 = 5.

It is usually the case, at least in situations like this one, that σ1 = σ2 .


However, if we had cause to believe that σ1 >σ2 say, then a balanced experiment is not
optimal. If σ1 >σ2 it would be better to assign n1 >n2 .
How? . . . so that ni ∝ σi , i.e. n1 = σ σ+σ1
N and n2 = σ σ+σ 2
N.
1 2 1 2
σ12 σ22 dV n1 σ1
Let V = var(X̄1 −X̄2 ) = n1 + N −n1 , then dn1 =0 ⇒ n2 = σ2 .
(Note that this result implies that if σ1 =σ2 , the experiment should be balanced.)

σ12 σ2 1 1 50
var(X̄1 −X̄2 ) = + 2 ≈ 25( + ) = .
n1 n2 n n n
q
Thus, the approx 95% CI is x̄ ± 1.96 50n.
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 177

q
50 √ √
So, we require 1.96 n =1 ⇒ n = 1.96 50 ⇒ n ≈ 192.

i.e. we need about 192 in each arm of the trial to achieve this level of accuracy.

Another option in planning is to specify the power of the test of µ1 =µ2 for a
specified difference. For example: find the sample size required in order that
the power is 0.9 when µ2 −µ1 = 2.5, using a test of significance level 0.05.

d
Let Z = X̄
√2 −X̄1 , so that Z = N(0, 1) when H0 is true. When µ2 −µ1 = 2.5,
50/n
E(Z) = √2.5 , and in order that we have a significance level of 0.05 and power
50/n
0.9, we require
√2.5 = 1.96 + 1.2816 ⇒ n = 84.1
50/n
Therefore we need 85 in each arm of the trial to achieve the specified power.
It is not often the case that the variances are known, but this result can be useful as a large
sample approximation. Generalising the rules obtained in the above example, we get the
following sample size rules.
Assuming populations with equal variances, σ 2 , we require a sample of at least n from each
population, where
2 2
2z1− 1 σ
α
for a 100(1−α)% CI of half-width d, n> 2
;
d2

to test H0 : µ1 =µ2 with significance level α 2(z1− 21 α + z1−β )2 σ 2


and with power 1−β when µ1 = µ2 +d, n> .
d2

EXAMPLE 7.3.4: (. . . continued)


For the above example, applying the formulae gives:
2 ×52
for a 95% CI of half-width 1: n > 2×1.96
12
= 192.08;
2(1.96+1.2816)2 ×52
for a test with α = 0.05 and power 0.90 for difference 2.5: n > 2.52
=
84.06.

7.3.2 Variances unknown but equal

In most cases of application of inference on difference of means, we won’t know the true
standard deviations of the populations. However, we may reasonably expect that σ12 = σ22 ,
since we are comparing similar measurements (treatments vs control, intervention A vs
intervention B). In these situations, any change will be a (relatively small) shift in the
mean. So, this is our standard assumption: we assume the variances are equal.
If σ1 = σ2 = σ, then we have

(X̄1 − X̄2 ) − (µ1 − µ2 ) d


q = N(0, 1)
σ n11 + n12

In the one-sample case:


X̄ − µ d X̄ − µ d
q = N leads to q = tn−1 .
σ n1 S n1
lOMoARcPSD|8938243

page 178 Experimental Design and Data Analysis

So, by analogy with the one sample case, we might hope that replacement of σ by S would
result in a t distribution.

But what S? . . . and what t?


P P
x1 x2
If µ1 = µ2 = µ, the best way to combine x̄1 = and x̄2 = to produce an estimate
n1 n2
of µ is
P P
x1 + x2 n1 x̄1 + n2 x̄2
x̄ = =
n 1 + n2 n1 + n2
i.e. a weighted average of x̄1 and x̄2 , with weights equal to the sample sizes.
(x1 − x̄1 )2 (x2 − x̄2 )2
P P
2 2 2 2 2
Similarly, if σ1 = σ2 = σ , the best way to combine s1 = and s2 =
n1 − 1 n2 − 1
to produce an estimate of σ 2 is
(x1 − x̄1 )2 + (x2 − x̄2 )2 (n1 −1)s21 + (n2 −1)s22
P P
s2 = =
(n1 − 1) + (n2 − 1) n1 + n2 − 2
i.e. a weighted average of s21 and s22 , with weights equal to the degrees of freedom. The
degrees of freedom of the combined estimate, s2 , is the sum of the degrees of freedom, i.e.
n1 + n2 − 2.
Replacing σ by its estimate S gives us the standard error
r
1 1
se(X̄1 −X̄2 ) = S + .
n1 n2
This gives us a result analogous to the one-sample results:
(X̄1 −X̄2 ) − (µ1 −µ2 ) d (X̄1 −X̄2 ) − (µ1 −µ2 ) d
q = N and q = tn1 +n2 −2 .
1 1
σ n1 + n2 S n11 + n12

These are used for inference on µ1 −µ2 when the variances are unknown but assumed equal.
• To find a 95% confidence interval for µ1 −µ2 , we use:
 (X̄1 − X̄2 ) − (µ1 − µ2 ) 
Pr c0.025 (tn1 +n2 −2 ) < q < c0.975 (tn1 +n2 −2 ) = 0.95.
S n11 + n12

Rearranging this to make µ1 −µ2 the subject leads to:


q
95% CI for µ1 −µ2 : (x̄1 −x̄2 ) ± c0.975 (tn1 +n2 −2 ) s n11 + 1
n2 .

• To specify a test of µ1 =µ2 , we define the test statistic


X̄ − X̄2
T = q1 .
S n11 + n12
d
Under the null hypothesis H0 (µ1 =µ2 ), this statistic has the distribution T = tn1 +n2 −2 .
Therefore we can either compare the observed value t against a critical value c0.975 (tn1 +n2 −2 ),
or calculate the p-value as twice the tail probability of a tn1 +n2 −2 distribution.

EXAMPLE 7.3.5: Random samples of 4 and 16 observations from normally dis-


tributed populations with equal variances give the following results:
n1 = 4 x̄1 = 24.6 s21 = 4.5
n2 = 16 x̄2 = 21.4 s22 = 5.1

Show that the pooled variance estimate is s2 = 5.0. Hence obtain a 95% confi-
dence interval for µ1 −µ2 . Test the null hypothesis (µ1 =µ2 ) and give the p-value.
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 179

3×4.5 + 15×5.1 4.5 + 5×5.1 30.0


s2 = = = = 5.0
18 6 6
q q
x̄1 − x̄2 = 3.2; se(x̄1 − x̄2 ) = 5( 14 + 16
1
) = 25 16
= 54 .

95% CI for µ1 −µ2 : (3.2 ± 2.101×1.25) = (0.6, 5.8).

x̄1 − x̄2 3.2


t= = = 2.56, so p = 2 Pr(t18 > 2.56) ≈ 0.02.
se(x̄1 − x̄2 ) 5/4
so we reject (µ1 =µ2 ), since p < 0.05.

EXAMPLE 7.3.6: n1 = 25 x̄1 = 11.43 s21 = 3.79


n2 = 10 x̄2 = 9.74 s22 = 2.21

(i) Find a 95% confidence interval for µ1 −µ2 .

(ii) Test the hypothesis H0 : µ1 =µ2 vs H1 : µ1 6=µ2 .


q
x̄1 −x̄2 = 1.69; s2 = 24×3.79+9×2.21
33
= 3.359; se(x̄1 −x̄2 ) = 1
3.359( 25 + 1
10 ) =
0.686.

(i) 95% CI for µ1 −µ2 : (1.69 ± 2.035×0.686) = (0.29, 3.09).

x̄1 −x̄2 1.69


(ii) t = se(x̄ = 0.686 = 2.464, so p = 2 Pr(t33 > 2.464) = 0.019.
1 −x̄2 )

so we reject H0 , since p < 0.05.


Inference on the means using the t-distribution is based on the following assumptions:
1. samples random (independent, identically distributed random variables)
2. samples independent
3. populations normally distributed
4. population variances equal
The first two are properties of the sampling process; they are dependent on the sampling protocol,
though some checking might be possible based on the data obtained. The last two can be checked
using the sample data: using normal plots and by comparing the sample variances.

EXAMPLE 7.3.7: Consider the problem of familial aggregation of cholesterol


levels. Suppose the cholesterol levels are assessed in 40 11–15 year-old boys,
whose fathers have died from heart disease and it is found that their mean is
207.3 mg/dL with standard deviation 25.6. Another group of 60 boys whose
fathers do not have heart disease and are from the same census tract also have
their cholesterol levels measured. This group has mean 193.4 mg/dL with stan-
dard deviation 17.3.
(a) What are the underlying populations here?
What is the research question of interest?
(b) Find a point estimate and a 95% confidence interval for the difference be-
tween the mean cholesterol levels of the two populations, assuming the
variances of the two populations are equal. 13.90; (5.39, 22.41).
(c) Is there any evidence that their mean cholesterol levels are different? Ex-
plain.
(t98 = 3.24, p = 0.002).
lOMoARcPSD|8938243

page 180 Experimental Design and Data Analysis

7.3.3 Variances unknown and unequal

How do we know if the variances of two populations are equal or not, if we don’t know
them? There are formal tests of the hypothesis H0 : σ1 = σ2 , but we do not study them
here. A good rule of thumb is that we can assume the variances are equal if the larger of
the two sample standard deviations is less than twice the smaller, i.e. if
1 s1
6 6 2.
2 s2
If this happens, we can use the tests in the previous section. But sometimes, it doesn’t, and
then those tests are not applicable. However, we can still replace σ1 and σ2 individually by
S1 and S2 , to obtain the standard error for our estimator:
s
S12 S2
se(X̄1 −X̄2 ) = + 2.
n1 n2

Fortunately, it turns out that changing the standard deviation to the standard error still
results in a t distribution, albeit a slightly more complicated one:

(X̄1 − X̄2 ) − (µ1 − µ2 ) d


q 2 ≈ tk ,
S1 S22
n1 + n2

where k is given by
s21
1 β2 (1−β)2 n1
= + , where β = s21 s22
.
k n1 − 1 n2 − 1 +
n1 n2

The value of k is such that: min(n1 − 1, n2 − 1) 6 k 6 n1 + n2 − 2. For hand calculation, we


take the safe approach and use k = min(n1 − 1, n2 − 1).
This distribution can be manipulated as before to give us the results we need. This leads to
the confidence interval q
s21 s22
95% CI for µ1 −µ2 : (x̄1 −x̄2 ) ± c0.975 (tk ) n1 + n2 .

To test H0 : µ1 = µ2 , we use the test statistic


X̄1 − X̄2 d
T =q 2 = tk under H0 .
S1 S22
n1 + n2

This is compared to the appropriate critical value or a p-value computed using a tk tail
probability.

E XERCISE . Consider the cholesterol level example given above. Use the unpooled ap-
proximate t-procedure to test the hypothesis that the mean cholesterol levels for the two
populations are different. (β = 0.7666, k = 62.53; t = 3.01, p = 0.004).
(Using the ‘safe’ value: k = 39, gives p = 0.005);
c0.975 (t39 ) = 2.023 cf. c0.975 (t62 ) = 1.999).

In R: t.test(x,y) compares samples x and y. There is an option paired=TRUE for the


paired t-test. The default is unpaired. Another option var.equal=TRUE specifies if the
variances are equal and in this case the pooled variance estimate is used.
Consider the following simulated example with two independent samples of size n = 10
each generated from N (1/2, 1) and N (3/2, 1), respectively.
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 181

> x <- rnorm(10, mean=1.5, sd=1) # generate first population


> y <- rnorm(10, mean=0.5, sd=1) # generate second population
> t.test(x,y, var.equal=TRUE) # specify equal variances

Two Sample t-test

data: x and y
t = 3.2675, df = 18, p-value = 0.004277
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.4908838 2.2589641
sample estimates:
mean of x mean of y
1.4879612 0.1130373

> t.test(x,y) # unequal variances (default option)

Welch Two Sample t-test

data: x and y
t = 3.2675, df = 16.68, p-value = 0.004629
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.4858426 2.2640053
sample estimates:
mean of x mean of y
1.4879612 0.1130373

7.4 Case study: Lead exposure

Landrigan et al. (1975)1 conducted a study examining the effects of exposure to lead on
the psychological and neurological well-being of children. The children in the study were
aged between 3 years 9 months to 15 years 11 months, and had lived within 6.6km of
a large, lead-emitting ore smelter in El Paso, Texas. The children were divided into two
groups: the control group consisted of 78 children with blood-lead levels of less than 40
µg/100mL in 1972 and 1973, and the lead-absorption group consisted of 46 children with
blood-lead levels of more than 40 µg/100mL in either 1972 or 1973. Each child completed
various neurological and psychological assessments. We are interested in one assessment
in particular: the number of taps on a metal plate that were recorded in a 10 second interval
while the child’s hand and wrist were held above the table (finger-wrist tapping). This test
was used to measure neurological function, specifically wrist flexor and extensor muscle
function, and was performed only by children over 5 years old.
Q UESTION : Is this an experiment or an observational study?
We will use an independent samples t-test to test whether there is a difference between the
mean finger-wrist tapping scores of children with low blood-lead levels and children with
high blood-lead levels. Let µ1 denote the mean finger-wrist tapping score of children with
blood-lead levels less than 40 µg/100mL, and let µ2 denote the mean finger-wrist tapping
score of children with blood-lead levels of more than 40 µg/100mL. The null hypothesis is
H0 : µ1 = µ2 (or µ1 −µ2 = 0); there is no difference between the two groups. The alternative
1 Landrigan, P. J., Whitworth, R. H., Baloh, R. W., Staehling, N. W., Barthel, W. F. and Rosenblum, B. F. (1975).

Neuropsychological dysfunction in children with chronic low-level lead absorption. The Lancet, 1, 708 – 715.
lOMoARcPSD|8938243

page 182 Experimental Design and Data Analysis

hypothesis is H1 : µ1 6= µ2 (or µ1 − µ2 6= 0). The variance of the finger-wrist tapping scores


for each group is unknown beforehand.
The data from this study is available in a file called Lead.txt, which we load into R using
the command:
> lead <- read.table(’Lead.txt’, header=T)
Once this data is loaded into R, the command head(lead) will display the first 6 rows
of the data. The following commands are used to find the finger-wrist tapping score for
each of the two groups. Note that, because the finger-wrist tapping was performed only by
children aged over 5 years old, there are missing values in the data. These missing values
will be removed.
# Sort individuals into groups
> grp1 <- which(lead$lead_grp == 1) # control group
> grp2 <- which(lead$lead_grp == 2 | lead$lead_grp == 3) # treatment group

# Identify the maximum finger-wrist tapping score for each child


> fwt.grp1 <- lead$maxfwt[grp1]
> fwt.grp2 <- lead$maxfwt[grp2]

# Remove missing values


> fwt.grp1 <- fwt.grp1[which(fwt.grp1 != 99.0)]
> fwt.grp2 <- fwt.grp2[which(fwt.grp2 != 99.0)]
Now fwt.grp1 contains the finger-wrist tapping scores for the children in the control
group and fwt.grp2 contains the finger-wrist tapping scores for the children in the lead-
absorption group.
The boxplots below compare the finger-wrist tapping scores for the two groups. We can
see that there appears to be a slight difference, with the control group (group 1) having a
larger score overall. But there is a fair amount of variation, together with some outliers in
either direction, so it is not immediately clear if the difference is significant.

Finger−wrist tapping score for lead exposure groups

●● ●
2
Group

● ● ● ●
1

20 30 40 50 60 70 80

Score

Should we use a test that assumes equal variances or not? From the boxplots, the groups
look like they have similar spread. Let’s look at the respective standard deviations:
> sd(fwt.grp1)
[1] 12.05658
> sd(fwt.grp2)
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 183

[1] 13.15582
> sd(fwt.grp2)/sd(fwt.grp1)
[1] 1.091174
The standard deviations are very close to each other, so it appears that an equal-variance
test is reasonable.
We now perform a 2-sample t-test using t.test() to determine if there is a significant
difference. The output is given below.
> t.test(fwt.grp1, fwt.grp2, var.equal=TRUE)

Two Sample t-test

data: fwt.grp1 and fwt.grp2


t = 2.6772, df = 97, p-value = 0.008718
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.812977 12.204880
sample estimates:
mean of x mean of y
54.43750 47.42857
This output tells us that:
• The mean number of finger-wrist taps for the control group is x̄1 = 54.44, and the
mean numer of finger-wrist taps for the lead-absorption group is x̄2 = 47.43;
• In a t-test of H0 : µ1 = µ2 , the test statistic is t = 2.677, the degrees of freedom is 97,
and the p-value is 0.0087;
• A 95% confidence interval for µ1 − µ2 is (1.81, 12.20).
Make sure you can identify these values in the output. Since the p-value is less than 0.05,
and the confidence interval does not include 0, we reject the null hypothesis H0 : µ1 = µ2 at
the 5% level of significance. We conclude that there is a significant difference between the
mean finger-wrist tapping score of children with blood-lead levels of less than 40µg/100mL
and children with blood-lead levels of more than 40µg/100mL, and that the children with
high lead exposure perform worse on the test.
E XERCISE . Try repeating the above analysis without assuming equal variances. What changes
and why?

7.5 Comparing two proportions


We are often interested in comparing two populations with respect to the presence of an
attribute of interest. Let p1 and p2 be the proportions of the two populations that have the
attribute. We want to compare p1 with p2 . We consider the large sample results only, which
are equivalent to (approximate) z-tests.
The general principle remains the same: calculate your estimator (here p̂1 −p̂2 ), then its
standard error (se(p̂1 −p̂2 )). Normalise your estimator by its standard error, then compare
this to a standard normal distribution.

EXAMPLE 7.5.1: (Aspirin trial)


A study was undertaken by the Physicians’ Health Study Research Group at
lOMoARcPSD|8938243

page 184 Experimental Design and Data Analysis

Harvard Medical School to test whether aspirin taken regularly reduces mor-
tality from cardiovascular disease. Every other day, physicians participating in
the study took either one aspirin tablet or a placebo. The study was blind —
those in the study did not know which they were taking. Over the course of the
study, the number of heart attacks were recorded for both groups. The results
were

heart attacks subjects


(fatal plus non-fatal)
aspirin group 104 11037
placebo group 189 11034

Is taking aspirin effective in reducing the risk of heart attack?


Notation:
population sample sample sample
proportion size frequency proportion
Sample 1 p1 n1 X1 x1 P̂1 = X
n
1
p̂1 = nx1
1 1
Sample 2 p2 n2 X2 x2 P̂2 = X
n2
2
p̂2 = nx2
2

We are interested in comparing the population proportions p1 and p2 : i.e. estimating p1 −p2
and testing p1 =p2 .
d d
X1 = Bi(n1 , p1 ) and X2 = Bi(n2 , p2 ).

For large samples, inference is based on the results


d p1 q 1  d p2 q 2 
P̂1 ≈ N p1 , and P̂2 ≈ N p2 , .
n1 n2

EXAMPLE 7.5.2: (males & females)


A random sample of n1 =100 females yielded x1 =54 with attribute A; and, of a
random sample of n2 =60 males, x2 =27 had attribute A. Let p1 and p2 denote
the proportion of females and males with attribute A.

• Find a 95% confidence interval for p1 −p2 .

n1 = 100, x1 = 54; n2 = 60, x2 = 27.


54
p̂1 = 100 = 0.54, p̂2 = 27 60 = 0.45; p̂1 − p̂2 = 0.09.
q q
se(p̂1 −p̂2 ) = p̂1 (1−
n1
p̂1 )
+ p̂2 (1−p̂2 )
n2 = 0.54×0.46
100 + 0.45×0.55
60 = 0.0813.
95% CI for p1 −p2 : 0.09 ± 1.96×0.0813 = (−0.07, 0.25) [est ± “2”se]

• Test the null hypothesis H0 : p1 =p2 .

est − 0
To test H0 : p1 = p2 , we use a z-test, based on: z = ≷ “2”
q se0
se = se(p̂1 −p̂2 ) = p̂1 (1−
n1
p̂1 )
+ p̂2 (1−
n2
p̂2 )
. . . . but what is se0 ?
se0 is the standard error, estimated assuming H0 to be true.

se0 is better, because we want to get an accurate approximation to the distribution of Z,


when H0 is true, since that is how we work out the p-value (or the critical region).
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 185

If H0 is true (p1 = p2 = p) then


s r
p(1−p) p(1−p) 1 1
sd(P̂1 −P̂2 ) = + = p(1−p) +
n1 n2 n1 n2

x1 + x2 81
and the best estimate of p is p̂ = = 160 = 0.506,
n1 + n2
q
1 1

so se0 = 0.506×0.494 100 + 60 = 0.0816.

Here there is not much difference between se and se0 : they will be quite close if p̂1 and
p̂2 are not very different.

Therefore, the test statistic is

est − 0 0.09
z= = = 1.102, p = 2 Pr(Z > 1.102) = 0.270.
se0 0.0816
Thus we do not reject H0 (since |z| < 1.96 or p > 0.05). There is no significant
evidence in these data to indicate that p1 6= p2 .
s
p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
95% CI: est ± “2”se = p̂1 −p̂2 ± 1.96 + .
n1 n2

est − 0 p̂1 −p̂2 x1 + x2


test: z = =q , where p̂ = .
se0 p̂(1−p̂)( n11 + 1 n1 + n2
n2 )

n1 p̂1 + n2 p̂2
Note that p̂ = , i.e. a weighted average of the p̂i (like the pooled average
n1 + n2
for the mean.)

EXAMPLE 7.5.3: (male & female, continued: contingency table)


The data in this situation can be presented in the form:

A A′
M 54 46 100 (p̂1 = 0.54)
F 27 33 60 (p̂2 = 0.45)
81 79 160 (p̂ = 0.506)

This table is like a probability table: an “observed” probability table, or an “estimated”


probability table on dividing through by the total.

Such a table is called a contingency table and is examined in more detail in Section 7.7.
It can be generalised to allow more rows (corresponding to more groups, or populations)
and more columns (corresponding to a categorisation of the attribute).
lOMoARcPSD|8938243

page 186 Experimental Design and Data Analysis

Summary:

sample 1 sample 2 difference


x1 x2
p̂1 = p̂2 = est = p̂1 − p̂2
n1 n2 s
p1 (1−p1 ) p2 (1−p2 ) p̂1 (1−p̂1 ) p̂2 (1−p̂2 )
var(P̂1 ) = var(P̂2 ) = se = +
n1 n2 n n2
r 1
x1 + x2 1 1
p̂ = se0 = p̂(1−p̂)( + )
n1 + n2 n1 n2
est − 0
CI: est ± “2”se HT: z =
se0

EXAMPLE 7.5.4: (Aspirin trial, continued)


Is aspirin effective in reducing the incidence of heart attacks?
Find a 95% confidence interval for the difference in the proportions of heart
attacks for the two treatment groups.

104 189
p̂1 = 11037 = 0.009423, p̂2 = 11034 = 0.017129; p̂1 −p̂2 = −0.007706.

104+189 293
p̂ = 11037+11034 = 22071 = 0.013275.
q
1 1
se(p̂1 −p̂2 ) = 0.013275×0.986725( 11037 + 11034 ) = 0.001541

·
· · est = −0.0077, se = 0.0015.

z = est
se
= −0.007706
0.001541
= −5.001, p = 0.000.

95% CI: −0.007706 ± 1.96×0.001541 = (−0.011, −0.005).

E XERCISE . (Vasectomy and prostate cancer)


Prostate cancer occurred in 69 of 21,300 men who had not had a vasectomy; and in 113 of
22,000 men who had had a vasectomy.
(i) Do these data provide sufficient evidence to conclude that men who have had a va-
sectomy are at greater risk of having prostate cancer?
(ii) Is this a designed experiment or an observational study?
(iii) Is it reasonable to conclude that having a vasectomy increases the risk of prostate
cancer?

7.6 Comparing two rates


We are often interested in comparing two (sub-)populations with respect to rates of disease.
If the rates are α1 and α2 , we are interested in estimating α1 −α2 and/or testing α1 =α2 .
In this section, we consider the large sample results only, which are equivalent to (approx-
imate) z-tests.
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 187

Notation
population person number sample
rate years of cases rate
Sample 1 α1 t1 X1 x1 Â1 = X
t1
1
α̂1 = nx1
1
Sample 2 α2 t2 X2 x2 Â2 = X
t
2
α̂ 2 = x2
n
2 2

For such data:


d d
X1 = Pn(α1 t1 ) and X2 = Pn(α2 t2 ).

For large samples, inference is based on the results


d α1  d α2 
Â1 ≈ N α1 , and Â2 ≈ N α1 , .
t1 t2
The procedure is very similar to the comparing proportions case, as summarised below.

sample 1 sample 2 difference


x1 x2
α̂1 = α̂2 = est = α̂1 − α̂2
t1 t2 r
α1 α2 α̂1 α̂2
var(Â1 ) = var(Â2 ) = se = +
t1 t2 r t 1 t2
x1 + x2 1 1
α̂ = se0 = α̂( + )
t1 + t 2 t1 t2
est − 0
CI: est ± “2”se HT: z =
se0

EXAMPLE 7.6.1:

cases person-years estimate


14
exposed 14 1000 α̂1 = 1000 = 0.014
10
not exposed 10 5000 α̂2 = 5000 = 0.002
24
24 6000 α̂ = 6000 = 0.004

0.014 − 0.002 0.012


z=q = = 5.48, p = 0.000
1
0.004( 1000 + 1 0.00219
5000 )

Hence we would reject H0 . There is significant evidence here that the rate is
greater for exposed individuals.

7.7 Goodness of fit tests

7.7.1 Completely specified hypothesis

We divide possible observations into categories C1 , C2 , . . ., Ck , such that each observation


must belong to one and only one category. The null hypothesis then takes the form:
H0 : Pr(X ∈ Cj ) = pj j = 1, 2, . . . , k, for specified values of (p1 , p2 , . . . , pk ).
lOMoARcPSD|8938243

page 188 Experimental Design and Data Analysis

category C1 C2 ... Ck
P
sample observed frequency f1 f2 ... fk fj = n
P
H0 probability p1 p2 ... pk pj = 1
P
(model) expected frequency np1 np2 ... npk npj = n

On the basis of the sample (observed frequencies), we wish to test H0 , i.e., to test the good-
ness of fit of the hypothesis to the observed data.

EXAMPLE 7.7.1: A first-year class of 200 students each selected “random dig-
its”, with the results given below. Do the digits occur with equal frequency?
1
i.e., test H0 : pi = 10 , i = 0, 1, . . . , 9.

i 0 1 2 3 4 5 6 7 8 9
obs freq fi 12 16 15 25 13 21 17 32 25 24
exp freq npi 20 20 20 20 20 20 20 20 20 20

The test statistic we use to assess goodness of fit is given by


X (o − e)2 k
X (fi − npi )2
U= = .
e i=1
npi
For the data in the above example the observed value of U is given by:
82 42 42
u = 20 + 20 + · · · + 20 = 18.70.
Is this too large?
To determine whether it is too large, we need to find the distribution of U under H0 .
If u = 0 then it is a perfect fit, while a large value of u indicates a bad fit. A reasonable test
of the goodness of fit of the hypothesis is therefore given by:
reject H0 if U > c.
To find c, we need to know the distribution of U under H0 .
d d
If H0 is true then Fi = Bi(n, pi ); and if n is large then Fi ≈ N(npi , npi qi ). Statistical theory
then shows that, if H0 is true:
k
X (Fi −npi )2 d 2
np
≈ χk−1 .
i
i=1

The χ2 distribution2 is tabulated and available in R.


(Note: χ = chi is pronounced ‘ky’ as in ‘sky’, so χ2 is ‘ky squared’.)
Table 8 gives the quantiles (inverse cdf) of the χ2 distribution.
R: dchisq(), pchisq() and qchisq() give the pdf, cdf and inverse-cdf, respectively.
The test is to reject H0 if U > c1−α (χ2k−1 ), where α denotes the significance level of the test;
or to compute p = Pr(χ2k−1 > u) and reject if p < α.
The following points concerning χ2 goodness-of-fit tests should be noted:
1. In using the χ2 distribution we are approximating binomial by normal, hence we
must have n large and the pi s not too small. The standard rule in this situation is:
npi > 5, i.e. ei > 5.
2. We prefer the number of classes, k, to be large (if there is a choice) since this gives a
more powerful test, but we must have npi > 5. If this condition is not satisfied then
we must combine classes until it is satisfied.
2 Note: The χ2m distribution can be defined as the sum of squares of independent standard normal random
d
variables: U = Z12 + Z22 + · · · + Zm
2 = χ2 , where Z , Z , . . . , Z
m 1 2 m iid N.
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 189

3. The goodness-of-fit test is a one-tailed test. Although alternative hypothesis is still


H1 = H′0 , the goodness-of-fit statistic tends to be small when H0 is true and tends to
large when it is not true. Thus p = Pr(U > u), with no 2: there is no doubling in this
case. The doubling is taken care of by the squaring.
4. Although U is called a goodness-of-fit statistic, it is really a measure of the badness
of fit! The larger it is, the worse the fit. And the smaller it is, the better the fit. In
fact, if U is too small means that the fit is “too good”. This could be used as a test for
rigging of experiments; but only if the sample size n is very large (else the power is
quite small).

EXAMPLE 7.7.2: For the “random digits” considered above, we obtained u =


d
18.70. If H0 is true, then U = χ29 , and so we would reject H0 if u > 16.92. So, we
reject H0 . There is significant evidence that the digits are not random.

In this case p = Pr(χ29 > 18.70) = 0.028 (using R); Table 8 indicates that p is
slightly larger than 0.025.

EXAMPLE 7.7.3: In one of Mendel’s dihybrid cross experiments, he observed


315 smooth yellow, 108 smooth green, 101 wrinkled yellow and 32 wrinkled
green F2 plants. Test the hypothesis that these observed frequencies fit a 9 : 3 :
3 : 1 ratio.

type SY SG WY WG
observed frequency 315 108 101 32
9 3 3 1
H0 : probability 16 16 16 16
expected frequency 312.75 104.25 104.25 34.75

9
The total number, n = 556; so the expected frequencies are given by 556 × 16 =
3
312.75, 556 × 16 = 104.25, etc.

P (o−e)2 2.252 3.752 3.252 2.752


u= e
= 312.75 + 104.25 + 104.25 + 34.75 = 0.016 + 0.135 + 0.101 + 0.218 =
0.470.
d
H0 ⇒ U = χ23 , so p = 0.925 and we accept H0 . The model and data are
compatible: we say that the model is a good fit to the data.

In R we use the function chisq.test():

> observed <- c(315, 108, 101, 32)


> expected <- c(9/16, 3/16, 3/16, 1/16)
> chisq.test(x=observed, p=expected)

Chi-squared test for given probabilities

data: observed
X-squared = 0.47002, df = 3, p-value = 0.9254

EXAMPLE 7.7.4: A random sample of 200 observations on X gave the following


results:
x 0 1 2 3 4 5
freq(x) 54 79 45 18 3 1

d d
Is X = Bi(10, 0.1)? In other words, test the null hypothesis H0 : X = Bi(10, 0.1).
lOMoARcPSD|8938243

page 190 Experimental Design and Data Analysis

This hypothesis specifies completely the probabilities of an observation being


in each of the possible classes. In order that npi > 5, the number of classes is
reduced to four, by combining adjacent classes, as indicated below:

x 0 1 2 >3
obs 54 79 45 22
exp 69.74 77.48 38.74 14.01

d
If H0 is true then U = χ23 , so we reject H0 if U > 7.82. From the above table,
P (o−e)2
u= e
= 8.66, hence we reject H0 . There is evidence in these data that
the distribution of X is not Bi(10, 0.1).
Fitting distributions (hypothesis specified except for one or more parameters)*
d
Consider the null hypothesis X = Bi(10, p), where p is unspecified. To fit this distribution,
we need to estimate p from the sample, and use this estimate to determine expected fre-
quencies under H0 . In estimating p we lose another degree of freedom, since we are using
the data to enable the model to fit better. More generally, each parameter estimated results
in another constraint, and another degree of freedom lost.

d
EXAMPLE 7.7.5: For the observations above, is X = Bi(10, p)?

We have x̄ = 1.20, so an estimate of p is given by p̂ = 0.12, since µ = 10p.


So the expected frequencies are as given in the following table, obtained using
Bi(10, 0.12):

x 0 1 2 >3
obs 54 79 45 22
exp 55.70 75.95 46.61 21.65

d
Thus, if H0 is true then U ≈ χ22 , so we reject H0 if U > 5.99.

P (o−e)2
From the sample, u = e
= 0.24, and hence we do not reject H0 .

We take this as an indication that the Binomial distribution fits the data, but
with p = 0.12, rather than p = 0.10.
Generally, in fitting a distribution in this way,
Pk (F −np )2 P (o−e)2 d 2
U = i=1 i np i = e
≈ χk−m−1 ,
i
where k = number of classes and m = number of parameters estimated.

7.8 Contingency tables

Another approach to comparing two proportions is to use a χ2 -test, which is applicable to


contingency tables in general. This is a goodness-of-fit test, and tests the null hypothesis
that two classifications are independent. Thus if H0 is rejected, there is evidence indicating
some association between the two classifications, i.e. between the two categorical variables.
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 191

EXAMPLE 7.8.1: (Aspirin trial)


The aspirin data can be re-expressed as a contingency table:

Heart attacks No heart attacks


Aspirin group 104 10933
Placebo group 189 10845

This contingency table has two rows and two columns and is called a 2×2 table.
In this case the classification variables are treatment (aspirin and placebo) and
disease status (heart attack and no heart attack).

The null hypothesis we test is that disease status classification (heart attack
or no heart attack) is independent of the treatment classification (aspirin or
placebo). If H0 were rejected, it would provide evidence of a relation between
treatment and outcome.

7.8.1 2×2 contingency table

A 2×2 contingency table takes the form:

obs freq A A′
G 47 23 70
G′ 13 17 30
60 40 100

On the basis of this sample, we wish to test the hypothesis that the classifications are inde-
pendent; i.e., H0 : Pr(A ∩ G) = Pr(A) Pr(G) and H1 : Pr(A ∩ G) 6= Pr(A) Pr(G).
Note: Independence can be expressed in the form Pr(A | G) = Pr(A | G ′ ), i.e. the
probability of attribute A is the same in G or in G ′ , which is equivalent to p1 = p2 .

If H0 is true then the expected frequencies are given by:

exp freq A A′
G npG pA npG qA npG
G′ nqG pA nqG qA nqG
npA nqA n
where pA + qA = 1 and pG + qG = 1.
To evaluate the expected frequencies, we need to assign values to pG and pA . We use
70 60 30 40
p̂G = 100 and p̂A = 100 ; so that q̂G = 100 and q̂A = 100 . Then we obtain:

exp freq A A′
G 42 28 70
G′ 18 12 30
60 40 100
h i
70 60
e.g. eG∩A = 100× 100 × 100 = 70×60
100
= sum.C×sum.A
N
.

The other expected frequencies could be worked out similarly: e.g. eG∩A′ = 70×40 100
= 28,
but it easier to obtain the other expected frequencies by subtraction: eG∩A′ = 70 − 42 = 28.
lOMoARcPSD|8938243

page 192 Experimental Design and Data Analysis

Note: these “expected frequencies” represent estimated means, so there is no need for them to be
integers (although they are in this example).

The test statistic takes the form


X (o − e)2
U= ,
e
where
o refers to the observed frequency of a category, and
e refers to the frequency that would be expected if the hypothesis being tested (H0 ) is true.
X (o − e)2 d
If H0 is true then U = ≈ χ2df , and in the 2×2 case, df = 1.
e
One way to see that df = 1 is to observe that it is sufficient to determine one expected frequency
to complete the table. Alternatively, fitting the independence model to a 2×2 table is equivalent to
fitting a distribution on 4 cells, with 2 parameters (pA and pG ), so df = 4 − 2 − 1 = 1.
X (o − e)2 d
Thus, if H0 is true then U = ≈ χ21 .
e
P (o − e)2 52 52 52 52
The observations give: u = = + + + = 4.96.
e 42 28 18 12
We reject H0 since u > c0.95 (χ21 ) = 3.84; p = Pr(χ21 > 4.96) = 0.026.

EXAMPLE 7.8.2: Consider again the two-group example:

A A′
G1 54 46 100
G2 27 33 60
81 79 160

We wish to test whether there is a difference between the groups. So the null
hypothesis is that the group classification and the attribute classification are
independent. For these data, under the null hypothesis of independence, the
expected frequencies are given by:

A A′
G1 50.625 49.375 100
G2 30.375 29.625 60
81 79 160

and so,
P (o − e)2 3.3752 3.3752 3.3752 3.3752
u= = + + + = 1.215.
e 50.625 49.375 30.375 29.625
d
Under H0 (no difference between the groups), U = χ21 and so we would reject
H0 if u > 3.84. There is no significant evidence here that there is a difference
between the groups p = Pr(χ21 > 1.215) = 0.270.
R can be used to analyse contingency tables, using chisq.test(). Consider again the
example above. When tabulated data are given, use the following:
> my.table <- matrix(c(54, 27, 46, 33), ncol=2) # code the data as a matrix
> results <- chisq.test(my.table, correct=FALSE)
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 193

Pearson’s Chi-squared test

data: my.table
X-squared = 1.2152, df = 1, p-value = 0.2703
Note that chisq.test() admits an option for continuity correction. Actually the analyses
saved in results contains quite a few outputs:
> names(results) # elements of ’results’
[1] "statistic" "parameter" "p.value" "method" "data.name"
"observed" "expected" "residuals" "stdres"
For example, to extract the expected frequencies write:
> results$expected # ’expected’ element of ’results’
[,1] [,2]
[1,] 50.625 49.375
[2,] 30.375 29.625

The χ2 -test for a 2×2 contingency table is actually identical to the z-test for testing equality
of proportions, since u = z 2 and χ21 = N2 . In the example, z = 1.102 and u = 1.215.

a b
For a 2×2 contingency table with frequencies given by , it can be shown that
c d

(ad − bc) n
z=p and u = z 2 .
(a + b)(c + d)(a + c)(b + d)

We note that if (ad ≷ bc) ⇔ (positive/no/negative) relationship, which corresponds to



z ≷ 0. Further, in this case, the correlation coefficient, r = z/ n.
Like the z-test, the χ2 -test applies only if n is large. It depends on the normal approximation
to the binomial, for which we need np > 5. Thus, a standard rule for the application of the
χ2 -test is that all the expected frequencies, e > 5. R produces a warning if e < 5. In such a
situation you should use the option correct=FALSE in chisq.test.

EXAMPLE 7.8.3: (Aspirin trial, again)


Use a contingency table method to analyse the aspirin data.

With the data


trt hd freq
A H 104
A H’ 10933
P H 189
P H’ 10845

the following R output was obtained:

> X <- matrix(c(104, 189, 10933, 10845), ncol=2)


> chisq.test(X, correct=FALSE)

Pearson’s Chi-squared test

data: X
X-squared = 25.014, df = 1, p-value = 5.692e-07

Thus we reject H0 . There is significant evidence here that the rate of heart attacks
lOMoARcPSD|8938243

page 194 Experimental Design and Data Analysis

is smaller in the aspirin group.

Odds Ratio
There is another useful measure of relationship in this situation that we have met: the odds
ratio, θ. Based on the above table, we obtain an estimate of the odds ratio:
ad
θ̂ = and we observe that (ad ≷ bc) ⇔ θ̂ ≷ 1.
bc

A confidence interval for θ is obtained as follows:


ln θ̂ = ln a − ln b − ln c + ln d
r
1 1 1 1
se(ln θ̂) = + + +
a b c d
r
ad 1 1 1 1
95% CI for ln θ: ln ± 1.96 + + +
bc a b c d
i.e. L < ln θ < U
95% CI for θ: eL < θ < eU
θ = 1 corresponds to independence, or no relationship; thus if the confidence interval
excludes 1, this indicates significant dependence.

54 46
EXAMPLE 7.8.4: For the above 2×2 contingency table , we find:
27 33

ln θ̂ = ln 54 − ln 46 − ln 27 + ln 33 = 0.3610;
q
1 1 1 1
se(ln θ̂) = 54 + 46 + 27 + 33 = 0.3280.

95% CI for ln θ: 0.3619 ± 1.96×0.3280 = (−0.281, 1.005).

95% CI for θ: (e−0.281 , e1.005 ) = (0.76, 2.73).


As the confidence interval for θ includes 1, there is no significant evidence of a
relationship.

EXAMPLE 7.8.5: (Aspirin trial: odds ratio)


In this case the estimated odds ratio is 0.549.

ln θ̂ = ln 104 − ln 10933 − ln 189 + ln 10845 = −0.6054;


q
1 1 1 1
se(ln θ̂) = 104 + 10933 + 189 + 10845 = 0.1228.

95% CI for ln θ: −0.6054 ± 1.96×0.1228 = (−0.85, −0.36).

95% CI for θ: (e−0.85 , e−0.36 ) = (0.43, 0.69).

Since the confidence interval for θ excludes 1, there is significant evidence here
of a negative relationship between aspirin and heart attacks, i.e. more aspirin,
less heart attacks.
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 195

7.8.2 r × c contingency tables

The χ2 test in the r × c case is a straightforward extension of the χ2 test used in the 2 × 2
case. We now have a table of the form:

B1 B2 ... Bc
A1 a1
A2 a2
.. ..
. .
Ar ar
b1 b2 ... bc n

The expected frequencies are calculated in the same way,


row.sum × col.sum a i bj 
i.e. exp.freq = eij = .
n n
Here, the number of cells, k = rc; and the number of effective constraints on the frequen-
cies, ℓ = r+c−1.
Thus df = k − ℓ = rc − (r+c−1) = (r − 1)(c − 1).
X (o − e)2 d
So, provided npi > 5, H0 ⇒ U = ≈ χ2(r−1)(c−1) .
e
EXAMPLE 7.8.6: Comparing cough mixtures:

A B C
“little or no relief” 11 13 9
“moderate relief” 32 28 27
“total relief” 7 9 14

In R the procedure for an r×c contingency table is no different to the procedure


for a 2×2 table. In this case, the following output is obtained:

> X <- matrix(c(11, 32, 7, 13, 28, 9, 9, 27, 14), ncol=3)


> X
[,1] [,2] [,3]
[1,] 11 13 9
[2,] 32 28 27
[3,] 7 9 14
> chisq.test(X)

Pearson’s Chi-squared test

data: X
X-squared = 3.81, df = 4, p-value = 0.4323

R gives U = 3.81 (df = 4), P ≈ 0.43 and so we do not reject H0 : there is no


significant difference in the (perceived) effects of the cough mixtures.
lOMoARcPSD|8938243

page 196 Experimental Design and Data Analysis

EXAMPLE 7.8.7: (TV watching time and fitness)

TV watching time
Fitness 0 1–2 3–4 >5
Fit 35 101 28 4
Not fit 147 629 222 34

Is there an association between fitness and time spent watching TV?

(35−25.48)2 (222−215.0)2
u= 25.48
+ ··· + 215.0
= 6.161,
df = 3, p = Pr(χ23 > 6.161) = 0.104.

There is no significant evidence of an association between fitness and TV watching


time in these data.
EXAMPLE 7.8.8: Aluminium and Alzheimer’s disease: a case-control study

The focus of the study was on the use of antacids that contain aluminum.

Aluminum-containing antacid use


None Low Medium High
Alzheimer’s patients 112 3 5 8
Control group 114 9 3 2

There are two parts to this story:


112 16
Alz vs ACA: No evidence of relationship (u = 0.151, df = 1, p =
114 14
0.698)

3 5 8
Alz vs level: There appears to be some evidence here of a relation,
9 3 2
but it can’t
be tested using χ2 as 3 cells have e < 5.
Q UESTION : What conclusion would you draw if there were a significant result?

E XERCISE . 400 patients with malignant melanoma:


Histological Site
type hd&nk trunk extremities Total
Hutchinson’s 22 2 10 34
Superficial 16 54 115 185
Nodular 19 33 73 125
Indeterminate 11 17 28 56
Total 68 106 226 400

Test whether there is significant association between Type and Site.


lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 197

Problem Set 7
7.1 The effect of a single 600 mg dose of Vitamin C versus a sugar placebo on the muscular en-
durance (as measured by repetitive grip strength trials) of thirteen male volunteers (19-23 years
old) was evaluated. The study was conducted in a double-blind manner, with crossover. That
is, two tests were carried out on each subject, once after taking vitamin C and once after taking
the sugar placebo.
Subject Placebo Vitamin C Difference
1 170 248 –78
2 180 218 –38
3 372 349 23
4 288 264 24
5 636 593 43
6 172 117 55
7 278 185 93
8 279 185 94
9 258 122 136
10 363 159 204
11 417 145 272
12 678 387 291
13 699 245 454
mean 368.5 247.5 121.0
stdev 188.7 132.0 148.8
(a) The following questions refer to the design of the study.
i. What is the response variable? the explanatory variable(s)?
ii. How has comparison been used in the study?
iii. How has control been used in the study?
iv. The study was conducted in a ’double-blind’ manner. What does this mean?
v. How should randomisation have been used in the study?
vi. Give one point in favour of, and one point against, the use of a crossover design for
this study.
(b) Draw a boxplot of the differences of the data, clearly labelling all relevant points, includ-
ing any outliers, should they exist. To help, here is the five number summary:
Min. 1st Qu. Median 3rd Qu. Max.
-78.0 23.5 93.0 238 454.0
i. What assumption are you looking to check in the boxplot and what do you conclude?
ii. Suggest an alternative plot that may be useful and describe what you would expect
to see if the assumption you are looking to check is reasonable.
(c) Carry out a t-test on the differences.
i. State the null and alternative hypotheses, calculate the value of the test statistic, and
give a range for the p-value (e.g. 0.05 < p < 0.1).
ii. State your conclusions, in non-statistical terms.
7.2 A colleague has analysed the data from Problem 7.1, and shows you the R output below.
data: Placebo and VitaminC
t = 1.89, df = 24, p-value = 0.070
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7670078 1.1014895
(a) Compare the point estimate of the mean difference in test scores that you obtained (in
Problem 7.1) with the result your colleague found.
(b) Compare the 95% confidence interval for the mean difference in test scores that you ob-
tained (in Problem 7.1) with the result your colleague found.
(c) Why are the results different?
(d) Which analysis is more appropriate? Explain why.
7.3 Volunteers who had developed a cold within the previous 24 hours were randomised to take
either zinc or placebo lozenges every 2 to 3 hours until their cold symptoms were gone (Prasad
et al., 2000). For the twenty-five participants who took zinc lozenges, the mean overall duration
of symptoms was 4.5 days and the standard deviation was 1.6 days. For the twenty-three
participants who took placebo lozenges, the mean overall duration of symptoms was 8.1 days
and the standard deviation was 1.8 days.
lOMoARcPSD|8938243

page 198 Experimental Design and Data Analysis

(a) For the two groups calculate the difference in the sample means and the standard error
of the difference in means.
(b) Compute a 95% confidence interval for the difference in mean days of overall symptoms
for the placebo and the zinc lozenge treatments, and write a sentence interpreting the
interval. (Assume that the standard deviations for the placebo and the zinc lozenge treatments
are the same. Does this seem reasonable?)
(c) Does the interval computed in (b) give evidence that the population means are different?
Explain.
7.4 The effect of exercise on the amount of lactic acid in the blood was examined in a study. Eight
men and seven women who were attending a week-long training camp participated in the
experiment. Blood lactate levels were measured before and after playing three games of rac-
quetball, and shown below.
Men Women
Player 1 2 3 4 5 6 7 8 Player 1 2 3 4 5 6 7
Before 13 20 17 13 13 16 15 16 Before 11 16 13 18 14 11 13
After 18 37 40 35 30 20 33 19 After 21 26 19 21 14 31 20
(a) Does exercise change the blood lactate level for women players? Test this.
(b) Estimate the mean change in blood lactate level for male racquetball players using a 95%
confidence interval.
(c) Is the mean change in blood level the same for men and women players? Test this.
7.5 The following observations are obtained on two treatments:
treatment C 34.7 26.7 32.0 52.7 45.4 31.5 20.3 23.4 35.9 42.1
treatment K 35.6 28.5 35.7 54.8 47.1 33.5 19.2 27.2 37.2 41.5
It can be assumed that the observations are independent and normally distributed with equal
variances. Let δ denote the increase in mean that results from using treatment K rather than
treatment C.
(a) Test for the difference in the effects of the two treatments using an independent samples
t-test. Derive a 95% confidence interval for δ.
(b) Now suppose that the columns actually correspond to blocks (a–j): for example, a ‘block’
might one individual who is given first one treatment, and then at some later time, the
other treatment.
a b c d e f g h i j
treatment C 34.7 26.7 32.0 52.7 45.4 31.5 20.3 23.4 35.9 42.1
treatment K 35.6 28.5 35.7 54.8 47.1 33.5 19.2 27.2 37.2 41.5
Test for the difference in the effects of the two treatments using a paired-samples t-test.
Derive a 95% confidence interval for δ. Why is this interval narrower than the one derived
in (a)?
7.6 Consider the independent samples
sample 1 27 34 37 39 40 43
sample 2 41 44 52 93
(a) Draw dotplots of the two samples.
(b) Show that a two-sample t-test does not reject the null hypothesis of equal means.
(c) If the observation 93 is found to be a mistake: it should have been 53. Show that a two-
sample t-test now rejects the null hypothesis of equal means.
(d) Explain the difference in the results of the tests.

7.7 A recent study compared the use of angioplasty (PTCA) with medical therapy in the treatment
of single-vessel coronary artery disease. At the six-month clinic visit, 35 on 96 patients seen in
the PTCA group and 55 of 104 patients seen in the medical therapy group have had angina.
Is there evidence in these data that PTCA is more effective than medical therapy in preventing
angina?
Find a 95% confidence interval for the difference in proportions.

7.8 In a test to evaluate the worth of a drug in treating a particular disease, the following results
were obtained in a double-blind trial:
no improvement improvement lost to survey
placebo 22 23 5
drug 12 30 8
lOMoARcPSD|8938243

Chapter 7: Comparative Inference page 199

Do these data indicate that the drug has brought about a significant increase in the improve-
ment rate? Explain your reasoning.

7.9 In a study, 500 patients undergoing abdominal surgery were randomly assigned to breathe one
of two oxygen mixtures during surgery and for two hours afterwards. One group received a
mixture containing 30% oxygen, a standard generally used in surgery. The other group was
given 80% oxygen. Wound infections developed in 28 of the 250 patients who received 30%
oxygen, and in 13 of the 250 patients who received 80% oxygen.
Is there evidence to conclude that the proportion of patients who develop wound infection is
lower for the 80% oxygen treatment than for the 30% oxygen treatment. Use a p-value ap-
proach.

7.10 A study on motion sickness in buses reported that seat position within a bus may have some
effect on whether one experiences motion sickness. The following table classifies each person
in a random sample of bus passengers by the location of their seat and whether nausea was
reported.

Location
front middle rear
nausea 58 166 193
no nausea 870 1163 806

Based on these data, can you conclude that there is an association between seat location and
nausea.

7.11 A case-control study with 100 cases of disease D, and 100 matched controls, yielded the fol-
lowing results with respect to exposure E:
E E′
case, D 63 37 (ncase = 100)
control, D′ 48 52 (ncontrol = 100)
i. Test the hypothesis that the proportion of individuals with exposure E is the same in both
populations (cases and controls).
ii. Find an estimate and a 95% confidence interval for the odds ratio.

7.12 Data relating to oral-contraceptive use and the incidence of breast cancer in the age-group 40–
44 years in the Nurses’ Health Study are given in the table below:
OC-use group number of cases number of person-years
current users 13 4 761
past users 164 121 091
never users 113 98 091
(a) i. Compare the incidence rate of breast cancer in current-users versus never-users us-
ing a z-test, and report a p-value.
ii. Find a 95% confidence interval for the rate ratio.
(b) i. Compare the incidence rate of breast cancer in past-users versus never-users using a
z-test, and report a p-value.
ii. Find a 95% confidence interval for the rate ratio.
lOMoARcPSD|8938243

page 200 Experimental Design and Data Analysis


lOMoARcPSD|8938243

Chapter 8

REGRESSION AND CORRELATION

“‘Is there any other point to which you would wish to draw my attention?’ ‘To the curious incident of
the dog in the night-time.’ ‘The dog did nothing in the night-time.’ ‘That was the curious incident.’”
Sherlock Holmes, The Silver Blaze, 1894.

8.1 Introduction
In this chapter, we consider bivariate numerical data: that is, data for two numerical vari-
ables, and we seek to investigate the relationship between the variables.
80
60
y
40
20

20 40 60 80
x

A bivariate data set consists of n pairs of data points:


{(xi , yi ), i = 1, 2, . . . , n}.
If x and y are numerical variables, these data can be plotted on a “scatter diagram” or
“scatter plot” (the bivariate analogue of a dotplot) as discussed in Chapter 2.

201
lOMoARcPSD|8938243

page 202 Experimental Design and Data Analysis

In this chapter we are concerned with the case of bivariate numerical variables. However,
in general, a bivariate data set may involve variables which may be either numerical or
categorical.
Note: A categorical bivariate data set consists of n pairs of data points:
{(ci , di ), i = 1, 2, . . . , n},
where c and d are categorical variables, such as gender or attribute. ci and di denote the
values of the categorical variable for individual i, thus (ci , di ) = (F, D′ ) indicates that indi-
vidual i is a female who does not have attribute D.
D′ D
Such a data set is most simply summarised by a con-
F 15 40 55
tingency table (see §7.5) with rows representing the
M 25 20 45
c-categories and columns the d-categories.
35 65 100
Of course, either of the categorical variables may have more than two possible values,
resulting in a contingency table with more rows or columns.
A scatter plot for bivariate categorial variables is singularly unhelpful! If each observation
is represented by a point, then the result is a rectangular array of points, as in the left
diagram below.

Note that for the purposes of plotting, each category has to be allocated numerical values.
This is simply a coding device.
There needs to be some mechanism for displaying how many times each point is observed.
(There can be a similar problem, though clearly to a lesser degree, for numerical variables
where several individuals give the same values.) One way to overcome this problem is
to “jitter” the points. This means that instead of plotting at (x, y), we plot at (x+e, y+f ),
where e and f are (small) random perturbations. The extent of the jittering can be varied
to suit the situation.
A preferable alternative is to modify the size of the “point” to represent the number of
observations at the specified point (cf. Gapminder plots).
Another representation is the Mosaic diagram, which takes two forms corresponding to
row percentages and column percentages: see diagram below.
D′ D D′ D

F
F

M
M
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 203

Q UESTION : What happens if the bivariate data set consists of


{(ci , yi ), i = 1, 2, . . . , n},
where c is a categorical variable and y is a numerical variable? Give a simple example of
such a data set. What would a scatter plot look like in this case? Why is this unsatisfactory?
Suggest a sensible modification.
In dealing with numerical bivariate data:
Regression is used for prediction; when we want to use x to predict y.
Correlation is used to measure association between x and y.
We start with correlation, which was introduced in Chapter 2.

8.2 Correlation

In Chapter 2, we saw that the correlation r indicates the relationship between the x and y
variables; and indicates how the points are distributed within a scatter plot. Recall that the
appearance of a scatter plot for r = −1, −0.75, . . . , 1; from a straight line with a negative
slope at r = −1 through negative relationships with more scatter to a random spread at
r = 0 and then through to a straight line with positive slope at r = 1.
lOMoARcPSD|8938243

page 204 Experimental Design and Data Analysis

Properties of r
1. −1 6 r 6 1
2. r > 0 indicates a positive relationship; r < 0 indicates a negative relationship.
The magnitude of r indicates the strength of the (linear) relationship.
3. r = ±1 if, and only if, y = a + bx with b 6= 0.
r = 1 if b > 0 and r = −1 if b < 0.
4. r (like x̄ and s) is affected by outliers.

x̄ and sx indicate the location and spread of the x-data: about 95% in (x̄ − 2sx , x̄ + 2sx );
ȳ and sy indicate the location and spread of the y-data: about 95% in (ȳ − 2sy , ȳ + 2sy );
the correlation, r, or rxy , indicates how the points are distributed in this region.
However, the appearance of the scatter plot is affected by the scale used on the axes in
plotting the graph! To make scatter plots comparable, you should try to arrange the scale
so that the spread of the y is about the same as the spread of the x.
The following three scatter plots plot identical points, but using different scales. The corre-
lation is 0.564.

The graph at the left is preferred, as the apparent spreads of point in the horizontal and
vertical directions are similar.

E XERCISE . The following statistics are available for a bivariate data set:
n = 100; x̄ = 55.4, sx = 14.1; ȳ = 42.8, sy = 6.3; r = −0.52.
Using only the given information, indicate the form of the scatter plot for these data.
The shape of the scatter plot is indicated by the negative correlation: r = −0.52. It will
resemble in form the scatter plot for r = −0.5. The scale is specified by the means and
standard deviations: (i.e. about 95%) of the data have 27.2 < x < 83.6 and 30.2 < y < 55.4.
To compute a correlation, use a calculator or a computer. On R, use cor(x,y) and enter
the names of the variables for which a correlation is required.
Just so you know:
1 P
sample variance of x: s2x = n−1 (x − x̄)2 ;
1
s2y = n−1 (y − ȳ)2 ;
P
sample variance of y:
1 P
sample covariance of x, y: sxy = n−1 (x − x̄)(y − ȳ);
sxy P
(x−x̄)(y−ȳ)
sample correlation of x, y: rxy = = √P .
sx sy (x−x̄)2 (y−ȳ)2
P

Even if we never use the formula to compute r, it still has its uses:
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 205

1
Pn n n
n−1 i=1 (xi − x̄)(yi − ȳ) 1 X xi − x̄  yi − ȳ  1 X
r= = = xsi ysi ,
sx sy n − 1 i=1 sx sy n − 1 i=1
where xsi and ysi denote the standardised scores. This indicates that the points that con-
tribute most to the correlation are those with both standardised scores large. It also tells us
that r is not affected by location and scale.

E XERCISE . Consider the following trivial data set:


x 1 2 3
y 6 4 8

Here n = 3, as there are three bivariate pairs.


Check, by hand, that x̄ = 2, ȳ = 6, sx = 1, sy = 2 and r = 0.5. Verify these results using a
calculator or computer.

EXAMPLE 8.2.1: (Blood value measures)


A standard measure (y) of a blood value is difficult and expensive to determine,
so an alternative measure (x) is proposed which is simpler and cheaper to ob-
tain. The following data have been obtained:

x 49.6 70.8 55.3 69.2 51.9 54.6 59.8 55.3 65.1 75.8
y 56.1 61.9 58.3 61.8 59.4 56.6 59.5 58.5 64.5 65.5

Using a calculator or computer, check that n = 10, x̄ = 60.74, sx = 8.936;


ȳ = 60.21, sy = 3.152; and r = 0.8848.

In R:

> x <- c(49.6, 70.8, 55.3, 69.2, 51.9, 54.6, 59.8, 55.3, 65.1, 75.8)
> y <- c(56.1, 61.9, 58.3, 61.8, 59.4, 56.6, 59.5, 58.5, 64.5, 65.5)
> length(x)
[1] 10
> mean(x)
[1] 60.74
> sd(x)
[1] 8.935597
> mean(y)
[1] 60.21
> sd(y)
[1] 3.15223
> cor(x,y) # correlation between x and y
[1] 0.8847845

Maybe, once only, it Pmay be worthwhile to check, P using a spreadsheet say, that
(x − x̄)2 = 718.60, (x − x̄)(y − ȳ) = 224.30, (y − ȳ)2 = 89.43; and hence that
P
224.30
r = √718.60×89.43 = 0.8848. But then again, maybe not! You will not need to com-
pute correlation like this. This is simply to indicate that this is what your computer or
calculator does when it is evaluating r.

For example, in R:

> sum((x-mean(x))ˆ2)
[1] 718.604
lOMoARcPSD|8938243

page 206 Experimental Design and Data Analysis

Q UESTION : The two blood measures are supposed to be comparable. Why is the correla-
tion not sufficient to ensure this, no matter how close to 1 it gets?

A note on correlation* (Beyond the formula)


In Chapter 2, we learned that the sample variance s2 measures the variability in an univari-
ate sample of numeric data. The correlation coefficient r does a similar job for a bivariate
numeric sample. An important difference between variance and correlation is that while
the sample variance only gives us the magnitude of variation about the sample mean, r
gives us both magnitude and direction.
For now let us just focus on the sample covariance (numerator of the sample correlation),
Pn
sxy = i=1(xi − x̄)(yi − ȳ). Consider a bivariate sample of size n. For any pair (xi , yi ) from
this sample, four things can happen:

(xi − x̄)(yi − ȳ) > 0 if xi > x̄ and yi > ȳ;


(xi − x̄)(yi − ȳ) > 0 if xi < x̄ and yi < ȳ;
(xi − x̄)(yi − ȳ) < 0 if xi > x̄ and yi < ȳ;
(xi − x̄)(yi − ȳ) < 0 if xi < x̄ and yi > ȳ.

Thus the product (xi − x̄)(yi − ȳ) computes two properties. First, it tells us how far xi and yi
have deviated from their respective location measures, the sample means. Second, it tells
us whether or not the deviation of both xi and yi from their sample means is in the same
direction. That is, if yi takes a high (low) value whenever xi has a high (low) value. This
property is shown in the figure below. The two vertical lines are the sample means.

For the data in the left panel the yi s increase (relative to ȳ) as the xi s increase, so the co-
variance is positive. The reverse happens in the right panel, so the covariance is negative.
Comparing with sample variance, we can think of sxy = (xi − x̄)(yi − ȳ) as a joint deviation
of x and y from the sample means.
An issue with sxy is that it combines information about the spread of x and y with the
strength of their relationship. This can pose some challenges to its utility in practical use.
So it is scaled (divided) by the sample standard deviations of the individual variables, x
and y, so that this modified metric, the sample correlation coefficient (r), measures only the
strength of their relationship, and takes values between −1 and 1.
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 207

8.2.1 Inference based on correlation

In the same way that the sample mean (x̄) is an estimate of a population mean (µ) for a
univariate population, the sample correlation (r) is an estimate of the population correla-
tion (ρ) for a bivariate population. The population correlation, ρ, is a measure of the linear
association between two variables.

Properties of ρ
1. −1 6 ρ 6 1
2. ρ = ±1 if, and only if, Y = a + bX with b 6= 0.
ρ = 1 if b > 0 and ρ = −1 if b < 0.
3. If X and Y are independent then ρ = 0.
Note: the converse is not true, in general, but it is true when (X, Y ) is bivariate normal.

Like µ, the population correlation is generally unknown, and we seek to estimate it using
a sample drawn from the population. The correlation obtained from the sample is r.

population sample
mean µX , µY x̄, ȳ
standard deviation σX , σY sx , sy
correlation ρ r
(−1 6 ρ 6 1) (−1 6 r 6 1)

Inference on the correlation:


• estimation of ρ: point estimate (r) and an interval estimate, i.e. a 95% confidence
interval for ρ.
• testing a null hypothesis about ρ, which is usually H0 : ρ = 0.
If we assume that (X, Y ) is (bivariate) normal then r can be used to test the hypothesis
ρ = 0, which is also a test for independence. However, this is equivalent to testing whether
the slope of the regression line is 0, which we will learn how to do later.
A 95% confidence interval for ρ based on r can be read off the correlation statistic-parameter
diagram in the Statistical Tables (Figure 9).

EXAMPLE 8.2.2: A random sample of 50 observations on (X, Y ) produced a


sample correlation coefficient, r = −0.42. Is this significant evidence that X
and Y are not independent?

For the above example, using the diagram in the Tables we obtain the 95% CI for
ρ as (−0.62, −0.16), which excludes zero. So we can conclude that ρ=0 would
be rejected.

EXAMPLE 8.2.3: Suppose that, for a sample of n = 20 observations on (X, Y ),


we obtain r = 0.5.

Using the Statistical Tables diagram, an approximate 95% confidence interval


for ρ is given by 0.07 < ρ < 0.77. In particular, we can say that X and Y are
significantly positively correlated; they are not independent.
lOMoARcPSD|8938243

page 208 Experimental Design and Data Analysis

8.3 Straight-line regression

Correlation provides information on the strength of the linear relationship between two
numerical variables. To further explore the relationship between two numerical variables,
we develop a model that relates one variable to the other. We can then use this model to
predict one variable from the other.

The regression of y on x is E(Y | x), i.e. the expectation of Y given the value of x. For
example, Y may be the measured pressure of a gas in a given volume x, the measurement
being subject to error. Here, we might expect that E(Y | x) = xc .
The simplest form of regression model, and the only one that we will consider, is
E(Y | x) = α + βx and var(Y | x) = σ 2 ,
so that the regression of y on x is linear, with constant variance.
However, in some cases, it is possible to transform data to produce a straight line regres-
sion. For example:
1. In the above example on pressure and volume, if we write x∗ = x1 , then E(Y | x∗ ) =
cx∗ , which is a linear model with intercept = 0.
2. If y ≈ αxβ , then it might be appropriate to take logs and consider the model E(Y ∗ | x∗ ) =
α∗ +βx∗ , where Y ∗ = ln y, x∗ = ln x, α∗ = ln α.
Note: such a transformation affects the assumption of equal variance.

EXAMPLE 8.3.1: (Recurrence time)


The following data were obtained in a study of the relationship between tumour
size (x, in mm) and time to recurrence (y, in months).

x 2.5 5 10 15 17.5 20 25 30 35 40
y 63 58 55 61 62 37 38 45 46 19
60
50
y
40
30
20

10 20 30 40
x

In fitting a regression line, our aim is to use x to predict y; so the fitted line is the one that
gives the best prediction of y for a given x.
This is not necessarily the line that best fits the relationship between the variables; and it is
definitely not the line that would be used to predict x given y, if that were required.

EXAMPLE 8.3.2: (Carbon monoxide)


Data are collected on the number of cars per hour and the concentration of
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 209

carbon-monoxide (CO) in parts per million at certain street corners and are
shown below:

cars/hr 980 1040 1135 1450 1510 1675 1890 2225 2670 2935 3105 3330
CO conc 9.0 6.8 7.7 9.6 6.8 11.3 12.3 11.8 20.7 19.2 21.6 20.6

(a) What are the variables being measured?


(b) Which is the response variable?
(c) What are the questions of interest?

The model and its assumptions


The data: {(xi , yi ), i = 1, 2, . . . , n}.
The aim here is to find a relation y = g(x) so that x can be used to predict y.

We assume that, given x, Y is a random variable with


mean E(Y | x) = µ(x)
variance var(Y | x) = σ 2 (not dependent on x).

We generally assume that µ(x) has the form µ(x) = α + βx.


This is called the straight-line regression model, and the line y = α + βx is called the
population regression line.
In addition, we usually assume that Y is normally distributed.
We can write this as
yi = α + βxi + ei , i = 1, 2, . . . , n,
where the ei are independent realisations of N(0, σ 2 ), called the random errors.

The model is represented graphically as follows:

Interpretations based on the model

The coefficient β describes the relation between y and x:


• if β is positive, then x and y are positively related;
• if β is negative, then x and y are negatively related;
• the value of β gives the increase in the mean of y when x is increased by one unit;
the average increase in y when x is increased by one.
Once the coefficients α and β are estimated, we have a fitted model.
lOMoARcPSD|8938243

page 210 Experimental Design and Data Analysis

• The fitted model is used to estimate the average value of Y for a given value of x.
• It is also used to predict a future observation of Y for a given value of x.

8.4 Estimation of α and β: least squares

We use the “least squares” method to find the fitted model. That is, we consider all straight
lines y = a + bx and select that line for which
n
X
∆ = ∆(a, b) = (yi − a − bxi )2
i=1

is a minimum. Note that yi − a − bxi is just the vertical distance of the i’th data point (xi , yi )
from the line y = a + bx. The resulting line we denote by µ̂(x) = α̂ + β̂x; α̂ and β̂ denote
the “least squares” estimates of α and β. Fortunately, there is exact formula to find the
estimates of α and β:
Pn
(x −x̄)(yi −ȳ)
β̂ = Pn i
i=1
2
, α̂ = ȳ − β̂ x̄ so that µ̂(x) = ȳ + β̂(x − x̄).
i=1 (xi −x̄)

C HALLENGE . Can you derive the least squares estimates of α̂ and β̂?

Hint: To find where ∆ is a minimum:


(yi − a − bxi )2 = [(yi − ȳ) − b(xi − x̄)]2 + n(ȳ − a − bx̄)2 ,
P P
P 2 P
using z = (z − z̄)2 + nz̄ 2 ;
∆ is minimised by setting the second term to zero and minimising the first term;
∂∆ ∂∆
or you could find , and equate them to zero, and solve.
∂a ∂b
Pn
(x −x̄)(yi −ȳ) sxy sxy sy sy
Pn i
Now, β̂ = i=1 2
= 2 = · =r ,
i=1 (xi −x̄) s x s s
x y s x s x

so β̂ can be evaluated from sx , sy and r. In fact, the statistics n, x̄, ȳ, sx , sy and r are
sufficient for any computation relating to straight line regression!
A neat form for the least squares regression line is ys = rxs .
y−ȳ s s 
= r x−x̄ ⇔ y = ȳ − (r sy ) x̄ + r sy x, i.e. y = α̂ + β̂ x.

s y s x x x

EXAMPLE 8.4.1: (Recurrence time, continued)


For the recurrence time data, we have:
n = 10, x̄ = 20.0, ȳ = 48.4; sx = 12.53, sy = 14.19; r = −0.7953.

14.19
So β̂ = −0.7953 × = −0.9009 and α̂ = 48.4 − (−0.901)×20.0 = 66.4177.
12.53
Therefore µ̂(x) = 66.42 − 0.901x, so that, for example, the mean of Y when
x=16 is estimated to be 52.0 (= 66.42 − 0.901×16).

The fitted line is shown on the scatter plot below:


lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 211

60
50
y
40
30
20

10 20 30 40
x

Actually, α̂ and β̂ are available on many calculators; and from the computer, so
you don’t even need to do the calculation from n, x̄, ȳ, sx , sy and r!
In R, we use the function lm():

> fit <- lm(y ˜ x) # "linear model" of y against x

On the left-hand side of the symbol “˜” we include the response. On the right
hand side we put one or more predictors.

The function summary() produces the following output:

> summary(fit)

Call:
lm(formula = y ˜ x)

Residuals:
Min 1Q Median 3Q Max
-11.400 -5.400 -1.787 7.474 11.348

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.4177 5.6481 11.759 2.5e-06 ***
x -0.9009 0.2428 -3.711 0.00595 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 9.124 on 8 degrees of freedom


Multiple R-squared: 0.6325,Adjusted R-squared: 0.5866
F-statistic: 13.77 on 1 and 8 DF, p-value: 0.00595

. . . from which we see that α̂ = 66.418 and β̂ = −0.9009;


and other stuff, which we learn something about in the next section.
lOMoARcPSD|8938243

page 212 Experimental Design and Data Analysis

EXAMPLE 8.4.2: (CO concentration, continued)


n = 12, x̄ = 1995.42, ȳ = 13.1167; sx = 838.50, sy = 5.7733;
r = 0.9476.
5.7733
So β̂ = 0.9476 × = 0.006524 and α̂ = 13.1167 − 0.006524×1995.42 =
838.50
0.098316.

The fitted regression line is µ̂(x) = 0.0983 + 0.0065x.


E XERCISE . Use a calculator or computer to obtain α̂ and β̂ directly.

8.5 Inference for the straight line regression model

In order to carry out inference, i.e. to find confidence intervals and test hypotheses, we
need expressions for the variances of the estimators.
Regression is concerned with estimating for a given x. So for statistical inference to do with
regression, we treat the xs are constants, and the Y s as random variables.
Let B̂ denote the estimator of β. Under the assumption that var(Y | x) = σ 2 :
σ2 σ2
var(Ȳ ) = , var(B̂) = , where K = Σ(x − x̄)2 ; and Ȳ & B̂ are independent.
n K 2
[Note that K = (n−1) sx .]

[We use K to indicate that B̂ behaves a bit like Ȳ , but with n replaced by K.]
It follows that, if M̂(x) denotes the estimator of µ(x), that is,
M̂(x) = Ȳ + (x − x̄)B̂,
then, since Ȳ and B̂ are independent, we have
 1 (x − x̄)2  2
var M̂(x) = + σ .
n K
Note that  = M̂(0), (i.e. α̂ = µ̂(0)).

EXAMPLE 8.5.1: (Recurrence time, continued)


var(Ȳ ) = 0.1σ 2 , var(B̂) = 0.000708σ 2 , var(Â) = 0.383σ 2 ,

var M̂(16) =
0.111σ 2 .
Now, generally, the error variance σ 2 is unknown, and so these variances will be unknown.
In order to estimate the variances, we need to estimate σ 2 . To do this, we use the residuals,
which are also useful as a model-checking device.

Residuals

We define the residuals as


êi = yi − α̂ − β̂xi ,
i.e. the deviation of the observed (yi ) from the “expected”, the fitted mean (α̂ + β̂xi ). On
the scatter plot, the residual is the vertical distance between yi and the fitted line.
We have yi = α + βxi + ei (the straight line regression model);
and yi = α̂ + β̂xi + êi (from the definition of residuals).
Here α, β and ei are unknown; but α̂, β̂ and êi are observed. Since α̂ + β̂xi is supposed to
be close to α + βxi , it follows that the residuals, êi should be close to the errors ei .
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 213

Thus the residuals can be used to check the fit of the model, since they should behave like
the model errors, ei , i.e. independent observations on N(0, σ 2 ).
If they do not, then the model should be questioned.
◦ independence? Look for patterns in residual plots: any pattern is an indication of non-
randomness. For example, if the residuals vs fitted values follow a curve, this suggests a
curved regression, rather than a straight line. If the residuals vs observation order (often
time) shows a trend, it suggests that the regression is varying with time.
◦ mean zero? If not there is a mistake, since av(êi ) = av(yi − α̂ − β̂xi ) = ȳ − α̂ − β̂ x̄ = 0.
◦ equal variances? This may show in residual plot, though only with a lot of points. If the
residuals are close to the horizontal axis, it suggests a small error variance; if they are widely
spread it indicates a large error variance. The spread should be “about the same” for all fitted
values.
◦ normality? Use QQ-plots or normal-plots for the residuals. The normal plot of the residuals
should be close to a straight line if the errors are normally distributed.

R produces residual graphs that help to check these assumptions. The command plot(fit),
with fit being the regression output, gives the following plots.

Residuals vs Fitted Normal Q-Q


1.5

5 9
10

1.0
Standardized residuals
5

0.5
Residuals

0.0
0

-0.5
-5
-10

6
10 6
-1.5

10

30 40 50 60 -1.5 -0.5 0.5 1.5


Fitted values Theoretical Quantiles

n
X
The residual sum of squares (error SS) is given by: d2 = (yi − α̂ − β̂xi )2 .
i=1
d2
And, to estimate σ , we use the error mean square (error MS): s2 =
2
,
n−2
2
which is unbiased for σ . The divisor is n−2, since there are two parameters to be estimated
in the regression model. Another way of saying this is that {ê1 , . . . , ên } has n−2 degrees of
P P
freedom, since êi = 0 and xi êi = 0.

Note: Since the sample mean of the residuals is zero, the sample variance of the residuals
1 P 2
is n−1 êi . But, as two parameters have to be estimated to make the residual mean zero,
we choose to divide by n−2 rather than n−1.

Note: Computational formula for s2 , for hand computation: s2 = n−1 s2 (1 − r2 ), which again
n−2 y
lOMoARcPSD|8938243

page 214 Experimental Design and Data Analysis

depends only on the specified statistics.

EXAMPLE 8.5.2: (Recurrence time, continued)


s2 = 98 × 14.192 (1 − 0.79532 ) = 83.25, s = 9.124.

Such computation is usually unnecessary: for example R gives s in the regres-


sion output. Check the R output given above (listed as Residual standard
error). To extract residuals and compute the residual sum of squares write:

> fit$residuals # residuals


1 2 3 4 5 ...
-1.165487 -3.913274 -2.408850 8.095575 11.347788 ...
> res <- fit$residuals
> sum(resˆ2) # residual sum of squares
[1] 666.0239
> sqrt(sum(resˆ2)/(10-2)) # s
[1] 9.124307

with the last line being s.

However, this error variance estimate is not available in many calculators.


For inference, we need to make an assumption about the distribution of Y . The simplest,
and most commonly made, is to assume that Y is normally distributed:
d
i.e. Yi = N(α + βxi , σ 2 ).
In this case:
d σ2 d σ2
Ȳ = N(α + β x̄, ) and B̂ = N(β, ), and further, Ȳ and B̂ are independent.
n K
These results can be used for inference on α, β; and hence on µ(x).
In this case, S 2 has n−2 degrees of freedom. So, when S replaces σ, tn−2 replaces N.
B̂−β d
Thus, for example: √ = tn−2 ,
S/ K
so that a 95% confidence interval for β is given by:
s
β̂ ± c0.975 (tn−2 ) √ [i.e. est ± “2”se . . . again.]
K
Similarly, a 95% confidence interval for µ(x):
s 
(x − x̄)2

2
1
µ̂(x) ± c0.975 (tn−2 ) s + [est ± “2”se].
n K
Note: This interval gives a confidence interval for the unknown mean of Y . It says nothing
about an actual observed value.

EXAMPLE 8.5.3: (Recurrence time, continued)



95% CI for β: −0.901 ± 2.306 × 83.25 × 0.000708, i.e. (−1.461 < β < −0.341).

95% CI for µ(16): 52.00 ± 2.306 × 83.25 × 0.111, i.e. (44.98 < µ(16) < 59.03).
In the above example, the confidence interval for µ(16) is a statement about the unknown
mean of Y when x = 16. It says nothing about an actual observation for x = 16.
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 215

To do that, a prediction interval for Y is required: an interval within which we are 95%
sure that a future observation will lie.
To obtain a prediction interval, we use:
(x − x̄)2
 
∗ d 2 1
Z = Y − µ̂(x) = N 0, σ (1 + + )
n K

It follows that a 95% prediction interval for Y is given by:


s 
(x − x̄)2

2
1
µ̂(x) ± c0.975 (tn−2 ) s 1 + + .
n K

Observe that the 95% confidence interval takes the form est ± “2”se,pwhere se = s2 c; and
the 95% prediction interval takes the form est ± “2”pe, where pe = s2 (1 + c).
Note that the width of the confidence interval and of the prediction interval depends upon
how close the value of x is to x̄; the further away from x̄, the wider the interval. In the
above, c depends on x: c = (x − x̄)2 /K. Thus c is zero when x = x̄, and it can be quite large
when x is a long way from x̄.

EXAMPLE 8.5.4: (Recurrence time, continued)


A 95% prediction interval for Y at x=16 is given by:

52.00 ± 2.306 × 83.25 × 1.111, i.e. 29.82 < Y (16) < 74.19.

These intervals can be obtained in R using the function predict() which al-
lows us to enter the value of x for which intervals are required. For this exam-
ple, we enter 16. This gives the confidence interval and prediction interval at
the end of the regression output.

> fit <- lm(y ˜ x) # fit regression


> new <- data.frame(x = 16) # new predictor(s)
> predict(fit, new, se.fit=T, interval="confidence") # confidence interval
$fit
fit lwr upr
1 52.00354 44.98315 59.02393

$se.fit
[1] 3.044395

> predict(fit, new, se.fit=T, interval="prediction") # prediction interval


$fit
fit lwr upr
1 52.00354 29.82255 74.18453

$se.fit
[1] 3.044395

Note that the option interval specifies either a confidence interval for µ(16)
(the mean response at x = 16) or the prediction interval for Ŷ (16) (the new
response at x = 16).
Hypothesis testing
It is standard to carry out a test of the null hypothesis H0 : β = 0 (testing the utility of the
model: is x of any use in predicting y?). If β is not significantly different from 0 then the
(linear) relationship between x and y is weak and knowing the value of x will not be of
much use in predicting the value of y.
lOMoARcPSD|8938243

page 216 Experimental Design and Data Analysis

The test statistic is


β̂ est − 0
t= √ [i.e. t = .]
s/ K se
with critical values obtained from the tn−2 distribution.
For the recurrence time example, t = −3.71 and we reject β=0.

All of this is most easily done using a statistical package, such as R, which produces output
like the following:
> summary(fit)

Call:
lm(formula = y ˜ x)

Residuals:
Min 1Q Median 3Q Max
-11.400 -5.400 -1.787 7.474 11.348

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.4177 5.6481 11.759 2.5e-06 ***
x -0.9009 0.2428 -3.711 0.00595 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 9.124 on 8 degrees of freedom


Multiple R-squared: 0.6325,Adjusted R-squared: 0.5866
F-statistic: 13.77 on 1 and 8 DF, p-value: 0.00595

Note that as well as estimates of parameters, standard errors (“Std. Error”) and p-values
for testing H0 : α=0 and H0 : β=0, R gives the goodness of fit F-statistic and its p-value at
the bottom, which we ignore at this stage. The F test in the last line is just a test of H0 :
β=0, which is equivalent to the t test given above. The value of the F -statistic is equal to
the value of t2 (13.77 = (−3.71)2 ).
It is common to give a value of R2 (R-squared):
regression SS
R2 = ,
total SS
which is called the coefficient of determination. R2 can be thought of as the proportion
of the variation in y that is accounted for by the regression on x. In the above, R2 =
1146.6/1812.4 = 0.633.
2
2
An alternative (adjusted) version is obtained using 1 − Radj = ss2 ; from which we obtain
y
2
2
Radj = R2 − 1−R
n−2
2
. It follows that Radj 6 R2 . In the above Radj
2
= 0.633 − 1−0.633
8
= 0.587.)

EXAMPLE 8.5.5: (Humidity)


Material is stored in a place which has no humidity control. Measurements of
the relative humidity (x) in the storage place and the moisture content (y) of the
sample of the material (both in percentages) on each of twelve days yielded the
following results:

x 42 35 50 43 48 62 31 36 44 39 55 48
y 12 8 14 9 11 16 7 9 12 10 13 11
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 217

(a) Plot a scatter diagram to verify that it is rea-


sonable to assume that the regression of Y on x

16
is linear.

14
(b) Fit a straight line regression by the method of

12
y
least squares.

10
α̂ = −0.950, β̂ = 0.269.

8
(c) Find a 95% confidence interval for the slope
of the regression line. 30 35 40 45 50 55 60
x
s = 1.101, se(β̂) = 0.0377;
95% CI for β: (0.185, 0.353).

EXAMPLE 8.5.6: (Blood measures)


Two measures (x and y) of a blood value are measured for a sample of individ-
uals with the following results:

i 1 2 3 ··· 100
xi 57.8 66.6 63.1 ··· 59.1
yi 51.1 55.1 58.5 ··· 53.9

For these data, the following values have been calculated:

n = 100, x̄ = 64.114, ȳ = 54.924, sx = 10.5781, sy = 4.6461 and r = 0.8830.

(a) Estimate the straight line regression which would be used to predict y using
x.

(b) Give an estimate of the error variance.

(c) Find the standard error for the slope estimate, se(β̂).

q The 95% confidence interval for µ(60) can be expressed in the form m ±
(d)
2
c sn + f se(β̂)2 .
Specify values for m, c and f .

Answers: 4.6461
(a) β̂ = 0.8830 × 10.5781 = 0.3878; α̂ = 54.924 − 0.3878×64.114 = 30.06.
1
(b) s2 = 98 × 4.46412 (1 − 0.88302 ) = 4.804.
q
4.8042
(c) se(β̂) = 11077.6404 = 0.0208

(d) m = µ̂(60) = 30.06 + 60×0.3878 = 53.33;


c = c0.975 (t98 ) = 1.984
f = (x − x̄)2 = (60 − 64.114)2 = 16.93.

EXAMPLE 8.5.7: (Reticulocytes and Lymphocytes)


The data given below are for fourteen patients with aplastic anaemia. The ob-
served variables are x = % reticulocytes and y = lymphocytes (per mm2 ). A .txt
file containing the data looks like this:
lOMoARcPSD|8938243

page 218 Experimental Design and Data Analysis

ret lymph
3.6 2240
2.0 2678
0.3 1820
0.3 2206
0.2 2086
3.0 2299
0.2 1276
1.0 2088
2.2 2013
2.7 2600
3.2 2684
1.6 1840
2.5 1760
1.4 1950

Using an appropriate path, the data are imported as follows.

> mydata <- read.table("lymph.txt", header=TRUE) # read lymph.txt file

Summary statistics and regression output are shown below:

> summary(mydata)
ret lymph
Min. :0.200 Min. :1276
1st Qu.:0.475 1st Qu.:1868
Median :1.800 Median :2087
Mean :1.729 Mean :2110
3rd Qu.:2.650 3rd Qu.:2284
Max. :3.600 Max. :2684

> fit <- lm(lymph ˜ ret, data = mydata) # fit linear model
> summary(fit)

Call:
lm(formula = lymph ˜ ret, data = mydata)

Residuals:
Min 1Q Median 3Q Max
-558.90 -200.56 -36.36 294.66 519.15

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1798.91 162.69 11.057 1.2e-07 ***
ret 179.97 78.35 2.297 0.0404 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 337.3 on 12 degrees of freedom


Multiple R-squared: 0.3054,Adjusted R-squared: 0.2475
F-statistic: 5.276 on 1 and 12 DF, p-value: 0.04042
i. Specify the fitted regression of y on x.
µ̂(x) = 1799 + 180 x.
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 219

ii. Estimate the slope of the regression line, and interpret it in the context.
β̂ = 180.0. This is the estimated increase in the mean lymphocyte count per unit percentage increase in
reticulocytes (for 0 < %ret < 4).
iii. Find a 95% confidence interval for this slope.
179.97 ± 2.179×78.35 = (9, 351).
iv. Suppose %reticulocyte = 2.0%, what do you expect the lymphocyte count to be?
µ̂(2.0) = 2160 . . . 1400 < y(2.0) < 2920.
v. If y(2.0) = 2678, what is the residual for this observation?
ê = 2678 − 2158.8 ≈ 520. Mark it on the scatter plot.
vi. Specify a 95% confidence interval for the mean lymphocyte count for x = 2.
(1960, 2360).
vii. Specify the prediction error for x = 2.
√ √
pe = s2 + se2 = 113753 + 92.62 = 350. also (2921−1397)/(2×2.179).

viii. Find a 95% confidence interval for the mean lymphocyte count for x = 3.0.
µ̂(3.0) = 1798.9
q + 179.97×3.0 = 2338.8 ≈ 2340;
113753
se[µ̂(3.0)] = 14
+ (3−1.729)2 ×78.352 = 134.32 ≈ 134
95% CI: 2338.8 ± 2.179×134.32 = (2050, 2630).

8.6 Case study: Blood fat content

Kleinbaum and Kupper (1978)1 provide a data set containing measurements on age, weight
and blood fat content for 25 individuals. We are interested in the relationship between age
(x) and blood fat content (y). This data set is available as BFC.txt. Load this data using
the commands
> BFC <- read.table(’BFC.txt’, header=T)
> Age <- BFC$Age
> Bfc <- BFC$BloodFatContent
This stores the age and blood fat content of the 25 individuals into Age and Bfc, respec-
tively. A scatterplot of Age and Bfc (shown below) can be obtained using the command
> plot(Age, Bfc, xlab = "Age", ylab = "Blood Fat Content", las = 1)
From this scatterplot, we see that there is a positive correlation between age and blood fat
content. That is, older people tend to have higher blood fat content, and younger people
tend to have lower blood fat content. In fact, using the command cor(Age, Bfc), we
find that the sample correlation coefficient is r = 0.837, and from the statistic-parameter di-
agram a 95% confidence interval for this correlation coefficient is (0.66, 0.93). This indicates
that there is a strong positive linear relationship between Age and Bfc.
Now we will fit a least squares regression line to the data. The command summary(model
<- lm(Bfc ˜ Age)) will fit the linear regression model, store it into the model object,
and produce the summary output below.
> summary(model <- lm(Bfc ˜ Age))

Call:
lm(formula = Bfc ˜ Age)

Residuals:
1 Kleinbaum, D.G and Kupper, L. L (1978). Applied Regression Analysis and Other Multivariable Methods.

Duxbury Press.
lOMoARcPSD|8938243

page 220 Experimental Design and Data Analysis

450

400

Blood Fat Content


350

300

250

200

20 30 40 50 60

Age

Min 1Q Median 3Q Max


-63.478 -26.816 -3.854 28.315 90.881

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.5751 29.6376 3.461 0.00212 **
Age 5.3207 0.7243 7.346 1.79e-07 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 43.46 on 23 degrees of freedom


Multiple R-squared: 0.7012, Adjusted R-squared: 0.6882
F-statistic: 53.96 on 1 and 23 DF, p-value: 1.794e-07
From this output we can identify:
• The estimated intercept, α̂ = 102.6, and its standard error se(α̂) = 29.6;
• The estimated slope, β̂ = 5.32, and its standard error se(β̂) = 0.72;
• The estimate of the error variance, s2 = 43.46.
Check to make sure you can identify these values in the output above.
We can interpret the value of β̂ as follows: according to our model, the mean blood fat
content increases by a fixed amount for each year increase in age. We estimate this amount
as β̂ = 5.32 units.
Q UESTION : What is the equivalent interpretation of α̂?
The fitted least squares regression line is

µ̂(x) = 102.575 + 5.321 x.

Suppose that we want to estimate the mean blood fat content of a 55 year old individual.
Using our model, this individual’s mean blood fat content is estimated to be

µ̂(55) = 102.575 + 5.321 × 55 = 395.23.

Also given in the above summary output are the test statistics (t value column) and p-
values (Pr(>|t|) column) for hypothesis tests of H0 : α = 0 against H1 : α 6= 0, and
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 221

H0 : β = 0 against H1 : β 6= 0. These p-values, which are both much lower than 0.05, lead
us to conclude that the true α and β are both significantly different from 0, at the 5% level
of significance.
The above scatterplot is reproduced below with the least squares regression line superim-
posed. This line can be added to the existing scatterplot using the command:
> abline(model$coefficients[1], model$coefficients[2])

450

400
Blood Fat Content

350

300

250

200

20 30 40 50 60

Age

Do you think that the least squares regression line fits the data well? From the summary
output, R2 = 0.701. Therefore, 70.1% of the variation in blood fat content can be ex-
plained by age. We should also check that there are no violations of the model assump-
tions. The residuals vs fitted values and normal QQ plot are produced using the command
plot(model, which = c(1, 2)). Both figures are shown below.

Residuals vs Fitted Normal Q−Q


100

●8 8●
Standardized residuals

16 ● ● 16
●6 ●6
50

● ●
Residuals

● ●
● ● ●
● ●●
●●

● ● ●
0

● ●
●●
0


● ● ● ●
● ●
● ●●●
● ●●
−50

● ●●
−1

● ●

● ● ●

250 300 350 400 −2 −1 0 1 2

Fitted values Theoretical Quantiles

There are no obvious patterns in the residuals vs fitted values plot, and the normal QQ plot
is approximately linear. This suggests that the form of our linear regression model, and the
assumption that the random errors have a N (0, σ 2 ) distribution, are appropriate for these
data.
A confidence interval for µ(55) (the mean blood fat content for 55 year olds) is found using
the commands
lOMoARcPSD|8938243

page 222 Experimental Design and Data Analysis

> new <- data.frame(Age = 55)


> predict(model, new, interval = "confidence")
fit lwr upr
1 395.2123 365.3888 425.0359
We thus find that the mean blood fat content for 55 year olds is somewhere in the interval
(365.389, 425.036), with 95% confidence.
Now suppose that we have a specific 55 year old individual in mind. A prediction interval
for this person’s blood fat content (with 95% probability) is found using the command
> predict(model, new, interval = "prediction")
fit lwr upr
1 395.2123 300.4883 489.9363
Therefore, the prediction interval for this person’s blood fat content is (300.488, 489.936).
Observe that the prediction interval is wider than the confidence interval.
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 223

Problem Set 8
8.1 This problem investigates the relationship between FEV (litres) and age (years) for boys. Some
R output is shown below. The sample has 336 boys and their mean age is 10.02 years.

Call:
lm(formula = FEV ˜ age)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0736 0.1128 0.65 0.514
age 0.2735 0.0108 25.33 0.000 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.588102


Multiple R-squared: 0.658,Adjusted R-squared: 0.657
i. Looking at the scatter plot, do you think there is a relation between FEV and age?
Do you think a simple linear regression is appropriate for modelling the relationship?
ii. What are the assumptions of the simple linear regression model?
iii. What is the slope of the fitted regression line and what is its interpretation?
iv. Predict the FEV for a 10-year-old boy.
v. The output gives R-sq = 65.8%. What is the interpretation of this?
vi. Compute the correlation coefficient between FEV and age for boys.

8.2 The table below gives the corresponding values of variables x and y.
x 5 6 7 8 10 11 12 13 14 14
y 28 20 26 28 24 16 22 10 12 14
For these data, check the following calculations:
x̄ = 10, ȳ = 20; sx = 3.333, sy = 6.667, r = −0.8.
i. Assuming that E(Y | x) = α + βx and var(Y | x) = σ 2 , obtain estimates of α and β using
the method of least squares. Plot the observations and your fitted line.
ii. Show that s2 = 18. Hence obtain se(β̂), and derive a 95% confidence interval for β.
lOMoARcPSD|8938243

page 224 Experimental Design and Data Analysis

iii. Find the sample correlation and, assuming the data are from a bivariate normal popula-
tion, find a 95% confidence interval for the population correlation.

8.3 A random sample of n = 50 observations are obtained on (X, Y ). For this sample, it is found
that x̄ = ȳ = 50, sx = sy = 10 and the sample correlation rxy = −0.5.
i. Indicate, with a rough sketch, the general nature of the scatter plot for this sample.
ii. On your diagram, indicate the fitted line for the regression of y on x.
iii. Give an approx 95% confidence interval for the population correlation.

8.4 A data set containing 6 columns of data was created by an English statistician Frank Anscombe.
The scatterplots arising from these data are sometimes called the “Anscombe quartet”.
x1 y1 y2 y3 x4 y4
10 8.04 9.14 7.46 8 6.58
8 6.95 8.14 6.77 8 5.76
13 7.58 8.74 12.74 8 7.71
9 8.81 8.77 7.11 8 8.84
11 8.33 9.26 7.81 8 8.47
14 9.96 8.10 8.84 8 7.04
6 7.24 6.13 6.08 8 5.25
4 4.26 3.10 5.39 19 12.50
12 10.84 9.13 8.15 8 5.56
7 4.82 7.26 6.42 8 7.91
5 5.68 4.74 5.73 8 6.89
i. Carry out four simple linear regressions: y1 on x1 , y2 on x1 , y3 on x1 and y4 on x4 . What
do you notice about the results?
ii. Look at the four scatterplots of the data with the corresponding fitted line. Anscombe
concocted these data to make a point. What was the point? What would you conclude
about the appropriateness of simple linear regression in each case?
iii. What are the observed and predicted values of y4 at x4 = 19? Change the y4 value for
this datum to 10 and refit the regression. What are the observed and predicted values at
x4 = 19 now?
8.5 Researchers speculate that the level of a particular type of chemical found in a patient’s blood
affects the size of a hepatocellular carcinoma. Experimenters take a random sample of 25 pa-
tients and both assess the size of their tumours (cm) and test for the levels of this chemical in
their blood (mg/L). The mean chemical level was found to be 45mg/L. A simple linear regres-
sion is fitted; a partial R output is below.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.2981 0.05134 ? ?
x -0.15123 0.00987 ? ?
---
Residual standard error: 1.213
Multiple R-squared: 0.895
(a) What is the response variable? What is the explanatory variable?
(b) Write down the general model equation, stating all assumptions about the errors. How
could you graphically check each of these assumptions?
(c) i. Write down an estimate of the slope of the regression line.
ii. Write a sentence interpreting this slope in the context of the question.
(d) Use the R output to determine whether there is evidence, at the 5% level, that the chemical
level affects tumour size.
(e) Based on the result of your test in (d), would a 95% confidence interval for the true slope
contain zero? Explain why or why not.
(f) Suppose the chemical level in a patient’s blood is 25mg/L.
i. What do you expect the tumour size to be?
ii. If the actual size is 8cm, calculate the residual for this observation.
iii. Construct a 90% confidence interval for the expected tumour size.
(g) Write a sentence interpreting the R-sq value in the context of the question. Using it,
calculate the sample correlation coefficient.
lOMoARcPSD|8938243

Chapter 8: Regression and Correlation page 225

8.6 Low-density lipoprotein (LDL) cholesterol has consistently been shown to be related to car-
diovascular disease in adults. Researchers are interested in factors that may be associated to
LDL cholesterol in children. One such factor is obesity, which is measured by the ponderal in-
dex (kg/cm3 ). 162 children are sampled and it is found that the sample correlation coefficient
between LDL cholesterol and ponderal index is 0.22.
(a) Find a 95% confidence interval for the correlation.
(b) A simple linear regression can be fitted to the data, with the true slope denoted by β.
Based on (a), what do you expect to be the result of a hypothesis test: H0 : β = 0 versus
H1 : β 6= 0?
(c) Calculate the coefficient of determination and write a sentence interpreting it in the con-
text of the question.

8.7 Marty posted the following question on a discussion list.


“I have measured two variables, Fe concentration and Protein, in three lichen species: S1,
S2, S3. I took 30 measurements, ten per species. The question was: Is there any correlation
between Fe and Protein? I took the correlation coefficients and the regression lines between
these two parameters for each species separately as well as the correlation coefficients and
the regression lines of pooled data independent of species. The problem is that while the
correlation in any species is positive the overall correlation is negative!!! In the following
data, that are given as an example, the correlation coefficients are r1 = 0.88, r2 = 0.76, and
r3 = 0.90 while the overall correlation is r = −0.67!!!!! Can that be right? Whatever, the
question remains: Is the correlation between these two variables positive or negative? What’s
the conclusion?”
The relevant data are given below.
Fe Protein Species Fe Protein Species Fe Protein Species
3 23 1 6 29 1 4 22 1
14 9 3 10 20 2 12 4 3
17 11 3 17 10 3 15 8 3
6 12 2 7 18 2 12 7 3
13 6 3 2 20 1 3 24 1
13 8 3 2 23 1 7 27 1
19 12 3 6 15 2 7 15 2
8 16 2 16 10 3 9 30 1
12 18 2 11 21 2 5 26 1
9 18 2 7 26 1 10 17 2
(a) Produce a scatter plot of Fe against Protein that identifies the three species on the plot.
(b) Check the results asserted by the questioner:
i. What is the correlation overall?
ii. What are the correlations within each species?
(c) What is the conclusion?

8.8 Two measures are evaluated for each of fifty cases: the sample correlation between these mea-
sures is evaluated as –0.40. Find a 95% confidence interval for the correlation. Is this evidence
of a relationship between the two measures? Explain.

8.9 Is cardiovascular fitness (as measured by time to exhaustion running on a treadmill) related
to an athlete’s performance in a 20 km ski race? The following data were collected in a study:
x = treadmill run time to exhaustion (in minutes) and y = 20 km ski time (in minutes).
x 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7
y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7
(a) The correlation coefficient is r = −0.796. Test the hypothesis that the two variables are
uncorrelated.
(b) A simple linear regression analysis is carried out in R, giving the following output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 88.80 5.750 15.44 0.000
ret -2.334 0.591 ? ?
---
Residual standard error: 2.188
Multiple R-squared: 0.636
lOMoARcPSD|8938243

page 226 Experimental Design and Data Analysis

i. The row headed ret has missing values for the t-ratio and p-value. Explain what
these numbers pertain to (no need to calculate them) and whether they bear any
relation to the test in (a).
ii. Suppose an athlete has treadmill run time to exhaustion of 10 minutes. Give a 95%
prediction interval for his 20 km ski time. (The x-values have mean x̄ = 9.66.)

8.10 (FEV vs age data from Problem 8.1: some inference questions.)
i. The null hypothesis states that there is no relation between FEV and age. State this in
terms of an appropriate parameter and test it.
ii. Obtain a 95% CI for the slope of the regression line of FEV on age.
iii. Obtain an estimate of the error variance σ 2 .
iv. Obtain a 95% CI for the mean FEV of 10 year-old boys.
v. Obtain a 95% prediction interval for the FEV of a 10 year-old boy.
lOMoARcPSD|8938243

REVISION PROBLEMS

Revision Problem Set R1

R1.1 (a) Write two sentences to compare and contrast observational study and experiment.
(b) A “randomised controlled trial” is the gold-standard for medical experiments.
i. Give an explanation of the importance of randomisation to convince a doubting sci-
entist of its value.
ii. What is meant by “control”? Why is it important?
(c) It is thought that exposure E is a possible cause of D, so that E and D ought to be pos-
itively related. However, a recent study showed a negative relationship between E and
D. It was discovered that this was due to a confounding factor C, which is a known cause
of D, and which was strongly negatively correlated with E among the individuals used
in the study.
Draw a diagram to illustrate this situation.

R1.2 (a) A sample of nine observations is supposed to be a random sample from a Normal popu-
lation. The order statistics for this sample are as follows:
40.4 48.8 54.8 59.2 64.1 65.0 68.7 72.7 75.1
i. Evaluate the sample median and sample quartiles. Hence draw the boxplot for these
data.
ii. If the data are a random sample from a N(µ, σ 2 ) population, explain why E(X(1) ) ≈
µ − 1.28σ.
iii. Sketch a Normal QQ-plot for these data, clearly labelling the axes. Indicate how
estimates of µ and σ could be obtained on your diagram.
(b) For the data in (a), the calculator gives x̄ = 60.977778 and s = 11.376926. Specify a point
estimate and a 95% interval estimate for µ.

R1.3 (a) Write two sentences to compare and contrast independent and mutually exclusive.
p1 p1 (1 − p2 )
(b) The risk ratio is , and the odds ratio is .
p2 (1 − p1 )p2
i. If the odds ratio is 2, and p1 = 0.1, find the risk ratio.
ii. If the odds ratio is 2, and p1 → 0, what happens to the risk ratio?
iii. If the odds ratio is 2, and p1 → 1, what happens to the risk ratio?
iv. A case-control study gives an estimate of the odds ratio relating exposure E and
disease D of 2.0. What can you say about the relative risk of D with and without
exposure E?

227
lOMoARcPSD|8938243

page 228 Experimental Design and Data Analysis

R1.4 (a) Write two sentences to compare and contrast prevalence and incidence.
(b) Individuals with disease D have chemical L at high levels in the bloodstream. For these
d
individuals, L = N(40, 42 ). There is a threshold beyond which the body has an overload
problem.
i. Find Pr(L > 50).
d
ii. Suppose that the threshold is actually a random variable, T = N(50, 22 ), which is
independent of L. Find Pr(L > T ).
(c) i. T1 and T2 are independent random variables with
E(T1 ) = E(T2 ) = θ and sd(T1 ) = 1, sd(T2 ) = 2.
Let T = wT1 + (1 − w)T2 . Show that var(T ) = 5w2 − 8w + 4 and hence show that
var(T ) is minimised when w = 0.8.
ii. Two independent random experiments have been carried out, each with the inten-
tion of estimating the parameter θ. The results are:
experiment 1 n1 = 40 θ̂1 = 50.0 se(θ̂1 ) = 1.0
experiment 2 n1 = 10 θ̂2 = 55.0 se(θ̂2 ) = 2.0.
Use these to results to give the optimal estimate of θ and specify its standard error.

R1.5 (a) Write two sentences to compare and contrast standard deviation and standard error.
(b) A prevalence study, i.e. a survey, collects data from a sample at a specific time-point (al-
though the time-point may be relative, as the survey may take a week or so to complete).
In such a survey of 2000 individuals, 350 of them had attribute H. Find a 95% confidence
interval for the prevalence of H.
(c) A cohort of 400 individuals is followed for a period of five years. The total observed
person-time was 1200 person-years, and 36 cases were observed.
i. Give a reason why the person-time is not 400×5 = 2000 person-years.
ii. Find an approximate 95% confidence interval for the incidence rate (cases per person-
year).

R1.6 (a) Write two sentences to compare and contrast p-value and power.
d
(b) Suppose that Z = N(θ, 1).
It is planned to use an observation on Z to test the hypothesis H0 : θ = 0.
Consider the test: “reject H0 if |Z| > 1.96”.
i. Show that this test has significance level 0.05.
ii. Show that this test has power 0.80 when θ = 2.80.
d
(c) A random sample of n observations is obtained on X = N(µ, 52 ), i.e. σ is assumed known
(and σ = 5).
X̄ − 40
Let Z = √ . This is the test statistic used to test the hypothesis µ = 40.
5/ n
i. Find E(Z) when µ = 41.
ii. How large a sample is required so that the z-test of µ = 40 based on X̄ with signifi-
cance level 0.05 has power 0.80 against the alternative µ = 41?
R1.7 A recent study compared the use of angioplasty (PTCA) with medical therapy in the treatment
of single-vessel coronary artery disease. At the six-month clinic visit, 10 of 40 patients seen in
the PTCA group and 20 of 40 patients seen in the medical therapy group, have had angina.
(a) Is there evidence in these data that PTCA is more effective than medical therapy in pre-
venting angina? Test the hypothesis that the probabilities are the same in the two groups.
Give a p-value and state your conclusion.
(b) Using these data, find an estimate of the odds ratio relating PTCA and angina.
By estimating ln OR, obtain a 95% confidence interval for the odds ratio.
R1.8 Two measures are evaluated for each of fifty cases: the sample correlation between these mea-
sures is evaluated as –0.40. Find a 95% confidence interval for the correlation. Is this evidence
of a relationship between the two measures? Explain.
lOMoARcPSD|8938243

Revision Problems page 229

Revision Problem Set R2


R2.1 You plan to conduct an experiment to test the effectiveness of the drug ZZZ, a new drug that
is supposed to reduce insomnia. You plan to use a sample of subjects that are treated with the
drug and another sample of subjects that are given a placebo.
(a) What is a placebo and why is it used?
(b) Why is randomisation important? How could it be used in this study?
(c) If analysis of the results of the data resulting from this study showed that there is signif-
icant improvement with ZZZ, does this indicate that the drug causes the improvement?
Explain.
R2.2 The dotplot and descriptive statistics for a random sample of eighty observations from a pop-
ulation are given below:

Variable N Mean SEMean StDev Min Q1 Med Q3 Max


x 80 66.6 2.2 20.0 3.1 54.4 68.0 81.5 97.2
(a) Write a sentence or two describing this sample.
(b) Which one of the following is the QQ-plot for these data?
[1] [2] [3]

Copy, roughly, your selected QQ-plot and indicate on your copy the labels and scales on
each axis.
(c) i. Give a rough approximation for a 95% confidence interval for the population mean.
ii. Give a rough approximation for a 95% prediction interval for an observation from
this population. Hint: 4/80 = 5%.
Note: You should not assume the population distribution is normal.
R2.3 A research paper modelled the Rayley Psychomotor Development Index in five year-old chil-
dren as having a normal distribution with mean 100 and standard deviation 10. Assume this
model is correct, and that a random sample of 200 observations is to be obtained.
(a) Indicate values that you would expect to observe for the five-number summary for this
sample, i.e. the minimum, the lower quartile, the median, the upper quartile and the
maximum, briefly explaining your reasoning.
Hence draw a likely boxplot for such a sample.
(b) An observation is nominated as a ‘potential outlier’ if it is more than 1.5 IQR above the up-
per quartile, or 1.5 IQR below the lower quartile, where IQR denotes inter-quartile range.
i. Show that, for a sample from a normally distributed population, the probability of a
potential outlier is 0.0070.
ii. What is the probability of at least one potential outlier in the sample of 200?
R2.4 (a) If 15% of university students are left handed, find the probability that, of a tutorial class
of sixteen students, at most one is left handed.
What assumptions have you made in evaluating this probability?
(b) Three research papers report estimates of µ and their standard errors. These results are
used to produce the following meta-analysis table used to obtain the optimal estimate
based on these three reported results.
lOMoARcPSD|8938243

page 230 Experimental Design and Data Analysis

est se 1/se2 w w×est


paper 1 1.51 0.20 25.00 0.7092 1.071
paper 2 1.44 0.40 6.25 0.1773 0.255
paper 3 1.46 0.50 4.00 0.1135 0.166
35.25 1.0000 1.492
i. What does the w column represent?
ii. Explain how the first number in the w column, 0.7092, is obtained.
iii. Use the above table to obtain the optimal estimate and its standard error.
R2.5 (a) Sketch a graph of the pdf of a t8 distribution, i.e. a t-distribution with 8 degrees of free-
dom, indicating the positions of the 0.025 and the 0.975 quantiles on your graph, and
specifying their values.
(b) A study was performed on the relationship between the concentration of plasma antioxi-
dant vitamins and cancer risk. Plasma vitamin-A concentration (µmol/L) were measured
for nine randomly chosen stomach-cancer cases: for these data, the mean is 2.65 and the
standard deviation is 0.36.
i. Find a 95% confidence interval for the mean vitamin-A concentration of stomach
cancer cases.
ii. The standard level is 2.90, based on the mean of a very large sample of controls.
Determine whether there a significant difference between the mean for the stomach-
cancer cases and the controls. Explain. What is your conclusion?
d
R2.6 (a) Suppose that Z = N(θ, 1).
It is planned to use an observation on Z to test the hypothesis H0 : θ = 0.
Consider the test: “reject H0 if |Z| > 1.96”.
i. Show that this test has significance level 0.05.
ii. Show that this test has power 0.90 when θ = 3.24.
d
(b) A random sample of n observations is obtained on X = N(µ, 102 ), i.e. σ is assumed
known (and σ = 10).
X̄−30
Let Z = 10/√n . This is the test statistic used to test the hypothesis µ = 30.
i. Find E(Z). (Your answer should involve µ.)
ii. How large a sample is required so that the z-test of µ = 30 based on X̄ with signifi-
cance level 0.05 has power 0.90 against the alternative µ = 31?

R2.7 (a) A treatment for migraine is trialled in a double-blind randomised experimental study
involving 400 patients: 200 receive a placebo (P) and 200 receive the treatment (T). Three
months later, the patients report whether they were worse, the same, or better on the
medication they received. The results were as follows:
worse same better
T 40 100 60 200
P 60 100 40 200
100 200 100 400
To examine whether the treatment having an effect, we test the hypothesis that treatment
P (o−e)2
and outcome classification are independent using a χ2 test. Let U = e
denote
the χ2 statistic used to test for independence.
Show that, for the table above, u = 8.0; and give an approximate p-value.
What is your conclusion?
(b) i. When comparing two independent samples, we wish to test the null hypothesis that
the samples are drawn from the same population (H0 ). One way to do this is to use
an independent samples t-test.
What assumption is made about the common population distribution in applying
the independent samples t-test?
ii. Consider the following data
sample 1: 27, 28, 31, 33, 35, 45
sample 2: 41, 46, 94
We can use a rank test to test H0 . To do this, we use
lOMoARcPSD|8938243

Revision Problems page 231

w̄1 − w̄2
z= q
1
12
N (N + 1)( n11 + 1
n2
)

where w̄1 denotes the average rank for sample 1, w̄2 the average rank for sample 2
and N = n1 +n2 .
Show that, for these data, z = −2.07, and hence that the rank test indicates rejection
of the null hypothesis at the 5% significance level.
lOMoARcPSD|8938243

page 232 Experimental Design and Data Analysis

Revision Problem Set R3

R3.1 (a) A study is to be conducted to evaluate the effect of a drug on brain function. The evalu-
ation consisted of measuring the response of a particular part of the brain using an MRI
scan. The drug is prescribed in doses of 1, 2 and 5 milligrams. Funding allows only 24
observations to be taken in the current study.
In a meeting to decide the design of the study, the following suggestions are made con-
cerning the conduct of the experiment. For each of the suggestions say whether or not
you think it is appropriate giving a reason for your answer.
(A) Amy suggests that a placebo should be used in addition to the three doses of the
drug. What is a placebo and why might its use be desirable?
(B) Ben says that the study should be conducted as a double-blind study. Explain what
this means, and why it might be desirable.
(C) Claire says that she is willing to be “the subject” for the study (i.e. to take different
doses of the drug and to have her response measured as often as is needed). Give
one point in favour of, and one point against this proposal.
(D) Don suggests that it would be better to have 24 subjects, and to allocate them at ran-
dom to the different drug doses. Give a reason why this design might be better than
the one suggested by Claire, and briefly explain how you would do the randomisa-
tion.
(E) Erin claims that it would be better to use 8 subjects, with each subject taking, on
separate occasions, each of the three different doses of the drug. Give one point
in favour of, and one point against this claim, and explain how you would do the
required randomisation.
(b) i. An exposure E is thought to cause disease outcome D. Suppose that C is a pos-
sible confounding factor. How would this be represented on a causal relationship
diagram?
ii. Smoking is thought to cause heart disease. Dr.W. claims that an individual’s level of
exercise may be a confounding factor. Represent the relationship between smoking
(S), heart disease (D) and above-average exercise level (X) on a causal relationship
diagram.
Mr.H. states that X should not be considered as a confounder. Holmes is right again!
Explain why exercise level should not be considered as a confounding factor for the
relation between smoking and heart disease.

R3.2 Consider the data set:


30.1 20.9 33.2 29.7 31.4 33.6 28.3 30.2 26.9 29.3
32.5 34.7 24.4 28.3 36.3 37.4 29.2 28.2 26.0
(a) Compute each of the following statistics for this data set:
i. the sample mean, x̄;
ii. the sample standard deviation, s;
iii. the upper quartile, Q3;
iv. the tenth percentile, ĉ0.1 .
(b) If this data set was obtained as a random sample on a N(31, 52 ) population, what values
would you expect for each of the statistics specified in (a)?
(c) The graph shown below is the normal QQ plot for this sample
i. Specify the coordinates of the indicated point.
ii. Use the diagram to obtain estimates of the population mean and standard deviation,
explaining you method.
iii. How does a normal probability plot relate to the normal QQ plot?
lOMoARcPSD|8938243

Revision Problems page 233


R3.3 (a) Suppose events D and E are such that Pr(E) = 0.4, Pr(D | E) = 0.1, Pr(D | E ′ ) = 0.2.
i. Find Pr(D).
ii. Find Pr(E | D).
iii. Are D and E positively related, not related or negatively related? Explain.
iv. Specify the odds ratio for D and E.
(b) A new test for a disease, C, was applied to 100 individuals with disease C, and 100
individuals who do not have C. The following results were obtained:

test result C Frequency


+ X 85
+ × 10
− X 15
− × 90
(A positive test result is supposed to indicate the presence of disease C.)
i. Find the estimated sensitivity and specificity of the test.
ii. Find the estimated positive predictive value when this test is applied to a population
for which the prevalence is 0.1.
iii. If the specificity were fixed at the estimated value, show that, for a population with
prevalence 0.1, the positive predictive value must be less than 53%.

R3.4 (a) Suppose that, in a population, 30% of individuals have attribute A. A random sample
of 240 is selected from this population. Let X denote the number of individuals in the
sample with attribute A. Find an approximate 95% probability interval for X.
(b) A cohort of individuals is observed for a total of 10 000 person-years. If the incidence rate
of disease B is 0.0022 per person-year, give an approximate 95% probability interval for
the number of cases of B in this cohort.
(c) Among healthy individuals in a particular population, the serum uric acid level Y mg/100L
is distributed as N(5.0, 0.82 ).
i. Find a 99% probability interval for Y .
ii. Find Pr(Y > 6.0).
iii. Find Pr(Y > 7.0 | Y > 6.0).
lOMoARcPSD|8938243

page 234 Experimental Design and Data Analysis

d
R3.5 Twelve independent observations are obtained on X = N(µ, 1), i.e. we have a random sample
of n=12 from a Normal population, for which the variance is known: σ 2 =1. The sample mean
for this sample is denoted by X̄.
To test H0 : µ = 10, we use the decision rule: “reject H0 if |X̄ − 10| > 0.6”.
(a) Find a 95% probability interval for X̄ if µ=10.
(b) Find the significance level of this test.
(c) Find the p-value if x̄ = 10.8.
(d) Find the power of the test if µ = 11.
(e) Find a 95% confidence interval for µ if x̄ = 10.8.
(f) Find a 95% prediction interval for X if x̄ = 10.8.

R3.6 (a) A study was conducted to examine the efficacy of an intramuscular injection of cholecal-
ciferol for vitamin D deficiency. A random sample of 30 sufferers of vitamin D deficiency
were chosen and given the injection. Serum levels of 25-hydroxyvitamin D3 (25OHD3 )
were measured at the start of the study and 4 months later. The difference X was calcu-
lated as (4-month reading – baseline reading).
For the sample of differences: sample mean = 15.0 and sample standard deviation = 18.4.
Construct a 95% confidence interval for the mean difference. What can you conclude?
(b) We are interested in estimating the prevalence of attribute B among 50-59 year-old women.
Suppose that in a sample of 2000 such women, 400 are found to have attribute B. Obtain
a point estimate and a 95% confidence interval for the prevalence.
(c) Of 1200 individuals employed at the PQR centre during the past ten years, 28 contracted
disease K. After adjusting for a range of covariates, the expected number of cases of K is
calculated to be 16.0.
i. Test the hypothesis that there is an excess risk of K at the PQR centre.
ii. The standardised morbidity ratio, SMR = µ/µ0 , where µ denotes the mean number
of cases among the subpopulation, and µ0 denotes the mean number of cases ex-
pected among the subpopulation if it were the same as the general population. Find
an approximate 95% confidence interval for SMR in this case.

R3.7 The data below are obtained from a trial comparing drug A, drug B and a placebo C. The
table indicates the number of individuals who reported improvement (I) with the treatment,
and those who did not.
improvement no improvement
drug A 15 10 25
drug B 10 15 25
placebo C 10 40 50
35 65 100
Let p1 = Pr(I | A), p2 = Pr(I | B), and p3 = Pr(I | C).
(a) The following (incomplete) R output was obtained to answer the question: “Is there a sig-
nificant difference between the proportion reporting improvement in the three groups?”,
i.e. test H0 : p1 = p2 = p3 .
Cell Contents
|-------------------------|
| N |
| Expected N |
| Chi-square contribution |
|-------------------------|

Total Observations in Table: 100


lOMoARcPSD|8938243

Revision Problems page 235

|
| [,1] | [,2] | Row Total |
-------------|-----------|-----------|-----------|
[1,] | 15 | 10 | 25 |
| 8.750 | 16.250 | |
| 4.464 | 2.404 | |
-------------|-----------|-----------|-----------|
[2,] | 10 | 15 | 25 |
| 8.750 | 16.250 | |
| 0.179 | 0.096 | |
-------------|-----------|-----------|-----------|
[3,] | 10 | 40 | 50 |
| 17.500 | 32.500 | |
| 3.214 | 1.731 | |
-------------|-----------|-----------|-----------|
Column Total | 35 | 65 | 100 |
-------------|-----------|-----------|-----------|
i. Explain how the values 8.750 and 4.464 can be calculated.
ii. Complete the test giving the p-value, and state your conclusion.
(b) Test the null hypothesis p1 = p2 .
(c) Assume that we increase the number of subjects treated with drug A and drug B, so that
n of each are tested. Find the sample size n required in order that we obtain a confidence
interval for p1 −p2 of half-width less than 0.15, i.e. the confidence interval should take the
form p̂1 −p̂2 ± h, where h 6 0.15.
R3.8 A random sample of 50 observations are obtained on the bivariate normal data (X, Y ). For
this sample, it is found that x̄ = ȳ = 30, sx = sy = 10 and the sample correlation rxy = 0.4. The
regression of Y on x is given by E(Y | x) = α+βx.
i. Indicate, with a rough sketch, the general nature of the scatter plot for this sample.
ii. Show that, for
Pthese data:
K = (x − x̄)2 = 4900 and sxy = n−1 1
P
(x − x̄)(y − ȳ) = 40.
Hence, or otherwise, find β̂ and α̂.
iii. On your diagram, indicate the fitted regression line.
iv. Given that the estimate of the error variance, s2 = 85.75, find a 95% confidence interval
for β.
v. Give an approximate 95% confidence interval for the population correlation.
lOMoARcPSD|8938243

page 236 Experimental Design and Data Analysis

Revision Problem Set R4


R4.1 A random sample of 100 observations is obtained from a Normally distributed random vari-
d
able X = N(140, 102 ).
(a) Specify approximate values you would expect to obtain for each of the following statistics
for this sample:
i. the sample mean, x̄;
ii. the sample standard deviation, s;
iii. the number of observations greater than 150, freq(X>150);
iv. the sample upper-quartile, Q3;
v. the sample maximum, x(100) .
(b) i. Sketch a boxplot that would be not unreasonable for this sample.
ii. Indicate in a sketch, the likely form of a Normal QQ-plot for this sample, showing
its important features.

R4.2 A five-year study was conducted to look at the effect of oral contraceptive (OC) use on heart
disease in women 40–49 years of age. All women were aged 40–44 years at the start of the
study. There were 5624 OC users at baseline (i.e. the start of the study), who were followed for
a total of 23 058 person-years, and of these women, 31 developed a myocardial infarction (MI)
during the five-year period. There were 9472 non-users, followed for 40 730 person-years, and
19 of them developed an MI over the five-year period.
n t x
OC-users 5624 23 058 31
non-users 9472 40 730 19
i. Is this a designed experiment or an observational study?
ii. What are the experimental/study units?
iii. Is this a prospective study, retrospective study or a cross-sectional study?
iv. All the women in the study are aged 40–44. Explain why this was done.
v. Use these data to test the hypothesis that the incidence rate for MI is unaffected by OC-
use. What conclusion can you draw?
vi. Consider a hypothetical population of 10 000 women.
Let µ1 denote the expected number of cases of MI in the next five years if all of the women
were OC-users. Let µ2 denote the expected number of cases of MI in the next five years
if none of the women were OC-users.
Obtain an estimate and a 95% confidence interval for µ1 − µ2 .
Give an interpretation of this result.

R4.3 (a) If two independent events each has probability 0.6 of occurring, find the probability that
at least one of them occurs.
(b) A test for detecting a characteristic C gives a positive result for 60% of a large number
of patients subsequently found to have the characteristic, and gave a negative result for
90% of those not having it. If the test is applied randomly to a population in which the
proportion of persons with the characteristic C is 30%, find the probability that a person
has the characteristic if their test gave a positive result.
What is the sensitivity of this test? What is its negative predictive value?
(c) What is relative risk? Write a sentence describing relative risk.
Why can’t we estimate relative risk with only the data from a case-control study? What
else do we need?
lOMoARcPSD|8938243

Revision Problems page 237

R4.4 The number of times a particular device is used in a given medical procedure is a random
variable X with pmf given by

x 0 1 2 3
p(x) 0.2 0.4 0.3 0.1
(a) Draw a sketch graph of the cdf of X.
(b) Show that E(X) = 1.3 and sd(X) = 0.9.
(c) The total number of times the device is used in 100 of these procedures is given by T =
X1 + X2 + · · · + X100 , where X1 , X2 , . . . , X100 are independent random variables each
with the pmf given in (a).
i. Find the mean and standard deviation of T .
ii. Explain why the distribution of T is approximately Normal.
iii. Find approximately Pr(T 6 125).
R4.5 (a) In daily self-administered blood pressure readings, it is expected that, if the blood pres-
sure is stable, the readings (in mm Hg) will have standard deviation 10.
Suppose that Ms. J. obtains eleven daily observations. Specify the standard deviation of
the average of these eleven readings.
Specify the assumptions you have made in obtaining your result.
(b) A study was conducted on the blood pressure of people with glaucoma. In the study,
25 people with glaucoma were recruited and their mean systolic blood pressure was
142 mm Hg, with a standard deviation of 20 mm Hg. Give a point estimate and a 95%
interval estimate for the mean systolic blood pressure for individuals with glaucoma.
R4.6 (a) The following is a random sample from a Normal population:
7.0, 9.0, 10.0, 11.0, 13.0.
i. Verify that x̄ = 10.0 and s2 = 5.0.
ii. Find a 95% prediction interval for a future observation from this population.
(b) In a particular district, the average number of cases of D reported each month is 2.75.
What is the probability that there are at most 10 cases reported in a particular six-month
period?
(c) The index of numerical development, NDI, measures the ability of a first-year university
student to deal with numbers. The standard score when this test was devised in 1975
was 500. It is believed that the advent of computers and calculators has brought about a
decline in NDI.
Values of NDI were obtained on a random sample of first-year students with the follow-
ing results:
540, 450, 399, 415, 556, 488, 366, 490, 474, 456, 398, 513, 342, 328, 593, 360.
For this sample n = 16, x̄ = 448 and s = 80.
d
Assume these data are a random sample on X =N(µ, σ 2 ). We wish to test the hypothesis
µ=500 against µ6=500 using a significance level of 0.05.
i. Show that t = −2.6, and show that H0 is rejected by comparing t with the appropri-
ate critical value. Specify the critical value.
ii. Specify the p-value for this test.
iii. What can you conclude?
R4.7 (a) Of 100 independent 95% confidence intervals, let Z denote the number of these confi-
dence intervals that contain the true parameter value.
Specify the distribution of Z.
d
(b) One observation is obtained on W = N(µ, 1). To test H0 : µ = 0 vs µ 6= 0, the decision
rule is to reject H0 if |W | > 2.17. The observation is w = 1.53.
i. Find the significance level.
ii. Find the p-value.
iii. Find the power if µ = 3.
lOMoARcPSD|8938243

page 238 Experimental Design and Data Analysis

(c) Use the Poisson Statistic-Parameter (SP) diagram to obtain:


i. a rejection region for X to test λ=11 vs λ6=11 using a test of nominal significance
level of 0.05;
ii. a 95% confidence interval for λ when x = 6;
iii. A hospital had six “serious” medical incidents in the last five years. Specify a 95%
confidence interval for the annual rate of “serious” medical incidents.

R4.8 A pilot study of a new antihypertensive agent is performed for the purpose of planning a
larger study. Twenty five patients who have diastolic blood pressure of at least 95 mm Hg are
recruited for the study. Fifteen patients are given the treatment, and ten get the placebo. After
one month, the observed reduction in diastolic blood pressure yields the following results.
n1 = 15; x̄1 = 9.0, s21 = 60.0;
n2 = 10; x̄2 = 2.5, s22 = 44.7.
Assume that these are independent samples obtained from Normally distributed populations,
d d
X1 = N(µ1 , σ 2 ) and X2 = N(µ2 , σ 2 ). It is assumed that the population variances are equal,
and so the sample variances are pooled to give s2 = 54.0.
i. Explain how this pooled variance is obtained.
ii. Find a 95% confidence interval for µ1 −µ2 .
iii. What are your conclusions from this study?

R4.9 Transient hypothyroxinemia, a common finding in premature infants, is not thought to have
long-term consequences, or to require treatment. A study was performed to investigate whether
hypothyroxinemia in premature infants is a cause of subsequent motor and cognitive abnor-
malities. Blood thyroxine values were obtained on routine screening in the first week of life
from a number of infants who weighed 2000g or less at birth and were born at 34 weeks gesta-
tion or earlier. The data given below gives the gestational age (x, in weeks) and the thyroxine
level (y, in unspecified units).
x 25 26 27 28 30 31 32 33 34 34
y 10 12 16 14 24 20 28 26 22 28
For these data, the following statistics were calculated:
n = 10, x̄ = 30, ȳ = 20; (x − x̄)2 = 100, (x−x̄)(y−ȳ) = 180, (y−ȳ)2 = 400.
P P P

i. Assuming that E(Y | x) = α + βx and var(Y | x) = σ 2 , obtain estimates of α and β using


the method of least squares.
ii. Show that s2 = 9.5 and hence obtain se(β̂).
iii. Plot the observations and your fitted line.
iv. Find the sample correlation, and give an approximate 95% confidence interval for the
population correlation.

R4.10 (a) Sixty independent procedures yielded 48 successes. Find a 95% confidence interval for
the probability of success.
State any assumptions you have made.
(b) The diagram below gives the sample cdf for a random sample of 100 observations on the
recurrence time (in months) for a particular condition following treatment.
lOMoARcPSD|8938243

Revision Problems page 239

i. Find the sample median.


ii. Test the hypothesis that the population median is 10.
iii. Explain briefly how a sample cdf relates to a probability plot.
iv. Draw a rough graph of what you think the population pdf might look like.
lOMoARcPSD|8938243

page 240 Experimental Design and Data Analysis

Revision Problem Set R5


R5.1 A random variable X, describing the number of people in a randomly selected car on the road,
has a probability mass function (pmf)
x 1 2 3 4
p(x) 0.4 0.3 0.2 0.1
(a) Verify that p(x) is a probability mass function (pmf).
(b) Derive the cumulative distribution function F (x) = Pr(X 6 x).
(c) Calculate the probability
i. that there are at most three people in the car.
ii. that there are more than one and less than four people in the car.
(d) i. Calculate the mean number of people in the car.
ii. Calculate the standard deviation of the number of people in the car.
(e) Using your answers from (d), what is the approximate distribution of the total number of
people in 40 cars? Justify your answer.
Give an approximate 95% probability interval for the number of people in 40 cars.
(f) The car passes a toll-gate with the rules: cars with 1 person must pay $5.00, cars with 2
people must pay $3.00 and cars carrying 3 or more people must pay $2.
i. Write down the pmf of the toll paid by a single car.
ii. Calculate the mean toll paid by a single car passing through the toll gate.
R5.2 A new diagnostic test for a type of coronary condition has been recently developed. The sensi-
tivity of the test is 0.995, and the specificity of the test is 0.99.
(a) Suppose the prevalence of the disease is 5 per 1000 people.
i. Calculate the proportion of positive tests in the population.
ii. Calculate the positive predictive value of the test.
iii. Write a sentence interpreting the positive predictive value.
(b) The researchers introduce the test to another community, where the positive predictive
value is found to be 0.46. Assuming the sensitivity and specificity remain as stated above,
what is the prevalence of the disease in this community?
(c) It is later shown that a sensitivity of 0.995 is only achieved when cholesterol levels are 150
mg/dL. It is found that the sensitivity changes according to the table below.

cholesterol level 130 150 170 190 210 230 250


sensitivity 0.999 0.995 0.970 0.925 0.875 0.750 0.625
i. Sketch a scatterplot of the cholesterol level and sensitivity. Label both axes.
ii. Describe two main features of the relationship between cholesterol level and sensi-
tivity.
iii. For the test to be useful, the sensitivity must be at least 0.8. Use your scatterplot to
estimate the highest cholesterol level for which the test is deemed useful.
R5.3 (a) You have torn a tendon and are facing surgery to repair it. The orthopedic surgeon ex-
plains the risks to you: Infection occurs in 3% of such operations, the repair fails in 14%,
and both infection and failure occur together in 1%.
i. What percentage of these operations succeed and are free from infection?
ii. Are failure and infection are independent events? Explain. If they are not, determine
if they are positively or negatively correlated.
(b) Independent random variables X and Y are such that
E(X) = 10, sd(X) = 2; E(Y ) = 10, sd(Y ) = 1.
Let Z = aX + (1 − a)Y , where 0 6 a 6 1.
i. Show that E(Z) = 10 and var(Z) = 5a2 − 2a + 1.
ii. For what value of a is var(Z) minimised?
iii. What is the minimum value of var(Z)?
R5.4 A study was performed looking at the effect of mean UV exposure on the change in pulmonary
function (measured by the change in forced expiratory volume over a 4-hour bush-walk). A
random sample of 60 members of a bush-walking club was used for the study, with 30 walking
on moderate-UV days and 30 walking on high-UV days. The change in pulmonary function is
lOMoARcPSD|8938243

Revision Problems page 241

recorded in the table below.

UV Level n x̄ s
Moderate 30 0.04 0.11
High 30 0.10 0.25
Based on this study, is there evidence to suggest that there is a difference in the mean change
in pulmonary function between the two groups? Use a significance level of 0.05, and state any
assumptions that you make.
R5.5 A study investigated the relationship between the use of a type of oral contraceptive and the
development of endometrial cancer. The study found that out of 100 subjects who took the
contraceptive, 6 developed endometrial cancer. Of the 225 subjects who did not take the con-
traceptive, 9 developed endometrial cancer.
(a) Based on this study, is there evidence at the 5% level to suggest that there is a higher
proportion of people with endometrial cancer amongst those taking the contraceptive
compared to the control group?
(b) Describe what is meant by a Type I error and a Type II error in the context of the question.
(c) Medical authorities decide that if the test shows that there is a significantly higher pro-
portion of people with endometrial cancer in the group taking the contraceptive, then the
oral contraceptive will be removed from the market.
i. Describe the consequences of a Type I error and of a Type II error.
ii. Explain, for each type of error, whether the consequences are more of a problem for
the women using the oral contraceptive or the manufacturer of the contraceptive.
R5.6 (a) A recent study compared the use of angioplasty (PTCA) with medical therapy in the
treatment of single-vessel coronary artery disease. At the six-month clinic visit, 35 of 96
patients seen in the PTCA group were found to have had angina.
Find a 95% confidence interval for the probability of angina within six months after PTCA
treatment.
(b) The mortality experience of 8146 male employees of a research, engineering and metal-
fabrication plant in Tonawanda, New York, was studied from 1946 to 1981. Potential
workplace exposure included welding fumes, cutting oils, asbestos, organic solvents and
environmental ionizing radiation. Comparisons were made for specific causes of death
between mortality rates in the workers and the U.S. white male mortality rates from 1950
to 1978.
Suppose that, among workers who were hired prior to 1946 and who had worked in the
plant for 10 or more years, 17 deaths due to cirrhosis of the liver were observed, while 6.3
were expected based on U.S. white male mortality rates.
i. Estimate λ, the mean number of deaths for this subpopulation of workers.
ii. Test the hypothesis that λ = λ0 , where λ0 denotes the population value, 6.3.
iii. Find a 95% confidence interval for λ and SMR = λ/λ0 .
R5.7 A group of researchers are investigating a new treatment for reducing systolic blood pressure.
They want to compare the results of a group of patients receiving the new treatment with a
group of subjects receiving a placebo treatment.
(a) Your boss says that it is too costly to include a group of subjects taking a placebo. What
can you say to justify including them in the experiment?
(b) It is decided that 20 patients will take the new treatment and 20 will take the placebo.
Since the investigation is taking place over two cities (Melbourne and Sydney), to make
things simpler, the new treatment will be administered in Melbourne and the placebo
will be given to subjects in Sydney.
i. Identify a potential problem with this design.
ii. Briefly describe a way to overcome this problem.
(c) What is the definition of a lurking variable in the context of this question? Write down
two potential lurking variables.
(d) It is finally decided to run the whole experiment in one city, with 20 subjects taking the
treatment and 20 taking the placebo. The sample mean change in systolic blood pressure
for the new-treatment group is −10.5mmHg, with a standard deviation of 5.2mmHg. The
sample mean change for the placebo group is −6.1mmHg, with a standard deviation of
4.9mmHg.
lOMoARcPSD|8938243

page 242 Experimental Design and Data Analysis

i. Assuming the underlying variances are equal for the two groups, construct a 95%
confidence interval for the difference in the mean change in blood pressure for the
two groups.
ii. From your confidence interval, explain whether you think the change in blood pres-
sure differs between the two groups.
R5.8 A new antibiotic is thought to affect plasma-glucose concentration (mg/dL). It is known that
in the general population, the mean plasma-glucose concentration is 4.91 with a standard de-
viation of 0.57. A random sample of 10 people is given a fixed dosage of the antibiotic. Their
plasma-glucose concentrations are measured the next day. The concentrations are given in the
table below.
subject 1 2 3 4 5 6 7 8 9 10
concentration 5.05 4.35 5.36 5.46 5.40 4.55 6.45 5.28 4.95 5.50
(a) Draw a boxplot of concentration, making sure you label it appropriately. Show any
working required to construct the graph.
(b) Assume that the true standard deviation for the antibiotic group is the same as for the
general population. Conduct a test at the 1% level to investigate whether the mean
plasma-glucose concentration is higher for those people taking the antibiotic, compared
to the general population. Use the p-value approach, and state any assumptions that you
make.
(c) i. If the true mean is actually µ = 5.5, what is the power of this test?
ii. What happens to the power if α is increased to 0.05? Briefly explain your answer.
(There is no need for any calculations for this part of the question).
R5.9 (a) FEV (forced expiratory volume) is an index of pulmonary function that measures the vol-
ume of air expelled after one second of constant effort. A longitudinal study collected
data on children aged 3–19. The following is a partial R output on a simple linear regres-
sion analysis, relating the variables FEV and AGE for the boys in the group. The group
consisted of 336 boys and their mean age was 10.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0736 0.1128 0.65 0.514
AGE 0.2735 0.0108 25.33 0.000 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.5881 on 334 degrees of freedom


Multiple R-squared: 0.658, Adjusted R-squared: 0.657

i. The slope of the regression line is 0.273. What is the interpretation of this?
ii. The p-value given for the t-ratio 25.33 is 0.000. What does this signify? What distri-
bution is used to find the p-value?
iii. Obtain an estimate and a 95% confidence interval for the mean FEV for 12-year-old
boys. (You may use the table value 1.967 for this.)
σ2
[Hint: var µ̂(x) = n + (x−x̄)2 var(β̂).]


(b) A study of fifty individuals found a sample correlation coefficient of r = +0.31 between
the variables u and v. Does this represent significant evidence of a positive relationship
between u and v? Explain.
lOMoARcPSD|8938243

ANSWERS TO THE PROBLEMS

Problem Set 1
1.1 i. Observational study: the treatment is not imposed;
ii. Exposure = oral contraceptive (OC) use; disease outcome = myocardial infarction (MI);
iii. Prospective study;
iv. Response = (MI or not); explanatory variable = (OC or not);
v. To avoid confounding with age;
vi. Keep it simple!
1.2 (a) We really don’t know! It might have been ”Does exercise increase lactic acid? . . . and by
how much?” However, we will take to to have been ”Is the change in lactic acid after
exercise different for men and women?”
(b) Observational study;
(c) Response variable = change in lactate levels; for the question of comparing males and
females, the gender categories (male and female) take the role of exposure or treatment:
explanatory variable = gender.
(d) Age is a potential confounder (for example, if most of the males were 40–49 and most of
the females were 20–29, then the difference between the groups may be due to age rather
than gender);
(e) A confounding variable is one that affects the outcome (blood lactate level), and which
is related to gender, in the sense that the variable is not balanced between the males and
the females in the sample. Apart from age, there are a number of possible confounders
that suggest themselves: for example individual’s fitness level, weight or recent activity.
1.3 (a) Retrospective;
(b) Prospective;
(c) Cross-sectional.
1.4 (a) Individual subjects (worried about anxiety? . . . how were they chosen? what is their age?
gender?); response variable = difference in anxiety level, explanatory variable, treatment
= meditation;
(b) Experimental study: random allocation of treatment or non-treatment;
(c) No (presumably each individual knows whether or not what therapy they receive is med-
itation); an individual may respond better to a treatment they believe will do them good
— a blind study would avoid this problem. Could this experiment be blind?
(d) Yes: gender is confounded with the treatment.
1.5 (A) A placebo is an inactive drug, which appears the same as the active drug. It is desirable
in order to ascertain whether the active drug is having an effect.
(B) In a double-blind study, neither the subject not the treatment provider knows whether
the treatment is the active or the inactive drug. It is desirable in order to guard against
any possible bias: on the part of the subject or on the part of the treatment provider (due
to prior expectations).
(C) In favour: there would be be no between-subject variation. Against: there may be carry-
over effects, from one treatment to the next. The results may not be generalisable: is
Claire representative?

243
lOMoARcPSD|8938243

page 244 EDDA: Answers to the Problems

(D) There can be no carry-over effect in this case. It is likely to be generalisable to a larger
population (the population that the subjects represent). Choose a random order for
AAAAAAAABBBBBBBBCCCCCCCC (using R sampling) and assign these treatments to
subjects 1, 2, . . . , 24; or AAAAAABBBBBBCCCCCCXXXXXX, where X is the placebo.
(E) This method eliminates the between subject variation, but there may be possible carry-
over effects. For each subject, choose a random order for ABC (or ABCX).
1.6 experimental unit = male physicians, response variable = heart attacks (or perhaps heart prob-
lems), explanatory variable = treatment (aspirin/placebo); and other recorded covariates, such
as age, medical history, . . . .

1.7 (a) i. The women should be chosen at random, from among the women attending the
program. The population is then the women attending the Omega program. Issues
of time, location, program leader are all relevant here.
ii. If the program had a strict protocol for how it was delivered that was followed ev-
erywhere you might consider the conclusions to apply to any Omega program . . .
perhaps.
(b) i. The population of items from Grokkle’s production line. The items should be sam-
pled at random. An important issue here is time. You can only sample in a particular
time period. Strictly speaking, the population could then be real: the population of
items in the time period from which you sampled.
ii. If the production line process is believed to be stable over time (usually, a very brave
assumption!) you might consider applying the conclusions to a longer time period
than that sampled. In practice, this is often done: a sample is taken in a week in
March, and an inference is drawn about the whole year. This is a rather dangerous
practice.
(c) Geriatric patients: When an intervention has been used, the circumstances in which it
was applied are usually very important. We could say that the population is all geriatric
patients “like the ones in the Melbourne facility”, but that really doesn’t say anything
useful: what exactly does “like” mean here? This is why randomization is so important in
assessing an intervention. When we have a randomized trial, and therefore some patients
with the intervention and some not, it can be reasonable to apply the conclusions more
widely, to all geriatric patients. Effectively, this is often done.
(d) Breast cancer: Similar issues to the geriatric patients arise.
1.8 (a) The subject does not know whether they have received the treatment drug (SleepWell) or
the control drug (Placebo).
(b) So as to reduce bias and provide a fair comparison of the effect of the drug. Subjects may
tend to sleep more because of the suggestion that the drug will help: the placebo effect.
(c) Each patient has an equal chance of being assigned to the treatment or control. If there are
2n subjects who have agreed to take part in the experiment, randomly choose an order
for T T · · · T CC · · · C, i.e. n T s and n Cs, and assign in this order to the subjects.
(d) Replication is repetition of the experiment. We want a large number of replicates as this
increase the precision of the comparison being made.
1.9 (a) response variable = birthweight; explanatory variable = mother’s smoking status;
(b) observational study: mother’s smoking status is not imposed;
(c) race, (physical) size of parents, socio-economic status, mother’s health, pre-natal care, . . . ;
we can choose the mothers so that (some of) these variables are similar in the two groups.
1.10 The C–X line is removed: randomisation means that there can be no correlation between the
intervention X and the variable C, so the relationship between X and D is unaffected by the
relation between C and D.
1.11 We use a randomised controlled experiment. Patient function/status will be assessed by an
initial test, i.e. before treatment commences. The drug will be given in the form of a pill to be
administered by the carer. The control group will receive a placebo, i.e. a pill identical in ap-
pearance to the treatment (drug-L) pill. The treatment/placebo pill package will be randomly
assigned by the statistician (20 of each), so that neither the patient/carer nor the physician will
know whether the pill is drug-L or placebo. Thus the trial is double blind. At the end of the
treatment time (six months, say) the patients will be re-tested.
Because randomisation has been used, the significant difference can be attributed to the causal
effect of drug L.
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 245

1.12 Review: tutorial/discussion/revision


1.13 (1) survey; This examines the population perception of the problem, or their belief about it
(which may reflect a media view rather than the truth). This applies whether it is the
general population or the medical practitioners being considered. It does not help to
answer the research question.
(2) case-control study; This study may give an indication as to whether there is a relation
between heart problems and cholesterol level. It says nothing about the effect of reducing
cholesterol level.
(3) prospective study, follow-up study, longitudinal study;
Similar response to (2): this study tells us about the relation between cholesterol level and
heart disease; it does not tell us anything about the effect of reducing cholesterol level.
Note: The population from which the sample was drawn (individuals, aged 40–49, at-
tending a medical clinic) may not be representative of the general population. This ap-
plies to (3) and (4).
(4) prospective study, follow-up study, longitudinal study; This study at least contains a
group of individuals for which cholesterol has been reduced: the HL group. It is this
group that is of major interest. How does it compare with the LL and the LH groups?
This comparison impacts on the research question.
(5) clinical trial (experimental study); Comparing the heart disease outcomes of the two
groups (S) and (L) would indicate possible benefits of cholesterol-reducing diet over a
modified (non cholesterol-reducing) diet. This is associated with the research question,
but not the same.
There are two outcomes here: the cholesterol level (has it been reduced? has it been
reduced differently?) and heart disease (is there a difference?)
1.14 Mrs Green is attributing cause to the result, i.e. that walking faster will reduce the probability
of death in the next five years. Cause cannot be inferred from an observational study. The
likely explanation of this result is reverse causation: that illness (leading to death) is causing
the individual to walk more slowly. It may not do any harm, but it could. It’s probably not a
good idea!

Problem Set 2
2.1 (a) continuous; categorical; ordinal; discrete; categorical; continuous.
(b)

The heights should be proportional to (37/6, 13/5, 5/5, 0, 1/5). Whether the reported
data are correct is another matter, given the silliness of the graph, but this is about the
best representation of the data, as given.
The given graph actually appeared in ‘The Age’ (some time ago now)!
2.2 (a) not much recommendation?
(b) people have to die of something; look at quantity/quality of life lost?
(c) “more accidents” is not the same as “worse drivers” (poorer cars? more time on the
roads? . . . );
(d) nonsense, but it might be interesting to work out what it might mean.
2.3 (a) i. whether a vegetarian diet will change the cholesterol level;
ii. n=20, study unit = hospital employees (on standard meat-eating diet who agreed to
adopt a vegetarian diet for one month);
iii. all (hospital) employees on a standard meat-eating diet (extension?)
(b) i.
lOMoARcPSD|8938243

page 246 EDDA: Answers to the Problems

ii. mean = 21.95, median = 20, sd = 14.38, IQR = 18.75;


iii. data ranges from –8 to 49, with mean 22.0 and standard deviation 14.4, close to sym-
metrical with no obvious outliers;
iv. mean & sd (of diff) will change; med & iqr (of diff) will not change.
(c) The data suggest that the vegetarian diet alters the cholesterol level: apparently giving
about a 10% reduction. How sure are we of this conclusion?
2.4 (a) i. (1, 2, 8, 32, 64); ii. (1, 3.5, 24, 160, 512);

(b) i. mean = 5.5, median = 5.5; sd = 3.0, iqr = 5.5;


ii. mean = 14.5, median = 5.5; sd = 30.2, iqr = 5.5.
While the mean & standard deviation are affected by the ‘outlier’, the median & inter-
quartile range are unchanged.
(c) i. Smallest s occurs when all three values are the same (three 1s, or three 2s or . . . ),
smallest s = 0. Largest s occurs when either there are two 1s and one 9, or one 1 and
two 9s, in which case s = 4.62.
ii. Smallest s occurs when the three values are consecutive (e.g. 1, 2, 3), smallest s = 1.
Largest s occurs when for either (1, 2, 9) or (1, 8, 9) in which case s = 4.36.

2.5 (frequency polygon) (cumulative frequency polygon)

(a) med ≈ 87, IQR ≈ 99-76= 23, P10 ≈ 71, P90 ≈ 108;
(b) 0.85;
(c) 0.96;
(d) close to 0.95 [x̄ ± 2s is supposed to contain about 95% of the data];
(e) The data are classified into intervals (groups), so we do not know their values. To get the
correct values for these sample statistics, we would need the actual data.
2.6 (a) response variable = weight (in gram) after 21 days;
explanatory variable = treatment (control/lysine);
age of chicks (1-day → 22-days);
breed of chicks [yes]; conditions (temperature, humidity, . . . ) [yes]
(b) Not a good idea. Actually a really bad idea! Then ‘farm’ would be confounded with
‘treatment’.
(c)

It appears that Lysine has the effect of increasing the weight gain.
2.7 The mean and median are the same for each data set, but the spread of the data sets are sub-
stantially different. This is apparent in a dotplot. It is indicated by the standard deviation or
the interquartile range.
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 247

2.8 (a)

(b) It depends! If the missing observations are typical, then they are likely to go where the
observed data are: mostly in the middle with one or two a bit further away from the
middle. But they might be missing because the patient was too ill, or was unable to give
a reading . . . in which case the H-level might be very high? . . . and the missing data are
atypical.
(c) The target population would be all individuals with characteristic C.
(d) We would be assuming that the missing observations are typical — so that the remaining
(observed) 25 are too.
2.9

There appears to be a positive relationship between PBF and PBV.


2.10 i.

ii. strong positive relationship: r ≈ 0.8.


iii. R gives r = 0.912.
iv. y = 1 + x (the middle line on the diagram) looks like the best line to use.

Problem Set 3
3.1 B B′
A 0.004 0.026 0.03
A′ 0.056 0.914 0.97
0.06 0.94 1
There is a positive relationship between A and B:
0.004
Pr(A | B) = 0.06 = 0.067 > Pr(A) = 0.03;
0.004
Pr(B | A) = 0.03 = 0.133 > Pr(B) = 0.06;
or Pr(A ∩ B) = 0.004 > Pr(A) Pr(B) = 0.0018.
Note: any one of these inequalities is enough to show a positive relationship.
3.2 D and E are events with Pr(D | E) = 0.1 and Pr(D | E ′ ) = 0.2.
(a) Pr(E | D) < Pr(E), so E and D are negatively related.
0.1/0.9
(b) OR = 0.2/0.8 = 0.444.
lOMoARcPSD|8938243

page 248 EDDA: Answers to the Problems

D D′
E 0.04 0.36 0.4
E′ 0.12 0.48 0.6
0.16 0.84 1
0.04 0.04×0.48
(c) Pr(D) = 0.16; (d) Pr(E | D) = 0.16 = 0.25. Note: OR = 0.12×0.36 = 0.444.
p1 (1 − p2 )
3.3 (a) = 2 ⇒ p1 − p1 p2 = 2p2 − 2p1 p2 ⇒ p1 = 2p2 − p1 p2 .
(1 − p1 )p2
Dividing through by p2 gives the result.
p1 p2 p1 − p2 p1 /p2
0.00 0.0000 0.0000 2.00
0.01 0.0050 0.0050 1.99
0.05 0.0256 0.0244 1.95
0.10 0.0526 0.0474 1.90
0.25 0.1429 0.1071 1.75
0.50 0.3333 0.1667 1.50
0.90 0.8182 0.0818 1.10
1.00 1.0000 0.0000 1.00
p1 (1 − p2 )
(b) As for (a): = θ ⇒ p1 − p1 p2 = θp2 − θp1 p2 ⇒ p1 = θp2 − (θ−1)p1 p2 .
(1 − p1 )p2
Again, dividing through by p2 gives the required result:
RR = θ×(1 − p1 ) + 1×p1
This is a weighted average of 1 and θ, with weight p1 on 1 and 1−p1 on θ, and must
therefore lie between 1 and θ. Note: RR divides 1 and θ in the ratio 1−p1 : p1 .
This applies even if θ < 1; i.e. it will lie between θ and 1, but in this case, RR will be less
than 1 (and greater than θ).
(c) OR = 2 ⇒ RR must lie between 1 and 2. If the risks are small, then it will be close to
2, but slightly smaller than 2.
(d) i. 2.8; ii. 1.4; iii. 0.525.
3.4 (a) (b) (c) (d)
0.1 0.3 0.4 0.2 0.2 0.4 0.1 0.5 0.6 0.2 0.2 0.4
0.4 0.2 0.6 0.3 0.3 0.6 0.2 0.2 0.4 0.2 0.4 0.6
0.5 0.5 1 0.5 0.5 1 0.3 0.7 1 0.4 0.6 1
1 1
(d) Let Pr(B) = b, then the entries are b, (1−b); 21 b, 32 (1−b).
2 3
Then 1
2
b + 13 (1−b) = 0.4 ⇒
b=0.4.
3.5 (a) Since Pr(E | D) = 0.20 < Pr(E|D′ ) = 0.25, it follows that E and D are negatively related:
E is less likely for D than for D′ .
0.2 1/4
(b) O(E | D) = 0.8 = 41 ; and O(E | D′ ) = 0.75
0.25
= 13 . So, OR = 1/3 = 0.75.
(Note: OR < 1 ⇒ negative relationship.)
3.6 Note that these results are estimated, and as they are based on a relatively small sample of 62 individuals
and therefore not particularly reliable.
′ ′
P20 P20 P20 P20
D 12 4 16 D 0.194 0.065 0.258 sn = 0.750

D′ 12 34 46 D′ 0.194 0.548 0.742 sp = 0.739
24 38 62 0.387 0.613 1

P20 P20
0.0075
D 0.0075 0.0025 0.01 ppv = 0.2658 = 0.028;
and for prevalence 0.01, ⇒
D′ 0.2583 0.7317 0.99 0.7317
npv = 0.7342 = 0.997.
0.2658 0.7342 1
′ ′
P15 P15 P15 P15
D 7 9 16 D 0.113 0.145 0.258 sn = 0.438

D′ 3 43 46 D′ 0.048 0.694 0.742 sp = 0.935
10 52 62 0.161 0.694 1
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 249


P15 P15
0.0044
D 0.0044 0.0056 0.01 ppv = 0.0689 = 0.064;
and for prevalence 0.01, ⇒
D′ 0.0646 0.9254 0.99 0.9254
npv = 0.9311 = 0.994.
0.0689 0.9311 1
Changing to threshold of 15 increases the ppv (there are fewer false positives . . . and fewer
true positives), but decreases the npv (more false negatives).

3.7 i. Assuming this sample is representative of the population (say, of men aged 50–59)
P P′ P P′
D 92 46 595 D 0.126 0.063 0.188 sn = 0.667

D′ 27 568 46 D′ 0.037 0.775 0.812 sp = 0.955
119 614 733 0.162 0.838 1
0.126
ppv = 0.162 = 0.773.

ii. Possibly from a community screening program (like mammography screening for breast
cancer) in which, say, men aged 50–59 are invited to attend for a free test. In this case, we
would have to assume that those who chose to attend for the screening test are represen-
tative of the target population. If such data came from routine GP tests (say applied to all
men 50–59 attending the clinic) this would be less representative.
To discover whether they had cancer, there would need to be some sort of follow-up
(perhaps we might take ‘no diagnosed cancer’ in five years time as an indicator). In that
case, there are (statistical) risks: that some cancers have not shown symptoms in that
time; or that some cancers developed after the test.

3.8 For the case ℓ = 3, we say that the test is positive if {PSA > 3}, and we denote this event by P3 .
In that case,
sensitivity, sn = Pr(P3 | C) = Pr(PSA > 3) = 1 − 0.003 = 0.997;
specificity, sp = Pr(P3′ | C ′ ) = Pr(PSA 6 3) = 0.140.
Hence the C×P3 probability table can be completed (the top left on in the array below); and
from that we obtain:
0.1994 0.1120
ppv = Pr(C | P3 ) = 0.8874 = 0.225 and npv = Pr(C ′ | P3′ ) = 0.1136 = 0.995.
Similarly for the other values of ℓ. The Disease/Test probability tables are given below for
ℓ = 3, 4, 5, 6, 7, 8.

We want sn, sp, ppv and npv large; we want fp and fn small. The problem is we can’t have it
all. There are no simple rules for what is ‘best’. It depends on the situation which is rated more
important, and even then, there is disagreement even between experts!
lOMoARcPSD|8938243

page 250 EDDA: Answers to the Problems

For example, we want sn and sp to be large, but


as one increases the other decreases. See the
plot of sn vs 1–sp, i.e. Pr(P | D) vs Pr(P | D′ ): sn > sp
this plot is called an ROC curve. Which is more
important: sn or sp? By how much?
Perhaps we might decide that the quality of the
test is measured by Q = a sn + b sp? and max- sn < sp
imise Q.
Or, equivalently minimise the “cost” of error,
A(1−sn) + B(1−sp).
Should we consider ppv or npv?
There is no definitive answer!

Problem Set 4
4.1 (a) graphs of pmf and cdf:

3 1
(b) Pr(X = 2) = 36 = 12 (0.083)
9 11
Pr(X > 4) = Pr(X=5) + Pr(X=6) = 36 + 36 = 59 (0.556)
5 7 9 7
Pr(2 < X 6 5) = Pr(X=3) + Pr(X=4) + Pr(X=5) = 36 + 36 + 36 = 12
(0.583)
1
(c) (see the graph of the cdf above) Pr(X = 2) = (jump in F at x=2) = 12
Pr(X > 4) = 1 − F (4) = 95
7
Pr(2 < X 6 5) = F (5) − F (2) = 12
4.2 i. Pr(X > 0.1) = 1 − F (0.1) = 0.94 = 0.656.
ii. Pr(X > s) = 0.01 ⇒ (1 − s)4 = 0.01 ⇒ 1 − s = 0.32, i.e. s = 0.68.
Thus, the supply needs to be at least 68L.

y 0 1 2
4.3 (a) Y = 2X has pmf
p(y) 0.5 0 0.5

z 0 1 2
(b) Z = X1 +X2 has pmf
p(z) 0.25 0.5 0.25
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 251

(c) Y and Z have the same mean: E(Y ) = E(Z) = 1;


Y is more spread than Z: var(Y )=1, var(Z)=0.5. (Note that E(X)=0.5, var(X)=0.25).
Graphs of the pmfs are shown above.
4.4 (a) i. 0.8791; ii. 0.2112; iii. 0.7844; iv. 0.4148.
(b) i. 0.1088; ii. 0.2275; iii. 0.6554; iv. 0.0014.
d
4.5 (a) X = Bi(20, 0.10); E(X) = 2, so we ’expect’ about 2.
Pr(X > 4) = 0.1330, from Tables or R.
d
(b) X = Bi(400, 0.1); E(X) = 40, so we ’expect’ about 40.
d
Pr(X > 60) ≈ Pr(X ∗ > 59.5), where X ∗ = N(40, 36)
d
≈ Pr(Xs∗ > 59.5−40
6
), where Xs∗ = N(0, 1)

≈ Pr(Xs > 3.25) = 0.0006.
The exact probability, using R, is Pr(X > 60) = 0.0011.
(c) In each case, it is assumed that each class is a random sample (with respect to handed-
ness) from the population. This means that the sample of students can be regarded as a
sequence of independent trials with each with each student having the same probability
of being left-handed.
4.6 α = 4.5 month−1 , t = 6 months;
d
X = Pn(27) ⇒ Pr(X > 35) = 1 − FX (34) = 1 − 0.9213 = 0.079.
d
4.7 X = Pn(1.8) ⇒ Pr(X > 6) = 1 − 0.9896 = 0.010.
4.8 (a) i. Pr(X 6 47) = Pr(Xs < 47−50
10
) = Pr(Xs < −0.3) = 0.382;
ii. Pr(X > 64) = Pr(Xs > 64−50
10
) = Pr(Xs > 1.4) = 0.081.
iii. Pr(47 < X 6 64) = 0.919 − 0.382 = 0.537;
note: < or 6, > or > doesn’t matter for continuous random variables.
iv. c = c0.05 (X) = 50 − 1.6449×10 = 33.55;
v. c0.025 = 50 − 1.96×10 = 30.40.
d d d
(b) X = Bi(1000, 61 ); X ≈ X ∗ , where X ∗ = N( 1000
6
, 5000
36
) = N(166.6667, 11.78512 ).
∗ ∗
Pr(X 6 150) ≈ Pr(X < 150.5) = Pr(Xs < −1.3719) = 0.0851.
R gives Pr(X 6 150) = 0.0837.
d
4.9 (a) sn = Pr(P | D) = Pr(XG > 6.50), where XG = N(8.5, 1.22 ) (for individuals with gout);
6.50−8.5
thus, sn = Pr(Z > 1.2
) = Pr(Z > −1.667) = 0.952.
d
(b) sp = Pr(P ′ | D′ ) = Pr(XH < 6.50), where XH = N(5.0, 0.82 ) (for healthy individuals);
6.50−5.0
thus, sp = Pr(Z < 0.8
) = Pr(Z < 1.875) = 0.970.
4.10 The expected five-number summary is given by cq = 240 + 40zq , where zq = –2.3301, –0.6745,
1
0, 0.6745, 2.3301, corresponding to q = 101 , 14 , 12 , 34 and 101
100
.
This gives (min, Q1, med, Q3, max) ≈ (146.8, 213.0, 240.0, 267.0, 333.2)
d
4.11 Let X denote the survival time in months, so X = N(30, 152 ). We are treating time as continu-
ous (even though month is a silly unit of time).
12−30
i. Pr(X < 12) = Pr(Xs < 15 ) = 0.115;
ii. Pr(12 < X < 24) = 0.3446 − 0.1151 = 0.230;
iii. Pr(X > c) = 0.8 ⇒ c = c0.2 = 17.4; after 17 months, there are expected to be more
than 80% surviving, but after 18 months, less than 80% surviving;
iv. c0.25 = 19.9, c0.5 = 30, c0.75 = 40.1

The dotplot suggests bimodality, but the sample is small; x̄ ≈ 30 and s ≈ 15 (these are quite
reasonable sample values: we don’t expect values identical to the population values).
Of course, this cannot be an exact model: for example, this model would mean that
Pr(T < 0) = 0.023.
d
Nevertheless, it seems not unreasonable to use T ≈ N(30, 152 ) as an approximate model.
lOMoARcPSD|8938243

page 252 EDDA: Answers to the Problems

Note: An alternative approach may be to consider the observed number of months as an integer variable.
Then we need to consider how to interpret events such as “more than a year”: is this “X > 12” or
“X > 12” or something else?
p
4.12 (a) µY −X = 0.002; σY −X = (0.004)2 + (0.002)2 = 0.00448.
(b) µZ = 2.001; σZ = 0.00224. Z is more variable than Y , but with a mean closer to 2.
(c) Which is ‘best’ X, Y or Z? There is no simple answer here, each has its merits. X is unbi-
ased (mean = 2), but it has the largest standard deviation, and hence the least precision. Y
is biased, with a larger bias than Z, but it is more precise than Z; it has a smaller standard
deviation. So what is needed is a trade-off between bias and precision. I would choose Z
as a compromise, but choosing either X, because it is the only one that is unbiased, or Y ,
because it has the smallest standard deviation (and quite a small bias) is acceptable.
d
(d) X = N(2, 0.0042 ) ⇒ Pr(1.995 < X < 2.005)
= Pr(−1.25 < Xs < 1.25) = 0.8944 − 0.1056 = 0.7887;
d
Y = N(2.002, 0.0022 ) ⇒ Pr(1.995 < X < 2.005)
= Pr(−3.5 < Xs < 1.5) = 0.9332 − 0.0002 = 0.9330;
d
X = N(2.001, 0.0022362 ) ⇒ Pr(1.995 < X < 2.005)
= Pr(−2.683 < Xs < 1.789) = 0.9632 − 0.0036 = 0.9595;
which gives some support to Z as a good estimator, because these results suggest that it
is more likely to be “close” to the true value, i.e. within 0.005 of 2.
4.13 (a)

d
(b) X − Y = N(−7.8, 95.30),
mean = 165.4 − 173.2 and variance = 6.72 + 7.12 , so sd = 9.762.
0+7.8
Pr(X > Y ) = Pr(X−Y > 0) = Pr(Z > 9.762
) = Pr(Z > 0.7990) = 0.212.
4.14 (a) Pr(Y > 10) = Pr(ln Y > ln 10) = Pr(Z > 0.303) = 0.381;
(b) Let c0.25 , c0.5 and c0.75 denote the quartiles and median of Y .
c0.25 is such that Pr(Y < c0.25 ) = 0.25. Therefore:
Pr(ln Y < ln c0.25 ) = 0.25 ⇒ ln c0.25 = 2 − 0.6745×1 = 1.3255
⇒ c0.25 = e1.3255
⇒ c0.25 ≈ 3.76.
Similarly, we find c0.5 = e2 ≈ 7.39 and c0.75 = 2.6745 ≈ 14.51.
(c) Y is positively skew: since c0.75 −c0.5 > c0.5 −c0.25
(d) the graph of the pdf of Y is:

d
4.15 (a) Let X denote the number of patients in which XB kills the bacteria; then X = Bi(100, 0.85)
(since the probability of “success” is the efficacy). Then Pr(“significantly better”) =
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 253

Pr(X > 88) = 0.2473, using R.


Since XB is actually better than the standard, this is not very good: there is only a one in four
chance that the test will find that XB is “significantly better”.
d
(b) i. if p = 0.9 then X = Bi(100, 0.9) and Pr(X > 88) = 0.8018. . . . which is a bit better!
d
ii. if p = 0.95 then X = Bi(100, 0.95) and Pr(X > 88) = 0.9985. . . . close to ideal.
d
iii. if p = 0.8 then X = Bi(100, 0.8) and Pr(X > 88) = 0.0253. This means that there
is a 2.5% chance that a drug with the same efficacy as the standard antibiotic is (wrongly)
found to be “significantly better”. This is the trade-off: if we were to make the cut-off larger
than 88, it would mean lower power when the drug is actually better.

Problem Set 5
d 102
5.1 (a) X̄ = N(50, 10
);
Pr(49 < X̄ < 51) = Pr(− √110 < X̄s < √1 )
10
= Pr(−0.316 < X̄s < 0.316) = 0.248.
d 10 2
(b) X̄ = N(50, 100
);
Pr(49 < X̄ < 51) = Pr(−1 < X̄s < 1) = 0.683.
d 102
(c) X̄ = N(50, 1000
);
√ √
Pr(49 < X̄ < 51) = Pr(− 10 < X̄s 10) = Pr(−3.162 < X̄s < 3.162) = 0.998.
d 14.22
5.2 X̄ ≈ N(55.4, 50 ).
14.2
95% prob interval: 55.4 ± 1.96× √ = (51.5, 59.3)
50
5.3 [cf. Computer Lab Week 7: StatPlay & Confidence Intervals]
(a) 0.954 = 0.8145.
(b) i. 0.9520 = 0.3585;
ii. about 19 = 20×0.95;
iii. Bi(20, 0.95);
iv. 0.3585, 0.3774, 0.1887.
5.4 i. n = 30, x̄ = 40.86, (σ = 8);
8 
95% CI for µ: 40.86 ± 1.9600× √ = (38.00, 43.72).
30
8 
ii. narrower: it has less chance of containing µ. 40.86 ± 0.6745× √ = (39.87, 41.84).
30
iii. the confidence interval would continue to get narrower, until it reaches the point estimate
x̄, which is the 0% confidence interval.
8 
iv. 40.86 ± 3.2905× √ = (36.05, 45.66).
30
5.5 n = 30, x̄ = 40.86, s = 7.036.
7.036 
95% CI for µ: 40.86 ± 2.045× √ = (38.23, 43.48). cf. (38.00, 43.72).
30
This interval is narrower because the sample standard deviation s = 7.036 happens to be less
than the population standard deviation σ = 8 for this sample. If the population standard
deviation is actually equal to 8, then sometimes s will be less than 8, and sometimes it will
be more than 8. In this case we were ‘lucky’. On average, the interval based on s will be
wider, since not only is s ≈ 8 on average, but the multiplier of s (based on t) is larger than the
multiplier of σ (based on z).
5.6 n = 50, d¯ = 17.4, sd = 21.2.
21.2
i. d¯ ± 2.010× √ = (11.4, 23.4);
50
ii. the CI excludes zero, so that a mean difference of zero is implausible; this indicates an
increase.
5.7 There is no need to assume Normal population; though we are assuming that the sample size is large
d
enough for the CLT to apply, so that X̄ ≈ N.
i. σ, population standard deviation; n, the sample size; α, the probability of error, equiva-
lently the confidence level 100(1−α).
lOMoARcPSD|8938243

page 254 EDDA: Answers to the Problems

ii. the width increases with increasing σ; the width increases with decreasing α (or increas-
ing confidence level); and the width decreases with increasing n.
iii. wider interval means less precision, i.e. the “answer” is less precise: a wider interval
gives the scientist less precise information about the parameter.
5 √
iv. c0.975 (N) = 1.96 ⇒ 1.96× √n = 0.5 ⇒ n = 19.6 ⇒ n = 384.2;
Thus we want the sample size to be at least 385.
q
228 0.20×0.80
5.8 p̂ = 1140 = 0.20; se(p̂) = 1140
= 0.0118.
(approx) 95% CI for p: (0.20 ± 1.96×0.0118) = (0.177, 0.223).
Note that because n is large the exact interval will be almost the same: R gives (0.177, 0.224).

5.9
5.10
n x p̂ 95% CI p̂n x
95% CI
20 4 0.2 (0.06, 0.44) 20 16
0.8 (0.56, 0.94)
50 10 0.2 (0.10, 0.34) 50 40
0.8 (0.66, 0.90)
100 20 0.2 (0.13, 0.29) 100 80
0.8 (0.71, 0.87)
200 40 0.2 (0.15, 0.26) 200 160
0.8 (0.74, 0.85)
q
n=100, x=20 ⇒ approx 95% CI for p: 0.2 ± 1.96 0.2×0.8 100
= (0.122, 0.278)
cf. exact 95% CI from tables: (0.13, 0.29). Note: R gives (0.127, 0.292);
and the ‘better’ approximation gives (0.126, 0.297).
5.11 (a) If the data are a random sample from a Normal population, then the QQ plot should be
close to a straight line, with intercept µ and slope σ.
(k=15): y-coordinate = x(15) = 65 and x-coordinate Φ−1 ( 15
20
) = 0.6745.
So the point is (0.6745, 65).
µ̂ = 50 (intercept); σ̂ = 20 (slope).
(b)

For a Probability plot the axes are interchanged; the x-coordinate = x(15) = 65 and the
y-coordinate Φ−1 (0.75) = 0.6745, though the y-axis label is 0.75 (= Φ(0.6745)).
(c) n = 19, x̄ = 50.05, s = 17.81.
17.81
i. 95% CI for µ: 50.05 ± 2.101× √ = (41.47, 58.63);
19
q
1
ii. 95% PI for X: 50.05 ± 2.101×17.81 1 + 19 = (11.66, 88.44).

5.12
interval freq fˆ x cum.freq F̂
0<x<1 27 0.27 1 27 0.27
1<x<2 18 0.18 2 45 0.45
2<x<3 20 0.20 3 65 0.65
3<x<5 17 0.085 5 82 0.82
5 < x < 10 12 0.024 10 94 0.94
10 < x < 20 6 0.006 20 100 1.00
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 255

(a)

(b)

From the graph, the sample median, m̂ ≈ 2.25, from F̂ = 0.5.


65
(c) Let p = Pr(X < 3). From the data, p̂ = 100 = 0.65.
q
0.65×0.35
approx 95% CI for p: 0.65 ± 1.96 100
= 0.65 ± 0.093 = (0.557, 0.743).
Using R, the exact 95% CI is (0.548, 0.743); ‘better’ approx is (0.545, 0.743).
The 95% confidence interval excludes 0.5, which means that 0.5 is an implausible value
for Pr(X < 3); and hence 3 is an implausible value for the median.
5.13 i. θ = Pr(positive result)
= Pr(at least one of the ten blood samples is positive)
= 1 − Pr(all ten are negative)
= 1 − (1 − p)10 (assuming the blood samples are independent).
Note: this means that p = 1 − (1 − θ)1/10 .
ii. 4/10 positive groups ⇒ 95% CI for θ: 0.1216 < θ < 0.7376 (exact, using R);
And, since θ is an increasing function of p, it follows that 0.0129 < p < 0.1252.
Note: 1 − (1 − 0.1216)1/10 = 0.0129 and 1 − (1 − 0.7376)1/10 = 0.1252.
43
5.14 α̂ = 1047 = 0.041 cases/person-year;
r √
43/1047
q
α̂ 43
se(α̂) = t
= 1047
= 1047
= 0.0063;
approx 95% CI for α: 0.04107 ± 0.01228 = (0.029, 0.053). [ exact 95% CI: (0.030, 0.055).]
No. This result suggests that the industry rate is greater than 0.02.
5.15 (a) est sd 1/sd2 w w×est
25.0 0.4 6.25 0.36 9.0
23.8 0.311.11 0.64 15.2
17.36 1 24.2

Therefore, est = 24.2 and sd = 1/ 17.36 = 0.24.
(b) est sd 1/sd2 w w×est
25.0 0.4 6.25 0.148 3.689
23.8 0.3 11.11 0.262 6.243
24.4 0.2 25.00 0.590 14.400
42.36 1.000 24.33

Therefore, est = 24.3 and sd = 1/ 42.36 = 0.15.
Including the third estimate doesn’t change the pooled estimate much, but it substantially
increases its precision.
lOMoARcPSD|8938243

page 256 EDDA: Answers to the Problems

5.16 (a)
est se 1/seˆ2 w w*est
0.0827 0.0533 352.0024 0.3505 0.029
0.3520 0.1058 89.3364 0.089 0.0313
0.0520 0.0503 395.2429 0.3936 0.0205
-0.7702 0.5109 3.8311 0.0038 -0.0029
0.1049 0.0797 157.4285 0.1568 0.0164
0.1542 0.3935 6.4582 0.0064 0.001
1004.2995 1 0.0953

(b) est = 0.0953, se = 0.0316 (= 1/ 1004.2995)
(c) 95% CI for ln OR: 0.0953 ± 1.96×0.0316 = (0.0334, 0.1571)
(d) 95% CI for OR = exp(0.0334, 0.1571) = (1.034, 1, 170).
(e) OR > 1 : exposure and disease outcome are positively related, i.e. exposure is associated
with greater probability of the disease outcome: .
(f) There is significant evidence in these data to indicate that OR > 1, since the 95% CI is
entirely greater than 1, i.e. the ’plausible’ values are greater than 1; i.e. there is significant
evidence to indicate that β-carotene increases the risk of cardiovascular mortality (small,
but significant).

Problem Set 6
6.1 n = 30, x̄ = 40.86, (σ = 8). Note: s = 7.04.
(a) i. 95% CI for µ: 40.86 ± 1.96 √830 = (38.00, 43.72);
hence we reject (µ=45) since 45 6∈ CI.
ii.

35 40 45
40.86−45

iii. p = 2 Pr(X̄ < 40.86) = 2 Pr X̄s < √
8/ 30
= 2 Pr(X̄s < −2.83) = 0.005;
hence we reject (µ=45) since p < 0.05.
√ ; reject (µ=45) if |z| > 1.96, i.e. if x̄ 6∈ (45 ± 1.96 √8 );
x̄−45
iv. z = 8/ 30 30

thus we reject (µ=45) if x̄ < 42.14 or if x̄ > 48.76, (as above)


and, since x̄ = 40.86, we reject (µ=45), and conclude that there is significant evi-
dence to suggest that µ < 45: the ‘plausible’ values for µ (as specified by the confi-
dence interval) are less than 45.
(b) i. 99.9% CI for µ: 40.86 ± 3.2905× √830 = (36.05, 45.66);
hence we do not reject (µ=45) at the 0.1% level, since 45 ∈ CI.
ii.

35 40 45

iii. p = 0.005, as above; and, as p > 0.001, we do not reject (µ=45).


√ ; reject (µ=45) if |z| > 3.2905, i.e. if x̄ 6∈ (45 ± 3.2905 √8 );
x̄−45
iv. z = 8/ 30 30

thus we reject (µ=45) if x̄ < 40.19 or if x̄ > 49.81, (as above)


and, since x̄ = 40.86, we do not reject (µ=45): we would conclude that there is no
significant evidence that µ 6= 45 as the ‘plausible’ values (as specified by the confi-
dence interval) include 45.
Note: What is ‘significant’ and ’plausible’ depends entirely on the specification of the signif-
icance level and the corresponding confidence level.

12.5−9
6.2 (a) Pr(X < 12.5 | anaemic) = Pr(Xs < 3
) = Pr(Xs < 1.167) = 0.879;
12.5−16
(b) Pr(X > 12.5 | healthy) = Pr(Xs > 2.5 ) = Pr(Xs < −1.4) = 0.919;
(c) sensitivity = Pr(P | D) = 0.879; specificity = Pr(P ′ | D′ ) = 0.919;
Unless we know Pr(anaemic) (i.e. prevalence in the population, or relevant sub-population)
we cannot evaluate ppv or pnv. (In hypothesis testing generally we never know Pr(H0 ) or
Pr(H1 ), i.e. we never know “prevalence”.)
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 257

(d) Here H0 = healthy and H1 = anaemic (so this is a “non-standard” situation)


type I error = rejecting H0 (positive result) when H0 is true (healthy);
type II error = accepting H0 (negative result) when H1 is true (anaemic).
significance level = 1 − specificity = 0.081;
power = sensitivity = 0.879; so Pr(type II error) = 1 − 0.879 = 0.121.

6.3 p̂ = 0.2, 95% CI: 0.09 < p < 0.36; 0.2, (0.11, 0.33); 0.2, (0.13, 0.29); 0.2, (0.15, 0.26);
do not reject, do not reject, reject, reject.
These results are summarised in the following table. In addition, more precise values for the
exact confidence interval, obtained from R are also listed, along with the approximate and
‘better’ approximate confidence intervals for comparison.

n x p̂ 95% CI conclusion R approx b.approx


20 4 0.2 (0.06, 0.44) do not reject (0.057, 0.437) (0.025, 0.375) (0.035, 0.465)
50 10 0.2 (0.10, 0.34) do not reject (0.100, 0.337) (0.089, 0.311) (0.097, 0.347)
100 20 0.2 (0.13, 0.29) reject (0.127, 0.292) (0.122, 0.278) (0.126, 0.297)
200 40 0.2 (0.15, 0.26) reject (0.147, 0.262) (0.145, 0.255) (0.147, 0.264)

6.4 data give n = 19, x̄ = 29.74 and s = 7.85; (n∗ =1: one observation missing. We assume that
the missing observation is “missing at random”, i.e. it’s just as likely to be large or small: we’re
assuming it’s distributed like the others. In particular, we are assuming that it has not been
discarded because it was too large, for example.)
s 7.85
n = 19; µ̂ = x̄ = 29.74; and se(µ̂) = √n = √ = 1.80.
19 33.5);
(a) 95% CI for µ: 29.74 ± 2.101×1.801 = (26.0,
and, since 31 ∈ CI, we do not reject λ=31.
x̄−µ 29.74−31
(b) t = s/√n0 = √ = −0.701;
7.85/ 19
p = 2 Pr(t18 < −0.701) = 0.492 (using R); or, from tables p ≈ 0.5
[ since c0.75 (t18 ) = 0.688 and c0.8 (t18 ) = 0.862; so Pr(t18 > 0.7) ≈ 0.25.]
There is no significant evidence here to indicate that µ 6= 31.
Note: as the data are counts, and therefore integer-valued, we should really have made a correction for
continuity (ΣX 6 565). This gives tc = −0.687.

6.5 (a) p = 2 Pr(X > 15 | λ = 10) = 0.167.



(b) 95% CI for λ. approx: est ± 2se = (15 ± 2 15 = (7.3, 22.7);
exact: tables (fig.4): (8.4, 24.8); R: (8.39, 24.74).
SMR = λ/λ0 = λ/10,
95% CI for SMR: approx = (0.73, 2.27); exact = (0.84, 2.47).
Note: SMR = 1 ∈ CI, so do not reject H0 .
(c) t = 5000 person-years, so α = λ/5000,
95% CI for α: approx = (0.0015, 0.0045); exact = (0.0017, 0.0049).
10
Note: α = 5000 = 0.002 ∈ CI, so do not reject H0 .
11−4.3−0.5
6.6 Using the approximate z-test, we have z = √
4.3
= 2.990, giving p ≈ 0.003.
Now, λ0 = 4.3 is not really large enough to be using a normal approximation. Using the
Poisson distribution, we obtain p = 2 Pr(X > 11) = 0.010. Hence there is significant evidence
in these data of an excess risk of breast cancer. Note: we reject H0 , and since λ̂ > 4.3, the plausible
values (as specified by the CI) will be greater than 4.3, so we can conclude there is significant evidence
that λ > 4.3.
λ

SMR = 4.3 ; approx. 95% CI for λ: 11 ± 1.96 11 = (4.50, 17.50)
Hence approx. 95% CI for SMR: (1.05, 4.07) (obtained by dividing the CI for λ by 4.3).
Note: These approximate confidence intervals are dubious, as they are based on a questionable
Normal approximation.
R gives an exact 95% CI for λ: (5.49, 19.68); or, using the Poisson SP diagram in the Tables
(Figure 4) gives (5.5, 19.7).
For SMR this gives: est = 2.6, 95%CI : (1.28, 4.58).
lOMoARcPSD|8938243

page 258 EDDA: Answers to the Problems

6.7
type II error when µ=B ✲

power when µ=A

✛ significance level
A C B
(Note that the H0 -value is µ = C, the value at which the power-curve has a minimum.)

significance level doubled (dotted) significance level quadrupled (dotted)


sample size doubled (dashed) sample size quadrupled (dashed)

6.8 Let Z = √ , where X denotes the change in serum cholesterol. We reject H0 if |Z| > 1.96.
38.5/ n
X̄  X̄−10 d
When µ=10, we want Pr 38.5/√n > 1.96 = 0.95. But, if µ=10 then 38.5/√n = N(0, 1), so
10 X̄−10 10
we subtract 38.5/√n from both sides, to give Pr( 38.5/√n > 1.96 − 38.5/√n ) = 0.95.
10 √ 38.5
Therefore 1.96 − √ = −1.6449, and hence n = 10 × 3.6049 = 192.6.
38.5/ n
13×38.52
So we require a sample of at least 193. The formula (p140) gives: n > 102
= 192.7.
6.9 (a) p = probability of five-year survival; H0 : p = 0.10 vs H1 : p 6= 0.10.
d
(b) p = 2 Pr(X > 27) = 0.044 (exact, using X = Bi(180, 0.10) and R);
p ≈ 2 Pr(X ∗ > 26.5) = 2 Pr(Xs∗ > 2.112) = 0.035 (normal approximation).
q
0.1×0.9
(c) We would reject H0 if p̂ > 0.10 + 1.96 n
, and we would also reject H0 if p̂ <
q
0.1×0.9
0.10 − 1.96 n
.
If p = 0.15, we want
qthe probability of rejecting H0 to be 0.95, which means that
0.1×0.9
Pr(p̂ > 0.10 + 1.96 n
) = 0.95, since, when p = 0.15, the probability of H0 being
rejected because p̂ is too small is negligible.
q
0.15×0.85
If p = 0.15, then Pr(p̂ > 0.15 − 1.6449 ) = 0.95. It follows that
q q0.05
0.1×0.9 0.15×0.85
0.10 + 1.96 n
= 0.15 − 1.6449 n
(see the diagram below).
√ √
√ 1.96 0.1×0.9 + 1.6449 0.15×0.85
Therefore n= ⇒ n = 552.6.
0.05
So, we need a sample of at least 553 to ensure power > 0.95 when p = 0.15.
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 259

H0 : p = 0.10

0.025

0.10 ✻ q
0.10×0.90
0.10 + 1.96 n

q
0.15×0.85
0.15 − 1.6449 n

H1 : p = 0.15

0.05

6.10 Let X denote the number of cases of K. Under the null hypothesis (that the individuals at HQ
d
centre are the same as the general population), X = Pn(4.6).
Therefore p = 2 Pr(X > 13) = 2×0.001 = 0.002. Hence there is significant evidence of an
excess risk of K among HQ employees.
6.11 There is no evidence in these data that the treatment has an effect. The data are compatible
with the null hypothesis (that the treatment has no effect).
6.12 H0 : µ = 20 vs H1 : µ 6= 20; the test statistic, t = x̄−µ√0
s/ n
17.4−20
⇒ tobs = 5.1/ √
20
= −2.28. The null
distribution of t, i.e. the distribution of t under H0 , is t19 , assuming the population is normally
distributed. The critical value (for a test of significance level 0.05) is c0.975 (t19 ) = 2.093. Since
|tobs | is greater than this, there is significant evidence that µ < 20.
6.13 From the definition of the median: m = 20 ⇒ Pr(X < 20) = 0.5 (for a continuous random
variable).
Let Y denote the number of observations less than 20, i.e. Y = freq(X < 20). Then, if m = 20,
d
Y = Bi(11, 0.5); and we observe y = 10. So, p = 2 Pr(Y > 10) = 2×0.0059 = 0.002. Thus we
reject H0 , and conclude there is evidence that the median is less than 20.
6.14 i. There are 10 observations that round to 37.0. We don’t know whether these 10 observa-
tions are above or below 37 (i.e. 37.0000. . . ). So we delete them from consideration. This
leaves 120 observations, of which 81 are less than 37, and 39 are greater than 37.
d
If H0 (m = 37) is true, then W = freq(X<37) = Bi(120, 0.5);
using the approximate z-test, zc = 81−60−0.5

30
= 3.743, so that p ≈ 0.0002.
Thus we reject the hypothesis that m = 37, and conclude that there is significant evidence
that m < 37. [The null hypothesis is rejected and since m̂ < 37, the plausible values for m (as
specified by the CI, even though we haven’t found it) will be less than 37.]
ii. That this is a random sample (of healthy adults) and that temperatures are correctly mea-
sured. No assumption is made about the distribution of temperatures.
iii. Using Stat > Nonparametrics ◮ 1-Sample Sign . . . gives:
Sign test of median = 37.00 versus not = 37.00
N Below Equal Above P Median
x 130 81 10 39 0.0002 36.80
Sign confidence interval for median
Confidence
Achieved Interval
N Median Confidence Lower Upper Position
x 130 36.80 0.9345 36.80 36.90 55
0.9500 36.74 36.90 NLI
0.9563 36.70 36.90 54
Try also Stat > Basic Statistics ◮ Graphical Summary . . . which gives the CI.
lOMoARcPSD|8938243

page 260 EDDA: Answers to the Problems

6.15 (a) SP: λ = 15 ⇒ 7.5 < x < 23.5, i.e. 8 6 X 6 23.


Thus we would reject H0 if X 6 7 or if X > 24.
(b) SP: x = 15 ⇒ 8.5 < λ < 24.8 (95% CI for λ). Reading the diagram is a bit rough. R
gives the exact 95% CI for λ as (8.40, 24.74).
d
(c) Let Y = number of cases of disease D, then Y = Pn(5000α), where α denotes the inci-
dence rate.
y = 15 ⇒ 8.40 < 5000α < 24.74 ⇒ 0.0017 < α < 0.0049.

6.16 The definitions and assumptions are missing. Here, it is assumed that we are sampling from a
Normal population with known variance σ 2 .
For the confidence interval case, we require the sample size n to be large enough so that the
margin of error of a 100(1 − α)% confidence interval should be at most d.
For the hypothesis testing case, we are testing the null hypothesis H0 : µ = µ0 using a signifi-
cance level α; and we require the sample size n to be large enough so that when µ = µ1 (where
µ1 = µ0 ± d), the power of the test should 2be at 2least 1 −2 β.2
z (kσ) z σ
i. n increases by a factor of k2 : n′ = = k2 2 = k2 n.
d2 d
2 ′ z2 σ2 1 z2 σ2 1
ii. n decreases by a factor of k : n = = 2 = 2 n.
(kd)2 k d2 k
2.5758
iii. 0.95 →
7 0.99 means z = 1.96 7→ z ′ = 2.5758, so n increases by a factor of ( 1.96 )2 :
1.962 σ 2 2.57582 σ 2 n′ 2.5758 2
n= 2
7→ n′ = 2
, so = = 1.727.
d d n 1.96
z1− 1 α σ
iv. For the diagram shown below, √2 = d:
n

And this diagram corresponds to the power diagram (EDDA p141) with β = 0.5.
v. β = 0.1 7→ β ′ = 0.01 means that z1−β = 1.2816 7→ z1−β ′ = 2.3263;
(1.96+1.2816)2 σ 2 ′ (1.96+2.3263)2 σ 2
and so n = d 2 →
7 n = d2
.
n′ (1.96+2.3263)2
Therefore n = (1.96+1.2816)2 = 1.748.

Problem Set 7

7.1 (a) i.
muscular endurance, as measured by repetitive grip strength trials;
ii.
paired comparisons: treatment and control applied to the same subject;
iii.
control = sugar placebo;
iv.
neither the patient nor the tester know whether the treatment received was the vita-
min C or the placebo;
v. randomisation should be used to determine which treatment (vitamin C or placebo)
is used first;
vi. better (more efficient) comparisons between treatment and control; the possibility of
carry-over effects.
(b) i.
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 261

To check on outliers; and a rough check of normality, via symmetry at least. This
looks a bit positively skew, but there are relatively few observations.
ii. To check on Normality, use a QQ-plot or a Probability plot. They should be close to
a straight line. The plot below indicates that this sample is acceptably Normal:

(c) i. H0 : µD = 0 (i.e. µV C = µP ) vs H1 : µD 6= 0
d¯ 121.0
t12 = s /√n = √ = 2.93, 0.01 < p < 0.02.
d 148.8/ 13
ii. There is significant evidence in these data that the muscular endurance is less with
vitamin C than with the placebo (assuming that large values of the response variable
corresponds to greater muscular endurance), i.e. there is significant evidence here
that vitamin C reduces muscular endurance. We reject H0 , and the plausible values of
µD (as specified by the CI) are positive; and µD > 0 corresponds to µP > µV C .
[t12 = 2.93, p = 0.013; 95%CI : (31, 211)].

7.2 (a) x̄1 − x̄2 = 121.0 = d, ¯ so the point estimate is the same; but, this two-sample approach
q
1
gives se(x̄1 − x̄2 ) = 162.9 13 1
+ 13 ¯ = 41.3.
= 63.9, vs the difference approach: se(d)
(b) 95% CI using the two-sample approach: (−12, 254), vs 95% CI from Problem 7.1 (differ-
ence approach): (31, 211)
(c) The two-samples approach assumes there is no connection between the two results of an
individual. It is assumed that samples are independent random samples from the treated
and untreated (placeboed?) populations.
(d) Clearly there is a connection between the results for a given subject. Some individuals are
stronger than others. Look at the results for subjects 5 and 6. In using the two-samples
(independent samples) approach, the treatment difference is masked by the difference
between individuals. The differences approach (paired samples) effectively removes the
individual differences.

7.3 Zinc: n1 = 25, x̄1 = 4.5, s1 = 1.6; P lacebo: n2 = 23, x̄2 = 8.1, s2 = 1.8.
r
24×1.62 +22×1.82
q
1 1
(a) x̄1 − x̄2 = −3.6, s = 46
= 1.70; se = s 25
+ 23
= 0.49.
(b) 95% CI for µ1 −µ2 : −3.6 ± 2.015×0.491 = (−4.6, −2.6).
c0.975 (t46 ) = 2.015 using R; or tables (c0.975 (t40 ) = 2.021, c0.975 (t50 ) = 2.009). Even if you
used 2.021, the 95% CI is unchanged to two decimal places.
(c) Yes. The 95% CI excludes zero. There is significant evidence here that the mean recovery
time is less with the zinc treatment.
Note that if you assumed that σ1 6= σ2 , little would change as q
s1 and s2 are not very different:
2 2
df = 44 (using R); t = −7.30 [instead of t = −7.34]; se = 1.6 25
+ 1.8
23
= 0.493 [instead of
se = 0.491]; and 95% CI = (−4.59, −2.61) [instead of (−4.59, −2.61)!]
lOMoARcPSD|8938243

page 262 EDDA: Answers to the Problems

7.4 (a) sample of differences (10, 10, 6, 3, 0, 20, 7) n1 = 7, d¯1 = 8.0, s1 = 6.40;
8.0√
t= = 3.31 cf. t6 ; reject H0 , p = 0.016;
6.40/ 7
(b) sample of differences (5, 17, 23, 22, 17, 4, 18, 3) n2 = 8, d¯2 = 13.6, s1 = 8.28;
8.28
95% CI: (13.6 ± 2.365× √ ) = (6.7, 20.6).
8
(c) compare female and male differences (two-sample test) n1 = 7, d¯1 = 8.0, s1 = 6.40;
n2 = 8, d¯2 = 13.6, s1 = 8.28; (s = 7, 47):
¯
d −d ¯
t = q 11 2 1 = 1.45, cf. t13 ; do not reject H0 , p = 0.170;
s n1
+n
2 q
1
Note, the 95% CI: (5.6 ± 2.160×7.47× 7
+ 18 ) = (−2.7, 14.0).

34.47−36.03
√1 1 −1.56
7.5 (a) Two-sample t-test t = = 4.523 = −0.345, cf. c0.975 (t18 ) = 2.101;
10.11 10 + 10
9s2 +9s2
so we accept µ1 = µ2 . [s2 = 118 2 = 12 (s21 + s22 ) = 12 (10.062 + 10.172 ) = 10.112 .
With equal sample sizes, the pooled s2 is the average of s21 and s22 .]
95% CI for µ1 −µ2 : −1.56 ± 2.101×4.523 = (−11.06, 7.94).
Using R, the following output is obtained:
Welch Two Sample t-test

data: C and K
t = -0.34487, df = 17.998, p-value = 0.7342
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.063388 7.943388
sample estimates:
mean of x mean of y
34.47 36.03

Thus, with this (independent samples) t-test, we do not reject µC = µK (δ = 0); the 95%
confidence interval for δ is given by (−11.06, 7.94). (Note that the confidence interval
contains zero, indicating non-rejection of δ = 0.)

(b) If the data are paired then we consider the sample of differences, di = xCi − xKi :
–0.9, –1.8, –3.7, –2.1, –1.7, –2.0, 1.1, –3.8, –1.3, 0.6.
¯
For this sample, t = s /d√10 = 1.578/
−1.56

10
= −1.56
0.499
= −3.13, cf. c0.975 (t9 ) = 2.262;
d
so we reject δ = 0 (i.e. we reject µC = µK ).
95% CI: −1.56 ± 2.262×0.499 = (−2.69, −0.43) (which does not contain zero).

7.6 (a)

(b) n1 = 6, x̄1 = 36.7, s1 = 5.6; n2 = 4, x̄2 = 57.5, s2 = 24.1; (s = 15.4)


x̄ −x̄
pooled-t: t = q 11 2 1 = −2.09; cf. c0.975 (t8 ) = 2.306; p = 0.070;
s n +n
1 2
x̄ −x̄
unpooled-t: t = r 12 2 2 = −1.70; cf. c0.975 (t3 ) = 3.182; p = 0.188.
s1 s
n1
+ n2
2
(c) With 53 instead of 93:
n1 = 6, x̄1 = 36.7, s1 = 5.61;
n2 = 4, x̄2 = 47.5, s2 = 5.92; (s = 5.73)
x̄ −x̄
pooled-t: t = q 11 2 1 = −2.93; cf. c0.975 (t8 ) = 2.306; p = 0.019.
s n1
+n
2
(d) Even though the difference between the means is reduced, the t-test is now significant.
The outlier affects not only the mean, but also the standard deviation, reducing the t-
statistic, and making it non-significant. The t-test does not perform well in the presence
of outliers (whether the pooled or unpooled test is used).
35
7.7 PTCA: p̂1 = 96 = 0.365;
55 90
MT: p̂2 = 104 = 0.529; ⇒ p̂ = 200 = 0.450;
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 263

q
1 1
p̂1 −p̂2 = −0.164, se(p̂1 −p̂2 ) = 0.45×0.55( 96 + 104 ) = 0.070.
−0.164
z = 0.070 = −2.333, p = 0.020.
There is significant evidence here that p1 < p2 , i.e. that PTCA is more effective in preventing
angina.
7.8 What should be done with the “lost to survey” individuals? If these individuals are omitted
then for the resulting 2 × 2 table, we have χ21 = 3.77, so that p > 0.05 and we do not reject H0 .
This test indicates there is no significant evidence of any change in the improvement rate.
Note: If we choose to omit the “lost to survey” individuals, then we are implicitly assuming that these
individuals are similar to those that remain in the sample. This would not be the case if, for example,
individuals who showed no improvement were more inclined to remove themselves from the survey.
This is a common problem with non-respondents. We must make (reasonable) assumptions about their
behaviour — and attempt to justify these assumptions.
28
7.9 Group 1 (30% O2 ): p̂1 = 250 = 0.112;
13 41
Group 2 (80% O2 ): p̂2 = 250 = 0.052; p̂ = 500 = 0.082.
q
1 1
p̂1 −p̂2 = 0.060, se0 (p̂1 −p̂2 ) = 0.082×0.918( 250 + 250 ) = 0.0245;
est 0.060
z = se = 0.0245 = 2.445, p = 0.014.
0
Since p < 0.05, there is significant evidence in these data to indicate that p2 < p1 , i.e. that the
rate of wound infection is less with the 80% oxygen treatment.
7.10 obs: 58 166 193 417 exp: 118.9 170.2 127.9 417
870 1163 806 2839 809.1 1158.8 871.1 2839
928 1329 999 3256 928 1329 999 3256
P (o−e)2
u= e
= 73.79; df = 2, c0.95 (χ22 ) = 5.991; p = 0.000.
There is significant evidence of an association between nausea and seat position. The indi-
viduals in the rear seats are more likely to experience nausea, and those in the front seats are
less likely to experience nausea. (This is seen by comparing observed and expected frequencies
based on independence. If nausea and seat position were independent, we would expect about
128 of those in the back seats to experience nausea, whereas 193 were observed. And for the
front seats, we observed 58 compared to the expected 119.)
63
7.11 i. case D: p̂1 = 100 = 0.63;
48 111
control D′ : p̂2 = 100 = 0.48; p̂ = 200 = 0.555;
q
1 1
p̂1 −p̂2 = 0.15, se(p̂1 −p̂2 ) = 0.555×0.445( 100 + 100
) = 0.0703;
est 0.15
z = se = 0.0703 = 2.134, p = 0.033.
We reject H0 , and conclude that there is significant evidence in these data that the cases
have a greater probability of exposure (compared to the controls).
Note: treating the data as a 2×2 contingency table gives u = 4.55 (= z 2 ), p = 0.033.
q
63×52 1 1 1 1
ii. θ̂ = 48×37 = 1.84; ln θ̂ = 0.612, se(ln θ̂) = 63 + 37 + 48 + 52 = 0.288
95% CI for ln θ: (0.612 ± 1.96×0.288) = (0.048, 1.177)
95% CI for θ: (e0.048 , e1.177 ) = (1.05, 3.24)
r
13 α̂1
7.12 (a) current users: t1 = 4761, x1 = 13; α̂1 = 4761 = 0.002731, se(α̂1 ) = t
= 0.000757;
r 1
113 α̂3
never users: t3 = 98091, x3 = 113; α̂3 = 98091 = 0.001152, se(α̂3 ) = t
= 0.000108.
3

rate-difference, α1 − α3 :
est
est.diff = 0.001579, se0 = 0.000519; z = se = 3.039, p = 2 Pr(Z > 3.039) = 0.002.
0
There is significant evidence in these data that α1 > α3 , i.e. that the incidence rate among
current-users is greater than among never-users.
α 13+113
rate ratio φ = α1 (see EDDA p165) [Note: α̂ = 4761+98091 = 0.001225.]
3
φ̂ = 2.37, ln φ̂ = 0.863, 95% CI for ln φ: 0.863 ± 1.96×0.293 = (0.289 < ln φ < 1.437);
lOMoARcPSD|8938243

page 264 EDDA: Answers to the Problems

95% CI for φ: (1.34 < φ < 4.21).


Note: the CI excludes 1, indicating there is evidence that φ > 1, i.e. φ1 > φ3 .
r
164 α̂2
(b) past users: t2 = 121091, x2 = 164; α̂2 = 121091 = 0.001354, se(α̂2 ) = t
= 0.001354;
r 2
113 α̂3
never users: t3 = 98091, x3 = 113; α̂3 = 98091 = 0.001152, se(α̂3 ) = t
= 0.000108.
3

rate-difference, α2 − α3 :
est
est.diff = 0.000202, se0 = 0.000153; z = se = 1.325, p = 2 Pr(Z > 1.325) = 0.185.
0
There is no significant evidence in these data that α2 6= α3 , i.e. no evidence that the
incidence rate among current-users is different from the rate among never-users.
α 164+113
rate ratio φ = α2 (see EDDA p165) [Note: α̂ = 121091+98091 = 0.001264.]
3
φ̂ = 1.18, ln φ̂ = 0.162, 95% CI for ln φ: 0.162 ± 1.96×0.122 = (−0.078 < ln φ < 0.401);
95% CI for φ: (0.93 < φ < 1.49).
Note: the CI includes 1, indicating no evidence against φ = 1, i.e. φ2 = φ3 .

Problem Set 8
8.1 i. yes: a positive relationship; straight-line regression looks OK, there may be question-
marks at the ends, but there are only a few observations there.
ii. E(Y | x) = α + βx, var(Y | x) = σ 2 ; and the errors are independent.
We also usually assume that the distribution is Normal.
iii. β̂ = 0.273 indicates that the average FEV increases by 0.273 L for each year of age.
iv. µ̂(10) = 0.0736 + 0.27348×10 = 2.81.
v. R2 is the proportion of the variation of FEV explained by the boys’ ages.
√ √
vi. r = R2 = 0.658 = 0.811 (It is positive because the relationship is positive, as seen
from the scatter plot and/or the fact that β̂ > 0.)
sy 6.667
8.2 i. β̂ = r s = −0.8 × 3.333 = −1.6; α̂ = ȳ − β̂ x̄ = 20 + 1.6×10 = 36; µ̂(x) = 36 − 1.6x.
x

n−1 9
ii. s2 = n−2 (1 − r2 )s2y = 8 (1 − 0.82 )6.6672 = 18;
q
18 √
K = (x − x̄)2 = 9 × 3.3332 = 100; se(β̂) = 100 = 0.18 = 0.424;
P

95% CI for β: −1.6 ± 2.306×0.424 = (−2.58, −0.62).

iii. From the correlation SP diagram (Figure 10):


n=10, r= − 0.8 ⇒ approx 95% CI for ρ: (−0.95 < ρ < −0.34)
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 265

8.3 i.&ii.

iii. From the correlation SP diagram (Figure 10):


n=50, r= − 0.5 ⇒ approx 95% CI for ρ: (−0.68 < ρ < −0.26)

8.4 i.

Exactly the same fitted regression line results in each case: y = 3.0 + 0.5x.
ii. The point that Anscombe wanted to make was that it is important to examine the scatter-
plots before calculating regression lines and correlations. By looking at just the regression
analyses, we would not have seen how different the data sets were.
Comment: Data set 1 (y1 on x1 ) looks reasonable for the usual assumptions and so the regression
is meaningful and appropriate. Set 2 (y2 on x1 ) is curvilinear and therefore linear regression is not
appropriate. Set 3 (y3 on x1 ) lies almost on an exact straight line except for one observation which
looks like an outlier and should therefore be investigated further before carrying out the regression.
Set 4 (y4 on y4 ) looks very unusual. The x values are identical except for one. With only two x
values represented there is no way of knowing if the relationship is linear or non-linear.
iii. The observed value at x4 = 19 is 12.5, which is the same as the predicted value. Changing
y4 from 12.5 to 10 and refitting the regression line results in a predicted value of 10, which
is the same as the observed again. From the plot, we can see that the point (19, 12.5) is
used to fit the regression line, resulting in the observed being the same as the fitted.

8.5 (a) response variable, y = size of tumour; explanatory variable, x = level of chemical in the
blood.
d
(b) Yi = α + βxi + Ei , where Ei = N(0, σ 2 ); the Ei are assumed to be independent.
A residual plot indicates E(Ei ) = 0 (average at zero), var(Ei ) = K (spread roughly con-
stant), and linearity of the model (no curved pattern in the residual plot); a normal plot
of the residuals checks their normality. The scatter plot indicates the reasonableness of
the straight line regression.
(c) i. β̂ = −0.15;
ii. An increase of 1 mg/L of this chemical in the blood corresponds to a decrease of
0.15cm in the mean tumour size.
lOMoARcPSD|8938243

page 266 EDDA: Answers to the Problems

β̂ −0.15123
(d) A test of β=0 is given by t = = 0.00987 = −15.32, which is significant, compared
se(β̂)
to t23 . We conclude that there is significant evidence in these data indicating β < 0.
(e) No: if 0 ∈ CI we would not reject (β=0).
(f) i. µ̂i = 10.3 − 0.15×25 = 6.55;
ii. êi = yi − µ̂i = 1.45;
r
1.2132
iii. se(µ̂i ) = 25
+ (25−45)2 ×0.009872 = 0.313;
90% CI for µi : (6.55 ± 1.714×0.313) = (6.01, 7.09).
(g) R2 indicates the proportion of the variation in y explained by the explanatory variable x:
in this case, about 90%.

r = −0.946, (r < 0 since there is a negative correlation (β̂ < 0) and |r| = 0.895).

8.6 (a) assuming (X, Y ) bivariate



normal,
0.22 160
to test ρ=0, t = 1−0.222 = 2.85, cf. c0.975 (t160 ) = 1.975, p = 0.005.
(b) reject β=0: the t-statistic used is the same as the one used to test ρ=0. Thus we reject
ρ = 0: there is significant evidence of a positive correlation between LDL and obesity.
(c) coefficient of determination, R2 = 0.222 = 0.048; obesity, as measured by the ponderal
index, explains about 5% of the variation in LDL.

8.7 (a)

(b) i. correlation for Fe & Protein gives r = −0.675;


ii. correlation for Fe & Protein, by Species, gives r1 = 0.880, r2 = 0.781 and r3 = 0.903.
(c) The scatterplot shows that the species are distinct groups, and within each group there is
a positive correlation between Fe and Protein. However, when the groups are combined
a negative correlation is ‘induced’ because the different species have different levels of Fe
and Protein. The overall correlation is an artefact and the separate correlations should be
reported.

8.8 We have n = 50, r = −0.40. From the correlation SP diagram (Figure 10), we obtain:
95% CI for ρ: (−0.61 < ρ < −0.14).
So there is significant evidence in these data to indicate that a negative relationship exists
(i.e. ρ < 0), since CI < 0 and 0 6∈ CI.

−0.796 9
8.9 (a) t = √ 2 = −3.945, cf. c0.975 (t9 ) = 2.262, p = 0.003;
1−0.796
hence we reject the hypothesis that the variables are uncorrelated. There is evidence here
that they are negatively correlated.
(b) i. t is the t-statistic to test β=0; it is equal to the t-statistic calculated in (a) to test ρ=0.
Thus t = −3.945 and p = 0.003.
ii. µ̂(10) = 88.80r − 10×2.334 = 65.46;
2.1882
pe(Y (10)) = 2.1882 + 11
+ (10 − 9.66)2 ×0.5912 = 2.294;
95% PI for Y (10): (65.46 ± 2.262×2.294) = (60.3, 70.7).
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 267

0.27348
8.10 i. β = 0, t = 0.01080 = 25.33, cf. c0.975 (t334 ) = 1.967, p = 0.000;
hence we reject the hypothesis that there is no relation between FEV and age: the data
indicate there is a positive relationship.
ii. 95% CI for β: 0.27348 ± 1.967×0.01080 = (0.252, 0.295).
0.2735−0.16
iii. H0 : β = 0.16. We reject H0 since 0.16 6∈ CI; or t = 0.01080
= 10.51.
iv. s2 = 0.5881022 = 0.346. r
0.5881022
v. µ̂(10) = 2.8084; se(µ̂(10)) = 336
+ 0.022 ×0.010802 = 0.03208;
95% CI for µ(10): 2.8084 ± 1.967×0.03208 = (2.745, 2.872).
r
0.5881022
vi. pe(Y (10)) = 0.5881022 + 336
+ 0.022 ×0.010802 = 0.5890;
95% PI for Y (10): 2.8084 ± 1.967×0.5890 = (1.65, 3.97).
lOMoARcPSD|8938243

page 268 EDDA: Answers to the Problems

Revision Problem Set R1


R1.1 (b) i. Randomisation has the effect of averaging out any possible confounding variables
(whether they are observed or not). Confounding factors could affect the results of
the experiment: without randomisation, the observed change in the response vari-
able could be attributed to something other than the treatment. Thus, randomisation
increases the validity of the experiment and adds weight to the evidence for causa-
tion.
ii. A control group is a group of individuals who do not receive the treatment. This
group is compared to a group of individuals who do receive the treatment. Often
the control group is given a placebo, i.e. a pseudo-treatment that looks like the real
thing but which is known to be neutral. The control group forms a baseline for
comparison, so as to better detect the effect of the treatment (and only the treatment).
(c) C
− ❅+


E D

R1.2 (a) i. med = 64.1; Q1 = 51.8, Q3, 70.7.

k
ii. x(k) ∼ cq , where q = n+1 ; thus x(1) ∼ c0.1 = µ − 1.28σ.
iii.
z x
-1.28 40.4
-0.84 48.8
-0.52 54.8
-0.25 59.2
0.00 64.1
0.25 65.0
0.52 68.7
0.84 72.7
1.28 75.1

(b) point estimate of µ, µ̂ = 60.98.


11.3769

95% interval estimate of µ: est ± t se = 60.9778 ± 2.306 × = (52.34, 69.72).
9
p1 (1−p2 ) p
R1.3 (b) i. (1−p )p = 2 ⇒ p1 − p1 p2 = 2p2 − 2p1 p2 ⇒ p1 = 2p2 − p1 p2 ⇒ p1 = 2 − p1 .
1 2 2
p
So, when p1 = 0.1, p1 = 1.9;
2
p
ii. and when p1 → 0, p1 → 2;
2
p
iii. and when p1 → 1, p1 → 1.
2
iv. OR = 2 ⇒ 1 < RR < 2;
and if the risks are small (which is likely for a case-control study), then RR will be
slightly less than 2.
50−40
R1.4 (b) i. Pr(L > 50) = Pr(Ls > 4
) = Pr(Ls > 2.5) = 0.0062.
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 269

d
ii. Pr(L > T ) = Pr(L − T > 0), where L − T = N(−10, 42 + 22 );
0+10 
= Pr (L − T )s > √ = Pr(N > 2.236) = 0.0127.
20
(c) i. V = var(T ) = w2 var(T1 ) + (1 − w)2 var(T2 ) = w2 + 4(1 − w)2 = 5w2 − 8w + 4.
dV dV
V is minimised when dw = 0; dw = 10w − 8 = 0 ⇒ w = 0.8.

ii. θ̂ = 0.8×50.0 + 0.2×55.0 = 51.0; se(θ̂) = 0.82 ×1.0 + 0.22 ×2.0 = 0.89.
350
R1.5 (b) prevalence estimate, p̂H = 2000 = 0.175.
q
0.175×0.825
95% CI for pH : est ± 1.95 se = 0.175 ± 1.96 2000
= (0.158, 0.192).
(c) i. Not all 400 individuals are observed for five years: some become cases, some may
leave the study early (others may enter it late) and some may die.
36
ii. incidence rate estimate α̂ = 1200 = 0.03 (cases per person-year).
q
0.03
95% CI for α: 0.03 ± 1.96 1200 = (0.020, 0.040).
d
R1.6 (b) i. H0 ⇒ Z = N(0, 1)
significance level = Pr(reject H0 | H0 )
d
= Pr(Z > 1.96) + Pr(Z < −1.96), where Z = N(0, 1)
= 0.025 + 0.025
= 0.05.
d
ii. H1 (θ = 2.80) ⇒ Z = N(2.80, 1)
power = Pr(reject H0 | H1 )
d
= Pr(Z > 1.96) + Pr(Z < −1.96), where Z = N(2.80, 1)
= Pr(Zs > −0.84) + Pr(Zs < −4.76)
= 0.800 + 0.000
= 0.80.

41−40 n
(c) i. E(Z) = 5/√n = 5 .

n
ii. To have power 0.80, we require E(Z) = 2.8, i.e. 5 = 2.8 ⇒ n = 196.
R1.7 (a) This can be tested using either a χ2 -test or a z-test. They are equivalent.
obs A A′ exp A A′
P 10 30 40 P 15 25 40
P′ 20 20 40 P′ 15 25 40
30 50 80 30 50 80
P (o−e)2 1 1 1 1
uc = e
= 52 ( 15 + 25 + 15 + 25 ) = 5.33, p = 0.021.
0.25 − 0.5
zc = q = −2.309, p = 0.021. (Note: 2.3092 = 5.33.)
1 1
0.375×0.625( 40 + 40 )
There is significant evidence that PTCA reduces the rsik of angina.
1
(b) Let θ denote the odds ratio. θ̂ = 3 .
q
1 1 1 1
ln θ̂ = −1.0986, and se(ln θ̂) = 10 + 30 + 20 + 20 = 0.483.
95% CI for ln θ: −1.0986 ± 1.96×0.483 = (−2.045, −0.152).
95% CI for θ: (0.129, 0.859).
The confidence interval suggests that the odds ratio is less than 1, indicating that PTCA
reduces the odds of angina, in acordance with the result of i, which indicated that PTCA
reduces the risk of angina.

R1.8 From the correlation SP-diagram, with n = 50 and r = −0.40, we obtain:


95% CI for ρ: (−0.61, −0.13).
Since 0 6∈ CI, there is evidence of a negative relationship between the two measures.

Revision Problem Set R2


R2.1 (a) A placebo is an inactive drug which looks the same as the treatment drug. It is used to
give a baseline or control level, against which the treatment can be compared.
lOMoARcPSD|8938243

page 270 EDDA: Answers to the Problems

(b) Randomisation is important to ensure validity and to balance the effects of any potential
confounding or lurking variables.
The subjects should be randomly allocated so that each is equally likely to receive the
treatment or the placebo.
(c) Assuming this experiment was performed as a randomised controlled trial, then a signif-
icant result provides evidence supporting drug ZZZ as a cause of improvement.
R2.2 (a) The sample data are negatively skewed with mean 67 and standard deviation 20.
(b) [1]; the horizontal scale gives the standard normal quantiles with grid z = −2, −1, 0, 1, 2
(the tick-marks are at −2, 0, 2); the vertical scale gives the sample quantiles, with grid
x = 0, 10, . . . , 100 (tick-marks at 0, 20, . . . , 100).
(c) i. approx 95% CI for µ: 66.6 ± 1.99×2.2 = (62, 2, 71.0).
Note: the t-distribution is not strictly appropriate here, as the population is non-normal. As
the sample is moderately large it provides a reasonable approximation.
77
ii. (x(2) , x(79) ) = (16, 96) gives a 81 = 95.1% prediction interval.
R2.3 (a) q zq xq
median 0.5 0 100
Q1, Q3 0.25, 0.75 ±0.67 93.3, 106.7
min, max 0.005, 0.995 ±2.58 74.2, 125.8
1
Note: For a sample of n = 200, x(1) ∼ cq , where q = 201 ≈ 0.005, and x(200) ∼ cq , where
200
q = 201 ≈ 0.995. Thus the minimum and maximum are approximated by the 0.005 and 0.995
quantiles.
An approximate (average) boxplot for this sample:

(b) i. probability of an outlier = 2 Pr(Xs > 0.6745 + 1.5×2×0.6745)


= 2 Pr(Xs > 2.698) = 0.007;
OR 2 Pr(X > 106.7 + 1.5(106.7 − 93.3) = 2 Pr(X > 126.98) = 0.007.
ii. Pr(at least one outlier) = 1 − Pr(no outliers) = 1 − 0.993200 = 0.755.
d
R2.4 (a) X = Bi(16, 0.15); Pr(X 6 1) = 0.0743 + 0.2097 = 0.284.
It is assumed that the tutorial class is a representative sample of university students (i.e.
essentially random with respect to left-handedness).
(b) i. the weights (given to each paper-estimate);
25.00
ii. 0.7092 = 35.25 ;

iii. est = 1.49 (i.e. sum of w×est); se = 1/ 35.25 = 0.17.
R2.5 (a) c0.025 (t8 ) = −2.306, c0.975 (t8 ) = 2.306, from the tables.

0.36
(b) i. 95% CI for µ: 2.65 ± 2.306× √ = (2.37, 2.93).
9
ii. 2.90 is in the confidence interval. Hence we do not reject H0 , i.e. there is no significant
evidence of a difference in means. There is no evidence in the data that the mean
vitamin A level for stomach cancer patients is different from the controls.
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 271

d
R2.6 (a) i. significance level = Pr(reject H0 ; H0 true) = Pr(|Z| > 1.96), where Z = N(0, 1);
= 0.025 + 0.025 = 0.05.
d
ii. power = Pr(reject H0 ; H1 true) = Pr(|Z| > 1.96), where Z = N(3.24, 1);
= Pr(Z > 1.96) + Pr(Z < −1.96)
= Pr(Zs > −1.28) + Pr(Z < −5.20)
= 0.8997 + 0.0000 = 0.90.
X̄−30 E(X̄)−30 µ−30
(b) i. E(Z) = E( 10/√n ) = 10/√n = 10/√n ;
31−30 √
ii. 10/√n = 3.24 ⇒ n = 32.4,
So the sample size needs to be at least 1050.
(1.96+1.28)2 102
Note: the formula gives n > (31−30)2
.

50 100 50 102 02 102


R2.7 (a) expected values, e: ⇒ u = ( 50 + 100 + 50 )×2 = 8.
50 100 50
p = Pr(χ22 > 8) ≈ 0.02, using the tables.
Since p < 0.05 we reject the null hypothesis of independence. The data indicate that there
is significant evidence that the treatment is better than the placebo.
(b) i. Normally distributed.

1 2 3 4 5 7 → w̄1 = 11/3
ii. ranks:
6 8 9 → w̄2 = 23/3
−4 −4
z= q = = −2.07.
1 1
×9×10×( + ) 1 1.936
12 6 3

Since |z| > 1.96, the rank test indicates rejection of the null hypothesis at the 5%
significance level.

Revision Problem Set R3

R3.1 (a) (A) A placebo is an inactive drug, which appears the same as the active drug. It is desir-
able in order to ascertain whether the active drug is having an effect.
(B) In a double-blind study, neither the subject not the treatment provider knows whether
the treatment is the active or the inactive drug. It is desirable in order to guard
against any possible bias: on the part of the subject or on the part of the treatment
provider (due to prior expectations).
(C) In favour: there would be be no between-subject variation. Against: there may be
carry-over effects, from one treatment to the next. The results may not be generalis-
able: is Claire representative?
(D) There can be no carry-over effect in this case. It is likely to be generalisable to a
larger population (the population that the subjects represent). Choose a random
order for AAAAAAAABBBBBBBBCCCCCCCC (using R sampling) and assign these
treatments to subjects 1, 2, . . . , 24.
(E) This method eliminates the between subject variation, but there may be possible
carry-over effects. For each subject, choose a random order for ABC.
(b) i. C

✲❘

E D
ii. X
−✒ ❅−



S H
+
It seems likely that X may be part of the causal link between smoking and cancer, as
indicated in the diagram, and can therefore not be considered as a confounder.
R3.2 (a) i. x̄ = 30.03;
lOMoARcPSD|8938243

page 272 EDDA: Answers to the Problems

ii. s = 4.069;
iii. Q3 = x(15) = 33.2;
iv. ĉ0.1 = x(2) = 24.4.
(b) i. x̄ ≈ µ = 31;
ii. s ≈ σ = 5;
iii. Q3 ≈ c0.75 = 31 + 0.6745×5 = 34.4;
iv. ĉ0.1 ≈ c0.1 = 31 − 1.2816×5 = 24.6.
4
(c) i. k = 4: x-coordinate = Φ−1 ( 20 ) = −0.84; y-coordinate = x(4) = 26.9.
ii. µ̂ = 30 (y-intercept); σ̂ = 4 (slope = 34−30
1−0
).
iii. A normal probability plot is a QQ-plot with axes interchanged (and the population
quantile axis relabelled).
R3.3 (a) The probability table below can be found from the given information: Pr(E) = 0.4, so
Pr(E ′ ) = 0.6; Pr(E ∩ D) = Pr(E) Pr(D | E) = 0.4×0.1 = 0.04 and Pr(E ′ ∩ D) =
Pr(E ′ ) Pr(D | E ′ ) = 0.6×0.2 = 0.12. The other entries follow by subtraction, and addi-
tion.
D D′
E 0.04 0.36 0.4
E ′ 0.12 0.48 0.6
0.16 0.84 1
Then, from the probability table, we obtain:
i. Pr(D) = 0.16;
0.04
ii. Pr(E | D) = 0.16 = 0.25;
iii. negatively related since, for example, Pr(D | E) < Pr(D | E ′ );
0.04×0.48 4
iv. OR = 0.12×0.36 = 9 = 0.44.
85 90
(b) i. sensitivity = Pr(P | D) = 100 = 0.85; specificity = Pr(P ′ | D′ ) = 100 = 0.90.
ii. Using prevalence, Pr(C) = 0.1, we can complete the probability table:
P P′
C 0.085 0.015 0.1 (0.85)
C ′ 0.090 0.810 0.9 (0.90)
0.175 0.825 1
0.085
Hence ppv = 0.175 = 0.486.
iii. The maximum value of ppv occurs when the sensitivity is equal to 1.
0.1
Thus ppvmax = 0.19
= 0.526.

d √
R3.4 (a) X = Bi(240, 0.3). Therefore E(X) = 240×0.3 = 72 and sd(X) = 240×0.3×0.7 = 7.10.
approximate 95% probability interval: 72 ± 1.96×7.10 = (58.1, 85.9).
d √
(b) X = Pn(22) ⇒ E(X) = 22, sd(X) = 22 = 4.69.
approximate 95% probability interval: 22 ± 1.96×4.69 = (12.8, 31.2).
(c) i. 99% probability interval for Y : 5.0 ± 2.5758×0.8 = (2.94, 7.06);
ii. Pr(Y > 6.0) = Pr(Ys > 1.25) = 0.106;
Pr(Y >7.0) Pr(Ys >2.5) 0.0062
iii. Pr(Y > 7.0 | Y > 6.0) = Pr(Y >6.0) = Pr(Y >1.25) = 0.1056 = 0.059.
s

1
R3.5 (a) 10 ± 1.96× √ = (9.43, 10.57);
12
(b) α = 2 Pr(X̄ > 10.6 | µ = 10) = 2 Pr(X̄s > 2.078) = 0.038;
(c) p = 2 Pr(X̄ > 10.8 | µ = 10) = 2 Pr(X̄s > 2.771) = 0.006;
d 1
(d) power = 1 − Pr(9.4 < X̄ < 10.6), where X̄ = N(11, 12
);
power = 1 − Pr(−5.542 < X̄s < −1.386) = 1 − 0.0829 = 0.917;
(e) 95% confidence interval for µ: 10.8 ± 1.96× √112 = (10.23, 11.37);
q
1
(f) 95% prediction interval for X: 10.8 ± 1.96× 1 + 12 = (8.76, 12.84).
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 273

18.4
R3.6 (a) 95% confidence interval for mean difference: 15.0 ± 2.045× √ 30
= (8.13, 21.87);
there is significant evidence of an increase in mean vitamin D levels.
q
400
(b) p̂B = 2000 = 0.2; se(p̂B ) = 0.2×0.8
2000
= 0.0089;
95% confidence interval for pB : 0.2 ± 1.96×0.0089 = (0.182, 0.218).
d
(c) Under the null hypothesis (of ‘normal’ risk), the number of cases of K, X = Pn(16).
27.5−16
i. p = 2 Pr(X > 28) = 2 Pr(Xs∗ > 4
) = 0.004, so there is significant evidence
of excess risk.

ii. 95% confidence interval for µ: 18 ± 1.96 28 = (17.6, 38.4);
95% confidence interval for SMR = µ/16: (1.1, 2.4).
25×35 (15−8.75)2
R3.7 (a) i. 8.750 = 100
; 4.464 = 8.75
.
ii. u = 4.464 + · · · + 1.731 = 12.09, cf. χ22 ;
Tables: 0.001 < p < 0.005, so we reject H0 , and conclude that there is significant
evidence of a difference between the groups.
(b) p̂1 = 0.6, p̂2 = 0.4; so p̂ = 0.5.
0.6 − 0.4
z= q = 1.414, so that p = 2 Pr(Z > 1.414) = 0.157;
1 1
0.5×0.5( 25 + 25 )
and we conclude that there is no significant evidence of a difference between the proba-
bility of improvement with A and with B.
r r r
1 1 0.5 0.5
(c) se = 0.5×0.5 + = ; thus we require 1.96 6 0.15 ⇒ n > 86.
n n n n

R3.8 i.

sxy
ii. K = (n − 1)s2x = 49×102 = 4900; rxy = ⇒ sxy = 0.4×102 = 40.
sx sy
· 40
· ·β̂ = 100 = 0.4 and α̂ = 30 − 0.4×30 = 18.
iii. fitted line: y = 18 + 0.4x, shown on diagram.
r
85.75
iv. se(β̂) = = 0.132;
4900
95% confidence interval for β: 0.4 ± 2.011×0.132 = (0.13, 0.67).
v. Tables (SP diagram for correlation): 0.15 < ρ < 0.60.

Revision Problem Set R4


R4.1 (a) i. x̄ ≈ 140;
ii. s ≈ 10;
iii. f ≈ 100×0.1587, so f ≈ 16 (Pr(X > 150) = Pr(Xs > 1) = 0.1587);
iv. Q3 ≈ 140 + 0.67×10 = 146.7 (z0.75 = 0.6745);
v. max ≈ 140 + 2.33×10 = 163.3 (z0.99 = 2.3263).
lOMoARcPSD|8938243

page 274 EDDA: Answers to the Problems

(b) i. boxplot: (117, 133, 140, 147, 163);

ii. roughly a straight line with intercept ≈ 140 and slope ≈ 10, but with points in an
increasing sequence.
R4.2 i.observational study;
ii.women 40–44 years old at baseline;
iii.prospective study;
iv. to avoid age dependence of myocardial infarction;
q
31 0.001344
v. α̂1 = 23058 = 0.001344, se(α̂1 ) = 23058
= 0.000241;
q
19 0.000466
α̂2 = 40730 = 0.000466, se(α̂2 ) = 40730
= 0.000107;
q
1 1
α̂1 −α̂2 = 0.000878, se0 (α̂1 −α̂2 ) = 0.000784( 23058 + 40730 ) = 0.000231.
α̂ −α̂ 0.000878
z = se (1α̂ −2α̂ ) = 0.000231 = 3.805, p = 0.000. [2 Pr(Z > 3.805) = 0.000142]
0 1 2
Since z > 1.96 (or p < 0.05) we reject H0 (α1 = α2 ).
There is significant evidence in these data that OC-users have a greater incidence of
myocardial infarction.
vi. µ1 −µ2 = 50000(α1 −α2 ); est = 43.9, se = 13.2; 95% CI: 43.9±1.96×13.2 = (18.0, 69.8).
Note: the point and interval estimates√for µ1 −µ2 are just 50 000 times the point and interval
estimates for α1 −α2 ; se(α̂1 −α̂2 ) = 0.0002412 + 0.0001072 = 0.000264.
The difference is the increase in the number of myocardial infarctions associated with
OC-use among 10 000 women in five years.
R4.3 (a) prob = 0.6 + 0.6 − 0.62 = 0.84 or 1 − 0.42 = 0.84;
(b) P P′
C 0.18 0.12 0.3
C′ 0.07 0.63 0.7
0.25 0.75 1
0.18
Pr(C | P ) = 0.25 = 0.72;
sensitivity, sn = Pr(P | C) = 0.6;
0.63
negative predictive value, npv = Pr(C ′ | P ′ ) = 0.75 = 0.84;
Pr(D | E)
(c) relative risk, RR = Pr(D | E ′ ) ; i.e. the ratio of the probability of the disease given the
exposure to the probability of the disease given non-exposure;
prevalence is required to estimate relative risk.
R4.4 (a) step-function cdf: F (x) = 0.2, (06 x < 1); 0.6, (16 x < 2); 0.9, (26 x < 3); 1.0, (x>3).
(b) E(X) = 0×0.2 + 1×0.4 + 2×0.3 + 3×0.1 = 1.3;
var(X) = E((X − 1.3)2 ) = 1.69×0.2 + 0.09×0.4 + 0.49×0.3 + 2.89×0.1 = 0.81;
or var(X) = E(X 2 ) − E(X)2 = 02 ×0.2 + 12 ×0.4 + 22 ×0.3 + 32 ×0.1 − 1.32 = 0.81;
(c) i. E(T ) = 100×1.3 = 130; var(T ) = 100×0.81 = 81, so sd(T ) = 9;
ii. central limit theorem: sum of iidrvs is asymptotically Normal;
iii. Pr(T 6 125) ≈ Pr(T ∗ < 125.5) = Pr(Ts∗ < −0.5) = 0.309.
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 275

σ 10
R4.5 (a) sd(X̄) = √n = √ = 3.0.
11
It is assumed that Ms.J’s blood pressure is stable and that the daily readings are indepen-
dent.
20
(b) µ̂ = 142; se(µ̂) = √ = 4;
25

50.0 2
R4.6 (a) i. x̄ = 5
= 10.0; s2 = 14 (9 + 1 + 1 + 9) = 5.0, or s2 = 14 (520 − 505 ) = 5.0.
q
1 
ii. 95% PI for X: 10.0 ± 2.776× 5(1 + 5 ) = (10.0 ± 6.8) = (3.2, 16.8);
(b) The mean of a six-month period is 6×2.75 = 16.5. So, the number of cases in a six-month
d
period, X = Pn(16.5).
10.5−16.5


Pr(X 6 10) ≈ Pr(X ∗ < 10.5) = Pr Xs∗ < = Pr(Xs∗ < −1.477) = 0.070.
16.5
Using R, Pr(X 6 10) = 0.0619.
448−500
√ −52
(c) i. t = = 20 = −2.6; cf. c0.975 (t15 ) = 2.131;
80/ 16
ii. p = 2 Pr(t15 < −2.6) ≈ 0.02.
iii. reject H0 . There is significant evidence that the mean MDI is less than 500.

d
R4.7 (a) Z = Bi(100, 0.95);
d
(b) i. α = Pr(|W | > 2.17), where W = N(0, 1), = 2×0.015 = 0.03;
d
ii. p = 2 Pr(W > 1.53), where W = N(0, 1), = 2×0.063 = 0.126;
d
iii. power = Pr(|W | > 2.17), where W = N(3, 1), = Pr(Ws > −0.83) = 0.797.
(c) i. X 6 4 or X > 19;
ii. 2.2 < λ < 13.1;
iii. 0.44 < α < 2.62, since α = λ/5.
14×60+9×44.7
R4.8 (a) s2 = 23
= 54.0;
q
1 1

(b) 95% CI for µ1 −µ2 : 6.5 ± 2.069 54( 15 + 10 ) = (6.5 ± 6.2) = (0.3, 12.7).
(c) There is significant evidence that the treatment gives a greater decrease in mean diastolic
blood pressure.

180
R4.9 i. β̂ = 100 = 1.8; α̂ = ȳ − β̂ x̄ = 20−30×1.8 = −34;
q
1802 9.5
ii. s2 = 81 (400 − 100
) = 76
8
= 9.5; se(β̂) = 100
= 0.31;
iii.

180
iv. r = 10×20 = 0.9; 95% CI for ρ: (0.60, 0.97), using Tables Figure 10.

R4.10 (a) n = 60, p̂ = 0.8. q


0.8×0.2
approx 95% CI for p: 0.8 ± 1.96 60
= (0.70, 0.90).
The exact interval, obtained using R, is (0.677, 0.892).
It is assumed that the procedures are independent with equal probability of success.
(b) i. m̂ ≈ 7.7 (where F̂ = 0.5);
d
ii. W = freq(X < 10) = 61; H0 ⇒ W ≈ N(50, 25);
61−50−0.5
z= 5
= 2.1, p = 0.036; so we reject H0 . There is significant evidence in
these data to indicate that m < 10.
lOMoARcPSD|8938243

page 276 EDDA: Answers to the Problems

iii. The vertical scale is warped so that a Normal cdf is a straight line.
iv. T > 0; mode less than 10; T < 40; positively skew.

Revision Problem Set R5


R5.1 (a) p(x) > 0 and Σp(x) = 0.4+0.3+0.2+0.1 = 1;
(b) x 1 2 3 4
F (x) 0.4 0.7 0.9 1.0
(c) i. Pr(X 6 3) = 0.9;
ii. Pr(1 < X < 4) = 0.5;
(d) i. E(X) = 1×0.4 + 2×0.3 + 3×0.2 + 4×0.1 = 2;
ii. var(X) = 12 ×0.4 + 02 ×0.3 + 12 ×0.2 + 22 ×0.1 = 1 ⇒ sd(X) = 1.
d
(e) T = 40
P
i=1 Xi ≈ N(80, 40), since E(T ) = 40×2, var(T ) = 40×1,
and the distribution of T is approximately Normal by the central limit theorem.
An approximate
√ 95% probability interval for the total number of people in 40 cars is
(80 ± 1.96 40) ≈ (68, 92).
(f) i. y 2 3 5
p(y) 0.3 0.3 0.4
ii. E(Y ) = 2×0.3 + 3×0.3 + 5×0.4 = 3.5 (dollars).
R5.2 sn = 0.995, sp = 0.99.
(a) P P′ i. Pr(P ) = 0.014925;
ii. ppv = Pr(D | P ) = 0.333;
D 0.004975 0.000025 0.005 iii. of those individuals who test positive,
D′ 0.009950 0.985050 0.995 only 1/3 have the disease.
0.014925 0.985075 1
(b) P P′
D 0.995p ··· p
D′ 0.01(1−p) ··· 1−p
0.995p + 0.01(1−p) ··· 1
0.995p
ppv = 0.995p+0.01(1−p) = 0.46 ⇒ p = 0.00849.

(c)

i. scatter plot

ii. the correlation is negative (i.e. sen-


sitivity decreases with increasing
cholesterol); the relation is curvilinear
(i.e. it does not follow a straight line
regression).
iii. reading from the graph, chol 6 223.

R5.3 (a) F F′
I 0.01 0.02 0.03
I ′ 0.13 0.84 0.97
0.14 0.86 1
′ ′
i. Pr(F ∩ I ) = 0.84;
1
ii. Pr(F | I) = 3 = 0.33 > Pr(F ) = 0.14, so F & I are positively related.
(b) i. E(Z) = a×10 + (1−a)×10 = 10;
var(Z) = a2 ×22 + (1−a)2 ×12 = 4a2 + (1−a)2 = 5a2 − 2a + 1.
dV
ii. da = 10a − 2 = 0 ⇒ a = 0.2;
iii. Vmin = 0.22 ×22 + 0.82 ×12 = 0.16 + 0.64 = 0.8.
lOMoARcPSD|8938243

EDDA: Answers to the Problems page 277

29s21 +29s22
R5.4 (pooled-t) s2 = 58
= 0.0373, s = 0.193;
0.04−0.10
√ 1 1 = −1.203, cf. c0.975 (t58 ) = 2.00; so we do not reject H0 (µ1 = µ2 ).
tp =
0.193 30 + 30
This test assumes that the samples are independent random samples from populations that
are normally distributed with equal variances. There may be some question about the last
0.04−0.10
assumption, but . . . if we were to use the unpooled-t: tu = q 2 2
= −1.203.
0.11
30
+ 0.25
30
2
Note: since s = 1
2
(s21 + s22 ), it follows that tu = tp .
tu is compared to c0.975 (tk ); and since 29 6 k 6 58, 2.00 6 c 6 2.05, thus the conclusion is the
same: do not reject H0 .
R5.5 data: E E′
C 6 94 100 p̂1 = 0.06
C′ 9 216 225 p̂2 = 0.04
15 310 325 p̂ = 0.0462
√ 0.06−0.04
(a) z = 1 1
= 0.793;
0.0462×0.9538×( 100 + 225 )
since |z| < 1.96, there is no evidence to indicate rejection of p1 = p2 .
(b) type I error (rejecting H0 when H0 is true) means we would conclude that OC alters the
cancer risk when it does not;
type II error (not rejecting H0 when H1 is true) means we would conclude that OC does
not alter the cancer risk when it does.
(c) i. type I ⇒ OC removed when it is OK; type II ⇒ OC continues to be sold when it is
not OK.
ii. type I is a problem for the drug company; type II is a problem for women using OC.
q
35 0.365×0.635
R5.6 (a) p̂ = 96 = 0.365, se(p̂) = 96
= 0.049;
95% CI for p: 0.365 ± 1.96×0.049 = (0.27, 0.46).
(b) i. λ̂ = 17;
d
ii. Let X denote the number of deaths recorded among the specified cohort. X =
Pn(λ), and we wish to test H0 : λ = 6.3.
d
Thus p = 2 Pr(X > 17), where X = Pn(6.3); and hence p = 0.0006, using Tables or
R. Hence we reject H0 . There is significant evidence here that λ > 6.3, i.e. that there
is excess mortality due to cirrhosis of the liver among this cohort.
17−6.3−0.5

Note: the approx z-test gives zc = = 4.06, so p ≈ 0.000, and we reject
6.3
H0 .
iii. The Poisson SP diagram (Tables Figure 4) gives 95% CI for λ: (9.9, 27.2).
Note: R gives (9.90, 27.22); the approx 95% CI gives (8.9, 25.1).
λ
SMR = 6.3 . 95% CI for SMR: (1.6, 4.3).
R5.7 (a) a comparison is required to demonstrate the effectiveness of the treatment;
(b) i. city and treatment are confounded;
ii. ten treatments and ten controls in each city.
(c) a variable that may be confounded with the treatment: gender, age, health, . . . .
(d) nT = 20, x̄T = −10.5, sT = 5.2; nC = 20, x̄C = −6.1, sC = 4.9.
s2 = 21 (5.22 + 4.92 ) = 25.525 ⇒ s = 5.05.
q
1 1

95% CI: (−10.5+6.1) ± 2.024×5.05 20 + 20 = (−4.4 ± 3.23) = (−7.6, −1.2).
There is significant evidence that the decrease is greater with the treatment, since 0 6∈ CI.

R5.8 (a) ordered data: (4.35, 4.55, 4.95, 5.05, 5.28, 5.36, 5.40, 5.46, 5.50, 6.45);
min =4.35 med=x(5.5) =5.32 max =6.45
Q1=x(2.75) =4.85 Q3=x(8.25) =5.47
d 0.572
(b) population: µ0 = 4.91, σ0 = 0.57; X̄ ≈ N(µ, 10 ), x̄ = 5.235;
5.235−4.91

z= = 1.80, p = 2 Pr(N > 1.80) = 0.071.
0.57/ 10
Since p > 0.01, we do not reject H0 .
lOMoARcPSD|8938243

page 278 Experimental Design and Data Analysis

Apart from assuming σ = 0.57, we also assume that X̄ is approximately normally dis-
tributed. This is based on the central limit theorem, but since we only have a sample of
10, the population distribution cannot be too far from Normal. (There may be some doubt
about the “outlier” at 6.45; but, as s = 0.58 which is very close to the assumed population σ, it
seems that this is not unreasonable.)
(c) i. To determine the power, we need to specify the decision rule.
For α = 0.01, we reject H0 if |z| > 2.5758,
0.57 0.57
i.e. x̄ > 4.91 − 2.5758× √ 10
= 4.45 or if x̄ < 4.91 + 2.5758× √ 10
= 5.37.
d 0.572
So, if µ = 5.5, power ≈ Pr(X̄ ′ > 5.37), where X̄ ′ = N(5.5, 10
).
5.37−5.5
√ ) = Pr(N > −0.697) = 0.757.
power = Pr(X̄s′ >
0.57/ 10
ii. The power would be increased. The critical values would be closer to 4.91 (4.91 ±
0.57
1.96× √ 10
, i.e. 4.56 and 5.26), so power = Pr(X̄ ′ > 5.26) > Pr(X̄ ′ > 5.37).

R5.9 (a) i. β̂ = 0.273 indicates that the average FEV increases by 0.273 L for each year of age.
ii. Test of β=0; p = 0.000 means that the probability of observing a value of t as extreme
d
as this if β=0 is less than 0.0005; t = t334 .
r
0.58812
iii. µ̂(12) = 3.355; se(µ̂(12)) = 336
+ 22 ×0.010802 = 0.0387;
95% CI for µ(12): 3.355 ± 1.967×0.0387 = (3.28, 3.43).
(b) Yes. The 95% confidence interval obtained from the Correlation SP diagram (Tables Fig-
ure 10) gives (0.03 < ρ < 0.65), which excludes zero. Hence ρ = 0 would be rejected;
there is significant evidence indicating a positive relationship.
lOMoARcPSD|8938243

EDDA: Statistical Tables page 279

Statistical Tables

Table 1: Binomial distribution — probability mass function

p
x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

n=1 0 .9900 .9500 .9000 .8500 .8000 .7500 .7000 .6500 .6000 .5500 .5000 1
1 .0100 .0500 .1000 .1500 .2000 .2500 .3000 .3500 .4000 .4500 .5000 0
n=2 0 .9801 .9025 .8100 .7225 .6400 .5625 .4900 .4225 .3600 .3025 .2500 2
1 .0198 .0950 .1800 .2550 .3200 .3750 .4200 .4550 .4800 .4950 .5000 1
2 .0001 .0025 .0100 .0225 .0400 .0625 .0900 .1225 .1600 .2025 .2500 0
n=3 0 .9703 .8574 .7290 .6141 .5120 .4219 .3430 .2746 .2160 .1664 .1250 3
1 .0294 .1354 .2430 .3251 .3840 .4219 .4410 .4436 .4320 .4084 .3750 2
2 .0003 .0071 .0270 .0574 .0960 .1406 .1890 .2389 .2880 .3341 .3750 1
3 .0001 .0010 .0034 .0080 .0156 .0270 .0429 .0640 .0911 .1250 0
n=4 0 .9606 .8145 .6561 .5220 .4096 .3164 .2401 .1785 .1296 .0915 .0625 4
1 .0388 .1715 .2916 .3685 .4096 .4219 .4116 .3845 .3456 .2995 .2500 3
2 .0006 .0135 .0486 .0975 .1536 .2109 .2646 .3105 .3456 .3675 .3750 2
3 .0005 .0036 .0115 .0256 .0469 .0756 .1115 .1536 .2005 .2500 1
4 .0001 .0005 .0016 .0039 .0081 .0150 .0256 .0410 .0625 0
n=5 0 .9510 .7738 .5905 .4437 .3277 .2373 .1681 .1160 .0778 .0503 .0313 5
1 .0480 .2036 .3281 .3915 .4096 .3955 .3602 .3124 .2592 .2059 .1563 4
2 .0010 .0214 .0729 .1382 .2048 .2637 .3087 .3364 .3456 .3369 .3125 3
3 .0011 .0081 .0244 .0512 .0879 .1323 .1811 .2304 .2757 .3125 2
4 .0005 .0022 .0064 .0146 .0284 .0488 .0768 .1128 .1563 1
5 .0001 .0003 .0010 .0024 .0053 .0102 .0185 .0313 0
n=6 0 .9415 .7351 .5314 .3771 .2621 .1780 .1176 .0754 .0467 .0277 .0156 6
1 .0571 .2321 .3543 .3993 .3932 .3560 .3025 .2437 .1866 .1359 .0938 5
2 .0014 .0305 .0984 .1762 .2458 .2966 .3241 .3280 .3110 .2780 .2344 4
3 .0021 .0146 .0415 .0819 .1318 .1852 .2355 .2765 .3032 .3125 3
4 .0001 .0012 .0055 .0154 .0330 .0595 .0951 .1382 .1861 .2344 2
5 .0001 .0004 .0015 .0044 .0102 .0205 .0369 .0609 .0938 1
6 .0001 .0002 .0007 .0018 .0041 .0083 .0156 0
n=7 0 .9321 .6983 .4783 .3206 .2097 .1335 .0824 .0490 .0280 .0152 .0078 7
1 .0659 .2573 .3720 .3960 .3670 .3115 .2471 .1848 .1306 .0872 .0547 6
2 .0020 .0406 .1240 .2097 .2753 .3115 .3177 .2985 .2613 .2140 .1641 5
3 .0036 .0230 .0617 .1147 .1730 .2269 .2679 .2903 .2918 .2734 4
4 .0002 .0026 .0109 .0287 .0577 .0972 .1442 .1935 .2388 .2734 3
5 .0002 .0012 .0043 .0115 .0250 .0466 .0774 .1172 .1641 2
6 .0001 .0004 .0013 .0036 .0084 .0172 .0320 .0547 1
7 .0001 .0002 .0006 .0016 .0037 .0078 0
n=8 0 .9227 .6634 .4305 .2725 .1678 .1001 .0576 .0319 .0168 .0084 .0039 8
1 .0746 .2793 .3826 .3847 .3355 .2670 .1977 .1373 .0896 .0548 .0313 7
2 .0026 .0515 .1488 .2376 .2936 .3115 .2965 .2587 .2090 .1569 .1094 6
3 .0001 .0054 .0331 .0839 .1468 .2076 .2541 .2786 .2787 .2568 .2188 5
4 .0004 .0046 .0185 .0459 .0865 .1361 .1875 .2322 .2627 .2734 4
5 .0004 .0026 .0092 .0231 .0467 .0808 .1239 .1719 .2188 3
6 .0002 .0011 .0038 .0100 .0217 .0413 .0703 .1094 2
7 .0001 .0004 .0012 .0033 .0079 .0164 .0313 1
8 .0001 .0002 .0007 .0017 .0039 0

0.99 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 x
p
lOMoARcPSD|8938243

page 280 Experimental Design and Data Analysis

p
x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

n=9 0 .9135 .6302 .3874 .2316 .1342 .0751 .0404 .0207 .0101 .0046 .0020 9
1 .0830 .2985 .3874 .3679 .3020 .2253 .1556 .1004 .0605 .0339 .0176 8
2 .0034 .0629 .1722 .2597 .3020 .3003 .2668 .2162 .1612 .1110 .0703 7
3 .0001 .0077 .0446 .1069 .1762 .2336 .2668 .2716 .2508 .2119 .1641 6
4 .0006 .0074 .0283 .0661 .1168 .1715 .2194 .2508 .2600 .2461 5
5 .0008 .0050 .0165 .0389 .0735 .1181 .1672 .2128 .2461 4
6 .0001 .0006 .0028 .0087 .0210 .0424 .0743 .1160 .1641 3
7 .0003 .0012 .0039 .0098 .0212 .0407 .0703 2
8 .0001 .0004 .0013 .0035 .0083 .0176 1
9 .0001 .0003 .0008 .0020 0
n=10 0 .9044 .5987 .3487 .1969 .1074 .0563 .0282 .0135 .0060 .0025 .0010 10
1 .0914 .3151 .3874 .3474 .2684 .1877 .1211 .0725 .0403 .0207 .0098 9
2 .0042 .0746 .1937 .2759 .3020 .2816 .2335 .1757 .1209 .0763 .0439 8
3 .0001 .0105 .0574 .1298 .2013 .2503 .2668 .2522 .2150 .1665 .1172 7
4 .0010 .0112 .0401 .0881 .1460 .2001 .2377 .2508 .2384 .2051 6
5 .0001 .0015 .0085 .0264 .0584 .1029 .1536 .2007 .2340 .2461 5
6 .0001 .0012 .0055 .0162 .0368 .0689 .1115 .1596 .2051 4
7 .0001 .0008 .0031 .0090 .0212 .0425 .0746 .1172 3
8 .0001 .0004 .0014 .0043 .0106 .0229 .0439 2
9 .0001 .0005 .0016 .0042 .0098 1
10 .0001 .0003 .0010 0
n=11 0 .8953 .5688 .3138 .1673 .0859 .0422 .0198 .0088 .0036 .0014 .0005 11
1 .0995 .3293 .3835 .3248 .2362 .1549 .0932 .0518 .0266 .0125 .0054 10
2 .0050 .0867 .2131 .2866 .2953 .2581 .1998 .1395 .0887 .0513 .0269 9
3 .0002 .0137 .0710 .1517 .2215 .2581 .2568 .2254 .1774 .1259 .0806 8
4 .0014 .0158 .0536 .1107 .1721 .2201 .2428 .2365 .2060 .1611 7
5 .0001 .0025 .0132 .0388 .0803 .1321 .1830 .2207 .2360 .2256 6
6 .0003 .0023 .0097 .0268 .0566 .0985 .1471 .1931 .2256 5
7 .0003 .0017 .0064 .0173 .0379 .0701 .1128 .1611 4
8 .0002 .0011 .0037 .0102 .0234 .0462 .0806 3
9 .0001 .0005 .0018 .0052 .0126 .0269 2
10 .0002 .0007 .0021 .0054 1
11 .0002 .0005 0
n=12 0 .8864 .5404 .2824 .1422 .0687 .0317 .0138 .0057 .0022 .0008 .0002 12
1 .1074 .3413 .3766 .3012 .2062 .1267 .0712 .0368 .0174 .0075 .0029 11
2 .0060 .0988 .2301 .2924 .2835 .2323 .1678 .1088 .0639 .0339 .0161 10
3 .0002 .0173 .0852 .1720 .2362 .2581 .2397 .1954 .1419 .0923 .0537 9
4 .0021 .0213 .0683 .1329 .1936 .2311 .2367 .2128 .1700 .1208 8
5 .0002 .0038 .0193 .0532 .1032 .1585 .2039 .2270 .2225 .1934 7
6 .0005 .0040 .0155 .0401 .0792 .1281 .1766 .2124 .2256 6
7 .0006 .0033 .0115 .0291 .0591 .1009 .1489 .1934 5
8 .0001 .0005 .0024 .0078 .0199 .0420 .0762 .1208 4
9 .0001 .0004 .0015 .0048 .0125 .0277 .0537 3
10 .0002 .0008 .0025 .0068 .0161 2
11 .0001 .0003 .0010 .0029 1
12 .0001 .0002 0
n=13 0 .8775 .5133 .2542 .1209 .0550 .0238 .0097 .0037 .0013 .0004 .0001 13
1 .1152 .3512 .3672 .2774 .1787 .1029 .0540 .0259 .0113 .0045 .0016 12
2 .0070 .1109 .2448 .2937 .2680 .2059 .1388 .0836 .0453 .0220 .0095 11
3 .0003 .0214 .0997 .1900 .2457 .2517 .2181 .1651 .1107 .0660 .0349 10
4 .0028 .0277 .0838 .1535 .2097 .2337 .2222 .1845 .1350 .0873 9
5 .0003 .0055 .0266 .0691 .1258 .1803 .2154 .2214 .1989 .1571 8
6 .0008 .0063 .0230 .0559 .1030 .1546 .1968 .2169 .2095 7
7 .0001 .0011 .0058 .0186 .0442 .0833 .1312 .1775 .2095 6
8 .0001 .0011 .0047 .0142 .0336 .0656 .1089 .1571 5
9 .0001 .0009 .0034 .0101 .0243 .0495 .0873 4
10 .0001 .0006 .0022 .0065 .0162 .0349 3
11 .0001 .0003 .0012 .0036 .0095 2
12 .0001 .0005 .0016 1
13 .0001 0

0.99 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 x
p
lOMoARcPSD|8938243

EDDA: Statistical Tables page 281

p
x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

n=14 0 .8687 .4877 .2288 .1028 .0440 .0178 .0068 .0024 .0008 .0002 .0001 14
1 .1229 .3593 .3559 .2539 .1539 .0832 .0407 .0181 .0073 .0027 .0009 13
2 .0081 .1229 .2570 .2912 .2501 .1802 .1134 .0634 .0317 .0141 .0056 12
3 .0003 .0259 .1142 .2056 .2501 .2402 .1943 .1366 .0845 .0462 .0222 11
4 .0037 .0349 .0998 .1720 .2202 .2290 .2022 .1549 .1040 .0611 10
5 .0004 .0078 .0352 .0860 .1468 .1963 .2178 .2066 .1701 .1222 9
6 .0013 .0093 .0322 .0734 .1262 .1759 .2066 .2088 .1833 8
7 .0002 .0019 .0092 .0280 .0618 .1082 .1574 .1952 .2095 7
8 .0003 .0020 .0082 .0232 .0510 .0918 .1398 .1833 6
9 .0003 .0018 .0066 .0183 .0408 .0762 .1222 5
10 .0003 .0014 .0049 .0136 .0312 .0611 4
11 .0002 .0010 .0033 .0093 .0222 3
12 .0001 .0005 .0019 .0056 2
13 .0001 .0002 .0009 1
14 .0001 0
n=15 0 .8601 .4633 .2059 .0874 .0352 .0134 .0047 .0016 .0005 .0001 15
1 .1303 .3658 .3432 .2312 .1319 .0668 .0305 .0126 .0047 .0016 .0005 14
2 .0092 .1348 .2669 .2856 .2309 .1559 .0916 .0476 .0219 .0090 .0032 13
3 .0004 .0307 .1285 .2184 .2501 .2252 .1700 .1110 .0634 .0318 .0139 12
4 .0049 .0428 .1156 .1876 .2252 .2186 .1792 .1268 .0780 .0417 11
5 .0006 .0105 .0449 .1032 .1651 .2061 .2123 .1859 .1404 .0916 10
6 .0019 .0132 .0430 .0917 .1472 .1906 .2066 .1914 .1527 9
7 .0003 .0030 .0138 .0393 .0811 .1319 .1771 .2013 .1964 8
8 .0005 .0035 .0131 .0348 .0710 .1181 .1647 .1964 7
9 .0001 .0007 .0034 .0116 .0298 .0612 .1048 .1527 6
10 .0001 .0007 .0030 .0096 .0245 .0515 .0916 5
11 .0001 .0006 .0024 .0074 .0191 .0417 4
12 .0001 .0004 .0016 .0052 .0139 3
13 .0001 .0003 .0010 .0032 2
14 .0001 .0005 1
15 0
n=16 0 .8515 .4401 .1853 .0743 .0281 .0100 .0033 .0010 .0003 .0001 16
1 .1376 .3706 .3294 .2097 .1126 .0535 .0228 .0087 .0030 .0009 .0002 15
2 .0104 .1463 .2745 .2775 .2111 .1336 .0732 .0353 .0150 .0056 .0018 14
3 .0005 .0359 .1423 .2285 .2463 .2079 .1465 .0888 .0468 .0215 .0085 13
4 .0061 .0514 .1311 .2001 .2252 .2040 .1553 .1014 .0572 .0278 12
5 .0008 .0137 .0555 .1201 .1802 .2099 .2008 .1623 .1123 .0667 11
6 .0001 .0028 .0180 .0550 .1101 .1649 .1982 .1983 .1684 .1222 10
7 .0004 .0045 .0197 .0524 .1010 .1524 .1889 .1969 .1746 9
8 .0001 .0009 .0055 .0197 .0487 .0923 .1417 .1812 .1964 8
9 .0001 .0012 .0058 .0185 .0442 .0840 .1318 .1746 7
10 .0002 .0014 .0056 .0167 .0392 .0755 .1222 6
11 .0002 .0013 .0049 .0142 .0337 .0667 5
12 .0002 .0011 .0040 .0115 .0278 4
13 .0000 .0002 .0008 .0029 .0085 3
14 .0001 .0005 .0018 2
15 .0001 .0002 1
16 0
n=17 0 .8429 .4181 .1668 .0631 .0225 .0075 .0023 .0007 .0002 17
1 .1447 .3741 .3150 .1893 .0957 .0426 .0169 .0060 .0019 .0005 .0001 16
2 .0117 .1575 .2800 .2673 .1914 .1136 .0581 .0260 .0102 .0035 .0010 15
3 .0006 .0415 .1556 .2359 .2393 .1893 .1245 .0701 .0341 .0144 .0052 14
4 .0076 .0605 .1457 .2093 .2209 .1868 .1320 .0796 .0411 .0182 13
5 .0010 .0175 .0668 .1361 .1914 .2081 .1849 .1379 .0875 .0472 12
6 .0001 .0039 .0236 .0680 .1276 .1784 .1991 .1839 .1432 .0944 11
7 .0007 .0065 .0267 .0668 .1201 .1685 .1927 .1841 .1484 10
8 .0001 .0014 .0084 .0279 .0644 .1134 .1606 .1883 .1855 9
9 .0003 .0021 .0093 .0276 .0611 .1070 .1540 .1855 8
10 .0004 .0025 .0095 .0263 .0571 .1008 .1484 7
11 .0001 .0005 .0026 .0090 .0242 .0525 .0944 6
12 .0001 .0006 .0024 .0081 .0215 .0472 5
13 .0001 .0005 .0021 .0068 .0182 4
14 .0001 .0004 .0016 .0052 3
15 .0001 .0003 .0010 2
16 .0001 1
17 0

0.99 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 x
p
lOMoARcPSD|8938243

page 282 Experimental Design and Data Analysis

p
x 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

n=18 0 .8345 .3972 .1501 .0536 .0180 .0056 .0016 .0004 .0001 18
1 .1517 .3763 .3002 .1704 .0811 .0338 .0126 .0042 .0012 .0003 .0001 17
2 .0130 .1683 .2835 .2556 .1723 .0958 .0458 .0190 .0069 .0022 .0006 16
3 .0007 .0473 .1680 .2406 .2297 .1704 .1046 .0547 .0246 .0095 .0031 15
4 .0093 .0700 .1592 .2153 .2130 .1681 .1104 .0614 .0291 .0117 14
5 .0014 .0218 .0787 .1507 .1988 .2017 .1664 .1146 .0666 .0327 13
6 .0002 .0052 .0301 .0816 .1436 .1873 .1941 .1655 .1181 .0708 12
7 .0010 .0091 .0350 .0820 .1376 .1792 .1892 .1657 .1214 11
8 .0002 .0022 .0120 .0376 .0811 .1327 .1734 .1864 .1669 10
9 .0004 .0033 .0139 .0386 .0794 .1284 .1694 .1855 9
10 .0001 .0008 .0042 .0149 .0385 .0771 .1248 .1669 8
11 .0001 .0010 .0046 .0151 .0374 .0742 .1214 7
12 .0002 .0012 .0047 .0145 .0354 .0708 6
13 .0002 .0012 .0045 .0134 .0327 5
14 .0002 .0011 .0039 .0117 4
15 .0002 .0009 .0031 3
16 .0001 .0006 2
17 .0001 1
18 0
n=19 0 .8262 .3774 .1351 .0456 .0144 .0042 .0011 .0003 .0001 19
1 .1586 .3774 .2852 .1529 .0685 .0268 .0093 .0029 .0008 .0002 18
2 .0144 .1787 .2852 .2428 .1540 .0803 .0358 .0138 .0046 .0013 .0003 17
3 .0008 .0533 .1796 .2428 .2182 .1517 .0869 .0422 .0175 .0062 .0018 16
4 .0112 .0798 .1714 .2182 .2023 .1491 .0909 .0467 .0203 .0074 15
5 .0018 .0266 .0907 .1636 .2023 .1916 .1468 .0933 .0497 .0222 14
6 .0002 .0069 .0374 .0955 .1574 .1916 .1844 .1451 .0949 .0518 13
7 .0014 .0122 .0443 .0974 .1525 .1844 .1797 .1443 .0961 12
8 .0002 .0032 .0166 .0487 .0981 .1489 .1797 .1771 .1442 11
9 .0007 .0051 .0198 .0514 .0980 .1464 .1771 .1762 10
10 .0001 .0013 .0066 .0220 .0528 .0976 .1449 .1762 9
11 .0003 .0018 .0077 .0233 .0532 .0970 .1442 8
12 .0004 .0022 .0083 .0237 .0529 .0961 7
13 .0001 .0005 .0024 .0085 .0233 .0518 6
14 .0001 .0006 .0024 .0082 .0222 5
15 .0001 .0005 .0022 .0074 4
16 .0001 .0005 .0018 3
17 .0001 .0003 2
18 1
19 0
n=20 0 .8179 .3585 .1216 .0388 .0115 .0032 .0008 .0002 20
1 .1652 .3774 .2702 .1368 .0576 .0211 .0068 .0020 .0005 .0001 19
2 .0159 .1887 .2852 .2293 .1369 .0669 .0278 .0100 .0031 .0008 .0002 18
3 .0010 .0596 .1901 .2428 .2054 .1339 .0716 .0323 .0123 .0040 .0011 17
4 .0133 .0898 .1821 .2182 .1897 .1304 .0738 .0350 .0139 .0046 16
5 .0022 .0319 .1028 .1746 .2023 .1789 .1272 .0746 .0365 .0148 15
6 .0003 .0089 .0454 .1091 .1686 .1916 .1712 .1244 .0746 .0370 14
7 .0020 .0160 .0545 .1124 .1643 .1844 .1659 .1221 .0739 13
8 .0004 .0046 .0222 .0609 .1144 .1614 .1797 .1623 .1201 12
9 .0001 .0011 .0074 .0271 .0654 .1158 .1597 .1771 .1602 11
10 .0002 .0020 .0099 .0308 .0686 .1171 .1593 .1762 10
11 .0005 .0030 .0120 .0336 .0710 .1185 .1602 9
12 .0001 .0008 .0039 .0136 .0355 .0727 .1201 8
13 .0002 .0010 .0045 .0146 .0366 .0739 7
14 .0002 .0012 .0049 .0150 .0370 6
15 .0003 .0013 .0049 .0148 5
16 .0003 .0013 .0046 4
17 .0002 .0011 3
18 .0002 2
19 1
20 0

0.99 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 x
p
lOMoARcPSD|8938243

EDDA: Statistical Tables page 283

Figure 2: Binomial distribution — confidence intervals

✛ x/n
1.0 0.9 0.8 0.7 0.6 0.5
0.9 0.1

0.8 0.2

0.7 0.3

0.6 0.4

10
0.5 0.5

p 20 p
0.4 0.6
50
100
200
0.3 500 0.7

500
200
100
0.2 50 0.8
20

10
0.1 0.9

0.0 1.0 ❄
0.0 0.1 0.2 0.3 0.4 0.5
x/n ✲

Note: The numbers on the curves indicate the value of n.


For a specified value of x/n, the curves give a 95% confidence interval for p. For a specified
value of p, the curves give a two-sided critical region of size 0.05 to test the hypothesis that
the specified value of p is the true value.
lOMoARcPSD|8938243

page 284 Experimental Design and Data Analysis

Table 3: Poisson distribution — probability mass function

λ
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x

0 .9048 .8187 .7408 .6703 .6065 .5488 .4966 .4493 .4066 .3679 0
1 .0905 .1637 .2222 .2681 .3033 .3293 .3476 .3595 .3659 .3679 1
2 .0045 .0164 .0333 .0536 .0758 .0988 .1217 .1438 .1647 .1839 2
3 .0002 .0011 .0033 .0072 .0126 .0198 .0284 .0383 .0494 .0613 3
4 .0001 .0003 .0007 .0016 .0030 .0050 .0077 .0111 .0153 4
5 .0001 .0002 .0004 .0007 .0012 .0020 .0031 5
6 .0001 .0002 .0003 .0005 6
7 .0001 7

λ
x 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 x

0 .3329 .3012 .2725 .2466 .2231 .2019 .1827 .1653 .1496 .1353 0
1 .3662 .3614 .3543 .3452 .3347 .3230 .3106 .2975 .2842 .2707 1
2 .2014 .2169 .2303 .2417 .2510 .2584 .2640 .2678 .2700 .2707 2
3 .0738 .0867 .0998 .1128 .1255 .1378 .1496 .1607 .1710 .1804 3
4 .0203 .0260 .0324 .0395 .0471 .0551 .0636 .0723 .0812 .0902 4
5 .0045 .0062 .0084 .0111 .0141 .0176 .0216 .0260 .0309 .0361 5
6 .0008 .0012 .0018 .0026 .0035 .0047 .0061 .0078 .0098 .0120 6
7 .0001 .0002 .0003 .0005 .0008 .0011 .0015 .0020 .0027 .0034 7
8 .0001 .0001 .0001 .0002 .0003 .0005 .0006 .0009 8
9 .0001 .0001 .0001 .0002 9

λ
x 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 x

0 .1225 .1108 .1003 .0907 .0821 .0743 .0672 .0608 .0550 .0498 0
1 .2572 .2438 .2306 .2177 .2052 .1931 .1815 .1703 .1596 .1494 1
2 .2700 .2681 .2652 .2613 .2565 .2510 .2450 .2384 .2314 .2240 2
3 .1890 .1966 .2033 .2090 .2138 .2176 .2205 .2225 .2237 .2240 3
4 .0992 .1082 .1169 .1254 .1336 .1414 .1488 .1557 .1622 .1680 4
5 .0417 .0476 .0538 .0602 .0668 .0735 .0804 .0872 .0940 .1008 5
6 .0146 .0174 .0206 .0241 .0278 .0319 .0362 .0407 .0455 .0504 6
7 .0044 .0055 .0068 .0083 .0099 .0118 .0139 .0163 .0188 .0216 7
8 .0011 .0015 .0019 .0025 .0031 .0038 .0047 .0057 .0068 .0081 8
9 .0003 .0004 .0005 .0007 .0009 .0011 .0014 .0018 .0022 .0027 9
10 .0001 .0001 .0001 .0002 .0002 .0003 .0004 .0005 .0006 .0008 10
11 .0001 .0001 .0001 .0002 .0002 11
12 .0001 12

λ
x 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 x

0 .0450 .0408 .0369 .0334 .0302 .0273 .0247 .0224 .0202 .0183 0
1 .1397 .1304 .1217 .1135 .1057 .0984 .0915 .0850 .0789 .0733 1
2 .2165 .2087 .2008 .1929 .1850 .1771 .1692 .1615 .1539 .1465 2
3 .2237 .2226 .2209 .2186 .2158 .2125 .2087 .2046 .2001 .1954 3
4 .1733 .1781 .1823 .1858 .1888 .1912 .1931 .1944 .1951 .1954 4
5 .1075 .1140 .1203 .1264 .1322 .1377 .1429 .1477 .1522 .1563 5
6 .0555 .0608 .0662 .0716 .0771 .0826 .0881 .0936 .0989 .1042 6
7 .0246 .0278 .0312 .0348 .0385 .0425 .0466 .0508 .0551 .0595 7
8 .0095 .0111 .0129 .0148 .0169 .0191 .0215 .0241 .0269 .0298 8
9 .0033 .0040 .0047 .0056 .0066 .0076 .0089 .0102 .0116 .0132 9
10 .0010 .0013 .0016 .0019 .0023 .0028 .0033 .0039 .0045 .0053 10
11 .0003 .0004 .0005 .0006 .0007 .0009 .0011 .0013 .0016 .0019 11
12 .0001 .0001 .0001 .0002 .0002 .0003 .0003 .0004 .0005 .0006 12
13 .0001 .0001 .0001 .0001 .0002 .0002 13
14 .0001 14
lOMoARcPSD|8938243

EDDA: Statistical Tables page 285

λ
x 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 x

0 .0166 .0150 .0136 .0123 .0111 .0101 .0091 .0082 .0074 .0067 0
1 .0679 .0630 .0583 .0540 .0500 .0462 .0427 .0395 .0365 .0337 1
2 .1393 .1323 .1254 .1188 .1125 .1063 .1005 .0948 .0894 .0842 2
3 .1904 .1852 .1798 .1743 .1687 .1631 .1574 .1517 .1460 .1404 3
4 .1951 .1944 .1933 .1917 .1898 .1875 .1849 .1820 .1789 .1755 4
5 .1600 .1633 .1662 .1687 .1708 .1725 .1738 .1747 .1753 .1755 5
6 .1093 .1143 .1191 .1237 .1281 .1323 .1362 .1398 .1432 .1462 6
7 .0640 .0686 .0732 .0778 .0824 .0869 .0914 .0959 .1002 .1044 7
8 .0328 .0360 .0393 .0428 .0463 .0500 .0537 .0575 .0614 .0653 8
9 .0150 .0168 .0188 .0209 .0232 .0255 .0281 .0307 .0334 .0363 9
10 .0061 .0071 .0081 .0092 .0104 .0118 .0132 .0147 .0164 .0181 10
11 .0023 .0027 .0032 .0037 .0043 .0049 .0056 .0064 .0073 .0082 11
12 .0008 .0009 .0011 .0013 .0016 .0019 .0022 .0026 .0030 .0034 12
13 .0002 .0003 .0004 .0005 .0006 .0007 .0008 .0009 .0011 .0013 13
14 .0001 .0001 .0001 .0001 .0002 .0002 .0003 .0003 .0004 .0005 14
15 .0001 .0001 .0001 .0001 .0001 .0002 15

λ
x 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 x

0 .0061 .0055 .0050 .0045 .0041 .0037 .0033 .0030 .0027 .0025 0
1 .0311 .0287 .0265 .0244 .0225 .0207 .0191 .0176 .0162 .0149 1
2 .0793 .0746 .0701 .0659 .0618 .0580 .0544 .0509 .0477 .0446 2
3 .1348 .1293 .1239 .1185 .1133 .1082 .1033 .0985 .0938 .0892 3
4 .1719 .1681 .1641 .1600 .1558 .1515 .1472 .1428 .1383 .1339 4
5 .1753 .1748 .1740 .1728 .1714 .1697 .1678 .1656 .1632 .1606 5
6 .1490 .1515 .1537 .1555 .1571 .1584 .1594 .1601 .1605 .1606 6
7 .1086 .1125 .1163 .1200 .1234 .1267 .1298 .1326 .1353 .1377 7
8 .0692 .0731 .0771 .0810 .0849 .0887 .0925 .0962 .0998 .1033 8
9 .0392 .0423 .0454 .0486 .0519 .0552 .0586 .0620 .0654 .0688 9
10 .0200 .0220 .0241 .0262 .0285 .0309 .0334 .0359 .0386 .0413 10
11 .0093 .0104 .0116 .0129 .0143 .0157 .0173 .0190 .0207 .0225 11
12 .0039 .0045 .0051 .0058 .0065 .0073 .0082 .0092 .0102 .0113 12
13 .0015 .0018 .0021 .0024 .0028 .0032 .0036 .0041 .0046 .0052 13
14 .0006 .0007 .0008 .0009 .0011 .0013 .0015 .0017 .0019 .0022 14
15 .0002 .0002 .0003 .0003 .0004 .0005 .0006 .0007 .0008 .0009 15
16 .0001 .0001 .0001 .0001 .0001 .0002 .0002 .0002 .0003 .0003 16
17 .0001 .0001 .0001 .0001 .0001 17

λ
x 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 x

0 .0022 .0020 .0018 .0017 .0015 .0014 .0012 .0011 .0010 .0009 0
1 .0137 .0126 .0116 .0106 .0098 .0090 .0082 .0076 .0070 .0064 1
2 .0417 .0390 .0364 .0340 .0318 .0296 .0276 .0258 .0240 .0223 2
3 .0848 .0806 .0765 .0726 .0688 .0652 .0617 .0584 .0552 .0521 3
4 .1294 .1249 .1205 .1162 .1118 .1076 .1034 .0992 .0952 .0912 4
5 .1579 .1549 .1519 .1487 .1454 .1420 .1385 .1349 .1314 .1277 5
6 .1605 .1601 .1595 .1586 .1575 .1562 .1546 .1529 .1511 .1490 6
7 .1399 .1418 .1435 .1450 .1462 .1472 .1480 .1486 .1489 .1490 7
8 .1066 .1099 .1130 .1160 .1188 .1215 .1240 .1263 .1284 .1304 8
9 .0723 .0757 .0791 .0825 .0858 .0891 .0923 .0954 .0985 .1014 9
10 .0441 .0469 .0498 .0528 .0558 .0588 .0618 .0649 .0679 .0710 10
11 .0244 .0265 .0285 .0307 .0330 .0353 .0377 .0401 .0426 .0452 11
12 .0124 .0137 .0150 .0164 .0179 .0194 .0210 .0227 .0245 .0263 12
13 .0058 .0065 .0073 .0081 .0089 .0099 .0108 .0119 .0130 .0142 13
14 .0025 .0029 .0033 .0037 .0041 .0046 .0052 .0058 .0064 .0071 14
15 .0010 .0012 .0014 .0016 .0018 .0020 .0023 .0026 .0029 .0033 15
16 .0004 .0005 .0005 .0006 .0007 .0008 .0010 .0011 .0013 .0014 16
17 .0001 .0002 .0002 .0002 .0003 .0003 .0004 .0004 .0005 .0006 17
18 .0001 .0001 .0001 .0001 .0001 .0001 .0002 .0002 .0002 18
19 .0001 .0001 .0001 .0001 19
lOMoARcPSD|8938243

page 286 Experimental Design and Data Analysis

λ
x 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 x

0 .0008 .0007 .0007 .0006 .0006 .0005 .0005 .0004 .0004 .0003 0
1 .0059 .0054 .0049 .0045 .0041 .0038 .0035 .0032 .0029 .0027 1
2 .0208 .0194 .0180 .0167 .0156 .0145 .0134 .0125 .0116 .0107 2
3 .0492 .0464 .0438 .0413 .0389 .0366 .0345 .0324 .0305 .0286 3
4 .0874 .0836 .0799 .0764 .0729 .0696 .0663 .0632 .0602 .0573 4
5 .1241 .1204 .1167 .1130 .1094 .1057 .1021 .0986 .0951 .0916 5
6 .1468 .1445 .1420 .1394 .1367 .1339 .1311 .1282 .1252 .1221 6
7 .1489 .1486 .1481 .1474 .1465 .1454 .1442 .1428 .1413 .1396 7
8 .1321 .1337 .1351 .1363 .1373 .1381 .1388 .1392 .1395 .1396 8
9 .1042 .1070 .1096 .1121 .1144 .1167 .1187 .1207 .1224 .1241 9
10 .0740 .0770 .0800 .0829 .0858 .0887 .0914 .0941 .0967 .0993 10
11 .0478 .0504 .0531 .0558 .0585 .0613 .0640 .0667 .0695 .0722 11
12 .0283 .0303 .0323 .0344 .0366 .0388 .0411 .0434 .0457 .0481 12
13 .0154 .0168 .0181 .0196 .0211 .0227 .0243 .0260 .0278 .0296 13
14 .0078 .0086 .0095 .0104 .0113 .0123 .0134 .0145 .0157 .0169 14
15 .0037 .0041 .0046 .0051 .0057 .0062 .0069 .0075 .0083 .0090 15
16 .0016 .0019 .0021 .0024 .0026 .0030 .0033 .0037 .0041 .0045 16
17 .0007 .0008 .0009 .0010 .0012 .0013 .0015 .0017 .0019 .0021 17
18 .0003 .0003 .0004 .0004 .0005 .0006 .0006 .0007 .0008 .0009 18
19 .0001 .0001 .0001 .0002 .0002 .0002 .0003 .0003 .0003 .0004 19
20 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0002 20
21 .0001 .0001 21

x 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 x

0 .0003 .0003 .0002 .0002 .0002 .0002 .0002 .0002 .0001 .0001 0
1 .0025 .0023 .0021 .0019 .0017 .0016 .0014 .0013 .0012 .0011 1
2 .0100 .0092 .0086 .0079 .0074 .0068 .0063 .0058 .0054 .0050 2
3 .0269 .0252 .0237 .0222 .0208 .0195 .0183 .0171 .0160 .0150 3
4 .0544 .0517 .0491 .0466 .0443 .0420 .0398 .0377 .0357 .0337 4
5 .0882 .0849 .0816 .0784 .0752 .0722 .0692 .0663 .0635 .0607 5
6 .1191 .1160 .1128 .1097 .1066 .1034 .1003 .0972 .0941 .0911 6
7 .1378 .1358 .1338 .1317 .1294 .1271 .1247 .1222 .1197 .1171 7
8 .1395 .1392 .1388 .1382 .1375 .1366 .1356 .1344 .1332 .1318 8
9 .1256 .1269 .1280 .1290 .1299 .1306 .1311 .1315 .1317 .1318 9
10 .1017 .1040 .1063 .1084 .1104 .1123 .1140 .1157 .1172 .1186 10
11 .0749 .0776 .0802 .0828 .0853 .0878 .0902 .0925 .0948 .0970 11
12 .0505 .0530 .0555 .0579 .0604 .0629 .0654 .0679 .0703 .0728 12
13 .0315 .0334 .0354 .0374 .0395 .0416 .0438 .0459 .0481 .0504 13
14 .0182 .0196 .0210 .0225 .0240 .0256 .0272 .0289 .0306 .0324 14
15 .0098 .0107 .0116 .0126 .0136 .0147 .0158 .0169 .0182 .0194 15
16 .0050 .0055 .0060 .0066 .0072 .0079 .0086 .0093 .0101 .0109 16
17 .0024 .0026 .0029 .0033 .0036 .0040 .0044 .0048 .0053 .0058 17
18 .0011 .0012 .0014 .0015 .0017 .0019 .0021 .0024 .0026 .0029 18
19 .0005 .0005 .0006 .0007 .0008 .0009 .0010 .0011 .0012 .0014 19
20 .0002 .0002 .0002 .0003 .0003 .0004 .0004 .0005 .0005 .0006 20
21 .0001 .0001 .0001 .0001 .0001 .0002 .0002 .0002 .0002 .0003 21
22 .0001 .0001 .0001 .0001 .0001 .0001

x 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0 x

0 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0001 0


1 .0010 .0009 .0009 .0008 .0007 .0007 .0006 .0005 .0005 .0005 1
2 .0046 .0043 .0040 .0037 .0034 .0031 .0029 .0027 .0025 .0023 2
3 .0140 .0131 .0123 .0115 .0107 .0100 .0093 .0087 .0081 .0076 3
4 .0319 .0302 .0285 .0269 .0254 .0240 .0226 .0213 .0201 .0189 4
5 .0581 .0555 .0530 .0506 .0483 .0460 .0439 .0418 .0398 .0378 5
6 .0881 .0851 .0822 .0793 .0764 .0736 .0709 .0682 .0656 .0631 6
7 .1145 .1118 .1091 .1064 .1037 .1010 .0982 .0955 .0928 .0901 7
8 .1302 .1286 .1269 .1251 .1232 .1212 .1191 .1170 .1148 .1126 8
9 .1317 .1315 .1311 .1306 .1300 .1293 .1284 .1274 .1263 .1251 9
10 .1198 .1210 .1219 .1228 .1235 .1241 .1245 .1249 .1250 .1251 10
11 .0991 .1012 .1031 .1049 .1067 .1083 .1098 .1112 .1125 .1137 11
12 .0752 .0776 .0799 .0822 .0844 .0866 .0888 .0908 .0928 .0948 12
13 .0526 .0549 .0572 .0594 .0617 .0640 .0662 .0685 .0707 .0729 13
14 .0342 .0361 .0380 .0399 .0419 .0439 .0459 .0479 .0500 .0521 14
15 .0208 .0221 .0235 .0250 .0265 .0281 .0297 .0313 .0330 .0347 15
16 .0118 .0127 .0137 .0147 .0157 .0168 .0180 .0192 .0204 .0217 16
17 .0063 .0069 .0075 .0081 .0088 .0095 .0103 .0111 .0119 .0128 17
18 .0032 .0035 .0039 .0042 .0046 .0051 .0055 .0060 .0065 .0071 18
19 .0015 .0017 .0019 .0021 .0023 .0026 .0028 .0031 .0034 .0037 19
20 .0007 .0008 .0009 .0010 .0011 .0012 .0014 .0015 .0017 .0019 20
21 .0003 .0003 .0004 .0004 .0005 .0006 .0006 .0007 .0008 .0009 21
22 .0001 .0001 .0002 .0002 .0002 .0002 .0003 .0003 .0004 .0004 22
23 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0002 .0002 23
24 .0001 .0001 .0001 24
lOMoARcPSD|8938243

EDDA: Statistical Tables page 287

Figure 4: Poisson distribution — confidence intervals

Note: It is assumed that an observation x is obtained from a Poisson distribution with parameter λ.
For a specified value of x, the curves specify a 95% confidence interval for λ. For a specified value of λ, the
curves give a two-sided critical region of size 0.05 to test the hypothesis that the specified value of λ is the
true value.
lOMoARcPSD|8938243

page 288 Experimental Design and Data Analysis

Table 5: Normal distribution — cumulative distribution function

x 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359 4 8 12 16 20 24 28 32 36
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 4 8 12 16 20 24 28 32 35
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 4 8 12 15 19 23 27 31 35
0.3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 4 8 11 15 19 23 26 30 34
0.4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 4 7 11 14 18 22 25 29 32
0.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 3 7 10 14 17 21 24 27 31
0.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 3 6 10 13 16 19 23 26 29
0.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 3 6 9 12 15 18 21 24 27
0.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 3 6 8 11 14 17 19 22 25
0.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 3 5 8 10 13 15 18 20 23
1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621 2 5 7 9 12 14 16 18 21
1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 2 4 6 8 10 12 14 16 19
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 2 4 6 7 9 11 13 15 16
1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 2 3 5 6 8 10 11 13 14
1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 1 3 4 6 7 8 10 11 13
1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 1 2 4 5 6 7 8 10 11
1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 1 2 3 4 5 6 7 8 9
1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 1 2 3 3 4 5 6 7 8
1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 1 1 2 3 4 4 5 6 6
1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 1 1 2 2 3 4 4 5 5
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 0 1 1 2 2 3 3 4 4
2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 0 1 1 2 2 2 3 3 4
2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890 0 1 1 1 2 2 2 3 3
2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 0 1 1 1 1 2 2 2 2
2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 0 0 1 1 1 1 1 2 2
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 0 0 0 1 1 1 1 1 1
2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 0 0 0 0 1 1 1 1 1
2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 0 0 0 0 0 1 1 1 1
2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981 0 0 0 0 0 0 0 1 1
2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 0 0 0 0 0 0 0 0 0
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 0 0 0 0 0 0 0 0 0
3.1 .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .9993 0 0 0 0 0 0 0 0 0
3.2 .9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995 .9995 0 0 0 0 0 0 0 0 0
3.3 .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997 0 0 0 0 0 0 0 0 0
3.4 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998 0 0 0 0 0 0 0 0 0
3.5 .9998 .9998 .9998 .9998 .9998 .9998 .9998 .9998 .9998 .9998 0 0 0 0 0 0 0 0 0
3.6 .9998 .9998 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 0 0 0 0 0 0 0 0 0
3.7 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 0 0 0 0 0 0 0 0 0
3.8 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 .9999 0 0 0 0 0 0 0 0 0

Table 6: Normal distribution — inverse cdf

q cq q cq q cq q cq q cq q cq
0.50 0.0000 0.60 0.2533 0.70 0.5244 0.80 0.8416 0.90 1.2816 0.99 2.3263
0.51 0.0251 0.61 0.2793 0.71 0.5534 0.81 0.8779 0.91 1.3408 0.991 2.3656
0.52 0.0502 0.62 0.3055 0.72 0.5828 0.82 0.9154 0.92 1.4051 0.992 2.4089
0.53 0.0753 0.63 0.3319 0.73 0.6128 0.83 0.9542 0.93 1.4758 0.993 2.4573
0.54 0.1004 0.64 0.3585 0.74 0.6433 0.84 0.9945 0.94 1.5548 0.994 2.5121
0.55 0.1257 0.65 0.3853 0.75 0.6745 0.85 1.0364 0.95 1.6449 0.995 2.5758
0.56 0.1510 0.66 0.4125 0.76 0.7063 0.86 1.0803 0.96 1.7507 0.996 2.6521
0.57 0.1764 0.67 0.4399 0.77 0.7388 0.87 1.1264 0.97 1.8808 0.997 2.7478
0.58 0.2019 0.68 0.4677 0.78 0.7722 0.88 1.1750 0.975 1.9600 0.998 2.8782
0.59 0.2275 0.69 0.4958 0.79 0.8064 0.89 1.2265 0.98 2.0537 0.999 3.0902
lOMoARcPSD|8938243

EDDA: Statistical Tables page 289

Table 7: t distribution — inverse cdf

p
df 0.600 0.750 0.800 0.900 0.950 0.975 0.990 0.995 0.999 0.9995

1 0.325 1.000 1.376 3.078 6.314 12.71 31.82 63.66 318.3 636.6
2 0.289 0.816 1.061 1.886 2.920 4.303 6.965 9.925 22.33 31.60
3 0.277 0.765 0.978 1.638 2.353 3.182 4.541 5.841 10.21 12.92
4 0.271 0.741 0.941 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 0.267 0.727 0.920 1.476 2.015 2.571 3.365 4.032 5.894 6.869
6 0.265 0.718 0.906 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 0.263 0.711 0.896 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 0.262 0.706 0.889 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 0.261 0.703 0.883 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 0.260 0.700 0.879 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 0.260 0.697 0.876 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 0.259 0.695 0.873 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 0.259 0.694 0.870 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 0.258 0.692 0.868 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 0.258 0.691 0.866 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 0.258 0.690 0.865 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 0.257 0.689 0.863 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 0.257 0.688 0.862 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 0.257 0.688 0.861 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 0.257 0.687 0.860 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 0.257 0.686 0.859 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 0.256 0.686 0.858 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 0.256 0.685 0.858 1.319 1.714 2.069 2.500 2.807 3.485 3.768
24 0.256 0.685 0.857 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 0.256 0.684 0.856 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 0.256 0.684 0.856 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 0.256 0.684 0.855 1.314 1.703 2.052 2.473 2.771 3.421 3.689
28 0.256 0.683 0.855 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 0.256 0.683 0.854 1.311 1.699 2.045 2.462 2.756 3.396 3.660
30 0.256 0.683 0.854 1.310 1.697 2.042 2.457 2.750 3.385 3.646
31 0.256 0.682 0.853 1.309 1.696 2.040 2.453 2.744 3.375 3.633
32 0.255 0.682 0.853 1.309 1.694 2.037 2.449 2.738 3.365 3.622
33 0.255 0.682 0.853 1.308 1.692 2.035 2.445 2.733 3.356 3.611
34 0.255 0.682 0.852 1.307 1.691 2.032 2.441 2.728 3.348 3.601
35 0.255 0.682 0.852 1.306 1.690 2.030 2.438 2.724 3.340 3.591
36 0.255 0.681 0.852 1.306 1.688 2.028 2.434 2.719 3.333 3.582
37 0.255 0.681 0.851 1.305 1.687 2.026 2.431 2.715 3.326 3.574
38 0.255 0.681 0.851 1.304 1.686 2.024 2.429 2.712 3.319 3.566
39 0.255 0.681 0.851 1.304 1.685 2.023 2.426 2.708 3.313 3.558
40 0.255 0.681 0.851 1.303 1.684 2.021 2.423 2.704 3.307 3.551
50 0.255 0.679 0.849 1.299 1.676 2.009 2.403 2.678 3.261 3.496
60 0.254 0.679 0.848 1.296 1.671 2.000 2.390 2.660 3.232 3.460
70 0.254 0.678 0.847 1.294 1.667 1.994 2.381 2.648 3.211 3.435
80 0.254 0.678 0.846 1.292 1.664 1.990 2.374 2.639 3.195 3.416
90 0.254 0.677 0.846 1.291 1.662 1.987 2.368 2.632 3.183 3.402
100 0.254 0.677 0.845 1.290 1.660 1.984 2.364 2.626 3.174 3.390
120 0.254 0.677 0.845 1.289 1.658 1.980 2.358 2.617 3.160 3.373
160 0.254 0.676 0.844 1.287 1.654 1.975 2.350 2.607 3.142 3.352
200 0.254 0.676 0.843 1.286 1.653 1.972 2.345 2.601 3.131 3.340
240 0.254 0.676 0.843 1.285 1.651 1.970 2.342 2.596 3.125 3.332
300 0.254 0.675 0.843 1.284 1.650 1.968 2.339 2.592 3.118 3.323
400 0.254 0.675 0.843 1.284 1.649 1.966 2.336 2.588 3.111 3.315
∞ 0.253 0.674 0.842 1.282 1.645 1.960 2.326 2.576 3.090 3.290

Note: Interpolation with respect to df should be linear in 120/df .


lOMoARcPSD|8938243

page 290 Experimental Design and Data Analysis

Table 8: χ2 distribution — inverse cdf

p
df 0.005 0.010 0.025 0.050 0.100 0.250 0.500 0.750 0.900 0.950 0.975 0.990 0.995 0.999

1 0.000 0.000 0.001 0.004 0.016 0.102 0.455 1.323 2.706 3.841 5.024 6.635 7.879 10.83
2 0.010 0.020 0.051 0.103 0.211 0.575 1.386 2.773 4.605 5.991 7.378 9.210 10.60 13.82
3 0.072 0.115 0.216 0.352 0.584 1.213 2.366 4.108 6.251 7.815 9.348 11.34 12.84 16.27
4 0.207 0.297 0.484 0.711 1.064 1.923 3.357 5.385 7.779 9.488 11.14 13.28 14.86 18.47
5 0.412 0.554 0.831 1.145 1.610 2.675 4.351 6.626 9.236 11.07 12.83 15.09 16.75 20.51
6 0.676 0.872 1.237 1.635 2.204 3.455 5.348 7.841 10.64 12.59 14.45 16.81 18.55 22.46
7 0.989 1.239 1.690 2.167 2.833 4.255 6.346 9.037 12.02 14.07 16.01 18.48 20.28 24.32
8 1.344 1.647 2.180 2.733 3.490 5.071 7.344 10.22 13.36 15.51 17.53 20.09 21.95 26.12
9 1.735 2.088 2.700 3.325 4.168 5.899 8.343 11.39 14.68 16.92 19.02 21.67 23.59 27.88
10 2.156 2.558 3.247 3.940 4.865 6.737 9.342 12.55 15.99 18.31 20.48 23.21 25.19 29.59
11 2.603 3.053 3.816 4.575 5.578 7.584 10.34 13.70 17.28 19.68 21.92 24.73 26.76 31.26
12 3.074 3.571 4.404 5.226 6.304 8.438 11.34 14.85 18.55 21.03 23.34 26.22 28.30 32.91
13 3.565 4.107 5.009 5.892 7.041 9.299 12.34 15.98 19.81 22.36 24.74 27.69 29.82 34.53
14 4.075 4.660 5.629 6.571 7.790 10.17 13.34 17.12 21.06 23.68 26.12 29.14 31.32 36.12
15 4.601 5.229 6.262 7.261 8.547 11.04 14.34 18.25 22.31 25.00 27.49 30.58 32.80 37.70
16 5.142 5.812 6.908 7.962 9.312 11.91 15.34 19.37 23.54 26.30 28.85 32.00 34.27 39.25
17 5.697 6.408 7.564 8.672 10.09 12.79 16.34 20.49 24.77 27.59 30.19 33.41 35.72 40.79
18 6.265 7.015 8.231 9.390 10.86 13.68 17.34 21.60 25.99 28.87 31.53 34.81 37.16 42.31
19 6.844 7.633 8.907 10.12 11.65 14.56 18.34 22.72 27.20 30.14 32.85 36.19 38.58 43.82
20 7.434 8.260 9.591 10.85 12.44 15.45 19.34 23.83 28.41 31.41 34.17 37.57 40.00 45.31
21 8.034 8.897 10.28 11.59 13.24 16.34 20.34 24.93 29.62 32.67 35.48 38.93 41.40 46.80
22 8.643 9.542 10.98 12.34 14.04 17.24 21.34 26.04 30.81 33.92 36.78 40.29 42.80 48.27
23 9.260 10.20 11.69 13.09 14.85 18.14 22.34 27.14 32.01 35.17 38.08 41.64 44.18 49.73
24 9.886 10.86 12.40 13.85 15.66 19.04 23.34 28.24 33.20 36.42 39.36 42.98 45.56 51.18
25 10.52 11.52 13.12 14.61 16.47 19.94 24.34 29.34 34.38 37.65 40.65 44.31 46.93 52.62
26 11.16 12.20 13.84 15.38 17.29 20.84 25.34 30.43 35.56 38.89 41.92 45.64 48.29 54.05
27 11.81 12.88 14.57 16.15 18.11 21.75 26.34 31.53 36.74 40.11 43.19 46.96 49.65 55.48
28 12.46 13.56 15.31 16.93 18.94 22.66 27.34 32.62 37.92 41.34 44.46 48.28 50.99 56.89
29 13.12 14.26 16.05 17.71 19.77 23.57 28.34 33.71 39.09 42.56 45.72 49.59 52.34 58.30
30 13.79 14.95 16.79 18.49 20.60 24.48 29.34 34.80 40.26 43.77 46.98 50.89 53.67 59.70
31 14.46 15.66 17.54 19.28 21.43 25.39 30.34 35.89 41.42 44.99 48.23 52.19 55.00 61.10
32 15.13 16.36 18.29 20.07 22.27 26.30 31.34 36.97 42.58 46.19 49.48 53.49 56.33 62.49
33 15.82 17.07 19.05 20.87 23.11 27.22 32.34 38.06 43.75 47.40 50.73 54.78 57.65 63.87
34 16.50 17.79 19.81 21.66 23.95 28.14 33.34 39.14 44.90 48.60 51.97 56.06 58.96 65.25
35 17.19 18.51 20.57 22.47 24.80 29.05 34.34 40.22 46.06 49.80 53.20 57.34 60.27 66.62
36 17.89 19.23 21.34 23.27 25.64 29.97 35.34 41.30 47.21 51.00 54.44 58.62 61.58 67.98
37 18.59 19.96 22.11 24.07 26.49 30.89 36.34 42.38 48.36 52.19 55.67 59.89 62.88 69.35
38 19.29 20.69 22.88 24.88 27.34 31.81 37.34 43.46 49.51 53.38 56.90 61.16 64.18 70.70
39 20.00 21.43 23.65 25.70 28.20 32.74 38.34 44.54 50.66 54.57 58.12 62.43 65.48 72.06
40 20.71 22.16 24.43 26.51 29.05 33.66 39.34 45.62 51.81 55.76 59.34 63.69 66.77 73.40
50 27.99 29.71 32.36 34.76 37.69 42.94 49.33 56.33 63.17 67.50 71.42 76.15 79.49 86.66
60 35.53 37.48 40.48 43.19 46.46 52.29 59.33 66.98 74.40 79.08 83.30 88.38 91.95 99.61
70 43.28 45.44 48.76 51.74 55.33 61.70 69.33 77.58 85.53 90.53 95.02 100.4 104.2 112.3
80 51.17 53.54 57.15 60.39 64.28 71.14 79.33 88.13 96.58 101.9 106.6 112.3 116.3 124.8
90 59.20 61.75 65.65 69.13 73.29 80.62 89.33 98.65 107.6 113.1 118.1 124.1 128.3 137.2
100 67.33 70.06 74.22 77.93 82.36 90.13 99.33 109.1 118.5 124.3 129.6 135.8 140.2 149.4
120 83.85 86.92 91.57 95.70 100.6 109.2 119.3 130.1 140.2 146.6 152.2 159.0 163.6 173.6
140 100.7 104.0 109.1 113.7 119.0 128.4 139.3 150.9 161.8 168.6 174.6 181.8 186.8 197.4
160 117.7 121.3 126.9 131.8 137.5 147.6 159.3 171.7 183.3 190.5 196.9 204.5 209.8 221.0
180 134.9 138.8 144.7 150.0 156.2 166.9 179.3 192.4 204.7 212.3 219.0 227.1 232.6 244.4
200 152.2 156.4 162.7 168.3 174.8 186.2 199.3 213.1 226.0 234.0 241.1 249.4 255.3 267.5
240 187.3 192.0 199.0 205.1 212.4 224.9 239.3 254.4 268.5 277.1 284.8 293.9 300.2 313.4
300 240.7 246.0 253.9 260.9 269.1 283.1 299.3 316.1 331.8 341.4 349.9 359.9 366.8 381.4
400 330.9 337.2 346.5 354.6 364.2 380.6 399.3 418.7 436.6 447.6 457.3 468.7 476.6 493.1

Note: Linear interpolation with respect to df should be should be satisfactory for most purposes.
√ 2
For df > 100, use cq (χ2df ) ≈ 12 cq (N) + 2 df − 1 , where N denotes the standard
normal distribution.
lOMoARcPSD|8938243

EDDA: Statistical Tables page 291

Figure 9: Confidence intervals for the correlation coefficient

10

20

50
100
200
500
500
200
100
50

20

10

Note: It is assumed that a random sample of n observations is obtained on a bivariate normal pop-
ulation with correlation coefficient, ρ. The numbers on the curves indicate the sample size.
For an observed value of the sample correlation coefficient, r, the curves specify a 95% con-
fidence interval for ρ. For a given value of ρ, the curves specify a two-sided critical region of
size 0.05 to test the hypothesis that the given value is the true value.
lOMoARcPSD|8938243

page 292 Experimental Design and Data Analysis

Experimental Design & Data Analysis: Summary notes

STATISTICS
Types of variable properties
categorical category
ordinal category + order
numerical category + order + scale; [counting = discrete, measurement = continuous]

Descriptive statistics for {x1 , x2 , . . . , xn }; order statistics (x(1) 6 x(2) 6 · · · 6 x(n) ).


x̄ = n1 n 1 Pk
P
sample mean, x̄ i=1 xi ≈ n j=1 fj uj .
sample median, m̂, ĉ0.5 the middle observation, x( 1 (n+1))
2
sample P -trimmed mean trim off ⌈ 21 nP ⌉ observations at each end, and average the rest.
sample mode, M̂ the most frequent observation, or the midpoint of the most frequent class.
sample quantile, ĉq ĉq = x(k) , where k = (n+1)q.
sample quartiles Q1 = ĉ0.25 , Q3 = ĉ0.75 (Q2 = m̂ = ĉ0.5 ).
five-number summary (min, Q1, med, Q3, max)
boxplot * * ‘outliers’ outside
(Q1 − 1.5 IQR, Q3 + 1.5 IQR)
min Q1 med Q3 max
Pn
sample variance, s2 s2 = 1
n−1 i=1 (xi − x̄)
2

1 n 2 1 n 1 Pk 1 Pk
xi ) 2 fj u2j − f j uj )2
 
form for computation
P P
= n−1 i=1 xi − n ( i=1 ≈ n−1 j=1 n
( j=1

sample standard deviation, s s2
sample interquartile range, IQR IQR = Q3 − Q1, τ̂ = ĉ0.75 − ĉ0.25 (a number, not an interval)
sample range x(n) − x(1)
frequency distributions dotplot bar graph histogram
q
q q
q q q q
q q q q q
1
sample pmf, p̂(x) p̂(x) = n
freq(X = x)
sample pdf, fˆ(x) fˆ(x) = 1
n(b−a)
freq(a < X < b) for cell a < x < b [histogram]
1 k
sample cdf, F̂ (x) F̂ (x) = n
freq(X 6 x); F̂ (x) = n
, (x(k) 6 x < x(k+1) )


1

q
x
ĉq
−1
sample quantiles (inverse cdf) F̂ (ĉq ) ≈ q; ĉq ≈ F̂ (q).
1
Pn
sample covariance, sxy sxy = n−1 i=1 (xi − x̄)(yi − ȳ)
sxy Σ(x−x̄)(y−ȳ) 1
Pn
sample correlation, r = rxy rxy = s s = √ = n−1 i=1 xsi ysi .
x y Σ(x−x̄)2 Σ(y−ȳ)2

number developing disease D during time period ∆t


risk (incidence proportion) R R̂ = number of individuals followed for the time period
number of individuals developing disease D in a time interval
incidence rate, α α̂ = total time for which individuals were followed
prevalence proportion, π π̂ = number of individuals with characteristic D at time t
total number of individuals
lOMoARcPSD|8938243

EDDA: Summary notes page 293

STATISTICS
Data sources. Types of studies: experimental studies observational studies
clinical trials cohort (follow-up, prospective)
field trials case-control (retrospective)
community intervention cross sectional (survey)
imposed intervention no intervention
(randomisation)
inferred causation no inferred causation

statistical experiments: treatments applied to experimental units


and their effect on the response variable is observed
desirable qualities of an experiment: (1) validity (unbiasedness); (2) precision (efficiency).
validity control group no treatment; placebo = simulated (non)treatment
randomisation each unit has an equal probability of being assigned each treatment
precision blocking a block is a group of similar experimental units;
(stratification) block ≈ sub-experiment: randomise within blocks
replication more observations increases precision
balance balance is preferable: i.e. equal numbers with each treatment
confounding variable an explanatory variable whose effect distorts the effect of another.
lurking variable an unobserved variable that could be a confounding variable

PROBABILITY, Pr (a set function defined on an event space)


random experiment a procedure leading to an observable outcome
event space, Ω set of possible outcomes
event, A subset of event space
properties of probability function (1) 0 6 Pr(A) 6 1 for all events A
(2) Pr(∅) = 0, Pr(Ω) = 1
(3) Pr(A′ ) = 1 − Pr(A) (A′ denotes the complement of A).
(4) A ⊆ B ⇒ Pr(A) 6 Pr(B)
(5) Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B) [addition theorem]
assigning values to Pr symmetry; long-term relative frequency; subjective; model
Pr(A) p
odds, O O(A) = Pr(A′ ) ; odds = 1−p
B B′
A Pr(A∩B) Pr(A∩B ′ ) Pr(A) α β
probability table for A and B A′ Pr(A′ ∩B) Pr(A′ ∩B ′ ) Pr(A′ ) γ δ
Pr(B) Pr(B ′ ) 1
Pr(A∩H)
conditional probability Pr(A | H) = Pr(H)
, Pr(H) 6= 0
Pr(A | H)
conditional odds O(A | H) = Pr(A′ | H) .
multiplication rule Pr(A∩B) = Pr(A) Pr(B | A) = Pr(B) Pr(A | B)
relationship between A and B Pr(A | B) ≷ Pr(A) ≷ Pr(A | B ′ ) ( positive relationship )
negative relationship
Pm
law of total probability Pr(H) = i=1 Pr(Ai ) Pr(H|Ai ) for {Ai } a partition of Ω.
Pr(Ak ) Pr(H|Ak )
Bayes’ theorem Pr(Ak |H) = Pm for {Ai } a partition of Ω.
i=1 Pr(Ai ) Pr(H|Ai )
mutually exclusive and exhaustive “causes” A1 , A2 , . . . , Ak of “result” H
e.g. exposure → disease; disease → test result
Pr(D | E) α(γ+δ)
relative risk (risk ratio), RR RR = for disease D with exposure E; RR = γ(α+β)
Pr(D | E ′ )
O(D | E) αδ
odds ratio, OR OR = for disease D with exposure E; OR = βγ
O(D | E ′ )

Diagnostic testing D = individual has disease, P = individual tests positive


sensitivity sn = Pr(P | D)
specificity sp = Pr(P ′ | D′ )
positive predictive value ppv = Pr(D | P )
negative predictive value npv = Pr(D′ | P ′ )
errors false positive = D′ ∩P ; false negative = D∩P ′
prevalence, prior probability Pr(D)
lOMoARcPSD|8938243

page 294 Experimental Design and Data Analysis

Independent events Pr(A ∩ B) = Pr(A) Pr(B) 6= 0 (e.g. H1 , H2 )


cf. mutually exclusive events A ∩ B = ∅, Pr(A ∩ B) = 0 (e.g. H1 , T1 )
independence of n events Pr(Aj1 ∩ Aj2 ∩ · · · ∩ Ajm ) = Pr(Aj1 ) Pr(Aj2 ) · · · Pr(Ajm )
if A1 , A2 , . . . , An independent, then: Pr(A1 ∩A2 ∩ · · · ∩An ) = Pr(A1 ) Pr(A2 ) · · · Pr(An )
Pr(A1 ∪A2 ∪ · · · ∪An ) = 1 − Pr(A′1 ∩ · · · ∩A′n ) = Pr(A′1 ) · · · Pr(A′n ),
i.e. Pr(“at least one”) = 1 – Pr(“none”).

Random variable, X: Ω → R Maths defn: real-valued function defined on Ω, X(ω), ω ∈ Ω.


a numerical outcome of a random procedure.
sample space, S the set of possible values of X, i.e. the range of the function X: Ω → S ⊆ R
cumulative distribution function, cdf F (x) = Pr(X 6 x)
properties of a cdf F (1) F non-decreasing
(2) F (−∞) = 0, F (∞) = 1
(3) F right-continuous, i.e. F (x + 0) = F (x).
r
r r
sketch cdf
1✻ 1✻
r
r
r
r r ✲ ✲
probability from cdf Pr(a < X 6 b) = F (b) − F (a)
sketch inverse cdf, F −1

0 1 0 1

−1
q-quantile, cq (0 < q < 1) c q = FX (q)
continuous random variables Pr(X = x) = 0
d

probability density function, pdf f (x) = dx F (x) ; Pr(X ≈ x) ≈ f (x)δx
properties of a pdf f (1) fR (x) > 0

(2) −∞
f (x)dx = 1
Rb Rx
probability from pdf Pr(a < X 6 b) = a f (x)dx ⇒ F (x) = −∞ f (t)dt
sketch pdf

area = 1



discrete random variables
probability mass function, pmf p(x) = Pr(X = x)
properties of a pmf f (1) p(x)
P >0
r r
(2) p(x) = 1
r
sketch pmf r r
r r
r
relation of pmf to cdf p(x) = F (x + 0) − F (x − 0) = jump in F at x
Expectation, E R P
expectation of ψ(X) E(ψ(X)) = ψ(x)f (x)dx or ψ(x)p(x)
R P
mean of X, µ, E(X) xf (x)dx or xp(x)
E(a + bX), E(X + Y ) a + b E(X), E(X) + E(Y )
median of X, m 0.5-quantile, c0.5 = F −1 (0.5)
mode of X, M f (M) > f (x) for all x or p(M) > p(x) for all x
variance of X, var(X), σ 2 2 2 2

E (X − µ) p = E(X ) − E(X)
standard deviation, sd(X), σ sd(X) = var(X)
var(a + bX), sd(a + bX) b2 var(X), |b| sd(X)
var(X + Y ) (X and Y independent) var(X) + var(Y )

covariance of X and Y , cov(X, Y ) σXY = E (X−µX )(Y −µY ) (zero if X and Y are independent).
σXY
correlation of X and Y , ρ(X, Y ) ρXY = σ σ (zero if X and Y are independent).
X Y
var(aX + bY ) a2 var(X) + b2 var(Y ) + 2ab cov(X, Y )
lOMoARcPSD|8938243

EDDA: Summary notes page 295

Linear combinations of independent rvs Y = a1 X1 + a2 X2 + · · · + ak Xk , with E(Xi ) = µi , var(Xi ) = σi2


mean of a1 X1 +a2 X2 + · · · +ak Xk E(Y ) = a1 µ1 + a2 µ2 + · · · + ak µk , E(X1 −X2 ) = µ1 −µ2 ;
variance of a1 X1 +a2 X2 + · · · +ak Xk var(Y ) = a21 σ12 + a22 σ22 + · · · + a2k σk2 , var(X1 −X2 ) = σ12 +σ22 ;
if X1 , X2 , . . . , Xk normally distributed then Y = a1 X1 +a2 X2 + · · · +ak Xk is normally distributed.
combining indept unbiased estimators T1 , T2 , . . . , Tk independent, with E(Ti ) = θ and var(Ti ) = σi2 .
c
optimal T = a1 T1 + · · · +ak Tk ai = σ 2 , where c = 1/( σ12 + · · · + 1
σk2) ⇒ E(T ) = θ, var(T ) = c.
i 1

Random sampling: iidrvs independent identically distributed random variables


d
random sample on X X1 , X2 , . . . , Xn iidrvs = X
statistic, T T = ψ(X1 , X2 , . . . , Xn )
d
distribution of frequencies freq(A) = Bi(n, Pr(A))
σ2
X̄ = n1 n
P
sample mean i=1 Xi E(X̄) = µ, var(X̄) = n
Pn
sample variance, S 2 S 2 = n−1
1
i=1 (Xi − X̄)
2
E(S 2 ) = σ .
2
p
law of large numbers If µ = E(X) < ∞ then X̄ → µ as n → ∞
d σ2
central limit theorem If also σ 2 = var(X) < ∞, then X̄ ∼ N(µ, n
)

Statistical Inference estimator of θ T is a statistic chosen so that it will be close to θ


estimate of θ t is a realisation of an estimator T
unbiasedness (for θ) E(T ) = θ

Confidence interval “basic confidence interval”: est ± “2”se



confidence interval for θ based on T realisation of the random interval ℓ(T ), u(T ) , 
where Pr(ℓ(T ) < θ < u(T )) = γ; CI for θ: ℓ(t), u(t)
est−θ0
Hypothesis testing “basic test statistic”: se∗
, cf. “2”

significance level α = Pr(reject H0 | H0 ), α = Pr(type I error) = Pr(reject H0 | H0 )


power 1−β = Pr(reject H0 | H1 )) β = Pr(type II error) = Pr(do not reject H0 | H1 )
power function Q(θ) = Pr(reject H0 | θ)
p-value Pr(test statistic is at least as extreme as the value observed | H0 );
reject H0 if p < α.

Inference for normal populations (variance known) (variance unknown)


2 X̄−µ
√ =d X̄−µ
√ =d
one sample: n on N(µ, σ ) σ/ n
N S/ n
tn−1
σ s
100(1−α)% CI for µ x̄ ± c1− 1 α (N) √n x̄ ± c1− 1 α (tn−1 ) √n q
2 q 2
1 1
100(1−α)% PI for X x̄ ± c1− 1 α (N) σ 1 + n
x̄ ± c1− 1 α (tn−1 ) s 1 + n
2 2
x̄−µ
0 x̄−µ
test statistic for µ=µ0 z = σ/√n t = s/√n0
sample size calculations z2 σ2
1− 1 α
100(1−α)% CI = [est ± w]; n> 2
w2
;
(z1− 1 α +z1−β )2 σ 2 (z1− 1 α σ0 + z1−β σ1 )2
2 2
sig level (µ0 ) α; power (µ1 ) 1−β n> (µ1 −µ0 )2
; and if σ1 6=σ0 : n > (µ1 −µ0 )2
k
checking Normality: QQ plot {(Φ−1 ( n+1 ), x(k) ), k = 1, 2, . . . , n};
if Normal model is correct points should be close to a straight line with intercept µ and slope σ.
probability plot for Normality QQ plot with axes interchanged [ and Φ−1 (q) relabelled as q. ]

(X̄1 −X̄2 )−(µ1 −µ2 ) d (X̄1 −X̄2 )−(µ1 −µ2 ) d


two samples: n1 on N(µ1 , σ12 ) r
2 2
= N; r
2 2
≈ tk ;
σ1 σ2 S1 S2
n2 on N(µ2 , σ22 ) n
+ n n
+ n
1 2 1 2
where min(n1 −1, n2 −1) 6 k 6 n1 +n2 −2.
q 2 2
q
σ1 σ2 s2 s2
100(1−α)% CI for µ1 −µ2 x̄1 −x2 ± c1− 1 α (N) n1
+ n2
x̄1 −x2 ± c1− 1 α (tk ) 1
n1
+ 2
n2
2 2
x̄ −x̄ d x̄ −x̄ d
test statistic for µ1 −µ2 = 0 z = r 12 2 2 = N; t∗ = r 12 2 2 = tk ;
σ1 σ s1 s
n1
+ n2 n1
+ n2
2 2
lOMoARcPSD|8938243

page 296 Experimental Design and Data Analysis

(X̄1 −X̄2 )−(µ1 −µ2 ) d (n1 −1)S12 2


+(n2 −1)S2
if σ12 = σ22 = σ 2 , then q = tn1 +n2 −2 , where S 2 = n1 +n2 −2
.
S 2 ( n1 + n1 ) q
1 2
100(1−α)% CI for µ1 −µ2 x̄1 −x̄2 ± c1− 1 α (tn1 +n2 −2 ) s2 ( n11 + n12 )
2
x̄ −x̄
test statistic for µ1 =µ2 t = q 11 2
s2 ( n + n1 )
sample size calculations 1 2
2z 2 σ2
1− 1

100(1−α)% CI = [est ± w]; n1 = n2 > w2
;
2(z1− 1 α +z1−β )2 σ 2
X̄1√
−X̄2 µ −µ
2
, θ = 1√ 22

sig level α; power(d) = 1−β n1 = n2 > d2
Z= 2
σ n σ n
Inference for proportions
x d 
one sample of n p̂ = n ; X = Bi(n, p) ≈ N np, np(1−p) ( np>5, nq>5 ) [CC]
q q
est−p0
large n est = p̂, se = p̂(1−
n
p̂)
, se 0 = p0 (1−p0 )
n
CI: est ± z1− 1 α se; HT: z = se0
2

small n R, Statistic-Parameter diagram [Figure 2]


d x
Xi = Bi(ni , pi ) ≈ N ni pi , ni pi (1−pi ) ; p̂i = ni .

two samples of n1 and n2
q i
large n confidence interval est = p̂1 −p̂2 , se = p̂1 (1− p̂1 )
+ p̂2 (1−p̂2 )
; CI: est ± z1− 1 α se;
q n1 n2 2
1 1 x1 +x2 est
large n test p1 =p2 est = p̂1 −p̂2 , se0 = p̂(1−p̂)( n1 + n2 ), p̂ = n1 +n2 ; HT: z = se
0
sample size calculations use σ02 = p0 (1−p0 ) and σ12 = p1 (1−p1 ) in the Normal results above (σ0 6= σ1 ).
Inference for rates
x d
one sample for person-time t α̂ = t ; X = Pn(αt) ≈ N(αt, αt) ( αt>10 ) [CC]
q
est−α0
est = α̂, se = α̂t ; se0 = αt0 ; CI: est ± z1− 1 α se;
p
large t HT: z = se0
2
small t R, Statistic-Parameter diagram [Figure 4]
d
expected number of cases, λ λ̂ = x; X, number of cases = Pn(λ) ≈ N(λ, λ) (λ > 10) [CC]
d x
Xi = Pn(αi ti ) ≈ N αi ti , αi ti ; α̂i = t i .

two samples for t1 and t2
q i
large t confidence interval est = α̂1 −α̂2 , se = α̂t11 + α̂t22 ; CI: est ± z1− 1 α se;
2
est
q
large t test α1 =α2 est = α̂1 −α̂2 , se0 = α̂( t11 + t12 ), α̂ = xt11 +x
+t2
2
; HT: z = se
0
2 P (o−e)2 d 2
χ goodness of fit test u= e
≈ χ k−ℓ (provided e>5), where k= # classes, ℓ = # constraints
r×c contingency table observed frequencies, o = fij
fi· f·j
testing independence expected frequencies e = eij = n , where fi· = row i sum, f·j = col j sum
P (o−e)2 d 2 d
u= e
≈ χ(r−1)(c−1) (provided e>5); for 2×2 table, u ≈ χ21 .

a b est z
2×2 contingency table c d z = se = √ (ad−bc) a+b+c+d ; r = √n , u = z 2 .
0 (a+b)(c+d)(a+c)(b+d)
q
odds ratio, estimate and CI ad
θ̂ = bc ; se(ln θ̂) = a + 1b + 1c + d1 ; 95% CI for ln θ: ln θ̂ ± 1.96 se(ln θ̂).
1

d
Straight line regression Yi = N(α + βxi , σ 2 ), (i = 1, 2, . . . , n).
rsy Σ(x−x̄)(y−ȳ)
least squares estimates β̂ = s = Σ(x−x̄)2
; α̂ = ȳ − β̂ x̄
x
1 n−1 1 (Σ(x−x̄)(y−ȳ))2 
estimate of σ 2 s2 = n−2 Σ(y−α̂−β̂xi )2 = n−2 (1 − r2 )s2y = n−2 Σ(y−ȳ)2 − Σ(x−x̄)2
d σ2 d σ2 2
estimators ȳ = N(α + β x̄, n ), β̂ = N(β, K ), where K = Σ(x−x̄) ; ȳ, β̂ independent.
d 1 (x−x̄)2
µ̂(x) = ȳ + (x−x̄)β̂ = N µ(x), c(x)σ 2 , where c(x) = n + K


d 2 d d
β̂ = N(β, σK ), µ̂(x) = N µ(x), c(x)σ 2 Y (x) = N µ(x), σ 2
 
inference on β, µ̂(x), Y (x)
β̂−β µ̂(x)−µ(x) Y (x)−µ̂(x)
√ =d
tn−2 ; √
= tn−2 ;
d
√ = tn−2
d
S/ K S c(x) S 1+c(x)
β̂ ± c0.975 (tn−2 ) √sK ,
p p
CI for β, CI for µ(x), PI for Y (x) µ̂(x) ± c0.975 (tn−2 )s c(x), µ̂(x) ± c0.975 (tn−2 )s 1+c(x)

Correlation ρ (−1 6 ρ 6 1) (population); r (−1 6 r 6 1) (sample, estimate of ρ)


sxy
r= s s = √ Σ(x−x̄)(y−ȳ)
x y Σ(x−x̄) Σ(y−ȳ)2
2

inference and CI for ρ Statistic-Parameter diagram [Figure 9]


lOMoARcPSD|8938243

EDDA: Summary notes page 297

Probability Distributions
d
1. Binomial distribution X = Bi(n, p) [n positive integer, 0 6 p 6 1]
pmf, p(x) (n x n−x
x )p q , x = 0, 1, 2, . . . , n; p + q = 1 [Table 1]
physical interpretation X = number of successes in n independent trials,
each having probability p of success (Bernoulli trials)
E(X), var(X) np, npq
d d
properties (1) If Zi iidrvs = Bi(1, p) then X = Z1 + Z2 + · · · + Zn = Bi(n, p)
d d d
(2) X1 = Bi(n1 , p), X2 = Bi(n2 , p) indept ⇒ X1 +X2 = Bi(n1 +n2 , p)
(3) If n → ∞, p → 0, so that np → λ, then Bi(n, p) → Pn(λ)
(4) If n → ∞, then Bi(n, p) ∼ N(np, npq) [np > 5, nq > 5], in which case:
d
if X ∗ = N(np, npq), then Pr(X = k) ≈ Pr(k−0.5 < X ∗ < k+0.5) [CC]
d
2. Poisson distribution X = Pn(λ) [λ > 0]
e−λ λx
pmf, p(x) x!
, (x = 0, 1, 2, ...) [Table 3]
Poisson process “events” occurring so that the probability that an “event” occurs
in (t, t + δt) is αδt + o(δt), where α = rate of the process
physical interpretation X = number of “events” in unit time of a Poisson process with rate λ.
E(X), var(X) λ, λ
d d d
properties (1) X1 = Pn(λ1 ), X2 = Pn(λ2 ) independent ⇒ X1 + X2 = Pn(λ1 + λ2 )
(2) approximation to Bi(n, p) when n large, p small: λ = np.
(3) if λ → ∞ then Pn(λ) ∼ N(λ, λ) [λ > 10], in which case:
d
if X ∗ = N(λ, λ), then Pr(X = k) ≈ Pr(k−0.5 < X ∗ < k+0.5) [CC]
d
3. Normal distribution X = N(µ, σ 2 ) [σ > 0]
standard normal distribution N(0, 1)
1 2 Rx 1 2
pdf, ϕ(x); cdf, Φ(x) ϕ(x) = √12π e− 2 x ; Φ(x) = −∞ √1 e− 2 t dt

[cdf: Table 5]
E(X), var(X) 0, 1 [inverse cdf: Table 6]
√1 e− 2σ2 (x−µ)
1 2
general normal distribution, pdf, f (x)
σ 2π
physical interpretation just about any variable obtained from a large number of components
(by the central limit theorem)
E(X), var(X) µ, σ 2
d d
properties (1) if X = N(µ, σ 2 ) then a + bX = N(a + bµ, b2 σ 2 )
X−µ d d
(2) Z= σ
= N(0, 1) ⇔ X = µ + σZ = N(µ, σ 2 ); cq (X) = µ + σcq (Z)
d d d
(3) X1 = N(µ1 , σ12 ), X2 = N(µ2 , σ22 ) indept ⇒ X1 +X2 = N(µ1 +µ2 , σ12 +σ22 )

d
4. t distribution X = tn [n = 1, 2, 3, . . .]
d d d
definition if Z = N(0, 1), U = χ2n indept, then X = √ Z = tn
U/n
Γ( n+1 )
pdf, f (x) √1 2 1
(x > 0) [inverse cdf: Table 7]
nπ Γ( n ) 2 n+1
2 (1+ xn ) 2
n
E(X), var(X) 0, n−2
2 n+1 1 2
comparison with standard normal tn has wider tails: var > 1; tn → N(0, 1) as n→∞: (1+ xn )− 2 → e− 2 x

d
5. χ2 distribution X = χ2n [n = 1, 2, 3, . . .]
d d
definition if Z1 , Z2 , . . . , Zn iidrvs = N(0, 1) then X = Z12 + Z22 + · · · + Zn2 = χ2n
− 1 x 1 n−1
e 2 x2
pdf, fX (x) 1n
1 n)
(x > 0) [inverse cdf: Table 8]
2 2 Γ( 2
E(X), var(X) n, 2n
d d d
properties (1) X1 = χ2m , X2 = χ2n indept ⇒ X1 +X2 = χ2m+n
(n−1)S 2 d 2σ 4
(2) sample on N(µ, σ 2 ): σ2
= χ2n−1 ⇒ E(S 2 ) = σ 2 , var(S 2 ) = n−1
P (o−e)2 d 2
(3) goodness of fit test: e
= χk−p−1
lOMoARcPSD|8938243

page 298 Experimental Design and Data Analysis


lOMoARcPSD|8938243

Index


abstraction ( ◦◦ ), 2, 74, 91, 92, 127 comparing distributions, 59
accept H0 , 148 outliers, 58
addition theorem, 72, 73 Busselton Health Study, 21
additivity
of means, 98 case-control study, 22, 23, 26, 82, 83, 196
of variance, for independent random variables, comparison with cohort study, 24
99 categorical data, 44
age-specific table, 8 causality, 29
Agresti’s approx CI cause and association, 31, 32
for α, 134 definition, 29
for p, 132 Hill’s criteria, 30
alternative hypothesis, 147–149 reverse, 32
animal experiments, 26 census, 24
approx CI central limit theorem, 112, 123
basic, 125 certain, 71, 72
for α, 133 chance, 71
for λ, 134 chartjunk, 40
at least one = not none, 86 checking, 45
checking normality, 138
balance, 18, 19, 176 chi-squared distribution, χ2
bar graph, 53 in R, 188
barchart, 52 inverse cdf table, 290
Bayes’ theorem, 80 chi-squared test statistic, 192
formula, 82 cholera data, 20
odds view, 85 clinical trial, 9, 11, 13, 21, 26
better approx CI coding, 45
for α, 134 coefficient of determination, 216
for p, 132 cohort, 12, 19, 134
binning, 54 closed, 21
binomial distribution, 102 open, 21
Bi(n, p), 103 cohort study, 19–21, 23, 26, 82
approximated by normal, 113 comparison with case-control, 24
parameter, testing, 159 combining estimates, 141
pmf table, 279 common cause, 32
pmf, graph, 103 community intervention trial, 14
pmf, in Tables, 103 comparative inference, 171
SP diagram, 283 comparing risks, 76
biostatistics, 5 comparison of proportions, 183
bivariate data, 46, 62, 201 summary, 186
C×N, 62 comparison of rates, 186
categorical data, 190 summary, 187
C×C, 62 complement, 72
N×N, 63 complementary event, A′ , 72
numerical data, 201 computers and data analysis, 42
bivariate normal distribution, 207 conditional odds, 79
blind study, 28, 184 conditional probability, 76
blinding, 18, 19 confidence interval, 125–127, 129, 147
block, 17, 19 0%, 128
boxplot, 58 and hypothesis test, 147, 150
inexact correspondence, 159

299
lOMoARcPSD|8938243

page 300 Experimental Design and Data Analysis: Index

and prediction interval, 138 diagnostic testing, 83


for β, 214 individual, 85
for µ when σ unknown, 135, 136 diagrams, 43
for µ(x), 214 discrete numerical data, 44
for m, 163 disease, 9, 11, 19, 24, 45, 75, 78, 80, 82, 83, 87, 107–109,
for p, 131 179, 184
exact, 131 comparing rates of, 186
for p1 −p2 , 184 diagnostic tests for, 83
for correlation, 207 event, 12
for incidence rate, 163 occurrence, 11, 24, 105, 108
for mean difference, 172, 175 outcomes, 2, 74
for odds ratio, 194 distribution, 91
level, 128 dotchart, 43, 52
not a prediction interval, 137 double-blind, 18, 29
realisation of random interval, 128
recommended, 154 ecological study, 26
confounding, 6, 28 empirical cdf, 95
confounding factor, 33 encoding/decoding paradigm, 40
confounding variable, 6, 15, 16 epidemiological study, types of, 11
contingency table, 190 epidemiology, 5, 24
2×2, 191 error
r×c, 195 and residuals, 212
analysis in R, 192 mean-square, MS, 213
continuity correction, 111, 114 sampling, 25
continuous numerical data, 44 sum of squares, residual SS, 213
continuous random variables, Pr(X = x) = 0, 94 type I and type II, 149
contrapositive, 148 estimate, 11, 122, 123, 125, 126, 141, 142
control, 19 from QQ-plot, 139
control group, 15, 196 of odds ratio, 83
correction for continuity, 111, 114 optimal combination, 141
for p-values, 160 estimation, 125
correlation, 64, 204 of α and β in regression, 210
and relationship, 65 of µ1 −µ2 , 174
coefficient, 193, 204 σ1 & σ2 known, 175
coefficient, population (ρ), 207 σ1 & σ2 unknown and unequal, 180
for association, 203 σ1 & σ2 unknown but equal, 177
in R, 204 of normal µ
SP diagram, 291 when σ is unknown, 135
counterfactual model, 30 when σ known, 127
coverage, of confidence interval, 131 of population proportion p, 129
cross-sectional study, 24, 26 of population rate α, 133
cumulative distribution function, cdf, 47, 94 estimator, 126
connection to pdf & pmf, 95 of σ, 135
properties, 95 of p, 130
cumulative frequency, 56 of rate α, 133
distribution, 47 exact CI
cumulative relative frequency, 89 for α, 134
function, 56 for λ, 134
inverse, 57 for p, 131
expectation
data, 2, 39 = mean, 97
data analysis, 1, 39 additivity, 98
and computers, 42 properties, 98
data distribution, 52 expected frequency, 188, 191, 192, 195
data presentation, 40 expected number of cases, 163
principles, 44 experiment, 11, 14–17, 21, 26–28, 39, 176, 186
data-ink ratio, 40 experimental design, 14
deciles, 97 experimental study, 11, 27, 41
degrees of freedom, 135, 178, 192, 214 explanatory variable, 11, 15, 16, 45
descriptive statistics, 46 exposed, 1, 19–21, 82, 187
designed experiment, 28, 41 exposure, 1, 11, 15, 16, 19, 22, 24, 27, 78, 80, 171
diagnostic test result, 80
lOMoARcPSD|8938243

Experimental Design and Data Analysis: Index page 301

false negative, 83, 84 law of total probability, 80


false positive, 83, 84 formula, 82
field trial, 14 least squares, 210
first quartile, 49 legal system and hypothesis testing, 149
Fisher, 16 line-plot, 63
fitting distributions, 190 linear combinations, 115
five-number summary, 58 mean and variance, 115
Framingham Heart Study, 21 optimal, 141
frequency distribution, 47 optimal weights, 142
frequency, freq, 53 Literary Digest poll, 26
location, 58, 99, 100, 122
general population, 1 log transformation, 46, 61
goodness-of-fit test logistic transformation, 46
chi-squared, 187 logit transformation = log-odds, 46
properties, 188 lognormal distribution, 118
statistic, 147, 192 long-term relative frequency, 74
graph of normal power function, 156 longitudinal, 24
graphical representation, 52 lower quartile, 49
gridlines, 44 LTP = law of total probability, 80
lurking variable, 9, 15
half-width, 129
hat notation, 50 margin of error, 129
Hill’s criteria of causal association, 30 matched pair, 173
histogram, 54 matched samples, 22
hypothesis, 148 mean, 97, 101
hypothesis testing, 125, 147 additivity, 98
and confidence interval, 147, 150 inference on, 123
inexact correspondence, 159 of a sum, 98
and the legal system, 149 properties, 98
for m=m0 , sign test, 162 waiting time, 108
for p1 =p2 , 184 mean, median, mode, 101
for normal populations, 154 measures of location, sample, 48
for rate α=α0 , 163 median, 96, 101
logic, 148 meta analysis, 142
procedure, 150 mode, 101
model, 71, 121
implausible values, lead to rejection, 151 population-data, 2
impossible, 71, 72 straight-line regression, 209
in vitro experiments, 26 modelling, 74
incidence, 75, 107 moment statistics, 47
annual, 108 MRFIT, 14
incidence rate, 11, 19, 105, 107–109, 133, 163 multiplication rule
independence, 86 for independent events, 86
not the same as mutually exclusive, 86 for probability, 78
independent events, 78 multiplication rule, for independent events, 86
independent samples, 171, 174 mutually exclusive, 72
difference between samples, 172
independent trials, 87, 102 natural experiment, 20, 21, 26
inference on λ, 134 negative predictive value, npv, 83, 84
information, reciprocal of variance, 142 negative relationship
integer-valued variable, 114 between events, 79
interquartile range, 101 between variables, 64
intersection, 72, 73 negatively related events, 78
interval data, 44 no relationship between events = independence, 86
interval estimate, 126, 127 no-effect hypothesis, 148
intervention, 15 non-representative sample, 82, 84
inverse cdf, 96 normal distribution, 110
inverse cumulative relative frequency function, 57 additivity of, 115
and the central limit theorem, 113
John Snow, 20 as limit distribution, 113
checking, 138, 139
Lanarkshire milk experiment, 7 five-number summary, 112
lOMoARcPSD|8938243

page 302 Experimental Design and Data Analysis: Index

inverse cdf, 111 Poisson, pmf table, 284


inverse cdf table, 288 pooled mean, 178
quantiles, 111 pooled variance, 178
standard, 110 population, 2, 71, 121
standard cdf table, 288 population proportion, 74, 130
normal scores, 139 testing, 159
null hypothesis, H0 , 147–149 population rate, testing, 163
numerical data, 44 positive predictive value, 84, 85
positive predictive value, ppv, 83, 84
observational study, 11, 27, 39, 41, 45 positive relationship
observations, 71, 121 between events, 79
observed frequency, 188, 192 between variables, 64
odds, 75 positively related events, 78
against, 75 positively skew, 252
conditional, 79 power, 149
on, 75 evaluation of, 155
odds ratio, 76, 79, 82, 83, 194 function, 150, 156
and risk ratio, 88 increases with sample size, 150
confidence interval for, 194 precision, 15, 17, 125
measures relationship, 79 prediction interval, 137, 138
odds view of Bayes’ theorem, 85 and confidence interval, 138
one-sided, 147 for Y given x, 215
optimal estimate for Y (x), 215
table for, 143 in R, 215
variance, 142 straight-line regression, 215
weight, 142 prediction, regression used for, 203
optimal estimator, 141 prevalence, 11, 20, 24, 75, 83–85, 87
optimal linear combination, 141 probability, 1, 71, 121
optimal weights, for linear combinations, 142 addition theorem, 72
ordering, 43, 45 as area, 93
ordinal data, 44 assigning values to, 73
conditional, 76
p-value, 151 multiplication rule, 78
alternative definition, 161 properties, 72
for t-test, 158 probability density function, pdf, 94
for z-test, 154 properties, 94
paired comparison, 171, 173 probability distribution, 91
sample of differences, 172 description of, 100
paired samples, 173 of the sample mean, 122
parameter, 122 probability interval, 104, 127, 129–131
parametric hypothesis, 148 for the sample mean, 123
pdf, connection to cdf, 95 probability mass function, pmf, 94
percentiles, 97 properties, 94
person-years, 107, 108, 133, 134, 163, 187 probability model, 74, 87, 102
Physicians’ health study, 12 probability table, 72, 73, 83
placebo, 13 and contingency table, 73
plausible values, 125, 127, 129, 147, 152 LTP and Bayes, 81
pmf, connection to cdf, 95 prospective, 24, 27, 104
point estimate, 126
0% confidence interval, 128 QQ-plot, 139
of p, 131 checking normality, 139
Poisson distribution, 105 for residuals, 213
additivity, 109 quantiles, 47, 96, 97
approximated by normal, 113 sample, 57
for inference on population rate, 133 quantitative statement, 41
in R, 105 quartiles, 96
mean and variance, 106
pmf, 105 random digits, 9
SP diagram, 287 random error, 209
Poisson process, 105 random experiment, 87
analogue of independent trials, 105 random number, 99
rate of, 105 random procedure, 91
lOMoARcPSD|8938243

Experimental Design and Data Analysis: Index page 303

random sampling, 121 from grouped data, 51


random variable, 91 sampling, 121
continuous, 93 frame, 25
discrete, 93 from a normal population, 127
randomisation, 9, 12, 16, 17, 19, 24, 28 units, 25
randomised controlled trial (RCT), 13 sampling distribution, 121
ratio data, 44 of P̂ , 130
realisation, 122, 126, 127, 142, 209 scatter diagram, 63, 201
of random variable, 121 scatter plot, 63, 64, 201, 210
regression and correlation, 65
for prediction, 203 scale change, 66
straight-line, 208 scaling, 65
reject H0 , 148 se0, standard error, assuming H0 true, 184
relationship, 31 self-pairing, 173
between events, 193 sensitivity, sn, 83–85
between variables, 201 sign test, hypothesis test for m=m0 , 162
diagram, 31 significance level, α, 149, 152
relative risk, 79 significant, 152
replication, 18, 19, 28 significant figures, 46
reporting test results, 154 simple hypothesis, 148
residual sum of squares, error SS, 213 simulation, 95
residuals, 212 single-blind, 18
and errors, 212 skeptical statistician, 27
properties, 213 skewness, 58, 60, 101
response variable, 9, 11, 15, 16, 19, 45, 173, 209 source population, 22, 24
retrospective, 23, 24, 27 specificity, sp, 83–85
reverse causation, 32 spread, 58, 99–101, 122
rigging, test for, 189 measures of, 101
risk, 8, 11, 22, 74, 75, 108, 109, 186 standard deviation, 98, 99
risk difference, 76 non-additivity, 99
risk ratio, 76, 79, 82, 83 standard error, se, 125, 126
and odds ratio, 88 assuming H0 true, se0 , 184
ROC curve, 89 standard normal distribution, 110
cdf, in Tables, 110
Salk vaccine trial, 14 inverse cdf, in Tables, 111
sample, 71, 121 standardisation theorem, 110
sample correlation coefficient, 207 standardised incidence ratio (SIR), 165
sample density, 54 standardised mortality ratio (SMR), 165
sample distribution, 52 statistic, 122
sample interquartile range, 51, 122 statistic-parameter diagram, 129
sample maximum, 49 for correlation, in Tables, 207
sample mean, 48, 49, 122–128, 150, 151, 153, 154, 158, statistical graphics, 40
172 statistical hypothesis, 148
= average, 137 statistical inference, 2, 125
asymptotic normality, 123 statistically significant, 152
distribution, 122 statistics, 1, 2, 5
from grouped data, 48 stem-and-leaf plot, 59
mean and variance, 123 straight line, 65
sample measures of location, 48 easiest curve to fit, 138
sample median, 48, 49, 112, 122, 172 straight-line regression, 208
sample minimum, 49 assumptions, 209
sample mode, 49 hypothesis testing, 215
sample proportion, 130 inference, 212
as a sample mean, 130 interpretations, 209
mean and variance of, 130 model, 209
sample quantiles, 49, 57, 138 statistics for hand-computation, 210
sample size determination, 129, 132, 157 utility test, 215
for specified power, 158 stratification, 8, 9, 17
independent samples, 176 stratified population, 109
sample space, 91 study design, 1
sample standard deviation, 51, 122, 137, 158 study types, 11, 24, 26
sample variance, 51 value, 26
lOMoARcPSD|8938243

page 304 Experimental Design and Data Analysis: Index

subjective probability, 74
success
and failure, 102
in independent trials, 87
survey, 24, 39, 132, 153
symmetry, 73

t distribution, 135
in R, 136
inverse cdf in Tables, 136
inverse cdf table, 289
t-test, for µ=µ0 , 158
tables, 42
tabular presentation, 42
tea-tasting experiment, 16
test for independence of variables, 207
test reporting, 154
test statistic, 148, 149
testing
binomial parameter, 159
population proportion, 159
population rate, 163
with discrete variables, 165
third quartile, 49
time-line diagram, 27
transformations, 46
treatment, 15–19, 27, 39, 171–173, 176, 177, 186, 191
trimmed mean, 49
two-sided, 147
type I error, 149
type II error, 149
types of error, 149
types of study, 11
types of variable, 44

unexposed, 1, 19
union, 72, 73
univariate data, 46
universe reduction, 76
unrelated events, 78
upper quartile, 49
utility test for straight-line regression, 215

validity, 15
variable types, 44
variance, 98
additivity, 99
properties, 99
Venn diagram, 72

Whickham study, 8

z-test, 152
approximate, 159
for µ=µ0 , 154

You might also like