Intro To Prob Theory
Intro To Prob Theory
Probability Theory
and Statistics
for Psychology
and
Quantitative Methods for
Human Sciences
David Steinsaltz1
University of Oxford
(Lectures 1–8 based on earlier version by Jonathan Marchini)
1
University lecturer at the Department of Statistics, University of Oxford
Contents
1 Describing Data 1
1.1 Example: Designing experiments . . . . . . . . . . . . . . . . 2
1.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Types of variables . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Ambiguous data types . . . . . . . . . . . . . . . . . . 5
1.3 Plotting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Cumulative and Relative Cumulative Frequency Plots
and Curves . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Dot plot . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.6 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Summary Measures . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Measures of location (Measuring the center point) . . 19
1.4.2 Measures of dispersion (Measuring the spread) . . . . 24
1.5 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.6.1 Mathematical notation for variables and samples . . . 32
1.6.2 Summation notation . . . . . . . . . . . . . . . . . . . 33
2 Probability I 35
2.1 Why do we need to learn about probability? . . . . . . . . . . 35
2.2 What is probability? . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.2 Calculating simple probabilities . . . . . . . . . . . . . 39
2.2.3 Example 2.3 continued . . . . . . . . . . . . . . . . . . 40
2.2.4 Intersection . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.5 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
iii
iv Contents
2.2.6 Complement . . . . . . . . . . . . . . . . . . . . . . . 42
2.3 Probability in more general settings . . . . . . . . . . . . . . 43
2.3.1 Probability Axioms (Building Blocks) . . . . . . . . . 43
2.3.2 Complement Law . . . . . . . . . . . . . . . . . . . . . 44
2.3.3 Addition Law (Union) . . . . . . . . . . . . . . . . . . 44
3 Probability II 47
3.1 Independence and the Multiplication Law . . . . . . . . . . . 47
3.2 Conditional Probability Laws . . . . . . . . . . . . . . . . . . 51
3.2.1 Independence of Events . . . . . . . . . . . . . . . . . 53
3.2.2 The Partition law . . . . . . . . . . . . . . . . . . . . 55
3.3 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Probability Laws . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Permutations and Combinations (Probabilities of patterns) . 59
3.5.1 Permutations of n objects . . . . . . . . . . . . . . . . 59
3.5.2 Permutations of r objects from n . . . . . . . . . . . . 61
3.5.3 Combinations of r objects from n . . . . . . . . . . . . 62
3.6 Worked Examples . . . . . . . . . . . . . . . . . . . . . . . . 63
Describing Data
Uncertain knowledge
+ knowledge about the extent of uncertainty in it
= Useable knowledge
C. R. Rao, statistician
As we know, there are known knowns. There are things we know we know.
We also know there are known unknowns. That is to say we know there are
some things we do not know. But there are also unknown unknowns, The
ones we don’t know we don’t know.
Donald Rumsfeld, US Secretary of defense
1
2 Describing Data
• How many?
In the original study, the authors had six infants in the treatment group
(the formal name for the ones who received the exercise — also called the
experimental group), and six in the control group. (In fact, they had a
second control group, that was subject to an alternative exercise regime. But
that’s a complication for a later date.) The results are tabulated in Table
1.1. We see that most of the treatment children did start walking earlier
Describing Data 3
than most of the control children. But not all. The slowest child from the
treatment group in fact started walking later than four of the six control
children. Should we still be convinced that the treatment is effective? If
not, how many more subjects do we need before we can be confident? How
would we decide?
Table 1.1: Age (in months) at which infants were first able to walk indepen-
dently. Data from [ZZK72].
The answer is, we can’t know for sure. The results are consistent with
believing that the treatment had an effect, but they are also consistent with
believing that we happened to get a particularly slow group of treatment
children, or a fast group of control children, purely by chance. What we
need now is a formal way of looking at these results, to tell us how to decide
how to draw conclusions from data — “The exercise helped children walk
sooner” — and how properly to estimate the confidence we should have in
our conclusions — How likely is it that we might have seen a similary result
purely by chance, if the exercise did not help? We will use graphical tools,
mathematical tools, and logical tools.
1.2 Variables
The datasets that Psychologists and Human Scientists collect will usually
consist of one or more observations on one or more “variables”.
Qualitative Quantitative
Discrete
Discrete Continuous
(counts)
Binary Non-
Binary
Smoking (yes/no) Hair colour
Sex (M/F) ethnicity
place of birth (home/hospital) cause of death
Figure 1.1: A summary of the different data types with some examples.
Continuous Data
Forty-four babies (a new record) were born in one 24-hour period at the
Mater Mothers’ Hospital in Brisbane, Queensland, Australia, on December
18, 1997. For each of the 44 babies, The Sunday Mail recorded the time
of birth, the sex of the child, and the birth weight in grams. The data are
shown in Table 1.2, and will be referred to as the “Baby boom dataset.”
While we did not collect this dataset based on a specific hypothesis, if we
wished we could use it to answer several questions of interest. For example,
• Are these observations consistent with boys and girls being equally
likely?
These are all questions that you will be able to test formally by the end of
this course. First though we can plot the data to view what the data might
be telling us about these questions.
24
20
16
Frequency
12
8
4
0
Girl Boy
Figure 1.3: A Bar Chart showing the gender distribution in the Baby-boom
dataset.
1.3.2 Histograms
An analogy
• Draw a bar for each category whose heights represent the counts in
each category.
For the baby-boom dataset we can draw a histogram of the birth weights
(Figure 1.4). To draw the histogram I found the smallest and largest values
smallest = 1745 largest = 4162
There are only 44 weights so it seems reasonable to take 6 bins
Using these categories works well, the histogram shows us the shape of the
distribution and we notice that distribution has an extended left ‘tail’.
20
15
Frequency
10
5
0
Figure 1.4: A Histogram showing the birth weight distribution in the Baby-
boom dataset.
Too few categories and the details are lost. Too many categories and the
overall shape is obscured by haphazard details (see Figure 1.5).
In Figure 1.6 we show some examples of the different shapes that his-
tograms can take. One can learn quite a lot about a set of data by looking
just at the shape of the histogram. For example, Figure 1.6(c) shows the
percentage of the tuberculosis drug isoniazid that is acetylated in the livers
of 323 patients after 8 hours. Unacetylated isoniazid remains longer in the
blood, and can contribute to toxic side effects. It is interesting, then, to
Describing Data 11
35
7
30
6
25
5
Frequency
Frequency
20
4
15
3
10
2
5
1
0
0
1500 2500 3500 4500 1500 2500 3500 4500
Figure 1.5: Histograms with too few and too many categories respectively.
notice that there is a wide range of rates of acetylation, from patients who
acetylate almost all, to those who acetylate barely one fourth of the drug in
8 hours. Note that there are two peaks — this kind of distribution is called
bimodal — which points to the fact that there is a subpopulation who lacks
a functioning copy of the relevant gene for efficiently carrying through this
reaction.
So far, we have taken the bins to all have the same width. Sometimes
we might choose to have unequal bins, and more often we may be forced to
have unequal bins by the way the data are delivered. For instance, suppose
we did not have the full table of data, but were only presented with the
following table: What is the best way to make a histogram from these data?
We could just plot rectangles whose heights are the frequencies. We then
end up with the picture in Figure 1.7(a). Notice that the shape has changed
substantially, owing to the large boxes that correspond to the widened bins.
In order to preserve the shape — which is the main goal of a histogram —
we want the area of a box to correspond to the contents of the bin, rather
than the height. Of course, this is the same when the bin widths are equal.
Otherwise, we need to switch from the frequency scale to density scale,
12 Describing Data
20
0.012
15
0.010
Frequency
0.008
10
Density
0.006
0.004
5
0.002
0
0.000
1500 2000 2500 3000 3500 4000 4500 0 50 100 150 200 250 300
(a) Left-skewed: Weights from Babyboom (b) Right-skewed: 1999 California house-
data set. hold incomes. (From www.census.gov.)
50
0.020
40
0.015
30
Frequency
Density
0.010
20
0.005
10
0.000
0
25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 120 140 160 180 200 220 240 260 280 300
(c) Bimodal: Percentage of isoniazid acety- (d) Bell shaped: Serum cholesterol of 10-
lated in 8 hours. year-old boys [Abr78].
0.04
15
0.03
Frequency
Density (babies/g)
10
0.02
5
0.01
0
0
1500 2000 2500 3000 3500 4000 4500 1500 2000 2500 3000 3500 4000 4500
Figure 1.7: Same data plotted in frequency scale and density scale. Note
that the density scale histogram has the same shape as the plot from the
data with equal bin widths.
in which the height of a box is not the number of observations in the bin,
but the number of observations per unit of measurement. This gives us the
picture in Figure 1.7(b), which has a very similar shape to the histogram
with equal bin-widths.
Thus, for the data in Table 1.3 we would calculate the height of the first
rectangle as
Number of births 5 babies
Density = = = 0.005.
width of bin 1000g
The complete computations are given in Table 1.4, and the resulting his-
togram is in Figure 1.7(b).
• For each bin, compute the density, which is simply the number of
observations divided by the width of a bin;
• Draw a bar for each bin whose height represent the density in each bin.
The area of the bar will correspond to the number of observations in
the bin.
Consider the histogram of birth weight shown in Figure 1.4. The frequencies,
cumulative frequencies and relative cumulative frequencies of the intervals
are given in Table 1.5.
50
Cumulative Frequency
Cumulative Frequency
40
40
30
30
20
20
10
10
0
1500 2000 2500 3000 3500 4000 4500 2000 2500 3000 3500 4000 4500
Figure 1.8: Cumulative frequency curve and plot of birth weights for the
baby-boom dataset.
a dot plot of birth weights grouped by gender for the baby-boom dataset.
The plot suggests that girls may be lighter than boys at birth.
16 Describing Data
Boy
Gender
Girl
Figure 1.9: A Dot Plot showing the birth weights grouped by gender for the
baby-boom dataset.
Scatter plots are useful when we wish to visualise the relationship between
two measurement variables.
For example, we can draw a scatter plot to examine the relationship between
birth weight and time of birth (Figure 1.10). The plot suggests that there
is little relationship between birth weight and time of birth.
Describing Data 17
Figure 1.10: A Scatter Plot of birth weights versus time of birth for the
baby-boom dataset.
ï10 0 10 20 30
ï10 0 10 20 30
ï10 0 10 20 30
that the distributions seem to have roughly the same center but that the
data plotted in the third are more spread out than in the first. Obviously,
comparing second and the third we observe differences in both the center
and the spread of the distribution.
While it is straightforward to compare two distributions “by eye”, plac-
ing the two histograms next to each other, it is clear that this would be
difficult to do with ten or a hundred different distributions. For example,
Figure 1.6(a) shows a histogram of 1999 incomes in California. Suppose we
wanted to compare incomes among the 50 US states, or see how incomes
developed annually from 1980 to 2000, or compare these data to incomes in
20 other industrialised countries. Laying out the histograms and comparing
them would not be very practical.
Instead, we would like to have single numbers that measure
• the ‘center’ point of the data.
learn how to go a stage further and ‘test’ whether two variables have the
same center point.
• The Mode of a set of numbers is simply the most common value e.g.
the mode of the following set of numbers
1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6, 7, 8, 10, 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
we see that the mode is the peak of the distribution and is a reasonable
representation of the center of the data. If we wish to calculate the
mode of continuous data one strategy is to group the data into adjacent
intervals and choose the modal interval i.e. draw a histogram and take
the modal peak. This method is sensitive to the choice of intervals
and so care should be taken so that the histogram provides a good
representation of the shape of the distribution.
The Mode has the advantage that it is always a score that actually
occurred and can be applied to nominal data, properties not shared
by the median and mean. A disadvantage of the mode is that there
may two or more values that share the largest frequency. In the case
of two modes we would report both and refer to the distribution as
bimodal.
20 Describing Data
• The Median can be thought of as the ‘middle’ value i.e. the value for
which 50% of the data fall below when arranged in numerical order.
For example, consider the numbers
15, 3, 9, 21, 1, 8, 4,
1, 3, 4, 8 , 9, 15, 21
1, 3, 4, 8 , 9, 15, 99999
P
See the appendix for a brief description of the summation notation ( )
1, 3, 4, 8, 7, 15, 99999
Data (x) 1 2 3 4 5 6
Frequency (f) 2 4 6 7 4 1
(2 × 1) + (4 × 2) + (6 × 3) + (7 × 4) + (4 × 5) + (1 × 6) = 82
2 + 4 + 6 + 7 + 4 + 1 = 24
• If the
The mid-range
There is actually a fourth measure of location that can be used (but rarely
is). The Mid-Range of a set of data is half way between the smallest
and largest observation i.e. half the range of the data. For example, the
mid-range of
1, 3, 4, 8, 9, 15, 21
is (1 + 21) / 2 = 11. The mid-range is rarely used because it is not resistant
to outliers and by using only 2 observations from the dataset it takes no
account of how spread out most of the data are.
Describing Data 23
Symmetric
ï10 0 10 20 30
Positive Skew
mean
median
mode
0 5 10 15 20 25 30
Negative Skew
mean
median
mode
0 5 10 15 20 25 30
Figure 1.12: The relationship between the mean, median and mode.
24 Describing Data
• Calculate the 25% point (1st quartile) of the dataset. The location
of the 1st quartile is defined to be the N 4+1 th data point.
• Calculate the 75% point (3rd quartile) of the dataset. The location
of the 3rd quartile is defined to be the 3(N4+1) th data point2 .
⇒ IQR = 80 - 18 = 62
⇒ SIQR = 62 / 2 = 31.
10, 15, 18, 33, 34, 36, 51, 73, 80, 86.
What is the 1st quartile? We’re now looking for the (10+1)4 = 2.75 data
point. This should be 3/4 of the way from the 2nd data point to the 3rd.
The distance from 15 to 18 is 3. 1/4 of the way is .75, and 3/4 of the way
is 2.25. So the 1st quartile is 17.25.
2
The 2nd quartile is the 50% point of the dataset i.e. the median.
Describing Data 25
200
IQR
150
25% 75%
Frequency
100
50
0
ï5 0 5 10 15 20 25
10, 15, 18, 33, 34, 36, 51, 73, 80, 86, 92
At first this sounds like a good way of assessing the spread since you might
think that large spread gives rise to larger deviations and thus a larger
26 Describing Data
mean deviation. In fact, though, the mean deviation is always zero. The
positive and negative deviations cancel each other out exactly. Even so, the
deviations still contain useful information about the spread, we just have to
find a way of using the deviations in a sensible way.
We calculate the MAD in the following way (see Table 1.7 for an exam-
ple)
Describing Data 27
From Table 1.7 we see that the sum of the absolute deviations of the numbers
in Example 1 is 284 so
284
MAD = = 25.818 (to 3dp)
11
another way of measuring the spread is to consider the mean of the squared
deviations, called the variance
From Table 1.7 we see that the sum of the squared deviations of the numbers
in Example 1 is 9036 so
9036
s2 = = 903.6
11 − 1
Describing Data 29
• All points that lie outside the whiskers are plotted individually as
outlying observations.
4000
Upper Whisker
3rd quartile
3500
Median
1st quartile
3000
Lower Whisker
2500
Outliers
2000
Figure 1.15: A Box Plot of birth weights for the baby-boom dataset showing
the main points of plot.
4000
3500
3000
2500
2000
Girls Boys
Figure 1.16: A Box Plot of birth weights by gender for the baby-boom
dataset.
350
300
250
200
150
Day of Week
1.6 Appendix
1.6.1 Mathematical notation for variables and samples
Mathematicians are lazy. They can’t be bothered to write everything out
in full so they have invented a language/notation in which they can express
what they mean in a compact, quick to write down fashion. This is a good
thing. We don’t have to study maths every day to be able to use a bit of the
language and make our lives easier. For example, suppose we are interested
in comparing the resting heart rate of 1st year Psychology and Human Sci-
ences students. Rather than keep on referring to variables ‘the resting heart
rate of 1st year Psychology students’ and ‘the resting heart rate of 1st year
Describing Data 33
x1 = 3 x2 = 6 x3 = 1 x4 = 7 x5 = 6
If the limits of the summation are obvious within context the the notation
is often abbreviated to X
x = 19
34 Describing Data
Lecture 2
Probability I
• what probability is
35
36 Probability I
Examine
Results
Hypothesis STATISTICAL
Propose an Take
about a TEST
experiment a
population sample
Study
STATISTICS
Design
Figure 2.1: The scientific process and role of statistics in this process.
Treatment Control
(anturane) (placebo)
# patients 813 816
deaths 74 89
% mortality 9.1% 10.9%
Probability I 37
1. How to enumerate all the ways the coins could come up.
How many ways are there? The number depends on the
exact procedure, but if we flip one coin for each patient, the
number of cards in the box would be 21629 , which is vastly
38 Probability I
When we toss the die there are six possible outcomes i.e. 1,
2, 3, 4, 5 and 6. We say that the sample space of our experi-
ment is the set S = {1, 2, 3, 4, 5, 6}.
The outcome ”the top face shows a three” is the sample point 3.
The event A1 , that the die shows an even number is the subset
A1 = {2, 4, 6} of the sample space.
2.2.1 Definitions
The example above introduced some terminology that we will use repeatedly
when we talk about probabilities.
The set of all possible outcomes of the experiment is called the sample
space.
In some settings (like the example of the fair die considered above) it is
natural to assume that all the sample points are equally likely.
In this case, we can calculate the probability of an event A as
|A|
P (A) = ,
|S|
40 Probability I
S
1 3 5
|A1 | 3 1
P (A1 ) = = =
|S| 6 2 A1
2 4 6
S
1 3 5 A2
|A2 | 2 1
P (A2 ) = = =
|S| 6 3
2 4 6
2.2.4 Intersection
What about P (face is even, and larger than 4) ?
|A1 ∩ A2 | 1
A1 ∩ A2 = {6} ⇒ P (A1 ∩ A2 ) = =
|S| 6
S
1 3 5 A2
A1
2 4 6
2.2.5 Union
|A1 ∪ A2 | 4 2
A1 ∪ A2 = {2, 4, 5, 6} ⇒ P (A1 ∪ A2 ) = = =
|S| 6 3
42 Probability I
S
1 3 5 A2
A1
2 4 6
2.2.6 Complement
|Ac1 | 3 1
Ac1 = {1, 3, 5} ⇒ P (Ac1 ) = = =
|S| 6 2
Probability I 43
S
1 3 5
A1
2 4 6
(ii). P (S) = 1.
This axiom says that the probability of everything in the sample space
is 1. This says that the sample space is complete and that there are no
sample points or events that allow outside the sample space that can
occur in our experiment.
The rule is
A{ = 1 − P (A) (Law 1)
A = The event that a randomly selected student from the class has brown eyes
B = The event that a randomly selected student from the class has blue eyes
Probability I 45
What is the probability that a student has brown eyes OR blue eyes?
This is the union of the two events A and B, denoted A∪B (pronounced ‘A
or B’)
A B
A∩B
Write A for the event that the SNP is variable in Africa, and
B for the event that it is variable in Asia. We are told
P (A) = 0.7
P (B) = 0.8
P (A ∩ B) = 0.6.
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
= 0.7 + 0.8 − 0.6
= 0.9.
Lecture 3
Probability II
47
48 Probability II
There are 36 points in the sample space. These are all equally
likely. Thus, each point has probability 1/36. Consider the
events
Thus, two (or more) coin flips are always independent. But this is also
relevant to analysing experiments such as those of Example 2.1. If the
drug has no effect on survival, then events like {patient # 612 survived}
are independent of events like {patient # 612 was allocated to the control
group}.
Then we see from Figure 3.2 that P (C) = 10/36, and P (A∩C) =
6/36 6= 10/36 × 1/2. On the other hand, if we replace C by
D = {Sum of the two rolls roll is exactly 9}, then we
see from Figure 3.3 that P (D) = 4/36 = 1/9, and P (A ∩ D) =
2/36 = 1/9 × 1/2, so the events A and D are independent. We
see that events may be independent, even if they are not based
on separate experiments.
50 Probability II
A = The event that a randomly selected student from the class has a bike
B = The event that a randomly selected student from the class has blue eyes
and P(A) = 0.36, P(B) = 0.45 and P(A∩B) = 0.13
What is the probability that a student has a bike GIVEN that the stu-
dent has blue eyes?
in other words
Considering just students who have blue eyes, what is the probability that
a randomly selected student has a bike?
52 Probability II
A B
A∩Bc A∩B A c∩B
A c∩Bc
P(B|A) = P(A ∩ B)
(Conditional Probability Law)
P(A)
We have that
P (A) = 0.7
P (B) = 0.8
P (A ∩ B) = 0.6.
We want
P (A ∩ B) 0.6
P (A|B) = = = 0.75
P (B) 0.8
Similarly
P(A|B)P(B) = P(A ∩ B)
Note that in this case (provided P (B) > 0), if A and B are independent
P (A ∩ B) P (A)P (B)
P (A|B) = = = P (A),
P (B) P (B)
and similarly P (B|A) = P (B) (provided P (A) > 0).
So for independent events, knowledge that one of the events has occurred
does not change our assessment of the probability that the other event has
occur.
Then we are told P (A) = 0.45, P (B) = 0.55, P (S) = 0.3, and
P (A ∩ S) = 0.2.
P(A) is made up of two parts (i) the part of A contained in B (ii) the
part of A contained in Bc .
A B
A∩B
More generally, events are mutually exclusive if at most one of the events
can occur in a given experiment. Suppose E1 , . . . , En are mutually exclusive
events, which together form the whole sample space: E1 ∪ E2 ∪ · · · ∪ En = S.
(In other words, every possible outcome is in exactly one of the E’s. Then
What is the probability that the child will have genotype AB?
The power of Bayes Rule is its ability to take P (S|D) and cal-
culate P (D|S).
P(B|A) = P(A ∩ B)
P(A)
P(B|A) = P(A|B)P(B)
(Bayes Rule)
P(A)
P(B|A) = P(A ∩ B)
(Conditional Probability Law)
P(A)
P(B|A) = P(A|B)P(B)
(Bayes Rule)
P(A)
Q. How many ways can they be arranged? i.e. how many permutations
are there?
60 Probability II
A. 2 ways AB BA
Consider 3 objects A B C
Consider 4 objects A B C D
A. 24 ways
ABCD ABDC ACBD ACDB ADBC ADCB
BACD BADC BCAD BCDA BDAC BDCA
CBAD CBDA CABD CADB CDBA CDAB
DBCA DBAC DCBA DCAB DABC DACB
There is a pattern emerging here.
No. of objects 2 3 4 5 6 ...
No. of permutations 2 6 24 120 720 ...
Can we find a formula for the number of permutations of n objects?
Suppose we have 5 objects. How many different ways can we place them
into 5 boxes?
There are now only 4 objects to choose from for the second box.
5 4
There are 3 choices for the 3rd box, 2 for the 4th and 1 for the 5th box.
Probability II 61
5 4 3 2 1
There are 4 choices for the first box and 3 choices for the second box
4 3
n n!
Pr =
(n − r)!
4 4! 4×3×2×1
P1 = = =4×3
2! 2×1
62 Probability II
AB AC AD BC BD CD
We write this as
4
C2 = 6
We say there are n Cr combinations of r objects chosen from n.
n n!
Cr =
(n − r)!r!
6!
# of ways of choosing 4 consonants = 6 C4 = 4!2! = 15
8!
# of ways of choosing 4 letters = 8 C4 = 4!4! = 70
15 3
⇒ P(all four are consonants) = 70 = 14
(ii). A bag contains 8 white counters and 3 black counters. Two counters
are drawn, one after the other. Find the probability of drawing one
white and one black counter, in any order
What is the probability that the second counter is black (assume that
the first counter is replaced after it is taken)?
P(W1∩W2)
=(8/11)(8/11)
8/11 W2 =64/121
W1
P(W1∩B2)
8/11
=(8/11)(3/11)
3/11 B2 =24/121
P(B1∩W2)
W2 =(3/11)(8/11)
8/11 =24/121
3/11
B1
P(B1∩B2)
3/11 =(3/11)(3/11)
B2
=9/121
P(W1∩W2)
=(8/11)(7/10)
7/10 W2 =56/110
W1
P(W1∩B2)
8/11
=(8/11)(3/10)
3/10 B2 =24/110
P(B1∩W2)
W2 =(3/11)(8/10)
8/10 =24/110
3/11
B1
P(B1∩B2)
2/10 =(3/11)(2/10)
B2
=6/110
(iii). From 2001 TT Prelim Q1 Two drugs that relieve pain are available
to treat patients. Drug A has been found to be effective in three-
quarters of all patients; when it is effective, the patients have relief
from pain one hour after taking this drug. Drug B acts quicker but only
works with one half of all patients: those who benefit from this drug
have relief of pain after 30 mins. The physician cannot decide which
patients should be prescribed which drug so he prescribes randomly.
Assuming that there is no variation between patients in the times
taken to act for either drug, calculate the probability that:
Let
R30 = The event that a patient is relieved of pain within 30 mins
R60 = The event that a patient is relieved of pain within 60 mins
A = Event that a patient takes drug A
B = Event that a patient takes drug B
66 Probability II
0.375
⇒ P (A|R60) = 0.625 = 0.6
4.1 Introduction
In Lecture 2 we saw that we need to study probability so that we can
calculate the ‘chance’ that our sample leads us to the wrong conclusion
about the population. To do this in practice we need to ‘model’ the process
of taking the sample from the population. By ‘model’ we mean describe the
process of taking the sample in terms of the probability of obtaining each
possible sample. Since there are many different types of data and many
different ways we might collect a sample of data we need lots of different
probability models. The Binomial distribution is one such model that turns
out to be very useful in many experimental settings.
69
70 The Binomial Distribution
Some values of X will be more likely to occur than others. Each value
of X will have a probability of occurring. What are these probabilities?
Consider the probability of obtaining just one black ball, i.e. X = 1.
One possible way of obtaining one black ball is if we observe the pattern
BRRRR. The probability of obtaining this pattern is
2 1 1 1 1
P(BRRRR) = 3 × 3 × 3 × 3 × 3
There are 32 possible patterns of black and red balls we might observe. 5 of
the patterns contain just one black ball
It’s now just a small step to write down a formula for this situation specific
situation in which we toss a coin 5 times
2 x 1 (5−x)
P (X = x) = 5 Cx × ×
3 3
We can use this formula to tabulate the probabilities of each possible value
of X.
These probabilities are plotted in Figure 4.1 against the values of X. This
shows the distribution of probabilities across the possible values of X. This
The Binomial Distribution 71
0 5
5C 2 1
P(X = 0) = 0× 3 × 3 = 0.0041
1 4
5C 2 1
P(X = 1) = 1 × 3 × 3 = 0.0412
2 3
5C 2 1
P(X = 2) = 2 × 3 × 3 = 0.1646
3 2
5C 2 1
P(X = 3) = 3 × 3 × 3 = 0.3292
4 1
5C 2 1
P(X = 4) = 4 × 3 × 3 = 0.3292
5 0
5C 2 1
P(X = 5) = 5 × 3 × 3 = 0.1317
0.0 0.1 0.2 0.3 0.4 0.5
P(X)
0 1 2 3 4 5
X
- 2 possible outcomes for each trial “success” and “failure”, e.g. Heads
or Tails
- Trials are independent, e.g. each coin toss doesn’t affect the others
- P(“success”) = p is the same for each trial, e.g. P(Black) = 2/3 is the
same for each trial
n
P(X = x) = Cx px (1 − p)(n−x) x = 0, 1, . . . , n
X∼Bin(n, p)
Examples
With this general formula we can calculate many different probabilities.
(i). Suppose X ∼ Bin(10, 0.4), what is P(X = 7)?
10
P(X = 7) = C7 (0.4)7 (1 − 0.4)(10−7)
= (120)(0.4)7 (0.6)3
= 0.0425
The Binomial Distribution 73
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
P(X)
P(X)
P(X)
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
X X X
If X ∼ Bin(n, p) then
µ = np
√
σ = npq where q = 1 − p
In the example above, X ∼ Bin(5, 2/3) and so the mean and standard
deviation are given by
µ = np = 5 × (2/3) = 3.333
and
√
σ= npq = 5 × (2/3) × (1/3) = 1.111
However, for a given value of p, the skewness goes down as n increases. All
binomial distributions eventually become approximately symmetric for large
n. This will be discussed further in Lecture 6.
• posit a hypothesis
Hypothesis The die is fair. All 6 outcomes have the same probability.
76 The Binomial Distribution
Testing the hypothesis Assuming our hypothesis is true what is the prob-
ability that we would have observed such a sample or a sample more extreme,
i.e. is our sample quite unlikely to have occurred under the assumptions of
our hypothesis?
- P(“success”) = p is the same for each trial, i.e. P(1 comes up) = 1/6
is the same for each trial
18 60−30
1 5
P (# 1’s is exactly 30) =60 C30 − = 2.25 × 10−9 .
6 6
60 x 60−x
X 1 5
P (# 1’s is at least 30) = 60
Cx − = 2.78 × 10−9 .
6 6
x=30
Which is the appropriate probability? The “strange event” from the per-
spective of the fair die was not that 1 came up exactly 30 times, but that it
came up so many times. So the relevant number is the second one, which
is a little bigger. Still, the probability is less than 3 in a billion. In other
The Binomial Distribution 77
words, if you were to perform one of these experiments once a second, con-
tinuously, you might expect to see a result this extreme once in 10 years. So
you either have to believe that you just happened to get that one in 10 years
outcome the one time you tried it, or you have to believe that there really
is something biased about the die. In the language of hypothesis testing we
say we would ‘reject the hypothesis’.
From Table 2.1, we know that there were 163 patients who died,
out of a total of 1629. Now, suppose the study works as follows:
Patients come in the door, we flip a coin, and allocate them to
the treatment group if heads comes up, or to the control group
if tails comes up. (This isn’t exactly how it was done, but close
enough. Next term, we’ll talk about other ways of getting the
same results.)
We had a total of 813 heads out of 1629, which is pretty close
to half, which seems reasonable. On the other hand, if we look
at the 163 coin flips for the patients who died, we only had 74
heads, which seems pretty far from half (which would be 81.5).
It seems there are two plausible hypotheses:
i 163−i 74
1 1 X
163
P (# heads at most 74 or at least 89) = Ci
2 2
i=0
163
1 i 1 163−i
X
163
+ Ci
2 2
i=89
= 0.272.
Note that the “two-tailed” probability is exactly twice the “one-
tailed”. We show these probabilities on the probability his-
togram of Figure 4.3.
mean=81.5
0.06
X≤74 X≥89
0.04
prob.=0.136 prob.=0.136
Probability
0.03
0.02
0.01
0.00
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Number of Heads
We observe 74
Figure 4.3: The tail probabilities for testing the hypothesis that anturane
has no effect.
5.1 Introduction
The book [Mou98] cites data from the St. Luke’s Hospital Gazette,
on the monthly number of drownings on Malta, over a period of
nearly 30 years (355 consecutive months). Most months there
were no drownings. Some months there was one person who
drowned. One month had four people drown. The data are
given as counts of the number of months in which a given num-
ber of drownings occurred, and we repeat them here as Table
5.1.
Looking at the data in Table 5.1, we might suppose that one of
the following hypotheses is true:
81
82 The Poisson Distribution
In such situations we are often interested in whether the events occur ran-
domly in time or space. Consider the Babyboom dataset (Table 1.2), that
we saw in Lecture 1. The birth times of the babies throughout the day are
shown in Figure 5.1(a). If we divide up the day into 24 hour intervals and
The Poisson Distribution 83
count the number of births in each hour we can plot the counts as a his-
togram in Figure 5.1(b). How does this compare to the histogram of counts
for a process that isn’t random? Suppose the 44 birth times were distributed
in time as shown in Figure 5.1(c). The histogram of these birth times per
hour is shown in Figure 5.1(d). We see that the non-random clustering of
events in time causes there to be more hours with zero births and more
hours with large numbers of births than the real birth times histogram.
This example illustrates that the distribution of counts is useful in un-
covering whether the events might occur randomly or non-randomly in time
(or space). Simply looking at the histogram isn’t sufficient if we want to
ask the question whether the events occur randomly or not. To answer
this question we need a probability model for the distribution of counts of
random events that dictates the type of distributions we should expect to
see.
λx
P(X = x) = e−λ x = 0, 1, 2, 3, 4, . . .
x!
X∼Po(λ)
15
10
Frequency
5
0
0 200 400 600 800 1000 1200 1440 0 2 4 6
Birth Time (minutes since midnight) No. of births per hour
(a) Babyboom data birth times (b) Histogram of Babyboom birth times
15
10
Frequency
5
0
Figure 5.1: Representing the babyboom data set (upper two) and a nonran-
dom hypothetical collection of birth times (lower two).
Note A Poisson random variable can take on any positive integer value.
In contrast, the Binomial distribution always has a finite upper limit.
The Poisson Distribution 85
We want P (X ≥ 2) = P (X = 2) + P (X = 3) + . . .
but
P (X ≥ 2) = P (X = 2) + P (X = 3) + . . .
= 1 − P (X < 2)
= 1 − (P (X = 0) + P (X = 1))
1.80 1.81
= 1 − (e−1.8 + e−1.8 )
0! 1!
= 1 − (0.16529 + 0.29753)
= 0.537
86 The Poisson Distribution
If X ∼ Po(λ) then
µ = λ
√
σ = λ
σ2 = λ
The Poisson Distribution 87
0.25
0.25
0.20
0.20
0.20
0.15
0.15
0.15
P(X)
P(X)
P(X)
0.10
0.10
0.10
0.05
0.05
0.05
0.00
0.00
0.00
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
X X X
Well, if births occur randomly at a rate of 1.8 births per 1 hour interval
Then births occur randomly at a rate of 3.6 births per 2 hour interval
Then Y ∼ Po(3.6)
5
P (Y = 5) = e−3.6 3.6
5! = 0.13768
What is the probability that we observe 7 births in total from the two
hospitals in a given 1 hour period?
P (Z ≥ 6) = 1 − P (Z ≤ 5)
4.60 4.61 4.62 4.63 4.64 4.65
= 1 − e−4.6 + + + + +
0! 1! 2! 3! 4! 5!
= 0.314.
The Poisson Distribution 89
44
Therefore the mean birth rate for both sequences is 24 = 1.8333
What would be the expected counts if birth times were really random i.e.
what is the expected histogram for a Poisson random variable with mean
rate λ = 1.8333.
x 0 1 2 3 4 5 ≥6
P (X = x) 0.15989 0.29312 0.26869 0.16419 0.07525 0.02759 0.01127
Then if we observe 24 hour intervals we can calculate the expected fre-
quencies as 24 × P (X = x) for each value of x.
x 0 1 2 3 4 5 ≥6
Expected frequency 3.837 7.035 6.448 3.941 1.806 0.662 0.271
24 × P (X = x)
We say we have fitted a Poisson distribution to the data.
Once we have fitted a distribution to the data we can compare the expected
frequencies to those we actually observed from the real Babyboom dataset.
We see that the agreement is quite good.
x 0 1 2 3 4 5 ≥6
Expected 3.837 7.035 6.448 3.941 1.806 0.662 0.271
Observed 3 8 6 4 3 0 0
1
in practice we group values with low probability into one category.
90 The Poisson Distribution
When we compare the expected frequencies to those observed from the non-
random clustered sequence in Section 1 we see that there is much less agree-
ment.
x 0 1 2 3 4 5 ≥6
Expected 3.837 7.035 6.448 3.941 1.806 0.662 0.271
Observed 12 3 0 2 2 4 1
In Lecture 9 we will see how we can formally test for a difference between
the expected and observed counts. For now it is enough just to know how
to fit a distribution.
In general,
If n is large (say > 50) and p is small (say < 0.1) then a
Bin(n, p) can be approximated with a Po(λ) where λ = np
X ∼ Bin(100, 0.05)
0.20
0.20
P(X)
P(X)
0.10
0.10
0.00
0.00
0 2 4 6 8 10 0 2 4 6 8 10
X X
Figure 5.3: A Binomial and Poisson distribution that are very similar.
We want P (X ≥ 2)?
P (X ≥ 2) = 1 − P (X < 2)
= 1 − P (X = 0) + P (X = 1)
50 51
≈ 1 − e−5 + e−5
0! 1!
≈ 1 − 0.040428
≈ 0.9596
• The exact distribution may have too much detail. There may be some
features of the exact distribution that are irrelevant to the questions
92 The Poisson Distribution
47
X 23i
P (Y ≥ 48) = 1 − e−23 = 3.5 × 10−6 .
i!
i=0
Now, those of you who have learned some calculus at A-levels may remember
the Taylor series for ez :
z2 z3
ez = 1 + z + + + ··· .
2! 3!
In particular, for small z we have e−z ≈ 1 − z, and the difference (or “error”
in the approximation) is no bigger than z 2 /2. The key idea is that if z is
very small (as it is when z = λ/n, and n is large), then z 2 is a lot smaller
than z.
Using a bit of algebra, we have
λ −x
x
λ n
n λ
P {Xn = x} = Cx 1− 1−
n n n
λ −x λ n
x
n(n − 1) · · · (n − x + 1) λ
= 1− 1−
x! nx n n
x 1
x−1
n
λ (1) 1 − n · · · 1 − n λ
= λ
x 1− .
x! 1− n n
Now, if we’re not concerned about the size of the error, we can simply
say that n is much bigger than λ or x (because we’re thinking of a fixed λ
and x, and n getting large). So we have the approximations
1 x−1
1− ··· 1 − ≈ 1;
n n
λ x
1− ≈ 1;
n
λ n −λ/n n
1− ≈ e = e−λ .
n
Thus
λx −λ
P {Xn = x} ≈ e .
x!
than about 1.6λ2 /n3/2 . In Example 5.5, where n = 100 and λ = 5, this
says the error won’t be bigger than about 0.04, which is useful information,
although in reality the maximum error is about 10 times smaller than this.
On the other hand, if n = 400, 000 (about the population of Malta), and
λ = 0.47, then the error will be only about 10−8 .
√
Let’s assume that n is at least 4λ2 , so λ < n/2. Define the approxima-
tion error to be
:= max P {Xn = x} − P {X = x} .
(The bars | · | mean that we’re only interested in how big the difference is,
not whether it’s positive or negative.) Then
n !
λx (1) 1 − n1 · · · 1 − x−1
λ λ x
P {Xn = x} − P {X = x} = x n
1− − e−λ
x! 1 − nλ n x!
!
λx −λ λ −x 1 − λ/n n
1 x−1
= e (1) 1 − ··· 1 − 1− −1
x! n n n e−λ/n
√
If x is bigger than n, then P {X = x} and P {Xn = x} are both tiny; we
won’t go into the details here, but we will consider only x that are smaller
than this. Now we have to do some careful approximation. Basic algebra
tells us that if a and b are positive,
Thus
x−1
x2
1 x−1 X k
1− 1− ··· 1 − >1− >1− ,
n n n 2n
k=0
and x
λ λx
1> 1− >1− .
n n
Again applying some calculus, we turn this into
λ −x
λx
1< 1− <1+ .
n n − λx
98 The Poisson Distribution
λx+1 −λ λ x
x
≤ max e + + √ .
x! n n n(1 − x/2 n)
√
We need to find the maximum over all possible x. If x < n then this
becomes
1 λx+1 −λ 4λ2
≤ max e (λ + 3x) ≤ √ ∗ ,
n x! n 2πn
(by a formula known as “Stirling’s formula”), where λ∗ = max{λ, 1}.
Lecture 6
6.1 Introduction
In previous lectures we have considered discrete datasets and discrete prob-
ability distributions. In practice many datasets that we collect from ex-
periments consist of continuous measurements. For example, there are the
weights of newborns in the babyboom data set (Table 1.2). The plots in
Figure 6.1 show histograms of real datasets consisting of continuous mea-
surements. From such samples of continuous data we might want to test
whether the data is consistent with a specific population mean value or
whether there is a significant difference between 2 groups of data. To an-
swer these question we need a probability model for the data. Of course,
there are many different possible distributions that quantities could have. It
is therefore a startling fact that many different quantities that we are com-
monly interested in — heights, weights, scores on intelligence tests, serum
potassium levels of different patients, measurement errors of distance to the
nearest star — all have distributions which are close to one particular shape.
This shape is called the Normal or Gaussian1 family of distributions.
1
Named for German mathematician Carl Friedrich Gauss, who first worked out the
formula for these distributions, and used them to estimate the errors in astronomical
computations. Until the introduction of the euro, Gauss’s picture — and the Gaussian
curve — were on the German 10 mark banknote.
99
100 The Normal Distribution
12
10
10
8
8
Frequency
Frequency
6
6
4
4
2
2
0
0
1000 2000 3000 4000 5000 6000 1.0 1.2 1.4 1.6 1.8
Birth weight (g) Petal length
30
25
8
Frequency
20
Frequency
6
15
4
10
2
5
0
0
700000 800000 900000 1100000 3.0 3.5 4.0 4.5 5.0 5.5
(c) Brain sizes of 40 Psychology students (d) Serum potassium measurements from 152 healthy vol-
unteers
When we considered the Binomial and Poisson distributions we saw that the
probability distributions were characterized by a formula for the probability
of each possible discrete value. All of the probabilities together sum up to
1. We can visualize the density by plotting the probabilities against the
discrete values (Figure 6.2). For continuous data we don’t have equally
spaced discrete values so instead we use a curve or function that describes
the probability density over the range of the distribution (Figure 6.3). The
curve is chosen so that the area under the curve is equal to 1. If we observe a
sample of data from such a distribution we should see that the values occur
in regions where the density is highest.
0.00 0.02 0.04 0.06 0.08 0.10 0.12
P(X)
0 5 10 15 20
X
0.04
0.03
density
0.02
0.01
0.00
X∼N(µ, σ 2 )
µ = 100 ! = 10 µ = 100 ! = 5
0.08
0.08
density
density
0.04
0.04
0.00
0.00
µ = 130 ! = 10 µ = 100 ! = 15
0.08
0.08
density
density
0.04
0.04
0.00
0.00
P(Z < 0)
0
For this example we can calculate the required area as we know the distri-
bution is symmetric and the total area under the curve is equal to 1, i.e.
P (Z < 0) = 0.5.
What about P (Z < 1.0)?
P(Z < 1)
0 1
Calculating this area is not easy2 and so we use probability tables. Probabil-
2
For those Mathematicians who recognize this area as a definite integral and try to do
the integral by hand please note that the integral cannot be evaluated analytically
The Normal Distribution 105
ity tables are tables of probabilities that have been calculated on a computer.
All we have to do is identify the right probability in the table and copy it
down! Obviously it is impossible to tabulate all possible probabilities for all
possible Normal distributions so only one special Normal distribution, N(0,
1), has been tabulated.
The tables allow us to read off probabilities of the form P (Z < z). Most of
the table in the formula book has been reproduced in Table 6.1. From this
table we can identify that P (Z < 1.0) = 0.8413 (this probability has been
highlighted with a box).
z 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 5040 5080 5120 5160 5199 5239 5279 5319 5359
0.1 0.5398 5438 5478 5517 5557 5596 5636 5675 5714 5753
0.2 0.5793 5832 5871 5910 5948 5987 6026 6064 6103 6141
0.3 0.6179 6217 6255 6293 6331 6368 6406 6443 6480 6517
0.4 0.6554 6591 6628 6664 6700 6736 6772 6808 6844 6879
0.5 0.6915 6950 6985 7019 7054 7088 7123 7157 7190 7224
0.6 0.7257 7291 7324 7357 7389 7422 7454 7486 7517 7549
0.7 0.7580 7611 7642 7673 7704 7734 7764 7794 7823 7852
0.8 0.7881 7910 7939 7967 7995 8023 8051 8078 8106 8133
0.9 0.8159 8186 8212 8238 8264 8289 8315 8340 8365 8389
1.0 0.8413 8438 8461 8485 8508 8531 8554 8577 8599 8621
1.1 0.8643 8665 8686 8708 8729 8749 8770 8790 8810 8830
0 0.92 0 0.92
0 z
!0.5 0 0 0.5
The Normal distribution is symmetric so we know that P (Z >
−0.5) = P (Z < 0.5) = 0.6915
We can use the symmetry of the Normal distribution to calculate
The Normal Distribution 107
!0.76 0 0 0.76
!0.64 0 0.43
P(X < !0.64) P(X < 0.43)
!0.64 0 0 0.43
108 The Normal Distribution
6.5 Standardisation
All of the probabilities above were calculated for the standard Normal distri-
bution N(0, 1). If we want to calculate probabilities from different Normal
distributions we convert the probability to one involving the standard Nor-
mal distribution. This process is called standardisation.
+,!-(/.
! #$%
!!()*(+,"-(/.
" !$%
'(%()*(+,"-(&.
" &$#
!
X − 3500 3100 − 3500
P (X < 3100) = P <
500 500
= P (Z < −0.8) where Z ∼ N(0, 1)
= 1 − P (Z < 0.8)
= 1 − 0.7881
= 0.2119
In this example,
Z=D!2
16.40
0 2 0!2 0
16.40
= !0.122
112 The Normal Distribution
!
D−2 0−2
P (D < 0) = P √ <√ = P (Z < −0.122) where Z ∼ N(0, 1)
269 269
= 1 − (0.8 × 0.5478 + 0.2 × 0.5517)
= 0.45142
then
X+Y
Let A = 2= 12 X + 12 Y be the average time of rats A and B
Then A ∼ N 21 80 + 12 78, ( 21 )2 102 + ( 12 )2 132 = N(79, 67.25)
79 82 0 82 ! 79
8.20
= 0.366
!
A − 79 82 − 79
P (A > 82) = P √ < √ = P (Z > 0.366) where Z ∼ N(0, 1)
67.25 67.25
= 1 − (0.4 × 0.6406 + 0.6 × 0.6443)
= 0.35718
⇒ P (X < x) = 0.8
114 The Normal Distribution
Z = X ! 45
20
45 x 0 x ! 45
20
= 0.84
For example, Figure 6.5 compares a Bin(300, 0.5) and a N(150, 75) which
both have the same mean and variance. The figure shows that the distribu-
tions are very similar.
0.04
0.03
0.03
P(X = x)
density
0.02
0.02
0.01
0.01
0.00
0.00
100 120 140 160 180 200 100 120 140 160 180 200
X X
In general,
If X ∼ Bin(n, p) then
µ = np
σ 2 = npq where q =1−p
X ∼ N(np, npq)
1 1
n > 10 and p ≈ 2 OR n > 30 and p moving away from 2
Unfortunately, it’s not quite so simple. We have to take into account the
fact that we are using a continuous distribution to approximate a discrete
distribution. This is done using a continuity correction. The continuity
correction appropriate for this example is illustrated in the figure below
0 1 2 3 4 5 6 7 8 9 10 11 12
3.5 7.5
!
3.5 − 6 X −6 7.5 − 6
P (3.5 < X < 7.5) = P √ < √ < √
3 3 3
= P (−1.443 < Z < 0.866) where Z ∼ N(0, 1)
= 0.732
The Normal Distribution 117
The exact answer is 0.733 so in this case the approximation is very good.
In general,
If X ∼ Po(λ) then
µ = λ
σ2 = λ
X ∼ N(λ, λ)
!
X − 25 26.5 − 25
P (X < 26.5) = P <
5 5
= P (Z < 0.3) where Z ∼ N(0, 1)
= 0.6179
118 The Normal Distribution
Z = X ! 25
5
25 26.5 0 26.5 ! 25
5
= 0.3
Figure 6.6: Normal table used to compute tail probability for Aquarius
experiment.
0.000 0.002 0.004 0.006 0.008 0.010
y
1875 2006
x
121
122 Confidence intervals
1 1 σ2
V ar(X̄) = V ar(X1 +· · ·+X198 ) = V ar(X1 )+· · ·+V ar(X198 ) = .
n2 n2 n
since the variance of a sum of independent variables is always the sum of
the variances. Thus, we can standardise X̄ by writing
X̄ − µ
Z= √ ,
σ/ n
which is a standard normal random variable (that is, with expectation 0 and
variance 1).
A tiny bit of algebra gives us
σ
µ = X̄ − √ Z.
n
This expresses the unknown quantity µ in terms of known quantities and a
random variable Z with known distribution. Thus we may use our standard
normal tables to generate statements like “the probability is 0.95 that Z is
in the range −1.96 to 1.96,” implying that
σ σ
the probability is 0.95 that µ is in the range X̄ − 1.96 √ to X̄ + 1.96 √ .
n n
(Note that we have used the fact that the normal distribution is symmetric
about 0.) We call this interval a 95% confidence interval for the unknown
population mean.
√
The quantity σ/ n, which determines the scale of the confidence inter-
val, is called the Standard Error for the sample mean, commonly abbrevi-
ated SE. If we take σ to be the sample standard deviation — √
more about this
assumption in chapter 10 — the Standard Error is 69mm/ 198 ≈ 4.9mm.
The 95% confidence interval for the population mean is then 1732 ± 9.8mm,
so (1722, 1742)mm. In place of our vague statement about a best guess for
µ, we have an interval of width 20 mm in which we are 95% confident that
the true population mean lies.
√
where SE = σ/ n, and z is the appropriate quantile of the standard nor-
mal distribution. That is, it is the number such that (100 − c)/2% of the
probability in the standard normal distribution is above z. Thus, if we’re
looking for a 95% confidence interval, we take X̄ ± 2 SE, whereas a 99%
confidence interval would be X̄ ± 2.6 SE, since we see on the normal table
that P (Z < 2.6) = 0.9953, so P (Z > 2.6) = 0.0047 ≈ 0.5%. (Note: The
central importance of the 95% confidence interval derives primarily from its
convenient correspondence to a z value of 2. More precisely, it is 1.96, but
we rarely need — or indeed, can justify — such such precision.)
In other situations, as we will see, we use the same formula for a normal
confidence interval for a parameter µ. The only thing that changes from
problem to problem is the point estimate X̄, and the standard error.
you can compute from the data X) A(X) and B(X), such that
P A(X) ≤ µ ≤ B(X) = α.
The quantity P A(X) ≤ θ ≤ B(X) is called the coverage probability
for µ. Thus, a confidence interval for µ with confidence coefficient α is
precisely a random interval with coverage probability α. In many cases, it is
not possible to find an interval with exactly the right coverage probability.
We may have to content ourselves with an approximate confidence interval
(with coverage probability ≈ α) or a conservative confidence interval (with
coverage probability ≥ α). We usually make every effort not to overstate our
confidence about statistical conclusions, which is why we try to err on the
side of making the coverage probability — hence the interval — too large.
An illustration of this problem is given in Figure 7.1. Suppose we are
measuring systolic blood pressure on 100 patients, where the true blood
pressure is 120 mmHg, but the measuring device makes normally distributed
errors with mean 0 and SD 10 mmHg. In order to reduce the errors, we
take four measures on each patient and average them. Then we compute
a confidence interval. The measures are shown in figure 7.1(a). In Figure
7.1(b) we have shown a 95% confidence interval for each patient, computed
by taking the average of the patient’s four measurements, plus and minus 10.
Notice that there are 6 patients (shown by red X’s for their means) where
the true measure — 120 mmHg — lies outside the confidence interval. In
Figures 7.1(c) and 7.1(d) we show 90% and 68% confidence intervals, which
are narrower, and hence miss the true value more frequently.
A 90% confidence interval tells you that 90% of the time the true value
will lie in this range. In fact, we find that there are exactly 90 out of 100
cases where the true value is in the confidence interval. The 68% confidence
intervals do a bit better than would be expected on average: 74 of the 100
trials had the true value in the 68% confidence interval.
100 110 120 130 140 150
Blood Pressure
Confidence intervals
90
90
0 20 40 60 80 100 0 20 40 60 80 100
Trial Trial
Blood Pressure
125
90
90
0 20 40 60 80 100 0 20 40 60 80 100
Trial Trial
Figure 7.1: Confidence intervals for 100 patients’ blood pressure, based on four measurements. Each column of
Figure 7.1(a) shows a single patient’s four measurements. The true BP in each case is 120, and the measurement
errors are normally distributed with mean 0 and SD 10.
126 Confidence intervals
We discussed in section 6.8 that the binomial distribution can be well ap-
proximated by a normal distribution. This means that if we are estimating
the probability of success p from some observations of successes and failures,
we can use the same methods as above to put a confidence interval on p.
For instance, the Gallup organisation carried out a poll in October, 2005,
of Americans’ attitudes about guns (see http://www.gallup.com/poll/
20098/gun-ownership-use-america.aspx). They surveyed 1,012 Ameri-
cans, chosen at random. Of these, they found that 30% said they personally
owned a gun. But, of course, if they’d picked different people, purely by
chance they would have gotten a somewhat different percentage. How dif-
ferent could it have been? What does this survey tell us about the true
fraction (call it p) of Americans who own guns?
We can compute a 95% confidence interval as 0.30±1.96 SE. All we need
to know is the SE for the proportion p, which is the same as the standard
deviation for the observed proportion of successes. We know from section
6.8 (and discussed again at length in section 8.3), that the standard error is
r
p(1 − p)
SE = ,
n
p
where n is the number of samples. In this case, we get SE= 0.3 × 0.7/1012 =
0.0144. So
Loosely put, we can be 95% confident that the true proportion support-
ing EPP is between 27% and 33%. A 99% confidence interval comes from
multiplying by 2.6 instead of 1.96: it goes from 26.3% to 33.7%.
Notice that the Standard Error for a proportion is a maximum when
p = 0.5. Thus, we can always get a “conservative confidence interval” — an
interval where the probability of finding the true parameter
p in it is at least
95% (or whatever the level is) by taking the SE to be .25/n. The 95%
confidence interval then has the particularly simple form sample mean ±
√
1/ n.
Confidence intervals 127
X̄ − µ
Z= √ (7.1)
σ/ n
(ii). has thin tails: Most of the probability is close to the mean, not many
SDs away from the mean.
0.4
0.3
Density
0.2
0.1
0.0
-3 -2 -1 0 1 2 3 4
(a) λ = 1
0.20
Density
0.10
0.00
-4 -2 0 2 4 6 8 10
(b) λ = 4
0.12
0.08
Density
0.04
0.00
0 5 10 15 20
(c) λ = 10
0.08
Density
0.04
0.00
5 10 15 20 25 30 35
(d) λ = 20
0.6
0.3
0.4
0.2
0.2
0.1
0.0
0.0
-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3
0.4
0.3
0.2
0.1
0.0
0 2 4 6 8 10 -2 -1 0 1 2 3 4
0.20
0.10
0.10
0.05
0.00
0.00
5 10 15 20 -2 0 2 4 6 8
0.12
0.06
0.08
0.04
0.04
0.02
0.00
0.00
35 40 45 50 55 60 65 5 10 15
Figure 7.3: Normal approximations to Binom(n, p). Shaded region is the implied approximate
probability of the Binomial variable < 0 or > n.
132 Confidence intervals
means and computing confidence intervals for the population mean (as well
as for differences in means) even to data that are not normally distributed.
(Caution: Remember that t is an improvement over Z only when the number
of samples being averaged is small. Unfortunately, the CLT itself may not
apply in such a case.) We have already applied this idea when we did the Z
test for proportions, and the CLT was also hidden in our use of the χ2 test.
This is of course the same as the probability that the average number of
births is at least 255. We could also compute this by reasoning
√ that X̄ =
S/100 is normally distributed with mean 251 and SD 41.9/ 100 = 4.19.
Thus,
X̄ − 251 255 − 251
P X̄ > 255 = P > = P {Z > 0.95} ,
4.19 4.19
0.015
0.008
0.010
Density
Density
0.004
0.005
0.000
0.000
150 200 250 300 350 200 250 300
(a) n = 1 (b) n = 2
0.030
0.020
0.015
0.020
Density
Density
0.010
0.010
0.005
0.000
0.000
180 200 220 240 260 280 300 220 240 260 280
(c) n = 5 (d) n = 10
0.06
0.08
0.06
0.04
Density
Density
0.04
0.02
0.02
0.00
0.00
240 250 260 270 235 240 245 250 255 260 265
to include Bill Gates, with annual income of, let us say, £3 billion, then his
income will be ten times as large as the total income of the entire remainder
of the sample. Even if everyone else has zero income, the sample mean will
be at least £300,000. The distribution of the mean will not converge, or will
converge only very slowly, if it can be substantially affected by the presence
or absence of a few very high-earning individuals in the sample.
Figure 7.5(a) is a histogram of household incomes, in thousands of US
dollars, in the state of California in 1999, based on the 2000 US census (see
www.census.gov). We have simplified somewhat, since the final category is
“more than $200,000”, which we have treated as being the range $200,000 to
$300,000. (Remember that histograms are on a density scale, with the area
of a box corresponding to the number of individuals in that range. Thus,
the last three boxes all correspond to about 3.5% of the population, despite
their different heights.) The mean income is about µ = $62, 000, while the
median is $48,000. The SD of the incomes is σ = $55, 000.
Figures 7.5(b)—7.5(f) show the effect of averaging 2,5,10,50, and 100
randomly chosen incomes, together with a normal distribution (in green) as
predicted by the CLT, with mean µ and variance σ 2 /n. We see that the
convergence takes a little longer than it did with the more balanced birth
data of Figure 7.4 — averaging just 10 incomes is still quite skewed — but
by the time we have reached the average of 100 incomes the match to the
predicted normal distribution is remarkably good.
There are many implications of the Central Limit Theorem. We can use it
to estimate the probability of obtaining a total of at least 400 in 100 rolls of
a fair six-sided die, for instance, or the probability of a subject in an ESP
experiment, guessing one of four patterns, obtaining 30 correct guesses out
of 100 purely by chance. These were discussed in lecture 6 of the first set
of lectures. It suggests an explanation for why height and weight, and any
other quantity that is affected by many small random factors, should end
up being normally distributed.
Here we discuss one crucial application: The CLT allows us to com-
pute normal confidence intervals and apply the Z test to data that are not
themselves normally distributed.
136 Confidence intervals
0.012
0.008
0.008
Density
Density
0.004
0.004
0.000
0.000
0 50 100 150 200 250 300 0 50 100 150 200 250 300
(a) n = 1 (b) n = 2
0.020
0.015
0.010
Density
Density
0.010
0.005
0.005
0.000
0.000
(c) n = 5 (d) n = 10
0.06
0.04
0.03
0.04
Density
Density
0.02
0.02
0.01
0.00
0.00
40 50 60 70 80 90 100 50 60 70 80
The Z Test
8.1 Introduction
In Lecture 1 we saw that statistics has a crucial role in the scientific process
and that we need a good understanding of statistics in order to avoid reach-
ing invalid conclusions concerning the experiments that we do. In Lectures 2
and 3 we saw how the use of statistics necessitates an understanding of prob-
ability. This lead us to study how to calculate and manipulate probabilities
using a variety of probability rules. In Lectures 4, 5 and 6 we consider three
specific probability distributions that turn out to be very useful in practical
situations. Effectively, all of these previous lectures have provided us with
the basic tools we need to use statistics in practical situations.
The goal of statistical analysis is to draw reasonable conclusions from the
data and, perhaps even more important, to give precise statements about the
level of certainty that ought to be attached to those conclusions. In lecture
7 we used the normal distribution to derive one form that these “precise
statements” can take: a confidence interval for some population mean. In
this lecture we consider an alternative approach to describing very much the
same information: Significance tests.
139
140 The Z Test
Figure 8.1: A Histogram showing the birth weight distribution in the Baby-
boom dataset.
and
Thus,
3276 − 3426
P (X ≤ 3276) = P Z ≤ = P (Z ≤ −1.81) = 1−P (Z < 1.81),
81
chance happened to get a result that would only happen about one time in
30. This seems unlikely, but not impossible.
Pay attention to the double negative that we commonly use for sig-
nificance tests: We have a research hypothesis, which we think would be
interesting if it were true. We don’t test it directly, but rather we use the
data to challenge a less interesting null hypothesis, which says that the
apparently interesting differences that we’ve observed in the data are simply
the result of chance variation. We find out whether the data support the
research hypothesis by showing that the null hypothesis is false (or unlikely).
If the null hypothesis passes the test, then we know only that this particular
challenge was inadequate. We haven’t proven the null hypothesis. After
all, we may just not have found the right challenger; a different experiment
might show up the weaknesses of the null. (The potential strength of the
challenge is called the “power” of the test, and we’ll learn about that in
section 13.2.)
What if the challenge succeeds? We can then conclude with confidence
(how much confidence depends on the p-value) that the null was wrong. But
in a sense, this is shadow boxing: We don’t exactly know who the challenger
is. We have to think carefully about what the plausible alternatives are.
(See, for instance, Example 8.2.)
The basic steps carried out in Example 8.1 are common to most significance
tests:
(v). Compare the test statistic to its sampling distribution under the
null hypothesis and calculate the p-value. The strength of the evi-
dence is larger, the smaller the p-value.
The Z Test 143
If the p-value for the test were much larger, say 0.23, then we would conclude
that
144 The Z Test
Truth
H0 True H0 False
Decision
Correct Type II Error
Retain H0
(Prob. 1 − α) (Prob.=β)
Type I Error Correct
Reject H0
(Prob.=level=α) (Prob.=Power=1 − β)
“the evidence against the null hypothesis is not significant at the 5% level”
values will be the most extreme 5% of values in the right hand tail of the
distribution. Using our tables backwards we can calculate that the boundary
of this region, called the critical value, will be 1.645. The value of our
test statistic is 3.66 which lies in the critical region so we reject the null
hypothesis at the 5% level.
N(0, 1)
0.05
0 1.645
Critical Region
(v). Compare the test statistic to its sampling distribution under the
null hypothesis and calculate the p-value,
or equivalently,
or equivalently,
observation − expectation
Z= . (8.2)
standard error
The expectation and standard error are the mean and the standard deviation
of the sampling distribution: that is, the mean and standard deviation that
the observation has when seen as a random variable, whose distribution is
given by the null hypothesis. Thus, Z has been standardised: its distribution
is standard normal, and the p-value comes from looking up the observed
value of Z on the standard normal table.
We call this a “one-sample” test because we are interested in testing the
mean of samples from a single distribution. This is as opposed to the “two-
The Z Test 147
sample test” (discussed in section ??), in which we are testing the difference
in means between two populations.
X1 ∼ N(µ, σ 2 ) X2 ∼ N(µ, σ 2 )
then
1 1 1 1 1 1
X = X1 + X2 ∼ N µ + µ, ( )2 σ 2 + ( )2 σ 2
2 2 2 2 2 2
σ2
⇒ X ∼ N µ,
2
In general,
sample mean − µ
Z= √ .
σ/ n
Thus, under the assumption of the null hypothesis the sample mean of
44 values from a N(3426, 5382 ) distribution is
5382
X ∼ N 3426, = N 3426, 812
44
148 The Z Test
H1 : p > 0.25.
H0 : p = p 0
the
p expected proportion of successes X/n is p0 , and the standard error is
p0 (1 − p0 )/n.
proportion of successes − p0
Z= p .
p0 (1 − p0 )/n
Z has standard normal distribution.
The test statistic will come out exactly the same, regardless of whether
we work with numbers of successes or proportions.
√
shrinking by a factor of n. This corresponds to our intuition that averaging
many independent samples will tend to be closer to the true value than any
single measurement. If the standard deviation of the population is σ, the
√
standard error of the sample mean is σ/ n. Intuitively, the standard error
tells us about how far off the sample mean will be from the true population
mean (or true probability of success): we will almost never be off by more
than 3 SEs.
In this case we allow for the possibility that the mean value is greater
than 3426g by setting our critical region to be lowest 2.5% and highest 2.5%
of the distribution. In this way the total area of the critical region remains
0.05 and so the level of significance of our test remains 5%. In this exam-
ple, the critical values are -1.96 and 1.96. Thus if our test statistic is less
than -1.96 or greater than 1.96 we would reject the null hypothesis. In this
example, the value of test statistic does lie in the critical region so we reject
The Z Test 153
N(0, 1)
0.025 0.025
ï1.96 0 1.96
Critical Region Critical Region
Fundamentally, though, the distinction between one-tailed and two-tailed
tests is important only because we set arbitrary p-values such as 0.05 as hard
cutoffs. We should be cautious about “significant” results that depend for
their significance on the choice of a one-tailed test, where a two-tailed test
would have produced an “insignificant” result.
Then a 95% confidence interval for the mean birthweight of Australian ba-
bies is 3276 ± 1.96 · 80g = (3120, 3432)g; a 99% confidence interval would be
3276±1.96·80g = (3120, 3432)g; a 3276±2.58·80g = (3071, 3481)g. (Again,
remember that we are making the — not particularly realistic — assump-
tion that the observed birthweights are a random sample of all Australian
birthweights.) This is consistent with the observation that the Australian
birthweights would just barely pass a test for having the same mean as the
UK average 3426g, at the 0.05 significance level.
In fact, it is almost true to say that the symmetric 95% confidence in-
terval contains exactly the possible means µ0 such that the data would pass
a test at the 0.05 significance level for having mean equal to µ0 . What’s
the difference? It’s in how we compute the standard error. In computing
a confidence interval we estimate the parameters of the distribution from
the data. When we perform a statistical test, we take the parameters (as
far as possible) from the null hypothesis. In this case, that means that it
makes sense to test based on the presumption that the standard deviation of
weights is the SD of the UK births, which is the null hypothesis distribution.
In this case, this makes only a tiny difference between 538g (the UK SD)
and 528g (the SD of the Australian sample).
.
Lecture 9
The χ2 Test
In this lecture and many of the following ones we will learn about other
statistical tests, for testing other sorts of scientific claims. The procedure
will be largely the same: Formulate the claim in terms of the truth or falsity
of a null hypothesis, find an appropriate test statistic (according to the
principles enumerated above), and then judge the null hypothesis according
to whether the p-value you compute is high (good for the null!) or low (bad
for the null, good for the alternative!).
The Z test was used for testing whether the mean of quantitative data
could have a certain value. In this lecture we consider categorical data.
155
156 χ2 Test
These don’t have a mean. We’re usually interested in some claim about the
distribution of the data among the categories. The most basic tool we have
for testing whether data we observe really could have come from a certain
distribution is called the χ2 test.
When is n large enough? The rule of thumb is that the expected number
in every category must be at least about 5. So what do we do if some of the
expected numbers are too small? Very simple: We group categories together,
χ2 Test 159
until the problem disappears. We will see examples of this in sections 9.4.1
and 9.4.2.
The χ2 distribution with d degrees of freedom is a continuous distribu-
tion1 with mean d and variance 2d. In Figure 9.1 we show the density of the
chi-squared distribution for some choices of the degrees of freedom. We note
that these distributions are always right-skewed, but the skew decreases as
d increases. For large d, the χ2 distribution becomes close to the normal
distribution with mean d and variance 2d.
As with the standard normal distribution, we rely on standard tables
with precomputed values for the χ2 distribution. We could simply have a
separate table for each number of degrees of freedom, and use these exactly
like the standard normal table for the Z test. This would take up quite
a bit of space, though. (Potentially infinite — but for large numbers of
degrees of freedom see section 9.2.2.) Alternatively, we could use a computer
programme that computes p-values for arbitrary values of X and d.f. (In the
R programming language the function pchisq does this.) This is an ideal
solution, except that you don’t have computers to use on your exams.
Instead, we rely on a traditional compromise approach, taking advantage
of the fact that the most common use of the tables is to find the critical value
for hypothesis testing at one of a few levels, such as 0.05 and 0.01.
0.5
1 d.f.
2 d.f.
3 d.f.
0.4
4 d.f.
5 d.f.
6 d.f.
0.3
7 d.f.
Density
0.2
0.1
0.0
0 2 4 6 8 10
x
5 d.f.
10 d.f.
20 d.f.
0.10
30 d.f.
Density
0.05
0.00
0 10 20 30 40 50
x
0.15
0.10
Density
0.05
0.00
0 5 10 11.07 15.09 20
Figure 9.2: χ2 density with 5 degrees of freedom. The green region represents
1% of the total area. The red region represents a further 4% of the area, so
that the tail above 11.07 is 5% in total.
For example, if we were testing at the 0.01 level, with 60 d.f., we would
first look for 9950 on the standard normal table, finding that this corresponds
to z = 2.58. (Remember that in a two-tailed test at the 0.01 level, the
probability above z is 0.005.) We conclude that the critical value for χ2
with 60 d.f. at the 0.01 level is about
√
2.58 120 + 60 = 88.26.
The exact value, given on the table, is 88.38. For larger values of d.f. we
simply rely on the approximation.
to come up. We roll it 60 times, and tabulate the number of times each side
comes up. The results are given in Table 9.3.
Side 1 2 3 4 5 6
Observed Frequency 16 15 4 6 14 5
Expected Frequency 10 10 10 10 10 10
It certainly appears that sides 1,2, and 5 come up more often than they
should, and sides 3, 4, and 6 less frequently. On the other hand, some
deviation is expected, due to chance. Are the deviations we see here too
extreme to be attributed to chance?
Suppose we wish to test the null hypothesis
We now test the null hypothesis for the combined sample of men
and women, at the 0.01 significance level. We plug these columns
numbers did we really observe? We observed six numbers (the frequency counts for the
six sides), but they had to add up to 60, so any five of them determine the sixth one. So
we really only observed five numbers.
χ2 Test 165
Table 9.5: Observed frequency of birth months, England and Wales, 1993.
(3). Using these expected numbers and the observed numbers (the data)
compute the χ2 statistic.
(4). Compare the computed statistic to the critical value. Important: The
degrees of freedom are reduced by one for each parameter that has
been estimated.
168 χ2 Test
We now compute
(16 − 10.6)2 (7 − 14.1)2 (3 − 9.4)2 (4 − 6.0)2
X2 = + + + = 11.35.
10.6 14.1 9.4 6.0
Suppose we want to test the null hypothesis at the 0.01 significance level.
In order to decide on a critical value, we need to know the correct number
of degrees of freedom. The reduced Table 9.7 has oly 4 categories. There
would thus be 3 d.f., were it not for our having estimated a parameter to
decide on the distribution. This reduces the d.f. by one, leaving us with 2
degrees of freedom. Looking in the appropriate row, we see that the critical
value is 9.21, so we do reject the null hypothesis (the true p-value is 0.0034),
and conclude that the data did not come from a Poisson distribution.
# Girls 0 1 2 3 4 5 6
Frequency 7 45 181 478 829 1112 1343
Expected 2.3 26.1 132.8 410.0 854.2 1265.6 1367.3
Probability 0.0004 0.0043 0.0217 0.0670 0.1397 0.2070 0.2236
# Girls 7 8 9 10 11 12
Frequency 1033 670 286 104 24 3
Expected 1085.2 628.1 258.5 71.8 12.1 0.9
Probability 0.1775 0.1027 0.0423 0.0117 0.0020 0.0002
We can use a Chi-squared test to test the hypothesis that the data follow
a Binomial distribution.
170 χ2 Test
Thus we can fit a Bin(12, 0.4808) distribution to the data to obtain the
expected frequencies (E) alongside the observed frequencies (O). The prob-
abilities are shown at the bottom of Table 9.8, and the expectations are
found by multiplying the probabilities by 6115. The first and last cate-
gories have expectations smaller than 5, so we absorb them into the next
categories, yielding Table 9.9.
Table 9.9: Modified version of Table 9.8, with small categories grouped
together.
# Girls 0, 1 2 3 4 5 6 7 8 9 10 11, 12
Frequency 52 181 478 829 1112 1343 1033 670 286 104 27
Expected 28.4 132.8 410.0 854.2 1265.6 1367.3 1085.2 628.1 258.5 71.8 13.0
Probability 0.0047 0.0217 0.0670 0.1397 0.2070 0.2236 0.1775 0.1027 0.0423 0.0117 0.0022
df = (k − 1) − p = (11 − 1) − 1 = 9
The test statistics lies well within the Critical Region so we conclude that
there is significant evidence against the null hypothesis at the 5% level.
We conclude that the sex of newborns in families with 12 children is NOT
binomially distributed.
χ2 Test 171
(2). The p value really doesn’t seem to be the same between families. Some
families have a tendency to produce more boys, others more girls.
Note that (1) is consistent with our original hypothesis, that babies all have
the same probability p of being female: We have just pushed the variability
from small families to large ones. Think of it this way: Suppose there were
a rule that said: Stop when you have 3 children, unless the children are all
boys or all girls. Otherwise, keep trying to get a balanced family. Then
the small families would be more balanced than you would have expected,
and the big families more unbalanced — for instance, half of the four-child
families would have all boys or all girls. Of course, it’s more complicated
than that: Different parents have different ideas about the “ideal” family.
But this effect does seem to explain some of the deviation from the binomial
distribution.
The statistical analysis in [LA98] tries to pull these effects apart (and also
take into account the small effect of identical twins), finding that there is an
SD of about 0.16 in the value of p, the probability of a girl, and furthermore
that there is some evidence that some parents produce nothing but girls, or
at most have a very small probability of producing boys.
We can use a Chi-squared test to test the hypothesis that the data follow a
Normal distribution.
172 χ2 Test
From the data we can estimate the mean and standard deviation using the
sample mean and standard deviation
x̄ = 172
s = 7.15
To fit a Normal distribution with this mean and variance we need to calculate
the probability of each interval. This is done in four straightforward steps
(i). Calculate the upper end point of each interval (u)
(ii). Standardize the upper end points (z)
(iii). Calculate the probability P (Z < z)
(iv). Calculate the probability of each interval
(v). Calculate the expected cell counts
Height (cm) 155-160 161-166 167-172 173-178 179-184 185-190
Endpoint (u) 160.5 166.5 172.5 178.5 184.5 ∞
Standardized (z) -1.61 -0.77 0.07 0.91 1.75 ∞
P (Z < z) 0.054 0.221 0.528 0.818 0.960 1.00
P (a < Z < b) 0.054 0.167 0.307 0.290 0.142 0.040
Expected 5.4 16.7 30.7 29.0 14.2 4.0
Observed 5 17 38 25 9 6
From this table we see that there is one cell with an expected count less
than 5 so we group it together with the nearest cell. (A single cell with
expected count is on the borderline; we could just leave it. We certainly
don’t want any cell with expected count less than about 2, and not more
than one expected count under 5.)
Height (cm)
155-160 161-166 167-172 173-178 179-190
Expected 5.4 16.7 30.7 29.0 18.2
Observed 5 17 38 25 15
χ2 Test 173
The test statistics lies outside the Critical Region so we conclude that the
evidence against the null hypothesis is not significant at the 0.05 level.
Men Women
right-handed 934 1070
left-handed 113 92
ambidextrous 20 8
Looking at the data, it looks as though the women are more likely to
be right-handed. Someone might come along and say: “The left cerebral
hemisphere controls the right side of the body, as well as rational thought.
This proves that women are more rational than men.” Someone else might
say, “This shows that women are under more pressure to conform to soci-
ety’s expectations of normality.” But before we consider this observation
as evidence of anything important, we have to pose the question: Does this
174 χ2 Test
Or, to put it differently, each person gets placed in one of the six cells of the
table. The null hypothesis says that which row you’re in is independent of
the column. This is a lot like the problems we had in section 9.4: Here the
family of distributions we’re interested in is all the distributions in which
the rows are independent of the columns. The procedure is essentially the
same:
(1). Estimate the parameters in the null distribution. This means we esti-
mate the probability of being in each row and each column.
(2). Compute the expected occupancy in each cell: This is the number of
observations times the row probability times the column probability.
(3). Using these expected numbers and the observed numbers (the data)
compute the χ2 statistic.
(4). Compare the computed statistic to the critical value. The number of
degrees of freedom is (r − 1)(c − 1), where r is the number of rows
and c the number of columns. (Why? The number of cells is rc. We
estimated r −1 parameters to determine the row probabilities and c−1
parameters to determine the column probabilities. So we have r +c−2
parameters in all. By our standard formula,
We wish to test the null hypothesis at the 0.01 significance level. We ex-
tend the table to show the row and column fractions in Table 9.11. Thus, we
see that 89.6% of the sample were right-handed, and 47.7% were male. The
fraction that were right-handed males would be 0.896 × 0.477 = 0.427 under
the null hypothesis. Multiplying this by 2237, the total number of observa-
tions, we obtain 956, the expected number of right-handed men under the
χ2 Test 175
null hypothesis. We repeat this computation for all six categories, obtaining
the results in Table 9.12. (Note that the row totals and the column totals
are identical to the original data.) We now compute the χ2 statistic, taking
the six “observed” counts from the black Table 9.10, and the six “expected”
counts from the red Table 9.12:
Men Women
right-handed 956 1048
left-handed 98 107
ambidextrous 13 15
Table 9.12: Expected counts for NHANES handedness data, computed from
fractions in Table 9.11.
test tells us is that we would most likely have found a difference in handed-
ness between men and women if we had surveyed the whole population.
Lecture 10
for the (unknown) value of σ. But S is only an estimate for σ: it’s a random
variable that might be too high, and might be too low. So, what we were
calling Z is not really Z, but a quantity that we should give another name
to:
X̄ − µ
T := √ . (10.1)
S/ n
If S is too big, then Z > T , and if S is too small then Z < T . On average,
you might suppose, Z and T would be about the same — and, in this you
would be right. Does the distinction matter then?
177
178 T distribution
Since T has an extra source of error in the denominator, you would ex-
pect it to be more widely scattered than Z. That means that if you compute
T from the data, but look it up on a table computed from the distribution
of Z — the standard normal distribution — you would underestimate the
probability of a large value. The probability of rejecting a true null hypothe-
sis (Type I error) will be larger than you thought it was, and the confidence
intervals that you compute will be too narrow. This is very bad! If we
make an error, we always want it to be on the side of underestimating our
confidence.
Fortunately, we can compute the distribution of T (sometimes called
“Student’s t”, after the pseudonym under which statistician William Gossett
published his first paper on the subject, in 1908). While the mathematics
behind this is beyond the scope of this course, the results can be found in
tables. These are a bit more complicated than the normal tables, because
there is an extra parameter: Not surprisingly, the distribution depends on
the number of samples. When the estimate is based on very few samples
(so that the estimate of SD is particularly uncertain) we have a distribu-
tion which is far more spread out than the normal. When the number of
samples is very large, the estimate s varies hardly at all from σ, and the
corresponding t distribution is very close to normal. As with the χ2 distri-
bution, this parameter is called “degrees of freedom”. For the T statistic,
the number of degrees of freedom is just n − 1, where n is the number of
samples being averaged. Figure 10.1 shows the density of the t distribution
for different degrees of freedom, together with that of the normal. Note that
the t distribution is symmetric around 0, just like the normal distribution.
Table 10.1 gives the critical values for a level 0.05 hypothesis test when
Z is replaced by t with different numbers of degrees of freedom. In other
words, if we define tα (d) to be the number such that P T < tα = α when
T has the Student distribution with d degrees of freedom, Table 10.1(a)
gives values of t0.95 , and Table 10.1(b) gives values of t0.975 . Note that the
values of tα (d) decrease as d increases, approaching a maximum, which is
zα = tα (∞).
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
● ●
●●
●
●
●
● ●
●● Normal
●
● ●
●
●
●●
● ●
●
●
●
●
●
●
●●
●
●
1 d.f.
●
● ●● 2 d.f.
●● ●
●
●
● ●●
●● ●
● 4 d.f.
0.3
●
● ●●
●
●
●
● ●
●
●●
10 d.f.
●
●
● ●
●
●●
25 d.f.
●● ●
● ●
●
● ●
dnorm(z)
●● ●
●
●
● ●●
● ●
●
● ●
● ●
0.2
●● ●
●
●
● ●●
●● ●
●
●
● ●●
●● ●
●
●
● ●●
●● ●
●
●
● ●●
●● ●
●
●
● ●●
●● ●
●
●
0.1
●
● ●
●
●● ●
●
●
●
● ●
●
●
● ●
●●
●●
● ●
●
●
●
●
● ●
●
●
●
●
● ●●
●
●
●
●● ●
●●
●
●
●● ●
●●
●
●
●●
● ●
●●
●
●
●
●●
● ●
●●
●
●
●●
●
●
●● ●●
●
●
●●
●●
●
●
●●
●
● ●
●
●●
●
●
●●
●
● ●
●
0.0
●
●
●
●●
●
●
●●
●
●
●● ●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
● ●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
−4 −2 0 2 4
Figure 10.1: The standard normal density together with densities for the t
distribution with different degrees of freedom.
Table 10.1: Cutoffs for hypothesis tests at the 0.95 level, using the t statistic
with different degrees of freedom. The ∞ level is the limit for a very large
number of degrees of freedom, which is identical to the distribution of the Z
statistic.
with the t statistic, we follow the same procedures as in section 7.1, substi-
tuting s for σ, and the quantiles of the t distribution for the quantiles of the
normal distribution: that is, where we looked up a number z on the normal
table, such that P (Z < z) was a certain probability, we substitute a number
t such that P (T < t) is that same probability, where T has the Student T
distribution with the right number of degrees of freedom. Thus, if we want
a 95% confidence interval, we take
s
X̄ ± t × √ ,
n
where t is found in the column marked “P = 0.05” on the T-distribution
table — 0.05 being the probability above t that we are excluding. It corre-
sponds, of course, to P (T < t) = 0.975.
The probability we are looking for is 0.01, which is the last column of
the table, so looking in the row for 5 d.f. (see Figure 10.2) we see that
182 T distribution
the appropriate value of t is 4.03. Thus, we can be 99% confident that the
patient’s true average phosphate level is between 4.3mg/dl and 6.5mg/dl.
Note that if we had known the SD for the measurements to be 0.67, instead
of having estimated it from the observations, we would have used z = 2.6
(corresponding to a one-sided probability of 0.995) in place of t = 4.03,
yielding a much narrower confidence interval.
Summary
If you want to compute an α × 100% confidence interval for the population
mean of a normally distributed population based on n samples you do the
following:
(1). Compute the sample mean x̄.
(3). Look on the table to find the number t in the row corresponding to
n − 1 degrees of freedom and the column corresponding to α.
√ √
(4). The confidence interval is from x̄ − st/ n to x̄ + st/ n. In other
words, we are α × 100% confident that µ is in this range.
T distribution 183
n
1X
σx2 = V ar(x) = (xi − x̄)2
n
i=1
Why is it, then, that we estimate variance and SD by using a sample variance
and sample SD in which n in the denominator is replaced by n − 1?:
n
1 X
s2x = (xi − x̄)2 .
n−1
i=1
n
1X
(xi − µ)2
n
i=1
2 1
(X1 − X̄)2 + (X2 − X̄)2
σX =
2
X1 − X2 2
=
2
1
= ((X1 − µ) + (µ − X2 ))2
4
1
(X1 − µ)2 + (µ − X2 )2 + 2(X1 − µ)(µ − X2 ) .
=
4
How big is this on average? The first two terms in the brackets will average
to σ 2 (the technical term is, their expectation is σ 2 ), while the last term
averages to 0. The total averages then to just σ 2 /2.
men in total, and we have sampled only 1 man out of 100,000. Indeed, this
is true, but the effect vanishes quite quickly as the size of the population
grows.
Suppose we have a box with N cards in it, each of which has a number,
and we sample n cards without replacement, drawing numbers X1 , . . . , Xn .
Suppose that the cards in the box were themselves drawn from a normal
distribution with variance σ 2 , and let µ be the population mean — that is,
the mean of the numbers in the box. The sample mean X̄ is still normally
distributed, with expectation µ, so the only question now is to determine
the standard error. Call this standard error SENR (NR=no replacement),
and the SE computed earlier SEWR (WR=with replacement. It turns out
that the standard error is precisely
s s
n−1 σ n−1
SENR = SEWR 1− = √ 1− . (10.2)
N −1 n N −1
Thus, if we had sampled 199 out of 300, the SE (and hence alsop the width of
all our confidence intervals) would be multiplied by a factor of 101/299 =
0.58, so would be barely half as large. If the whole population is 1000, so
that we have sampled 1 out of 5, the correction factor has gone up to 0.89,
so the correction is only by about 10%. And if the population is 10,000, the
correction factor is 0.99, which is already negligible for nearly all purposes.
Thus, if the 199 married men had been sampled from a town with just
300 married men, the 95% confidence interval for the average height of mar-
ried men in the town would be 1732mm±2·0.58·4.9mm = 1732mm±5.7mm,
so about (1726, 1738)mm, instead of the 95% confidence interval computed
earlier for sampling with replacement, which was (1722, 1742)mm.
The size of the sample matters far more than the size of the pop-
ulation (unless you are sampling a large fraction of the population
without replacement).
Suppose your metre stick is actually 101 cm long, though. Then all of your
measurements will start out about 1% too short before you add random
error on to them. Taking more measurements will not get you closer to the
true value, but rather to the ideal biased measure. The important lesson is:
Statistical analysis helps us to estimate the extent of random error. Bias
remains.
Of course, statisticians are very concerned with understanding the sources
of bias; but bias is very subject-specific. The bias that comes in conducting
a survey is very different from the bias that comes from measuring the speed
of blink reflexes in a psychology experiment.
Selection bias
The distribution you sample from may actually differ from the distribution
you thought you were sampling from. In the simplest (and most common)
type of survey, you mean to be doing a so-called “simple random sample” of
the population: Each individual has the same chance of being in the sample.
It’s easy to see how this assumption could go wrong. How do you pick a
random set of 1000 people from all 60 million people in Britain? Do you
dial a random telephone number? But some people have multiple telephone
lines, while others have none. And what about mobiles? Some people are
home more than others, so are more likely to be available when you call.
And so on. All of these factors can bias a survey. If you survey people on
the street, you get only the people who are out and about.
Early in the 20th century, it was thought that surveys needed to be huge
to be accurate. The larger the better. Then some innovative pollsters, like
George Gallup in the US, realised that for a given amount of effort, you
would get a better picture of the true population distribution by taking
T distribution 189
a smaller sample, but putting more effort into making sure it was a good
random sample. More random error, but less bias. The biggest advantage
is that you can compute how large the random error is likely to be, whereas
bias is almost entirely unknowable.
In section 7.1 we computed confidence intervals for the heights of British
men (in 1980) on the basis of 199 samples from the OPCS survey. In fact,
the description we gave of the data was somewhat misleading in one respect:
The data set we used actually gave the paired heights of husbands and wives.
Why does this matter? This sample is potentially biased because the only
men included are married. It is not inconceivable that unmarried men have
different average height from married men. In fact, the results from the
complete OCPS sample are available [RSKG85], and the average height was
found to be 1739mm, which is slightly higher than the average height for
married men that we found in our selective subsample, but still within the
95% confidence interval.
The most extreme cases of selection bias arise when the sample is self-
selected. For instance, if you look on a web site for a camera you’re inter-
ested in, and see that 27 buyers said it was good and 15 said it was bad,
what can you infer about the true percentage of buyers who were satisfied?
Essentially nothing. We don’t know what motivated those particular people
to make their comments, or how they relate to the thousands of buyers who
didn’t comment (or commented on another web site).
Non-response bias
If you’re calling people at home to survey their opinions about something,
they might not want to speak to you — and the people who speak to you
may be different in important respects from the people who don’t speak
to you. If you distribute a questionnaire, some will send it back and some
won’t. Again, the two groups may not be the same.
As an example, consider the health questionnaire that was mailed to
a random sample of 6009 residents of Somerset Health District in 1992.
The questionnaire consisted of 43 questions covering smoking habits, eating
patterns, alcohol use, physical activity, previous medical history and demo-
graphic and socio-economic details. 57.6% of the surveys were returned,
and on this basis the health authorities could estimate, for instance, that
24.2% of the population were current smokers, or that 44.3% engage in “no
moderate or vigorous activity”. You might suspect something was wrong
when you see that 45% of the respondents were male — as compared with
just under 50% of the population of Somerset (known from the census). In
190 T distribution
Response bias
Sometimes subjects don’t respond. Sometimes they do, but they don’t tell
the truth. “Response bias” is the name statisticians give to subjects giving
an answer that they think is more acceptable than the true answer. For
instance, one 1973 study [LSS73] asked women to express their opinion (from
1=strongly agree to 5=strongly disagree) to “feminist” or “anti-feminist”
statements. When the interviewer was a woman, the average response to the
statement “The woman’s place is in the home” was 3.09 — essentially neutral
— but this shifted to clear disagreement (3.80) when the interviewer was a
man. Similarly for “Motherhood and a career should not be mixed” (2.96
as against 3.62). On the other hand, those interviewed by women averaged
close to neutral (2.78) on the statement “A completely liberalized abortion
law is right,” whereas those interviewed by men were close to unanimous
strong agreement (1.31 average).
On a similar note, in the preceding 2 November presidential election,
just over 60% of registered voters cast ballots; when Gallup polled the pub-
lic less than three weeks later, 80% said they had voted. Another well-known
anomaly is the difference between heterosexual men and women in the num-
ber of lifetime sexual partners they report (which logically must be the same,
on average). [BS]
$Amount 0 1 5 10 15 20 25 30 35 40 50 55 75
N 3 6 7 19 1 20 22 7 1 4 28 1 1
$Amount 78 79 80 100 120 150 151 180 200 250 300 325
N 1 1 0 73 1 4 2 1 15 5 8 0
$Amount 400 500 550 750 1000 1100 2000 2500 5000 9999+
N 1 9 1 0 4 1 1 1 3 2
Table 10.3: The amounts claimed to have been donated to tsunami relief by
254 respondents to a Gallup survey in January 2005, out of 1008 queried.
Ascertainment bias
You analyse the data you have, but don’t know which data you never got to
observe. This can be a particular problem in a public health context, where
you only get reports on the illnesses serious enough to people to seek medical
treatment. The recent swine flu outbreak Before the recent outbreak of swine
flu, a novel bird flu was making public health experts nervous as it spread
through the world from its site of origin in East Asia. While it mainly affects
waterfowl, occasional human cases have occurred. Horrific mortality rates,
on the order of 50%, have been reported. Thorson et al. [TPCE06] pointed
out, though, that most people with mild cases of flu never come to the
attention of the medical system, particularly in poor rural areas of Vietnam
and China, where the disesase is most prevalent. They found evidence of a
high rate of “flulike illness” associated with contact with poultry among the
rural population in Vietnam. Quite likely, then, many of these illnesses were
mild cases of the avian influenza. Mortality rate is the probability of cases of
disease resulting in death, which we estimate from the fraction of observed
cases resulting in death. In this case, though, the sample of observed cases
was biased: A severe case of flu was more likely to be observed than a
192 T distribution
mild case. Thus, while the fraction of observed cases resulting in death was
quite high — in Vietnam 2003–5 there were 87 confirmed cases, of which
38 resulted in death — this likely does not reflect accurately the fraction of
deaths among all cases in the population. In all likelihood, the 38 includes
nearly all the deaths, but the 87 represents only a small fraction of the cases.
During World War II the statistician Abraham Wald worked with the US
air force to analyse the data on damage to military airplanes from enemy fire.
The question: Given the patterns of damage that we observe, what would
be the best place to put extra armour plating to protect the aircraft. (You
can’t put too much armour on, because it makes the aircraft too heavy.) His
answer: Put armour in places where you never see a bullet hole. Why? The
bullet holes you see are on the planes that made it back. If you never see
a bullet hole in some part of the plane, that’s probably because the planes
that were hit there didn’t make it back. (His answer was more complicated
than that, of course, and involved some careful statistical calculations. For
a discussion, see [MS84].)
Comparing Distributions
195
196 Comparing Distributions
where σX is the standard deviation for the X variable (the height of early-
marrieds) and σY is the standard deviation for the Y variable (the height of
late-marrieds); nX and nY are the corresponding numbers of samples. This
gives us the standard formula:
q 2 σY2
σX
q
1
SEX̄−Ȳ = nX + nY =σ nY + n1 if σX = σY .
X
Formula 11.1: Standard error for the difference between two normally dis-
tributed variables
H0 : µ X = µY ,
µ X − µY 19mm
Z= = = 1.4.
SEX−Y 14mm
If we were testing at the 0.05 level, we would not reject the null hypothesis,
we would reject values of Z bigger than 1.96 (in absolute value). Even
testing at the 0.10 level we would not reject the null, since the cutoff is 1.6.
Our conclusion is that the difference in heights between the early-married
and late-married groups is not statistically significant. Notice that this is
precisely equivalent to our previous observation that the symmetric 95%
confidence interval includes 0.
If we wish to test H0 against the alternative hypothesis µX > µY , we are
performing a one-sided test: We use the same test statistic Z, but we reject
values of Z which correspond to large values of µX − µY , so large positive
values of Z. Large negative values of Z, while they are unlikely for the null
hypothesis, are even more unlikely for the alternative. The cutoff for testing
at the 0.05 level is z0.95 = 1.64. Thus, we do not reject the null hypothesis.
198 Comparing Distributions
r r
1 1 1 1
SEPu −Pc = σ̂ + = 0.326 ∗ + = 0.059.
nu nc 54 70
pu − pc −0.083
Z= = = −1.41.
SE 0.059
The cutoff for rejecting Z at the 0.05 level is 1.96. Since the observed Z is
smaller than this, we do not reject the null hypothesis, that the infection
rates are in fact equal. The difference in infection rates is not statistically
significant, as we cannot be confident that the difference is not simply due
to chance.
Comparing Distributions 199
where t is the number such that 95% of the probability in the appropriate
T distribution is between −t and t (that is, the number in the P = 0.05
column of your table.) We need to know
The only problem is that we don’t know what σ is. We have our sample
SDs sx and sy , each of which should be approximately σ. The bigger the
sample, the better the approximation should be. This leads us to the pooled
sample variance s2p , which simply averages these estimates, counting the
bigger sample more heavily:
Table 11.2: Data from the Suddath [RS90] schizophrenia experiment. Hip-
pocampus volumes in cm3 .
Unaffected : 1.94,1.44, 1.56, 1.58, 2.06, 1.66, 1.75, 1.77, 1.78, 1.92,
1.25, 1.93, 2.04, 1.62, 2.08;
Schizophrenic : 1.27,1.63, 1.47, 1.39, 1.93, 1.26, 1.71, 1.67, 1.28, 1.85,
1.02, 1.34, 2.02, 1.59, 1.97.
X̄ − Ȳ 0.20
T = = = 2.02.
SE 0.099
We then observe that this is not above the critical value 2.05, so we RE-
TAIN the null hypothesis, and say that the difference is not statistically
significant.
If we had decided in advance that we were only interested in whether
µx > µy — the one-tailed alternative — we would use the same test statistic
T = 2.02, but now we draw our critical value from the P = 0.10 column,
which gives us 1.70. In this case, we would reject the null hypothesis. On the
other hand, if we had decided in advance that our alternative was µx < µy ,
we would have a critical value −1.70, with rejection region anything below
that, so of course we would retain the null hypothesis.
Table 11.3: Data from the Suddath [RS90] schizophrenia experiment. Hip-
pocampus volumes in cm3 .
cards, each of which has an X and a Y side, with numbers on each, and you
are trying to determine from the sample whether X and Y have the same
average. You could write down your sample of X’s, then turn over all the
cards and write down your sample of Y’s, and compare the means. If the
X and Y numbers tend to vary together, though, it makes more sense to
look at the differences X − Y over the cards, rather than throw away the
information about which X goes with which Y. If the X’s and Y’s are not
actually related to each other then it shouldn’t matter.
the one with the same mean and variance as this sample — and take 1000
numbers evenly spaced from the normal distribution, and plot them against
each other. If the sample really came from the normal distribution, then the
two should be about equal, so the points will all lie on the main diagonal.
Figure 11.2(c) shows a Q-Q plot for the original 15 samples, which clearly
do not fit the normal distribution very well.
0.4
0.6
120
0.3
100
0.4
Sample Quantiles
Sample Quantiles
80
Frequency
0.2
0.2
60
40
0.0
0.1
20
-0.2
0
(a) Histogram of 1000 resampled means (b) Q-Q plot of 1000 resampled means (c) Q-Q plot of 15 original measure-
ments
Let us first apply a Z test at the 0.05 significance level, without thinking
too deeply about what it means. (Because of the normal approximation,
it doesn’t really matter what the underlying distribution of weight loss is.)
We compute first the pooled sample variance:
• The sample of B’s is not an independent sample; it’s just the com-
plement of the sample of A’s. A bit of thought makes clear that this
tends to make the SE of the difference larger.
This turns out to be one of those cases where two wrongs really do make
a right. These two errors work in opposite directions, and pretty much
cancel each other out. Consequently, in analysing experiments we ignore
these complications and proceed with the Z- or t-test as in section 11.5.1,
as though they were independent samples.
Table 11.4
Yes No
Question A 161 22
Question B 92 54
any population, and we have no idea how representative they may or may
not be of the larger category of Homo sapiens. The real question is, among
these 383 people, how likely is it that we would have found a different result
had we by chance selected a different group of 200 people to pose question
B to. We want to do a significance test at the 0.01 level.
The model is then: 383 cards in a box. On one side is that person’s an-
swer to Question A, on the other side the same person’s answer to Question
B (coded as 1=yes, 0=no). The null hypothesis is that the average on the
A side is the same as the average on the B side (which includes the more
specific hypothesis that the A’s and the B’s are identical).
We pick 183 cards at random, and add up their side A’s, coming to 161;
from the other 200 we add up the side B’s, coming to 92. Our procedure is
then:
(1). The average of the sampled side A’s is X̄A = 0.88, while the average
of the sampled side B’s is X̄B = 0.46.
p
(2). The standard deviation of the A sides is estimated at σA = p(1 − p) =
0.32, p
while the standard deviation of the B sides is estimated at
σB = p(1 − p) = 0.50.
(4). Z = (X̄A − X̄B )/SEA−B = 9.77. The cutoff for a two-sided test at the
0.01 level is z0.995 = 2.58, so we clearly do reject the null hypothesis.
The conclusion is that the difference in answers between the two ques-
tions was not due to the random sampling. Again, this tells us nothing
directly about the larger population from which these 383 individuals were
sampled.
208 Comparing Distributions
Lecture 12
209
210 Non-Parametric Tests I
Table 12.1: Age (in months) at which infants were first able to walk inde-
pendently. Data from [ZZK72].
As we said then, the Treatment numbers seem generally smaller than the
Control numbers, but not entirely, and the number of observations is small.
Could we merely be observing sampling variation, where we happened to get
six (five, actually) early walkers in the Treatment group, and late walkers
in the Control group.
Following the approach of Lecture 11, we might perform a two-sample
T test for equality of means. We test the null hypothesis µT REAT = µCON
against the one-tailed alternative µT REAT < µCON , at the 0.05 level. To
find the critical value, we look in the column for P = 0.10, with 6 + 6 − 2
d.f., obtaining 1.81. The critical region is then {T < −1.81}. The relevant
Non-Parametric Tests I 211
summary statistics are given in Table 12.2. We compute the pooled sample
variance r
(6 − 1)1.452 + (6 − 1)1.522
sp = = 1.48,
6+6−2
so the standard error is
r
1 1
SE = sp + = 0.85.
6 6
We have then the T statistic
X̄ − Ȳ −1.6
T = = = −1.85.
SE 0.85
So we reject the null hypothesis, and say that the difference between the
two groups is statistically significant.
Mean SD
Treatment 10.1 1.45
Control 11.7 1.52
0.20
0.15
Density
0.10
0.05
0.00
8 10 12 14
Time (months)
Figure 12.1: Sketch of what the distribution of walking times from which
the data of Table 12.1 might have been drawn from, if they all came from
the same distribution. The actual measurements are shown as a rug plot
along the bottom — green for Treatment, red for Control. The marks have
been adjusted slightly to avoid exact overlaps.
Non-Parametric Tests I 213
800
600
Frequency
400
200
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Simulated T
H0 : mx = my
Halt : mx 6= my ,
or a one-tailed alternative
The idea of this test is straightforward: Let M be the median of the com-
bined sample {x1 , . . . , xnx , y1 , . . . , yny }. If the medians are the same, then
the x’s and the y’s should have an equal chance of being above M . Let Px
be the proportion of x’s that are above M , and Py the proportion of y’s that
are above M . It turns out that we can treat these as though they were the
proportions of successes in nx and ny trials respectively. Analysing these
results is not entirely straightforward; we get a reasonable approximation
by using the Z test for differences between proportions, as in section 11.3.
Consider the case of the infant walking study, described in section 1.1.
The 12 measurements are
9.0, 9.0, 9.5, 9.5, 9.75, 10.0, 11.5, 11.5, 12.0, 13.0, 13.0, 13.25,
where the CONTROLresults have been coloured RED, , and the TREAT-
MENT results have been coloured GREEN. The median is 10.75, and we
see that there are 5 control results above the median, and one treatment.
We pick 6 balls from 12, where 6 were red and 6 green. We want P(at
least 5 red). Since the null hypothesis says that all picks are equally likely,
this is simply the fraction of ways that we could make our picks which
happen to have 5 or 6 red. That is,
# ways to pick 5 R, 1G + # ways to pick 6 R, 0G
P (at least 5 R) = .
total # ways to pick 6 balls from 12
The number of ways to pick 6 balls from 12 is what we call 12 C6 = 924. The
number of ways to pick 6 red and 0 green is just 1: We have to take all the
reds, we have no choice. The only slightly tricky one is # ways to pick 5 red
and 1 green. A little thought shows that we have 6 C5 = 6 ways of choosing
the red balls, and 6 C1 = 6 ways of choosing the green one, so 36 ways in all.
Thus, the p-value comes out to 37/954 = 0.039, so we still reject the null
hypothesis at the 0.05 level. (Of course, the p-value for a two-tailed test is
twice this, or 0.078.
0.35
0.35
0.30
0.30
0.25
0.25
Probability
Probability
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0/6 1/6 2/6 3/6 4/6 5/6 6/6 0/6 1/6 2/6 3/6 4/6 5/6 6/6
Fraction red Fraction green
Figure 12.3: The exact probabilities from the binomial distribution for the
extreme results of number of red (control) and green (treatment) in the
infant walking experiment. The corresponding normal approximations are
shaded. Note that the upper tail starts at 4.5/6 = 0.75, not at 5/6; and the
lower tail starts at 1.5/6 = 0.25, rather than at 1/6.
There are many defects of the median test. One of them is that the
results are discrete — there are at most n/2 + 1 different possible outcomes
to the test — while the analysis with Z is implicitly discrete. This is one
of the many reasons why the median test, while it is sometimes seen, is not
recommended. (For more about this, see [FG00].) The rank-sum test is
almost always preferred.
Note that this method requires that the observations be all distinct.
There is a version of the median test that can be used when there are ties
among the observations, but we do not discuss it in this course.
not how far above or below. In the example of section 12.2, while 5 of the
6 treatment samples below the median, the one that is above the median is
near the top of the whole sample; and the one control sample that is below
the median is in fact near the bottom. It seems clear that we should want
to take this extra information into account. The idea of the rank-sum test
(also called the Mann-Whitney test) is that we consider not just yes/no,
above/below the median, but the exact relative ranking.
Continuing with this example, we list all 12 measurements in order, and
replace them by their ranks:
measurements 9.0 9.0 9.5 9.5 9.75 10.0 11.5 11.5 12.0 13.0 13.0 13.25
ranks 1 2 3 4 5 6 7 8 9 10 11 12
modified ranks 1.5 1.5 3.5 4.5 5 6 7.5 7.5 9 10.5 10.5 12
When measurements are tied, we average the ranks (we show this in the
column labelled “modified ranks”.) We wish to test the null hypothesis
H0 : control and treatment came from the same distribution; against the
alternative hypothesis that the controls are generally larger.
We compute a test statistic R, which is just the sum of the ranks in
the smaller sample. (In this case, the two samples have the same size, so
we can take either one. We will take the treatment sample.) The idea is
that these should, if H0 is true, be like a random sample from the numbers
1, . . . , nx + ny . If R is too big or too small we take this as evidence to reject
H0 . In the one-tailed case, we reject R for being too small (if the alternative
hypothesis is that the corresponding group has smaller values; or for being
too large (if the alternative hypothesis is that the corresponding group has
larger values.
In this case, the alternative hypothesis is that the group under con-
sideration, the treatment group has smaller values, so the rejection region
consists of R below a certain threshold. It only remains to find the appro-
priate threshold. These are given on the Mann-Whitney table (Table 5 in
the formula booklet). The layout of this table is somewhat complicated.
The table lists critical values corresponding only to P = 0.05 and P = 0.10.
We look in the row corresponding to the size of the smaller sample, and
column corresponding to the larger. For a two-tailed test we look in the
(sub-) row corresponding to the desired significance level; for a one-tailed
test we double the p-value.
The sum of the ranks for the treatment group in our example is R = 30.
Since we are performing a one-tailed test with the alternative being that
the treatment values are smaller, our rejection region will be of the form
220 Non-Parametric Tests I
R ≤ some critical value. We find the critical value on Table 12.4. The
values corresponding to two samples of size 6 have been highlighted. For a
one-tailed test at the 0.05 level we take the upper values 28, 50. Hence, we
would reject R ≤ 28. Since R = 30, we retain the null hypothesis in this
test.
If we were performing a two-tailed test instead, we would reject R ≤ 26
and R ≥ 52.
The table you are given goes up only as far as the larger sample size
equal to 10. For larger samples, we use a normal approximation:
R−µ
z= ,
σ
1
µ = nx (nx + ny + 1),
2
r
nx ny (nx + ny + 1)
σ= .
12
As usual, we compare this z to the probabilities on the normal table. Thus,
for instance, for a two-tailed test at the 0.05 level, we reject the null hypoth-
esis if |z| > 1.96.
As with the t test, when the data fall naturally into matched pairs, we can
improve the power of the test by taking this into account. We are given data
in pairs (x1 , y1 ), . . . , (xn , yn ), and we wish to test the null hypothesis H0 : x
and y come from the same distribution. In fact, the null hypothesis may be
thought of as being even broader than that. As we discuss in section 12.4.4,
there is no reason, in principle, why the data need to be randomly sampled
at all. The null hypothesis says that the x’s and the y’s are indistinguishable
from a random sample from the complete set of x’s and y’s together. We
don’t use the precise numbers — which depend upon the unknown distribu-
tion of the x’s and y’s — but only basic reasoning about the relative sizes
of the numbers. Thus, if the x’s and y’s come from the same distribution,
it is equally likely that xi > yi as that xi < yi .
The idea of the sign test is quite straightforward. We wish to test the null
hypothesis that paired data came from the same distribution. If that is the
case, then which one of the two observations is the larger should be just
like a coin flip. So we count up the number of times (out of n pairs) that
the first observation in the pair is larger than the second, and compute the
probability of getting that many heads in n coin flips. If that probability is
below the chosen significance level α, we reject the null hypothesis.
Schizophrenia study
Table 12.5: Data from the Suddath [SCT+ 90] schizophrenia experiment.
Hippocampus volumes in cm3 .
for breastfeeding. Each mother treated one breast and left the other un-
treated, as a control. The two breasts were rated daily for level of discomfort,
on a scale 1 to 4. Each method was used by 19 mothers, and the average dif-
ference between the treated and untreated breast for each of the 19 mothers
who used the “toughening” treatment were: −0.525, 0.172, −0.577, 0.200,
0.040, −0.143, 0.043, 0.010, 0.000, −0.522, 0.007, −0.122, −0.040, 0.000,
−0.100, 0.050, −0.575, 0.031, −0.060.
The original study performed a one-tailed t test at the 0.05 level of
the null hypothesis that the true difference between treated and untreated
breasts was 0: The cutoff is then −1.73 (so we reject the null hypothesis
on any value of T below −1.73). We have x̄ = −0.11, and sx = 0.25. We
√
compute then T = (x̄ − 0)/(sx / n) = −1.95, leading us to reject the null.
We should, however, be suspicious of this marginal result, which depends
upon the choice of a one-tailed test: for a two-tailed test the cutoff would
have been 2.10.
In addition, we note that the assumption of normality is drastically vi-
olated, as we see from the histogram of the observed values 12.4. To ap-
ply the sign test, we see that there are 8 positive and 9 negative values,
which is as close to an average value as we could have, and so conclude
that there is no evidence in the sign test of a difference between the treated
and untreated breasts. √ (Formally, we could compute p̂ = 8/17 = 0.47, and
Z = (0.47 − 0.50)/(0.5/ 17) = −0.247, which is nowhere near the cutoff of
1.96 for the z test at the 0.05 level.)
6
5
Frequency
4
3
2
1
0
As with the two-sample test in section 12.3.2, we can strengthen the paired-
sample test by considering not just which number is bigger, but the relative
ranks. The idea of the Wilcoxon (or signed-rank) test is that we might have
about equal numbers of positive and negative values, but if the positive
values are much bigger than the negative (or vice versa) that will still be ev-
idence that the distributions are different. For instance, in the breastfeeding
study, the t test produced a marginally significant result because several of
the very large values are all negative.
The mechanics of the test are the same as for the two-sample rank-sum
test, only the two samples are not the x’s and the y’s, but the positive and
negative differences. In a first step, we rank the differences by their absolute
values. Then, we carry out a rank-sum test on the positive and negative dif-
ferences. To apply the Wilcoxon test, we first drop the two 0 values, and
then rank the remaining 17 numbers by their absolute values:
Diff 0.007 0.010 0.031 0.040 -0.040 0.043 0.050 -0.060 -0.100
Rank 1 2 3 4.5 4.5 6 7 8 9
Diff -0.122 -0.143 0.172 0.200 -0.522 -0.525 -0.575 -0.577
Rank 10 11 12 13 14 15 16 17
The ranks corresponding to positive values are 1, 2, 3 ,4.5, 6, 7, 12, 13, which
sum to R+ = 48.5, while the negative values have ranks 4.5,8,9,10,11,14,15,16,17,
summing to R− = 104.5. The Wilcoxon statistic is defined to be T =
min{R+ , R− } = 48.5. We look on the appropriate table (given in Figure
12.5). We see that in order for the difference to be significant at the 0.05
level, we would need to have T ≤ 34. Consequently, we still conclude that
the effect of the treatment is not statistically significant.
Non-Parametric Tests I 225
against the general alternative that the samples did not come from P . To
do this, we need to create a test statistic whose distribution we know, and
which will be big when the data are far away from a typical sample from
the population P .
You already know one approach to this problem, using the χ2 test. To
do this, we split up the possible values into K ranges, and compare the
227
228 Non-Parametric Tests, Power
number of observations in each range with the number that would have been
predicted. For instance, suppose we have 100 samples which we think should
have come from a standard normal distribution. The data are given in Table
13.1. The first thing we might do is look at the mean and variance of the
sample: In this case, the mean is −0.06 and the sample variance 1.06, which
seems plausible. (A z test for the mean would not reject the null hypothesis
of 0 mean, and the test for variance — which you have not learned — would
be satisfied that the variance is 1.) We might notice that the largest value is
3.08, and the minimum value is −3.68, which seem awfully large. We have
to be careful, though, about scanning the data first, and then deciding what
to test after the fact: This approach, sometimes called data snooping, can
easily mislead, since every collection of data is likely to have something that
seems wrong with it, purely by chance. (This is the problem of multiple
testing, which we discuss further in section 14.3.)
-0.16 -0.68 -0.32 -0.85 0.89 -2.28 0.63 0.41 0.15 0.74
1.30 -0.13 0.80 -0.75 0.28 -1.00 0.14 -1.38 -0.04 -0.25
-0.17 1.29 0.47 -1.23 0.21 -0.04 0.07 -0.08 0.32 -0.17
0.13 -1.94 0.78 0.19 -0.12 -0.19 0.76 -1.48 -0.01 0.20
-1.97 -0.37 3.08 -0.40 0.80 0.01 1.32 -0.47 2.29 -0.26
-1.52 -0.06 -1.02 1.06 0.60 1.15 1.92 -0.06 -0.19 0.67
0.29 0.58 0.02 2.18 -0.04 -0.13 -0.79 -1.28 -1.41 -0.23
0.65 -0.26 -0.17 -1.53 -1.69 -1.60 0.09 -1.11 0.30 0.71
-0.88 -0.03 0.56 -3.68 2.40 0.62 0.52 -1.25 0.85 -0.09
-0.23 -1.16 0.22 -1.68 0.50 -0.35 -0.35 -0.33 -0.24 0.25
The X 2 statistic for this table is 7.90. This does not exceed the threshold
of 9.49 for rejecting the null hypothesis at the 0.05 level.
There are two key problems with this approach:
(1). We have thrown away some of the information that we had to begin
with, by forcing the data into discrete categories. Thus, the power
to reject the null hypothesis is less than it could have been. The
bottom category, for instance, does not distinguish between −2 and
the actually observed extremely low observation −3.68.
3
2
1
Sample Quantiles
0
-1
-2
-3
-4 -2 0 2 4
Theoretical Quantiles
1
5/6
4/6
1/2
P
2/6
1/6
0
1 2 3 4 5 6
0.4
0.2
0.0
-2 -1 0 1 2
Table 13.2: χ2 table for data from Table 13.1, testing its fit to a standard
normal distribution.
1.0
Fobs(x)
Fexp(x)
0.8
Fobs(x) − Fexp(x)
0.6
F(x)
0.4
0.2
0.0
-3 -2 -1 0 1 2 3
(a) Fobs shown in black circles, and Fexp (the normal distribution function) in red. The green segments show the
difference between the two distribution functions.
0.08
0.06
|Fobs(x) − Fexp(x)|
0.04
0.02
0.00
-3 -2 -1 0 1 2 3
(c) Difference between entry #i in Table 13.2(b) and i/100. Largest value shown
blue.
0.010 0.009 0.006 0.014 0.004 0.014 0.015 0.017 0.026 0.031
0.031 0.036 0.030 0.034 0.041 0.037 0.037 0.026 0.031 0.011
0.012 0.005 0.003 0.008 0.069 0.085 0.086 0.083 0.073 0.071
0.064 0.077 0.067 0.061 0.055 0.049 0.039 0.045 0.035 0.033
0.023 0.013 0.006 0.008 0.002 0.008 0.006 0.012 0.014 0.024
0.026 0.036 0.046 0.052 0.054 0.056 0.062 0.052 0.054 0.048
0.054 0.060 0.055 0.061 0.067 0.073 0.071 0.070 0.076 0.082
0.084 0.061 0.049 0.049 0.052 0.048 0.051 0.054 0.058 0.064
0.068 0.071 0.069 0.070 0.074 0.078 0.082 0.092 0.088 0.087
0.055 0.045 0.029 0.037 0.043 0.013 0.015 0.009 0.002 0.001
Table 13.3: Computing the Kolmogorov-Smirnov statistic for testing the fit
of data to the standard normal distribution.
t and Z tests. If the null hypothesis says µ = µ0 , and the alternative is that
µ > µ0 , then we can increase the power of our test against this alternative by
taking a one-sided alternative. On the other hand, if reality is that µ < µ0
then the test will have essentially no power at all. Similarly, if we think
the reality is that the distributions of X and Y differ on some scattered
intervals, we might opt for a χ2 test.
X: 5.4, 19.3, 15.8, 4.9, 0.7, 4.9, 8.5, 23.1, 16.7, 2.0
Y : 27.1, 28.4, 5.6, 29.0, 29.8, 26.1, 14.4, 14.9, 29.3, 18.0, 0.4, 23.3.
We put these values in order to compute the cdf (which we also plot in
Figure 13.4).
1.2 1.4 1.9 3.7 4.4 4.8 5.6 6.5 6.6 6.9 9.2 9.7 10.4 10.6 17.3 19.3 21.1 28.4
Fx 0.1 0.2 0.3 0.4 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.9 1.0
Fy 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.4 0.5 0.6 0.6 0.8 0.9 0.9 1.0 1.0 1.0
1
0.8
0.6
P
0.4
0.2
0
0 5 10 15 20 25 30
Kolmogorov-Smirnov
q Critical value:
q
n +n
Dcrit,0.05 = 1.36 nx + n1x = 1.36 nxx nyy .
1
Table 13.5: Ages at death for skeletons found at two Virginia sites, as de-
scribed in [Lov71].
Cumulative Distrib.
Age range Clarksville Tollifero Difference
3 0.16 0.16 0
6 0.16 0.24 0.08
12 0.22 0.29 0.07
17 0.24 0.41 0.17
20 0.24 0.43 0.19
35 0.57 0.8 0.23
45 0.97 0.9 0.07
55 1 1 0
entirely make sense. What we have done is effectively to take the maximum
difference not over all possible points in the distribution, but only at eight
specially chosen points. This inevitably makes the maximum smaller. The
result is to make it harder to reject the null hypothesis, so our significance
level is too high. We should compensate by lowering the critical value.
Truth
H0 True H0 False
Decision
Correct Type II Error
Don’t Reject H0
(Prob. 1 − α) (Prob.=β)
Type I Error Correct
Reject H0
(Prob.=level=α) (Prob.=Power=1 − β)
as the probability that X̄ > 0.196 or X̄ < −0.196. We know that X̄ has
N (µ, 0.1) distribution. If we standardise this, we see that Z 0 := 10(X̄ − µ)
is standard normal. A very elementary probability computation shows us
that
Thus
Power = Φ(−1.96 − 10µalt ) + 1 − Φ(1.96 − 10µalt ), (13.1)
where Φ is the number we look up on the normal table.
More generally, suppose we have n measures of a quantity with unknown
distribution, unknown mean µ, but known variance σ 2 , and we wish to use
a two-tailed Z test for the null hypothesis µ = µ0 against the alternative
µ = µalt . The same computation as above shows that
√
n
Power = Φ −z − (µ0 − µalt )
σ
√ (13.2)
n
+1−Φ z− (µ0 − µalt ) .
σ
1.0
0.8
0.6
Power
0.4
n=100, α=.05
0.2
n=100, α=.01
n=1000, α=.05
n=1000, α=.01
0.0
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
µalt − µ0
Figure 13.5: Power for Z test with different gaps between the null and alter-
native hypotheses for the mean, for given sizes of study (n) and significance
levels α.
b̄ − ā
Z= q .
1 1
σ n/2 + n/2
is. We see from the formula (13.2) that if the gap between µA and µB
becomes half as big, we need 4 times as many subjects to keep the same
power. If the power is only 0.2, say, then it is hardly worth starting in on
the experiment, since the result we get is unlikely to be conclusive.
Figure 13.6 shows the power for experiments where the true experimental
effect (the difference between µA and µB ) is 10 mmHg, 5 mmHg, and 1
mmHg), performing one-tailed and two-tailed significance tests at the 0.05
level.
Notice that the one-tailed test is always more powerful when µB − µA
is on the right side (µB bigger than µA , so B is superior), but essentially 0
when B is inferior; the power of the two-tailed test is symmetric. If we are
interested to discover only evidence that B is superior, then the one-tailed
test obviously makes more sense.
Suppose now that we have sufficient funding to enroll 50 subjects in our
study, and we think the study would be worth doing only if we have at least
an 80% chance of finding a significant positive result. In that case, we see
from Figure 13.6(b) that we should drop the project unless we expect the
difference in average effects to be at least 5 mmHg. On the other hand, if
we can afford 200 subjects, we can justify hunting for an effect only half
as big, namely 2.5 mmHg. With 1000 subjects we have a good chance of
detecting a difference between the drugs as small as 1 mmHg. On the other
hand, with only 10 subjects we would be unlikely to find the difference to
be statistically significant, even if the true difference is quite large.
1.0
0.8
0.6
Power
0.4
one-tailed test,n=10
0.8
one-tailed test,n=50
one-tailed test,n=200
one-tailed test,n=1000
two-tailed test,n=10
0.6
two-tailed test,n=50
Power
two-tailed test,n=200
two-tailed test,n=1000
0.4
0.2
0.0
-10 -5 0 5 10
µB − µA
0.4
t test
Mann-Whitney test
median test
0.2
0.0
-3 -2 -1 0 1 2 3
µ1 − µ2
Figure 13.7: Estimated power for three different tests, where the underlying
distributions are normal with variance 1, as a function of the true difference
in means. The test is based on ten samples from each distribution.
244 Non-Parametric Tests, Power
Lecture 14
245
246 ANOVA
who were nursed more than 9 months and those nursed less than 1 month
both had, on average, characteristics (whether their own or their mothers’)
that would seem to predispose them to lower IQ. For the rest of this chapter
we will work with the adjusted means.
The statistical technique for doing this, called multiple regression, is
outside the scope of this course, but it is fairly straightforward, and most
textbooks on statistical methods that go beyond the mast basic techniques
will describe it. Modern statistical software makes it particularly easy to
adjust data with multiple regression.
One approach would be to group the IQ scores into groups — low, medium,
and high, say. We would then have an incidence table If these were categor-
ical data — proportions of subjects in each breastfeeding class who scored
“high” and “low”, for instance — we could produce an incidence table such
as that in Table 14.2. (The data shown here are purely invented, for illustra-
tive purposes.) You have learned how to analyse such a table to determine
whether the vertical categories (IQ score) are independent of the horizontal
categories (duration of breastfeeding), using the χ2 test.
The problem with this approach is self-evident: We have thrown away
some of the information that we had to begin with, by forcing the data into
discrete categories. Thus, the power to reject the null hypothesis is less than
it could have been. Furthermore, we have to draw arbitrary boundaries be-
tween categories, and we may question whether the result of our significance
test would have come out differently if we had drawn the boundaries other-
wise. (These are the same problems, you may recall, that led us to prefer
the Kolomogorov-Smirnov test over χ2 . The χ2 test has the virtue of being
wonderfully general, but it is often not quite the best choice.)
248 ANOVA
Breastfeeding months
Full IQ score
≤1 2-3 4-6 7-9 >9
high 100 115 120 40 9
medium 72 85 69 35 9
low 100 115 80 29 5
means were in fact all the same — 1 out of 20 comparisons should yield a
statistically significant difference at the 0.05 level. How many statistically
significant differences do we need before we can reject the overall null hy-
pothesis of identical population means? And what if none of the differences
were individually significant, but they all pointed in the same direction?
The idea of analysis of variance (ANOVA) is that under the null hypothesis,
which says that the observations from different levels really all are coming
from the same distribution, the observations should be about as far (on
average) from their own level mean as they are from the overall mean of the
whole sample; but if the means are different, observations should be closer
to their level mean than they are to the overall mean.
We define the Between Groups Sum of Squares, or BSS, to be the
total square difference of the group means from the overall mean; and the
Error Sum of Squares, or ESS, to be the total squared difference of
the samples from the means of their own groups. (The term “error” refers
to a context in which the samples can all be thought of as measures of
the same quantity, and the variation among the measurements represents
random error; this piece is also called the Within-Group Sum of Squares.)
And then there is the Total Sum of Squares, or TSS, which is simply
the total square difference of the samples from the overall mean, if we treat
them as one sample.
K
X
BSS = ni (X̄k − X̄)2 ;
i=1
XK Xni
ESS = (Xij − X̄i )2
i=1 j=1
K
X
= (ni − 1)s2i where si is the SD of observations in level i.
i=1
X
T SS = (Xij − X̄)2
i,j
The initials BMS and EMS stand for Between Groups Mean Squares
and Error Mean Squares respectively.
ANOVA 251
variability among the data can be divided into two pieces: The variability
within groups, and the variability among the means of different groups.
Our goal is to evaluate the apportionment, to decide if there is “too much”
between group variability to be purely due to chance.
Of course, BSS and ESS involve different numbers of observations in
their sums, so we need to normalise them. We define
BSS ESS
BM S = EM S = .
K −1 N −K
This brings us to the second mathematical fact: if the null hypothesis is
true, then EMS and BMS are both estimates for σ 2 . On the other hand,
interesting deviations from the null hypothesis — in particular, where the
populations have different means — would be expected to increase BMS rel-
ative to EMS. This leads us to define the deviation from the null hypothesis
as the ratio of these two quantities:
BM S N − K BSS
F = = · .
EM S K − 1 ESS
We reject the null hypothesis when F is too large: That is, if we obtain a
value f such that P {F ≥ f } is below the significance level of the test.
SS d.f. MS F
Between BSS K −1 BMS
X/Y
Treatments (A) (B) (X = A/B)
Errors (Within ESS N −K EMS
Treatments) (C) (D) (Y = C/D)
Total TSS N −1
Under the null hypothesis, the F statistic computed in this way has a
known distribution, called the F distribution with (K − 1, N − K) degrees
of freedom. We show the density of F for K = 5 different treatments and
different values of N in Figure 14.1.
252 ANOVA
1.0
K=5,N=10
0.8 K=5,N=20
K=5,N=∞
K=3,N=6
K=3,N=∞
0.6
Density
0.4
0.2
0.0
0 1 2 3 4 5
quite small — after all, there is one distribution for each pair of integers. The
table gives only the cutoff only for select values of (d1 , d2 ) at the 0.05 level.
For parameters in between one needs to interpolate, and for parameters
above the maximum we go to the row or column marked ∞. Looking on
the table in Figure 14.2, we see that the cutoff for F (4, ∞) is 2.37. Using
a computer, we can compute that the cutoff for F (4, 968) at level 0.05 is
actually 2.38; and the cutoff at level 0.01 would be 3.34.
Table 14.5: ANOVA table for breastfeeding data: Full Scale IQ, Adjusted.
SS d.f. MS F
Between 3597 4 894.8
3.81
Samples (A) (B) (X = A/B)
Errors (Within 227000 968 234.6
Samples) (C) (D) (Y = C/D)
Total 230600 972
(TSS=A+C) (N − 1)
Figure 14.2: Table of F distribution; finding the cutoff at level 0.05 for the
breastfeeding study.
table for 27), we see that the cutoff at level 0.05 is 3.32. Thus, we conclude
that the difference in means between the groups is statistically significant.
xki = µk + ki , k = 1, . . . , K; i = 1, . . . , nk
where the ki are the normally distributed “errors”, and µk is the true mean
for group k. Thus, in the example of section 14.4.3, there were three groups,
corresponding to three different exercise regimens, and ten different samples
for each regimen. The obvious estimate for µk is
nk
1 X
x̄k· = xki ,
nk
i=1
and we use the F test to determine whether the differences among the means
are genuine. We decompose the total variance of the observations into the
ANOVA 255
High 626 650 622 674 626 643 622 650 643 631
Low 594 599 635 605 632 588 596 631 607 638
Control 614 569 653 593 611 600 603 593 621 554
Group Mean SD
High 638.7 16.6
Low 612.5 19.3
Control 601.1 27.4
Table 14.6: Bone density of rats after given exercise regime, in mg / cm3
portion that is between groups and the portion that is within groups. If the
between-group variance is too big, we reject the hypothesis of equal means.
Many experiments naturally lend themselves to a two-way layout. For
instance, there may be three different exercise regimens and two different
diets. We represent the measurements as
It is then slightly more complicated to isolate the exercise effect µk and the
diet effect νj . We test for equality of these effects by again splitting the
variance into pieces: the total sum of squares falls naturally into four pieces,
corresponding to the variance over diets, variance over exercise regimens,
variance over joint diet and exercise, and the remaining variance within
each group. We then test for whether the ratios of these pieces are too far
from the ratio of the degrees of freedom, as determined by the F distribution.
Multifactor ANOVA is quite common in experimental practice, but will
not be covered in this course.
SS d.f. MS F
Between 7434 2 3717
7.98
Samples (A) (B) (X = A/B)
Errors (Within 12580 27 466
Samples) (C) (D) (Y = C/D)
Total 20014 29
(TSS=A+C) (N − 1)
simply to substitute ranks for the actual observed values. This avoids the
assumption that the the data were drawn from a normal distribution.
In Table 14.8 we duplicate the data from Table 14.6, replacing the mea-
surements by the numbers 1 through 30, representing the ranks of the data:
the lowest measurement is number 1, and the highest is number 30. In other
words, suppose we have observed K different groups, with ni observations
in each group. We order all the observations in one large sequence of length
N , from lowest to highest, and assign to each one its rank. (In case of ties,
we assign the average rank.) We then sum the ranks in group i, obtaining
numbers R1 , . . . , RK . Then
Under the null hypothesis, that all the samples came from the same distri-
bution, H has the χ2 distribution with K − 1 degrees of freedom.
In the rat exercise example, we have the values of Ri given in Table
14.7(b), yielding H = 10.7. If we are testing at the 0.05 significance level,
the cutoff for χ2 with 2 degrees of freedom is 5.99. Thus, we conclude again
that there is a statistically significant difference among the distributions of
bone density in the three groups.
ANOVA 257
High 18.5 27.5 16.5 30 18.5 25.5 16.5 27.5 25.5 20.5
Low 6 8 23 11 22 3 7 20.5 12 24
Control 14 2 29 4.5 13 9 10 4.5 15 1
259
260 Regression
*
*
* *
100
*
90
80
0 2 4 6 8 10 12
Months breastfeeding
Figure 15.1: Plot of data from the breastfeeding IQ study in Table 14.1.
Stars represent mean for the class, boxes represent mean ± 2 Standard
Errors.
Regression 261
15.2 Scatterplots
The most immediate thing that we may wish to do is to get a picture of the
data with a scatterplot. Some examples are shown in Figure 15.2.
Birth weight and mother's smoking
180
74
●
●
●
● ●
●
●
●
●
160
●
● ● ●
● ● ● ●
●
● ● ●
72
● ● ●
●
● ●
● ●
● ●
●
● ● ●
●
●
● ●
● ● ●
●
● ● ● ●
●
● ●
● ● ●
● ●
●
● ●
● ●
● ● ●
● ●
140
●
● ●
● ● ● ●
● ●
●
● ●
● ●
● ●
●
●
● ● ● ● ●
● ●
70
●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
● ● ● ● ●
Birth weight (oz.)
●
● ●
● ●
● ● ●
● ●
● ●
● ● ● ● ● ●
●
● ●
● ●
● ●
● ● ●
● ●
●
● ●
● ●
● ● ●
● ●
●
● ● ● ● ● ● ●
68
●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
● ● ●
● ● ●
●
● ●
● ●
● ● ● ● ● ●
●
● ●
● ●
● ● ●
● ●
●
● ● ●
● ●
● ● ●
● ● ●
●
● ● ●
● ●
● ●
● ●
●
● ●
● ●
● ● ● ●
●
● ● ● ● ● ●
100
●
● ● ●
● ●
● ● ●
● ● ●
●
● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
66
●
● ● ●
● ●
● ●
● ●
●
● ●
● ● ● ●
● ● ●
●
● ● ●
● ●
● ● ●
● ●
● ●
● ● ●
● ●
●
● ● ●
● ● ● ●
● ●
● ●
● ●
● ● ●
● ●
● ● ● ●
●
80
● ● ● ●
●
●
● ●
64
● ● ● ●
● ●
●
● ●
● ● ●
●
●
● ●
●
60
62
●
●
0 10 20 30 40 50 64 66 68 70 72
(a) Infant birth weight against maternal smoking (b) Galton’s parent-child height data
1000 1100 1200 1300 1400
120
100
90
1700 1800 1900 2000 2100 2200 1700 1800 1900 2000 2100 2200
Total brain surface area (cm2) Total brain surface area (cm2)
(c) IQ and brain surface area (d) Brain volume and brain surface area
1000 2000 3000 4000 5000
3
log10Brain weight (g)
Brain weight (g)
2
1
0
0
-1
(e) Brain weight and body weight (62 species of land (f) log Brain weight and log body weight (62 species of
mammals) land mammals)
For n paired observations (xi , yi ), with means x̄ and ȳ, we define the pop-
ulation covariance
n
1X
cxy = mean (xi − x̄)(yi − ȳ) = (xi − x̄)(yi − ȳ).
n
i=1
As with the SD, we usually work with the sample covariance, which is
just
n
n 1 X
sxy = cxy = (xi − x̄)(yi − ȳ).
n−1 n−1
i=1
This is a better estimate for the covariance of the random variables that xi
and yi are sampled from.
Notice that the means of xi − x̄ and yi − x̄ are both 0: On average, xi is
neither higher nor lower than x̄. Why is the covariance then not also 0? If
X and Y are independent, then each value of X will come with, on average,
the same distribution of Y ’s, so the positives and negatives will cancel out,
and the covariance will indeed be 0. On the other hand, if high values of xi
tend to come with high values of yi , and low values with low values, then
the product (xi − x̄)(yi − ȳ) will tend to be positive, making the covariance
positive.
While positive and negative covariance have obvious interpretations, the
magnitude of covariance does not say anything straightforward about the
strength of connection between the covariates. After all, if we simply mea-
sure heights in millimetres rather than centimetres, all the numbers will
become 10 times as big, and the covariance will be multiplied by 100. For
this reason, we normalise the covariance by dividing it by the product of the
two standard deviations, producing the quantity called correlation:
Correlation
ρXY = Cov(X,Y )
σX σY .
Sample correlation
s
rxy = sxxysy .
It is easy to see that correlation does not change when we rescale the data
— for instance, by changing the unit of measurement. If xi were universally
replaced by x∗i = αxi + β, then sxy becomes s∗xy = αsxy , and sx becomes
s∗x = αsx . Since the extra factor of α appears in the numerator and in the
denominator, the final result of rxy remains unchanged. In fact, it turns
out that rxy is always between −1 and 1. The correlation of −1 means that
there is a perfect linear relationship between x and y with negative sign;
correlation of +1 means that there is a perfect linear relationship between
x and y with positive sign; and correlation 0 means no linear relationship
at all.
In Figure 15.3 we show some samples of standard normally distributed
pairs of random variables with different correlations. As you can see, high
positive correlation means the points lie close to an upward-sloping line;
high negative correlation means the points lie close to a downward-sloping
line; and correlation close to 0 means the points lie scattered about a disk.
There are several alternative formulae for the covariance, which may be more
convenient than the standard formula:
n
1 X
sxy = (xi − x̄)(yi − ȳ)
n−1
i=1
n
1 X n
= x i yi − x̄ȳ
n−1 n−1
i=1
1 2
sx+y − s2x−y ,
=
4
where sx+y and sx−y are the sample SDs of the collections (xi + yi ) and
(xi − yi ) respectively.
264 Regression
Correlation=0.95 Correlation=0.4
3
● ● ●
● ●●
● ●
● ●
● ● ● ●
● ●● ●
2
● ● ●●
● ● ● ●● ● ● ● ●
2
● ● ● ●
● ● ● ●
●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●
● ●● ●
●● ●● ● ●● ● ● ● ●
● ●●
● ● ●●
● ●●● ● ●● ● ●●
● ● ● ●●●
●● ●●
●●●●●
● ● ● ● ●● ● ● ● ● ●
● ●●
●●●
●
●
●●
●●● ●● ●● ● ●
●
● ●
● ●
1
● ●● ● ● ●
●● ●● ●●●● ● ●
● ●●●● ●● ● ●
1
● ● ● ● ●●●●●
●● ●●● ● ● ● ●● ●●
●●● ● ●● ● ●● ●
●● ● ●● ● ●
●● ● ● ●● ● ●
●● ●● ●●●●●●●●
●●
● ● ●●● ●● ●●●
● ● ●● ●
● ● ●● ● ●●
● ● ●● ● ●
● ●● ●●● ● ●● ●
● ●●● ●
● ●● ● ● ● ●
● ●●
●● ● ● ● ●●● ● ● ● ●
●●● ●●
●● ●●
● ● ● ●●●●●● ● ●● ● ● ●● ●● ●
●● ● ●●●● ●●● ● ● ● ●●●● ●
● ● ●●● ●●
●●●● ● ●
●●● ●● ● ● ● ● ●●
●● ●●●
●
●●● ●●●
●● ●●●
●●
●
●●
●●
●● ●●
●●● ● ●
●● ●● ● ●● ●
● ●● ● ●
●●● ●●●● ● ●●
●● ● ●●●●●●●● ● ●● ● ● ● ● ●
z1
z2
●
● ●●● ●●
0
● ●● ●● ● ●● ●● ●● ●●● ●
0
● ● ●
●●●●●● ●
●
● ●
●●●
●●
●
●●●
●
●
● ●
●●●● ●● ●● ●
●●● ● ● ●
● ● ●●●●
●●●● ● ● ● ● ● ●
●●● ●●●●
●● ●●
●● ● ● ● ●● ●● ● ●●●● ●●● ● ● ●
●●●●●●●● ●● ● ●● ●
●●●● ● ●●● ●
●●●● ● ● ●●● ●
●
● ●
● ● ●● ● ●
●●●●●●●●● ●● ● ● ●●
● ●
● ●●
●●●●●●●
●
●
●
●●● ●● ●●● ●●● ● ●● ●●● ●● ● ●● ● ●
●
● ● ●●
●
● ● ● ●●●
● ●●●●● ●● ● ● ● ●
● ●
●
●● ● ● ●
●
●● ●●● ● ●● ● ● ● ●
●●
● ●● ●● ● ● ● ●
−1
● ●●●●●
●●
●
●●
●
●
●●●●●
●●
●●●● ●
●
● ●●● ● ● ● ● ● ● ● ●●● ● ● ●
−1
● ●
● ● ●●●
● ● ● ●●● ● ● ● ● ●●● ●
●
●●●●● ●
●
●●●●●● ●● ●●
● ●● ●● ● ●
● ● ●●●● ●●
●●●●●●
●●●●●● ● ●● ● ● ●● ● ● ● ●
●●● ●●
●●●● ● ●●● ● ●● ●
●● ● ●● ●● ● ● ● ●● ● ● ● ●
●● ● ●● ●
● ●
● ●●●
● ● ●● ● ● ● ●●
−2
● ● ● ● ●●
● ●● ●
●
−2
●
● ●● ●
●
●
●
● ●
●
● ●●
●● ●● ●
●
●● ● ● ●
● ●
−3
●
● ●
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
x x
Correlation=0 Correlation=−0.8
● ● ● ●
●
3
● ●
2
● ● ●● ● ●
● ● ● ●
●
● ● ● ● ● ●● ●●
●
● ● ● ●● ●
● ● ● ●●● ●●● ●●● ●
●●
2
●● ● ● ● ●● ● ● ● ●● ● ●
● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●
●● ● ● ● ●●●
●● ● ●●● ●●
1
● ● ● ●
●●
● ● ● ●● ●● ●
●
● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ●
● ●● ● ● ● ● ● ● ●●● ●
●●●●●●
● ●●●●●● ●
● ● ● ●●
● ●●●●●●
● ●●● ● ● ●
●● ●● ●●●●● ● ●● ●● ● ● ● ●●● ● ●●
● ●
● ●● ●●●● ● ● ●●●
●
● ●● ● ● ● ●
● ● ● ●
1
● ● ●
●● ●● ●
●
● ● ●
●
● ●●
● ● ● ● ● ●● ● ● ● ●
● ● ●
● ●●
●
●●● ● ●●●
● ●●● ● ●●●
●● ● ●● ●●● ●
● ● ● ●● ●●●●●●
●● ● ●●●●●●●●● ●● ●
●● ● ●
● ● ●● ●●● ● ● ●●●● ●●● ●●
● ●● ● ●●● ● ●● ● ●●●●● ● ●● ●
● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ●
0
● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●
● ● ●
●●● ●●
●
● ● ●●●●●●
● ● ●●● ●● ●●● ● ● ●●●●●●●● ● ●●●● ● ●● ●● ●
●
z3
● ●● ● ● ● ● ● ●●
y
●
● ●● ● ●● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●●● ● ● ●● ● ●
● ●●●● ●●
● ● ● ● ● ●
●● ● ●● ● ●●
● ●●●
● ●
● ● ● ●●● ●
● ● ●● ● ● ● ●● ● ● ●●● ● ● ●
0
● ● ● ● ●
● ● ●●● ● ● ●
●
●● ● ●●● ● ● ●● ● ● ●●
●●● ● ● ● ● ● ● ●●● ●●●●●●● ●● ● ●●● ● ● ●
●● ● ●● ●
●● ● ●●●●●● ● ●● ●
●● ●
●● ● ● ● ● ● ●
● ● ●● ● ●● ●
●●
●
●●
● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ●●
● ● ●●●●● ●
● ●●●
● ● ●● ● ● ●
● ●●●● ● ●● ●● ●●● ● ● ●● ● ● ●●●
●●● ● ● ●● ●●●● ● ●● ● ●● ● ●
−1
● ● ● ● ●● ●●● ●● ● ●● ● ● ●● ●●●
● ● ● ● ● ● ● ● ● ●● ● ●
●
● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ●
● ● ● ● ●●●● ● ● ● ● ●● ●
−1
● ●●
●●● ●●●●●● ● ●●●●● ● ●● ● ●● ● ● ● ●
● ●●
●
●
● ●● ●
●●● ● ●
● ● ● ● ●● ● ●
● ● ● ● ● ● ●
● ● ● ●● ● ●●
● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ●● ●●●
● ●
● ● ● ●
● ● ●●● ● ● ●● ●● ● ● ●●
● ●● ● ● ●● ● ●
●
−2
● ● ●● ●● ●
● ●● ●
● ● ● ● ●
−2
● ●●●
● ● ● ● ●●
●
● ● ●
● ●
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
x x
This leads us to
13130
rxy = = 0.601.
sx sy
Similarly
ryz = −0.291, rxz = −0.063.
1
sxy = (13.66 − 5.41) = 2.07,
4
and
sxy 2.07
rxy = = = 0.459.
sx sy 2.52 · 1.79
2
Available as the dataset galton of the UsingR package of the R programming
language, or directly from http://www.bun.kyoto-u.ac.jp/~suchii/galton86.
html.
3
We do not need to pay attention to the fact that Galton multiplied all female heights
by 1.08.
Regression 267
Table 15.2: Variances for different combinations of the Galton height data.
SD Variance
Parent 1.79 3.19
Child 2.52 6.34
Sum 3.70 13.66
Difference 2.33 5.41
In the bottom row of thePtable we give the contribution that those in-
dividuals made to the total xi yi . Since the xi are all about the same in
any column, we treat them as though they were all about the same, equal
to the average x value in the column. (We give these averages in the first
row. This involves a certain amount of guesswork, particularly for the last
column; on the other hand, there are very few individuals in that column.)
268 Regression
The sum of the y values in any column is the average y value multiplied
by the relevant number of samples. Consider the first column:
X X
xi yi ≈ 1 · yi
i in first column i in first column
= 1 · N1 ȳ1
= 1 · 272 · 99.4
= 27037.
The sum in a column is just the average in the column multiplied by the
number of observations in the column, so we get
1
ȳ = (272 · 99.4 + 305 · 101.7 + 269 · 102.3 + 104 · 106.0 + 27 · 104.0) = 101.74.
973
Thus, we get
393853 973
sxy = − · 101.74 · 3.93 = 4.95
972 972
To compute the correlation we now need to estimate the variance of x
separately. (The variance of y could be computed from P the other given data
— using the individual column variances to compute yi2 — but we are
given that sy = 15.7.) For the x values, we continue to treat them as though
they were all the same in a column, and use the formula for grouped data.
We get then
2 1
sx = 272 · (1 − 3.93)2 + 305 · (3 − 3.93)2 + 269 · (5.5 − 3.93)2
972
2 2
+ 104 · (8.5 − 3.93) + 27 · (12 − 3.93)
= 7.13.
Regression 269
Thus, we have
sxy 4.95
rxy = = = 0.118.
sx sy 2.67 · 15.7
H0 : ρXY = 0,
It can be shown that, under the null hypothesis, this R has the Student
t distribution with n − 2 degrees of freedom. Thus, we can look up the
appropriate critical value, and reject the null hypothesis if |R| is above this
cutoff. For example, in the Brain measurement experiments of section 15.4.1
we have correlation between brain volume and surface area being 0.601 from
20 samples, which produces R = 3.18, well above the threshold value for t
with 18 degrees of freedom at the 0.05 level, which is 2.10. On the other
hand, the correlation −0.291 for surface area against IQ yields R = 1.29,
which does not allow us to reject the null hypothesis that the true underlying
population correlation is 0; and the correlation −0.063 between volume and
IQ yields R only 0.255.
270 Regression
Note that for a given n and choice of level, the threshold in t translates
directly to a threshold in r. If t∗ is the appropriate threshold value in t,
then we
p reject the null hypothesis when our sample correlation r is larger
than t∗ /(n − 2 + t∗ ). In particular, for large n and α = 0.05, we have a
threshold
p for t very close to 2, so that we reject the null hypothesis when
|r| > 2/n.
Figure 15.4: Galton parent-child heights, with SD line in green. The point
of the means is shown as a red circle.
Figure 15.5: The parent-child heights, with an oval representing the general
range of values in the scatterplot. The SD line is green, the regression
line blue, and the rectangles represent the approximate span of Y values
corresponding to X = 66, 68, 70, 72 inches.
mean SD
x 6 2 5 7 4 4.8 1.9
y 3 2 4 6 5 4.0 1.58
prediction
4.65 2.49 4.11 5.19 3.57
0.54x + 1.41
residual −1.65 −0.49 −0.11 0.81 1.43
0 1 2 3 4 5 6 7
y
Regression line
SD line
1 2 3 4 5 6 7 8
x
Figure 15.6: A scatterplot of the hypothetical data from Table 15.4. The
regression line is shown in blue, the SD line in green. The dashed lines show
the prediction errors for each data point corresponding to the two lines.
2400
130
Total brain surface area (cm2)
120
2200
110
Full-scale IQ
2000
100
1800
90
80
1600
900 1000 1100 1200 1300 1400 1500 1700 1800 1900 2000 2100 2200
Total brain volume (cm3) Total brain surface area (cm2)
(a) Volume against surface area regression (b) Surface area against IQ regression
Figure 15.7: Regression lines for predicting surface area from volume and
predicting IQ from surface area. Pink shaded region shows confidence inter-
val for slope of the regression line.
276 Regression
2400
Total brain surface area (cm2)
2200
2000
1800
Regression line
SD line
1600
Figure 15.8: A scatterplot of brain surface area against brain volume, from
the data of Table 15.1. The regression line is shown in blue, the SD line
in green. The dashed lines show the prediction errors for each data point
corresponding to the two lines.
Regression 277
b ± T · SE(b),
We look on the t-distribution table in the row for 18 degrees of freedom and
the column for p = 0.05 (see Figure 15.9) we see that 95% of the probability
lies between −2.10 and +2.10, so that a 95% confidence interval for β is
b ± 2.10 · SE(b) = 0.841 ± 2.10 · 0.264 = (0.287, 1.40).
The scatterplot is shown with the regression line y = .841x + 959 in Figure
15.7(a), and the range of slopes corresponding to the 95% confidence interval
is shown by the pink shaded region. (Of course, to really understand the
uncertainty of the estimates, we would have to consider simultaneously the
random error in estimating the means, hence the intercept of the line. This
leads to the concept of a two-dimensional confidence region, which is beyond
the scope of this course.)
Similarly, for predicting IQ from surface area we have
sz 13.2
b = ryz = −0.291 = −0.031, a = z̄ − bȳ = 160.
sy 125
The standard error for b is
√ √
b 1 − r2 0.031 · 1 − 0.2912
SE(b) ≈ √ = √ = 0.024.
r n−2 0.291 18
A 95% confidence interval for the true slope β is then given by −0.031 ±
2.10 · 0.024 = (−0.081, 0.019). The range of possible predictions of y from
x — pretending, again, that we know the population means exactly — is
given in Figure 15.7(b).
What this means is that each change of 1 cm3 in brain volume is typically
associated with a change of 0.841 cm2 in brain surface area. A person of
average brain volume — 1126 cm3 — would be expected to have average
brain surface area — 1906 cm2 — but if we know that someone has brain
volume 1226 cm3 , we would do well to guess that he has a brain surface
area of 1990 cm2 (1990 = 1906 + 0.841 · 100 = .841 · 1226 + 959). However,
given sampling variation the number of cm2 typically associated with 1 cm3
change in volume might really be as low as 0.287 or as high as 1.40, with
95% confidence.
Similarly, a person of average brain surface area 1906 cm2 might be
predicted to have average IQ of 101, but someone whose brain surface area
is found to be 100 cm2 might be predicted to have IQ below average by 3.1
points, so 97.9. At the same time, we can only be 95% certain that the
change associated with 100 cm2 increase in brain surface area is between
−8.1 and +1.9 points — hence, it might just as well be 0. We say that the
correlation between IQ and brain surface area is not statistically significant,
or that the slope of the regression line is not significantly different from 0.
Regression 279
Figure 15.9: T table for confidence intervals for slopes computed in section
15.6.4
280 Regression
Lecture 16
Regression, Continued
16.1 R2
What does it mean to say that bxi + a is a good predictor of yi from xi ?
One way of interpretting this would be to say that we will typically make
smaller errors by using this predictor, than if we tried to predict yi without
taking account of the corresponding value of xi .
Suppose we have our standard regression probability model Y = βX +
α + E: this means that the observations are
yi = βxi + α + i .
Of course, we don’t really get to observe the terms: they are only inferred
from the relationship between xi and yi . But if we have X independent of
E, then we can use our rules for computing variance to see that
281
282 Regression II, Review
From the Galton data of section 15.4.2, we see that the variance of the child’s
height is 6.34. Since r2 = 0.21, we say that the parents’ heights explain
21% of the variance in child’s height. We expect the residual variance
to be about 6.34 × 0.79 = 5.01. What this means is that the variance among
children whose parents were all about the same height should be 5.01. In
Figure 16.1 we see histograms of the heights of children whose parents all
had heights in the same range of ±1 inch. Not surprisingly, there is some
variation in the shapes of these histograms, vary somewhat, but the variances
are all substantially smaller than 6.34, a varying between 4.45 and 5.75.
The threshold for rejecting t with 971 degrees of freedom (the ∞ row —
essentially the same as the normal distribution) at the α = 0.01 significance
level is 2.58. Hence, the correlation is highly significant. This is a good
example of where “significant” in the statistical sense should not be confused
with “important”. The difference is significant because the sample is so large
that it is very unlikely that we would have seen such a correlation purely by
chance if the true correlation were zero. On the other hand, explaining 1%
of the variance is unlikely to be seen as a highly useful finding. (At the same
time, it might be at least theoretically interesting to discover that there is
any detectable effect at all.)
Regression II, Review 283
50
40
10
Frequency
Frequency
30
20
5
10
0
62 64 66 68 70 72 62 64 66 68 70 72
Heights of children with parent height 65 inches Heights of children with parent height 67 inches
30
60
Frequency
Frequency
20
40
10
20
0
62 64 66 68 70 72 74 60 64 68 72
Heights of children with parent height 69 inches Heights of children with parent height 71 inches
We have all heard stories of a child coming home proudly from school with
a score of 99 out of 100 on a test, and the strict parent who points out that
he or she had 100 out of 100 on the last test, and “what happened to the
other point?” Of course, we instinctively recognise the parent’s response
as absurd. Nothing happened to the other point (in the sense of the child
having fallen down in his or her studies); that’s just how test scores work.
Sometimes they’re better, sometimes worse. It is unfair to hold someone to
the standard of the last perfect score, since the next score is unlikely to be
exactly the same, and there’s nowhere to go but down.
Of course, this is true of any measurements that are imperfectly cor-
related: If |r| is substantially less than 1, the regression equation tells us
that those individuals who have extreme values of x tend to have values
of y that are somewhat less extreme. If r = 0.5, those with x values that
are 2 SDs above the mean will tend to have y values that are still above
average, but only 1 SD above the mean. There is nothing strange about
this: If we pick out those individuals who have exceptionally voluminous
brains, for instance, it is not surprising that the surface areas of their brains
are less extreme. While athletic ability certainly carries over from one sport
to another, we do not expect the world’s finest footballers to also win gold
medals in swimming. Nor does it seem odd that great composers are rarely
great pianists, and vice versa.
And yet, when the x and y are successive events in time — for instance,
the same child’s performance on two successive tests — there is a strong
tendency to attribute causality to this imperfect correlation. Since there
is a random component to performance on the test, we expect that the
successive scores will be correlated, but not exactly the same. The plot of
score number n against score number n + 1 might look like the upper left
scatterplot in Figure 15.3. If she had an average score last time, she’s likely
to score about the same this time. But if she did particularly well last time,
this time is likely to be less good. But consider how easy it would be to look
at these results and say, “Look, she did well, and as a result she slacked off,
and did worse the next time;” or “It’s good that we punish her when she
does poorly on a test by not letting her go outside for a week, because that
always helps her focus, and she does better the next time.” Galton noticed
that children of exceptionally tall parents were closer to average than the
parents were, and called this “regression to mediocrity” [Gal86].
Regression II, Review 285
The report goes on to say that “The department put the results of
the study close to the bottom of a list of appendices at the back of a
160-page report which claims that cameras play an important role in
saving lives.”
Species Body weight (kg) Body rank Brain weight (g) Brain rank
Table 16.1: Brain and body weights for 62 different land mammal species.
Available at http://lib.stat.cmu.edu/datasets/sleep, and as the ob-
ject mammals in the statistical language sR.
290 Bibliography
body 62 21 32 20 61 42 4 53 31 47 14 58 16 54 7 30 18 12 25 49 60
brain 62 22 37 19 61 50 3 47 35 56 21 55 10 54 8 34 14 17 32 41 59
diff. 0 1 5 1 0 8 0 6 4 9 7 3 6 0 1 4 4 5 7 8 1
body 44 10 56 51 46 8 22 59 52 45 1 2 50 11 23 4 5 27 34 57 15
brain 44 6 53 52 45 16 18 58 46 39 1 2 60 13 23 5 4 20 24 57 30
diff. 0 4 3 1 1 8 4 1 6 6 0 0 10 2 0 2 1 7 10 0 15
body 41 26 55 29 39 13 38 40 17 35 43 48 24 6 19 28 9 37 34 36
brain 44 25 51 26 36 9 38 49 28 33 42 48 29 6 12 28 11 40 15 31
diff. 2 1 4 3 3 4 0 9 10 2 1 0 5 0 7 0 2 3 18 5
Table 16.2: Ranks for body and brain weights for 62 mammal species, from
Table 16.1, and the difference in ranks between body and brain weights.
Bibliography
[Bia05] Carl Bialik. When it comes to donations, polls don’t tell the
whole story. The Wall Street Journal, 2005.
291
292 Bibliography
[TPCE06] Anna Thorson, Max Petzold, Nguyen Thi Kim Chuc, and Karl
Ekdahl. Is exposure to sick or dead poultry associated with
flulike illness?: A population-based study from a rural area in
Vietnam with outbreaks of highly pathogenic avian influenza.
Archives of Internal Medicine, 166:119–23, 2006.
[ZZK72] Philip R. Zelazo, Nancy Ann Zelazo, and Sarah Kolb. “walking”
in the newborn. Science, 176(4032):314–5, April 21 1972.