Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
George Deligiannidis
Module Lecturer 2020/21: Kalliopi Mylona
1 Introduction 1
1.1 Review of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The z-test for the mean of a population . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 The Normal Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 The z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Bivariate distributions 8
2.1 Review of Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Review of Discrete Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Continuous Bivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Multivariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Expectation and friends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 A few useful Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Uniform Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Moments and Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 22
3 Exponential Densities 26
3.1 Review of the Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 The Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Chi-squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 The F -distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.1 Testing variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 The Beta Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Estimation 42
4.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1
4.3 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 The Cramer-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.2 Efficient statistics are sufficient . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.3 Examples of Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Theory of Testing 61
5.1 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Significance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1.1 One-sided vs two-sided alternatives . . . . . . . . . . . . . . . . . . . 63
5.2.2 Types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Uniformly most powerful. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 Tests 74
6.1 The z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.1 Population mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.2 Testing for proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.3 Two Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.1 Student’s t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.2 Two Sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.2.1 Paired Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.2.1.1 Paired-sample t-test . . . . . . . . . . . . . . . . . . . . . . 81
6.2.2.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.3 When to use z, when t? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3 The χ2 -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.2 Comparing with fixed distribution . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.3 Comparing with a Parametric family of distributions . . . . . . . . . . . . . . . 85
6.3.4 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.5 Comparing Two Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Regression 90
7.1 A Single Explanatory Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.2 More Complicated Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2.1 Multiple Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 Distribution of the Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2
7.5 Tests for correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.5.0.0.1 Correlation and causality. . . . . . . . . . . . . . . . . . . . 100
7.6 Spearman’s Rank Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 101
3
Housekeeping
These notes are based on the notes kindly provided to me by Prof. Peter Saunders and Dr. George
Deligiannidis who both taught the course before. I would appreciate if you point out any typos you spot
out to me ([email protected]).
Chapter 1
Introduction
Null Hypothesis H0 : the observation is entirely due to chance, that is P(even) = 1/2;
Alternative Hypothesis H1 : the die is indeed biased as claimed P(even) > 1/2;
Test statistic: an observation, usually a number, that can be computed from the sample that is infor-
mative about the hypothesis that is tested. In our case the number of times we observed an even
number, or the average. We must know the distribution of the statistic under the null hypothesis.
Rejection Region we have to decide what results of the test statistic are too extreme given the null
hypothesis.
To specify the rejection region we need a probabilistic model about the null hypothesis. Notice that in
our setup, we do have such a model, that is we assume that the probability of observing an even number
is precisely 1/2.
We also have to specify the significance of the test, α, which is usually a low probability such as
.01, .05, .1 and so on. The significance is chosen by the statistician, always before even looking at the
data, and it specifies how strict we are in assessing the likelihood of the observation under the null
hypothesis. This specifies the rejection region, in the sense that we will reject a range of values that has
probability = α of happening under the null hypothesis.
In our case, notice that under the null hypothesis. every time you roll the die with probability 1/2 you
observe an even number. You roll the die 30 times, and thus under then null hypothesis
Y = #number of even numbers observes ∼ Bin(30, 1/2),
has the Binomial distribution, that is
!
30 1
P(Y = k) = , 0 ≤ k ≤ 30.
k 230
Let’s say that we fix the significance level at 5%. We can compute
P(Y ≥ 21) = .0214, P(Y ≥ 20) = .0494, P(Y ≥ 19) = .100,
1
and thus it makes sense to set the rejection region to be {y : y ≥ 20}. Since we observed 21 even
numbers, we conclude that at the 5% level, there is enough evidence to reject the null hypothesis that
P(even) = 1/2.
0.4
0.3
0.2
0.1
-4 -2 2 4
particular if the variations of individuals from their mean are effectively the sum of a large number
of independent random fluctuations which are equally likely to be in either direction. Height, for ex-
ample, is distributed approximately normally, which presumably because there are lots of genetic and
environmental factors, each of which has a small effect even if the underlying distribution is not normal.
This is due to the central limit theorem, as we shall see later in the course, which gives further justi-
fication for using the normal density. On the other hand, there are lots of densities that are not normal
– incomes, for example (because a relatively small number of people earn so much more than every-
one else), or the number of hours of sunshine per day in many countries (because most days are either
completely sunny or completely cloudy). You can draw some very incorrect conclusions if you do your
calculations assuming normality when that’s not justified.
Aside. This is a major issue in statistics. There are always a lot of assumptions involved that no one
states explicitly – that the density is normal, that the sample is random, that the sample is independent
and so on. If these assumptions fail, some or all of our conclusions may also be wrong. The job of
the statistician is not to apply a formula, or click the right menu item in SPSS or any other statistical
software, but rather to be able to judge whether a particular test is the appropriate one for the particular
dataset, model etc. Therefore a very important aspect of statistics is to have an excellent understanding
of the assumptions behind each test and statistical method.
2
1.2.2 The z-test
Suppose that you work for a pharmaceutical company which has devised a new drug for lowering blood
pressure. You want to test whether the drug is effective or not. You run an experiment and you measure
the drop in blood pressure(in some units) after giving a low dose of the drug to 10 patients
−1.9480, −0.6951, −0.0777, 1.7855, −1.4669, −0.6127, −0.8825, −2.7330, −1.2390, −2.5947.
Suppose that the measurements above arise from a Normal distribution with unknown mean µ and unit
variance σ 2 = 1.
First of all we set the significance level, based on how risky a wrong decision would be. Let’s say we
set it at α = .05.
Recall that the normal distribution with mean µ and variance σ 2 has probability density function
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) .
σ 2π
In terms of the parameters of underlying distribution, the hypothesis that the drug is effective is equiv-
alent to µ < 0. This will be our alternative hypothesis which we will test against the null hypothesis
which postulates that µ = 0 and that the fluctuations we observed were entirely due to chance alone.
Intuitively, we should reject the null hypothesis if the value we observe is too negative, because that
should be very unlikely under the null hypothesis. Let’s see how to quantify this statement.
Recall from last year that if X1 , . . . , Xn ∼ N (µ, σ 2 ) are independent then
X1 + · · · + Xn ∼ N (nµ, nσ 2 ),
and since for all random variables X and constants c 6= 0 we have var(X/c) = var(X)/c2 , we observe
that
X1 + · · · + Xn σ2
∼ N µ, .
n n
It’s always useful to standardise (subtract the mean and divide by the standard deviation) since statis-
tical tables are usually only provided for the standard normal distribution, and therefore we define the
quantity √ h
n X1 + · · · + Xn i
Z := − µ ∼ N (0, 1).
σ n
This is the z-statistic and is the basis of the z-test.
Let’s compute the z-statistic of our data. Of course we have assumed that our observations are a
sample from N (µ, 1), and thus µ is unknown. So which µ do we use to compute the z-statistic? We
only use the mean(or the parameter of interest) specified in the null hypothesis, in this case µ = 0.
Therefore our z-statistic is √
z = −1.0464 × 10 = −3.3090.
Next, let’s briefly think about what the rejection region should be. We are testing µ = 0 against µ < 0.
We are looking for evidence against the null hypothesis and in support of the alternative, therefore we
should reject the null hypothesis if the z-statistic turns out to be too negative (convince yourself that
we are not looking for large positive values). Thus our rejection region will have the form (−∞, −zα ]
where Z −zα
1 2
√ e−x /2 dx = α.
−∞ 2π
To find zα we check any statistical table we have access to and we find it to be approximate -1.645.
Therefore the rejection region should be (−∞, −1.645] which means that we reject the null hypothesis.
3
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 5040 5080 5120 5160 5199 5239 5279 5319 5359
0.1 0.5398 5438 5478 5517 5557 5596 5636 5675 5714 5753
0.2 0.5793 5832 5871 5910 5948 5987 6026 6064 6103 6141
0.3 0.6179 6217 6255 6293 6331 6368 6406 6443 6480 6517
0.4 0.6554 6591 6628 6664 6700 6736 6772 6808 6844 6879
0.5 0.6915 6950 6985 7019 7054 7088 7123 7157 7190 7224
0.6 0.7257 7291 7324 7357 7389 7422 7454 7486 7517 7549
0.7 0.7580 7611 7642 7673 7704 7734 7764 7794 7823 7852
0.8 0.7881 7910 7939 7967 7995 8023 8051 8078 8106 8133
0.9 0.8159 8186 8212 8238 8264 8289 8315 8340 8365 8389
1.0 0.8413 8438 8461 8485 8508 8531 8554 8577 8599 8621
1.1 0.8643 8665 8686 8708 8729 8749 8770 8790 8810 8830
1.2 0.8849 8869 8888 8907 8925 8944 8962 8980 8997 9015
1.3 0.9032 9049 9066 9082 9099 9115 9131 9147 9162 9177
1.4 0.9192 9207 9222 9236 9251 9265 9279 9292 9306 9319
1.5 0.9332 9345 9357 9370 9382 9394 9406 9418 9429 9441
1.6 0.9452 9463 9474 9484 9495 9505 9515 9525 9535 9545
1.7 0.9554 9564 9573 9582 9591 9599 9608 9616 9625 9633
1.8 0.9641 9649 9656 9664 9671 9678 9686 9693 9699 9706
1.9 0.9713 9719 9726 9732 9738 9744 9750 9756 9761 9767
Table 1.1: The standard normal table. The probability given refers to the shaded regions in Figure 1.2.2.
4
P 2P ! 1
0 z !z 0 z
1!P 1!P
!z 0 0 z
Figure 1.2.2: The normal density. The shaded region has probability P .
In fact, if the null hypothesis were true, the probability of observing a value less than or equal to our
z-statistic is less than .00001, which would be the so called p-value.
Suppose now that instead our observations are
−0.9320, 1.2821, −1.2757, −0.8838, −1.1120, −1.5742, −0.5922, −0.6741, 1.1301, −0.6490.
In this case the z-statistic gives z = −1.6700, which is much lower than before. Indeed, this still
lies in the rejection region, but intuitively the evidence against the null in this case is much weaker than
before. One way to quantify this is to state the p-value, the probability of obtaining a test statistic at
least as extreme as the observed one (given H0 is true), in this case the probability of observing a z-
statistic less than −1.67, which we find from the table to be .0475 or 4.75%. The p-value is a much more
objective way of assessing hypotheses and is often known as significance testing. The good thing about
this is that you don’t have to specify a significance level a priori, which is sometimes chosen arbitrarily.
In the last example we assumed that the observations are coming from a normal distribution. We used
this to deduce that the z-statistic is also normally distributed, so that we knew its distribution under the
null hypothesis.
It will often be the case that our observations themselves are not normally distributed, for example
look at the following 40 i.i.d. observations from some distribution with mean 1 and variance 1.
0.2851, 1.1475, 0.4602, 1.6923, 1.4971, 0.4663, 1.2907, 0.0109, 1.9815, 0.0560
1.4755, 0.2051, 1.0354, 0.3205, 0.9847, 3.8328, 1.3918, 2.7039, 1.5242, 0.3906
2.7460, 0.2369, 0.0030, 0.9933, 2.4361, 0.0937, 0.7949, 0.0625, 1.7030, 0.0118
0.2648, 0.1599, 1.4953, 0.1725, 0.4461, 0.2224, 1.4797, 4.0076, 3.1744, 0.2655
0.3987, 0.6819, 0.4270, 0.3089, 1.7114, 0.2932, 0.2176, 0.3800, 2.2355, 1.0886
0.3352, 0.4531, 0.5025, 0.5603, 3.1000, 0.2069, 0.0140, 0.2437, 0.4838, 0.4119
5
Figure 1.2.3: The histogram of the observations against the normal density with the correct mean and
variance.
0.4431, 0.0504, 0.9487, 1.7324, 0.3241, 0.9589, 3.5068, 0.6271, 0.0622, 1.4099,
0.8698, 0.8120, 1.9506, 0.0436, 1.0572, 0.4824, 0.6242, 0.1218, 1.0372, 1.2191
0.9397, 2.8131, 3.6835, 1.3268, 0.4146, 0.5525, 2.4410, 0.1065, 0.8901, 0.3854
0.2032, 0.1430, 0.8649, 0.0784, 0.9575, 0.6348, 0.2205, 0.4685, 0.0158, 0.6389.
If you look at the histogram it looks decidedly non-normal, in particular it is onesided, and all observa-
tions are non-negative. Suppose however that you repeat the experiment 1000 times, and each time you
compute the z-statistic. The histogram of the computed statistics will look like in Figure 1.2.4, which
has now started to look like it is normally distributed. This is a very important feature, and it is due to
the Central Limit theorem which states essentially that the z-statistic from a large enough i.i.d. sample
of any reasonable enough distribution will be approximately normally distributed. This essentially tells
us, that if the sample is large, then we know approximately the distribution of the statistic under the null
hypothesis.
This is very important for hypothesis testing. In general we will always need to know the distribution
of the statistic under the null hypothesis at least approximately. There are many tests where this is
possible, but before we learn about them, we have to review some of the basic facts from last year and
build up our arsenal of tools.
6
Figure 1.2.4: The histogram of 1000 z-statistics from an exponential distribution with mean 1 and
variance 1.
7
Chapter 2
Bivariate distributions
Definition 1 (Distribution Function). Let X be a random variable. The distribution function of X, which
we denote by FX : R → [0, 1] is the function FX (x) = P(X ≤ X) for x ∈ (−∞, ∞).
The distribution function is also called the Cumulative Distribution Function.
Theorem 1 (Properties of Distribution Functions). Let X be a random variable and F (x) := P(X ≤ x)
1. limx→−∞ F (x) = 0;
2. limx→∞ F (x) = 1;
Whether a random variable is discrete or continuous can be deduced from the properties of the Distri-
bution Function. In fact we have the following definition.
If X is a continuous random variable, then its distribution function is continuous. Also notice that
Z a
P(X = a) = P(a ≤ X ≤ a) = f (u)du = 0.
a
8
Theorem 2. If X continuous random variable with pdf f , and f is continuous at x then
F (x + h) − F (x)
f (x) = F 0 (x) = lim .
h→0 h
Theorem 3 (Properties of PDF). The pdf f of a continuous random variable satisfies the following:
1. f ≥ 0;
R∞
2. −∞ f (u)du = 1;
3. for all y, Z y
F (y) = f (u)du;
−∞
The last two properties are fundamental for continuous random variables. While for discrete variables,
the probability mass function is an actual probability of an event, for continuous random variables, the
probability density function is not a probability. The integral of the probability density function over a
set however is a probability.
Theorem 4 (Change of Variables). Let X be a continuous random variable with probability density
function fX . Let g : supp(fX ) → R be strictly monotone and differentiable, where
and define Y = g(X). Then Y is a continuous random variable and its probability density function is
given by
−1
d −1
fY (y) = fX (g (y)) g (y) .
dy
Proof. We will compute the density by differentiating the distribution function. Suppose first that g is
strictly increasing
FY (y) = P[Y ≤ y]
= P[g(X) ≤ y]
= P[X ≤ g −1 (y)]
= FX (g −1 (y)).
d 0 −1 d d
fY (y) = F (g (y)) g −1 (y) = fX (g −1 (y)) g −1 (y).
dy X dy dy
The proof is similar for strictly decreasing g.
9
2.2 Review of Discrete Bivariate Distributions
Last year you saw discrete bivariate probability distributions. If X and Y are two discrete random
variables taking values in a discrete set S, then we are often interested in their joint behaviour; we can
define their joint probability mass function
These are the probability mass functions of X and Y separately. They are called marginal distributions
because if you write the values of the function p(x, y) as entries in a rectangular array, the sums go into
the bottom and right hand margins.
Definition 3 (Independence). Two discrete random variables X, Y , with joint probability mass function
p(x, y) are independent if
Two variables are independent if the ”joint is the product of the marginals”.
Example 1. Suppose that you roll two dice and define X, Y to be the two outcomes. Then S =
{1, 2, 3, 4, 5, 6} and for any x, y ∈ S
1 1 1
p(x, y) = = × = pX (x)pY (y).
36 6 6
Clearly, and in agreement with intuition, the two dice are independent.
Example 2. On the other hand suppose that you have an urn with 5 black balls and 3 red balls. We
sample two balls without replacement. Let X, Y ∈ {0, 1}, where X = 0 if the first ball is red and
X = 1 if the first ball is black, and similarly for Y and the second ball. Then we can record the joint
probability distribution as follows:
X\Y 0 1
0 6/56 15/56 21/56
1 15/56 20/56 35/56
21/56 35/56
It is easy to see that in this case the two variables are not independent.
P(A ∩ B)
P(A | B) =
P(B)
Also recall Bayes’ theorem
10
Theorem 5 (Bayes’ Theorem). For any two events A and B such that P(A), P(B) > 0 we have
P(B|A) P(A)
P(A|B) = . (2.2.1)
P(B)
When given two discrete random variables X and Y and their joint probability mass function p(x, y),
we can define the conditional probability mass function of X given Y .
Definition 4. Let X, Y be two random variables with joint probability mass function p(x, y). Then the
conditional probability mass function of X given Y = y is defined as
p(x, y) P(X = x, Y = y)
p(x|y) = = .
pY (y) P(Y = y)
This is the standard conditional probability of the event A = {X = x} given the event B = {Y = y}.
For Example 2 the conditional probability distribution of X given Y = 1 is
p(0, 1) 15/56 3
pX|Y (0|1) = = = .
pY (1) 35/56 7
Definition 5 (Joint Distribution Function). Given two random variables, X and Y their joint distribution
function is the function F : R × R → [0, 1] given by
Next we define jointly continuous random variables and in the process we also define the joint proba-
bility density function.
Definition 6 (Jointly Continuous Random Variables). Two continuous random variables X and Y are
jointly continuous if there is a function fX,Y : R2 → [0, ∞) called the joint probability density function
such that Z s Z t
P(X ≤ s, Y ≤ t) = fX,Y (x, y)dxdy.
−∞ −∞
11
A joint density function f (x, y) is necessarily non-negative and normalised
ZZ
f (x, y)dxdy = 1.
R2
∂2
fX,Y (x, y) = FX,Y (x, y),
∂x∂y
whenever the partial derivative is defined.
Notice that two continuous random variables do not necessarily have to be jointly continuous.
Example 3. Let X = U [0, 1], a uniform random variable in the interval [0, 1], or in other words let X
have pdf fX (x) = 1 for x ∈ [0, 1] and fX (x) = 0 otherwise. Let Y = X. Then both X and Y are
continuous random variables. However they are not jointly continuous. Their joint distribution function
is given for x, y ∈ [0, 1] by
∂2
f (x, y) = min{x, y} = 0,
∂x∂y
which of course makes no sense.
Theorem 7 (Properties of Joint Distribution Functions). If X, Y have joint distribution function F then
1. For all y,
lim F (x, x) = lim F (x, y) = lim F (y, x) = 0.
x→−∞ x→−∞ x→−∞
2. limx→∞ F (x, x) = 1;
3. For all x, y, limx→∞ F (x, y) = FY (y) and limy→∞ F (x, y) = FX (x), where
4. If x1 ≤ x2 and y1 ≤ y2 then
Definition 7 (Marginal Densities). Let X, Y be jointly continuous with pdf f (x, y). Then X and Y are
continuous random variables. The individual probability density functions of X and Y are called the
marginal densities of X and Y respectively and are given by
Z ∞ Z ∞
fX (x) = f (x, y)dy, fY (y) = f (x, y)dx.
−∞ −∞
Definition 8 (Independent Random Variables). Two random variables X, Y with joint distribution func-
tion FX,Y and marginal distribution functions FX , FY are said to be independent if for all x, y
When the variables are jointly continuous, independence is more conveniently checked using the joint
density function.
12
Theorem 8. Let X, Y be jointly continuous random variables with joint density fX,Y (x, y). Then X
and Y are independent if and only if there exist functions h, g : R → [0, ∞) such that
and we conclude that fX,Y = fX fY at least in the sense that for all x, y the integrals
Z x Z y Z x Z y
fX,Y (r, s)drds = fX (r)fY (s)drds,
−∞ −∞ −∞ −∞
agree.
R R
”⇐”. Let cg := y g(y)dy and ch := x h(x)dx. First of all notice that since
Z Z
fX (x) = fX,Y (x, y)dy = h(x) g(y)dy = cg h(x),
y y
and similarly Z Z
fY (x) = fX,Y (x, y)dx = g(y) h(x)dx = ch g(y),
x x
RR
and fX,Y (x, y)dxdy = 1 we have that cg ch = 1. Thus
2.3.1 Multivariate
Everything in this section extends to more than two random variables in the obvious way. Concepts like
jointly continuous, joint density f (x1 , . . . , xn ), joint distribution function and the relationships between
them
Z x1 Z xn
F (x1 , . . . , xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn ) = ··· f (y1 , . . . , yn )dy1 · · · dyn .
−∞ −∞
For independence we have the ”mutual independence” which means that essentially the distribution
function, and density factorises fully into the product of the marginals
where Z
fi (yi ) = f (y1 , . . . , yn )dy1 . . . dyi−1 dyi+1 . . . dyn ,
R
is the i-th marginal.
Notice that in the multivariate case, we can have marginal distributions and densities for any subset
of the variables. That is say I ⊆ {1, . . . , n}, then the marginal of the variables (Xi : i ∈ I) is found
by integrating the joint pdf with respect to the variables not in I, that is for example fX1 X3 (x1 , x3 ) is
found by integrating all variables except for x1 and x2 .
13
A very important case is that of independent and identically distributed random variables which we
abbreviate as i.i.d. We say that X1 , . . . , Xn are i.i.d. if they are (mutually) independent and have the
same distribution, that is their marginals are all the same
Remark 9. We can also have pairwise independence where any two variables are independent.
As an example consider a die with four faces containing the following number respectively
You roll the die and you define the random variables Xi , i = 1, 2, 3 where Xi = 1 if i appears in the top
face, and Xi = 0 otherwise. and thus Xi ∼ Ber(1/2). Then its an easy exercise to check that for all
xi ∈ {0, 1} we have
1 1 1
P(Xi = xi , Xj = xj ) = = × = P(Xi = xi ) P(Xj = xj ),
4 2 2
and thus any two variables are independent. However they are not mutually independent since
1 1 1 1
P(X1 = X2 = X3 = 1) = 6= × × = P(X1 = x1 ) P(X2 = x2 ) P(X3 = x3 ).
4 2 2 2
In this course whenever we say that the variables X1 , . . . , Xn are independent we will mean mutually
independent.
dg −1 (y)
fY (y) = fX (g −1 (y)) .
dy
However this is not necessary to compute the expectation of g(X) or of any function of g(X), since the
expectation of the random variable g(X) can be computed through
Z
E g(X) = g(x)f (x)dx,
14
This generalises to two or more variables in the obvious way. Let g(X, Y ) be a function of two random
variables X, Y with joint probability density f (x, y). Then the expected value of g is defined to be
Z Z
E[g(X, Y )] = g(x, y)f (x, y)dxdy
Z Z Z Z
2 2
var(X) = x f (x, y)dydx − (E[X]) var(Y ) = y 2 f (x, y)dydx − (E[Y ])2
The point of this is that ρ is dimensionless and therefore scale invariant. If we replace X by aX and Y
by bY , then cov(X, Y ) is increased by a factor ab but ρ remains the same.
Linearity of expectation: For any random variables X, Y with finite expectations and any λ, µ ∈ R
Proof. The first property we will not prove. It is trivial for discrete random variables, and for
variables that admit a probability density function it is a property of integrals.
The second one is also trivial for discrete variables. For variables admitting a pdf, the conclusion
is always wrong, and thus we have to prove that if X ≥ 0 and it admits a pdf then E X > 0. In
this case Z ∞
P(X > x) = 1 − F (x) = fX (y)dy,
x
is continuous and monotone-decreasing, going from 1 to 0 so there exists x > c such that P(X >
c) > 0. From the first property we have that
h i h i
E X = E X1{X > c} + E X1{X ≤ c}
h i
≥ E X1{X > c}
h i
≥ E c1{X > c} = c P(X > c) > 0.
15
Variance of sum: The variance of a sum or difference is given by
2
var(X ± Y ) = E[(X ± Y )2 ] − E[(X ± Y )]
2 2
= E[X 2 ] ± 2E[XY ] + E[Y 2 ] − E[X] ± 2E[X]E[Y ] − E[Y ]
= var(X) + var(Y ) ± 2cov(X, Y ),
We are often interested in the special case of n identically distributed independent variables, which
we usually denote by X1 , X2 , . . . , Xn . From the above we have
where
1
X̄ := (X1 + · · · + Xn ),
n
and
1 1
var(X1 + X2 + . . . + Xn ) = n var(X) =⇒ var(X̄) = 2
.n var(X) = var(X).
n n
cov(X, Y ) = 0.
Scaling of variance: For any random variable X such that E X 2 < ∞ and a ∈ R
var(aX) = a2 var(X).
Since the leading coefficient and the constant term of the above polynomial in λ is positive(the parabola
opens upwards), the inequality will hold true iff the polynomial has at most one root, that is if and only
if the discriminant is non-positive, ie 4 E[XY ]2 − 4 ≤ 0, or equivalently that −1 ≤ E[XY ] ≤ 1.
16
Example 4. Consider two variables X, Y with joint density
with of course f (x, y) = 0 outside the triangle with vertices (0,0), (1,0), (1,1). First, we can verify that
this is a proper probability density. It is clearly non-negative, and then
Z 1Z x Z 1 Z 1
6(x − y)dydx = (6xy − 3y 2 )|10 dx = 3x2 dx = 1
0 0 0 0
as required. The marginal densities for each variable are obtained by integrating over the other one:
Z x
fX (x) = 6(x − y)dy = 6xy − 3y 2 |x0 = 3x2
0
Z 1
fY (y) = 6(x − y)dx = 3x2 − 6xy|1y = 3 − 6y + 3y 2 = 3(1 − y)2 .
y
with both zero for arguments outside the interval [0, 1]. The marginal densities are just the sort of
ordinary density we are used to and so are treated in the familiar way. Integrating each of them over the
interval gives unity. Also
Z 1 Z 1
23 3 3 1
E(X) = x.3x dx = E(Y ) = y.3(1 − y)2 dy = −2+ =
0 4 0 2 4 4
and
Z 1 Z 1
2 2 2 3 2 3 3 1
E(X ) = x .3x dx = E(Y ) = y 2 .3(1 − y)2 = 1 − + =
0 5 0 2 5 10
so that var(X) = 3/80 and var(Y ) = 3/80. These are equal because of the symmetries of the problem,
though this may not be obvious at first sight.
Note that if you aren’t convinced about using the marginal density, it’s really the same as taking a
double integral over the whole region except that we’ve already done the first bit. Thus
Z ∞ Z ∞ Z ∞ Z ∞ Z ∞
E(X) = xf (x, y)dydx = x f (x, y)dydx = xfX (x)dx
−∞ −∞ −∞ −∞ −∞
Working out E(Y ) and E(Y 2 ) is the same, except that to make the trick work we do the x integration
first. To work out the covariance we have to use the joint probability density and integrate over the whole
region, and this time we have to do the whole integration explicitly:
Z ∞ Z ∞
E(XY ) = xyf (x, y)dydx
−∞ −∞
Z 1Z x Z 1 Z x
E(XY ) = xy.6(x − y)dydx = (6x2 y − 6xy 2 )dy dx
0 0 0 0
Z 1h Z 1
i 1
= 2 2
3x y − 2xy 3
|x0 dx = x4 dx =
0 0 5
17
1 3 1 1
E(XY ) − E(X)E(Y ) = − . =
5 4 4 80
3
Because var(X) = var(Y ) = 80 , the correlation coefficient is
E(XY ) − E(X)E(Y ) 1
ρ= p =
var(X) var(Y ) 3
Example 5. Consider now the uniform distribution on the unit disc:
Z Z Z 1 Z 2π
E(X) = xdxdy = r cos θrdrdθ = 0
0 0
which was obvious by symmetry in the first place. Similarly E(Y ) = 0. And also (if this too isn’t
obvious)
Z Z Z 1 Z 2π
E(XY ) = xydxdy = r cos θ r sin θrdrdθ
0 0
But the θ integration is clearly zero, either by letting φ = 2θ or by symmetry, or whatever. Hence
cov(X, Y ) = E(XY ) − E(X)E(Y ) = 0.
Fact Independent variables are uncorrelated. But uncorrelated variables are not necessarily indepen-
dent.
Example 6. Let X be uniform in [−1, 1] and Y = X 2 . Then you can show that E[XY ] = E[X 3 ] = 0
and E[X] E[Y ] = 0 and thus the covariance is zero. However it is clear that the two variables are not
independent.
Can you extend this example to more general distributions?
p(x, y) P(X = x, Y = y)
pX|Y (x|y) = = .
pY (y) P(Y = y)
For continuous distributions the role of conditional probability mass functions is here played by condi-
tional densities.
Definition 11 (Conditional PDF). Let X, Y be jointly continuous with joint density f (x, y) and marginals
fX and fY . Then for any y such that fY (y) > 0 the conditional density of X given Y = y is given by
f (x, y)
fX|Y (x|y) = ,
fY (y)
This gives the familiar ”Joint density divided by the marginal density of the other variable.”
To check that this indeed makes sense as a definition for a density function we easily check first that
it is non-negative and that
1 fY (y)
Z Z
fX|Y (x|y)dx = f (x, y)dx = = 1.
x fY (y) x fY (y)
18
providing that fX (x) > 0.
I’ve already said that in the continuous case we use exactly the same definition. But we have to bear
in mind that we are working with probability densities, not actual probabilities, and so we have to check
that the f (y|x) as so defined is what it ought to be.
It is certainly nonnegative, so we check that it integrates to unity:
Z ∞ Z ∞
f (x, y)
fY |X (y|x)dy = dy
−∞ −∞ fX (x)
Z ∞
1 fX (x)
= f (x, y)dy = = 1.
fX (x) −∞ fX (x)
The conditional mean, or expectation, of Y given X is just the expected value of Y in the case where
the value of X is known, so it is evaluated using the conditional distribution
Z ∞
E(Y | X = x) = yf (y|x)dy
−∞
The conditional mean E(Y | X = x) is a function of x, and it is called the regression function of Y on
X. It’s a generalisation of the idea of a regression line when the relation is linear: in that case the line
goes through the points where you expect Y to be for any given value of X.
Since E(Y | X = x) is a function of x we can compute its expected value wrt the distribution of X.
Perhaps not surprisingly, we get the mean of Y :
Z Z
E[E(Y |X)] = yfY |X (y|x)dy fX (x)dx
Z Z
= yfY |X (y|x)fX (x)dydx
Z Z
= yf (x, y)dxdy
Z
= yfY (y)dy = E(Y ).
where the expected values are worked out using the conditional densities. The expected value of the
conditional variance is not, however, the unconditional variance.
and 0 elsewhere. The marginal densities for each variable are obtained by integrating over the other one:
Z x
fX (x) = 6(x − y)dy = 6xy − 3y 2 |x0 = 3x2 ,
0
19
Z 1
fY (y) = 6(x − y)dx = 3x2 − 6xy|1y = 3 − 6y + 3y 2 = 3(1 − y)2 .
y
Each of the conditional densities is just the joint density divided by the “other” marginal density:
2(x − y)
fX|Y (x|y) = , y<x<1
(1 − y)2
2(x − y)
fY |X (y|x) = , y < x < 1.
x2
To use these, we just make the indicated substitutions. Thus the conditional density of Y given that
X = 12 is
1 2( 1 − y)
f (y|x = ) = 2 1 .
2 4
1
only we have to be careful about the range. If x = 2 then Y must lie in the range 0 < y < 12 . We check:
1
2( 12 − y) 1
Z
2
1 = 4y − 4y 2 |02 = 1.
0 4
Note that, as advertised, these have different functional forms. The regression functions are:
Z 1
2x(x − y)dx y+2
E(X|y) = =
y (1 − y)2 3
Z x
2y(x − y)dy x
E(Y |x) = = .
0 x2 3
We can check by finding the unconditional means as the expected values of the conditional means:
Y +2 E(Y ) 2 3
E(X) = E[E(X|Y )] = E = + =
3 3 3 4
X 1
E(Y ) = E[E(Y |X)] = E( ) = .
3 4
All this generalizes easily to n random variables X1 , X2 , . . . Xn , though the calculations can get
tricky.
Note, by the way, that we have slightly awkward notation here. We don’t usually bother to indicate
what it is that we are taking the expected value with respect to because it is usually obvious. Here it is a
little bit less clear. It might have helped (though it wouldn’t be standard) to write E(X) = EY [E(X|Y )]
to stress that we are averaging over Y all the expected values E(X|y). As often happens, it is clear
enough when you think about it — since E(X|y) is a function of y but not of x any averaging has to be
over y — but you do have to think about it.
20
2.5.1 Uniform Bivariate Distributions
This is one of the simplest classes of continuous bivariate distributions. Suppose (X, Y ) is uniformly
distributed in some set A ⊂ R2 , i.e. suppose that f (x, y) = C over the entire range for which it is not
zero.
Similarly to the univariate case we have to determine C and this turns out to depend on the range.
Since f (x, y) = C on A and 0 elsewhere
ZZ ZZ
f (x, y) = C = 1,
R2 A
A := {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 2}.
Thus (
C, 0 < x < 1 and 0 < y < 2
f (x, y) =
0, otherwise.
1 2
Z
fX (x) = dy = 1, x ∈ [0, 1]
2 0
1 1 1
Z
fY (y) = dx = , y ∈ [0, 2],
2 0 2
that is X and Y are uniform on [0, 1] and [0, 2] respectively.
Also since for (x, y) ∈ A
1 1
fX,Y (x, y) = = 1 × = fX (x) × fY (y),
2 2
and so X and Y are independent.
1
Suppose we want to compute P(X < 2, Y > 1). This is given by integrating the density over the
appropriate region, that is
Z 1/2 Z 2
1 1
dydx =
2 0 1 4
Example 9. Suppose now that (X, Y ) are uniformly distributed over the region below the diagonal of
the same rectangle:
(
C, 1 > x > 0, y > 0, y < 2x
f (x, y) =
0, otherwise.
We work out the constant from the condition
Z 1 Z 2x
Cdydx = 1
0 0
and thus C = 1.
Are X and Y independent? You might think they are because f (x, y) seems to factor into a product
of trivial factors. However this is not the case here. Suppose that you want to claim that f (x, y) =
h(x)g(y), where h(x) = h(y) = 1 for all x, y. This is not however the case, as the range of x depends
on y and therefore the joint density does not factor.
21
The marginal densities are
Z 2x Z 1
y
fX (x) = dy = 2x fY (y) = dx = 1 − .
0 y/2 2
Note that
y2 2
Z 1 Z 2
fX (x)dx = x2 |10 = 1 fY (y)dy = (y − )| = 1,
0 0 4 0
as they should.
Since the variables are not independent it makes sense to compute the conditional densities which are
given by
1 1
f (x|y) = f (y|x) =
1 − y2 2x
which are clearly not equal to the marginal densities. These are probability densities too, and should
integrate to unity over their entire ranges, but we have to be careful because the range of integration can
(and in these cases does) depend on the other variable. This is where really helps to have a diagram.
So we have
Z 1 Z 2x
dx dy
=1 = 1.
y/2 1 − y2 0 2x
We can use these in an obvious way to calculate conditional probabilities. Typically, we might want
to know the probability that X > 32 given that Y = 34 . We simply substitute 34 for y in the expression
for f (x|y) and integrate, being careful about the range:
Z 1
2 3 dx 8
P(X > |Y = ) = 3 = 15 .
3 4 2/3 1− 8
If we want P(X < 23 |Y = 34 ), we have to be a bit careful about the lower limit:
Z 2/3
2 3 dx 7
P(X < |Y = ) = 3 = 15 .
3 4 3/8 1− 8
2
Of course we could also work out the probability for X > 3 and subtract.
22
Definition 13 (Moment Generating Function). The moment generating function of the random variable
X is defined as
ψX (t) = E(etX ),
for t ∈ R. We say that the moment generating function of X exists, if there exists a b > 0 such that
ψX (t) < ∞ for all |t| < b.
The reason behind its name becomes apparent after differentiating since
d tX
0
ψ (t) = E e = E(XetX )
dt
ψ 0 (0) = E(X).
Warning 1. In the above calculation we have accepted a bit of hand waving in exchanging the order of
differentiation and expectation. In this course you should just assume that this is possible. The theory
explaining when this is possible lies deeper in the realms of Real analysis than time and scope permit us
to explore.
Differentiating again
d d
ψ 00 (t) = E(XetX ) = E (XetX ) = E(X 2 etX )
dt dt
ψ 00 (0) = E(X 2 ).
It is clear that we can continue in this way. Alternatively, we can expand etX in a power series:
X (tX)n
etX =
n!
Taking the expected value of both sides:
X tn
ψ(t) = E(etX ) = E(X n )
n!
Theorem 12. If ψX (t) exists then for all k ∈ N we have
dk ψX (t)
= E[X k ].
dtk
t=0
Thus we can find the nth ordinary moment of X either as ψ (n) (0) or from coefficient of tn in the power
series expansion of ψ.
Example 10. For the binomial distribution
n
!
tX
X
tk n k n−k
ψ(t) = E(e )= e p q = (pet + q)n .
0
k
Then
ψ 0 (t) = npet (pet + q)n−1 ψ 00 (t) = npet .(n − 1)pet (pet + q)n−2 + npet (pet + q)n−1
The first two ordinary moments are therefore np and n(n − 1)p2 + np. Hence the mean is np and the
variance is
23
Example 11. For the exponential distribution
f (x; λ) = λe−λx
we have
Z ∞
λ
ψ(t) = E(e tX
)= etx λe−λx dx = , t < λ.
0 λ−t
Then
λ 2λ
ψ 0 (t) = ψ 00 (t) =
(λ − t)2 (λ − t)3
Hence E(X) = 1/λ and E(X 2 ) = 2/λ2 , so that var(X) = 1/λ2 .
Moment generating functions are extremely useful because they uniquely characterise the distributions
of random variables. Of course it is also obvious that if two random variables have the same distribution
then they also have the same moment generating function.
Remark 14. A little thought reveals that the moment generating function of a random variable with pdf
fX , is simply the Laplace transform of the pdf. Therefore the above result should not be that surprising
as Laplace transforms uniquely characterise functions subject to regularity restrictions.
Example 12 (Moment generating function of Normal Distribution). We can use the first result to work
out the moment generating function for the standard normal distribution, which has probability density
function
1 2
φ(x) = √ e−x /2 .
2π
Then we can calculate Z ∞
1 2 /2
ψZ (t) = E e tX
=√ etz e−z dz.
2π −∞
1 t2 1 t2
tz − z 2 /2 = − [z 2 − 2tz + t2 ] + = − [z − t]2 +
2 2 2 2
24
Then a simple change of variables shows that
2 2
et /2 et /2
Z ∞ Z ∞
2 /2 2 /2 2 /2
ψZ (t) = √ e−(z−t) dz = √ e−z dz = et .
2π −∞ 2π −∞
√
The trick here is that the integral is just 2π because it’s just the standard normal integral shifted over a
bit, which doesn’t matter since the limits are infinite in either direction.
This allows us to work out the mean and variance:
2 /2 2 /2 2 /2
ψZ0 (t) = tet ψZ00 (t) = et + t2 et
which gives zero mean and unit variance, as it should. We can now compute the MGF of a general
normal distribution N (µ, σ 2 ) with mean µ and variance σ 2 . The trick here is that if X ∼ N (0, 1) then
Y = µ + σX ∼ N (µ, σ 2 ). Using Theorem 15, the moment generating function is,
" #
µt σ 2 t2
ψY (t) = e ψX (σt) = exp µt +
2
and from this we can easily work out E(X) and var(X):
" #
0 2 σ 2 t2
ψ = (µ + tσ ) exp µt +
2
" #
00
2 2 2
σ 2 t2
ψ = (µ + tσ ) + σ exp µt + .
2
Hence E(X) = ψ 0 (0) = µ and E(X 2 ) = ψ 00 (0) = µ2 + σ 2 , i.e. var(X) = σ 2 . Now we can calculate
the higher order moments easily enough. The third central moment, generally called the skewness, is
defined by
and so µ3 = 0. Of course this is exactly as we would expect, since the odd central moments of a
symmetric distribution must be zero. Since µ3 is the lowest odd non-trivial central moment (the first
central moment is, by definition, zero) this does suggest that it is a good measure of the skewness, but
in fact it can be quite misleading. Symmetric distributions have zero skewness, but lots of asymmetric
definitions that on any intuitive basis are skewed can have zero third central moment.
25
Chapter 3
Exponential Densities
Many common statistical tests are based on the normal density and other densities based on it also
involve exponentials. So I will start by deriving a number of exponential distributions. Most of them
have other uses as well, so it’s not only just preparation for hypothesis testing.
The density is symmetric about the mean µ. The expected value of a normal random variables Y ∼
26
The proof is an easy calculation and is therefore left as an exercise. As a hint remember that any
density of a normal random variable has to integrate to 1.
Definition 14 (Covariance Matrix). Given random variables X1 , . . . , Xn , their covariance matrix is the
symmetric matrix
σ11 σ12 · · · σ1n
σ21 σ22 · · · σ2n
Σ := . .. .. ,
..
.. . . .
σn1 σn2 · · · σnn
Definition 15 (Multivariate Normal). We say that the jointly continuous random variables X1 , . . . , Xn
have the multivariate normal distribution with mean vector µ and positive definite covariance matrix Σ,
written (X1 , . . . , Xn ) ∼ N (µ, Σ) if their joint probability density function is given by
1 h 1
T −1
i
φ(x1 , . . . , xn ) = p exp − (x − µ) Σ (x − µ) , (3.1.1)
(2π)n |Σ| 2
The following fact is extremely useful for hypothesis testing and in general.
Theorem 17. If X1 , . . . , Xn are independent normal random variables such that Xi ∼ N (µi , σi2 ), then
n
X n
X
X1 + · · · + Xn ∼ N µi , σi2 .
i=1 i=1
Remark 18. Of course we know what the mean would be through the linearity of expectations, and the
variance also since the variance of a sum of independent variables is the sum of the variables. The main
content of the theorem is that the sum of independent normal random variables is also normal, a far
from trivial fact.
Proof. It’s enough to check for two random variables and to use induction.
2 ) and Y ∼ N (µ , σ 2 ) are independent Normal random variables. Let
Suppose that X ∼ N (µX , σX Y Y
Z = X + Y . Then
! !
σ 2 t2 σ 2 t2
ψZ (t) = ψX (t)ψY (t) = exp µX t + X exp µY t + Y
2 2
!
(σ 2 + σY2 )t2
= exp (µX + µY )t + X .
2
We recognise this as the moment generating function of a normally distributed variable with mean µX +
µY and variance (σX 2 + σ 2 ). By uniqueness of moment generating functions, we conclude that Z ∼
Y
N (µX + µY , σX2 + σ 2 ).
Y
27
An equivalent definition for jointly normal random variables is offered by the following theorem which
we will not prove.
Theorem 19. The variables X1 , . . . , Xn are jointly normal, if and only if for any λ1 , . . . , λn ∈ R, the
variable
λ1 X1 + λ2 X2 + · · · + λn Xn ,
is normally distributed.
It is easy to check that if (X1 , . . . , Xn ) ∼ N (µ, Σ) then E[Xi ] = µi , and that cov(Xi , Xj ) = σij .
As as special case look at the density in the case n = 2 where, letting ρ := corr(X, Y ) and µX , µY ,
σX , σY be the means and standard deviations of X and Y respectively, we have
1 h 1 n (x − µ )2 (y − µ )2 2ρ(x − µ )(y − µ ) oi
X Y X Y
φ(x, y) = p exp − 2 2 + 2 − .
2
2π 1 − ρ σX σY 2(1 − ρ ) σX σY σX σY
The formula becomes clear as soon as you realise that their covariance matrix is in fact diagonal since
all the covariances vanish.
In fact we have the following result.
Theorem 20. Suppose that X1 , . . . , Xn ∼ N (µ, Σ), in other words that X1 , . . . , Xn are jointly normal.
Then X1 , . . . , Xn are mutually independent if and only if the covariance matrix Σ is diagonal.
Remark 21. This in particular implies, that if X, Y are jointly normal, then they are independent if and
only if cov(X, Y ) = 0.
This is only true if the variables are JOINTLY NORMAL and may fail otherwise as the next example
demonstrates.
The problem here is that although X and Y are both normal, they are not jointly normal. To see why
just consider the linear combination X + Y and notice that with probability 1/2, B = −1 and thus
P(X + Y = 0) > 0. This means that X + Y is not a continuous random variable, in the sense that it
does not have a probability density function, and therefore it cannot be normally distributed.
Jointly normal random variables have some very nice properties which we now state without proof.
28
(a) (X1 , . . . , Xn )T = µ + Σ1/2 (Y1 , . . . , Yn )T where Y1 , . . . , Yn are i.i.d. standard normal.
AX ∼ N (Aµ, AΣAT ).
Notice that from the last property we can deduce the following. If Y1 , . . . , Yn are independent standard
normal random variables, then Σ is the identity matrix. Let O be an orthogonal n × n matrix and define
X T := (X1 , . . . , Xn )T := (OY )T . Then X ∼ N (Oµ, OOT ) and since OΣOT = I and therefore
X1 , . . . , Xn are also i.i.d. standard normal.
Remark 23. Notice that if the random vector X has covariance matrix Σ, and A is a n × n matrix, then
the random vector AX has covariance matrix AΣAT , even if the X is not normally distributed.
With this definition, if s < t then N (t) − N (s) is the number of events that has occurred in the time
interval (s, t].
We will make a number of assumptions to make the model tractable. Before we introduce some very
useful notation.
Definition 16 (Landau notation). Recall that we write f (h) = o(h) if limh→0 f (h)/h = 0 and f (h) =
O(h) if there is a constant C such that f (h) ≤ Ch for all h sufficiently small.
Definition 17 (Poisson process). The family of random variables {N (t) : t ≥ 0}, N (t) ∈ N such that
N (0) = 0, and the following properties are satisfied
Independent Increments the number of events occurring over disjoint intervals of time are indepen-
dent, that is for all integers n and positive real numbers t1 < t2 < · · · < tn we have that the
random variables N (ti − ti−1 ), i = 1, . . . , n, are mutually independent;
Stationary increments the distribution of the number of events happening over the interval [t, t + h]
depends only on h and is independent of t; that is the distribution of N (t+h)−N (t) is independent
of t;
It can be shown that a Poisson process can be equivalently defined in terms of the Poisson distribution,
the family of discrete distributions parameterised by λ > 0, given by
e−λ λk
pλ (k) = , k = 0, 1, 2 . . . .
k!
29
Definition 18 (Poisson process 2). The family of random variables {N (t) : t ≥ 0}, N (t) ∈ N such that
N (0) = 0,
Independent Increments for all integers n and positive real numbers t1 < t2 < · · · < tn we have that
the random variables N (ti − ti−1 ), i = 1, . . . , n, are mutually independent;
Let’s compute the probability that the first arrival is after time t, P(T > t).
Let M be a large integer, and let h = t/M . Then by the independent increments property we have
that Notice that T > t is equivalent to N (t) = 0 which is also equivalent to
N (h) + N (2h) − N (h) + · · · + N (M h) − N (M − 1)h = 0,
F (x) = 1 − e−λx
f (x) = λe−λx .
Definition 19. The exponential distribution with parameter λ, denoted by Exp(λ), is the distribution
function
F (x) = 1 − e−λx , x > 0.
A random variable with the above distribution is called an exponential random variable with parameter
λ.
30
As we have seen, it is easy to find its moment generating function
Z ∞
λ
ψ(t) = E(etX ) = λ etx e−λx dx =
0 λ−t
We will not prove the above result as it is beyond the scope of the course.
Notice that T1 ≡ T .
We also introduce the inter-arrival times
τ1 := T1 , τ2 := T2 − T1 , . . . , τr := Tr − Tr−1 .
From the third definition we know that the τi are i.i.d. exponential random variables with parameter
λ > 0. So the time to the r-th arrival Tr is the sum of r identically distributed exponential random
variables with parameter λ > 0. That means we know what the moment generating function must be,
r
λ
ψ(t) = , t < λ.
λ−t
From this we can deduce all the moments of the distribution. The only problem is that we don’t know
the density itself, since deducing the density from the MGF involves inverting Laplace transforms which
is fairly tricky and certainly beyond our scope.
However we know that MGFs uniquely characterise probability distributions, and therefore their pdfs.
Thus I will give you the density, and then we will prove that it is indeed the correct one.
In fact we will define the gamma distribution through its density.
Definition 21 (Gamma distribution). The distribution with probability density function
λα α−1 −λx
f (x) = x e , x>0
Γ(α)
for some α, λ > 0, is called the Gamma distribution with parameters α and λ is denoted Γ(α, λ).
Remark 25. Here we allow α to be non-integer, in which case we can no longer interpret a Gamma
random variable as a sum of i.i.d. exponential variables.
31
Here λ is the number of arrivals per unit time and α is a real positive number which, when it is an integer,
is r, the number of arrivals that we are waiting for. Γ(α) is the gamma function:
Z ∞
Γ(α) = xα−1 e−x dx
0
We can check that this is in fact a legitimate density by verifying that it is non-negative, which is
obvious, and that it integrates to unity. To do the latter we make the simple substitution x = λy in the
expression for the gamma function. This gives
Z ∞ Z ∞
Γ(α) = λα y α−1 e−λy dy = λα y α−1 e−λy dy.
0 0
We now check that we have the appropriate density by computing its moment generating function:
λα α−1 −λx λα
Z ∞ Z ∞
ψ(t) = etx x e dx = xα−1 e−(λ−t)x dx.
0 Γ(α) Γ(α) 0
λα
Z ∞
1
ψ(t) = . uα−1 e−u du
(λ − t)α Γ(α) 0
ψ(t) = (1 − t/λ)−α
(−α)(−α − 1) −t 2
−t
= 1 + (−α) + + ...
λ
2! λ
2
α 1 α(α + 1) t
=1+ t+ 2
+ ....
λ 2! λ λ
Proposition 26. Let X ∼ Γ(α, λ). Then
α α(α + 1) α
E(X) = , E(X 2 ) = , and var(X) = .
λ λ2 λ2
These are as we would expect. The mean waiting time to the nth event goes up as n and the standard
√
deviation as n. The gamma function has a number of useful properties. The most interesting one can
be seen if we integrate by parts:
xα e−x
Z ∞
∞ 1
Γ(α) = + xα e−x dx
α 0 α 0
Γ(α + 1) = αΓ(α)
Since Γ(1) = 0∞ e−x dx = 1, we have
R
32
3.5 Chi-squared
Let Z ∼ N (0, 1) be a standard normal variable. Then the moment generating function for Z 2 is
1 ∞Z
tZ 2 2 2
ψZ 2 (t) = E(e )= √ etz e−z /2 dz
2π −∞
Z ∞
1 2
=√ e−(1−2t)z /2 dz.
2π −∞
we see that Z 2 ∼ Γ( 12 , 21 ).
Now let
be the sum of the squares of K independent, standard normal variables. Then the moment generating
function is the kth power of the one we have just found:
and it follows from this that X 2 ∼ Γ(k/2, 1/2), and thus its pdf is given by
xk/2−1 e−x/2
fX 2 (x) = , x>0
2k/2 Γ(k/2)
Definition 22. The distribution of X 2 above is called the χ2 distribution (chi-squared) with k degrees
of freedom.
Of course from the above definition, if we have two independent sample Z1 , . . . , Zn ∼ N (0, 1) and
Y1 , . . . , Ym ∼ N (0, 1) of i.i.d. random variables and define
The mean and variance of the gamma distribution are α/λ and α/λ2 , so those for the χ2 distribution
are k and 2k, respectively.
33
3.6 The F -distributions
Let U and V be independent random variables, such that P(V = 0) = 0, and let W = U/V . Then the
cumulative distribution of W is
Z ∞ Z vw
FW (w) = P(W ≤ w) = P(U ≤ wV ) = fU (u)du fV (v)dv
0 0
Z ∞ Z vw
d d
fW (w) = [P(W ≤ w)] = fU (u)du fV (v)dv
dw dw
Z0∞ 0
Suppose that U and V are independent χ2 variables with m and n degrees of freedom respectively.
Then
1 ∞
Z
fW (w) = v.v n/2−1 (vw)m/2−1 e(−v−vw)/2 dv,
C 0
where C = 2(m+n)/2 Γ(m/2)Γ(n/2). We factor out the powers of w
wm/2−1
Z ∞
fW (w) = v (m+n)/2−1 e−(w+1)v/2 dv.
C 0
To get this into a form that looks like a gamma function we make the substitution y = (w + 1)v:
wm/2−1
Z ∞
fW (w) = y (m+n)/2−1 e−y/2 dy.
C(w + 1)(m+n)/2 0
The integral is not a function of w so we have the density of W = U/V apart from the constant. To find
that we compare the integral with the gamma function in the form
Z ∞
Γ(α) = λ α
y α−1 e−λy dy
0
With λ = 1/2 and α = (m + n)/2 this is just
Z ∞
−(m+n)/2
Γ(m/2 + n/2) = 2 y (m+n)/2−1 e−y/2 dy
0
34
3.7 Student’s t-distribution
Consider the following typical problem. The lengths of widgets produced by an automatic machine are
normally distributed with unknown mean µ and unknown standard deviation σ, that is Xi ∼ N (µ, σ 2 ).
The lengths of 25 widgets are measured, and the mean X̄ = n1 25
P
i=1 Xi is found to be 10.05 cm. Is there
reason to suspect that the mean is significantly different than 10? That is we want to test the hypotheis
H0 : µ = 10, vs H1 : µ 6= 10.
And under the null hypothesis µ = 0 so we can plug this in and perform our hypothesis test. In this case
we compute z = (10.05 − 10.0)/(.1/5) = 2.5. From the tables we know that P(Z ≤ 2.5) = 0.9938 and
so P(Z ≥ 2.5) = .0061 and therefore P(|Z| ≥ 2.5) = .012. This means that if µ = 10, the probability
of getting such a big deviation by chance is around 1.2%. We say that the increase is significant and we
reject the null hypothesis at the 5% level.
However what happens if we don’t know σ? The obvious thing is to estimate it and we know from
last year how to do it
1 X
s2 = (xi − x̄)2 .
n−1
Remark 28. Remind yourself that we divide by n − 1 not n so as to get the right expected value.
Let’s have a closer look at the above formula. We begin with
X X
(Xi − µ)2 = [(Xi − X̄) + (X̄ − µ)]2
X X X
= (Xi − X̄)2 + 2 (Xi − X̄)(X̄ − µ) + (X̄ − µ)2
X
= (Xi − X̄)2 + n(X̄ − µ)2 .
X X X
E[ (Xi − µ)2 ] = E(Xi − µ)2 = var(X) = n var(X)
2
E[n(X̄ − µ) ] = n var(X̄) = var(X).
We now have an estimate of the standard deviation, s. Let us briefly recall how the z-test worked.
There we knew σ and we based our calculations on the fact that if Xi ∼ N (µ, σ 2 ), i = 1, . . . , n are
i.i.d. then
X̄ − µ
√ ∼ N (0, 1).
σ/ n
To discuss what happens when we replace σ by its estimator let
1 X
S 2 := (Xi − X̄)2 .
(n − 1)
Remark 29. We use capital letters for random variables, and lower case for their observed values.
35
Therefore one might naively think that we can replace σ by S without changing the distribution but in
fact it turns out that
X̄ − µ
T = √ ,
S/ n
is NOT normally distributed. Intuitively, the fact that the denominator is random inflates the variability
of the whole fraction which ceases to be normally distributed.
Remark 30. We will see later that when the sample size is large, then in fact T is approximately normal.
Theorem 31. Let X1 , X2 , . . . , Xn ∼ N (µ, σ 2 ) be independent. Then X̄ and S 2 are independent and
σ2
X̄ ∼ N (µ, )
n
(n − 1)S 2
∼ χ2 (n − 1).
σ2
Proof. First we prove that they are independent. To do this we perform a change of variables. Notice
that we haven’t dealt with proper multivariate distributions before, just with bivariate but the concepts
are exactly the same.
We perform a change of variables from X1 , . . . , Xn to Y1 := X̄, and Yi = Xi − X̄ for i ≥ 2. Inverting
the transformation we find X̄ = Y1 , X2 = Y2 + Y1 , Xi = Yi + Y1 for i ≥ 2. In fact it is easier to think
of this as Y = AX where 1 1 1
. . .
n1 n
1
n
− n 1 − n . . . ...
A= .. .. .. .
. . . ...
− n1 − n1 ... 1 − 1
n
Since this is a linear transformation its partial derivatives are of course constant and therefore the Jaco-
bian |J| will not depend on the variables.
Thus we can compute the joint density of the variables Y1 , . . . , Yn to be
where n 1 X 2
o
fX (x1 , x2 , . . . , xn ) = C exp − (x i − µ) .
2σ 2
Notice that
n n
X x i − µ 2 1 hX 2 2
i
= (x i − x̄) + n(x̄ − µ)
i=1
σ σ 2 i=1
n
1h 2
X
2 2
i
= (x1 − x̄) + (x i − x̄) + n(x̄ − µ)
σ2 i=2
n n
1 h X 2 X
2 2
i
= (x i − x̄) + (xi − x̄) + n(x̄ − µ)
σ 2 i=2 i=2
n n
1 h X 2 X i
= 2
yi + yi2 + n(y1 − µ)2
σ i=2 i=2
36
Since the pdf factorizes into a product, one factor of which involves just y2 , . . . , yn and the factor in-
volves just y1 we conclude that y1 is independent of y2 , . . . , yn , or in other words that X̄ is independent
of Xi − X̄ for i = 2, . . . , n. Since X1 − X̄ = − ni=2 (Xi − X̄), we conclude that X̄ is independent
P
of Xi − X̄ for all i. In particular since S 2 is a function of the Xi − X̄ we conclude that X̄ and S 2 are
independent.
To prove the second statement we calculate
n
X n
X
(n − 1)S 2 = (Xi − X̄)2 = (Xi − µ + µ − X̄)2
i=1 i=1
n
X n
X
= (Xi − µ)2 + n(µ − X̄)2 + 2(µ − X̄) (Xi − µ)
i=1 i=1
Xn
= (Xi − µ)2 + n(µ − X̄)2 + 2n(µ − X̄)(X̄ − µ)
i=1
n
X
= (Xi − µ)2 − n(X̄ − µ)2 ,
i=1
n
(n − 1)S 2 X (Xi − µ)2 (X̄ − µ)2
= − n .
σ2 i=1
σ2 σ2
We know from the previous section that the (Xi − µ)/σ ∼ N (0, 1) are i.i.d. and that n(X̄ − µ)/σ ∼
N (0, 1) and thus
n
X (Xi − µ)2 (X̄ − µ)2
A := ∼ χ2 (n), B := n ∼ χ2 (1).
i=1
σ2 σ2
Equivalently from above we have that
n
(n − 1)S 2 (X̄ − µ)2 X (Xi − µ)2
+ n = .
σ2 σ2 i=1
σ2
Calculating MGFs of both sides and using the fact that S 2 and X̄ are independent, and that the right
hand side has the χ2 (n) distribution we obtain the equation
h (n − 1)S 2 i 1 1
E exp t = ,
σ2 (1 − 2t) 1/2 (1 − 2t)n/2
h (n − 1)S 2 i 1
E exp t 2
= ,
σ (1 − 2t)(n−1)/2
and thus
(n − 1)S 2
∼ χ2 (n − 1).
σ2
Definition 24 (Student’s t-distribution). Suppose that Z ∼ N (0, 1), V ∼ χ2 (n) be independent. Then
we say that
Z
p ∼ tn ,
V /n
where tn is called the t-distribution with n degrees of freedom.
37
i.i.d.
Theorem 32. If Xi ∼ N (0, 1) then
X̄ − µ
T = √ ∼ tn−1 .
S/ n
Notice straight away that because the denominator is positive, and the numerator is symmetric, as it is
normally distributed, the distribution of T must also be symmetric.
To find the density we notice that if we look at the square, it is the ratio of two independent χ2 random
variables with 1 and n degrees of freedom respectively. Therefore the square has the F distribution with
√
(1, n) degrees of freedom. Let V be such a random variable. Let t = g(v) = v. We know the density
of V and we want to find the density of T = g(V ). However we cannot apply the change of variable
√
formula directly because although T takes values on R, g(v) = v takes values only on (0, ∞) and is
in fact bijective as map between (0, ∞) and (0, ∞). This means that the change of variables formula
would actually give you the density of |T | rather than that of T . However by symmetry it is not too
difficult to deduce that if t > 0 then P(|T | > t) = 2 P(T > t) and therefore that the density of |T | is
twice the density of T . Therefore we can use the change of variable formula to find the density of |T |
and then divide by 2 to get that of T for all t > 0 and then by symmetry we can deduce the density on
all of R. We proceed as explained to find
−1
d −1
f|T | (t) = fV (g (t) g (t)
dt
nn/2 (t2 )−1/2
= 2t
B(1/2, n/2) (t2 + n)(1+n)/2
nn/2 Γ((n + 1)/2) 2
=2 (t + n)−(n+1)/2
Γ(1/2)Γ(n/2)
Γ((n + 1)/2) t2
=2√ (1 + )−(n+1)/2 ,
nπΓ(n/2) n
√
where we have used the fact that Γ(1/2) = π.
Therefore for t > 0 we know from above that
Γ((n + 1)/2) t2
fT (t) = √ (1 + )−(n+1)/2 ,
nπΓ(n/2) n
and by symmetry that for t < 0 fT (t) = fT (−t) and thus we deduce that for all t ∈ R
Γ((n + 1)/2) t2
fT (t) = √ (1 + )−(n+1)/2 .
nπΓ(n/2) n
Γ((n + 1)/2) 1
√ →√ ,
Γ(n/2) n 2
and taking logarithms and the Taylor expansion log(1 + x) = x + o(x) for x close to 0, we get
h t2 −(n+1)/2 i n+1 t2
log 1+ ) =− log 1 + )
n 2 n
n+1 t
2 t2
=− ) + o(1/n)) → − ,
2 n 2
which is the logarithm of the density of a standard normal random variable.
38
3.7.1 Testing variances
We have seen that if Z1 , Z2 , . . . Zn is a sample of size n from a standard normal distribution, then Zi2
P
has the χ2 distribution with n degrees of freedom. We have also seen that Y1 , Y2 , . . . , Yn is a sample of
size n is drawn from a normal distribution with mean µ and variance σ 2 , then
n
1 X
(Yi − Ȳ )2
σ 2 i=1
s21
F =
s22
has the F -distribution on (n1 , n2 ) df.
This test has to be used with caution, because it has been found that it is not very robust if the under-
lying densities are not normal.
Example 14. The blood pressures of rats in two samples were measured and found to be
Is the variance of the second group significantly greater than that of the second? The sample variances
work out to be S22 = 783.7 and S12 = 384.6. The ratio, S22 /S12 = 2.04. We compare 2.04 with the (7,6)
entry in the 5% table, which is 4.207. So the difference is not significant. Note this is a one tail test.
For a two tail test, you put whichever variance is larger in the numerator (to make F greater than unity)
and then use the 2.5% value, which for (7,6) is 5.7. That means we’d need a larger discrepancy to get
significance, as you’d expect.
39
where r > 0, s > 0 and B(r, s) is the beta function
Z 1
B(r, s) = xr−1 (1 − x)s−1 dx
0
Γ(r)Γ(s)
B(r, s) =
Γ(r + s)
The easiest way to find the moments of the beta distribution is directly:
1 1 Z
E(X) = x.xr−1 (1 − x)s−1 dx
B(r, s) 0
B(r + 1, s)
=
B(r, s)
Γ(r + 1)Γ(s) Γ(r + s)
= .
Γ(r + s + 1)Γ(r) Γ(s)
r
= .
r+s
Similarly
1 1 Z
E(X 2 ) = x2 .xr−1 (1 − x)s−1 dx
B(r, s) 0
B(r + 2, s)
=
B(r, s)
Γ(r + 2)Γ(s) Γ(r + s)
= .
Γ(r + s + 2)Γ(r) Γ(s)
(r + 1)r
= ,
(r + s + 1)(r + s)
and so
rs
var(X) = .
(r + s)2 (r+ s + 1)
It’s not immediately obvious that the beta function is related to the gamma function, but it is. To see
this we make the substitution x = cos2 θ to obtain (bearing in mind that dx = −2 cos θ sin θ and the
minus sign switches the limits)
Z π/2 Z π/2
B(r, s) = 2 cos2r−2 θ sin2s−2 θ. cos θ sin θdθ = 2 cos2r−1 θ sin2s−1 θdθ
0 0
Now we have seen the gamma function can be written in one of three forms:
Z ∞ Z ∞ Z ∞
1 2 /2
Γ(α) = xα−1 e−x dx = λα y α−1 e−λy dy = z 2α−1 e−z dz
0 0 2α−1 0
Z ∞ Z ∞
1 2 /2 1 2 /2
Γ(s)Γ(t) = x2s−1 e−x y 2t−1 e−y
dx. dy
2s−1 0 2t−1 0
Z ∞Z ∞
1 2 2
= s+t−2 x2s−1 y 2t−1 e−(x +y )/2 dydx
2 0 0
40
Z ∞ Z π/2
1 2 /2
= r2s+2t−2 cos2s−1 θ sin2t−1 θe−r rdθdr
2s+t−2 0 0
Z ∞ Z π/2
1 2 /2
= r2(s+t)−1 e−r dr.2 cos2s−1 θ sin2t−1 θdθ
2(s+t)−1 0 0
= Γ(s + t)B(r, s).
41
Chapter 4
Estimation
We are used to the idea that if we want to estimate the mean of a population, we take a sample and find
its average. It is the obvious thing to do, and it seems to work fines. If the sample is large we expect to
get a good result (if the distribution is normal we can even work out confidence limits) and if we use the
whole population as the sample we automatically get the right answer.
It’s not quite so simple for the variance, because instead of the variance of the sample we use the
sum of squares divided by n − 1. This should at least remind us that it’s not always obvious what
we are intended to do. Besides, it’s easy for distributions like the binomial, Poisson, Normal and even
exponential where the parameters are obviously related to readily measured statistics, but what happens
when this isn’t so? For the gamma distribution, for instance, we have E(X) = α/λ and var(X) = α/λ2 ,
and while we can solve these for α and λ, how good will the resulting estimates be?
Even in simple cases, there can be doubt. For instance, suppose we want to estimate the mean of a
uniform distribution. We could add up all the measurements and divide by the number in the sample,
but it would seem a lot easier to take half the biggest value. Is this all right?
What we are looking for is always the best way of estimating parameters. And as usual, the first thing
to do is to decide what we mean by ‘best’.
To help keep things straight, we introduce some notation. We will let θ stand for the parameter or
parameters of a distribution; thus it may be a number or a vector (θ1 , θ2 , . . . θn ). We let Ω denote the
parameter space, the set of all possible values of θ. In the case of the normal distribution, for example,
θ stands for the (µ, σ 2 ) and Ω is the set of all pairs (µ, σ 2 ) such that −∞ < µ < ∞ and σ 2 > 0.
Sometimes we can restrict the parameter space further if we have prior knowledge; for instance if the
distribution is of marks then presumably we can assume µ > 0 and there is no point in designing an
estimation procedure that is good at estimating negative marks. Being able to handle negative scores
will not enter into the criterion for ‘best’.
There is still a ”dispute” about the role of prior information in statistics. Some statisticians hold that
we should always begin by using whatever information and intuition we have to estimate where θ is
likely to be. Just as in the (usually noncontroversial) assumption that marks aren’t negative, this can
affect the test by leading us to choose one that works best in the range we think is important even at
the expense of doing less well elsewhere. This is the school of Bayesian statistics. I won’t say any
more about this because at this stage I want to stick to the conventional ideas, but you may run into this
concept at some point. Even the fact that there can be a dispute shows what happens when you work in
a subject whose very essence is dealing with uncertainty.
First we need to define the subject of our study. To estimate the expectation of a population, we will
take an i.i.d. sample X1 , . . . , Xn and compute its average
1X
X̄ = Xi .
n
42
To estimate the variance we will compute
1 X
S2 = (Xi − X̄)2 .
n−1
In either case we have what we call a statistic.
Remark 33. In some cases, if we know some of the parameters of the population, the function may
incorporate them. For example, when the population variance σ 2 is known and the mean µ0 is known
under the null, the z-statistic is
√
Z = f (X1 , . . . , Xn ) = n(X̄ − µ0 )/σ.
The important thing is that we must be able to compute the statistic given the sample.
There will always be many statistics for estimating a parameter, and we will have to decide which one
to use. Naturally we want to use the ‘best’ one, or one that is close to the ‘best’ one, so first we have to
know what we mean by ‘best’. Let’s consider some properties that we would like a statistic to have. As
we shall see, we can’t always get all of them at the same time.
First of all our statistic is random, and is meant to estimate a fixed deterministic value. Since it is
random there will always be some error in the estimation. We want this error to be zero on average,
otherwise we are doing something wrong.
Unbiasedness of a statistic tells us that it will give us the correct value on average.
We know that for a sample of size n, E(X̄) = µ so the sample mean is an unbiased estimate of the
population mean. When it comes to estimating the variance, however, we find that
X X
(Xi − µ)2 = [(Xi − X̄) + (X̄ − µ)]2
X X X
= (Xi − X̄)2 + 2 (Xi − X̄)(X̄ − µ) + (X̄ − µ)2 .
Now (X̄ − µ) is a constant for the summation, and since also (Xi − X̄) = 0 we have
P
X X
(Xi − µ)2 = (Xi − X̄)2 + n(X̄ − µ)2 .
43
from which we conclude that
X
E[ (Xi − X̄)2 ] = (n − 1) var(X)
1 X
s2 = (Xi − X̄)2
n−1
with n − 1 rather than n in the denominator. The point is that s2 is an unbiased estimate of σ 2 .
Given two unbiased estimators, naturally one would select the one with the smaller variability. Since
variance is our usual measure of variability we prefer the estimator with the smaller variance. If on
the other hand the estimators are not necessarily unbiased then it makes no sense to compare their
variances, which is the average squared distance from their respective means, but rather measure their
expected squared deviation from the true parameter value. This motivates the following definition.
The mean square error of an estimator takes into account both the bias and the variance since
In some cases an estimator may improve as the sample size increases. To capture this we introduce the
following asymptotic measure of the quality of estimators, or more precisely the quality of sequences of
estimators.
This is obviously a desirable property. It says that if you take a large enough sample, then most the
probability concentrates about the true parameter value.
1
(xi − x̄)2 is not an unbiased estimate of the population variance, it is consistent.
P
Remark 35. While n
Example 15. Suppose that you want to estimate the unknown mean of a normal population with variance
σ 2 = 1. You take a sample Xi ∼ N (µ, 1) and you compute the sample mean as an estimator of µ,
µ̂ = X̄. Then, since X̄ ∼ N (µ, 1/n) for any > 0
√ √
P(|µ̂ − µ| > ) = P( n|X̄ − µ| > n)
√
= P(|Z| ≥ n) → 0,
44
Example 16. There are surprisingly simple examples where the above calculation is too complicated to
be done exactly.
Suppose you are given a possibly unfair coin with unknown probability of ’heads’ p. You want to
estimate p and one natural choice is to toss the coin n times, and record the number of heads
n
X
N= Xi ,
i=1
p̂ = N/n.
Clearly N ∼ Bin(n, p) and thus E(p̂) = p so it is unbiased. Also we have that for any > 0, writing
dxe for the smallest integer larger than x,
and we are a bit stuck. Of course there are tricks one can use to estimate the above and to show that
it vanishes, but it does show that we need a more general method. This is given by the law of large
numbers in the next section.
Yn E(Yn ) np
E(p̂) = E = = =p
n n n
Pn
Also since Yn = i=1 Xi where the Xi are i.i.d.
p(1 − p)
P
var ( Xi ) n var(X1 )
var(p̂) = 2
= 2
= → 0,
n n n
as n → ∞.
Remember that variance is a measure of the spread of the distribution. Therefore the distribution of p̂
remains at the same location, while it gets more and more peaky as you can see in Figure 4.1.
So as n → ∞, the probability of any fixed interval around the mean approaches 1, and the random
variable p̂ ‘tends’ to the constant p in some sense1 . In other words, p̂ is both unbiased and consistent as
an estimate of p.
Of course not everything in life is Bernoulli trials or Gaussian, but fortunately the law applies far more
generally, and since it’s not too hard to see why we will go through it. First, however, let’s clarify what
the law says.
1
When talking about random variables there are various modes of ’convergence’ to a limit which are beyond the scope of
this course. This one is called convergence in probability. You will learn more about this in later courses on Probability Theory
45
2.0
1.5
1.0
0.5
-3 -2 -1 1 2 3
Figure 4.1.1: The density of N (µ, σ 2 ) for σ 2 = 1(blue), 1/2(orange), 1/3(green), 1/4(red) and
1/5(purple).
What it does say is that in a sequence of n Bernoulli trials, the proportion of successes tends to the
probability p. What it does not say is that the number of successes tends to the expected number np.
Let’s see what that means. Suppose we are tossing a perfectly unbiased statisticians’ coin, and the
first ten throws are all heads. As you know, that’s possible, if unlikely. Now we go on and toss it 990
more times to make 1000 in all. What might we expect to happen?
Most people, and even a lot of gamblers, would argue as follows. In 1000 trials there should be on
average 500 heads. We’ve already got 10, so we expect that of the next 990 trials, 490 should be heads
and 500 should be tails. If this is a betting game, we should certainly be betting on tails until the tails
catch up, as they are bound to.
Statisticians call this the Law of Maturing Probabilities, and it is wrong. It must be. How could it
possibly be true? It requires the coin to have a memory, so that it knows it owes us some tails and will
produce them. If I take a coin from my pocket and toss it, you think there is 50-50 chance it will show
heads. But how do you know it doesn’t owe me some heads? Surely there is no way within conventional
science that this can make any sense.
But what about the law of large numbers? Well, let’s go back and think. If we start from scratch,
the expected number of heads is just half the number of throws. So if we toss the coin 100 more times,
we expect 50 heads and 50 tails. Counting the 10 heads at the start, this makes the relative proportion
p100 = 60/110 = 0.55. If we toss the coin 200 more times, we expect 100 heads and 100 tails, and
again adding in the extra 10 heads this would make p200 = 110/210 = 0.52. If we toss the coin 990
times, we expect 495 heads and 495 tails, and adding the extra 10 heads would make p1000 = 0.505.
You can see what’s happening. The relative proportion is Yn /n, where Yn is the actual number of
successes on n trials. It may happen that Yn moves away from the theoretical expectation np. Suppose
we write Yn = np + δ. What the law of large numbers tells us is not that if we carry out extra trials the
proportion of successes will adjust itself to reduce δ but only that we don’t expect δ to get any bigger.
Since we are increasing the denominator, n, this means that the relative frequency will tend to p as it
should.
46
In other words, the deviation in numbers is δ and there is nothing to say this has to go to zero. The
deviation in relative frequency is δ/n. This does tend to zero, but by increasing n with δ constant, rather
than by decreasing δ.
The only use of the law of maturing probabilities is if you are betting against people who believe in it.
If a coin shows heads ten times in a row, when you know the odds are about 100:1 against this, someone
might be fool enough to offer you better than even odds against another head, on the grounds that it is
time for some tails. If he does, you should accept it. First, if the coin is fair, the odds are still even, as
they were at the beginning. Second, by applying a simple statistical test you know that the chances are
very good that the coin isn’t fair, and that it is biased in favour of heads!
Theorem 36 (Law of Large Numbers). Suppose that {Xi , i ≥ 0} are i.i.d. with finite mean µ and
variance σ 2 . Then
n
1X
µ̂ := Xi ,
n i=1
is a consistent estimator of the mean. In other words for all > 0,
P[|µ̂ − µ| > ] → 0,
as n → ∞.
Proposition 37 (Chebyshev’s Inequality). Let Y be a random variable with finite mean µ and variance.
Then for any k > 0,
1
q
P |Y − µ| ≥ k var(Y ) ≤ 2
k
We will now prove the LLN and then prove Chebyshev’s inequality.
Chebyshev’s inequality actually follows from the following more general result.
Theorem 38 (Markov’s Inequality). Let X > 0 be a random variable, such that E X < ∞ and c > 0 a
constant. Then
EX
P(X > c) ≤ .
c
Proof. Take a moment to convince yourselves that
h i
P(X > c) = E 1{X > c} ,
47
where for any x, c, (
1, if x > c,
1{X > c} =
0, otherwise.
Notice that if x > c then x/c > 1 and thus for any x, c > 0
x x
1{x > c} ≤ 1{x > c} ≤ .
c c
Recall from the basic properties of expectations that X ≥ Y ≥ 0 then E X ≥ E Y ≥ 0, we have that
h i hX i EX
P(X > c) = E 1{X > c} ≤ E = .
c c
Remark 39. If X > 0 and k ≥ 1 then x > c if and only if xk > ck . Therefore, if E X k < ∞ we can
also conclude that
E Xk
P(X > c) = P(X k > ck ) ≤ k ,
c
by applying Markov’s inequality to the positive random variable X k and the constant ck .
Proof of Chebyshev’s inequality. Let X = |Y − µ|/σ. Then X > 0 and since Y has finite variance we
have that E X 2 < ∞. By the remark above we can conclude that for any k > 0
E X2 E[(Y − µ)2 ] 1
P(|Y − µ| ≥ kσ) = P(X ≥ k) ≤ 2
= 2 2
= 2.
k σ k k
Now let’s see how this works out. Suppose we are told that the probability density has zero mean and
unit variance. Then if we set k = 1 we find P(|x| ≥ 1) ≤ 1, which tells us absolutely nothing new.
With k = 2, however, we have P(|x| ≥ 2) ≤ 14 and P(|x| ≥ 3) ≤ 19 If, in fact, x has the normal density,
1
then we know that P(|x| ≥ 2) is almost exactly 20 and P(|x| ≥ 3) is very nearly zero. Chebyshev’s
inequality obviously doesn’t do as well, but it’s pretty good considering how little we have to know.
And of course it does not depends on the assumption that the density is normal. We can make it work
providing we have reasonable estimates of µ and σ.
Since E(Yn − np)2 = np(1 − p), Chebyshev’s inequality applied with k 0 = k/σ immediately gives us
1
P(|Y − µ| ≥ kσ) ≤
k2
48
Here µ = 0.3 and
3 7
pq
σ2 =
= 10 10 = 0.21/n.
n n
We are to be no further than 0.1 from µ, so kσ = 0.1 and 1/k 2 = 100σ 2 = 21/n. If n = 100,
1/k 2 = 0.21, so the probability that p lies outside the interval is less than 0.21. The probability that
it lies inside the interval is greater than 0.79 If n = 1000, 1/k 2 = 0.021, so the probability that p lies
outside the interval is less than 0.021 and the probability that it lies inside the interval is 0.979. If we
use the normal approximation to the binomial, we take p to be N(0.3, 0.21/100).
√
Then P(x > 0.4) = P(z > (0.4 − 0.3)/( .21/10) = P(z > (0.1)/0.0458) = 1 − Φ(2.18) =
1 − 0.98537 = 0.01463. The probability that x lies outside the interval is therefore 0.0292 and the
probability that it lies within the range is 0.9708. This is of course better than what we get using
Chebyshev. Finally, if n = 1000, we have σ 2 = 0.021 which makes σ = 0.14. This makes both 0.2 and
0.4 more than 3 standard deviations from 0.3, so the probability of p falling outside the interval is very
close to 0.
4.2 Efficiency
So we would like our estimator to be unbiased and consistent. Another desirable property of an estimator
is that it should have a small mean square error which for unbiased estimators is just the variance.
That is given two unbiased estimators, one would of course choose the one with the smaller variance.
To see why just think about the width of the corresponding (1 − α) × 100%-confidence intervals. The
width is of course directly related to the variance. This motivates the following definition.
Definition 29 (Relative Efficiency). Given two unbiased estimators θ̂1 and θ̂2 of a parameter θ, the
relative efficiency of θˆ1 relative to θ̂2 , is denoted by eff(θ̂1 , θ̂2 ) and is defined as
var(θ̂2 )
eff(θ̂1 , θ̂2 ) = .
var(θ̂1 )
For biased estimators it is defined as
MSE(θ̂2 )
eff(θ̂1 , θ̂2 ) = .
MSE(θ̂1 )
Remark 40. We are assuming of course that both estimators have finite variance. If one of them has
infinite variance then one would choose the other one. If both have infinite variance then the efficiency
is not defined and one would have to use a different metric to compare the two estimators.
Notice that the definition is not symmetric, and that the smaller the m.s.e. of θ̂1 is, the greater its
relative efficiency w.r.t. any other estimator.
The reason for the name is that a more efficient estimator makes better use of the data, in that it
produces an estimator with less variability. We can see this from the following example:
1 M M
Z
E(Yi ) = xdx = ,
M 0 2
1 M 2 M2
Z
E(Yi2 ) = x dx = ,
M 0 3
M2
var(Yi ) = .
12
49
Suppose we want to estimate the mean. Of course one possibility is to use the sample mean
1X
θ̂1 := Ȳ = Yi ,
n
which we know is unbiased and consistent.
Another possibility comes from the fact that the maximum of a large sample should be close to M
and therefore one could try Mn /2 where Mn := maxi=1,...,n Yi . Let’s just check first what the mean of
this is. First we compute
FMn (x) := P(Mn ≤ x)
n
Y x n
= P(Yi ≤ x) =
i=1
M
for x ∈ (0, M ), 0 for x < 0 and 1 for x > M . Thus
1 n
0
fMn (x) := FM n
(x) = nxn−1 ,
M
and thus finally
Z M n
x 1
E[Mn /2] = nxn−1 dx
x=0 2 M
n h xn+1 iM
=
2M n n + 1 0
n M n+1 M n
= n
= .
2M n + 1 2 n+1
Therefore to make this unbiased we should use
n + 1 Mn
θ̂2 := .
n 2
Let’s compute the variances. The first one we know to be
1 M2
var(θ̂1 ) =
var(Y1 ) = .
n 12n
For the second one we first compute the second moment
h i Z M 1 n
E Mn2 = x2 nxn−1 dx
x=0 M
n h xn+2 iM
=
Mn n + 2 0
n M n+2 n
= n = M2 ,
M n+2 n+2
and thus
(n + 1)2
var(θ̂2 ) = var(Mn )
4n2
(n + 1)2 h 2 n 2 n2 i
= M − M
4n2 n+2 (n + 1)2
(n + 1)2 n M2
= M2 = .
4n2 (n + 2)(n + 1)2 4n(n + 2)
Therefore
var(θ̂2 ) 3
eff(θ̂1 , θ̂2 ) = = ,
var(θ̂1 ) n+2
which is less than 1 if n > 1 and therefore if n > 1 θ̂2 has a smaller variance and is generally preferable
to θ1 as an estimator of θ.
50
4.3 Maximum Likelihood Estimates
Although we know a few ways to compare estimators, we have yet to discuss how to construct them. It’s
like a lot of things in mathematics, including integration. There is one method that is often used both
in practical problems and also, as we shall see, in theoretical work, and that’s the method of maximum
likelihood. Before we discuss the method we need to define the concept of likelihood.
Definition 30 (Likelihood function for Discrete). Suppose Y1 , . . . , Yn are discrete random variables
whose distribution depends on a parameter θ, and have probability mass function
pθ (y1 , . . . , yn ) := Pθ (Y1 = y1 , . . . , Yn = yn ).
Let y1 , . . . , yn be sample observations. The likelihood of the parameter θ given the observations (y1 , . . . , yn )
is denoted by L(θ|y1 , y2 , . . . , yn ) is defined to be
L(θ|y1 , y2 , . . . , yn ) := Pθ (Y1 = y1 , . . . , Yn = yn ),
Definition 31 (Likelihood function for Continuous). Suppose Y1 , . . . , Yn are jointly continuous ran-
dom variables whose distribution depends on a parameter θ, and have probability density function
f (y1 , . . . , yn |θ). Let y1 , . . . , yn be sample observations. The likelihood of the parameter θ given the
observations (y1 , . . . , yn ) is denoted by L(θ|y1 , y2 , . . . , yn ) is defined to be
that is the joint density for the parameter θ evaluated at the observations.
When the variables Y1 , . . . , Yn are i.i.d. and Yi ∼ f (·|θ), in other words Yi has a pdf which depends
on a parameter θ then it’s very easy to see that
n
Y
L(θ|y1 , . . . , yn ) = f (yi |θ),
i=1
Remark 41. Despite the way we write it, we think of the likelihood function as a function of the param-
eters, and we treat the observations as fixed. Sometimes we will drop the observations and simply write
L(θ).
Suppose now that we have a sample of size n and from some distribution F (·|θ) which we know the
distribution up to the parameter θ. One way to estimate the parameter, would be the choose the value
that makes the observations as probable as possible. In the discrete case, we should choose the value θ∗
that maximises the probability
Notice that in the continuous case the likelihood is no longer a probability, but a density. However
if Y1 , . . . , Yn have joint density f (·, . . . , ·|θ) then by maximising L(θ|y1 , . . . , yn ) over θ, we are also
maximising
P Yi ∈ [yi − ∆yi , yi + ∆yi ], i = 1, . . . , nθ ≈ L(θ|y1 , y2 , . . . , yn )∆y1 × · · · × ∆yn ,
which is a probability.
This leads to the following definition.
51
Definition 32 (Maximum Likelihood Estimator). Suppose that a sample y1 , . . . , yn has likelihood func-
tion L(θ) = L(θ; y1 , . . . , yn ) depending on a parameter(vector) θ. Then a maximum likelihood estimator
θ̂MLE is the value of the parameters that maximises L(θ), if a maximum exists.
Remark 42. A maximum may not exist, or it may not be unique. In the first case the maximum likelihood
estimator does not exist, in the second it will not be unique, in the sense that the estimator will produce
a set of point estimates. In the cases we will deal with the maximum likelihood will usually be unique.
Remark 43. In most cases it is far easier to instead maximise the log-likelihood function l(θ) which is
perhaps unsurprisingly defined as
l(θ) = l(θ; y1 , . . . yn ) = log L(θ; y1 , . . . , yn ).
So we compute the probability of the observed outcome as a function of the parameters. We then
maximise this likelihood function to obtain the maximum likelihood estimates of the parameters. Note
that these are not the values of the parameters that are most likely, given the data.
Example 19. As an example, we consider the exponential distribution f (x|λ) = λe−λx . Suppose we
take a sample of size n. The likelihood, is
n n
!
−λxi
Y X
n
L(λ|x1 , . . . , xn ) = (λe ) = λ exp −λ xi = λn exp(−nλx̄).
i=1 i=1
Then
log L(λ) = n log λ − nλx̄,
and so
d n
log L = − nx̄.
dλ λ
Thus L has a unique maximum at λ̂ = 1/x̄ and this is therefore the maximum likelihood estimator of λ.
Example 20. As a more complicated example, let’s find the maximum likelihood estimates of the pa-
rameters of the normal distribution. The likelihood function is
n
1
Y h i
L= √ exp −(xi − µ)2 /2σ 2
1 σ 2π
n
n 1 X
log L = − log(2π) − n log σ − 2 (xi − µ)2 .
2 2σ 1
n
1 X
(xi − µ) = 0
σ2 1
n
n 1 X
− + 3 (xi − µ)2 = 0.
σ σ 1
n
1X 1X
µ̂ = xi σ̂ 2 = (xi − x̄)2
n i n
Note that the maximum likelihood estimator of the variance is biased. We don’t always find all the
criteria pointing at the same statistic.
52
Example 21. Suppose that y1 , . . . , yn are observations from a population with density
(
θ
x2
, x≥θ
f (x; θ) =
0, otherwise,
mini yi
Figure 4.3.1: The likelihood function L(θ; y1 , . . . , yn ). Notice that it vanishes for θ > mini yi .
always try to visualise what is happening at the boundaries, if there are any.
53
It can be shown that if the second partial derivative exists then we also have that
h ∂2 i
In (θ) = −n E 2
l(θ; X) .
∂θ
Theorem 45. Let X1 , . . . , Xn be i.i.d. with probability density function f (y; θ). Let θ̂n = g(X1 , . . . , Xn )
be an unbiased estimator of θ, such that the support of g(X) (the region for which the probability is not
zero) does not depend on θ. Then under mild conditions we have that
1
var(θ̂n ) ≥ .
In (θ)
Proof. We deal with the continuous case; the discrete case is done similarly. Consider the random
variable W defined by
∂ f 0 (X; θ)
W = log f (X; θ) = ,
∂θ f (X; θ)
where X ∼ f (·; θ) and f 0 (x; θ) denotes differentiation with respect to θ. Clearly W depends on θ and
so its expected value depends on θ and is obtained by integrating wrt f (·; θ). Hence
f 0 (x; θ)
Z
E(W ) = f (x; θ)dx
f (x; θ)
d
Z Z
= f 0 (x; θ)dx = f (x; θ)dx
dθ
d d
Z
= f (x; θ)dx = (1) = 0,
dθ dθ
under fairly general conditions that guarantee we can exchange differentiation and integration. Since
E(W ) = 0, cov(W, θ̂n ) = E(W θ̂n ) and thus
f 0 (x; θ)
Z
cov(W, θ̂n ) = E(W θ̂n ) = g(x1 , . . . , xn ). f (x; θ)dx
f (x; θ)
d
Z
= g(x1 , . . . , xn ) f (x; θ)dx
dθ
d
Z
= g(x)f (x; θ)dx
dθ
d dθ
= E(Y ) = = 1, (4.4.1)
dθ dθ
again under sufficient conditions to allow the exchange of derivative and integration. Now the maximum
value of any correlation coefficient is unity, and so
54
are i.i.d. We have just proved that the means are zero (because the proof works for a sample of one) and
hence the variance is equal to E(W 2 ), which is by definition I1 (θ). Hence
" 2 #
X ∂
var W = var Wi = nI1 (θ) = n E log f (X; θ)
∂θ
The importance of the Cramer Rao inequality is that it establishes a limit to the efficiency of an
estimator. No unbiased estimator can have variance less than 1/In (θ), so we can use this as the optimum.
We define the efficiency of an estimator as the ratio of the Cramer-Rao lower bound to the the variance
of the estimator.
Definition 34 (Efficiency). The efficiency of an unbiased estimator θ̂n of a parameter θ is defined as the
ratio of the Cramer-Rao bound to the variance of θ̂n , that is
1
eff(θ̂n ) = .
In (θ) var(θ̂n )
Example 22. Let us consider a random sample of size n from the exponential distribution. f (x; λ) =
λe−λx . Recall that we showed that 1/x̄ is the maximum likelihood estimator for λ:
P
λe−λxi = λn e−λ
Y
xi
L= ,
X
log L = n log λ − λ xi .
P
To maximise L we differentiate with respect to λ and we find n/λ̂ = xi and so λ̂ = 1/x̄. Since 1/λ
is a monotone (decreasing) function of λ, we would reparameterize with respect to µ ≡ 1/λ and the
invariance of MLEs gives us µMLE = x̄.
Now X̄ is actually an efficient estimate of µ := 1/λ. To see this, we reparameterise the density in
terms of µ:
1 −x/µ
f (x; µ) =
e
µ
log f = − log µ − x/µ
∂ −1 x
log f = + 2.
∂µ µ µ
We also know that E(X) = 1/λ = µ and E(X 2 ) = 2/λ2 = 2µ2 . Hence
h ∂ 2 i h X 1 2 i
E log f (X; µ) =E −
∂µ µ2 µ
2
1 2 −1 1 1 2 1
= +2 + 4 E(X ) = .
µ µ µ µ µ
Hence
n
In (µ) = .
µ2
The variance of the exponential distribution is 1/λ2 = µ2 so the variance of the mean of a sample of n
is µ2 /n. This is just 1/(In (µ)), which proves the result.
55
We mentioned above that if 1/X̄ is a maximum likelihood estimate of λ then X̄ is a maximum
likelihood estimate of µ = 1/λ. On the other hand, the fact that X̄ is an unbiased and efficient estimate
of µ does not mean that 1/X̄ is an efficient or even unbiased estimate of λ. In fact we can show that
n−1
λ̂0 = P ,
X
is an unbiased estimate of λ, that
λ2
var(λ̂0 ) =
n−2
and that the efficiency of λ̂0 is 1 − 2/n. This is less than unity, and there is no efficient estimator of λ.
On the other hand, the efficiency does approach unity as n → ∞. In such cases we say that λ̂0 is an
asymptotically efficient estimator.
4.5 Sufficiency
4.5.1 Sufficient Statistics
Suppose we toss a coin n times and we observe heads k times. Our intuition is to use p̂ = k/n as an
estimate of p, the probability of a head, and indeed this is the maximum likelihood estimator. However
someone might object that we have thrown away a lot of information by reducing all the data to a single
statistic, p̂. Are we certain that, for example, the order of the heads and tails, or perhaps whether the
third one was a head, have nothing more to tell us?
Of course we are certain, but how do we express this? A useful way of thinking about it is the
following. The probability of getting k heads in n trials is of course
!
n k
p (1 − p)n−k
k
This depends on p, which is why we can use the outcome to help us estimate p. Suppose, however, that
n
we are told that there were exactly k heads. Then there are k equally likely orderings of the k heads
and n − k tails, and each of them has probability 1/ nk . This is the conditional probability of a particular
outcome/ordering of the experiment given the total number of heads, and as you can see it is independent
of p. Consequently it can offer us no additional insight about p. Therefore all the information we can
extract about p is contained in the statistic p̂. We therefore call p̂ a sufficient statistic.
Definition 35 (Sufficient Statistic). Let X1 , . . . , Xn be i.i.d. from a probability distribution with pa-
rameter θ. Then the statistic T (X1 , X2 , . . . , Xn ) is called a sufficient statistic for θ if the conditional
distribution of X1 , . . . , Xn given the value of T does not depend on θ.
Fortunately we do not have to work out conditional distributions, because of a theorem which we will
state without proof.
Theorem 46. A statistic T (X1 , X2 , . . . , Xn ) is a sufficient statistic if and only if the joint probability
density of X1 , . . . , Xn can be factorised into two factors, one of which depends only on T and the
parameters while the other is independent of the parameters:
Remark 47. Note that the second factor is a function of x1 , . . . , xn , and thus it may also depend on the
statistic. It must not however depend on the parameter
56
4.5.2 Efficient statistics are sufficient
Let’s go back to the discussion of Section 4.4 and recall from Equation 4.4.1 that if the estimator θ̂n of
parameter θ is efficient, then cov(W, θ̂n ) = 1, where
∂ f 0 (X; θ)
W := log f (X; θ) = , X ∼ f (·; θ),
∂θ f (X; θ)
and since also var(θ̂n ) = 1/ var(W ) we have that corr(W, θ̂n ) = ±1.
2 ,
Remark 48. Let X, Y be two random variables such that E X = µx , E Y = µY , var(X) = σX
2
var(Y ) = σX and corr(X, Y ) = 1. Then an easy calculation shows that
" 2 #
corr(X, Y )σX 2
E X − µX − (Y − µY ) = σX − ρ2XY σX
2
= 0,
σY
since corr(X, Y ) = 1 by assumption. This means that letting there exist constants α, β such that
E[(X − α − βY )2 ] = 0, and since (X − α − βY )2 ≥ 0, this implies that we must have that P(X =
α + βY ) = 1.
Therefore if θ̂n is efficient, W is a linear function of θn , though the coefficients may involve θ:
∂
log f (X; θ) = W = a(θ) + b(θ)θ̂n .
∂θ
Thus log f (X; θ) can be obtained by integrating W with respect to θ, and this implies
Here the functions A, B are indefinite integrals of a, b, respectively, and K(X) is an arbitrary function
which is independent of θ.
The likelihood function is therefore of the form
h i
f (X; θ) = exp A(θ) + B(θ)θ̂n exp [K(X)]
But this is the required factorisation for sufficiency, and thus θ̂n is sufficient.
The converse is not necessarily true, ie there are sufficient statistics which are not efficient.
Example 23 (Poisson). We want to use x̄ to estimate λ, the parameter of the Poisson distribution. For a
sample of size n we have
n −λ xi
e λ Y 1
= e−nλ λΣxi
Y
f (x1 , x2 , . . . , xn ; λ) =
1
xi ! xi !
Y 1
−nλ nx̄
= e λ .
xi !
which is the required factorization.
57
Example 24 (Normal). We now check that x̄ is a sufficient statistic for estimating the mean of a normal
distribution with known variance:
1 1 X
2
f (x1 , x2 , . . . , xn ; µ) = exp − (x i − µ) .
(2πσ 2 )n/2 2σ 2
which is again the required factorisation. Note that the statistic appears in the second factor; it’s the
parameters that must not be there. Hence x̄ is a sufficient statistic for estimating µ.
We may also ask whether the pair (x̄, s) is sufficient for estimating the parameters (µ, σ), which is
the problem we are often really faced with. To see that they are, we simply note that if we substitute
s2 = (xi − x̄)2 /(n − 1) in the above expression we obtain
P
1 n n−1 2
f (x1 , x2 , . . . , xn ; µ) = exp − 2 (x̄ − µ)2 exp − s
2
(2πσ )n/2 2σ 2σ 2
which is a (trivial) factorization of the required kind.
It is intuitively clear that sufficiency is important because it means we have extracted everything
relevant from the data. In fact we can make this precise by the following result:
Theorem 49. From any unbiased estimate which is not based on a sufficient statistic, an improved
estimate can be obtained which is based on the sufficient statistic. It has smaller variance, and is
obtained by averaging with respect to the conditional distribution given the sufficient statistic.
Proof. Suppose that R(X1 , X2 , . . . , Xn ) is an unbiased estimate of the parameter θ and that T (X1 , X2 , . . . , Xn )
is a sufficient statistic for θ. Let the joint probability density function of R and T be fR,T (r, t). Then
the marginal distribution for T is Z ∞
fT (t) = f (r, t)dr
−∞
The conditional distribution of R given T is
fR,T (r, t)
fR|T (r|t) = ,
fT (t)
and because T is a sufficient statistic this does not depend on θ. Also Since R is an unbiased estimate of
θ Z ∞ Z ∞
E[R] = rfR,T (r, t)drdt = θ.
−∞ −∞
58
Z ∞ "Z ∞ #
= rfR|T (r|t)dr fT (t)dt
−∞ −∞
Z ∞ Z ∞
= rf (r, t)drdt = θ.
−∞ −∞
It remains to show that var(S) < var(R). We use a standard decomposition to get
The second term on the right hand side is obviously non-negative. In fact S 6= R since S is sufficient and
R is not. Therefore P[(R − S)2 > 0] > 0 and thus E[(R − S)2 ] > 0 from the positivity of expectations
property in Section 2.4.1. So the proof is complete apart from dealing with the last term:
Z ∞ Z ∞
E[(R − S)(S − θ)] = [r − s(t)][s(t) − θ]fR,T (r, t)drdt
−∞ −∞
Z ∞ Z ∞
= [r − s(t)]fR|T (r|t)dr [s(t) − θ]fT (t)dt.
−∞ −∞
where we have used fR,T (r, t) = fR|T (r| t)fT (t). By definition
Z ∞
s(t) = rf (r|t)dr,
−∞
and so the inner integral is identically zero. This proves the theorem.
Example 25. Suppose we toss a coin n times. Then for the i-th coin let Xi = 0 for a tail and 1 for a
head, and we decide to use as our estimate of the probability of a head X1 , the result of the first toss and
ignore the rest. First of all X1 is unbiased:
E(X1 ) = 0 × q + 1 × p = p
Its variance is pq, or npq with n = 1. For example if p = 1/2 the variance is 1/4 and the standard
deviation 1/2. Suppose, however, that we know that in the n trials there were exactly t heads. Then the
probability that the first one was a head is t/n and the probability that it was a tail is (n − t)/n. Now
the proportion of heads is a sufficient statistic:
n
Y
f (x1 , x2 , . . . , xn ) = pxi q 1−xi
1
n x
Y p i
= qn
1
q
P
p xi
= qn
q
p n t
n
= qn .
q
59
which is a (trivial) factorization of the required kind. We are now set up to use the theorem. We average
the statistic we began with using its conditional distribution given the sufficient statistic:
n−t t
P [X1 = 0| t] = P [X1 = 1| t] = .
n n
The mean value for any given value of T is then
n−t t t
0× +1× =
n n n
and this has variance pq/n which is less than pq.
Of course this isn’t really very impressive. The theorem worked, but it only gave us back a statistic
that we already knew to be unbiased and sufficient. There are cases where we can use the theorem a
bit more effectively, but its real importance is that it tells us that for unbiased estimation we should be
looking at sufficient statistics. In fact, for many distributions there is only one unbiased estimate based
on a sufficient statistic, and then that gives us the minimum variance unbiased estimate automatically.
So if we find a statistic which is both unbiased and sufficient, we can be reasonably confident we are
doing about as well as we can.
λα α−1 −λx
f (x) = x e .
Γ(α)
E(X) [E(X)]2
λ= , α= .
E(X 2 ) − [E(X)]2 E(X 2 ) − [E(X)]2
The method of moments consists in replacing E(X) with ( X)/n and E(X 2 ) with ( X 2 )/n and this
P P
60
Chapter 5
Theory of Testing
Theorem 50 (Central Limit Theorem). Let X1 , X2 , . . . be i.i.d. random variables with E(Xi ) = µ and
var(Xi ) = σ 2 < ∞. Define
Pn
Xi − nµ X̄ − µ
Zn := i=1 √ = √ .
σ n σ/ n
Then the distribution function of Zn converges to the distribution function of a standard normal random
variable as n → ∞, that is for all z
Z z
1 2
lim Fn (z) := lim P(Zn ≤ z) = √ e−t /2 dt.
n→∞ n→∞ −∞ 2π
Remark 51. This form of convergence is known as convergence in distribution, among other things.
Proof. We will prove the result in the case where the MGF of Xi , ψ(t) exists. This is by no means
necessary but it simplifies the proof significantly. Also since we can also write Zn as
Pn
i=1 (Xi − µ)/σ
Zn := √ ,
n
where the random variables (Xi − µ)/σ are also i.i.d. with mean 0 and variance 1, we can assume
w.l.o.g. that µ = 1 and σ 2 = 1.
Pn
By independence the moment generating function of the sum i=1 Xi will be [ψ(t)]n . And the
moment generating function for Zn will be
√ √ P t n
P
tZn t Xi / n (t/ n) Yi
ζn (t) := E(e ) = E(e ) = E(e )= ψ √ .
n
Since µ = 0 and σ = 1 we know that ψ 0 (0) = E(X) = 0 and ψ 00 (0) = E(X 2 ) = 1, and thus the
Taylor expansion of ψ(t) takes the form
t2 00 t3
ψ(t) = ψ(0) + tψ 0 (0) + ψ (0) + ψ 000 (0) + . . .
2! 3!
61
t2 t3 000
=1+ + ψ (0) + . . .
2! 3!
t2
= 1 + + o(t2 ),
2!
as t → 0. Hence n
√
t2 t2 n
ζn (t) = ψ(t/ n) = 1+ +o ,
2n n
and thus
t2 t2
log ζn (t) = n log 1 + +o ,
2n n
and using the fact that log(1 + x) = x + o(x) as x → 0, we have that
h t2 1 i t2
log ζn (t) = n +o → ,
2n n 2
from which we conclude that
2 /2
ζn (t) → et ,
which is the moment generating function of the standard normal distribution. The conclusion follows
from the following theorem.
Theorem 52 (Continuity Theorem). Suppose that the moment generating functions of the random vari-
ables U, U1 , U2 , . . . exist for all |t| < h and
E exp(tUn ) → E exp(tU ), for all |t| < h,
then for all u
P(Un ≤ u) → P(U ≤ u).
This is the justification for using the normal distribution in all sorts of examples. But this only works
in the limit, and in general how fast we approach the limit(although there are results such as the Berry-
Esseen theorem. depends on the distribution of the population.
So if you are pretty sure that the distribution is something like a normal distribution, the theorem
will come into play on fairly small samples. On the other hand, if the distribution is very different, if
it is skewed (like incomes) or if a lot of the probability is in the tails (like hours of sunshine in many
countries) you’d need a very large sample to make it work. This is typical of statistics. In practice we
do something which is theoretically not entirely justified but works well enough most of the time. But
we need to know why it works so that we understand why/when it may fail.
If, for example, we are testing the null hypothesis H0 : θ = 5 against the alternative H1 : θ > 5, then
H0 is simple and H1 is composite.
To perform the test we need a test statistic t(X), whose distribution we know when H0 is true, and
such that extreme values of t(X) make us doubt the validity of H0 . Suppose now that we perform the
experiment and obtain observations x := (x1 , . . . , xn ), which means that the observed value of the
statistic t(X) is simply t(x).
62
Definition 37 (Significance Level or p-value). The significance level or p-value of the test above is
p = P(t(X) ≥ t(x)|H0 ),
where P(t(X) ∈ A|H0 ) denotes the probability that t(X) ∈ A when H0 is true.
If p is small it means that if H0 were true the probability that a value as extreme as t(x) would be
observed if the experiment was repeated is small.
Example 26. Suppose that we observe a sample X1 , . . . , Xn from a N (µ, 1) population and we want
to test
H0 : µ = 0, vs H1 : µ 6= 0.
As you can see it is crucial for testing hypotheses that we know the distribution of the statistic under
the null hypothesis. This also explains the distinction between simple and composite hypotheses: it’s
obviously much easier to compute probabilities under a simple hypothesis where the parameter values
are completely determined, while if the distribution depends on a parameter and all we know is that the
parameter lies in some set then we don’t know the distribution.
We know that if H0 is true then X̄ ∼ N (0, 1/n) and therefore if H0 is true then X̄ shouldn’t be too
far from 0. Of course, since X̄ is normally distributed, there’s always a chance that X̄ is extreme even if
H0 is true. It is possible, but very unlikely, and this is the idea behind statistical tests. We are looking for
a statistic that we can compute from the data and whose distribution we know, at least approximately
under the null hypothesis. In our case say that n = 10 and that the sample mean is x̄ = .90 and thus
√
z = nx̄ ≈ 2.85. Since if H0 were true this should be a sample from a standard normal, we compute
the probability that a value at least as extreme as 2.85 would be observed, that is
this is the p-value and it’s the probability of observing something as extreme as 2.85. Thus if H0 were
true and we repeated this experiment 1000 times we would observe something as extreme as 2.85 only
around 4 to 5 times. This should really make you doubt whether H0 is true.
Remark 53. The p-value is NOT the probability that the null hypothesis is correct, as in our current
setup H0 is either true or false deterministically.
Definition 38 (Rejection or Critical Region). The rejection or critical region is a subset R ⊂ Rn such
that
• if x ∈ R we reject H0 ;
• if x ∈
/ R we do not reject H0 .
The alternative H1 : µ 6= 0 is a two-sided alternative. In order to reject H0 we want to see large absolute
values of z. On the other hand if we are testing H0 against H10 : µ > 0, then a large negative value of z
63
although very unlikely under H0 would be even less likely under any µ > 0. Therefore for the one-sided
alternative H10 , the p-value should be computed as
Reject
H0 ?
no
yes
H0 true H0 false
yes no yes
no
Definition 39. A Type I error is committed when we reject the null hypothesis H0 although it is true.
The probability of a type I error is denoted by α and is called the size, or significance-level of the test. If
H0 : θ ∈ Θ0 is composite then we define the size of the test by
A Type II error is committed when we accept the null hypothesis H0 although it is false. The proba-
bility of a type II error is denoted by β. The power of a test is the probability of detecting a false null
hypothesis and is thus 1 − β.
Hypothesis testing has a certain asymmetry between the two hypotheses, since we are acting like a
court of law: the null hypothesis is true until proven wrong. Because of this we place more weight on
64
avoiding type I errors, that is wrongly rejecting the null hypothesis, and therefore we want the probability
of a type I error to be within certain bounds that we deem acceptable.
When we set the (say) 5%-significance level, we are accepting that if we repeat the whole procedure
many times, around 5% we will make the wrong decision by rejecting a true null hypothesis. Remember
that this probability is indeed 5%, only if all the underlying assumptions (e.g. normality, randomness of
the sample, etc) are true. Therefore when testing a hypothesis, we choose a priori the size of the test,
based on the potential consequences of a type I error. Typical values are 10%, 5%, 1% and so on.
Example 27. In our running example, we can use the z-statistic to test the hypothesis that µ = 0, that
√
is z = (x̄)/(σ/ n). At the 5% level the rejection region is the set {|z| > 1.96}. If we insist on the 1%
level, then the critical region is the set {|z| > 2.58} and so on.
With the one-sided alternative hypothesis H10 : µ > 0, then the critical region at 5% is {x : z(x) >
1.64} while at 1% it is {x : z(x) > 2.33}.
You can now see where we are going. To test a null hypothesis H0 we work out the probability
distribution for the outcomes of the experiment given H0 . We then choose a critical region of the sample
space of outcomes that contains only 5% (say) of the probability. The problem is that there are naturally
many such regions, and we need to know how to find the best one.
Example 28. Suppose we are in the usual scenario of testing H0 : µ = 0 against H1 : µ 6= 0 for
the mean of a N (µ, 1) population. We use the z-statistic at the 5% confidence level. We want to
compute the power, that is the probability of correctly rejecting the null hypothesis when µ ∈ ω1 where
ω1 = {x : x 6= 0}. Therefore H1 is a composite hypothesis and thus the power will depend on the
particular value of θ ∈ ω1 . Of course the rejection region at the 5% is fixed and equal to {|z| > 1.96},
and thus to find the power we need to compute the probability Pµ (|Z| > 1.96), where Pµ denotes that
we are assuming that Xi ∼ N (µ, 1) and therefore under Pµ
√ √
Z := nX̄ ∼ N ( nµ, 1).
Thus
where Z̃ ∼ N (0, 1). The power is plotted in Figure 5.2.2. In Figure 5.2.3 we compare the power of the
two-sided test against that of the one-sided test with H10 : µ > 0. You can see that the one-sided test has
more power for µ < 0 but for µ > 0 its power is practically 0.
H0 : θ ∈ ω0 H1 : θ ∈ ω1
where ω0 , ω1 ⊂ Ω. To test the null hypothesis H0 against the alternative H1 we want a test such that
P(R|H0 ) ≤ α for all θ ∈ ω0 while at the same time 1 − P(Rc |H1 ) = β(θ) is as large as possible for all
θ ∈ ω1 .
65
1.0
0.8
0.6
0.4
0.2
-4 -2 2 4
Figure 5.2.2: The power for different values of µ with n = 1(blue), n = 20(orange), n = 100(green)
and n = 1000 (red).
In other words, we try to find a test with the smallest Type II error for a given (usually 5% or 1%)
Type I error.
The following result gives us exactly this.
Theorem 54 (Neyman-Pearson Lemma). Suppose we are given the i.i.d. sample X1 , . . . , Xn from a
distribution with parameter θ and we want to test the simple null hypothesis H0 : θ = θ0 against the
simple alternative H1 : θ = θ1 . Let L(θ) = L(θ; x) denote the likelihood of the parameter value θ
given observations x = (x1 , . . . , xn ). Then for any given size α the test that maximises the power at θ1
has rejection region
L(θ0 ; x)
R := x = (x1 , . . . , xn ) : ≤k ,
L(θ1 ; x)
where k is chosen so that the test has size α,
P(x ∈ R|H0 ) = α.
This test has maximum power among all test of H0 vs H1 with size α.
Proof. Let X = (X1 , . . . , Xn ) be your random sample. Consider any test of size ≤ α with rejection
region A. Then
P(X ∈ A|H0 ) ≤ α).
Recall the definition of the indicator 1A (x) = 1 if x ∈ A and 0 otherwise. Let R and k be as in the
statement of the theorem.
Notice that if x ∈ R then by definition of R
1
L(θ1 ; x) ≥ L(θ0 ; x),
k
and
1R (x) ≥ 1A (x).
On the other hand if x ∈
/R
1
L(θ1 ; x) ≤ L(θ0 ; x),
k
66
1.0
0.8
0.6
0.4
0.2
-4 -2 2 4
Figure 5.2.3: The power of one-sided(orange) and the two-sided test(blue)(n = 10).
Example 29. Suppose that X = (X1 , . . . , Xn ) ∼ N (µ, 1) We have a normal distribution with unit
variance and we want to test
H0 : µ = 0, vs H1 : µ = 2.
67
So we have the pdf of x = (x1 , . . . , xn ) under the two alternatives are
1 P 2
f (x; H0 ) = √ n e− xi /2 ,
2π
1 P 2
f (x; H1 ) = √ n e− (xi −2) /2 .
2π
and thus the likelihood ratio is
L(0; x) f (x; H0 ) 1 h X i
Λ(x) = = = Qn −(−4x +4)/2 = exp −2 (xi − 1) = exp [2n(1 − x̄)] .
L(2; x) f (x; H1 ) i=1 e
i
Now when H0 is true, the distribution of x̄ is normal with mean zero and variance 1/n. So we need
to find C such that Z ∞
1 2
√ e−ζ /2 dζ = .05,
2π z
where
C −0 √
z = √ = C n.
1/ n
Now the 95% point of the standardised normal density is 1.65, so we conclude that the most powerful
√
test is to reject H0 if z > 1.65, i.e. if x̄ > C = 1.65/ n. So the NP lemma manages to produce the
right tail test when that was we would have chosen anyway. Let’s see what happens in a situation in
which we would have expected a left tail test. Consider the same problem: normal density with unit
variance, H0 : µ = 0, only this time H1 : µ = −2. The calculation is the same, apart from a change in
sign:
1 2 1 2
f (x; H0 ) = √ e−x /2 f (x; H1 ) = √ e−(x+2) /2 ,
2π 2π
and for a sample of size n this leads to the likelihood ratio
Qn
√1 −x2 /2
i=1 2π e i 1 h X i
Λ= Qn = Qn = exp +2 (xi + 1) = exp [2n(1 + x̄)] .
√1 −(xi +2)2 /2
i=1 e
−(+4xi +4)/2
i=1 2π e
The Neyman-Pearson lemma says that we have to choose a region for which Λ < k for some constant k
which we have to determine. Here that means that the critical region must be of the form
e2n(1+x̄) < k
or
1
x̄ < log k − 1
2n
which is indeed a left tail test. We now need to determine C such that P(x̄ < C|H0 ) = 0.05 and this is
√
the same calculation as before, so we reject H0 if x̄ < 1.65/ n.
68
5.4 Uniformly most powerful.
Although we got the answer we were expecting in both cases, these were not really the sorts of problems
we have been dealing with up to now. We don’t usually have an alternative hypothesis like µ = 2. It’s
usually something like µ > 0. In other words, alternative hypotheses are generally composite rather
than simple. (Null hypotheses can be composite too, but let’s not digress!)
Sometimes a test can be found that has the greatest power against all the alternatives in a composite
alternative hypothesis. Such a test is called uniformly most powerful.
Definition 40 (Uniformly most powerful test). Let H0 be a null hypothesis and let H1 be a composite
alternative. Then a test is said to be uniformly most powerful(UMP) if is the most powerful against any
simple hypothesis included in H1 .
Example 30. Consider the same example as before: a sample of size n from N (µ, 1), with H0 : µ = 0
against H1 : µ > 0. Then the likelihood ratio is will of course depend on the specific alternative
1 h X i
Λ(x; µ) = Qn −(−2µxi +µ2 )/2
= exp −µ (xi − µ/2) = exp [nµ(µ/2 − x̄)] .
i=1 e
69
i.e. the usual 2-tail test. But notice that this is not UMP. For µ > 0 we have seen from the plot in
Figure 5.2.3 that the test with critical region {|x̄| > k} would be more powerful and similarly for µ < 0
and the corresponding regino.
Example 31. We are given a sample of size n from the exponential distribution f (x; λ) = λe−λx and
we are asked to test the null hypothesis H0 : λ = λ0 against the alternative H1 : λ = λ1 .
According to the Neyman-Pearson lemma we require
n Y
λ0 e−λ0 xi
Q
L(x; H0 ) λ0
k>Λ= =Q = e(λ1 −λ0 )xi ,
L(x; H1 ) λ1 e−λ1 xi λ1
or equivalently
1
log k > log (λ0 /λ1 ) + (λ1 − λ0 )x̄.
n
We can see that as before, what happens next depends on the sign of (λ1 − λ0 ). If (λ1 > λ0 ) then
the critical region is small values of x̄, which is as expected because E(X̄) = 1/λ. The critical region is
thus the interval R = (0, c), where c has to be determined.
For this we need to choose c such that the probability of obtaining a value of x̄ less than or equal to
c is a predetermined α, typically 5% . We can do this because we know that the sum of n independent
and identically distributed exponential variables has the gamma density. If x̄ = c then the sum of the
variables is nc, so we require
λn0
Z nc
α= xn−1 e−λ0 x dx.
Γ(n) 0
For the case n = 1 we choose c such that
Z c
α = P (X̄ < c | H0 ) = λ0 e−λ0 x dx = 1 − e−λ0 c ,
0
or equivalently
1 1
c= log ,
λ0 1−α
and this will also be UMP for H1 : λ > λ0 .
To compute the probability of a Type II error we need the probability that x does not lie in the rejection
region if the alternative hypothesis is true, i.e.
Z ∞
β= λ1 e−λ1 x dx = e−λ1 c .
c
Alternatively, if (λ1 < λ0 ) then the critical region is large values of x̄, i.e. all values of x̄ greater than c
where c now satisfies (taking n = 1 for simplicity)
Z ∞
α = P (X̄ > c | H0 ) = λ0 e−λ0 x dx = e−λ0 c ,
c
and thus
1 1
c= log .
λ0 α
This time the probability of the Type II error is
Z c
β= λ1 e−λ1 x dx = 1 − e−λ1 c.
0
70
5.5 Likelihood Ratio Test
The Neyman-Pearson lemma is very useful when it works, but it’s really only designed for simple
hypotheses. It can, as we have seen, be used for composite hypotheses as well, but only if we are a bit
fortunate in how things work out.
A more useful approach is the following. Say that X1 , . . . , Xn is a random sample from f (x; θ).
To compare the null hypothesis, H0 : θ ∈ Θ0 against the general alternative H1 : θ ∈ Θ, so that
H0 is a special case of H1 . Here we work out the likelihood function on each using in each case the
maximum likelihood estimates of all unknown parameters, given the hypothesis. Then we use the ratio
of these likelihoods as a measure of the relative probabilities of the hypotheses. As with the Neyman-
Pearson lemma, this gives us only the form of the test; the actual boundary usually has to be worked out
separately.
Thus consider a family of distributions with a parameter θ (remember, θ can be a vector). Let θ̂1 be
the maximum likelihood estimator of θ ∈ Θ1 and let θ̂0 be its maximum likelihood estimator given that
H0 is true. Then the likelihood ratio Λ(x) is defined by
Clearly Λ ≤ 1 with equality only if θ̂0 = θ̂, in which case there is clearly no reason at all to infer that
H0 is false. When, however, Λ is small, we infer that H0 is probably false, and so the principle of the
test is to reject H0 for Λ < k for some k which has to be determined. For a test of size α we must then
choose k such that
sup P(Λ(X) ≤ k|θ) = α.
θ∈Θ0
Note that while this reasoning is plausible, we have no guarantee that it leads to the best test. On the
other hand, the likelihood ratio test can often be used where the Neyman-Pearson lemma cannot.
Let’s see how this works.
Example 32. We are given a random sample X1 , . . . , Xn from a N (µ, σ 2 ) population with both µ and
σ 2 unknown. We want to test H0 : µ = 0 against the general alternative H1 : µ 6= 0.
There are two parameters of the distribution, µ and σ and we have to estimate them for both the null
hypothesis and the alternative. The likelihood function is
n
1 2
√ e−(xi −µ)/2σ .
Y
L(µ, σ 2 ; x) =
i=1 σ 2π
We have already found the maximum likelihood estimates of µ and σ. This was done by maximising
n 1 X
log L = −n log σ − log(2π) − 2 (xi − µ)2
2 2σ
by the usual method of equating to zero the partial derivatives with respect to µ and σ:
X
0= (xi − µ)
n 1 X
0 = −n + 3 (xi − µ)2 .
σ σ
These are the unconstrained maximum likelihood estimates. Note that the maximum likelihood esti-
mate of σ 2 is biased. To apply a likelihood ratio test we need the maximum likelihood estimates given
that the null hypothesis is true. In that case µ is fixed, so trivially µ̂0 = µ0 . The maximum likelihood
estimate of the variance is then just σ̂02 = (1/n) (xi − µ0 )2 .
P
71
We then substitute back into the likelihood functions, and we find, in either case,
n
1 2 2 e−n/2
√ e−(xi −µ̂) /2σ̂ =
Y
sup L(µ, σ 2 ) = L(µ̂, σ̂ 2 ) = √ ,
µ,σ 2 i=1 σ̂ 2π σ̂ n ( 2π)n
since X
σ̂ 2 = (1/n) (xi − µ̂)2 ,
and since from σ̂02 = (1/n) (xi − µ̂0 )2 , the likelihood ratio is given by
P
n !n/2
L(µ0 , σ̂02 ) (xi − x̄)2
P
σ̂
Λ= = = ,
L(µ̂, σ̂ 2 ) σ̂0n 2
(xi − µ0 )
P
i.e. the test tells us to reject the null hypothesis for small values of
(xi − x̄)2
P
2
,
(xi − µ0 )
P
We now have
n(x̄ − µ0 )2
y =1+ P .
(xi − x̄)2
This will be large if the second term is large. But the second term is just the square of
|x̄ − µ0 |
pP √ ,
(xi − x̄)2 / n
√ will see in the next chapter that this is proportional to the very popular t-statistic, up to a factor
We
n − 1.
Example 33 (Exponential Distribution). We want to test the null hypothesis H0 : λ = λ0 against the
alternative H1 : λ 6= λ0 . The likelihood function is
n
λe−λxi = λn e−nλx̄ .
Y
L=
i=1
so the maximum likelihood estimate is λ̂ = 1/x̄. On the null hypothesis, λ̂0 = λ0 . The likelihood ratio
is therefore
72
λ̂n0 e−nλ̂0 x̄
Λ= = (λ0 x̄)n e−nλ0 x̄ en ,
λ̂n e−nλ̂x̄
log Λ = n log λ0 + n log x̄ − nλ0 x̄ + n.
Differentiating with respect to x̄ shows that this has an extremum with respect to x̄ at n/x̄ = nλ0 and
this is clearly a maximum, since it corresponds to the constrained maximum likelihood estimate being
equal to the unconstrained, and makes Λ = 1.
We are looking for values of x̄ that make log Λ small, and since the first and last terms are constants
(for given λ0 , i.e. for a given H0 ) and we can divide by n without loss, the critical region is
log x̄ − λ0 x̄ < k,
for some k to be determined. Now if you look at the plot of the function log x−λ0 x given in Figure 5.5.1
you will notice that it is concave with a unique maximum, so that the region above translates into a
-5
-10
2 4 6 8 10
critical region of the form R = {x̄ < a} ∪ {x̄ > b} with a and b chosen so that
73
Chapter 6
Tests
In the previous Chapter we introduced the general theory of testing along with some general procedures
for constructing tests with useful properties. In this chapter we will introduce and use a large variety of
different tests and learn about their assumptions and their properties. This will help you build an arsenal
of techniques for many different situations.
In general the tests we will go through are divided into main categories, parametric and non-parametric.
Parametric tests have more underlying assumptions about the population, in particular assuming that the
observations arise from a distribution depending on a number of parameters. This means that if the
assumptions fail then so does our conclusion, which may in particular mean that p-values are under-
estimated. Non-parametric tests are combinatorial in nature and assume far less than parametric tests.
As a result they are much more robust. However nothing comes for free. What you gain in terms of
robustness you lose in terms of power: if the assumptions of a parametric test hold then it will generally
be more likely to reject a false null hypothesis than a non-parametric test of the same size.
74
Under the null hypothesis we know that Z ∼ N (0, 1) and we can use this either to construct a rejection
region for hypothesis testing, or to compute p-values for significance testing.
• H0 : p = p0 , against
• H1 : p > p0 .
Therefore if we apply the central limit theorem we get that for all z
!
X − np
P p ≤z → Φ(z),
np(1 − p)
where Z z
1 2
Φ(z) := √ e−x /2 dx.
−∞ 2π
In other words the distribution of the z-statistic
X − np X̄ − p
Z=p =p ,
np(1 − p) p(1 − p)/n
is in this case also approximately standard normal.
Remark 55. Notice that the same holds for any distribution that falls within the realm of the central
limit theorem. That is assuming that Xi ∼ f (·; θ), such that E X = µ and var(X) = σ 2 < ∞, then
X̄n − µ
Z := √ ,
σ/ n
is asymptotically N (0, 1), and thus for large enough samples we can use the z-test as usual.
Word of caution: This is only true approximately, and can fail spectacularly if for example var(X) =
σ 2 or even if n is not large enough.
• H0 : µY = µX ; against
75
• H0 : µY > µX .
Now under the null hypothesis we know that µX − µY = 0 and thus we use the statistic
X̄ − Ȳ
Z := r ∼ N (0, 1) .
2
σX 2
σY
nX + nY
Notice that if nX , nY are large then the standard deviation will be small and therefore if µX = µY
then we expect small observations. If on the other hand we observe a large negative value for x̄ − ȳ then
that casts doubt on the null hypothesis and suggests that indeed it may be that µY > µX .
Remark 56 (Comparing proportions). We can use this formula for comparing proportions too. In that
case, if the proportions are p1 and p2 , the Z statistic is
p1 − p2
q
p1 q1 p2 q2
,
n1 + n2
and under the null hypothesis this is for large n1 , n2 asymptotically N (0, 1) by the Central limit theorem.
Remark 57. If you are testing “no difference” between the two proportions, then it is recommended to
pool the variance. The Z statistic is:
pˆ1 − pˆ2
q
pˆ0 (1 − pˆ0 )( n11 + 1
n2 ),
76
Thus the statistic we should use is
X̄ − Ȳ 32.1 − 33.4
Z := r =q 2 = −3.31,
2
σX 2
σY 2.6 2.82
+
nX + nY 100 90
0.9 − 0.8
Z=p = 2.17
0.83 × 0.17(1/100 + 1/200)
The question was whether the new strain is better so we compare with the 5% one-tailed value which is
1.64, so the new strain is significantly better, at 5% significance level.
• H0 : µ = µ0 ; against
• H1 : µ > µ0 ,
is an unbiased consistent estimate of σ 2 . Therefore one might say why don’t we use instead the statistic
X̄ − µ0
T := √ ,
s/ n
77
and this is indeed the right idea. We can compute this from the data so it is indeed a statistic. Do
we know its distribution under the null hypothesis? Of course we do, simply cast your minds back to
Chapter 3 and Theorem 32 where we showed that T ∼ tn−1 , Student’s t-distribution with n − 1 degrees
of freedom.
Example 36. A builder’s yard sells sacks of sand that are i.i.d. samples from N (µ, σ 2 ) where µ is
supposed to be 85 lbs.
We want to test
• H0 : µ = 85; against
• H1 : µ 6= 85.
Ten sacks are selected at random, and the weights turn out to be
87.5 84.8 84.0 87.1 86.8 83.3 83.5 87.4 84.1 86.6.
The sample mean is x̄ = 85.51; Do we have enough evidence to reject the null hypothesis? We compute
the t-statistic sP
x̄ − µ (xi − x̄)2
t= √ , s= .
s/ n n−1
√
The sum of squares√is 26.609, which makes s = 2.96 = 1.72. So the value of the t-statistic is
(85.51 − 85)/(1.72/ 10) = 0.94. We were asked about a difference, with no indication of the direction
of the difference, so we compare with the 2-tailed value of t on 9df, which is 2.26. Clearly this is not
significant.
• H0 : µX = µY ; against
• H0 : µX 6= µY .
When σ 2 = σX
2 = σ 2 is known recall that
Y
X̄ − Ȳ − (µX − µY )
Z := r ∼ N (0, 1).
2
σX 2
σY
nX + nY
Suppose now that we do not know σX , σY , so we choose to estimate them by the sample variances
n
X n
Y
1 X 1 X
s2X := (Xi − X̄)2 , s2Y := (Yi − Ȳ )2 .
nX − 1 i=1 nY − 1 i=1
However if we only used one of the above estimators we would be throwing away information, so the
idea is to combined them to obtain the pooled estimator, that is we want to use a convex combination of
the two. The question is how to combine them.
Suppose for example that nX = nY , so that we have no reason to prefer one estimate over the other,
in which case we could simply use
78
(Xi − X̄)2 (Yi − Ȳ )2
P P
1 2 1
sX + s2Y = +
2 2 2(nX − 1) 2(nY − 1) =
(Xi − X̄)2 + (Yi − Ȳ )2 (nX − 1)s2X + (nY − 1)s2Y
P P
= = .
nX + nY − 2 nX + nY − 2
Now in the general case nX 6= nY notice that the above formula puts extra weight to the estimator from
the larger sample, and this makes perfect sense, since by consistency the larger sample will tend to be
more accurate.
Therefore we arrive at the following pooled estimator
Now if σ 2 := σX
2 = σ 2 then if we replace σ 2 by the pooled estimator in the z-statistic we arrive at
Y
X̄ − Ȳ
T := q .
1 1
Sp nX + nY
Of course we need to understand the distribution of the above statistic. Before we do this, we will
look at the denominator. In fact, we already know that
PnX PnY
(nX + nY − 2)Sp2 i=1 (Xi − X̄) i=1 (Yi − Ȳ )
W := = + ,
σ2 σ2 σ2
is the sum of two independent χ2 random variables with (nX − 1) and (nY − 1) degrees of freedom
respectively. Therefore by Theorem 27 we conclude that W has the χ2 distribution with (nX − 1) +
(nY − 1) = nX + nY − 2 degrees of freedom. In addition we know that X̄ is independent of sX , and
Ȳ is independent of sY , so that X̄ − Ȳ is independent of W , and thus finally under the null hypothesis
X̄ − Ȳ
Z := √ ∼ N (0, 1).
σ/ nX + nY
Therefore
X̄ − Ȳ
T = q
1 1
Sp nX + nY
s
X̄ − Ȳ (nX + nY − 2)Sp2
= √
σ nX + nY σ2
Z
=p ∼ tnX +nY −2 ,
W/(nX + nY − 2)
by Definition 24
There are two cases to consider. First, let us suppose that the two samples can be supposed to have the
same variance, which would correspond to them coming from the same populations only shifted over a
bit. In that case we can show that the appropriate t-statistic is
x̄ − ȳ
t := r P P
(xi −x̄)2 + (yi −ȳ)2
q
1 1
nx +ny −2 nx + ny
79
Example 37. We want to test if professional drivers drive more economically than casual drivers. We
randomly selected five professional and five casual drivers to drive the same car over the same route
through city traffic. The petrol consumption figures were
• H0 : µX = µY ;
• H1 : µX < µY .
The means are x̄ = 1.044 and ȳ = 1.220. The sums of squared deviations are (nX − 1)s2X = .0379 and
(nY − 1)s2Y = .1080. These don’t seem very close together, but let’s assume for the time being there is
no problem; later we will see that there’s no significant reason to expect there is. The pooled estimator
is then
.0379 + .1080
s2p = = .0182,
8
and thus
1.044 − 1.220
t= q = −2.06.
sp 25
The alternative hypothesis is that the professionals use less petrol, so the correct procedure is to
compare −t = 2.06 with −t8,5% , such that
From the appropriate table we find this to be 1.86, so the data do indicate that professionals use less
petrol.
Had the question simply been “Is there a difference?”, that is H1 : µX 6= µY , then the appropriate
test would have been to compare |t| with 2.31, and this would not have been significant. Had the
question been “Do the professionals use more petrol?” then you would compare t with 1.86, and since
−2.06 < 1.86 there is no significant evidence in favour.
Example 38. Consider the following experiment. The percentage of aggregated blood platelets in the
the blood of 11 individuals were measured before and after they smoked a cigarette.
Before 25 25 27 44 30 67 53 53 52 60 28
After 27 29 37 56 46 82 57 80 61 59 43
Difference 2 4 10 12 16 15 4 27 9 -1 15
It seems that smoking a cigarette results in more clotting and we want to test this hypothesis.
Notice that in this particular case, the data has a very particular structure. Just like the previous
scenario, we have two samples so we could just apply the two-sample t-test. But, in addition in this case
the two samples are naturally paired in a way we can exploit to remove some of the variability due to
differences between the subjects of the study.
To make this more precise, suppose that each individual is assigned a label i = 1, . . . , 11. We denote
with Xi the level before the cigarette, and with Yi the level after the cigarette. We additionally assume
80
that Yi = Xi + Di , where Xi and Di are independent. We interpret Xi as representing the natural level
of clotting present in individual i due to genetic and various other reasons, while Di captures just the
effect of the cigarette. Notice that the reason we conducted the experiment was specifically to measure
the effect of smoking a cigarette, and not the effects due to other factors. Therefore, by looking at the
differences between the pairs we should be able to remove the variability due to any factors other than
smoking.
In essence, what we are saying is that since Yi = Xi + Di and Xi is independent of Di
and thus by looking at the differences we should be able to get tighter estimates. In essence this means
that our hypothesis test should be more powerful than a two-sample test.
Let’s see how we do.
• H0 : µd = 0;
• H1 : µd > 0.
d¯ = 10.3, sd = 7.98.
Since most estimators in practice are based on a sample of the population, there is always a degree of
uncertainty. We sometimes want to know in addition how good our estimate really is.
One way of doing this is to compute a confidence interval. Let µ0 be a possible value of the population
mean µ. Then we ask whether our data would cause us to reject the hypothesis H0 : µ = µ0 . If it would,
then we consider µ0 to be inconsistent with the data, otherwise it is consistent. And if we would have
rejected H0 at the 5% level, then we say we are 95% confident that µ 6= µ0 .
The 95% confidence interval for µ is the set of all values of µ0 which would not have caused us to
reject the hypothesis µ = µ0 at the 5% level. And similarly for the 1%.
Now we would reject the null hypothesis µ = µ0 if
|x̄ − µ0 |
√ ≥ t5%,νdf ν = n − 1.
s/ n
81
The end points of the 95% confidence interval are where the equality is satisfied:
√
|x̄ − µ0 | = ±t5%,νdf s/ n,
√
µ0 = x̄ ± t5%,νdf s/ n,
where it is the 2-tailed t-value we need. As an example, let’s go back to the data from Example 36.
87.5 84.8 84.0 87.1 86.8 83.3 83.5 87.4 84.1 86.6.
We found x̄ = 85.5 and s = 1.72, and since the 2-tailed 95% value of t is 2.26 the 95% confidence
interval is √
µ0 = 85.5 ± 2.26 × 1.72/ 10 = 85.5 ± 1.23.
As a sanity check we recall that the original question was whether the sample was inconsistent with a
mean of 85. We found that we could not reject the null hypothesis, and indeed 85 does lie in the interval
(84.3, 86.7).
0.4
0.3
0.2
0.1
-4 -2 2 4
Figure 6.2.1: The probability density functions of the t-distribution with 1(blue),5(orange),10(green)
and 30(red) degrees of freedom, and the pdf of a standard normal(purple).
of a t distribution approaches that of a standard normal. The reason is very simple. Recall that the
t-distribution with ν degrees of freeom was defined as the distribution of
sP
ν 2
j=1 Yi
Tν := Z ,
ν
where Y1 , . . . , Yν are i.i.d. standard normal, independent of Z. The key lies in the denominator,
Pν 2
i=1 Yi
,
ν
82
which as we know by the strong law of large numbers is a consistent estimator of the mean of Yi2 , which
is the variance or Yi . But recall that we use the t-statistic when we don’t know the variance so we replace
it by the sample variance estimator, since we know it is consistent.
Well it turns out that consistency of the sample variance kicks in at moderate samples and is very
effective. For sample sizes of ≥ 60 there is no practical difference between the t-test and the z-test
because the t distribution with more than 60 degrees of freedom is for all practical purposes almost
identical to the standard normal.
Thus we need a rule of thumb about when to use each test. It is provided in Figure 6.2.2 which also
tells you which distribution to use when constructing confidence intervals for the mean of a population
(or a proportion).
estimating
no no large
know σ? a pro-
sample?
portion?
yes
yes no
yes
Use Z Use T
For example X ∈ {smoker, non-smoker}, X ∈ {heads, tails}, X ∈ {A, B, C} and so on are all
examples of categorical random variables.
83
• H1 : p 6= π.
We have performed the experiment and have n observations, ni in the i-th category. Notice that we must
have n = n1 + · · · + nk .
Under the null hypothesis, the expected count of observations in category i, is of course npi since
we have n trials, each of which has probability pi of resulting in category i. We will base our statistic
on the deviation of the observed count in category i with the expected one. We cannot simply add the
deviations because the result will always be
X X
(ni − npi ) = n − n pi = n − n = 0.
One way to avoid this cancellation is to square each deviation, which would result to
X
(ni − npi )2 .
Now this is not too bad as a statistic, since if it takes on large values it casts doubt on the validity of
the null hypothesis. However there is still a slight problem. Notice that if one category, say the first
one, is much more likely than the other ones, then we should take that into account. For example a
deviation of 10 in a category with expected count 1000, should be counted far less than a deviation of 10
in a category with expected count of 20. Therefore we should normalise by the expected count in each
category resulting in the following definition.
Definition 43 (χ2 - statistic). The χ2 -statistic is given by
k
2
X (ni − npi )2
X = .
i=1
npi
The statistic defined above can certainly be computed from the data, but do we know its distribution?
In fact we do, at least asymptotically as n → ∞, and its distribution is a χ2 distribution where the
degrees of freedom are computed as follows:
d.f. = # number of categories − # parameters estimated from data − 1.
We will now see a range of examples about how to apply the above test.
Of course the expected count in each category is 10. We summarise our observations in the following
table. We compute the χ2 -statistic
Observed 16 15 4 6 14 5
Expected 10 10 10 10 10 10
84
6.3.3 Comparing with a Parametric family of distributions
Last time we tested whether the observations came from a single distribution. In other cases we may
want to test whether the observations came from a parametric family of distributions, e.g. Poisson, or
Binomial, for some values of the parameters. Since we don’t know the parameters, we will estimate
them from the data. Let’s see an example.
We have the following observations about the number of deaths from horsekicks in different regiments
of the Prussian army. That is 0 deaths were observed 109 times, 1 death was observed 65 times and so
Deaths 0 1 2 3 4 ≥5
Frequency 109 65 22 3 1 0 Total=200
on. Suppose we want to test whether the above observations come from a Poisson distribution with some
parameter µ. That is we want to test
The first step is to estimate the parameter µ. Recall that if X ∼ Poisson(µ) then E X = µ. Therefore
we could estimate µ by the sample mean
X1 + · · · + Xn
X̄ = .
n
In our case we find
122
x̄ = = 0.61.
200
Now essentially we have to test
Just like in the previous case, we need to compare the observed counts in each category to the expected
count which we have to compute. How do we find the expected count?
If X ∼ Poisson(0.61) then we know that
e−0.61 0.61k
P(X = k) = , k ≥ 0,
k!
and thus for the first five categories we compute the expected count by multiplying the probability with
the number of observations. For the last category we need to compute
4
X e−0.61 0.61j
P(X ≥ 5) = 1 − = 0.000424972.
j=0
j!
Therefore we have
Deaths 0 1 2 3 4 ≥5
Expected freq. 108.67 66.3 20.2 4.1 0.6269 0.00000
Observed freq. 109 65 22 3 1 0
Now there is a slight problem here because in the last three categories the expected count is less than
5. This means that the sample size is not large enough for our χ2 statistic to have, at least approximately,
85
Deaths 0 1 ≥2
Expected freq. 108.67 66.3 24.9
Observed freq. 109 65 26
the correct distribution. When this happens we should combine the categories so that each category has
expected count at least 5. In our case we will combine the last 4 categories to make a single category.
Now we calculate
(108.67 − 109)2 (66.3 − 65)2 (24.9 − 26)2
X2 = + + = 0.0751.
108.67 66.3 20.2
Remark 58. If some categories have expected count less than 5, always combine the categories in such
a way as to make the expected count in each category of the final table at least 5.
Remark 59. When finding the number of degrees of freedom, we have to use the number of categories
of the final table.
Example 39 (Poisson again). Suppose that we are counting the number of accidents each week in a large
factory. We want to know whether accidents occur more or less at random, or whether they influence
each other. For instance, do they tend to cluster because there is some factor that makes accidents more
likely to happen? Alternatively, do they tend to anticluster, because after one accident has occurred
people are extra careful for a while?
Now if they really do occur at random, the number per week should obey a Poisson distribution
e−λ λx
f (x) = , x = 0, 1, 2, . . .
x!
where λ is the mean number of accidents per week.
Over a 50 week period (omitting the Xmas period when the factory was closed) the number of weeks
n(x) on which x accidents occurred was:
No. of accidents 0 1 2 3 4 5 6 7 8
Frequency 5 6 14 10 9 4 1 0 1
From these data we compute the mean number of accidents per week was λ̂ = 2.68. So if the
distribution were Poisson, we would expect the numbers of weeks on which x accidents occurred to
have been:
No. of accidents 0 1 2 3 4 5 6 7 8
Frequency 3.4 9.2 12.3 11.0 7.4 3.9 1.8 0.7 0.3
To test, we first combine cells so that none of the predicted frequencies is less than 5. That means
combining the first two cells, and also the last four: We then compute the sum of squares
86
No. of accidents 0 or 1 2 3 4 5 or more
Frequency 11 14 10 9 6
Expected 12.6 12.3 11.0 7.4 6.7
We had to estimate one parameter from the data, so the number of degrees of freedom is 5 − 1 − 1 = 3.
From the tables we see that the 95% value is 7.81, so the result is not significant. There is no reason to
suppose that the accidents are happening other than at random.
Note that had we been testing the hypothesis “The data are from a Poisson distribution with λ = 2.68”,
p would have been zero and so ν would have been 4, but that is a different problem, being a more
restrictive hypothesis. The 95% value on 4 df is 9.49, which is of course larger because being allowed
to fix λ makes it likely that we can get the sum of squares smaller.
Example 40. We often just look just at the right tail of the χ2 -distribution. But sometimes the left tail
may reveal interesting features.
Mendel crossed round yellow (RY) pea plants with wrinkled green (WG) ones. According to his
theory, there should have been progeny of all four possible combinations of the characters and they
should have occurred in the ratio RY:RG:WY:WG::9:3:3:1. This is a categorical distribution of the form
9 3 1
P(X = RY) = , P(X = RG) = P(X = WY) = , P(X = WG) = .
16 16 16
He had 556 plants, so the expected frequencies were 312.75, 104.25, 104.25, 34.75. The observed
frequencies were 315, 108, 101, 32.
We can easily do a chi-squared test on the data. The appropriate sum is
Remark 60. Notice that we can, also use the χ2 test for continuous variables by breaking the data
up into a finite number of bins. There are, however, better ways to do this. One such method is the
Kolmogorov-Smirnov test, which is based on the maximum difference between the observed and pre-
dicted cumulative frequencies, which we will try to cover if there is enough time.
87
A B C
Men 51 68 81 200
Women 63 141 96 300
114 209 177 500
On the basis of these data, do women have different preferences in soap from men? Therefore the null
hypothesis is
H0 : preferences are independent of gender.
Notice that if the null hypothesis is correct then the probability that an individual will prefer any one
soap is independent of whether that individual is male or female. So if the probabilities that a person
in the sample prefers A, B, C are denoted by P(A), P(B), P(C) and if the probabilities that a per-
son chosen at random is male or female are P(M ) and P(F ), respectively, we will have P(AM ) =
P(A)P (M ), and so on. If they are not independent, we will have to use conditional probabilities
P(AM ) = P(A|M )P(M ), etc.
To test the null hypothesis that the two attributes are independent we proceed as follows.
1. We estimate all the individual probabilities from the data; clearly the estimate of P(A) = 114/500 =
0.228, and so on.
2. We compute the expected numbers in each of the six classes using the simple multiplication rule
for probabilities (and of course multiplying by 500 to get expected numbers). Thus since P(M ) =
200/500 = 0.4 we have, assuming independence, P(M who prefers A) = 0.4 × 0.228 = .0912
and the expected number of males who prefer A is 45.6.
3. We then use the χ2 goodness of fit test to see whether the actual numbers are significantly different
from those we predict in this way.
A B C
Men 45.6 83.6 70.8 200
Women 68.4 125.4 96 300
114 209 177 500
Note that the row and column sums are not changed. We now work out the usual statistic
X (ne − no )2
X 2 := = 8.367.
ne
But before we can complete the test we have to know how many degrees of freedom there are.
We might as well do this for the general case. We have a 2-way contingency table, i.e. there are two
categories of classification (here which sex you are and what soap you prefer). There could be more,
and the analysis is much the same if there are. Let us suppose there are r classes in the A classification
and s classes in the B, so that every individual is to be classified as Ai ∩ Bj where i = 1, 2, . . . , r and
j = 1, 2, . . . , s. In the example, r = 2 because there are only two sexes and s = 3 because there are
three classes of soap preference.
According to the null hypothesis P(Ai ∩ Bj ) = P(Ai )P(Bj ) and we estimate each of the marginal
probabilities from the data. Note, however, that we only need to estimate r − 1 probabilities as the last
one can be computed from the rest since they add up to one. Similarly, we only have to estimate s − 1
of the P(Bj ).
88
There are rs cells altogether which is the number of categories. We have to estimate (r − 1) + (s − 1)
parameters, and so the number of degrees of freedom is
ν = rs − (r − 1) − (s − 1) − 1 = rs − r − s + 1 = (r − 1)(s − 1).
Hence in the example the right number of degrees of freedom is 2, and so we compare 8.367 with
χ295%,2df = 5.99 and χ299%,2df = 9.21 so we can reject the null hypothesis. We find a lack of indepen-
dence which is significant but not highly significant.
Note that this analysis doesn’t tell you anything about which soap men or women prefer. It tells you
that there appears to be a difference, but nothing about what sort of difference that is.
(xi − x̄)2
P
σ2
is the sum of squares of independent standard random variables and that consequently, with the usual
remark about n − 1 in place of n:
P
(x−x̄)2
s21 /σ 2 s21
= = Pnx −1 2 ,
s22 /σ 2 s22 (y−ȳ)
ny −1
means are 1.044 and 1.220. The sums of squared deviations are .0379 and .1080, which gives
.1080/4
F = = 2.85
.0379/4
According to the tables F95%,(4,4)df = 9.6 so the data do not cause us to reject the null hypothesis of
equal variances. In fact, on the null hypothesis the probability of getting a deviation as big as the one we
observed is about 1/3, so the situation is better than it probably looks. What it really amounts to is that
it isn’t easy to estimate variances accurately from small samples. The F -distribution is actually more
important in another context, that of ANOVA which we will come to later if time permits.
89
Chapter 7
Regression
We are given some data about a dependent variable Y and some variables X which we believe determine
Y . We assume that we can write
Y = g(X1 , X2 , . . . X3 )
The aim is to estimate f . We usually suppose that the form of f is known, and that our task is to estimate
one or more parameters, although there are techniques that help us to find f , or at least to choose between
two or more possible forms.
yi = g(xi ) + i
Since g(x) is supposed to be a known function, though with unknown parameters, Yi is a random vari-
able whose distribution is determined by that of . You can think of the Yi as having a distribution
approximately centred on the curve y = g(x) and then with a random component added on.
Suppose that i ∼ N (0, σ 2 ) are i.i.d. , which seems appropriate, given that the normal distribution is
also often called the error distribution. This is not essential to be normal, as we shall see, but it does
allow us to interpret the result as a maximum likelihood estimator. This implies that E(Yi ) = g(Xi ),
which seems reasonable. The errors are thus assumed to be random rather than systematic. We suppose
that the variance, σ 2 , is unknown, but for the time being we suppose that it is constant, that it does not
depend on X. This means we are assuming that the scatter is the same at all points on the graph, which
clearly need not be so. Then each Yi will be normally distributed with mean g(x) and variance σ 2 : that
is Yi will have probability density function
1 1
f (yi ) = √ exp − 2 (yi − g(xi ))2
σ 2π 2σ
The likelihood function is then, letting θ stand for all the unknown parameters of g,
90
nY
1 1
L(θ, σ) = √ exp − 2 (yi − g(xi ))2
σ 2π 2σ
n
1 1 X
2
= √ exp − 2 (yi − g(xi )) .
σ 2π 2σ
We now have to maximize L and as usual we work with log L instead
n 1 X
log L = −n log σ − log 2π − 2 (yi − g(xi ))2
2 2σ
∂L n 1 X
=− + 3 (yi − g(xi ))2
∂σ σ σ
∂L 1 X ∂g(xi )
= 2 (yi − g(xi )) .
∂θ σ ∂θ
As you can see from the calculations, what we are actually doing is choosing the parameters θ so as to
minimize (yi − g(xi ))2 . In other words, if the errors are distributed normally and σ 2 is constant (but
P
not otherwise!) the maximum likelihood estimates of the parameters are just the least squares estimates,
and the maximum likelihood estimate of the variance is obtained from the sum of the squares of the
deviations of the yi from their estimates based on the maximum likelihood estimates.
Let’s see how this works in the simplest interesting case, the linear model Y = a + bX + . Assuming
the to be normally distributed, we have to choose a and b so as to minimize
X
(yi − a − bxi )2 .
and hence
X X
nâ + b̂ xi = yi
X X X
â xi + b̂ x2i = x i yi .
where
X
sxx := (x − x̄)2
X
syy := (y − ȳ)2
X
sxy := (x − x̄)(y − ȳ).
91
This is often useful.
Let’s start with an artificial example so we can see how the numbers go. We’re given the following
pairs of observations:
X 1 2 4 5 7
Y 3 4 5 8 8.
In fact, the calculator that is supplied for the exam will find the values of a and b for you (so you’d be
well advised to learn how to use it) but if you don’t have one like that it’s not all that hard to do it by
hand. We set up the following table:
X Y X2 Y2 XY
1 3 1 9 3
2 4 4 16 8
4 5 16 25 20
5 8 25 64 40
7 8 49 64 56
19 28 95 178 127
5â + 19b̂ = 28
19â + 95b̂ = 127.
and since 5 × 19 = 95 we can solve this easily to obtain â = 13/6 ≈ 2.17 and b̂ = .90.
We can now use this relation to predict Y for any given X. For example, if X = 6 we expect
Y = 2.17 + 6 × .90 = 7.57. Of course we really ought to round to no closer than 7.6 because we
mustn’t give answers that are to more significant figures than the data we began with. Here we allow
ourselves one more, but that is about as far as we can go.
Note that while this method works for interpolation it is less reliable for extrapolation, because we
can’t be so confident about what happens beyond the range that we have data for. So while you can work
out from these data that if X = 20 we expect Y = 2.17 + 20 × 0.9 = 20.17, I wouldn’t bet on it.
Note also that we have always taken X as the predictor variable and Y as the response. We could do
it the other way around, and then by symmetry the equations will be:
X X
nα̂ + β̂ yi = xi
X X X
α̂ yi + β̂ yi2 = x i yi .
which here is
5α̂ + 28β̂ = 19
28α̂ + 178β̂ = 127.
This gives α̂ = −1.64 and β̂ = .972. This is not the same line as before, because
92
which is not the case. The reason is that if we really considered Y to be the predictor, then we would be
trying to minimise the sum of squares
X
(xi − α − βyi )2 ,
and this leads to a different result. It’s not hard to see why: we are trying to minimise the sum of squares
of horizontal deviations from the line whereas before we were minimizing the sums of squares of vertical
deviations.
Y = a + bX + cX 2 +
This is still usually referred to as a linear model because it is linear in the parameters, though not in
the predictor variable. The original result that the maximum likelihood estimates are the least squares
estimates still applies, and it’s not hard to show that the necessary equations are
na + bΣx + cΣx2 = Σy
aΣx + bΣx2 + cΣx3 = Σxy
aΣx2 + bΣx3 + cΣx4 = Σx2 y
where xTj = (xj1 , . . . , xjp ) is a 1 × p vector of explanatory variables associated with the j-th response
yj , j is the error in the j-th response and β T = (β1 , . . . , βp ) is a vector of unknown parameters. We
can write this in matrix form as
y = Xβ + , (7.2.1)
where y is the n × 1 vectors such that y T = (y1 , . . . , yn )
x11 x12 · · · xjp
x21
x22 · · · x2p
X=
.. .. .. ,
..
. . . .
xn1 xn2 · · · xnp
and T = (1 , . . . , n ) is a vector of i.i.d. N (0, σ 2 ) random variables. The model given in 7.2.1 is called
a linear regression model with design matrix X.
1
This section is heavily based on A.C.Davison’s Statistical Models
93
Under the assumptions that the 1 , . . . , n are i.i.d. N (0, σ 2 ), the responses (y1 , . . . , yn ) are also
independent normal random variables such that
yj ∼ N (xTj β, σ 2 )
Notice that
β1
β2
xTj β = (xj1 , . . . , xjp ) ×
.. ∈ R.
.
βp
Just like before in order to maximise the likelihood we just have to maximise the sum of squares which
is independent of σ 2
n
X
SS(β) := (yj − xTj β)2 = (y − Xβ)T (y − Xβ),
j=1
β̂ = (X T X)−1 X T y.
Since the likelihood is maximised w.r.t. β at β̂ irrespective of σ 2 , to obtain the maximum likelihood
estimator of σ we plug β̂ in the log-likelihood and maximise w.r.t. σ, that is
1
max l(σ 2 , β̂) ∝ n log σ 2 + (y − X β̂)T (y − X β̂),
σ2 σ2
94
Recall that β̂ = (X T X)−1 X T y and since by the definition of the model y = Xβ + we have that
β̂ = (X T X)−1 X T (Xβ + )
= X −1 (X T )−1 X T β + (X T X)−1 X T
= β + (X T X)−1 X T ,
which is clearly a linear combination of jointly normal random variables, and is thus normally distributed
itself.
We can also compute its mean vector E(β̂) and covariance matrix Σβ̂ using the properties of multi-
variate normal distributions. They are given by
E[β̂] = β
Σβ̂ = cov[(X T X)−1 X T , (X T X)−1 X T ]
T
= (X T X)−1 X T Σ (X T X)−1 X T
T
= σ 2 (X T X)−1 X T I (X T X)−1 X T
= σ 2 (X T X)−1 X T X(X T X)−1
= σ 2 (X T X)−1
where we have used Remark 23, the fact that Σ = σ 2 I the identity matrix, and the facts that (X −1 )T =
(X T )−1 , (XY )T = Y T X T and (XY )−1 = Y −1 X −1 . Therefore we know that
β̂ ∼ N (β, σ 2 (X T X)−1 ).
y − X β̂ = y − X(X T X)−1 X T y
= I − X(X T X)−1 X T y
= I − X(X T X)−1 X T (Xβ + )
= Xβ − X(X T X)−1 X T Xβ + − X(X T X)−1 X T
= Xβ − Xβ + − X(X T X)−1 X T
= (I − X(X T X)−1 X T ),
which is again a linear combination of normal random variables and thus normally distributed itself. We
can compute the mean vector and covariance matrix as follows
E[y − X β̂] = 0,
var(y − X β̂) = var{(I − X(X T X)−1 X T )}
= σ 2 (I − X(X T X)−1 X T )(I − X(X T X)−1 X T )T
= σ 2 (I − X(X T X)−1 X T ).
95
Finally since both vectors β̂ and y − X β̂ are linear combinations of the we compute their covariance
matrix as follows
h
cov(β̂, y − X β̂) = cov β + (X T X)−1 X T , (I − X(X T X)−1 X T ))
= (X T X)−1 X T Σ (I − X(X T X)−1 X T )T
= σ 2 (X T X)−1 X T (I − X(X T X)−1 X T )T
= σ 2 (X T X)−1 X T (I − X(X T X)−1 X T )
Since β̂ and y − X β̂ are jointly normal and their covariance matrix is 0, we conclude that β̂ and y − X β̂
are independent.
To derive the distribution of the variance estimator notice that
T = (y − Xβ)T (y − Xβ)
h iT h i
= (y − X β̂) + X(β̂ − β) (y − X β̂) + X(β̂ − β)
= (y − X β̂)T (y − X β̂) + (β̂ − β)T X T X(β̂ − β)
T 1 1
2
= 2 (y − X β̂)T (y − X β̂) + 2 (β̂ − β)T X T X(β̂ − β), (7.3.1)
σ σ σ
since
Therefore looking again at (7.3.1), the left hand side is the sum of squares of n independent standard
normal variables and therefore has the χ2 distribution with n degrees of freedom. On the other hand, on
the right hand side we have the sum of two independent variables. We know that
β̂ ∼ N (β, σ 2 (X T X)−1 ),
and thus
X(β̂ − β)/σ ∼ N (0, H),
where H := X(X T X)−1 X T . Now it can be easily seen that H 2 = H and thus H is idempotent.
Therefore it follows that it’s eigenvalues are all 0 or 1, and since tr(H) = tr (X T X)−1 X T X) =
tr (Ip ) = p, that exactly p of them are 1 and n − p of them are 0. Therefore H = ODOT where D
is diagonal with the first p diagonal entries equal to 1 and the rest 0. Also since H is symmetric O is
orthogonal. Using Theorem 22, since X(β̂ − β)/σ ∼ N (0, H) we can write X(β̂ − β)/σ = H 1/2 W ,
where W ∼ (0, In ) a vector of iid standard normal random variables. Since H 2 = H we have H 1/2 =
H and thus X(β̂ − β)/σ = HW = ODOT W . Thus since O is orthogonal OT O = I and D2 = D
1 T T
T
T
T
(β̂ − β) X X( β̂ − β) = ODO W ODO W
σ2
= W T ODOT ODOT W
= W T ODOT W
= (W 0 )T DW 0 ,
96
where W 0 = OW ∼ N (0, In ). Thus finally
p
1
(wi0 )2 ∼ χ2p ,
X
2
(β̂ − β)T X T X(β̂ − β) =
σ i=1
is an unbiased estimator of σ 2 .
We want to predict what the population will be in the year 2020; this is extrapolation but hopefully not
too far. So we fit a straight line. To keep things in line we rescale the data so as not to get overflows in
our calculator, and we round the numbers because there’s no way our answer is going to be correct to 5
significant figures. We renumber the years 1,2,3,4,5, which will make 6 stand for 2020. We knock the
last three digits off the population, and this leaves the following data:
X 1 2 3 4 5
Y 44 56 66 88 120
Y = 19.6 + 18.4X
To estimate the population in the year 2020 we set X = 6 and we obtain Y = 130, which makes an
estimated population of 130,000. You can see that two significant figures is enough!
It is always a good idea to look at the data instead of just plugging it into the computer and getting
numbers out. If we do that, we see that the points do not lie on a straight line at all, which is not
surprising because populations tend to increase exponentially.
This suggests that we ought to use a model of the form
Y = aebX +
Again, we can set up the relevant equations by minimizing the sum of squares
97
X
(yi − aebxi )2
X
ebxi (yi − aebxi ) = 0
X
axi (yi − aebxi ) = 0.
Unlike the previous case, we can’t solve these analytically, but we can find â and b̂ numerically. On the
other hand, we can do the thing differently by working with Z = log Y , which gives us a linear model
Z = a + bX + 0
Z = 3.52 + 0.246X
and the predicted value for X = 6 is Z = 4.99 i.e. Y = 147 so the predicted population in 2020 is
147,000.
This is almost certainly better, at least in the sense that if the past trend continues, which is an as-
sumption we have to justify but which we usually make for want of anything better, we would expect
the higher value. The advantage of transforming the data is that we can use standard programmes. The
disadvantage is that a lot of the theoretical basis becomes shaky, because if the errors in the original data
are normally distributed with constant variance, this is not true for the transformed data. That may or
may not matter, but you have to keep it in mind.
In fact, the transformation can even improve things, because if the errors are proportional to the
measurement, then taking logs makes them more equal than they were in the first place.
The covariance gives a measure of linear dependence between X and Y . If X and Y are big together
and small together, most terms in the product (X − X̄)(Y − Ȳ ) will be positive, so the sum will be large
and positive. If one is big when the other is small, most of the terms will be negative so the sum will be
large and negative. And if Y is just as likely to be big or small whatever X is, some of the terms will be
positive and some negative, so the sum should be small.
However the words “big” and “small” aren’t very precise, and since the variance of X and Y inflates
the covariance a better measure of linear dependence is given by the correlation
cov(X, Y )
ρ= p
var(X) var(Y )
98
which has the advantage that −1 ≤ ρ ≤ 1 so it establishes a scale. If we have a sample consisting of n
pairs of (X, Y ) values, we define the sample correlation coefficient
It turns out that |r| ≤ 1 exactly as for the correlation coefficient of random variables – we have only to
replace the expectation operator by summation symbols. [check!]
The correlation coefficient expresses the correlation between two samples and hence it can hopefully
serve as an estimate of the population correlation ρ. If it is positive, then if one is bigger than average,
the other is likely to be as well. If it is negative, then if one is bigger than average, the other is likely to
be smaller. If the correlation coefficient is close to zero, then if one is bigger than average that tells us
little if anything about the other one.
Example 41. Ten strawberry plants were grown in pots in a greenhouse. Measurements were taken of
x, the level of nitrogen present in the leaf at the time of picking (ppm by weight of dry leaf matter) and
y, the crop yield in grams.
x 2.50 2.55 2.54 2.56 2.68 2.55 2.62 2.57 2.63 2.59
y 247 245 266 277 284 251 275 272 241 265
We find
There are a number of tests we can apply to r. We will only mention the following:
• H0 : ρ = 0; against
• H1 : ρ 6= 0.
where
X
Sxx := (Xi − X̄)2
99
X
Syy := (Yi − Ȳ )2
X
Sxy := (Xi − X̄)(Yi − Ȳ ).
Also notice that when (X, Y ) have a bivariate normal distribution then we can write
σY
E(Y |X = x) = α + βx, β= ρ.
σX
This shows that to test the above hypothesis we can equivalently test:
• H0 : β = 0;
• H1 : β 6= 0.
To test whether the correlation is significantly different from 0 we compute the t-statistic
β̂
t= √ ,
S/ Sxx
7.5.0.0.1 Correlation and causality. Note that we have to be very careful with correlations. Above
all, we cannot infer cause and effect from a strong (i.e. ρ close to ±1) correlation. It is at least as
likely to be common cause, as in the good correlation between the numbers of taverns and the number
of Baptist ministers in American cities – which is actually to do with both being correlated with the
population of the city. This shows what happens if you don’t ask exactly how the research was carried
out. Another example is the strong correlation between monthly sales of ice skates in Canada and
surfboards in Australia, due to the fact that the start of winter in one corresponds to the start of summer
in the other.
What is more, correlation measures only a linear relation. Consider the following data:
X Y X2 Y2 XY
0 0 0 0 0
1 48 1 2304 48
2 64 4 4096 128
3 48 9 2304 144
4 0 16 0 0
10 160 30 8704 320
100
This gives x̄ = 2 and ȳ = 32, so Σ(x − x̄)(y − ȳ) = Σxy − 5 × 2 × 32 = 0 which makes ρ = 0.
Now these values are what you would get for the height of a ball thrown up with an initial speed of 64
feet/sec, taking the acceleration due to gravity as 32 ft/sec/sec. So there can be cause and effect and
yet a zero correlation, which shouldn’t surprise you because we saw earlier on that dependent random
variables can have zero correlation. What has happened is that half the time the ball was going up as
time increased and half the time it was going down as time increased. So on average it was doing neither.
That also tells you something about averages: the fact that the average speed was zero conceals the fact
that except for one brief instant when it was at its highest point, the ball was in fact moving.
or equivalently
Pn 1 Pn Pn
i=1 xi yi − n [ i=1 xi ] [ i=1 yi ]
= qP qP
2
n 2 − 1
xi ) × n 2 − 1
yi )2
P P
i=1 xi n ( i=1 yi n (
Let R(xi ) be the rank of the i-th x observation relative to the x sample, and R(yi ) similarly w.r.t. the
y sample. The Spearman rank correlation coefficient is calculated by replacing R(xi ) and R(yi ) for xi
and yi in the usual formula for the correlation. That is
Pn 1 Pn Pn
i=1 R(xi )R(yi ) − n [ i=1 R(xi )] [ i=1 R(yi )]
rS = qP qP (7.6.1)
n 2 1 P
i=1 R(xi ) − n ( R(xi ))2 × n 2 − 1
R(yi ))2
P
i=1 R(yi ) n (
The correlation coefficient we have just been working with requires quantitative data, but there may also
be situations in which we can only rank the data. In the case of quantitative data, this provides a non-
parametric test for association between two random variables which is therefore more robust in cases
where the data fails to be close to normally distributed.
In the case where there are no ties in either the x or y observations, that is the x’s are all distinct, and
the y’s are all distinct, it can be shown by simple algebra that
6Σd2
rS = 1 − ,
n(n2 − 1)
where di ≡ xi − yi . This makes sense intuitively: We are working with the differences between the
rankings and if there were complete consistency the rankings would be identical and all the differences
would be zero. If on the other hand, there is little consistency, then the differences would be large.
Suppose, for example that we are given the following pairs:
x 10 14 6 1 7 4 9 13 2 12 5 11 15 3 8
y 4 11 5 6 12 1 14 10 7 13 2 15 9 3 8
d 6 3 1 -5 -5 3 -5 3 -5 -1 3 -4 6 0 0.
101
As a check, the differences should sum to zero. Here Σd2 = 226 and 15 × (152 − 1) = 3360 so
rs = 1 − (6 × 226)/3360 = 1 − 0.4036 = 0.5964.
There are tables for critical values of rs but the Cambridge tables give values for S = Σd2 . Here we
have S = 226 and n = 15 so the 5% critical value (we were expecting a positive correlation so this is a
1-tail test) is 310. Our value is less than that, so the rank correlation is significantly different to 0.
Note that there is also Kendall’s rank correlation, so don’t confuse the two. Also the right hand column
of the table tells you what to divide S by to get rs
To test for a significant negative correlation is easy using rs , as is testing for a significant correlation
regardless of sign. If the critical value of rs for a positive correlation is R, so that we reject the null
hypothesis for rs ≥ R then if we are testing for a negative correlation we reject Ho for rS ≤ −R.
Thus if we are testing for a significant positive correlation we reject Ho for
6Σd2
1− ≥R
n(n2 − 1)
1
Σd2 ≤ n(n2 − 1)(1 − R).
6
On the other hand, if we are testing for a significant negative correlation, we reject Ho for
6Σd2
1− ≤ −R
n(n2 − 1)
1
Σd2 ≥ n(n2 − 1)(1 + R).
6
So if the critical value for Σd2 for testing for a significant positive correlation is X (we reject for
Σd2 < X) then to test for a significant negative correlation we reject for
1
Σd2 ≥ n(n2 − 1) − X
3
This is stated, though not very clearly, in the Cambridge Tables.
Note that if two rankings are positively correlated, then if you take one of them in the other order, the
correlation will be negative.
102