0% found this document useful (0 votes)
74 views

Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona

This document outlines the contents of a course on probability and statistics. It begins with a review of hypothesis testing and the z-test for means. It then covers bivariate distributions, expectations, conditional distributions, and moments. Exponential densities such as the normal, Poisson, exponential, gamma, chi-squared, F and t distributions are examined. The document also discusses estimation techniques including maximum likelihood and sufficient statistics. Finally, it reviews significance testing, the Neyman-Pearson lemma, and specific statistical tests like the z-test, t-test, and chi-squared test. Regression models are also introduced.

Uploaded by

Omer Mughal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona

This document outlines the contents of a course on probability and statistics. It begins with a review of hypothesis testing and the z-test for means. It then covers bivariate distributions, expectations, conditional distributions, and moments. Exponential densities such as the normal, Poisson, exponential, gamma, chi-squared, F and t distributions are examined. The document also discusses estimation techniques including maximum likelihood and sufficient statistics. Finally, it reviews significance testing, the Neyman-Pearson lemma, and specific statistical tests like the z-test, t-test, and chi-squared test. Regression models are also introduced.

Uploaded by

Omer Mughal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

PROBABILITY AND STATISTICS II

George Deligiannidis
Module Lecturer 2020/21: Kalliopi Mylona

August 11, 2020


Contents

1 Introduction 1
1.1 Review of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The z-test for the mean of a population . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 The Normal Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 The z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Bivariate distributions 8
2.1 Review of Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Review of Discrete Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Continuous Bivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Multivariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Expectation and friends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 A few useful Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Uniform Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Moments and Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . 22

3 Exponential Densities 26
3.1 Review of the Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 The Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Chi-squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 The F -distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.1 Testing variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 The Beta Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Estimation 42
4.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

1
4.3 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 The Cramer-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.2 Efficient statistics are sufficient . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.3 Examples of Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Theory of Testing 61
5.1 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Significance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1.1 One-sided vs two-sided alternatives . . . . . . . . . . . . . . . . . . . 63
5.2.2 Types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Uniformly most powerful. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6 Tests 74
6.1 The z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.1 Population mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.2 Testing for proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.3 Two Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.1 Student’s t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2.2 Two Sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.2.1 Paired Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2.2.1.1 Paired-sample t-test . . . . . . . . . . . . . . . . . . . . . . 81
6.2.2.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.3 When to use z, when t? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3 The χ2 -test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.2 Comparing with fixed distribution . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.3 Comparing with a Parametric family of distributions . . . . . . . . . . . . . . . 85
6.3.4 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.5 Comparing Two Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7 Regression 90
7.1 A Single Explanatory Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.2 More Complicated Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2.1 Multiple Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 Distribution of the Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

2
7.5 Tests for correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.5.0.0.1 Correlation and causality. . . . . . . . . . . . . . . . . . . . 100
7.6 Spearman’s Rank Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 101

3
Housekeeping

These notes are based on the notes kindly provided to me by Prof. Peter Saunders and Dr. George
Deligiannidis who both taught the course before. I would appreciate if you point out any typos you spot
out to me ([email protected]).
Chapter 1

Introduction

1.1 Review of Hypothesis Testing


Suppose that you go into a magic store to buy a biased die which claims that even numbers are more
likely than odd numbers, or in other words P(even) > 1/2. You want to test the claim before parting
with your money, and thus you decide to conduct an experiment. You roll the die 30 times and record
the result. Suppose you observe an even number 21 times. Is this enough to conclude that the claim is
indeed true?
Let’s remind ourselves how Hypothesis testing is formalised. A statistical test usually consists of

Null Hypothesis H0 : the observation is entirely due to chance, that is P(even) = 1/2;
Alternative Hypothesis H1 : the die is indeed biased as claimed P(even) > 1/2;
Test statistic: an observation, usually a number, that can be computed from the sample that is infor-
mative about the hypothesis that is tested. In our case the number of times we observed an even
number, or the average. We must know the distribution of the statistic under the null hypothesis.
Rejection Region we have to decide what results of the test statistic are too extreme given the null
hypothesis.

To specify the rejection region we need a probabilistic model about the null hypothesis. Notice that in
our setup, we do have such a model, that is we assume that the probability of observing an even number
is precisely 1/2.
We also have to specify the significance of the test, α, which is usually a low probability such as
.01, .05, .1 and so on. The significance is chosen by the statistician, always before even looking at the
data, and it specifies how strict we are in assessing the likelihood of the observation under the null
hypothesis. This specifies the rejection region, in the sense that we will reject a range of values that has
probability = α of happening under the null hypothesis.
In our case, notice that under the null hypothesis. every time you roll the die with probability 1/2 you
observe an even number. You roll the die 30 times, and thus under then null hypothesis
Y = #number of even numbers observes ∼ Bin(30, 1/2),
has the Binomial distribution, that is
!
30 1
P(Y = k) = , 0 ≤ k ≤ 30.
k 230

Let’s say that we fix the significance level at 5%. We can compute
P(Y ≥ 21) = .0214, P(Y ≥ 20) = .0494, P(Y ≥ 19) = .100,

1
and thus it makes sense to set the rejection region to be {y : y ≥ 20}. Since we observed 21 even
numbers, we conclude that at the 5% level, there is enough evidence to reject the null hypothesis that
P(even) = 1/2.

1.2 The z-test for the mean of a population


In the above test we only had to use the Binomial distribution. Although the calculations involved can
be tricky, they can always be done exactly by a computer. There are situations when the observations do
not take values in a discrete set.
We will now briefly review the the so-called z-test based on the normal, or Gaussian, density.

1.2.1 The Normal Density


This density is the familiar bell curve. A lot of things have approximately this density. It arises in

0.4

0.3

0.2

0.1

-4 -2 2 4

Figure 1.2.1: The density of the standard normal distribution.

particular if the variations of individuals from their mean are effectively the sum of a large number
of independent random fluctuations which are equally likely to be in either direction. Height, for ex-
ample, is distributed approximately normally, which presumably because there are lots of genetic and
environmental factors, each of which has a small effect even if the underlying distribution is not normal.
This is due to the central limit theorem, as we shall see later in the course, which gives further justi-
fication for using the normal density. On the other hand, there are lots of densities that are not normal
– incomes, for example (because a relatively small number of people earn so much more than every-
one else), or the number of hours of sunshine per day in many countries (because most days are either
completely sunny or completely cloudy). You can draw some very incorrect conclusions if you do your
calculations assuming normality when that’s not justified.
Aside. This is a major issue in statistics. There are always a lot of assumptions involved that no one
states explicitly – that the density is normal, that the sample is random, that the sample is independent
and so on. If these assumptions fail, some or all of our conclusions may also be wrong. The job of
the statistician is not to apply a formula, or click the right menu item in SPSS or any other statistical
software, but rather to be able to judge whether a particular test is the appropriate one for the particular
dataset, model etc. Therefore a very important aspect of statistics is to have an excellent understanding
of the assumptions behind each test and statistical method.

2
1.2.2 The z-test
Suppose that you work for a pharmaceutical company which has devised a new drug for lowering blood
pressure. You want to test whether the drug is effective or not. You run an experiment and you measure
the drop in blood pressure(in some units) after giving a low dose of the drug to 10 patients

−1.9480, −0.6951, −0.0777, 1.7855, −1.4669, −0.6127, −0.8825, −2.7330, −1.2390, −2.5947.

Suppose that the measurements above arise from a Normal distribution with unknown mean µ and unit
variance σ 2 = 1.
First of all we set the significance level, based on how risky a wrong decision would be. Let’s say we
set it at α = .05.
Recall that the normal distribution with mean µ and variance σ 2 has probability density function
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) .
σ 2π

In terms of the parameters of underlying distribution, the hypothesis that the drug is effective is equiv-
alent to µ < 0. This will be our alternative hypothesis which we will test against the null hypothesis
which postulates that µ = 0 and that the fluctuations we observed were entirely due to chance alone.
Intuitively, we should reject the null hypothesis if the value we observe is too negative, because that
should be very unlikely under the null hypothesis. Let’s see how to quantify this statement.
Recall from last year that if X1 , . . . , Xn ∼ N (µ, σ 2 ) are independent then

X1 + · · · + Xn ∼ N (nµ, nσ 2 ),

and since for all random variables X and constants c 6= 0 we have var(X/c) = var(X)/c2 , we observe
that
X1 + · · · + Xn σ2 
∼ N µ, .
n n
It’s always useful to standardise (subtract the mean and divide by the standard deviation) since statis-
tical tables are usually only provided for the standard normal distribution, and therefore we define the
quantity √ h
n X1 + · · · + Xn i
Z := − µ ∼ N (0, 1).
σ n
This is the z-statistic and is the basis of the z-test.
Let’s compute the z-statistic of our data. Of course we have assumed that our observations are a
sample from N (µ, 1), and thus µ is unknown. So which µ do we use to compute the z-statistic? We
only use the mean(or the parameter of interest) specified in the null hypothesis, in this case µ = 0.
Therefore our z-statistic is √
z = −1.0464 × 10 = −3.3090.

Next, let’s briefly think about what the rejection region should be. We are testing µ = 0 against µ < 0.
We are looking for evidence against the null hypothesis and in support of the alternative, therefore we
should reject the null hypothesis if the z-statistic turns out to be too negative (convince yourself that
we are not looking for large positive values). Thus our rejection region will have the form (−∞, −zα ]
where Z −zα
1 2
√ e−x /2 dx = α.
−∞ 2π
To find zα we check any statistical table we have access to and we find it to be approximate -1.645.
Therefore the rejection region should be (−∞, −1.645] which means that we reject the null hypothesis.

3
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 5040 5080 5120 5160 5199 5239 5279 5319 5359
0.1 0.5398 5438 5478 5517 5557 5596 5636 5675 5714 5753
0.2 0.5793 5832 5871 5910 5948 5987 6026 6064 6103 6141
0.3 0.6179 6217 6255 6293 6331 6368 6406 6443 6480 6517
0.4 0.6554 6591 6628 6664 6700 6736 6772 6808 6844 6879
0.5 0.6915 6950 6985 7019 7054 7088 7123 7157 7190 7224
0.6 0.7257 7291 7324 7357 7389 7422 7454 7486 7517 7549
0.7 0.7580 7611 7642 7673 7704 7734 7764 7794 7823 7852
0.8 0.7881 7910 7939 7967 7995 8023 8051 8078 8106 8133
0.9 0.8159 8186 8212 8238 8264 8289 8315 8340 8365 8389
1.0 0.8413 8438 8461 8485 8508 8531 8554 8577 8599 8621
1.1 0.8643 8665 8686 8708 8729 8749 8770 8790 8810 8830
1.2 0.8849 8869 8888 8907 8925 8944 8962 8980 8997 9015
1.3 0.9032 9049 9066 9082 9099 9115 9131 9147 9162 9177
1.4 0.9192 9207 9222 9236 9251 9265 9279 9292 9306 9319
1.5 0.9332 9345 9357 9370 9382 9394 9406 9418 9429 9441
1.6 0.9452 9463 9474 9484 9495 9505 9515 9525 9535 9545
1.7 0.9554 9564 9573 9582 9591 9599 9608 9616 9625 9633
1.8 0.9641 9649 9656 9664 9671 9678 9686 9693 9699 9706
1.9 0.9713 9719 9726 9732 9738 9744 9750 9756 9761 9767

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


2.0 0.9772 9821 9861 9893 9918 9938 9953 9965 9974 9981
3.0 0.9987 9990 9993 9995 9997 9998 9998 9999 9999 9999

Table 1.1: The standard normal table. The probability given refers to the shaded regions in Figure 1.2.2.

4
P 2P ! 1

0 z !z 0 z

1!P 1!P

!z 0 0 z

Figure 1.2.2: The normal density. The shaded region has probability P .

In fact, if the null hypothesis were true, the probability of observing a value less than or equal to our
z-statistic is less than .00001, which would be the so called p-value.
Suppose now that instead our observations are

−0.9320, 1.2821, −1.2757, −0.8838, −1.1120, −1.5742, −0.5922, −0.6741, 1.1301, −0.6490.

In this case the z-statistic gives z = −1.6700, which is much lower than before. Indeed, this still
lies in the rejection region, but intuitively the evidence against the null in this case is much weaker than
before. One way to quantify this is to state the p-value, the probability of obtaining a test statistic at
least as extreme as the observed one (given H0 is true), in this case the probability of observing a z-
statistic less than −1.67, which we find from the table to be .0475 or 4.75%. The p-value is a much more
objective way of assessing hypotheses and is often known as significance testing. The good thing about
this is that you don’t have to specify a significance level a priori, which is sometimes chosen arbitrarily.
In the last example we assumed that the observations are coming from a normal distribution. We used
this to deduce that the z-statistic is also normally distributed, so that we knew its distribution under the
null hypothesis.
It will often be the case that our observations themselves are not normally distributed, for example
look at the following 40 i.i.d. observations from some distribution with mean 1 and variance 1.

0.2851, 1.1475, 0.4602, 1.6923, 1.4971, 0.4663, 1.2907, 0.0109, 1.9815, 0.0560
1.4755, 0.2051, 1.0354, 0.3205, 0.9847, 3.8328, 1.3918, 2.7039, 1.5242, 0.3906
2.7460, 0.2369, 0.0030, 0.9933, 2.4361, 0.0937, 0.7949, 0.0625, 1.7030, 0.0118
0.2648, 0.1599, 1.4953, 0.1725, 0.4461, 0.2224, 1.4797, 4.0076, 3.1744, 0.2655
0.3987, 0.6819, 0.4270, 0.3089, 1.7114, 0.2932, 0.2176, 0.3800, 2.2355, 1.0886
0.3352, 0.4531, 0.5025, 0.5603, 3.1000, 0.2069, 0.0140, 0.2437, 0.4838, 0.4119

5
Figure 1.2.3: The histogram of the observations against the normal density with the correct mean and
variance.

0.4431, 0.0504, 0.9487, 1.7324, 0.3241, 0.9589, 3.5068, 0.6271, 0.0622, 1.4099,
0.8698, 0.8120, 1.9506, 0.0436, 1.0572, 0.4824, 0.6242, 0.1218, 1.0372, 1.2191
0.9397, 2.8131, 3.6835, 1.3268, 0.4146, 0.5525, 2.4410, 0.1065, 0.8901, 0.3854
0.2032, 0.1430, 0.8649, 0.0784, 0.9575, 0.6348, 0.2205, 0.4685, 0.0158, 0.6389.

If you look at the histogram it looks decidedly non-normal, in particular it is onesided, and all observa-
tions are non-negative. Suppose however that you repeat the experiment 1000 times, and each time you
compute the z-statistic. The histogram of the computed statistics will look like in Figure 1.2.4, which
has now started to look like it is normally distributed. This is a very important feature, and it is due to
the Central Limit theorem which states essentially that the z-statistic from a large enough i.i.d. sample
of any reasonable enough distribution will be approximately normally distributed. This essentially tells
us, that if the sample is large, then we know approximately the distribution of the statistic under the null
hypothesis.
This is very important for hypothesis testing. In general we will always need to know the distribution
of the statistic under the null hypothesis at least approximately. There are many tests where this is
possible, but before we learn about them, we have to review some of the basic facts from last year and
build up our arsenal of tools.

6
Figure 1.2.4: The histogram of 1000 z-statistics from an exponential distribution with mean 1 and
variance 1.

7
Chapter 2

Bivariate distributions

2.1 Review of Continuous Random Variables


We briefly review continuous random variables and their properties. These are random variables that
can take any value in some interval, and thus take values in an uncountable set, or a continuum.

Definition 1 (Distribution Function). Let X be a random variable. The distribution function of X, which
we denote by FX : R → [0, 1] is the function FX (x) = P(X ≤ X) for x ∈ (−∞, ∞).
The distribution function is also called the Cumulative Distribution Function.

We also recall the basic properties of distribution functions.

Theorem 1 (Properties of Distribution Functions). Let X be a random variable and F (x) := P(X ≤ x)

1. limx→−∞ F (x) = 0;

2. limx→∞ F (x) = 1;

3. F (x) is monotone increasing; if x1 ≤ x2 then F (x1 ) ≤ F (x2 ).

4. F (x) is right continuous, that is limy↓x F (y) = F (x).

Whether a random variable is discrete or continuous can be deduced from the properties of the Distri-
bution Function. In fact we have the following definition.

Definition 2 (Continuous Random Variable). A random variable X is continuous if there is a function


f : R → [0, ∞), called the probability density function, such that for all x
Z x
F (x) = f (y)dy.
−∞

In fact for any a ≤ b we have


Z b
P(a ≤ X ≤ b) = f (u)du.
a

If X is a continuous random variable, then its distribution function is continuous. Also notice that
Z a
P(X = a) = P(a ≤ X ≤ a) = f (u)du = 0.
a

The Fundamental Theorem of Calculus gives us the following result.

8
Theorem 2. If X continuous random variable with pdf f , and f is continuous at x then

F (x + h) − F (x)
f (x) = F 0 (x) = lim .
h→0 h

To get a feel about the pdf, consider the following

P(X ∈ [x, x + h]) ≈ h × f (x).

Probability density functions have the following properties.

Theorem 3 (Properties of PDF). The pdf f of a continuous random variable satisfies the following:

1. f ≥ 0;
R∞
2. −∞ f (u)du = 1;

3. for all y, Z y
F (y) = f (u)du;
−∞

4. for all a < b


Z b
P(a < X < b) = f (u)du.
a

The last two properties are fundamental for continuous random variables. While for discrete variables,
the probability mass function is an actual probability of an event, for continuous random variables, the
probability density function is not a probability. The integral of the probability density function over a
set however is a probability.

Theorem 4 (Change of Variables). Let X be a continuous random variable with probability density
function fX . Let g : supp(fX ) → R be strictly monotone and differentiable, where

supp(fX ) = {x ∈ R : fX (x) > 0},

and define Y = g(X). Then Y is a continuous random variable and its probability density function is
given by
−1
d −1
fY (y) = fX (g (y)) g (y) .

dy

Proof. We will compute the density by differentiating the distribution function. Suppose first that g is
strictly increasing

FY (y) = P[Y ≤ y]
= P[g(X) ≤ y]
= P[X ≤ g −1 (y)]
= FX (g −1 (y)).

Differentiating we find that

d 0 −1 d d
fY (y) = F (g (y)) g −1 (y) = fX (g −1 (y)) g −1 (y).
dy X dy dy
The proof is similar for strictly decreasing g.

9
2.2 Review of Discrete Bivariate Distributions
Last year you saw discrete bivariate probability distributions. If X and Y are two discrete random
variables taking values in a discrete set S, then we are often interested in their joint behaviour; we can
define their joint probability mass function

p(x, y) = P(X = x, Y = y), x, y ∈ S,

where p(x, y) ≥ 0 and


P P
x y p(x, y) = 1. We then define the marginal distributions
X X
pX (x) = p(x, y) pY (y) = p(x, y)
y x

These are the probability mass functions of X and Y separately. They are called marginal distributions
because if you write the values of the function p(x, y) as entries in a rectangular array, the sums go into
the bottom and right hand margins.

Definition 3 (Independence). Two discrete random variables X, Y , with joint probability mass function
p(x, y) are independent if

p(x, y) = pX (x) × pY (y), for all x and y.

Two variables are independent if the ”joint is the product of the marginals”.

Example 1. Suppose that you roll two dice and define X, Y to be the two outcomes. Then S =
{1, 2, 3, 4, 5, 6} and for any x, y ∈ S
1 1 1
p(x, y) = = × = pX (x)pY (y).
36 6 6
Clearly, and in agreement with intuition, the two dice are independent.

Example 2. On the other hand suppose that you have an urn with 5 black balls and 3 red balls. We
sample two balls without replacement. Let X, Y ∈ {0, 1}, where X = 0 if the first ball is red and
X = 1 if the first ball is black, and similarly for Y and the second ball. Then we can record the joint
probability distribution as follows:

X\Y 0 1
0 6/56 15/56 21/56
1 15/56 20/56 35/56
21/56 35/56

It is easy to see that in this case the two variables are not independent.

2.2.1 Conditional Probability


When two variables are not independent, then knowing the value of one should reveal information about
the other. This is quantified through conditional distribution.
Recall from last year, that if A and B are two events, that is two subsets of the sample space S, such
that P(B) > 0 then we can define the conditional probability of A given B is:

P(A ∩ B)
P(A | B) =
P(B)
Also recall Bayes’ theorem

10
Theorem 5 (Bayes’ Theorem). For any two events A and B such that P(A), P(B) > 0 we have

P(B|A) P(A)
P(A|B) = . (2.2.1)
P(B)

and the more general version

Theorem 6. Let A be an event, and B1 , . . . , Bn be a partition of S, ie Bi ∩ Bj = ∅ for i 6= j, and


∪Bi = S. Then
P(A|Bi ) P(Bi )
P(Bi |A) = P .
j P(A|Bj ) P(Bj )

When given two discrete random variables X and Y and their joint probability mass function p(x, y),
we can define the conditional probability mass function of X given Y .

Definition 4. Let X, Y be two random variables with joint probability mass function p(x, y). Then the
conditional probability mass function of X given Y = y is defined as

p(x, y) P(X = x, Y = y)
p(x|y) = = .
pY (y) P(Y = y)

This is the standard conditional probability of the event A = {X = x} given the event B = {Y = y}.
For Example 2 the conditional probability distribution of X given Y = 1 is

p(0, 1) 15/56 3
pX|Y (0|1) = = = .
pY (1) 35/56 7

which is not the same as the marginal pX (0) = 21/56.


Of course if X and Y are independent then the conditional probability mass function is just the
marginal.

2.3 Continuous Bivariate distributions


For continuous bivariate distributions, we can no longer work directly with probabilities. This is because,
if X and Y are two continuous random variables, then as explained earlier it does not make sense to talk
about the probability that X and Y take any particular value as P(X = x) = 0 for any x.
Similarly to univariate continuous distributions, the most basic object will be the joint distribution
function.

Definition 5 (Joint Distribution Function). Given two random variables, X and Y their joint distribution
function is the function F : R × R → [0, 1] given by

F (x, y) := P(X ≤ x, Y ≤ y), x, y ∈ R.

Next we define jointly continuous random variables and in the process we also define the joint proba-
bility density function.

Definition 6 (Jointly Continuous Random Variables). Two continuous random variables X and Y are
jointly continuous if there is a function fX,Y : R2 → [0, ∞) called the joint probability density function
such that Z s Z t
P(X ≤ s, Y ≤ t) = fX,Y (x, y)dxdy.
−∞ −∞

11
A joint density function f (x, y) is necessarily non-negative and normalised
ZZ
f (x, y)dxdy = 1.
R2

Also if X, Y are jointly continuous, then

∂2
fX,Y (x, y) = FX,Y (x, y),
∂x∂y
whenever the partial derivative is defined.
Notice that two continuous random variables do not necessarily have to be jointly continuous.

Example 3. Let X = U [0, 1], a uniform random variable in the interval [0, 1], or in other words let X
have pdf fX (x) = 1 for x ∈ [0, 1] and fX (x) = 0 otherwise. Let Y = X. Then both X and Y are
continuous random variables. However they are not jointly continuous. Their joint distribution function
is given for x, y ∈ [0, 1] by

F (x, y) = P(X ≤ x, Y ≤ y) = P(X ≤ min{x, y}) = min{x, y}.

If X, Y were jointly continuous then their density f would have to be

∂2
f (x, y) = min{x, y} = 0,
∂x∂y
which of course makes no sense.

Theorem 7 (Properties of Joint Distribution Functions). If X, Y have joint distribution function F then

1. For all y,
lim F (x, x) = lim F (x, y) = lim F (y, x) = 0.
x→−∞ x→−∞ x→−∞

2. limx→∞ F (x, x) = 1;

3. For all x, y, limx→∞ F (x, y) = FY (y) and limy→∞ F (x, y) = FX (x), where

FX (x) := P(X ≤ x), Fy (y) := P(Y ≤ y).

4. If x1 ≤ x2 and y1 ≤ y2 then

F (x2 , y2 ) − F (x1 , y2 ) − F (x2 , y1 ) + F (x1 , y1 ) ≥ 0.

Definition 7 (Marginal Densities). Let X, Y be jointly continuous with pdf f (x, y). Then X and Y are
continuous random variables. The individual probability density functions of X and Y are called the
marginal densities of X and Y respectively and are given by
Z ∞ Z ∞
fX (x) = f (x, y)dy, fY (y) = f (x, y)dx.
−∞ −∞

Definition 8 (Independent Random Variables). Two random variables X, Y with joint distribution func-
tion FX,Y and marginal distribution functions FX , FY are said to be independent if for all x, y

FX,Y (x, y) = FX (x) × FY (y).

When the variables are jointly continuous, independence is more conveniently checked using the joint
density function.

12
Theorem 8. Let X, Y be jointly continuous random variables with joint density fX,Y (x, y). Then X
and Y are independent if and only if there exist functions h, g : R → [0, ∞) such that

fX,Y (x, y) = h(x)g(y), for all x, y.

Proof. ”⇒”. Let fX and fY be the marginals. Then define


Z x Z y
F̃ (x, y) := fX (r)fY (s)dsdr.
−∞ −∞

It is easy then to see that


Z x Z y
F̃ (x, y) = fX (r)dr fY (s)ds
−∞ −∞
= FX (x)FY (y) = FXY (x, y),

and we conclude that fX,Y = fX fY at least in the sense that for all x, y the integrals
Z x Z y Z x Z y
fX,Y (r, s)drds = fX (r)fY (s)drds,
−∞ −∞ −∞ −∞

agree.
R R
”⇐”. Let cg := y g(y)dy and ch := x h(x)dx. First of all notice that since
Z Z
fX (x) = fX,Y (x, y)dy = h(x) g(y)dy = cg h(x),
y y

and similarly Z Z
fY (x) = fX,Y (x, y)dx = g(y) h(x)dx = ch g(y),
x x
RR
and fX,Y (x, y)dxdy = 1 we have that cg ch = 1. Thus

fX,Y (x, y) = h(x)g(y) = cg h(x)ch g(y) = fX (x)fY (y).

2.3.1 Multivariate
Everything in this section extends to more than two random variables in the obvious way. Concepts like
jointly continuous, joint density f (x1 , . . . , xn ), joint distribution function and the relationships between
them
Z x1 Z xn
F (x1 , . . . , xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn ) = ··· f (y1 , . . . , yn )dy1 · · · dyn .
−∞ −∞

For independence we have the ”mutual independence” which means that essentially the distribution
function, and density factorises fully into the product of the marginals

f (y1 , . . . , yn ) = f1 (y1 ) . . . fn (yn ),

where Z
fi (yi ) = f (y1 , . . . , yn )dy1 . . . dyi−1 dyi+1 . . . dyn ,
R
is the i-th marginal.
Notice that in the multivariate case, we can have marginal distributions and densities for any subset
of the variables. That is say I ⊆ {1, . . . , n}, then the marginal of the variables (Xi : i ∈ I) is found
by integrating the joint pdf with respect to the variables not in I, that is for example fX1 X3 (x1 , x3 ) is
found by integrating all variables except for x1 and x2 .

13
A very important case is that of independent and identically distributed random variables which we
abbreviate as i.i.d. We say that X1 , . . . , Xn are i.i.d. if they are (mutually) independent and have the
same distribution, that is their marginals are all the same

fX1 (·) = fX2 (·) = · · · = fXn (·) =: f (·),

and therefore by independence


n
Y
f (x1 , . . . , xn ) = f (xi ).
i=1

Remark 9. We can also have pairwise independence where any two variables are independent.
As an example consider a die with four faces containing the following number respectively

{1}, {2}, {3}, {1, 2, 3}.

You roll the die and you define the random variables Xi , i = 1, 2, 3 where Xi = 1 if i appears in the top
face, and Xi = 0 otherwise. and thus Xi ∼ Ber(1/2). Then its an easy exercise to check that for all
xi ∈ {0, 1} we have
1 1 1
P(Xi = xi , Xj = xj ) = = × = P(Xi = xi ) P(Xj = xj ),
4 2 2
and thus any two variables are independent. However they are not mutually independent since
1 1 1 1
P(X1 = X2 = X3 = 1) = 6= × × = P(X1 = x1 ) P(X2 = x2 ) P(X3 = x3 ).
4 2 2 2

In this course whenever we say that the variables X1 , . . . , Xn are independent we will mean mutually
independent.

2.4 Expectation and friends


Definition 9 (Expectation). Let X is a random variable with probability density f (x). Then we define
the expectation, or mean, of X, which we denote by E X as
Z
EX = xf (x)dx,

when the integral exists and is finite.


The variance of X denoted by var(X) is given by

var(X) = E[X 2 ] − (E[X])2 .

Notice that if g : R → R is a function, then Y = g(X) is also a random variable. If g is strictly


increasing wherever f > 0 and differentiable, then it is invertible, and a change of variables allows one
to deduce that, Y is again a continuous random variable with probability density function given by

dg −1 (y)
fY (y) = fX (g −1 (y)) .
dy

However this is not necessary to compute the expectation of g(X) or of any function of g(X), since the
expectation of the random variable g(X) can be computed through
Z

E g(X) = g(x)f (x)dx,

14
This generalises to two or more variables in the obvious way. Let g(X, Y ) be a function of two random
variables X, Y with joint probability density f (x, y). Then the expected value of g is defined to be
Z Z
E[g(X, Y )] = g(x, y)f (x, y)dxdy

In particular, we can define the means and variances exactly as before


Z Z Z Z
µx = E(X) = xf (x, y)dydx µy = E(Y ) = yf (x, y)dydx

Z Z Z Z
2 2
var(X) = x f (x, y)dydx − (E[X]) var(Y ) = y 2 f (x, y)dydx − (E[Y ])2

Definition 10 (Covariance, Correlation). The covariance of X and Y is defined by

cov(X, Y ) = E(XY ) − E(X) E(Y )


RR
where E(XY ) = xyf (x, y)dxdy.
The correlation coefficient ρ(X, Y ) ≡ ρXY is defined as
E(XY ) − E(X) E(Y )
ρ= p
(var(X))(var(Y ))

The point of this is that ρ is dimensionless and therefore scale invariant. If we replace X by aX and Y
by bY , then cov(X, Y ) is increased by a factor ab but ρ remains the same.

2.4.1 A few useful Properties


The following few properties are extremely useful.

Linearity of expectation: For any random variables X, Y with finite expectations and any λ, µ ∈ R

E(λX + µY ) = λ E(X) + µ E(Y ),

in other words taking expectations is a linear operation.


Positivity of Expectation: If X ≥ 0 and E[X] < ∞ then E X ≥ 0.
If X ≥ 0 and E[X] = 0 then P(X = 0) = 1.


Proof. The first property we will not prove. It is trivial for discrete random variables, and for
variables that admit a probability density function it is a property of integrals.
The second one is also trivial for discrete variables. For variables admitting a pdf, the conclusion
is always wrong, and thus we have to prove that if X ≥ 0 and it admits a pdf then E X > 0. In
this case Z ∞
P(X > x) = 1 − F (x) = fX (y)dy,
x
is continuous and monotone-decreasing, going from 1 to 0 so there exists x > c such that P(X >
c) > 0. From the first property we have that
h i h i
E X = E X1{X > c} + E X1{X ≤ c}
h i
≥ E X1{X > c}
h i
≥ E c1{X > c} = c P(X > c) > 0.

15
Variance of sum: The variance of a sum or difference is given by
 2
var(X ± Y ) = E[(X ± Y )2 ] − E[(X ± Y )]
 2  2
= E[X 2 ] ± 2E[XY ] + E[Y 2 ] − E[X] ± 2E[X]E[Y ] − E[Y ]
= var(X) + var(Y ) ± 2cov(X, Y ),

and in particular since if X, Y are independent

var(X ± Y ) = var(X) + var(Y ).

We are often interested in the special case of n identically distributed independent variables, which
we usually denote by X1 , X2 , . . . , Xn . From the above we have

E[X1 + X2 + . . . + Xn ] = nE[X] =⇒ E[X̄] = E[X],

where
1
X̄ := (X1 + · · · + Xn ),
n
and
1 1
var(X1 + X2 + . . . + Xn ) = n var(X) =⇒ var(X̄) = 2
.n var(X) = var(X).
n n

Expectation of produce of Independent variables: Let X and Y be independent and g, h : R 7→ R


be two functions. Then
Z ∞ Z ∞
E[g(X)h(Y )] = g(x)h(y)fXY (x, y)dxdy
x=−∞ y=−∞
Z ∞ Z ∞
= g(x)h(y)fX (x)fY (y)dxdy
x=−∞ y=−∞
Z ∞ Z ∞
= g(x)fX (x)dx × h(y)fY (y)dy
x=−∞ y=−∞
= E[g(X)] × E[h(Y )].

It follows trivially that if X, Y are independent then

cov(X, Y ) = 0.

The converse is not however true.

Scaling of variance: For any random variable X such that E X 2 < ∞ and a ∈ R

var(aX) = a2 var(X).

Theorem 10. ρ ∈ [−1, 1].

Proof. It suffices to prove for X, Y with E X = E Y = 0 and E X 2 = E Y 2 = 1.


For any real constant λ, the expectation of the random variable (X − λY )2 must clearly be non-
negative, so
0 ≤ E (X − λY )2 = λ2 − 2λ E[XY ] + 1.


Since the leading coefficient and the constant term of the above polynomial in λ is positive(the parabola
opens upwards), the inequality will hold true iff the polynomial has at most one root, that is if and only
if the discriminant is non-positive, ie 4 E[XY ]2 − 4 ≤ 0, or equivalently that −1 ≤ E[XY ] ≤ 1.

16
Example 4. Consider two variables X, Y with joint density

f (x, y) = 6(x − y), 0<y<x<1

with of course f (x, y) = 0 outside the triangle with vertices (0,0), (1,0), (1,1). First, we can verify that
this is a proper probability density. It is clearly non-negative, and then
Z 1Z x Z 1 Z 1
6(x − y)dydx = (6xy − 3y 2 )|10 dx = 3x2 dx = 1
0 0 0 0

as required. The marginal densities for each variable are obtained by integrating over the other one:

Z x
fX (x) = 6(x − y)dy = 6xy − 3y 2 |x0 = 3x2
0
Z 1
fY (y) = 6(x − y)dx = 3x2 − 6xy|1y = 3 − 6y + 3y 2 = 3(1 − y)2 .
y

with both zero for arguments outside the interval [0, 1]. The marginal densities are just the sort of
ordinary density we are used to and so are treated in the familiar way. Integrating each of them over the
interval gives unity. Also
Z 1 Z 1
23 3 3 1
E(X) = x.3x dx = E(Y ) = y.3(1 − y)2 dy = −2+ =
0 4 0 2 4 4
and
Z 1 Z 1
2 2 2 3 2 3 3 1
E(X ) = x .3x dx = E(Y ) = y 2 .3(1 − y)2 = 1 − + =
0 5 0 2 5 10
so that var(X) = 3/80 and var(Y ) = 3/80. These are equal because of the symmetries of the problem,
though this may not be obvious at first sight.
Note that if you aren’t convinced about using the marginal density, it’s really the same as taking a
double integral over the whole region except that we’ve already done the first bit. Thus
Z ∞ Z ∞ Z ∞ Z ∞ Z ∞
E(X) = xf (x, y)dydx = x f (x, y)dydx = xfX (x)dx
−∞ −∞ −∞ −∞ −∞

Working out E(Y ) and E(Y 2 ) is the same, except that to make the trick work we do the x integration
first. To work out the covariance we have to use the joint probability density and integrate over the whole
region, and this time we have to do the whole integration explicitly:
Z ∞ Z ∞
E(XY ) = xyf (x, y)dydx
−∞ −∞

For the example we have been using this gives

Z 1Z x Z 1 Z x 
E(XY ) = xy.6(x − y)dydx = (6x2 y − 6xy 2 )dy dx
0 0 0 0
Z 1h Z 1
i 1
= 2 2
3x y − 2xy 3
|x0 dx = x4 dx =
0 0 5

and so the covariance is

17
1 3 1 1
E(XY ) − E(X)E(Y ) = − . =
5 4 4 80
3
Because var(X) = var(Y ) = 80 , the correlation coefficient is

E(XY ) − E(X)E(Y ) 1
ρ= p =
var(X) var(Y ) 3
Example 5. Consider now the uniform distribution on the unit disc:
Z Z Z 1 Z 2π
E(X) = xdxdy = r cos θrdrdθ = 0
0 0

which was obvious by symmetry in the first place. Similarly E(Y ) = 0. And also (if this too isn’t
obvious)
Z Z Z 1 Z 2π
E(XY ) = xydxdy = r cos θ r sin θrdrdθ
0 0

But the θ integration is clearly zero, either by letting φ = 2θ or by symmetry, or whatever. Hence
cov(X, Y ) = E(XY ) − E(X)E(Y ) = 0.

Fact Independent variables are uncorrelated. But uncorrelated variables are not necessarily indepen-
dent.

Example 6. Let X be uniform in [−1, 1] and Y = X 2 . Then you can show that E[XY ] = E[X 3 ] = 0
and E[X] E[Y ] = 0 and thus the covariance is zero. However it is clear that the two variables are not
independent.
Can you extend this example to more general distributions?

2.5 Conditional Distributions


For discrete random variables we define the conditional distribution of X given Y via the equation

p(x, y) P(X = x, Y = y)
pX|Y (x|y) = = .
pY (y) P(Y = y)

For continuous distributions the role of conditional probability mass functions is here played by condi-
tional densities.

Definition 11 (Conditional PDF). Let X, Y be jointly continuous with joint density f (x, y) and marginals
fX and fY . Then for any y such that fY (y) > 0 the conditional density of X given Y = y is given by

f (x, y)
fX|Y (x|y) = ,
fY (y)

and similarly for fY |X .

This gives the familiar ”Joint density divided by the marginal density of the other variable.”
To check that this indeed makes sense as a definition for a density function we easily check first that
it is non-negative and that

1 fY (y)
Z Z
fX|Y (x|y)dx = f (x, y)dx = = 1.
x fY (y) x fY (y)

18
providing that fX (x) > 0.
I’ve already said that in the continuous case we use exactly the same definition. But we have to bear
in mind that we are working with probability densities, not actual probabilities, and so we have to check
that the f (y|x) as so defined is what it ought to be.
It is certainly nonnegative, so we check that it integrates to unity:

Z ∞ Z ∞
f (x, y)
fY |X (y|x)dy = dy
−∞ −∞ fX (x)
Z ∞
1 fX (x)
= f (x, y)dy = = 1.
fX (x) −∞ fX (x)

Also, averaging with respect to X yields the marginal density of Y , as it should:


Z ∞
E[fY |X (y|X)] = fY |X (y|x)fX (x)dx
−∞
Z ∞
= f (x, y)dx = fY (y).
−∞

The conditional mean, or expectation, of Y given X is just the expected value of Y in the case where
the value of X is known, so it is evaluated using the conditional distribution
Z ∞
E(Y | X = x) = yf (y|x)dy
−∞

The conditional mean E(Y | X = x) is a function of x, and it is called the regression function of Y on
X. It’s a generalisation of the idea of a regression line when the relation is linear: in that case the line
goes through the points where you expect Y to be for any given value of X.
Since E(Y | X = x) is a function of x we can compute its expected value wrt the distribution of X.
Perhaps not surprisingly, we get the mean of Y :
Z Z 
E[E(Y |X)] = yfY |X (y|x)dy fX (x)dx
Z Z
= yfY |X (y|x)fX (x)dydx
Z Z
= yf (x, y)dxdy
Z
= yfY (y)dy = E(Y ).

We can also define a conditional variance

var(Y |X) = E(Y 2 |X) − [E(Y |X)]2

where the expected values are worked out using the conditional densities. The expected value of the
conditional variance is not, however, the unconditional variance.

Example 7 (Example 6 continued). Recall that X, Y have joint density

f (x, y) = 6(x − y), 0 < y < x < 1,

and 0 elsewhere. The marginal densities for each variable are obtained by integrating over the other one:

Z x
fX (x) = 6(x − y)dy = 6xy − 3y 2 |x0 = 3x2 ,
0

19
Z 1
fY (y) = 6(x − y)dx = 3x2 − 6xy|1y = 3 − 6y + 3y 2 = 3(1 − y)2 .
y

Each of the conditional densities is just the joint density divided by the “other” marginal density:

2(x − y)
fX|Y (x|y) = , y<x<1
(1 − y)2
2(x − y)
fY |X (y|x) = , y < x < 1.
x2

To use these, we just make the indicated substitutions. Thus the conditional density of Y given that
X = 12 is

1 2( 1 − y)
f (y|x = ) = 2 1 .
2 4
1
only we have to be careful about the range. If x = 2 then Y must lie in the range 0 < y < 12 . We check:
1
2( 12 − y) 1
Z
2
1 = 4y − 4y 2 |02 = 1.
0 4

We can then find, for example,


1
1 1 2( 12 − y) 1
Z
2
P(Y > |x = ) = 1 = 4y − 4y 2 |1/4
2
= 1/4.
4 2 1/4 4

Note that, as advertised, these have different functional forms. The regression functions are:

Z 1
2x(x − y)dx y+2
E(X|y) = =
y (1 − y)2 3
Z x
2y(x − y)dy x
E(Y |x) = = .
0 x2 3

We can check by finding the unconditional means as the expected values of the conditional means:

Y +2 E(Y ) 2 3
 
E(X) = E[E(X|Y )] = E = + =
3 3 3 4
X 1
E(Y ) = E[E(Y |X)] = E( ) = .
3 4

All this generalizes easily to n random variables X1 , X2 , . . . Xn , though the calculations can get
tricky.
Note, by the way, that we have slightly awkward notation here. We don’t usually bother to indicate
what it is that we are taking the expected value with respect to because it is usually obvious. Here it is a
little bit less clear. It might have helped (though it wouldn’t be standard) to write E(X) = EY [E(X|Y )]
to stress that we are averaging over Y all the expected values E(X|y). As often happens, it is clear
enough when you think about it — since E(X|y) is a function of y but not of x any averaging has to be
over y — but you do have to think about it.

20
2.5.1 Uniform Bivariate Distributions
This is one of the simplest classes of continuous bivariate distributions. Suppose (X, Y ) is uniformly
distributed in some set A ⊂ R2 , i.e. suppose that f (x, y) = C over the entire range for which it is not
zero.
Similarly to the univariate case we have to determine C and this turns out to depend on the range.
Since f (x, y) = C on A and 0 elsewhere
ZZ ZZ
f (x, y) = C = 1,
R2 A

and thus C is the inverse of the area of A.

Example 8. Suppose (X, Y ) is uniformly distributed over the rectangle

A := {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 2}.

Thus (
C, 0 < x < 1 and 0 < y < 2
f (x, y) =
0, otherwise.

The area of A is 2 and thus have C = 21 , i.e. f (x, y) = 1


2 for x ∈ A. The marginal densities are

1 2
Z
fX (x) = dy = 1, x ∈ [0, 1]
2 0
1 1 1
Z
fY (y) = dx = , y ∈ [0, 2],
2 0 2
that is X and Y are uniform on [0, 1] and [0, 2] respectively.
Also since for (x, y) ∈ A
1 1
fX,Y (x, y) = = 1 × = fX (x) × fY (y),
2 2
and so X and Y are independent.
1
Suppose we want to compute P(X < 2, Y > 1). This is given by integrating the density over the
appropriate region, that is
Z 1/2 Z 2
1 1
dydx =
2 0 1 4
Example 9. Suppose now that (X, Y ) are uniformly distributed over the region below the diagonal of
the same rectangle:
(
C, 1 > x > 0, y > 0, y < 2x
f (x, y) =
0, otherwise.
We work out the constant from the condition
Z 1 Z 2x
Cdydx = 1
0 0

and thus C = 1.
Are X and Y independent? You might think they are because f (x, y) seems to factor into a product
of trivial factors. However this is not the case here. Suppose that you want to claim that f (x, y) =
h(x)g(y), where h(x) = h(y) = 1 for all x, y. This is not however the case, as the range of x depends
on y and therefore the joint density does not factor.

21
The marginal densities are
Z 2x Z 1
y
fX (x) = dy = 2x fY (y) = dx = 1 − .
0 y/2 2
Note that

y2 2
Z 1 Z 2
fX (x)dx = x2 |10 = 1 fY (y)dy = (y − )| = 1,
0 0 4 0
as they should.
Since the variables are not independent it makes sense to compute the conditional densities which are
given by
1 1
f (x|y) = f (y|x) =
1 − y2 2x
which are clearly not equal to the marginal densities. These are probability densities too, and should
integrate to unity over their entire ranges, but we have to be careful because the range of integration can
(and in these cases does) depend on the other variable. This is where really helps to have a diagram.
So we have
Z 1 Z 2x
dx dy
=1 = 1.
y/2 1 − y2 0 2x
We can use these in an obvious way to calculate conditional probabilities. Typically, we might want
to know the probability that X > 32 given that Y = 34 . We simply substitute 34 for y in the expression
for f (x|y) and integrate, being careful about the range:
Z 1
2 3 dx 8
P(X > |Y = ) = 3 = 15 .
3 4 2/3 1− 8
If we want P(X < 23 |Y = 34 ), we have to be a bit careful about the lower limit:
Z 2/3
2 3 dx 7
P(X < |Y = ) = 3 = 15 .
3 4 3/8 1− 8
2
Of course we could also work out the probability for X > 3 and subtract.

2.6 Moments and Moment Generating Functions


Some of the most basic properties of a random variable are its moments.
Definition 12 (Moments). The k-th moment of a random variable X is defined as E[X k ], if the expec-
tation of X k is well defined.
Remark 11. Not all random variables have all moments. For example suppose that X is a positive
continuous random variable taking values in [1, ∞) with density
(
α
x1+α
, x > 1;
fX (x) =
0, otherwise,
for some parameter α > 0. You can check that this indeed is a density. Then where the k-th moment
exists depends on α as the following simple calculation shows
Z ∞
E[X k ] = α xk−α−1 dx
x=1
and this is finite if and only if α + 1 − k > 1 that is if and only if k < α.

22
Definition 13 (Moment Generating Function). The moment generating function of the random variable
X is defined as
ψX (t) = E(etX ),
for t ∈ R. We say that the moment generating function of X exists, if there exists a b > 0 such that
ψX (t) < ∞ for all |t| < b.

The reason behind its name becomes apparent after differentiating since
d tX
 
0
ψ (t) = E e = E(XetX )
dt
ψ 0 (0) = E(X).

Warning 1. In the above calculation we have accepted a bit of hand waving in exchanging the order of
differentiation and expectation. In this course you should just assume that this is possible. The theory
explaining when this is possible lies deeper in the realms of Real analysis than time and scope permit us
to explore.

Differentiating again

d d
 
ψ 00 (t) = E(XetX ) = E (XetX ) = E(X 2 etX )
dt dt
ψ 00 (0) = E(X 2 ).

It is clear that we can continue in this way. Alternatively, we can expand etX in a power series:

X (tX)n
etX =
n!
Taking the expected value of both sides:

X tn
ψ(t) = E(etX ) = E(X n )
n!
Theorem 12. If ψX (t) exists then for all k ∈ N we have

dk ψX (t)
= E[X k ].
dtk

t=0

Thus we can find the nth ordinary moment of X either as ψ (n) (0) or from coefficient of tn in the power
series expansion of ψ.
Example 10. For the binomial distribution
n
!
tX
X
tk n k n−k
ψ(t) = E(e )= e p q = (pet + q)n .
0
k

Then

ψ 0 (t) = npet (pet + q)n−1 ψ 00 (t) = npet .(n − 1)pet (pet + q)n−2 + npet (pet + q)n−1

The first two ordinary moments are therefore np and n(n − 1)p2 + np. Hence the mean is np and the
variance is

var(X) = E(X 2 ) − (E(X))2 = n2 p2 − np2 + np − (np)2 = np(1 − p) = npq

23
Example 11. For the exponential distribution

f (x; λ) = λe−λx

we have
Z ∞
λ
ψ(t) = E(e tX
)= etx λe−λx dx = , t < λ.
0 λ−t
Then
λ 2λ
ψ 0 (t) = ψ 00 (t) =
(λ − t)2 (λ − t)3
Hence E(X) = 1/λ and E(X 2 ) = 2/λ2 , so that var(X) = 1/λ2 .

Moment generating functions are extremely useful because they uniquely characterise the distributions
of random variables. Of course it is also obvious that if two random variables have the same distribution
then they also have the same moment generating function.

Theorem 13 (Uniqueness). Let X, Y be random variables with moment-generating functions ψX , ψY


respectively. Suppose that both moment generating functions exist and that for some b > 0, we have
ψX (t) = ψY (t) for all |t| < b. Then X and Y have the same distribution function.

Remark 14. A little thought reveals that the moment generating function of a random variable with pdf
fX , is simply the Laplace transform of the pdf. Therefore the above result should not be that surprising
as Laplace transforms uniquely characterise functions subject to regularity restrictions.

Theorem 15 (Properties of MGFs). 1. Let a, b ∈ R and Y = a + bX. Then

ψY (t) = E(etY ) = E[et(a+bX) ] = E[eat e(bt)X ] = eat ψX (bt).

2. If X and Y are independent random variables and if Z = X + Y , then

ψZ (t) = E(etZ ) = E(et(X+Y ) ) = E(etX etY ) = ψX (t)ψY (t).

3. If X1 , . . . , Xn are independent and identically distributed random variables (i.i.d.), and S =


X1 + · · · + Xn then
ψS (t) = ψX (t)n .

Example 12 (Moment generating function of Normal Distribution). We can use the first result to work
out the moment generating function for the standard normal distribution, which has probability density
function
1 2
φ(x) = √ e−x /2 .

Then we can calculate Z ∞
1 2 /2
ψZ (t) = E e tX
=√ etz e−z dz.
2π −∞

We start with the standardised normal, because it is easier:


Z ∞
1 2 /2
ψZ (t) = √ etz e−z dz
2π −∞

We evaluate the integral by completing the square in the exponent:

1 t2 1 t2
tz − z 2 /2 = − [z 2 − 2tz + t2 ] + = − [z − t]2 +
2 2 2 2

24
Then a simple change of variables shows that
2 2
et /2 et /2
Z ∞ Z ∞
2 /2 2 /2 2 /2
ψZ (t) = √ e−(z−t) dz = √ e−z dz = et .
2π −∞ 2π −∞

The trick here is that the integral is just 2π because it’s just the standard normal integral shifted over a
bit, which doesn’t matter since the limits are infinite in either direction.
This allows us to work out the mean and variance:

2 /2 2 /2 2 /2
ψZ0 (t) = tet ψZ00 (t) = et + t2 et

which gives zero mean and unit variance, as it should. We can now compute the MGF of a general
normal distribution N (µ, σ 2 ) with mean µ and variance σ 2 . The trick here is that if X ∼ N (0, 1) then
Y = µ + σX ∼ N (µ, σ 2 ). Using Theorem 15, the moment generating function is,
" #
µt σ 2 t2
ψY (t) = e ψX (σt) = exp µt +
2

and from this we can easily work out E(X) and var(X):

" #
0 2 σ 2 t2
ψ = (µ + tσ ) exp µt +
2
" #
00

2 2 2
 σ 2 t2
ψ = (µ + tσ ) + σ exp µt + .
2

Hence E(X) = ψ 0 (0) = µ and E(X 2 ) = ψ 00 (0) = µ2 + σ 2 , i.e. var(X) = σ 2 . Now we can calculate
the higher order moments easily enough. The third central moment, generally called the skewness, is
defined by

µ3 = E(X − µ)3 = E(X 3 ) − 3µE(X 2 ) + 3µ2 E(X) − µ3


= E(X 3 ) − 3µ(σ 2 + µ2 ) + 2µ3
= E(X 3 ) − 3µσ 2 − µ3 .

We can compute E(X 3 ) directly from the moment generating function:


" #
σ 2 t2
ψ 000 (t) = (µ + tσ 2 )3 + (µ + tσ 2 )σ 2 + 2(µ + tσ 2 )σ 2

exp µt +
2

E(X 3 ) = ψ 000 (0) = µ3 + µσ 2 + 2µσ 2

and so µ3 = 0. Of course this is exactly as we would expect, since the odd central moments of a
symmetric distribution must be zero. Since µ3 is the lowest odd non-trivial central moment (the first
central moment is, by definition, zero) this does suggest that it is a good measure of the skewness, but
in fact it can be quite misleading. Symmetric distributions have zero skewness, but lots of asymmetric
definitions that on any intuitive basis are skewed can have zero third central moment.

25
Chapter 3

Exponential Densities

Many common statistical tests are based on the normal density and other densities based on it also
involve exponentials. So I will start by deriving a number of exponential distributions. Most of them
have other uses as well, so it’s not only just preparation for hypothesis testing.

3.1 Review of the Normal Distribution


The normal or Gaussian distribution is probably the most important continuous distribution, partly due
to its role in the central limit theorem. It is a two-parameter family of distributions, parameterized by
the mean µ, and variance σ 2 and denoted by N (µ, σ 2 ). Its density is given by
1 1
φ(y; µ, σ 2 ) =: f (y) =: √ (y − µ)2 , y ∈ (−∞, ∞).

exp − 2
2πσ 2 2σ

The density is symmetric about the mean µ. The expected value of a normal random variables Y ∼

Figure 3.1.1: The density of N (µ, σ 2 ).

N (µ, σ 2 ) is µ and its variance is σ 2 .


The distribution function of Y ∼ N (µ, σ 2 ), that is F (x) = P(X ≤ x) does not have a closed analytic
form, and therefore we get it’s values from statistical tables.

Theorem 16. Let Y ∼ N (µ, σ 2 ). Then


σ2 2
ψY (t) := E etY = eµt+ 2
t
.

26
The proof is an easy calculation and is therefore left as an exercise. As a hint remember that any
density of a normal random variable has to integrate to 1.

3.1.1 Multivariate Normal Distribution


This is a very important case so we will spend some time on it. In general the multivariate normal
distribution is the joint distribution of n continuous random variables X1 , . . . , Xn . Before we embark
on it we need some definitions.

Definition 14 (Covariance Matrix). Given random variables X1 , . . . , Xn , their covariance matrix is the
symmetric matrix  
σ11 σ12 · · · σ1n
 σ21 σ22 · · · σ2n 
 
Σ :=  . .. .. ,
.. 

 .. . . . 
σn1 σn2 · · · σnn

where σij := cov(Xi , Xj ).

Definition 15 (Multivariate Normal). We say that the jointly continuous random variables X1 , . . . , Xn
have the multivariate normal distribution with mean vector µ and positive definite covariance matrix Σ,
written (X1 , . . . , Xn ) ∼ N (µ, Σ) if their joint probability density function is given by
1 h 1
T −1
i
φ(x1 , . . . , xn ) = p exp − (x − µ) Σ (x − µ) , (3.1.1)
(2π)n |Σ| 2

where x = (x1 , . . . , xn )T viewed as a column vector and |Σ| is the determinant of Σ.


We equivalently say that X1 , . . . , Xn are jointly normal.

The following fact is extremely useful for hypothesis testing and in general.

Theorem 17. If X1 , . . . , Xn are independent normal random variables such that Xi ∼ N (µi , σi2 ), then
n
X n
X 
X1 + · · · + Xn ∼ N µi , σi2 .
i=1 i=1

Remark 18. Of course we know what the mean would be through the linearity of expectations, and the
variance also since the variance of a sum of independent variables is the sum of the variables. The main
content of the theorem is that the sum of independent normal random variables is also normal, a far
from trivial fact.

Proof. It’s enough to check for two random variables and to use induction.
2 ) and Y ∼ N (µ , σ 2 ) are independent Normal random variables. Let
Suppose that X ∼ N (µX , σX Y Y
Z = X + Y . Then
! !
σ 2 t2 σ 2 t2
ψZ (t) = ψX (t)ψY (t) = exp µX t + X exp µY t + Y
2 2
!
(σ 2 + σY2 )t2
= exp (µX + µY )t + X .
2

We recognise this as the moment generating function of a normally distributed variable with mean µX +
µY and variance (σX 2 + σ 2 ). By uniqueness of moment generating functions, we conclude that Z ∼
Y
N (µX + µY , σX2 + σ 2 ).
Y

27
An equivalent definition for jointly normal random variables is offered by the following theorem which
we will not prove.

Theorem 19. The variables X1 , . . . , Xn are jointly normal, if and only if for any λ1 , . . . , λn ∈ R, the
variable
λ1 X1 + λ2 X2 + · · · + λn Xn ,
is normally distributed.

It is easy to check that if (X1 , . . . , Xn ) ∼ N (µ, Σ) then E[Xi ] = µi , and that cov(Xi , Xj ) = σij .
As as special case look at the density in the case n = 2 where, letting ρ := corr(X, Y ) and µX , µY ,
σX , σY be the means and standard deviations of X and Y respectively, we have

1 h 1 n (x − µ )2 (y − µ )2 2ρ(x − µ )(y − µ ) oi
X Y X Y
φ(x, y) = p exp − 2 2 + 2 − .
2
2π 1 − ρ σX σY 2(1 − ρ ) σX σY σX σY

Of course you don’t have to memorise these formulas.


Another special case, which you do have to remember, is that of independent normal random variables.
For simplicity assume that Xi ∼ N (µi , σi2 ) and that X1 , . . . , Xn are independent. Then their joint
density is given by
n
Y 1 h 1 i
f (x1 , . . . , xn ) = q exp − 2 (x − µi )2 .
i=1 2πσ 2 2σi
i

The formula becomes clear as soon as you realise that their covariance matrix is in fact diagonal since
all the covariances vanish.
In fact we have the following result.

Theorem 20. Suppose that X1 , . . . , Xn ∼ N (µ, Σ), in other words that X1 , . . . , Xn are jointly normal.
Then X1 , . . . , Xn are mutually independent if and only if the covariance matrix Σ is diagonal.

Remark 21. This in particular implies, that if X, Y are jointly normal, then they are independent if and
only if cov(X, Y ) = 0.
This is only true if the variables are JOINTLY NORMAL and may fail otherwise as the next example
demonstrates.

Example 13. Let X ∼ N (0, 1), let


(
+1, with probability 1/2;
B=
−1, w.p. 1/2,

be independent of X and define Y := BX.


It is easy to check that Y ∼ N (0, 1) and that
1 1
cov(X, Y ) = E[XY ] = E[X 2 B] = E[X 2 ] − E[X 2 ] = 0.
2 2

The problem here is that although X and Y are both normal, they are not jointly normal. To see why
just consider the linear combination X + Y and notice that with probability 1/2, B = −1 and thus
P(X + Y = 0) > 0. This means that X + Y is not a continuous random variable, in the sense that it
does not have a probability density function, and therefore it cannot be normally distributed.

Jointly normal random variables have some very nice properties which we now state without proof.

Theorem 22. Let (X1 , . . . , Xn ) ∼ N (µ, Σ). Then

28
(a) (X1 , . . . , Xn )T = µ + Σ1/2 (Y1 , . . . , Yn )T where Y1 , . . . , Yn are i.i.d. standard normal.

(b) Let A be any n × n matrix. Then

AX ∼ N (Aµ, AΣAT ).

Notice that from the last property we can deduce the following. If Y1 , . . . , Yn are independent standard
normal random variables, then Σ is the identity matrix. Let O be an orthogonal n × n matrix and define
X T := (X1 , . . . , Xn )T := (OY )T . Then X ∼ N (Oµ, OOT ) and since OΣOT = I and therefore
X1 , . . . , Xn are also i.i.d. standard normal.

Remark 23. Notice that if the random vector X has covariance matrix Σ, and A is a n × n matrix, then
the random vector AX has covariance matrix AΣAT , even if the X is not normally distributed.

3.2 Poisson process


Before we move on, we first introduce the Poisson process, a very useful model which gives rise to many
useful exponential distributions.
Suppose that you want to model the arrival times of a sequence of events, for example buses arriving
at a bus stop, customers arriving at a shop, or atoms decaying.
We will denote the number of events that have occurred up to time t, by N (t)

N (t) = #arrivals up to time t.

With this definition, if s < t then N (t) − N (s) is the number of events that has occurred in the time
interval (s, t].
We will make a number of assumptions to make the model tractable. Before we introduce some very
useful notation.

Definition 16 (Landau notation). Recall that we write f (h) = o(h) if limh→0 f (h)/h = 0 and f (h) =
O(h) if there is a constant C such that f (h) ≤ Ch for all h sufficiently small.

Definition 17 (Poisson process). The family of random variables {N (t) : t ≥ 0}, N (t) ∈ N such that
N (0) = 0, and the following properties are satisfied

Independent Increments the number of events occurring over disjoint intervals of time are indepen-
dent, that is for all integers n and positive real numbers t1 < t2 < · · · < tn we have that the
random variables N (ti − ti−1 ), i = 1, . . . , n, are mutually independent;

Stationary increments the distribution of the number of events happening over the interval [t, t + h]
depends only on h and is independent of t; that is the distribution of N (t+h)−N (t) is independent
of t;

Distribution for some λ > 0, we have

P(N (h) = 1) = λh + o(h), and P(N (h) ≥ 2) = o(h);

is called the Poisson process of parameter λ.

It can be shown that a Poisson process can be equivalently defined in terms of the Poisson distribution,
the family of discrete distributions parameterised by λ > 0, given by

e−λ λk
pλ (k) = , k = 0, 1, 2 . . . .
k!

29
Definition 18 (Poisson process 2). The family of random variables {N (t) : t ≥ 0}, N (t) ∈ N such that
N (0) = 0,

Independent Increments for all integers n and positive real numbers t1 < t2 < · · · < tn we have that
the random variables N (ti − ti−1 ), i = 1, . . . , n, are mutually independent;

Poisson Increments N (t + h) − N (t) is Poisson with parameter λh.

3.3 Exponential Distribution


Now let T be the time until the first arrival that is

T := inf{t > 0 : N (t) = 1}.

Clearly from the above we have that

P(N (h) = 0) = 1 − λh + o(h).

Let’s compute the probability that the first arrival is after time t, P(T > t).
Let M be a large integer, and let h = t/M . Then by the independent increments property we have
that Notice that T > t is equivalent to N (t) = 0 which is also equivalent to
   
N (h) + N (2h) − N (h) + · · · + N (M h) − N (M − 1)h = 0,

which happens if and only if


   
N (h) = N (2h) − N (h) = · · · = N (M h) − N (M − 1)h = 0.

Since all of these random variables are independent we have


n     o
P N (h) = N (2h) − N (h) = · · · = N (M h) − N (M − 1)h =0

= P{N (h) = 0} × P{N (2h) − N (h) = 0} × · · · × P{N (M h) − N (M − 1)h = 0}

and by stationary increments and since h = t/M


h λt 1 i M
= [1 − λh + o(h)]M = 1 − +o → exp(−λt),
M M
as M → ∞, where the last limit follows easily after taking logarithms.
Thus finally the distribution function and pdf of the first arrival time T is given by

F (x) = 1 − e−λx
f (x) = λe−λx .

This is known as the exponential distribution, for obvious reasons.

Definition 19. The exponential distribution with parameter λ, denoted by Exp(λ), is the distribution
function
F (x) = 1 − e−λx , x > 0.
A random variable with the above distribution is called an exponential random variable with parameter
λ.

30
As we have seen, it is easy to find its moment generating function
Z ∞
λ
ψ(t) = E(etX ) = λ etx e−λx dx =
0 λ−t

and all the moments; the general formula is E(X n ) = n!/λn .


Using exponential random variables, we can give an equivalent definition of Poisson processes.
Definition 20 (Poisson process 3). Let τ1 , τ2 , . . . be a sequence of i.i.d. exponential random variables
with parameter λ > 0. Let S0 := 0, Sn := ni=1 τi , and define
P

N (t) = max{n : Sn ≤ t}.

Then N (t) is called a Poisson process with parameter λ.


Theorem 24. All three definitions are equivalent.

We will not prove the above result as it is beyond the scope of the course.

3.4 The Gamma Distribution


The Poisson process has very rich behaviour, and it will allow us to define further exponential distribu-
tions of interest.
Now we want to look at Tr , the waiting time to the r-th arrival, that is

Tr := inf{t > 0 : N (t) = r}.

Notice that T1 ≡ T .
We also introduce the inter-arrival times

τ1 := T1 , τ2 := T2 − T1 , . . . , τr := Tr − Tr−1 .

From the third definition we know that the τi are i.i.d. exponential random variables with parameter
λ > 0. So the time to the r-th arrival Tr is the sum of r identically distributed exponential random
variables with parameter λ > 0. That means we know what the moment generating function must be,
r
λ

ψ(t) = , t < λ.
λ−t

From this we can deduce all the moments of the distribution. The only problem is that we don’t know
the density itself, since deducing the density from the MGF involves inverting Laplace transforms which
is fairly tricky and certainly beyond our scope.
However we know that MGFs uniquely characterise probability distributions, and therefore their pdfs.
Thus I will give you the density, and then we will prove that it is indeed the correct one.
In fact we will define the gamma distribution through its density.
Definition 21 (Gamma distribution). The distribution with probability density function
λα α−1 −λx
f (x) = x e , x>0
Γ(α)
for some α, λ > 0, is called the Gamma distribution with parameters α and λ is denoted Γ(α, λ).
Remark 25. Here we allow α to be non-integer, in which case we can no longer interpret a Gamma
random variable as a sum of i.i.d. exponential variables.

31
Here λ is the number of arrivals per unit time and α is a real positive number which, when it is an integer,
is r, the number of arrivals that we are waiting for. Γ(α) is the gamma function:
Z ∞
Γ(α) = xα−1 e−x dx
0

We can check that this is in fact a legitimate density by verifying that it is non-negative, which is
obvious, and that it integrates to unity. To do the latter we make the simple substitution x = λy in the
expression for the gamma function. This gives
Z ∞ Z ∞
Γ(α) = λα y α−1 e−λy dy = λα y α−1 e−λy dy.
0 0

We now check that we have the appropriate density by computing its moment generating function:
λα α−1 −λx λα
Z ∞ Z ∞
ψ(t) = etx x e dx = xα−1 e−(λ−t)x dx.
0 Γ(α) Γ(α) 0

Making the substitution u = (λ − t)x, we obtain:

λα
Z ∞
1
ψ(t) = . uα−1 e−u du
(λ − t)α Γ(α) 0

But the integral is just Γ(α), and so



1

ψ(t) = .
1 − t/λ
The easiest way to find the moments of the gamma density is to expand its moment generating function
in series:

ψ(t) = (1 − t/λ)−α
(−α)(−α − 1) −t 2
−t
   
= 1 + (−α) + + ...
λ
2! λ
  2
α 1 α(α + 1) t
   
=1+ t+ 2
+ ....
λ 2! λ λ
Proposition 26. Let X ∼ Γ(α, λ). Then
α α(α + 1) α
E(X) = , E(X 2 ) = , and var(X) = .
λ λ2 λ2
These are as we would expect. The mean waiting time to the nth event goes up as n and the standard

deviation as n. The gamma function has a number of useful properties. The most interesting one can
be seen if we integrate by parts:

xα e−x 
Z ∞
∞ 1
Γ(α) =  + xα e−x dx
α 0 α 0

Γ(α + 1) = αΓ(α)
Since Γ(1) = 0∞ e−x dx = 1, we have
R

Γ(1) = 1, Γ(2) = 1, Γ(3) = 2 Γ(4) = 3!


and in general, Γ(n) = (n − 1)!. So it’s a sort of generalised factorial.

32
3.5 Chi-squared
Let Z ∼ N (0, 1) be a standard normal variable. Then the moment generating function for Z 2 is

1 ∞Z
tZ 2 2 2
ψZ 2 (t) = E(e )= √ etz e−z /2 dz
2π −∞
Z ∞
1 2
=√ e−(1−2t)z /2 dz.
2π −∞

With the change of variable u = (1 − 2t)1/2 z, we obtain


Z ∞
1 1 2 /2 1
ψZ 2 (t) = √ .√ e−u du = √
1 − 2t 2π −∞ 1 − 2t
Since we have already found that the moment generating function for the gamma distribution is

1

ψ(t) =
1 − t/λ

we see that Z 2 ∼ Γ( 12 , 21 ).
Now let

X 2 = Z12 + Z22 + Z32 + . . . + Zk2

be the sum of the squares of K independent, standard normal variables. Then the moment generating
function is the kth power of the one we have just found:

ψX 2 (t) = [(1 − 2t)−1/2 ]k = (1 − 2t)−k/2

and it follows from this that X 2 ∼ Γ(k/2, 1/2), and thus its pdf is given by

xk/2−1 e−x/2
fX 2 (x) = , x>0
2k/2 Γ(k/2)

Definition 22. The distribution of X 2 above is called the χ2 distribution (chi-squared) with k degrees
of freedom.

Of course from the above definition, if we have two independent sample Z1 , . . . , Zn ∼ N (0, 1) and
Y1 , . . . , Ym ∼ N (0, 1) of i.i.d. random variables and define

X12 = Z12 + · · · + Zn2


X22 = Y12 + · · · + Ym2 ,
X 2 = X12 + X22 ,

then of course X 2 has the χ2 distribution with n + m degrees of freedom.

Theorem 27 (Sum of independent χ2 -random variables). If X12 ∼ χ2 (n) is independent of X22 ∼


χ2 (m), then
X 2 := X12 + X22 ∼ χ2 (m + n).

The mean and variance of the gamma distribution are α/λ and α/λ2 , so those for the χ2 distribution
are k and 2k, respectively.

33
3.6 The F -distributions
Let U and V be independent random variables, such that P(V = 0) = 0, and let W = U/V . Then the
cumulative distribution of W is
Z ∞ Z vw 
FW (w) = P(W ≤ w) = P(U ≤ wV ) = fU (u)du fV (v)dv
0 0

We can now find the density of W by differentiation:

Z ∞ Z vw
d d

fW (w) = [P(W ≤ w)] = fU (u)du fV (v)dv
dw dw
Z0∞ 0

= vfU (vw)fV (v)dv.


0

Suppose that U and V are independent χ2 variables with m and n degrees of freedom respectively.
Then
1 ∞
Z
fW (w) = v.v n/2−1 (vw)m/2−1 e(−v−vw)/2 dv,
C 0
where C = 2(m+n)/2 Γ(m/2)Γ(n/2). We factor out the powers of w

wm/2−1
Z ∞
fW (w) = v (m+n)/2−1 e−(w+1)v/2 dv.
C 0

To get this into a form that looks like a gamma function we make the substitution y = (w + 1)v:

wm/2−1
Z ∞
fW (w) = y (m+n)/2−1 e−y/2 dy.
C(w + 1)(m+n)/2 0

The integral is not a function of w so we have the density of W = U/V apart from the constant. To find
that we compare the integral with the gamma function in the form
Z ∞
Γ(α) = λ α
y α−1 e−λy dy
0
With λ = 1/2 and α = (m + n)/2 this is just
Z ∞
−(m+n)/2
Γ(m/2 + n/2) = 2 y (m+n)/2−1 e−y/2 dy
0

and so the constant is


2(m+n)/2 Γ(m/2 + n/2) Γ(m/2 + n/2) 1
= = .
2(m+n)/2 Γ(m/2)Γ(n/2) Γ(m/2)Γ(n/2) B(m/2, n/2)

Definition 23. Let U ∼ χ2 (m), V ∼ χ2 (n) be independent. Then we say that


U/m
F = ,
V /n
has the F -distribution with (m, n) degrees of freedom.

Then the density of F is (not forgetting to scale for dF as well)

1 (m/n)m/2 xm/2−1 mm/2 nn/2 xm/2−1


fF (x) := . = . .
B(m/2, n/2) (mx + n)(m+n)/2 n−(m+n)/2 B(m/2, n/2) (mx + n)(m+n)/2

34
3.7 Student’s t-distribution
Consider the following typical problem. The lengths of widgets produced by an automatic machine are
normally distributed with unknown mean µ and unknown standard deviation σ, that is Xi ∼ N (µ, σ 2 ).
The lengths of 25 widgets are measured, and the mean X̄ = n1 25
P
i=1 Xi is found to be 10.05 cm. Is there
reason to suspect that the mean is significantly different than 10? That is we want to test the hypotheis

H0 : µ = 10, vs H1 : µ 6= 10.

If we knew the standard deviation, say σ = .1 then we know that



z = (X̄ − µ)/(σ/ n) ∼ N (0, 1).

And under the null hypothesis µ = 0 so we can plug this in and perform our hypothesis test. In this case
we compute z = (10.05 − 10.0)/(.1/5) = 2.5. From the tables we know that P(Z ≤ 2.5) = 0.9938 and
so P(Z ≥ 2.5) = .0061 and therefore P(|Z| ≥ 2.5) = .012. This means that if µ = 10, the probability
of getting such a big deviation by chance is around 1.2%. We say that the increase is significant and we
reject the null hypothesis at the 5% level.
However what happens if we don’t know σ? The obvious thing is to estimate it and we know from
last year how to do it
1 X
s2 = (xi − x̄)2 .
n−1
Remark 28. Remind yourself that we divide by n − 1 not n so as to get the right expected value.
Let’s have a closer look at the above formula. We begin with
X X
(Xi − µ)2 = [(Xi − X̄) + (X̄ − µ)]2
X X X
= (Xi − X̄)2 + 2 (Xi − X̄)(X̄ − µ) + (X̄ − µ)2
X
= (Xi − X̄)2 + n(X̄ − µ)2 .

We now take the expected values of each of these.

X X X
E[ (Xi − µ)2 ] = E(Xi − µ)2 = var(X) = n var(X)
2
E[n(X̄ − µ) ] = n var(X̄) = var(X).

from which we conclude that


X
E[ (Xi − X̄)2 ] = (n − 1) var(X)

which is why we use n − 1 in the denominator, not n.

We now have an estimate of the standard deviation, s. Let us briefly recall how the z-test worked.
There we knew σ and we based our calculations on the fact that if Xi ∼ N (µ, σ 2 ), i = 1, . . . , n are
i.i.d. then
X̄ − µ
√ ∼ N (0, 1).
σ/ n
To discuss what happens when we replace σ by its estimator let
1 X
S 2 := (Xi − X̄)2 .
(n − 1)
Remark 29. We use capital letters for random variables, and lower case for their observed values.

35
Therefore one might naively think that we can replace σ by S without changing the distribution but in
fact it turns out that
X̄ − µ
T = √ ,
S/ n
is NOT normally distributed. Intuitively, the fact that the denominator is random inflates the variability
of the whole fraction which ceases to be normally distributed.
Remark 30. We will see later that when the sample size is large, then in fact T is approximately normal.
Theorem 31. Let X1 , X2 , . . . , Xn ∼ N (µ, σ 2 ) be independent. Then X̄ and S 2 are independent and

σ2
X̄ ∼ N (µ, )
n
(n − 1)S 2
∼ χ2 (n − 1).
σ2
Proof. First we prove that they are independent. To do this we perform a change of variables. Notice
that we haven’t dealt with proper multivariate distributions before, just with bivariate but the concepts
are exactly the same.
We perform a change of variables from X1 , . . . , Xn to Y1 := X̄, and Yi = Xi − X̄ for i ≥ 2. Inverting
the transformation we find X̄ = Y1 , X2 = Y2 + Y1 , Xi = Yi + Y1 for i ≥ 2. In fact it is easier to think
of this as Y = AX where  1 1 1

. . .
 n1 n
1
n
− n 1 − n . . . ... 
A=  .. .. .. .
 . . . ... 

− n1 − n1 ... 1 − 1
n
Since this is a linear transformation its partial derivatives are of course constant and therefore the Jaco-
bian |J| will not depend on the variables.
Thus we can compute the joint density of the variables Y1 , . . . , Yn to be

fY (y1 , . . . , yn ) = |J|fX (y1 , y1 + y2 , . . . , y1 + yn ),

where n 1 X 2
o
fX (x1 , x2 , . . . , xn ) = C exp − (x i − µ) .
2σ 2
Notice that
n  n
X x i − µ 2 1 hX 2 2
i
= (x i − x̄) + n(x̄ − µ)
i=1
σ σ 2 i=1
n
1h 2
X
2 2
i
= (x1 − x̄) + (x i − x̄) + n(x̄ − µ)
σ2 i=2
n n
1 h X 2 X
2 2
i
= (x i − x̄) + (xi − x̄) + n(x̄ − µ)
σ 2 i=2 i=2
n n
1 h X 2 X i
= 2
yi + yi2 + n(y1 − µ)2
σ i=2 i=2

Therefore the pdf of Y1 , . . . , Yn simplifies to


n n
1 h X 2 X
 i 
2 2
fY (y1 , . . . , yn ) = C exp − y i + yi + n(y1 − µ)
2σ 2 i=2 i=2
n n
1 h X 2 X 1 h
 i   
i
= C exp − 2 yi + yi2 exp − 2 n(y1 − µ)2
2σ i=2 i=2

36
Since the pdf factorizes into a product, one factor of which involves just y2 , . . . , yn and the factor in-
volves just y1 we conclude that y1 is independent of y2 , . . . , yn , or in other words that X̄ is independent
of Xi − X̄ for i = 2, . . . , n. Since X1 − X̄ = − ni=2 (Xi − X̄), we conclude that X̄ is independent
P

of Xi − X̄ for all i. In particular since S 2 is a function of the Xi − X̄ we conclude that X̄ and S 2 are
independent.
To prove the second statement we calculate
n
X n
X
(n − 1)S 2 = (Xi − X̄)2 = (Xi − µ + µ − X̄)2
i=1 i=1
n
X n
X
= (Xi − µ)2 + n(µ − X̄)2 + 2(µ − X̄) (Xi − µ)
i=1 i=1
Xn
= (Xi − µ)2 + n(µ − X̄)2 + 2n(µ − X̄)(X̄ − µ)
i=1
n
X
= (Xi − µ)2 − n(X̄ − µ)2 ,
i=1
n
(n − 1)S 2 X (Xi − µ)2 (X̄ − µ)2
= − n .
σ2 i=1
σ2 σ2

We know from the previous section that the (Xi − µ)/σ ∼ N (0, 1) are i.i.d. and that n(X̄ − µ)/σ ∼
N (0, 1) and thus
n
X (Xi − µ)2 (X̄ − µ)2
A := ∼ χ2 (n), B := n ∼ χ2 (1).
i=1
σ2 σ2
Equivalently from above we have that
n
(n − 1)S 2 (X̄ − µ)2 X (Xi − µ)2
+ n = .
σ2 σ2 i=1
σ2

Calculating MGFs of both sides and using the fact that S 2 and X̄ are independent, and that the right
hand side has the χ2 (n) distribution we obtain the equation
h (n − 1)S 2 i 1 1
E exp t = ,
σ2 (1 − 2t) 1/2 (1 − 2t)n/2
h (n − 1)S 2 i 1
E exp t 2
= ,
σ (1 − 2t)(n−1)/2
and thus
(n − 1)S 2
∼ χ2 (n − 1).
σ2

Definition 24 (Student’s t-distribution). Suppose that Z ∼ N (0, 1), V ∼ χ2 (n) be independent. Then
we say that
Z
p ∼ tn ,
V /n
where tn is called the t-distribution with n degrees of freedom.

Notice then that



X̄ − µ n(X̄ − µ)/σ Z
T := √ =p =p ,
S/ n 2 2
[(n − 1)S /σ ] /(n − 1) W/(n − 1)
where Z ∼ N (0, 1) and W 2 ∼ χ2 (n − 1) are independent.

37
i.i.d.
Theorem 32. If Xi ∼ N (0, 1) then

X̄ − µ
T = √ ∼ tn−1 .
S/ n

Notice straight away that because the denominator is positive, and the numerator is symmetric, as it is
normally distributed, the distribution of T must also be symmetric.
To find the density we notice that if we look at the square, it is the ratio of two independent χ2 random
variables with 1 and n degrees of freedom respectively. Therefore the square has the F distribution with

(1, n) degrees of freedom. Let V be such a random variable. Let t = g(v) = v. We know the density
of V and we want to find the density of T = g(V ). However we cannot apply the change of variable

formula directly because although T takes values on R, g(v) = v takes values only on (0, ∞) and is
in fact bijective as map between (0, ∞) and (0, ∞). This means that the change of variables formula
would actually give you the density of |T | rather than that of T . However by symmetry it is not too
difficult to deduce that if t > 0 then P(|T | > t) = 2 P(T > t) and therefore that the density of |T | is
twice the density of T . Therefore we can use the change of variable formula to find the density of |T |
and then divide by 2 to get that of T for all t > 0 and then by symmetry we can deduce the density on
all of R. We proceed as explained to find

−1
d −1
f|T | (t) = fV (g (t) g (t)

dt
nn/2 (t2 )−1/2
= 2t
B(1/2, n/2) (t2 + n)(1+n)/2
nn/2 Γ((n + 1)/2) 2
=2 (t + n)−(n+1)/2
Γ(1/2)Γ(n/2)
Γ((n + 1)/2) t2
=2√ (1 + )−(n+1)/2 ,
nπΓ(n/2) n

where we have used the fact that Γ(1/2) = π.
Therefore for t > 0 we know from above that
Γ((n + 1)/2) t2
fT (t) = √ (1 + )−(n+1)/2 ,
nπΓ(n/2) n

and by symmetry that for t < 0 fT (t) = fT (−t) and thus we deduce that for all t ∈ R

Γ((n + 1)/2) t2
fT (t) = √ (1 + )−(n+1)/2 .
nπΓ(n/2) n

It is interesting to note that as n → ∞

Γ((n + 1)/2) 1
√ →√ ,
Γ(n/2) n 2

and taking logarithms and the Taylor expansion log(1 + x) = x + o(x) for x close to 0, we get
h t2 −(n+1)/2 i n+1  t2
log 1+ ) =− log 1 + )
n 2 n
n+1 t
 2 t2
=− ) + o(1/n)) → − ,
2 n 2
which is the logarithm of the density of a standard normal random variable.

38
3.7.1 Testing variances
We have seen that if Z1 , Z2 , . . . Zn is a sample of size n from a standard normal distribution, then Zi2
P

has the χ2 distribution with n degrees of freedom. We have also seen that Y1 , Y2 , . . . , Yn is a sample of
size n is drawn from a normal distribution with mean µ and variance σ 2 , then
n
1 X
(Yi − Ȳ )2
σ 2 i=1

has the χ2 distribution on n − 1 degrees of freedom and therefore letting


n
1 X
s2 = (Yi − Ȳ )2
n − 1 i=1

we can also say that


(n − 1)s2
∼ χ2 (n − 1).
σ2
This means that we can use the χ2 tables to test the variance against some value on the null hypothesis.
This isn’t actually done very often, because it seldom tells us very much. What is more common, is to
compare two sample variances.
If we have two random samples from the same population with variance σ 2 , then

(n1 − 1)s21 (n2 − 1)s22


and
σ2 σ2
each have χ2 densities on (n1 − 1) and (n2 − 1) df, respectively. Consequently

s21
F =
s22
has the F -distribution on (n1 , n2 ) df.
This test has to be used with caution, because it has been found that it is not very robust if the under-
lying densities are not normal.
Example 14. The blood pressures of rats in two samples were measured and found to be

Regular Feeding: 108, 133, 134, 145, 152, 155, 169


Intermittent: 115, 162, 162, 168, 170, 181, 199, 207

Is the variance of the second group significantly greater than that of the second? The sample variances
work out to be S22 = 783.7 and S12 = 384.6. The ratio, S22 /S12 = 2.04. We compare 2.04 with the (7,6)
entry in the 5% table, which is 4.207. So the difference is not significant. Note this is a one tail test.
For a two tail test, you put whichever variance is larger in the numerator (to make F greater than unity)
and then use the 2.5% value, which for (7,6) is 5.7. That means we’d need a larger discrepancy to get
significance, as you’d expect.

3.8 The Beta Density


The beta distribution is important for theoretical reasons but it doesn’t have a natural interpretation. So
we will simply define it by
xr−1 (1 − x)s−1
f (x) = , 0<x<1
B(r, s)

39
where r > 0, s > 0 and B(r, s) is the beta function
Z 1
B(r, s) = xr−1 (1 − x)s−1 dx
0

With a little bit of fiddling around, it can be shown that

Γ(r)Γ(s)
B(r, s) =
Γ(r + s)
The easiest way to find the moments of the beta distribution is directly:

1 1 Z
E(X) = x.xr−1 (1 − x)s−1 dx
B(r, s) 0
B(r + 1, s)
=
B(r, s)
Γ(r + 1)Γ(s) Γ(r + s)
= .
Γ(r + s + 1)Γ(r) Γ(s)
r
= .
r+s
Similarly

1 1 Z
E(X 2 ) = x2 .xr−1 (1 − x)s−1 dx
B(r, s) 0
B(r + 2, s)
=
B(r, s)
Γ(r + 2)Γ(s) Γ(r + s)
= .
Γ(r + s + 2)Γ(r) Γ(s)
(r + 1)r
= ,
(r + s + 1)(r + s)

and so
rs
var(X) = .
(r + s)2 (r+ s + 1)
It’s not immediately obvious that the beta function is related to the gamma function, but it is. To see
this we make the substitution x = cos2 θ to obtain (bearing in mind that dx = −2 cos θ sin θ and the
minus sign switches the limits)
Z π/2 Z π/2
B(r, s) = 2 cos2r−2 θ sin2s−2 θ. cos θ sin θdθ = 2 cos2r−1 θ sin2s−1 θdθ
0 0

Now we have seen the gamma function can be written in one of three forms:
Z ∞ Z ∞ Z ∞
1 2 /2
Γ(α) = xα−1 e−x dx = λα y α−1 e−λy dy = z 2α−1 e−z dz
0 0 2α−1 0

We now use the third of these and we write

Z ∞ Z ∞
1 2 /2 1 2 /2
Γ(s)Γ(t) = x2s−1 e−x y 2t−1 e−y
dx. dy
2s−1 0 2t−1 0
Z ∞Z ∞
1 2 2
= s+t−2 x2s−1 y 2t−1 e−(x +y )/2 dydx
2 0 0

40
Z ∞ Z π/2
1 2 /2
= r2s+2t−2 cos2s−1 θ sin2t−1 θe−r rdθdr
2s+t−2 0 0
Z ∞ Z π/2
1 2 /2
= r2(s+t)−1 e−r dr.2 cos2s−1 θ sin2t−1 θdθ
2(s+t)−1 0 0
= Γ(s + t)B(r, s).

41
Chapter 4

Estimation

We are used to the idea that if we want to estimate the mean of a population, we take a sample and find
its average. It is the obvious thing to do, and it seems to work fines. If the sample is large we expect to
get a good result (if the distribution is normal we can even work out confidence limits) and if we use the
whole population as the sample we automatically get the right answer.
It’s not quite so simple for the variance, because instead of the variance of the sample we use the
sum of squares divided by n − 1. This should at least remind us that it’s not always obvious what
we are intended to do. Besides, it’s easy for distributions like the binomial, Poisson, Normal and even
exponential where the parameters are obviously related to readily measured statistics, but what happens
when this isn’t so? For the gamma distribution, for instance, we have E(X) = α/λ and var(X) = α/λ2 ,
and while we can solve these for α and λ, how good will the resulting estimates be?
Even in simple cases, there can be doubt. For instance, suppose we want to estimate the mean of a
uniform distribution. We could add up all the measurements and divide by the number in the sample,
but it would seem a lot easier to take half the biggest value. Is this all right?
What we are looking for is always the best way of estimating parameters. And as usual, the first thing
to do is to decide what we mean by ‘best’.
To help keep things straight, we introduce some notation. We will let θ stand for the parameter or
parameters of a distribution; thus it may be a number or a vector (θ1 , θ2 , . . . θn ). We let Ω denote the
parameter space, the set of all possible values of θ. In the case of the normal distribution, for example,
θ stands for the (µ, σ 2 ) and Ω is the set of all pairs (µ, σ 2 ) such that −∞ < µ < ∞ and σ 2 > 0.
Sometimes we can restrict the parameter space further if we have prior knowledge; for instance if the
distribution is of marks then presumably we can assume µ > 0 and there is no point in designing an
estimation procedure that is good at estimating negative marks. Being able to handle negative scores
will not enter into the criterion for ‘best’.
There is still a ”dispute” about the role of prior information in statistics. Some statisticians hold that
we should always begin by using whatever information and intuition we have to estimate where θ is
likely to be. Just as in the (usually noncontroversial) assumption that marks aren’t negative, this can
affect the test by leading us to choose one that works best in the range we think is important even at
the expense of doing less well elsewhere. This is the school of Bayesian statistics. I won’t say any
more about this because at this stage I want to stick to the conventional ideas, but you may run into this
concept at some point. Even the fact that there can be a dispute shows what happens when you work in
a subject whose very essence is dealing with uncertainty.

First we need to define the subject of our study. To estimate the expectation of a population, we will
take an i.i.d. sample X1 , . . . , Xn and compute its average
1X
X̄ = Xi .
n

42
To estimate the variance we will compute
1 X
S2 = (Xi − X̄)2 .
n−1
In either case we have what we call a statistic.

Definition 25. Given a sample of observable random variables X1 , . . . , Xn , a statistic Y is a known


function of the sample
Y = f (X1 , . . . , Xn ).
When the statistic is used to estimate the value of a parameter(vector) then it is also called a point
estimate, or estimator.

Remark 33. In some cases, if we know some of the parameters of the population, the function may
incorporate them. For example, when the population variance σ 2 is known and the mean µ0 is known
under the null, the z-statistic is

Z = f (X1 , . . . , Xn ) = n(X̄ − µ0 )/σ.

The important thing is that we must be able to compute the statistic given the sample.

There will always be many statistics for estimating a parameter, and we will have to decide which one
to use. Naturally we want to use the ‘best’ one, or one that is close to the ‘best’ one, so first we have to
know what we mean by ‘best’. Let’s consider some properties that we would like a statistic to have. As
we shall see, we can’t always get all of them at the same time.
First of all our statistic is random, and is meant to estimate a fixed deterministic value. Since it is
random there will always be some error in the estimation. We want this error to be zero on average,
otherwise we are doing something wrong.

Definition 26 (Bias). The bias of an estimator θ̂ of a parameter θ is defined as

Bias(θ̂) = E[θ̂ − θ].

If Bias(θ̂) = 0 then we say that the estimator is unbiased.

Unbiasedness of a statistic tells us that it will give us the correct value on average.
We know that for a sample of size n, E(X̄) = µ so the sample mean is an unbiased estimate of the
population mean. When it comes to estimating the variance, however, we find that

X X
(Xi − µ)2 = [(Xi − X̄) + (X̄ − µ)]2
X X X
= (Xi − X̄)2 + 2 (Xi − X̄)(X̄ − µ) + (X̄ − µ)2 .

Now (X̄ − µ) is a constant for the summation, and since also (Xi − X̄) = 0 we have
P

X X
(Xi − µ)2 = (Xi − X̄)2 + n(X̄ − µ)2 .

We now take the expected values of each of these.


X X X
E[ (Xi − µ)2 ] = E(Xi − µ)2 = var(X) = n var(X),

E[n(X̄ − µ)2 ] = n var(X̄) = var(X),

43
from which we conclude that
X
E[ (Xi − X̄)2 ] = (n − 1) var(X)

which is why we define the sample variance as

1 X
s2 = (Xi − X̄)2
n−1
with n − 1 rather than n in the denominator. The point is that s2 is an unbiased estimate of σ 2 .
Given two unbiased estimators, naturally one would select the one with the smaller variability. Since
variance is our usual measure of variability we prefer the estimator with the smaller variance. If on
the other hand the estimators are not necessarily unbiased then it makes no sense to compare their
variances, which is the average squared distance from their respective means, but rather measure their
expected squared deviation from the true parameter value. This motivates the following definition.

Definition 27 (Mean square error). The mean square error of an estimator θ̂ is

MSE(θ̂) = E[(θ̂ − θ)2 ].

The mean square error of an estimator takes into account both the bias and the variance since

MSE(θ̂) = E[(θ̂ − θ)2 ]


= E[(θ̂ − E θ̂ + E θ̂ − θ)2 ]
= E[(θ̂ − E θ̂)2 ] + E[(E θ̂ − θ)2 ] + 2 E[(θ̂ − E θ̂) × (E θ̂ − θ)]
= var(θ̂) + Bias(θ̂)2 + 0.

In some cases an estimator may improve as the sample size increases. To capture this we introduce the
following asymptotic measure of the quality of estimators, or more precisely the quality of sequences of
estimators.

Definition 28 (Consistency). A sequence of statistics (Yn , n ≥ 0) is said to be a consistent estimate of


a parameter θ if for every  > 0,
lim P(|Yn − θ| ≤ ) = 1.
n→∞

Remark 34. We often denote point estimators for a parameter θ by θ̂.

This is obviously a desirable property. It says that if you take a large enough sample, then most the
probability concentrates about the true parameter value.
1
(xi − x̄)2 is not an unbiased estimate of the population variance, it is consistent.
P
Remark 35. While n

Example 15. Suppose that you want to estimate the unknown mean of a normal population with variance
σ 2 = 1. You take a sample Xi ∼ N (µ, 1) and you compute the sample mean as an estimator of µ,
µ̂ = X̄. Then, since X̄ ∼ N (µ, 1/n) for any  > 0
√ √
P(|µ̂ − µ| > ) = P( n|X̄ − µ| > n)

= P(|Z| ≥ n) → 0,

as n → ∞, where Z ∼ N (0, 1).

44
Example 16. There are surprisingly simple examples where the above calculation is too complicated to
be done exactly.
Suppose you are given a possibly unfair coin with unknown probability of ’heads’ p. You want to
estimate p and one natural choice is to toss the coin n times, and record the number of heads
n
X
N= Xi ,
i=1

where Xi = 1 if the n-th toss was ’heads’ and 0 otherwise.


Then let’s say we use as estimator the proportion of heads in the sample,

p̂ = N/n.

Clearly N ∼ Bin(n, p) and thus E(p̂) = p so it is unbiased. Also we have that for any  > 0, writing
dxe for the smallest integer larger than x,

P(|p̂ − p| > ) = P(|N − np| > n)


= P(|N − np| ≥ n)
= P(N ≥ dn(p + )e) + P(N ≤ dn(p − )e),

and we are a bit stuck. Of course there are tricks one can use to estimate the above and to show that
it vanishes, but it does show that we need a more general method. This is given by the law of large
numbers in the next section.

4.1 The Law of Large Numbers


A familiar example of consistency is in the frequently quoted and often misunderstood “law of large
numbers”.
The Law of Large Numbers is what most people are thinking of when they speak of the “Law of
Averages”. To see what happens, let’s go back to the simple scenario of Example 16: a sequence of
Bernoulli trials with probability p of success on each trial. Let Yn be the number of successes in n trials
and let p̂n = Yn /n be the proportion of successes in n trials.

Yn E(Yn ) np
 
E(p̂) = E = = =p
n n n
Pn
Also since Yn = i=1 Xi where the Xi are i.i.d.

p(1 − p)
P
var ( Xi ) n var(X1 )
var(p̂) = 2
= 2
= → 0,
n n n
as n → ∞.
Remember that variance is a measure of the spread of the distribution. Therefore the distribution of p̂
remains at the same location, while it gets more and more peaky as you can see in Figure 4.1.
So as n → ∞, the probability of any fixed interval around the mean approaches 1, and the random
variable p̂ ‘tends’ to the constant p in some sense1 . In other words, p̂ is both unbiased and consistent as
an estimate of p.
Of course not everything in life is Bernoulli trials or Gaussian, but fortunately the law applies far more
generally, and since it’s not too hard to see why we will go through it. First, however, let’s clarify what
the law says.
1
When talking about random variables there are various modes of ’convergence’ to a limit which are beyond the scope of
this course. This one is called convergence in probability. You will learn more about this in later courses on Probability Theory

45
2.0

1.5

1.0

0.5

-3 -2 -1 1 2 3

Figure 4.1.1: The density of N (µ, σ 2 ) for σ 2 = 1(blue), 1/2(orange), 1/3(green), 1/4(red) and
1/5(purple).

What it does say is that in a sequence of n Bernoulli trials, the proportion of successes tends to the
probability p. What it does not say is that the number of successes tends to the expected number np.
Let’s see what that means. Suppose we are tossing a perfectly unbiased statisticians’ coin, and the
first ten throws are all heads. As you know, that’s possible, if unlikely. Now we go on and toss it 990
more times to make 1000 in all. What might we expect to happen?
Most people, and even a lot of gamblers, would argue as follows. In 1000 trials there should be on
average 500 heads. We’ve already got 10, so we expect that of the next 990 trials, 490 should be heads
and 500 should be tails. If this is a betting game, we should certainly be betting on tails until the tails
catch up, as they are bound to.
Statisticians call this the Law of Maturing Probabilities, and it is wrong. It must be. How could it
possibly be true? It requires the coin to have a memory, so that it knows it owes us some tails and will
produce them. If I take a coin from my pocket and toss it, you think there is 50-50 chance it will show
heads. But how do you know it doesn’t owe me some heads? Surely there is no way within conventional
science that this can make any sense.
But what about the law of large numbers? Well, let’s go back and think. If we start from scratch,
the expected number of heads is just half the number of throws. So if we toss the coin 100 more times,
we expect 50 heads and 50 tails. Counting the 10 heads at the start, this makes the relative proportion
p100 = 60/110 = 0.55. If we toss the coin 200 more times, we expect 100 heads and 100 tails, and
again adding in the extra 10 heads this would make p200 = 110/210 = 0.52. If we toss the coin 990
times, we expect 495 heads and 495 tails, and adding the extra 10 heads would make p1000 = 0.505.
You can see what’s happening. The relative proportion is Yn /n, where Yn is the actual number of
successes on n trials. It may happen that Yn moves away from the theoretical expectation np. Suppose
we write Yn = np + δ. What the law of large numbers tells us is not that if we carry out extra trials the
proportion of successes will adjust itself to reduce δ but only that we don’t expect δ to get any bigger.
Since we are increasing the denominator, n, this means that the relative frequency will tend to p as it
should.

46
In other words, the deviation in numbers is δ and there is nothing to say this has to go to zero. The
deviation in relative frequency is δ/n. This does tend to zero, but by increasing n with δ constant, rather
than by decreasing δ.
The only use of the law of maturing probabilities is if you are betting against people who believe in it.
If a coin shows heads ten times in a row, when you know the odds are about 100:1 against this, someone
might be fool enough to offer you better than even odds against another head, on the grounds that it is
time for some tails. If he does, you should accept it. First, if the coin is fair, the odds are still even, as
they were at the beginning. Second, by applying a simple statistical test you know that the chances are
very good that the coin isn’t fair, and that it is biased in favour of heads!

4.1.1 The Law of Large Numbers


It’s time we give a rigorous statement and proof of the law of large numbers (LLN).

Theorem 36 (Law of Large Numbers). Suppose that {Xi , i ≥ 0} are i.i.d. with finite mean µ and
variance σ 2 . Then
n
1X
µ̂ := Xi ,
n i=1
is a consistent estimator of the mean. In other words for all  > 0,

P[|µ̂ − µ| > ] → 0,

as n → ∞.

The proof will be based on the following useful result.

Proposition 37 (Chebyshev’s Inequality). Let Y be a random variable with finite mean µ and variance.
Then for any k > 0,
1
 q 
P |Y − µ| ≥ k var(Y ) ≤ 2
k

We will now prove the LLN and then prove Chebyshev’s inequality.

Proof of Law of Large Numbers. Notice that E µ̂ = µ by linearity of expectation,


p while it’s variance is
var(µ̂) = σ 2 /n. Therefore by Chebyshev’s inequality applied with k := / var(µ̂) we immediately
have that
q
P[|µ̂ − µ| > ] = P[|µ̂ − µ| > k var(µ̂)]
1 var(µ̂) σ2
≤ = = .
k2 2 n2

Chebyshev’s inequality actually follows from the following more general result.

Theorem 38 (Markov’s Inequality). Let X > 0 be a random variable, such that E X < ∞ and c > 0 a
constant. Then
EX
P(X > c) ≤ .
c
Proof. Take a moment to convince yourselves that
h i
P(X > c) = E 1{X > c} ,

47
where for any x, c, (
1, if x > c,
1{X > c} =
0, otherwise.
Notice that if x > c then x/c > 1 and thus for any x, c > 0
x x
1{x > c} ≤ 1{x > c} ≤ .
c c
Recall from the basic properties of expectations that X ≥ Y ≥ 0 then E X ≥ E Y ≥ 0, we have that
h i hX i EX
P(X > c) = E 1{X > c} ≤ E = .
c c

Remark 39. If X > 0 and k ≥ 1 then x > c if and only if xk > ck . Therefore, if E X k < ∞ we can
also conclude that
E Xk
P(X > c) = P(X k > ck ) ≤ k ,
c
by applying Markov’s inequality to the positive random variable X k and the constant ck .

Proof of Chebyshev’s inequality. Let X = |Y − µ|/σ. Then X > 0 and since Y has finite variance we
have that E X 2 < ∞. By the remark above we can conclude that for any k > 0

E X2 E[(Y − µ)2 ] 1
P(|Y − µ| ≥ kσ) = P(X ≥ k) ≤ 2
= 2 2
= 2.
k σ k k

Now let’s see how this works out. Suppose we are told that the probability density has zero mean and
unit variance. Then if we set k = 1 we find P(|x| ≥ 1) ≤ 1, which tells us absolutely nothing new.
With k = 2, however, we have P(|x| ≥ 2) ≤ 14 and P(|x| ≥ 3) ≤ 19 If, in fact, x has the normal density,
1
then we know that P(|x| ≥ 2) is almost exactly 20 and P(|x| ≥ 3) is very nearly zero. Chebyshev’s
inequality obviously doesn’t do as well, but it’s pretty good considering how little we have to know.
And of course it does not depends on the assumption that the density is normal. We can make it work
providing we have reasonable estimates of µ and σ.

Example 17. Let’s go back to Example 16 where we were stuck at computing

P(|p̂ − p| > ) = P(|Yn − np| > n)


= P(|Yn − np| ≥ n).

Since E(Yn − np)2 = np(1 − p), Chebyshev’s inequality applied with k 0 = k/σ immediately gives us

P(|p̂ − p| > ) = P(|Yn − np| ≥ n)


np(1 − p) p(1 − p)
≤ 2 2
= → 0,
n  n2
which proves consistency and gives us also a rate.
For a more specific example let p = 1/3. We use Chebyshev’s inequality to find an upper limit to the
probability that the proportion of successes falls outside the interval (0.2,0.4) in (i) 100 trials, (ii) 1000
trials. Chebyshev’s inequality is

1
P(|Y − µ| ≥ kσ) ≤
k2

48
Here µ = 0.3 and
3 7
pq
σ2 =
= 10 10 = 0.21/n.
n n
We are to be no further than 0.1 from µ, so kσ = 0.1 and 1/k 2 = 100σ 2 = 21/n. If n = 100,
1/k 2 = 0.21, so the probability that p lies outside the interval is less than 0.21. The probability that
it lies inside the interval is greater than 0.79 If n = 1000, 1/k 2 = 0.021, so the probability that p lies
outside the interval is less than 0.021 and the probability that it lies inside the interval is 0.979. If we
use the normal approximation to the binomial, we take p to be N(0.3, 0.21/100).

Then P(x > 0.4) = P(z > (0.4 − 0.3)/( .21/10) = P(z > (0.1)/0.0458) = 1 − Φ(2.18) =
1 − 0.98537 = 0.01463. The probability that x lies outside the interval is therefore 0.0292 and the
probability that it lies within the range is 0.9708. This is of course better than what we get using
Chebyshev. Finally, if n = 1000, we have σ 2 = 0.021 which makes σ = 0.14. This makes both 0.2 and
0.4 more than 3 standard deviations from 0.3, so the probability of p falling outside the interval is very
close to 0.

4.2 Efficiency
So we would like our estimator to be unbiased and consistent. Another desirable property of an estimator
is that it should have a small mean square error which for unbiased estimators is just the variance.
That is given two unbiased estimators, one would of course choose the one with the smaller variance.
To see why just think about the width of the corresponding (1 − α) × 100%-confidence intervals. The
width is of course directly related to the variance. This motivates the following definition.
Definition 29 (Relative Efficiency). Given two unbiased estimators θ̂1 and θ̂2 of a parameter θ, the
relative efficiency of θˆ1 relative to θ̂2 , is denoted by eff(θ̂1 , θ̂2 ) and is defined as

var(θ̂2 )
eff(θ̂1 , θ̂2 ) = .
var(θ̂1 )
For biased estimators it is defined as
MSE(θ̂2 )
eff(θ̂1 , θ̂2 ) = .
MSE(θ̂1 )
Remark 40. We are assuming of course that both estimators have finite variance. If one of them has
infinite variance then one would choose the other one. If both have infinite variance then the efficiency
is not defined and one would have to use a different metric to compare the two estimators.

Notice that the definition is not symmetric, and that the smaller the m.s.e. of θ̂1 is, the greater its
relative efficiency w.r.t. any other estimator.
The reason for the name is that a more efficient estimator makes better use of the data, in that it
produces an estimator with less variability. We can see this from the following example:

Example 18. Let Y1 , . . . , Yn ∼ U(0, M ) the uniform distribution on (0, M ). Then

1 M M
Z
E(Yi ) = xdx = ,
M 0 2
1 M 2 M2
Z
E(Yi2 ) = x dx = ,
M 0 3
M2
var(Yi ) = .
12

49
Suppose we want to estimate the mean. Of course one possibility is to use the sample mean
1X
θ̂1 := Ȳ = Yi ,
n
which we know is unbiased and consistent.
Another possibility comes from the fact that the maximum of a large sample should be close to M
and therefore one could try Mn /2 where Mn := maxi=1,...,n Yi . Let’s just check first what the mean of
this is. First we compute
FMn (x) := P(Mn ≤ x)
n
Y  x n
= P(Yi ≤ x) =
i=1
M
for x ∈ (0, M ), 0 for x < 0 and 1 for x > M . Thus
 1 n
0
fMn (x) := FM n
(x) = nxn−1 ,
M
and thus finally
Z M  n
x 1
E[Mn /2] = nxn−1 dx
x=0 2 M
n h xn+1 iM
=
2M n n + 1 0
n M n+1 M n
= n
= .
2M n + 1 2 n+1
Therefore to make this unbiased we should use
n + 1 Mn
θ̂2 := .
n 2
Let’s compute the variances. The first one we know to be
1 M2
var(θ̂1 ) =
var(Y1 ) = .
n 12n
For the second one we first compute the second moment
h i Z M  1 n
E Mn2 = x2 nxn−1 dx
x=0 M
n h xn+2 iM
=
Mn n + 2 0
n M n+2 n
= n = M2 ,
M n+2 n+2
and thus
(n + 1)2
var(θ̂2 ) = var(Mn )
4n2
(n + 1)2 h 2 n 2 n2 i
= M − M
4n2 n+2 (n + 1)2
(n + 1)2 n M2
= M2 = .
4n2 (n + 2)(n + 1)2 4n(n + 2)
Therefore
var(θ̂2 ) 3
eff(θ̂1 , θ̂2 ) = = ,
var(θ̂1 ) n+2
which is less than 1 if n > 1 and therefore if n > 1 θ̂2 has a smaller variance and is generally preferable
to θ1 as an estimator of θ.

50
4.3 Maximum Likelihood Estimates
Although we know a few ways to compare estimators, we have yet to discuss how to construct them. It’s
like a lot of things in mathematics, including integration. There is one method that is often used both
in practical problems and also, as we shall see, in theoretical work, and that’s the method of maximum
likelihood. Before we discuss the method we need to define the concept of likelihood.

Definition 30 (Likelihood function for Discrete). Suppose Y1 , . . . , Yn are discrete random variables
whose distribution depends on a parameter θ, and have probability mass function

pθ (y1 , . . . , yn ) := Pθ (Y1 = y1 , . . . , Yn = yn ).

Let y1 , . . . , yn be sample observations. The likelihood of the parameter θ given the observations (y1 , . . . , yn )
is denoted by L(θ|y1 , y2 , . . . , yn ) is defined to be

L(θ|y1 , y2 , . . . , yn ) := Pθ (Y1 = y1 , . . . , Yn = yn ),

that is the joint probability mass function for the parameter θ.

Definition 31 (Likelihood function for Continuous). Suppose Y1 , . . . , Yn are jointly continuous ran-
dom variables whose distribution depends on a parameter θ, and have probability density function
f (y1 , . . . , yn |θ). Let y1 , . . . , yn be sample observations. The likelihood of the parameter θ given the
observations (y1 , . . . , yn ) is denoted by L(θ|y1 , y2 , . . . , yn ) is defined to be

L(θ|y1 , y2 , . . . , yn ) := f (y1 , . . . , yn |θ),

that is the joint density for the parameter θ evaluated at the observations.

When the variables Y1 , . . . , Yn are i.i.d. and Yi ∼ f (·|θ), in other words Yi has a pdf which depends
on a parameter θ then it’s very easy to see that
n
Y
L(θ|y1 , . . . , yn ) = f (yi |θ),
i=1

and similarly for discrete random variables.

Remark 41. Despite the way we write it, we think of the likelihood function as a function of the param-
eters, and we treat the observations as fixed. Sometimes we will drop the observations and simply write
L(θ).

Suppose now that we have a sample of size n and from some distribution F (·|θ) which we know the
distribution up to the parameter θ. One way to estimate the parameter, would be the choose the value
that makes the observations as probable as possible. In the discrete case, we should choose the value θ∗
that maximises the probability

L(θ|y1 , . . . , yn ) = P(Y1 = y1 , Y2 = y2 , . . . , Yn = yn |θ).

Notice that in the continuous case the likelihood is no longer a probability, but a density. However
if Y1 , . . . , Yn have joint density f (·, . . . , ·|θ) then by maximising L(θ|y1 , . . . , yn ) over θ, we are also
maximising
 
P Yi ∈ [yi − ∆yi , yi + ∆yi ], i = 1, . . . , n θ ≈ L(θ|y1 , y2 , . . . , yn )∆y1 × · · · × ∆yn ,

which is a probability.
This leads to the following definition.

51
Definition 32 (Maximum Likelihood Estimator). Suppose that a sample y1 , . . . , yn has likelihood func-
tion L(θ) = L(θ; y1 , . . . , yn ) depending on a parameter(vector) θ. Then a maximum likelihood estimator
θ̂MLE is the value of the parameters that maximises L(θ), if a maximum exists.
Remark 42. A maximum may not exist, or it may not be unique. In the first case the maximum likelihood
estimator does not exist, in the second it will not be unique, in the sense that the estimator will produce
a set of point estimates. In the cases we will deal with the maximum likelihood will usually be unique.
Remark 43. In most cases it is far easier to instead maximise the log-likelihood function l(θ) which is
perhaps unsurprisingly defined as
l(θ) = l(θ; y1 , . . . yn ) = log L(θ; y1 , . . . , yn ).

So we compute the probability of the observed outcome as a function of the parameters. We then
maximise this likelihood function to obtain the maximum likelihood estimates of the parameters. Note
that these are not the values of the parameters that are most likely, given the data.
Example 19. As an example, we consider the exponential distribution f (x|λ) = λe−λx . Suppose we
take a sample of size n. The likelihood, is
n n
!
−λxi
Y X
n
L(λ|x1 , . . . , xn ) = (λe ) = λ exp −λ xi = λn exp(−nλx̄).
i=1 i=1

Then
log L(λ) = n log λ − nλx̄,
and so
d n
log L = − nx̄.
dλ λ
Thus L has a unique maximum at λ̂ = 1/x̄ and this is therefore the maximum likelihood estimator of λ.

Example 20. As a more complicated example, let’s find the maximum likelihood estimates of the pa-
rameters of the normal distribution. The likelihood function is

n 
1
Y h i 
L= √ exp −(xi − µ)2 /2σ 2
1 σ 2π
n
n 1 X
log L = − log(2π) − n log σ − 2 (xi − µ)2 .
2 2σ 1

Differentiating with respect to each parameter and setting equal to zero:

n
1 X
(xi − µ) = 0
σ2 1
n
n 1 X
− + 3 (xi − µ)2 = 0.
σ σ 1

We can easily solve these to obtain the maximum likelihood estimates

n
1X 1X
µ̂ = xi σ̂ 2 = (xi − x̄)2
n i n

Note that the maximum likelihood estimator of the variance is biased. We don’t always find all the
criteria pointing at the same statistic.

52
Example 21. Suppose that y1 , . . . , yn are observations from a population with density
(
θ
x2
, x≥θ
f (x; θ) =
0, otherwise,

where θ > 0. The likelihood is then given by


θn
L(θ) = , y1 , . . . , yn ≥ θ,
y12 . . . yn2
and vanishes otherwise. If we differentiate and set the derivative equal to 0 then we get θ∗ = 0. But
is this the correct choice? Have a look at the likelihood function in Figure 21. You’ll notice that the
maximum is at the boundary and therefore it cannot be picked up by the derivative test. Remember to

mini yi

Figure 4.3.1: The likelihood function L(θ; y1 , . . . , yn ). Notice that it vanishes for θ > mini yi .

always try to visualise what is happening at the boundaries, if there are any.

The following is a very nice property of MLEs when we reparameterize a distribution.


Theorem 44 (Invariance of MLE). Suppose that θ̂ is the MLE for a parameter θ and let t(·) be a strictly
monotone function of θ. Then
(t(θ))MLE = t(θ̂),
that is the MLE of t(θ) is t(θ̂).

4.4 The Cramer-Rao Inequality


Suppose that we are given a sample x1 , . . . , xn and we want to estimate a parameter θ. What is the best
we can do in terms of relative efficiency? That is, what is the lowest variance we can achieve?
Definition 33 (Fisher Information). Let X ∼ f (·; θ). Then the Fisher Infomation is given by
" #
 ∂l(θ; X) 2
In (θ) := n E ,
∂θ

where l(θ; x) is the log-likelihood, that is l(θ; x) = log f (x; θ).

53
It can be shown that if the second partial derivative exists then we also have that
h ∂2 i
In (θ) = −n E 2
l(θ; X) .
∂θ
Theorem 45. Let X1 , . . . , Xn be i.i.d. with probability density function f (y; θ). Let θ̂n = g(X1 , . . . , Xn )
be an unbiased estimator of θ, such that the support of g(X) (the region for which the probability is not
zero) does not depend on θ. Then under mild conditions we have that
1
var(θ̂n ) ≥ .
In (θ)

Proof. We deal with the continuous case; the discrete case is done similarly. Consider the random
variable W defined by
∂ f 0 (X; θ)
W = log f (X; θ) = ,
∂θ f (X; θ)
where X ∼ f (·; θ) and f 0 (x; θ) denotes differentiation with respect to θ. Clearly W depends on θ and
so its expected value depends on θ and is obtained by integrating wrt f (·; θ). Hence

f 0 (x; θ)
Z
E(W ) = f (x; θ)dx
f (x; θ)
d
Z Z
= f 0 (x; θ)dx = f (x; θ)dx

d d
Z
= f (x; θ)dx = (1) = 0,
dθ dθ
under fairly general conditions that guarantee we can exchange differentiation and integration. Since
E(W ) = 0, cov(W, θ̂n ) = E(W θ̂n ) and thus

f 0 (x; θ)
Z
cov(W, θ̂n ) = E(W θ̂n ) = g(x1 , . . . , xn ). f (x; θ)dx
f (x; θ)
d
Z
= g(x1 , . . . , xn ) f (x; θ)dx

d
Z
= g(x)f (x; θ)dx

d dθ
= E(Y ) = = 1, (4.4.1)
dθ dθ
again under sufficient conditions to allow the exchange of derivative and integration. Now the maximum
value of any correlation coefficient is unity, and so

cov2 (W, θ̂n )


1 ≥ corr(W, θ̂n )2 =
(var W )(var θ̂n )

which since cov(W, θ̂n ) = 1 implies that


1
var θ̂n ≥ .
var W
Since the Xi are independent, the likelihood function factorises and thus
∂ ∂ Y X ∂ X
W = log f (X; θ) = log f (Xi ; θ) = log f (Xi ; θ) = Wi ,
∂θ ∂θ ∂θ
where

Wi := log f (Xi ; θ),
∂θ

54
are i.i.d. We have just proved that the means are zero (because the proof works for a sample of one) and
hence the variance is equal to E(W 2 ), which is by definition I1 (θ). Hence
" 2 #
X  ∂
var W = var Wi = nI1 (θ) = n E log f (X; θ)
∂θ

and since var Y ≥ 1/ var W we have the claimed result.

The importance of the Cramer Rao inequality is that it establishes a limit to the efficiency of an
estimator. No unbiased estimator can have variance less than 1/In (θ), so we can use this as the optimum.
We define the efficiency of an estimator as the ratio of the Cramer-Rao lower bound to the the variance
of the estimator.

Definition 34 (Efficiency). The efficiency of an unbiased estimator θ̂n of a parameter θ is defined as the
ratio of the Cramer-Rao bound to the variance of θ̂n , that is
1
eff(θ̂n ) = .
In (θ) var(θ̂n )

An estimator which has unit efficiency is called efficient.

Example 22. Let us consider a random sample of size n from the exponential distribution. f (x; λ) =
λe−λx . Recall that we showed that 1/x̄ is the maximum likelihood estimator for λ:
P
λe−λxi = λn e−λ
Y
xi
L= ,
X
log L = n log λ − λ xi .
P
To maximise L we differentiate with respect to λ and we find n/λ̂ = xi and so λ̂ = 1/x̄. Since 1/λ
is a monotone (decreasing) function of λ, we would reparameterize with respect to µ ≡ 1/λ and the
invariance of MLEs gives us µMLE = x̄.
Now X̄ is actually an efficient estimate of µ := 1/λ. To see this, we reparameterise the density in
terms of µ:
1 −x/µ
f (x; µ) =
e
µ
log f = − log µ − x/µ
∂ −1 x
log f = + 2.
∂µ µ µ

We also know that E(X) = 1/λ = µ and E(X 2 ) = 2/λ2 = 2µ2 . Hence
h ∂ 2 i h X 1 2 i
E log f (X; µ) =E −
∂µ µ2 µ
 2
 1 2 −1 1 1 2 1
= +2 + 4 E(X ) = .
µ µ µ µ µ
Hence
n
In (µ) = .
µ2
The variance of the exponential distribution is 1/λ2 = µ2 so the variance of the mean of a sample of n
is µ2 /n. This is just 1/(In (µ)), which proves the result.

55
We mentioned above that if 1/X̄ is a maximum likelihood estimate of λ then X̄ is a maximum
likelihood estimate of µ = 1/λ. On the other hand, the fact that X̄ is an unbiased and efficient estimate
of µ does not mean that 1/X̄ is an efficient or even unbiased estimate of λ. In fact we can show that
n−1
λ̂0 = P ,
X
is an unbiased estimate of λ, that
λ2
var(λ̂0 ) =
n−2
and that the efficiency of λ̂0 is 1 − 2/n. This is less than unity, and there is no efficient estimator of λ.
On the other hand, the efficiency does approach unity as n → ∞. In such cases we say that λ̂0 is an
asymptotically efficient estimator.

4.5 Sufficiency
4.5.1 Sufficient Statistics
Suppose we toss a coin n times and we observe heads k times. Our intuition is to use p̂ = k/n as an
estimate of p, the probability of a head, and indeed this is the maximum likelihood estimator. However
someone might object that we have thrown away a lot of information by reducing all the data to a single
statistic, p̂. Are we certain that, for example, the order of the heads and tails, or perhaps whether the
third one was a head, have nothing more to tell us?
Of course we are certain, but how do we express this? A useful way of thinking about it is the
following. The probability of getting k heads in n trials is of course
!
n k
p (1 − p)n−k
k

This depends on p, which is why we can use the outcome to help us estimate p. Suppose, however, that
n
we are told that there were exactly k heads. Then there are k equally likely orderings of the k heads
and n − k tails, and each of them has probability 1/ nk . This is the conditional probability of a particular


outcome/ordering of the experiment given the total number of heads, and as you can see it is independent
of p. Consequently it can offer us no additional insight about p. Therefore all the information we can
extract about p is contained in the statistic p̂. We therefore call p̂ a sufficient statistic.

Definition 35 (Sufficient Statistic). Let X1 , . . . , Xn be i.i.d. from a probability distribution with pa-
rameter θ. Then the statistic T (X1 , X2 , . . . , Xn ) is called a sufficient statistic for θ if the conditional
distribution of X1 , . . . , Xn given the value of T does not depend on θ.

Fortunately we do not have to work out conditional distributions, because of a theorem which we will
state without proof.

Theorem 46. A statistic T (X1 , X2 , . . . , Xn ) is a sufficient statistic if and only if the joint probability
density of X1 , . . . , Xn can be factorised into two factors, one of which depends only on T and the
parameters while the other is independent of the parameters:

f (x1 , x2 , . . . , xn ; θ) = g(t; θ)h(x1 , x2 , . . . , xn ).

Remark 47. Note that the second factor is a function of x1 , . . . , xn , and thus it may also depend on the
statistic. It must not however depend on the parameter

56
4.5.2 Efficient statistics are sufficient
Let’s go back to the discussion of Section 4.4 and recall from Equation 4.4.1 that if the estimator θ̂n of
parameter θ is efficient, then cov(W, θ̂n ) = 1, where

∂ f 0 (X; θ)
W := log f (X; θ) = , X ∼ f (·; θ),
∂θ f (X; θ)

and since also var(θ̂n ) = 1/ var(W ) we have that corr(W, θ̂n ) = ±1.
2 ,
Remark 48. Let X, Y be two random variables such that E X = µx , E Y = µY , var(X) = σX
2
var(Y ) = σX and corr(X, Y ) = 1. Then an easy calculation shows that
" 2 #
corr(X, Y )σX 2
E X − µX − (Y − µY ) = σX − ρ2XY σX
2
= 0,
σY

since corr(X, Y ) = 1 by assumption. This means that letting there exist constants α, β such that
E[(X − α − βY )2 ] = 0, and since (X − α − βY )2 ≥ 0, this implies that we must have that P(X =
α + βY ) = 1.

Therefore if θ̂n is efficient, W is a linear function of θn , though the coefficients may involve θ:


log f (X; θ) = W = a(θ) + b(θ)θ̂n .
∂θ
Thus log f (X; θ) can be obtained by integrating W with respect to θ, and this implies

log f (X; θ) = A(θ) + B(θ)θ̂n + K(X)

Here the functions A, B are indefinite integrals of a, b, respectively, and K(X) is an arbitrary function
which is independent of θ.
The likelihood function is therefore of the form
h i
f (X; θ) = exp A(θ) + B(θ)θ̂n exp [K(X)]

But this is the required factorisation for sufficiency, and thus θ̂n is sufficient.
The converse is not necessarily true, ie there are sufficient statistics which are not efficient.

4.5.3 Examples of Sufficient Statistics


Let’s try a few examples.

Example 23 (Poisson). We want to use x̄ to estimate λ, the parameter of the Poisson distribution. For a
sample of size n we have
n −λ xi
e λ Y 1
= e−nλ λΣxi
Y
f (x1 , x2 , . . . , xn ; λ) =
1
xi ! xi !
  Y 1 
−nλ nx̄
= e λ .
xi !
which is the required factorization.

57
Example 24 (Normal). We now check that x̄ is a sufficient statistic for estimating the mean of a normal
distribution with known variance:
1 1 X
 
2
f (x1 , x2 , . . . , xn ; µ) = exp − (x i − µ) .
(2πσ 2 )n/2 2σ 2

Recall the identity X X


(xi − µ)2 = (xi − x̄)2 + n(x̄ − µ)2 ,
which allows us to rewrite the density function as
1 n 1 X
     
f (x1 , x2 , . . . , xn ; µ) = exp − 2 (x̄ − µ)2 exp − (xi − x̄)2
,
(2πσ 2 )n/2 2σ 2σ 2

which is again the required factorisation. Note that the statistic appears in the second factor; it’s the
parameters that must not be there. Hence x̄ is a sufficient statistic for estimating µ.
We may also ask whether the pair (x̄, s) is sufficient for estimating the parameters (µ, σ), which is
the problem we are often really faced with. To see that they are, we simply note that if we substitute
s2 = (xi − x̄)2 /(n − 1) in the above expression we obtain
P

1 n n−1 2
   
f (x1 , x2 , . . . , xn ; µ) = exp − 2 (x̄ − µ)2 exp − s
2
(2πσ )n/2 2σ 2σ 2
which is a (trivial) factorization of the required kind.

It is intuitively clear that sufficiency is important because it means we have extracted everything
relevant from the data. In fact we can make this precise by the following result:

Theorem 49. From any unbiased estimate which is not based on a sufficient statistic, an improved
estimate can be obtained which is based on the sufficient statistic. It has smaller variance, and is
obtained by averaging with respect to the conditional distribution given the sufficient statistic.

Proof. Suppose that R(X1 , X2 , . . . , Xn ) is an unbiased estimate of the parameter θ and that T (X1 , X2 , . . . , Xn )
is a sufficient statistic for θ. Let the joint probability density function of R and T be fR,T (r, t). Then
the marginal distribution for T is Z ∞
fT (t) = f (r, t)dr
−∞
The conditional distribution of R given T is

fR,T (r, t)
fR|T (r|t) = ,
fT (t)

and because T is a sufficient statistic this does not depend on θ. Also Since R is an unbiased estimate of
θ Z ∞ Z ∞
E[R] = rfR,T (r, t)drdt = θ.
−∞ −∞

As an improved estimate of θ, consider S(T ), a function of T , which is obtained by averaging R with


respect to its conditional distribution given T . That is for T = t
Z ∞
 
S(t) := E R T = t = rfR|T (r|t)dr.
−∞

We can readily check that s is unbiased:


Z ∞
E[S(T )] = s(t)fT (t)dt
−∞

58
Z ∞ "Z ∞ #
= rfR|T (r|t)dr fT (t)dt
−∞ −∞
Z ∞ Z ∞
= rf (r, t)drdt = θ.
−∞ −∞

It remains to show that var(S) < var(R). We use a standard decomposition to get

var(R) = E[(R − θ)2 ]


n o
= E [(S − θ) + (R − S)]2
= var(S) + E[(R − S)2 ] + 2 E[(R − S)(S − θ)].

The second term on the right hand side is obviously non-negative. In fact S 6= R since S is sufficient and
R is not. Therefore P[(R − S)2 > 0] > 0 and thus E[(R − S)2 ] > 0 from the positivity of expectations
property in Section 2.4.1. So the proof is complete apart from dealing with the last term:
Z ∞ Z ∞
E[(R − S)(S − θ)] = [r − s(t)][s(t) − θ]fR,T (r, t)drdt
−∞ −∞
Z ∞ Z ∞ 
= [r − s(t)]fR|T (r|t)dr [s(t) − θ]fT (t)dt.
−∞ −∞

where we have used fR,T (r, t) = fR|T (r| t)fT (t). By definition
Z ∞
s(t) = rf (r|t)dr,
−∞

and since s(t) does not depend on r we have that


Z Z
s(t)f (r|t)dr = s(t) fR|T (r|t)dr = s(t),

and so the inner integral is identically zero. This proves the theorem.

Example 25. Suppose we toss a coin n times. Then for the i-th coin let Xi = 0 for a tail and 1 for a
head, and we decide to use as our estimate of the probability of a head X1 , the result of the first toss and
ignore the rest. First of all X1 is unbiased:

E(X1 ) = 0 × q + 1 × p = p

Its variance is pq, or npq with n = 1. For example if p = 1/2 the variance is 1/4 and the standard
deviation 1/2. Suppose, however, that we know that in the n trials there were exactly t heads. Then the
probability that the first one was a head is t/n and the probability that it was a tail is (n − t)/n. Now
the proportion of heads is a sufficient statistic:
n 
Y 
f (x1 , x2 , . . . , xn ) = pxi q 1−xi
1
n  x
Y p i
= qn
1
q
P
p xi
= qn
q
 p n t
n
= qn .
q

59
which is a (trivial) factorization of the required kind. We are now set up to use the theorem. We average
the statistic we began with using its conditional distribution given the sufficient statistic:
n−t t
P [X1 = 0| t] = P [X1 = 1| t] = .
n n
The mean value for any given value of T is then
n−t t t
0× +1× =
n n n
and this has variance pq/n which is less than pq.

Of course this isn’t really very impressive. The theorem worked, but it only gave us back a statistic
that we already knew to be unbiased and sufficient. There are cases where we can use the theorem a
bit more effectively, but its real importance is that it tells us that for unbiased estimation we should be
looking at sufficient statistics. In fact, for many distributions there is only one unbiased estimate based
on a sufficient statistic, and then that gives us the minimum variance unbiased estimate automatically.
So if we find a statistic which is both unbiased and sufficient, we can be reasonably confident we are
doing about as well as we can.

4.6 The Method of Moments


We have often taken sample moments as natural estimates of the corresponding population moments. A
generalization of this is that whenever a parameter of a distribution can be expressed as a function of
population moments, we can use the same function of sample moments to estimate the parameter.
There can be problems; as we have already seen, the sample variance is a biased estimate of the
population variance although at least it is consistent. BUT the variance is a so-called central moments
(i.e. moments about the mean) and the method of moments refers to ordinary moments, that is of the
form E(X n ). In any case, it is a start, and as we have seen once we have one estimate of something we
may even be able to use it to find a better one.
For example, suppose we want to estimate the parameters of the gamma distribution with density

λα α−1 −λx
f (x) = x e .
Γ(α)

We have already found that


α α(α + 1)
E(X) = , E(X 2 ) = .
λ λ2
which give var(X) = α/λ2 . We solve these equations to obtain

E(X) [E(X)]2
λ= , α= .
E(X 2 ) − [E(X)]2 E(X 2 ) − [E(X)]2

The method of moments consists in replacing E(X) with ( X)/n and E(X 2 ) with ( X 2 )/n and this
P P

gives the estimates


nX̄ nX̄ 2
λ̃ = α̃ =
(n − 1)S 2 (n − 1)S 2
where as usual S 2 = ( (X − X̄)2 )/(n − 1).
P

60
Chapter 5

Theory of Testing

5.1 The Central Limit Theorem


When we discussed the z-statistic we mentioned that it’s possible to use it to get asymptotically valid
results even when the underlying population is not normally distributed. The justification for using the
normal distribution for sample means even when we do not believe the distribution of the individual
items is normal is given by the Central Limit Theorem (CLT) and it is one of the most important and
fundamental theorems of probability and statistics.

Theorem 50 (Central Limit Theorem). Let X1 , X2 , . . . be i.i.d. random variables with E(Xi ) = µ and
var(Xi ) = σ 2 < ∞. Define
Pn
Xi − nµ X̄ − µ
Zn := i=1 √ = √ .
σ n σ/ n
Then the distribution function of Zn converges to the distribution function of a standard normal random
variable as n → ∞, that is for all z
Z z
1 2
lim Fn (z) := lim P(Zn ≤ z) = √ e−t /2 dt.
n→∞ n→∞ −∞ 2π
Remark 51. This form of convergence is known as convergence in distribution, among other things.

Proof. We will prove the result in the case where the MGF of Xi , ψ(t) exists. This is by no means
necessary but it simplifies the proof significantly. Also since we can also write Zn as
Pn
i=1 (Xi − µ)/σ
Zn := √ ,
n

where the random variables (Xi − µ)/σ are also i.i.d. with mean 0 and variance 1, we can assume
w.l.o.g. that µ = 1 and σ 2 = 1.
Pn
By independence the moment generating function of the sum i=1 Xi will be [ψ(t)]n . And the
moment generating function for Zn will be
√ √ P t  n
P   
tZn t Xi / n (t/ n) Yi
ζn (t) := E(e ) = E(e ) = E(e )= ψ √ .
n

Since µ = 0 and σ = 1 we know that ψ 0 (0) = E(X) = 0 and ψ 00 (0) = E(X 2 ) = 1, and thus the
Taylor expansion of ψ(t) takes the form

t2 00 t3
ψ(t) = ψ(0) + tψ 0 (0) + ψ (0) + ψ 000 (0) + . . .
2! 3!

61
t2 t3 000
=1+ + ψ (0) + . . .
2! 3!
t2
= 1 + + o(t2 ),
2!
as t → 0. Hence n



t2  t2  n 
ζn (t) = ψ(t/ n) = 1+ +o ,
2n n
and thus
t2  t2 
log ζn (t) = n log 1 + +o ,
2n n
and using the fact that log(1 + x) = x + o(x) as x → 0, we have that
h t2  1 i t2
log ζn (t) = n +o → ,
2n n 2
from which we conclude that
2 /2
ζn (t) → et ,
which is the moment generating function of the standard normal distribution. The conclusion follows
from the following theorem.
Theorem 52 (Continuity Theorem). Suppose that the moment generating functions of the random vari-
ables U, U1 , U2 , . . . exist for all |t| < h and
E exp(tUn ) → E exp(tU ), for all |t| < h,
then for all u
P(Un ≤ u) → P(U ≤ u).

This is the justification for using the normal distribution in all sorts of examples. But this only works
in the limit, and in general how fast we approach the limit(although there are results such as the Berry-
Esseen theorem. depends on the distribution of the population.
So if you are pretty sure that the distribution is something like a normal distribution, the theorem
will come into play on fairly small samples. On the other hand, if the distribution is very different, if
it is skewed (like incomes) or if a lot of the probability is in the tails (like hours of sunshine in many
countries) you’d need a very large sample to make it work. This is typical of statistics. In practice we
do something which is theoretically not entirely justified but works well enough most of the time. But
we need to know why it works so that we understand why/when it may fail.

5.2 Significance Testing


Suppose X := (X1 , . . . , Xn ) is a random sample from f (x; θ) where θ ∈ Θ, the parameter space which
can be the whole or a subset of Rd .
We want to test the null hypothesis H0 : θ ∈ Θ0 , against the alternative hypothesis H1 : θ ∈ Θ1 .
where Θ0 , Θ1 ⊂ Θ are disjoint subsets.
Definition 36 (Simple and composite hypotheses). A hypothesis H : θ ∈ Θ is called simple if Θ is a
single point. A hypothesis that is not simple is called composite.

If, for example, we are testing the null hypothesis H0 : θ = 5 against the alternative H1 : θ > 5, then
H0 is simple and H1 is composite.
To perform the test we need a test statistic t(X), whose distribution we know when H0 is true, and
such that extreme values of t(X) make us doubt the validity of H0 . Suppose now that we perform the
experiment and obtain observations x := (x1 , . . . , xn ), which means that the observed value of the
statistic t(X) is simply t(x).

62
Definition 37 (Significance Level or p-value). The significance level or p-value of the test above is

p = P(t(X) ≥ t(x)|H0 ),

where P(t(X) ∈ A|H0 ) denotes the probability that t(X) ∈ A when H0 is true.

If p is small it means that if H0 were true the probability that a value as extreme as t(x) would be
observed if the experiment was repeated is small.

Example 26. Suppose that we observe a sample X1 , . . . , Xn from a N (µ, 1) population and we want
to test
H0 : µ = 0, vs H1 : µ 6= 0.

As you can see it is crucial for testing hypotheses that we know the distribution of the statistic under
the null hypothesis. This also explains the distinction between simple and composite hypotheses: it’s
obviously much easier to compute probabilities under a simple hypothesis where the parameter values
are completely determined, while if the distribution depends on a parameter and all we know is that the
parameter lies in some set then we don’t know the distribution.
We know that if H0 is true then X̄ ∼ N (0, 1/n) and therefore if H0 is true then X̄ shouldn’t be too
far from 0. Of course, since X̄ is normally distributed, there’s always a chance that X̄ is extreme even if
H0 is true. It is possible, but very unlikely, and this is the idea behind statistical tests. We are looking for
a statistic that we can compute from the data and whose distribution we know, at least approximately
under the null hypothesis. In our case say that n = 10 and that the sample mean is x̄ = .90 and thus

z = nx̄ ≈ 2.85. Since if H0 were true this should be a sample from a standard normal, we compute
the probability that a value at least as extreme as 2.85 would be observed, that is

P(|Z| > 2.85) ≈ .00443,

this is the p-value and it’s the probability of observing something as extreme as 2.85. Thus if H0 were
true and we repeated this experiment 1000 times we would observe something as extreme as 2.85 only
around 4 to 5 times. This should really make you doubt whether H0 is true.

Remark 53. The p-value is NOT the probability that the null hypothesis is correct, as in our current
setup H0 is either true or false deterministically.

5.2.1 Hypothesis Testing


Sometimes rather than computing and reporting the p-value we have to make a decision: either reject
H0 or we don’t reject H0 . This maybe the preferred setup for testing a new drug; you either decide that
it is better than the previous version and launch it to market or you don’t. This type of test is then defined
in terms of a rejection region.

Definition 38 (Rejection or Critical Region). The rejection or critical region is a subset R ⊂ Rn such
that

• if x ∈ R we reject H0 ;

• if x ∈
/ R we do not reject H0 .

5.2.1.1 One-sided vs two-sided alternatives

The alternative H1 : µ 6= 0 is a two-sided alternative. In order to reject H0 we want to see large absolute
values of z. On the other hand if we are testing H0 against H10 : µ > 0, then a large negative value of z

63
although very unlikely under H0 would be even less likely under any µ > 0. Therefore for the one-sided
alternative H10 , the p-value should be computed as

p0 = P(z(X) > z(x)|H0 ),

as opposed to p = P(|z(X)| > z(x)|H0 ).


Notice that if under H0 , z(X) ∼ N (0, 1), then p = 2p0 . This means that the p-value may double by
going from a one-sided to a two-sided alternative. This means that whether you use a one- or a two-sided
test should be decided before you even look at the data based on the risks involved with a type I error.
On the other hand the rejection region for a two-sided test will also be different to that for a one-sided
test. Always think about which extremes cast doubt on the null hypothesis before writing down the
rejection region.

5.2.2 Types of errors


The statistical hypothesis H0 can be either true or not, and we can reject it or not. Obviously we would
like to reject it when it is false and not reject it otherwise, as the statistic is random there will be cases
when we make the wrong decision. As you can see in Figure 5.2.2, there are two types of errors. Each
type of error will happen with a certain probability.

Reject
H0 ?

no
yes

H0 true H0 false

yes no yes
no

Type I Correct Type II


error Decision error

Figure 5.2.1: Type I and type II errors.

Definition 39. A Type I error is committed when we reject the null hypothesis H0 although it is true.
The probability of a type I error is denoted by α and is called the size, or significance-level of the test. If
H0 : θ ∈ Θ0 is composite then we define the size of the test by

α = sup P(X ∈ R|θ).


θ∈Θ0

A Type II error is committed when we accept the null hypothesis H0 although it is false. The proba-
bility of a type II error is denoted by β. The power of a test is the probability of detecting a false null
hypothesis and is thus 1 − β.

Hypothesis testing has a certain asymmetry between the two hypotheses, since we are acting like a
court of law: the null hypothesis is true until proven wrong. Because of this we place more weight on

64
avoiding type I errors, that is wrongly rejecting the null hypothesis, and therefore we want the probability
of a type I error to be within certain bounds that we deem acceptable.
When we set the (say) 5%-significance level, we are accepting that if we repeat the whole procedure
many times, around 5% we will make the wrong decision by rejecting a true null hypothesis. Remember
that this probability is indeed 5%, only if all the underlying assumptions (e.g. normality, randomness of
the sample, etc) are true. Therefore when testing a hypothesis, we choose a priori the size of the test,
based on the potential consequences of a type I error. Typical values are 10%, 5%, 1% and so on.
Example 27. In our running example, we can use the z-statistic to test the hypothesis that µ = 0, that

is z = (x̄)/(σ/ n). At the 5% level the rejection region is the set {|z| > 1.96}. If we insist on the 1%
level, then the critical region is the set {|z| > 2.58} and so on.
With the one-sided alternative hypothesis H10 : µ > 0, then the critical region at 5% is {x : z(x) >
1.64} while at 1% it is {x : z(x) > 2.33}.

You can now see where we are going. To test a null hypothesis H0 we work out the probability
distribution for the outcomes of the experiment given H0 . We then choose a critical region of the sample
space of outcomes that contains only 5% (say) of the probability. The problem is that there are naturally
many such regions, and we need to know how to find the best one.
Example 28. Suppose we are in the usual scenario of testing H0 : µ = 0 against H1 : µ 6= 0 for
the mean of a N (µ, 1) population. We use the z-statistic at the 5% confidence level. We want to
compute the power, that is the probability of correctly rejecting the null hypothesis when µ ∈ ω1 where
ω1 = {x : x 6= 0}. Therefore H1 is a composite hypothesis and thus the power will depend on the
particular value of θ ∈ ω1 . Of course the rejection region at the 5% is fixed and equal to {|z| > 1.96},
and thus to find the power we need to compute the probability Pµ (|Z| > 1.96), where Pµ denotes that
we are assuming that Xi ∼ N (µ, 1) and therefore under Pµ
√ √
Z := nX̄ ∼ N ( nµ, 1).

Thus

β(µ) = Pµ (|Z| > 1.96)


= Pµ (Z > 1.96) + Pµ (Z < −1.96)
√ √
= Pµ (Z − nµ > 1.96 − µ) + Pµ (Z − nµ < −1.96 − µ)
√ √
= P(Z̃ > 1.96 − nµ) + Pµ (Z̃ < −1.96 − nµ),

where Z̃ ∼ N (0, 1). The power is plotted in Figure 5.2.2. In Figure 5.2.3 we compare the power of the
two-sided test against that of the one-sided test with H10 : µ > 0. You can see that the one-sided test has
more power for µ < 0 but for µ > 0 its power is practically 0.

5.3 The Neyman-Pearson Lemma


It is clear that among all tests with a given size we would prefer one with the highest power. At the end
of the day, although we want to be careful not to reject the null when it’s true, we want to have a good
chance of detecting when it is wrong.
Suppose we have two hypotheses

H0 : θ ∈ ω0 H1 : θ ∈ ω1

where ω0 , ω1 ⊂ Ω. To test the null hypothesis H0 against the alternative H1 we want a test such that
P(R|H0 ) ≤ α for all θ ∈ ω0 while at the same time 1 − P(Rc |H1 ) = β(θ) is as large as possible for all
θ ∈ ω1 .

65
1.0

0.8

0.6

0.4

0.2

-4 -2 2 4

Figure 5.2.2: The power for different values of µ with n = 1(blue), n = 20(orange), n = 100(green)
and n = 1000 (red).

In other words, we try to find a test with the smallest Type II error for a given (usually 5% or 1%)
Type I error.
The following result gives us exactly this.
Theorem 54 (Neyman-Pearson Lemma). Suppose we are given the i.i.d. sample X1 , . . . , Xn from a
distribution with parameter θ and we want to test the simple null hypothesis H0 : θ = θ0 against the
simple alternative H1 : θ = θ1 . Let L(θ) = L(θ; x) denote the likelihood of the parameter value θ
given observations x = (x1 , . . . , xn ). Then for any given size α the test that maximises the power at θ1
has rejection region
L(θ0 ; x)
 
R := x = (x1 , . . . , xn ) : ≤k ,
L(θ1 ; x)
where k is chosen so that the test has size α,

P(x ∈ R|H0 ) = α.

This test has maximum power among all test of H0 vs H1 with size α.

Proof. Let X = (X1 , . . . , Xn ) be your random sample. Consider any test of size ≤ α with rejection
region A. Then
P(X ∈ A|H0 ) ≤ α).
Recall the definition of the indicator 1A (x) = 1 if x ∈ A and 0 otherwise. Let R and k be as in the
statement of the theorem.
Notice that if x ∈ R then by definition of R
1
L(θ1 ; x) ≥ L(θ0 ; x),
k
and
1R (x) ≥ 1A (x).
On the other hand if x ∈
/R
1
L(θ1 ; x) ≤ L(θ0 ; x),
k

66
1.0

0.8

0.6

0.4

0.2

-4 -2 2 4

Figure 5.2.3: The power of one-sided(orange) and the two-sided test(blue)(n = 10).

and trivially since x ∈


/R
1R (x) ≤ 1A (x).
Thus overall for all x we have that
h 1 i h i
L(θ1 ; x) − L(θ0 ; x) × 1R (x) − 1A (x) ≥ 0,
k
since the two factors have the same sign. Integration does not alter the sign and therefore
1
Z h i h i
0≤ L(θ1 ; x) − L(θ0 ; x) × 1R (x) − 1A (x) dx
Z k Z
= L(θ1 ; x)1R (x)dx − L(θ1 ; x)1A (x)dx
1 1
Z Z
− L(θ0 ; x)1R (x)dx + L(θ0 ; x)1A (x)dx
Z k Z k
= f (x; θ1 )dx − f (x; θ1 )dx
R A
1 1
Z Z
− f (x; θ0 )(x)dx + f (x; θ0 )dx
R k A k
= P(X ∈ R|H1 ) − P(X ∈ A|H1 )
1h i
− P(X ∈ R|H0 ) − P(X ∈ A|H0 ) . (5.3.1)
k
Recall that we have chosen k such that P(X ∈ R|H0 ) = α, while P(X ∈ A|H0 ) ≤ α and thus the term
in Equation 5.3.1 is non-negative and therefore we conclude that

P(X ∈ R|H1 ) ≥ P(X ∈ A|H1 ),

and the result follows.

Let’s see how the Neyman-Pearson Lemma works.

Example 29. Suppose that X = (X1 , . . . , Xn ) ∼ N (µ, 1) We have a normal distribution with unit
variance and we want to test
H0 : µ = 0, vs H1 : µ = 2.

67
So we have the pdf of x = (x1 , . . . , xn ) under the two alternatives are
1 P 2
f (x; H0 ) = √ n e− xi /2 ,

1 P 2
f (x; H1 ) = √ n e− (xi −2) /2 .

and thus the likelihood ratio is
L(0; x) f (x; H0 ) 1 h X i
Λ(x) = = = Qn −(−4x +4)/2 = exp −2 (xi − 1) = exp [2n(1 − x̄)] .
L(2; x) f (x; H1 ) i=1 e
i

The Neymann-Pearson lemma says that we have to choose k such that


P(x : Λ(x) < k|H0 ) = α.
This means that the critical region must be of the form
e2n(1−x̄) < k
or equivalently
1
x̄ > 1 − log k.
2n
So far we are only dealing with the form of the critical region, so we might as well write this simply as
x̄ > C. The next step is to determine C, but this is actually a familiar calculation.
We now know that the form of the critical region is x̄ > C. But the critical region is by definition
the region in which we reject H0 , and we determine its size by the requirement that the Type I error, α
should be some given amount, usually .05. So we have to choose C such that
P(x̄ > C | H0 ) = .05.

Now when H0 is true, the distribution of x̄ is normal with mean zero and variance 1/n. So we need
to find C such that Z ∞
1 2
√ e−ζ /2 dζ = .05,
2π z
where
C −0 √
z = √ = C n.
1/ n
Now the 95% point of the standardised normal density is 1.65, so we conclude that the most powerful

test is to reject H0 if z > 1.65, i.e. if x̄ > C = 1.65/ n. So the NP lemma manages to produce the
right tail test when that was we would have chosen anyway. Let’s see what happens in a situation in
which we would have expected a left tail test. Consider the same problem: normal density with unit
variance, H0 : µ = 0, only this time H1 : µ = −2. The calculation is the same, apart from a change in
sign:
1 2 1 2
f (x; H0 ) = √ e−x /2 f (x; H1 ) = √ e−(x+2) /2 ,
2π 2π
and for a sample of size n this leads to the likelihood ratio
Qn
√1 −x2 /2
i=1 2π e i 1 h X i
Λ= Qn = Qn = exp +2 (xi + 1) = exp [2n(1 + x̄)] .
√1 −(xi +2)2 /2
i=1 e
−(+4xi +4)/2
i=1 2π e

The Neyman-Pearson lemma says that we have to choose a region for which Λ < k for some constant k
which we have to determine. Here that means that the critical region must be of the form
e2n(1+x̄) < k
or
1
x̄ < log k − 1
2n
which is indeed a left tail test. We now need to determine C such that P(x̄ < C|H0 ) = 0.05 and this is

the same calculation as before, so we reject H0 if x̄ < 1.65/ n.

68
5.4 Uniformly most powerful.
Although we got the answer we were expecting in both cases, these were not really the sorts of problems
we have been dealing with up to now. We don’t usually have an alternative hypothesis like µ = 2. It’s
usually something like µ > 0. In other words, alternative hypotheses are generally composite rather
than simple. (Null hypotheses can be composite too, but let’s not digress!)
Sometimes a test can be found that has the greatest power against all the alternatives in a composite
alternative hypothesis. Such a test is called uniformly most powerful.
Definition 40 (Uniformly most powerful test). Let H0 be a null hypothesis and let H1 be a composite
alternative. Then a test is said to be uniformly most powerful(UMP) if is the most powerful against any
simple hypothesis included in H1 .
Example 30. Consider the same example as before: a sample of size n from N (µ, 1), with H0 : µ = 0
against H1 : µ > 0. Then the likelihood ratio is will of course depend on the specific alternative
1 h X i
Λ(x; µ) = Qn −(−2µxi +µ2 )/2
= exp −µ (xi − µ/2) = exp [nµ(µ/2 − x̄)] .
i=1 e

We are to choose a region of the form


enµ(µ/2−x̄) < k,
and taking logarithms again, this becomes
nµ(µ/2 − x̄) < log k.
By hypothesis µ > 0 so dividing by it does not change the direction of the inequality
µ log k
x̄ > − =: k 0 .
2 nµ
Again, we are not concerned about the value of k 0 , but only the form of the region, which is that for
which x̄ is greater than some value. Then to determine k 0 and thus the rejection region we just consult

the relevant table, and in our case say at the 5% level k 0 is set at 1.645/ n. Notice that this value, and
thus the rejection region is completely independent of the true value of µ for all µ > 0.
Then because the calculation is exactly the same for any individual alternative simple hypothesis
H1 : µ = µ1 > 0, it also works for the composite hypothesis H1 : µ > 0.
If the alternative hypothesis was µ < 0, then the whole calculation would be the same until the point
at which we divide by µ. Since this is negative, dividing by it reverses the inequality, and so we conclude
that the form of the critical region is
µ log k
x̄ < −
2 nµ
which leads us to the usual 1-tail test but on −x̄ which is UMP.
Although for one-sided tests we can find UMP tests, it is not the case for two-sided tests. Suppose
that the alternative is µ 6= 0. Again the calculations are the same, up to the division by µ. The Neyman-
Pearson lemma still tells us that we are supposed to find a critical region for which
2nµ(µ/2 − x̄) < log k.
which now means either
µ log k µ log k
x̄ > − or x̄ < − ,
2 nµ 2 nµ
depending on the sign of µ. From the symmetry of the expression, we see that if we are to choose a
single k to cover both cases, the critical region must be symmetric, so a possible test would be
µ µ log k 
|x̄| >  −
 

2 2n

69
i.e. the usual 2-tail test. But notice that this is not UMP. For µ > 0 we have seen from the plot in
Figure 5.2.3 that the test with critical region {|x̄| > k} would be more powerful and similarly for µ < 0
and the corresponding regino.

Example 31. We are given a sample of size n from the exponential distribution f (x; λ) = λe−λx and
we are asked to test the null hypothesis H0 : λ = λ0 against the alternative H1 : λ = λ1 .
According to the Neyman-Pearson lemma we require
n Y
λ0 e−λ0 xi
Q
L(x; H0 ) λ0

k>Λ= =Q = e(λ1 −λ0 )xi ,
L(x; H1 ) λ1 e−λ1 xi λ1

or equivalently
1
log k > log (λ0 /λ1 ) + (λ1 − λ0 )x̄.
n
We can see that as before, what happens next depends on the sign of (λ1 − λ0 ). If (λ1 > λ0 ) then
the critical region is small values of x̄, which is as expected because E(X̄) = 1/λ. The critical region is
thus the interval R = (0, c), where c has to be determined.
For this we need to choose c such that the probability of obtaining a value of x̄ less than or equal to
c is a predetermined α, typically 5% . We can do this because we know that the sum of n independent
and identically distributed exponential variables has the gamma density. If x̄ = c then the sum of the
variables is nc, so we require
λn0
Z nc
α= xn−1 e−λ0 x dx.
Γ(n) 0
For the case n = 1 we choose c such that
Z c
α = P (X̄ < c | H0 ) = λ0 e−λ0 x dx = 1 − e−λ0 c ,
0

or equivalently
1 1
 
c= log ,
λ0 1−α
and this will also be UMP for H1 : λ > λ0 .
To compute the probability of a Type II error we need the probability that x does not lie in the rejection
region if the alternative hypothesis is true, i.e.
Z ∞
β= λ1 e−λ1 x dx = e−λ1 c .
c

Alternatively, if (λ1 < λ0 ) then the critical region is large values of x̄, i.e. all values of x̄ greater than c
where c now satisfies (taking n = 1 for simplicity)
Z ∞
α = P (X̄ > c | H0 ) = λ0 e−λ0 x dx = e−λ0 c ,
c

and thus
1 1
 
c= log .
λ0 α
This time the probability of the Type II error is
Z c
β= λ1 e−λ1 x dx = 1 − e−λ1 c.
0

70
5.5 Likelihood Ratio Test
The Neyman-Pearson lemma is very useful when it works, but it’s really only designed for simple
hypotheses. It can, as we have seen, be used for composite hypotheses as well, but only if we are a bit
fortunate in how things work out.
A more useful approach is the following. Say that X1 , . . . , Xn is a random sample from f (x; θ).
To compare the null hypothesis, H0 : θ ∈ Θ0 against the general alternative H1 : θ ∈ Θ, so that
H0 is a special case of H1 . Here we work out the likelihood function on each using in each case the
maximum likelihood estimates of all unknown parameters, given the hypothesis. Then we use the ratio
of these likelihoods as a measure of the relative probabilities of the hypotheses. As with the Neyman-
Pearson lemma, this gives us only the form of the test; the actual boundary usually has to be worked out
separately.
Thus consider a family of distributions with a parameter θ (remember, θ can be a vector). Let θ̂1 be
the maximum likelihood estimator of θ ∈ Θ1 and let θ̂0 be its maximum likelihood estimator given that
H0 is true. Then the likelihood ratio Λ(x) is defined by

supθ∈Θ0 L(θ; x) L(θ̂0 )


Λ= = .
supθ∈Θ1 L(θ; x) L(θ̂)

Clearly Λ ≤ 1 with equality only if θ̂0 = θ̂, in which case there is clearly no reason at all to infer that
H0 is false. When, however, Λ is small, we infer that H0 is probably false, and so the principle of the
test is to reject H0 for Λ < k for some k which has to be determined. For a test of size α we must then
choose k such that
sup P(Λ(X) ≤ k|θ) = α.
θ∈Θ0

Note that while this reasoning is plausible, we have no guarantee that it leads to the best test. On the
other hand, the likelihood ratio test can often be used where the Neyman-Pearson lemma cannot.
Let’s see how this works.
Example 32. We are given a random sample X1 , . . . , Xn from a N (µ, σ 2 ) population with both µ and
σ 2 unknown. We want to test H0 : µ = 0 against the general alternative H1 : µ 6= 0.
There are two parameters of the distribution, µ and σ and we have to estimate them for both the null
hypothesis and the alternative. The likelihood function is
n
1 2
√ e−(xi −µ)/2σ .
Y
L(µ, σ 2 ; x) =
i=1 σ 2π

We have already found the maximum likelihood estimates of µ and σ. This was done by maximising
n 1 X
log L = −n log σ − log(2π) − 2 (xi − µ)2
2 2σ
by the usual method of equating to zero the partial derivatives with respect to µ and σ:
X
0= (xi − µ)
n 1 X
0 = −n + 3 (xi − µ)2 .
σ σ

When we solve these we find µ̂ = x̄ and then σ̂ 2 = (1/n) (xi − x̄)2 .


P

These are the unconstrained maximum likelihood estimates. Note that the maximum likelihood esti-
mate of σ 2 is biased. To apply a likelihood ratio test we need the maximum likelihood estimates given
that the null hypothesis is true. In that case µ is fixed, so trivially µ̂0 = µ0 . The maximum likelihood
estimate of the variance is then just σ̂02 = (1/n) (xi − µ0 )2 .
P

71
We then substitute back into the likelihood functions, and we find, in either case,
n
1 2 2 e−n/2
√ e−(xi −µ̂) /2σ̂ =
Y
sup L(µ, σ 2 ) = L(µ̂, σ̂ 2 ) = √ ,
µ,σ 2 i=1 σ̂ 2π σ̂ n ( 2π)n

since X
σ̂ 2 = (1/n) (xi − µ̂)2 ,
and since from σ̂02 = (1/n) (xi − µ̂0 )2 , the likelihood ratio is given by
P

 n !n/2
L(µ0 , σ̂02 ) (xi − x̄)2
P
σ̂
Λ= = = ,
L(µ̂, σ̂ 2 ) σ̂0n 2
(xi − µ0 )
P

i.e. the test tells us to reject the null hypothesis for small values of

(xi − x̄)2
P
2
,
(xi − µ0 )
P

or, equivalently, for large values of


(xi − µ0 )2
P
y= P 2
.
(xi − x̄)
(x̄ −
P
This is actually the answer, but it doesn’t look familiar because we are expecting something with
µ0 ) in the numerator. We can arrange that by using the usual trick
X X
(xi − µ0 )2 = ((xi − x̄) + (x̄ − µ0 ))2
X X X
= (xi − x̄)2 + 2(x̄ − µ0 ) (xi − x̄) + (x̄ − µ0 )2
X
= (xi − x̄)2 + n(x̄ − µ0 )2 .

We now have
n(x̄ − µ0 )2
y =1+ P .
(xi − x̄)2
This will be large if the second term is large. But the second term is just the square of

|x̄ − µ0 |
pP √ ,
(xi − x̄)2 / n

√ will see in the next chapter that this is proportional to the very popular t-statistic, up to a factor
We
n − 1.

Example 33 (Exponential Distribution). We want to test the null hypothesis H0 : λ = λ0 against the
alternative H1 : λ 6= λ0 . The likelihood function is
n
λe−λxi = λn e−nλx̄ .
Y
L=
i=1

As usual, we take logarithms:

log L = n log λ − nλx̄


∂ log L n
= − nx̄.
∂λ λ

so the maximum likelihood estimate is λ̂ = 1/x̄. On the null hypothesis, λ̂0 = λ0 . The likelihood ratio
is therefore

72
λ̂n0 e−nλ̂0 x̄
Λ= = (λ0 x̄)n e−nλ0 x̄ en ,
λ̂n e−nλ̂x̄
log Λ = n log λ0 + n log x̄ − nλ0 x̄ + n.

Differentiating with respect to x̄ shows that this has an extremum with respect to x̄ at n/x̄ = nλ0 and
this is clearly a maximum, since it corresponds to the constrained maximum likelihood estimate being
equal to the unconstrained, and makes Λ = 1.
We are looking for values of x̄ that make log Λ small, and since the first and last terms are constants
(for given λ0 , i.e. for a given H0 ) and we can divide by n without loss, the critical region is

log x̄ − λ0 x̄ < k,

for some k to be determined. Now if you look at the plot of the function log x−λ0 x given in Figure 5.5.1
you will notice that it is concave with a unique maximum, so that the region above translates into a

-5

-10

2 4 6 8 10

Figure 5.5.1: The plot of log(x) − λ0 x.

critical region of the form R = {x̄ < a} ∪ {x̄ > b} with a and b chosen so that

P(x̄ < a|H0 ) + P(x̄ > b|H0 ) = α, (∗)

log a − λ0 a = log b − λ0 b. (∗∗)


The first of these states that the size of the Type I error is to be α and the second that we are to cut off at
values of a on the left and b on the right such that the value of log Λ and hence of Λ is the same. (Note
that when we applied the Neyman Pearson lemma to the exponential density we obtained a different
condition, viz. x̄ < C if λ1 > λ0 and x̄ > C if λ1 < λ0 .)
The sum of n independent identically distributed exponential variables has the gamma distribution, so
(*) becomes
λn0
Z nb
un−1 e−λ0 u du = 1 − α.
Γ(n) na
We can actually integrate this explicitly, but we get an expression that’s too complicated to work with
and we have to do everything numerically.

73
Chapter 6

Tests

In the previous Chapter we introduced the general theory of testing along with some general procedures
for constructing tests with useful properties. In this chapter we will introduce and use a large variety of
different tests and learn about their assumptions and their properties. This will help you build an arsenal
of techniques for many different situations.
In general the tests we will go through are divided into main categories, parametric and non-parametric.
Parametric tests have more underlying assumptions about the population, in particular assuming that the
observations arise from a distribution depending on a number of parameters. This means that if the
assumptions fail then so does our conclusion, which may in particular mean that p-values are under-
estimated. Non-parametric tests are combinatorial in nature and assume far less than parametric tests.
As a result they are much more robust. However nothing comes for free. What you gain in terms of
robustness you lose in terms of power: if the assumptions of a parametric test hold then it will generally
be more likely to reject a false null hypothesis than a non-parametric test of the same size.

6.1 The z-test


You have already seen that the z-statistic can be used to test for the unknown mean of a population. Here
we will be slightly more formal and stress more on the underlying assumptions.
In general the z-test is based on the z-statistic which is turn based on the sample mean X̄ and has the
form
sample mean − expected mean
Z= ,
standard error
where the standard error is basically the standard deviation of the sample mean.

6.1.1 Population mean


Let X1 , . . . , Xn be a random sample, that is i.i.d. random variables, from a N (µ, σ 2 ) population, where
σ is known. Suppose that we want to test
• H0 : µ = µ0 , against the alternative
• H1 : µ 6= µ0 .
Under the above assumptions, under the null hypothesis, we know that
X̄ ∼ N (µ0 , σ 2 /n).
Then the z-statistic in this case is given by
X̄ − µ0
Z= √ .
σ/ n

74
Under the null hypothesis we know that Z ∼ N (0, 1) and we can use this either to construct a rejection
region for hypothesis testing, or to compute p-values for significance testing.

6.1.2 Testing for proportions


Suppose we observe a number n of independent trials each with unknown probability of success p. We
want to test

• H0 : p = p0 , against

• H1 : p > p0 .

Suppose that we observe X = x successes.


Now you learned last year that in fact X ∼ Bin(n, p). The calculations quickly become messy if one
tries to use this fact for significance or hypothesis testing, but they are in principle possible especially
numerically. But there is a faster, although approximate way which is based on the central limit theorem.
The number X of successes in n independent trials with probability of success p can be written as
X = X1 + · · · + Xn , where the Xi ∼ Ber(p) are i.i.d. We readily have

E Xi = p, E X 2 = p, var(X) = p(1 − p).

Therefore if we apply the central limit theorem we get that for all z
!
X − np
P p ≤z → Φ(z),
np(1 − p)

where Z z
1 2
Φ(z) := √ e−x /2 dx.
−∞ 2π
In other words the distribution of the z-statistic
X − np X̄ − p
Z=p =p ,
np(1 − p) p(1 − p)/n
is in this case also approximately standard normal.
Remark 55. Notice that the same holds for any distribution that falls within the realm of the central
limit theorem. That is assuming that Xi ∼ f (·; θ), such that E X = µ and var(X) = σ 2 < ∞, then

X̄n − µ
Z := √ ,
σ/ n
is asymptotically N (0, 1), and thus for large enough samples we can use the z-test as usual.
Word of caution: This is only true approximately, and can fail spectacularly if for example var(X) =
σ 2 or even if n is not large enough.

6.1.3 Two Sample Tests


i.i.d.2 ) and Y , . . . , Y i.i.d.
Suppose that we have two independent random samples X1 , . . . , XnX ∼ N (µX , σX 1 nY ∼
2
N (µ, Y, σY ) are known and µX , µY are unknown. We hypothesis that the mean of the first distribution
is less than that of the second.
We want to test

• H0 : µY = µX ; against

75
• H0 : µY > µX .

Using standard properties of expectations we deduce that


2
σX
E(X̄) = µX , var(X̄) = ,
nX
σ2
E(Ȳ ) = µY , var(Ȳ ) = Y ,
nY
More importantly we also have that since the Xi are also independent of the Yi we have
2
σX σ2
E(X̄ − Ȳ ) = µX − µY , var(X̄ − Ȳ ) = + Y,
nX nY
and thus
X̄ − Ȳ − (µx − µY )
r ∼ N (0, 1) .
2
σX 2
σY
nX + nY

Now under the null hypothesis we know that µX − µY = 0 and thus we use the statistic

X̄ − Ȳ
Z := r ∼ N (0, 1) .
2
σX 2
σY
nX + nY

Definition 41. The quantity


2
σX σ2
σp2 := + Y,
nX nY
is often called the pooled variance.

Notice that if nX , nY are large then the standard deviation will be small and therefore if µX = µY
then we expect small observations. If on the other hand we observe a large negative value for x̄ − ȳ then
that casts doubt on the null hypothesis and suggests that indeed it may be that µY > µX .

Remark 56 (Comparing proportions). We can use this formula for comparing proportions too. In that
case, if the proportions are p1 and p2 , the Z statistic is
p1 − p2
q
p1 q1 p2 q2
,
n1 + n2

and under the null hypothesis this is for large n1 , n2 asymptotically N (0, 1) by the Central limit theorem.

Remark 57. If you are testing “no difference” between the two proportions, then it is recommended to
pool the variance. The Z statistic is:
pˆ1 − pˆ2
q
pˆ0 (1 − pˆ0 )( n11 + 1
n2 ),

where pˆ0 = n1npˆ11 +n


+n2 pˆ2
2
and under the null hypothesis H0 : p1 = p2 (= p0 ), this is for large n1 , n2
asymptotically N (0, 1), by the Central limit theorem.
i.i.d.
Example 34. We have 100 tires of Brand X and 90 tires of Brand Y. We assume that X1 , . . . , X100 ∼
2 ) and Y , . . . , Y i.i.d. 2
N (µX , σX 1 90 ∼ N (µY , σY ), where σX = 2.6, σY = 2.8 and µX , µY are unknown.
We observe x̄ = 32.1 and ȳ = 33.4. Can we infer that the tread-life of Brand Y is greater?

76
Thus the statistic we should use is
X̄ − Ȳ 32.1 − 33.4
Z := r =q 2 = −3.31,
2
σX 2
σY 2.6 2.82
+
nX + nY 100 90

To find the p-value we check


P(Z < −3.31) ≈ .00046648,
which is much small than even 1% value of Z, which is 2.32, so we reject the null and conclude tread-life
of Brand Y is statistically highly significantly better.
Example 35. Of 100 seeds of a new strain of plant, 90 germinated, whereas of 200 seeds of the old
strain, 160 germinated. Can we infer that the new strain is better?

0.9 − 0.8
Z=p = 2.17
0.83 × 0.17(1/100 + 1/200)
The question was whether the new strain is better so we compare with the 5% one-tailed value which is
1.64, so the new strain is significantly better, at 5% significance level.

6.2 The t-test


6.2.1 Student’s t-test
An important shortcoming of the z-test is that it requires us to know what the standard deviation of the
population is, and that is not generally available. Of course you can estimate the standard deviation from
the data you’ve got, but then the z-statistic will no longer be normally distributed and thus we can no
longer use the z-test.
A way round this was devised by a man called Gosset, who was working at the Guiness brewery in
Dublin. As he was not allowed to publish work that arose out of his employment, he used the pseudonym
“Student”, which was a common thing to do in those days.
Suppose that X1 , . . . , Xn ∼ N (µ, σ 2 ) where µ and σ are both unknown. Of course we know that
!
σ2
X̄ ∼ N µ, .
n
So if the we want to test for example

• H0 : µ = µ0 ; against
• H1 : µ > µ0 ,

then under the null hypothesis


X̄ − µ0
∼ N (0, 1).
σ 2 /n
But the left is NOT a statistic since we cannot compute it: we don’t know σ. A way around this problem
is of course to estimate it and we already know that
n
1 X
S 2 := (Xi − X̄)2 ,
n − 1 i=1

is an unbiased consistent estimate of σ 2 . Therefore one might say why don’t we use instead the statistic
X̄ − µ0
T := √ ,
s/ n

77
and this is indeed the right idea. We can compute this from the data so it is indeed a statistic. Do
we know its distribution under the null hypothesis? Of course we do, simply cast your minds back to
Chapter 3 and Theorem 32 where we showed that T ∼ tn−1 , Student’s t-distribution with n − 1 degrees
of freedom.

Example 36. A builder’s yard sells sacks of sand that are i.i.d. samples from N (µ, σ 2 ) where µ is
supposed to be 85 lbs.
We want to test

• H0 : µ = 85; against

• H1 : µ 6= 85.

Ten sacks are selected at random, and the weights turn out to be

87.5 84.8 84.0 87.1 86.8 83.3 83.5 87.4 84.1 86.6.

The sample mean is x̄ = 85.51; Do we have enough evidence to reject the null hypothesis? We compute
the t-statistic sP
x̄ − µ (xi − x̄)2
t= √ , s= .
s/ n n−1

The sum of squares√is 26.609, which makes s = 2.96 = 1.72. So the value of the t-statistic is
(85.51 − 85)/(1.72/ 10) = 0.94. We were asked about a difference, with no indication of the direction
of the difference, so we compare with the 2-tailed value of t on 9df, which is 2.26. Clearly this is not
significant.

6.2.2 Two Sample t-test


Often the problem is not to compare the sample with a known distribution but to compare the means
of two samples, that is suppose that X1 , . . . , Xn1 ∼ N (µX , σ 2 ) and Y1 , . . . , Yn2 ∼ N (µY , σ 2 ) be two
independent random samples and that we want to test

• H0 : µX = µY ; against

• H0 : µX 6= µY .

When σ 2 = σX
2 = σ 2 is known recall that
Y

X̄ − Ȳ − (µX − µY )
Z := r ∼ N (0, 1).
2
σX 2
σY
nX + nY

Suppose now that we do not know σX , σY , so we choose to estimate them by the sample variances
n
X n
Y
1 X 1 X
s2X := (Xi − X̄)2 , s2Y := (Yi − Ȳ )2 .
nX − 1 i=1 nY − 1 i=1

However if we only used one of the above estimators we would be throwing away information, so the
idea is to combined them to obtain the pooled estimator, that is we want to use a convex combination of
the two. The question is how to combine them.
Suppose for example that nX = nY , so that we have no reason to prefer one estimate over the other,
in which case we could simply use

78
(Xi − X̄)2 (Yi − Ȳ )2
P P
1 2 1
sX + s2Y = +
2 2 2(nX − 1) 2(nY − 1) =
(Xi − X̄)2 + (Yi − Ȳ )2 (nX − 1)s2X + (nY − 1)s2Y
P P
= = .
nX + nY − 2 nX + nY − 2
Now in the general case nX 6= nY notice that the above formula puts extra weight to the estimator from
the larger sample, and this makes perfect sense, since by consistency the larger sample will tend to be
more accurate.
Therefore we arrive at the following pooled estimator

(nX − 1)s2X + (nY − 1)s2Y


SP2 := .
nX + nY − 2

Now if σ 2 := σX
2 = σ 2 then if we replace σ 2 by the pooled estimator in the z-statistic we arrive at
Y

X̄ − Ȳ
T := q .
1 1
Sp nX + nY

Of course we need to understand the distribution of the above statistic. Before we do this, we will
look at the denominator. In fact, we already know that
PnX PnY
(nX + nY − 2)Sp2 i=1 (Xi − X̄) i=1 (Yi − Ȳ )
W := = + ,
σ2 σ2 σ2
is the sum of two independent χ2 random variables with (nX − 1) and (nY − 1) degrees of freedom
respectively. Therefore by Theorem 27 we conclude that W has the χ2 distribution with (nX − 1) +
(nY − 1) = nX + nY − 2 degrees of freedom. In addition we know that X̄ is independent of sX , and
Ȳ is independent of sY , so that X̄ − Ȳ is independent of W , and thus finally under the null hypothesis

X̄ − Ȳ
Z := √ ∼ N (0, 1).
σ/ nX + nY

Therefore
X̄ − Ȳ
T = q
1 1
Sp nX + nY
s
X̄ − Ȳ (nX + nY − 2)Sp2

= √
σ nX + nY σ2
Z
=p ∼ tnX +nY −2 ,
W/(nX + nY − 2)

by Definition 24
There are two cases to consider. First, let us suppose that the two samples can be supposed to have the
same variance, which would correspond to them coming from the same populations only shifted over a
bit. In that case we can show that the appropriate t-statistic is
x̄ − ȳ
t := r P P
(xi −x̄)2 + (yi −ȳ)2
q
1 1
nx +ny −2 nx + ny

and the number of degrees of freedom is nx + ny − 2.

79
Example 37. We want to test if professional drivers drive more economically than casual drivers. We
randomly selected five professional and five casual drivers to drive the same car over the same route
through city traffic. The petrol consumption figures were

Professionals(= X) 0.94 1.20 1.00 1.06 1.02


Private (= Y ) 1.40 0.98 1.22 1.16 1.34

Formally we assume that X1 , . . . , X5 ∼ N (µX , σ 2 ) and Y1 , . . . , Y5 ∼ N (µY , σ 2 ) are two indepen-


dent random samples, and we want to test

• H0 : µX = µY ;

• H1 : µX < µY .

The means are x̄ = 1.044 and ȳ = 1.220. The sums of squared deviations are (nX − 1)s2X = .0379 and
(nY − 1)s2Y = .1080. These don’t seem very close together, but let’s assume for the time being there is
no problem; later we will see that there’s no significant reason to expect there is. The pooled estimator
is then
.0379 + .1080
s2p = = .0182,
8
and thus
1.044 − 1.220
t= q = −2.06.
sp 25

The alternative hypothesis is that the professionals use less petrol, so the correct procedure is to
compare −t = 2.06 with −t8,5% , such that

P(t8 > t8,5% ) = 5%.

From the appropriate table we find this to be 1.86, so the data do indicate that professionals use less
petrol.
Had the question simply been “Is there a difference?”, that is H1 : µX 6= µY , then the appropriate
test would have been to compare |t| with 2.31, and this would not have been significant. Had the
question been “Do the professionals use more petrol?” then you would compare t with 1.86, and since
−2.06 < 1.86 there is no significant evidence in favour.

6.2.2.1 Paired Sample

Example 38. Consider the following experiment. The percentage of aggregated blood platelets in the
the blood of 11 individuals were measured before and after they smoked a cigarette.

Before 25 25 27 44 30 67 53 53 52 60 28
After 27 29 37 56 46 82 57 80 61 59 43
Difference 2 4 10 12 16 15 4 27 9 -1 15

It seems that smoking a cigarette results in more clotting and we want to test this hypothesis.
Notice that in this particular case, the data has a very particular structure. Just like the previous
scenario, we have two samples so we could just apply the two-sample t-test. But, in addition in this case
the two samples are naturally paired in a way we can exploit to remove some of the variability due to
differences between the subjects of the study.
To make this more precise, suppose that each individual is assigned a label i = 1, . . . , 11. We denote
with Xi the level before the cigarette, and with Yi the level after the cigarette. We additionally assume

80
that Yi = Xi + Di , where Xi and Di are independent. We interpret Xi as representing the natural level
of clotting present in individual i due to genetic and various other reasons, while Di captures just the
effect of the cigarette. Notice that the reason we conducted the experiment was specifically to measure
the effect of smoking a cigarette, and not the effects due to other factors. Therefore, by looking at the
differences between the pairs we should be able to remove the variability due to any factors other than
smoking.
In essence, what we are saying is that since Yi = Xi + Di and Xi is independent of Di

var(Yi ) = var(Xi ) + var(Di ) > var(Di ),

and thus by looking at the differences we should be able to get tighter estimates. In essence this means
that our hypothesis test should be more powerful than a two-sample test.
Let’s see how we do.

6.2.2.1.1 Paired-sample t-test Suppose that the differences D1 , . . . , D5 ∼ N (µ, σ 2 ) be i.i.d. We


want to test

• H0 : µd = 0;

• H1 : µd > 0.

We calculate the average difference and sample standard deviation

d¯ = 10.3, sd = 7.98.

The t-statistic is then given by


10.3 − 0
t= √ = 4.28,
7.98/ 11
and since the test is one-sided we check the tables for the t-distribution with 10 degrees of freedom and
we find the critical value to be 1.81. Therefore we reject the null hypothesis and in fact we get a highly
significant result with a p-value of around 0.0016.
To demonstrate the increase in power compared to a paired sample t-test, it is instructive to perform
it in this simple example. In fact the t-statistic obtained is −1.4164, with 20 degrees of freedom which
then gives a p-value of 0.1721, in which case we would fail to reject the null hypothesis at any reasonable
significance level!!!

6.2.2.2 Confidence Intervals

Since most estimators in practice are based on a sample of the population, there is always a degree of
uncertainty. We sometimes want to know in addition how good our estimate really is.
One way of doing this is to compute a confidence interval. Let µ0 be a possible value of the population
mean µ. Then we ask whether our data would cause us to reject the hypothesis H0 : µ = µ0 . If it would,
then we consider µ0 to be inconsistent with the data, otherwise it is consistent. And if we would have
rejected H0 at the 5% level, then we say we are 95% confident that µ 6= µ0 .
The 95% confidence interval for µ is the set of all values of µ0 which would not have caused us to
reject the hypothesis µ = µ0 at the 5% level. And similarly for the 1%.
Now we would reject the null hypothesis µ = µ0 if

|x̄ − µ0 |
√ ≥ t5%,νdf ν = n − 1.
s/ n

81
The end points of the 95% confidence interval are where the equality is satisfied:

|x̄ − µ0 | = ±t5%,νdf s/ n,

µ0 = x̄ ± t5%,νdf s/ n,
where it is the 2-tailed t-value we need. As an example, let’s go back to the data from Example 36.

87.5 84.8 84.0 87.1 86.8 83.3 83.5 87.4 84.1 86.6.
We found x̄ = 85.5 and s = 1.72, and since the 2-tailed 95% value of t is 2.26 the 95% confidence
interval is √
µ0 = 85.5 ± 2.26 × 1.72/ 10 = 85.5 ± 1.23.
As a sanity check we recall that the original question was whether the sample was inconsistent with a
mean of 85. We found that we could not reject the null hypothesis, and indeed 85 does lie in the interval
(84.3, 86.7).

6.2.3 When to use z, when t?


Now so far it has appeared pretty cleared when to use the z-test and when to use the t-test. However
look at the following plot comparing the pdf of t distribution with increasing degrees of freedom to
a standard normal. You can clearly see that as the number of degrees of freedom increase the p.d.f.

0.4

0.3

0.2

0.1

-4 -2 2 4

Figure 6.2.1: The probability density functions of the t-distribution with 1(blue),5(orange),10(green)
and 30(red) degrees of freedom, and the pdf of a standard normal(purple).

of a t distribution approaches that of a standard normal. The reason is very simple. Recall that the
t-distribution with ν degrees of freeom was defined as the distribution of
sP
ν 2
j=1 Yi

Tν := Z ,
ν
where Y1 , . . . , Yν are i.i.d. standard normal, independent of Z. The key lies in the denominator,
Pν 2
i=1 Yi
,
ν

82
which as we know by the strong law of large numbers is a consistent estimator of the mean of Yi2 , which
is the variance or Yi . But recall that we use the t-statistic when we don’t know the variance so we replace
it by the sample variance estimator, since we know it is consistent.
Well it turns out that consistency of the sample variance kicks in at moderate samples and is very
effective. For sample sizes of ≥ 60 there is no practical difference between the t-test and the z-test
because the t distribution with more than 60 degrees of freedom is for all practical purposes almost
identical to the standard normal.
Thus we need a rule of thumb about when to use each test. It is provided in Figure 6.2.2 which also
tells you which distribution to use when constructing confidence intervals for the mean of a population
(or a proportion).

estimating
no no large
know σ? a pro-
sample?
portion?

yes
yes no
yes

Use Z Use T

Figure 6.2.2: Flowchart for deciding whether to use Z or T .

6.3 The χ2 -test


So far we have been testing whether the mean of a normal population is equal to a given value, or
possibly whether the means of two populations are equal. More generally, rather than just comparing
the means of two distributions, we could compare where a sample comes from a given distribution, or
whether two samples come from the same distribution or not.
We will now see such a test, the χ2 test, which is useful in many situations but is most useful for
categorical data, that is data that arise as observations of categorical random variables.
Definition 42 (Categorical random variable). A variable that takes values in a finite set of categories,
according to some qualitative property.

For example X ∈ {smoker, non-smoker}, X ∈ {heads, tails}, X ∈ {A, B, C} and so on are all
examples of categorical random variables.

6.3.1 Goodness of Fit


Suppose we have n independent observations, where each observation falls into one of the categories
1, ..., k. Let ni be the number of observations in category i, so ki=1 ni = n. Let πi be the probability
P
Pk
that a single observation falls into category π, where i=1 πi = 1. Let π = (π1 , ..., πk ).
We want to test whether the random variable has a given distribution which assigns to the categories
the probabilities π1 , . . . , πk respectively.
To formalise suppose that we have a random sample X1 , . . . , Xn where P(Xi = j) = pj , j = 1, . . . , k
and p = (p1 , . . . , pk ). We want to test

• H0 : p = π; against the alternative

83
• H1 : p 6= π.
We have performed the experiment and have n observations, ni in the i-th category. Notice that we must
have n = n1 + · · · + nk .
Under the null hypothesis, the expected count of observations in category i, is of course npi since
we have n trials, each of which has probability pi of resulting in category i. We will base our statistic
on the deviation of the observed count in category i with the expected one. We cannot simply add the
deviations because the result will always be
X X
(ni − npi ) = n − n pi = n − n = 0.
One way to avoid this cancellation is to square each deviation, which would result to
X
(ni − npi )2 .
Now this is not too bad as a statistic, since if it takes on large values it casts doubt on the validity of
the null hypothesis. However there is still a slight problem. Notice that if one category, say the first
one, is much more likely than the other ones, then we should take that into account. For example a
deviation of 10 in a category with expected count 1000, should be counted far less than a deviation of 10
in a category with expected count of 20. Therefore we should normalise by the expected count in each
category resulting in the following definition.
Definition 43 (χ2 - statistic). The χ2 -statistic is given by
k
2
X (ni − npi )2
X = .
i=1
npi

The statistic defined above can certainly be computed from the data, but do we know its distribution?
In fact we do, at least asymptotically as n → ∞, and its distribution is a χ2 distribution where the
degrees of freedom are computed as follows:
d.f. = # number of categories − # parameters estimated from data − 1.

We will now see a range of examples about how to apply the above test.

6.3.2 Comparing with fixed distribution


Suppose that we want to test if a die is fair. We roll it 60 times and record the results. Thus there are six
categories and we want to test
 
1 1
• H0 : p = 6, . . . , 6 ;
 
1 1
• H0 : p 6= 6, . . . , 6 .

Of course the expected count in each category is 10. We summarise our observations in the following
table. We compute the χ2 -statistic

Observed 16 15 4 6 14 5
Expected 10 10 10 10 10 10

(16 − 10)2 (15 − 10)2 (5 − 10)2


X2 = + + ··· +
10 10 10
= 3.6 + 2.5 + 3.6 + 1.6 + 1.6 + 2.5 = 15.4.
We did not estimate any parameters from the data, so the number of degrees of freedom is d.f. = 6 − 1 =
5 and we therefore compare our result with the critical value of the χ2 distribution with 5 degrees of
freedom, which is 11.07 and 15.09 at the 5% and 1% significance levels respectively. Therefore we
reject the null hypothesis.

84
6.3.3 Comparing with a Parametric family of distributions
Last time we tested whether the observations came from a single distribution. In other cases we may
want to test whether the observations came from a parametric family of distributions, e.g. Poisson, or
Binomial, for some values of the parameters. Since we don’t know the parameters, we will estimate
them from the data. Let’s see an example.
We have the following observations about the number of deaths from horsekicks in different regiments
of the Prussian army. That is 0 deaths were observed 109 times, 1 death was observed 65 times and so

Deaths 0 1 2 3 4 ≥5
Frequency 109 65 22 3 1 0 Total=200

on. Suppose we want to test whether the above observations come from a Poisson distribution with some
parameter µ. That is we want to test

• H0 : the data come from a Poisson(µ) for some µ; against

• H1 : the data do not come from any Poisson distribution.

The first step is to estimate the parameter µ. Recall that if X ∼ Poisson(µ) then E X = µ. Therefore
we could estimate µ by the sample mean
X1 + · · · + Xn
X̄ = .
n
In our case we find
122
x̄ = = 0.61.
200
Now essentially we have to test

• H0 : the data come from the Poisson(0.61); against

• H1 : the data do not come from the Poisson(0.61) distribution.

Just like in the previous case, we need to compare the observed counts in each category to the expected
count which we have to compute. How do we find the expected count?
If X ∼ Poisson(0.61) then we know that

e−0.61 0.61k
P(X = k) = , k ≥ 0,
k!
and thus for the first five categories we compute the expected count by multiplying the probability with
the number of observations. For the last category we need to compute
4
X e−0.61 0.61j
P(X ≥ 5) = 1 − = 0.000424972.
j=0
j!

Therefore we have
Deaths 0 1 2 3 4 ≥5
Expected freq. 108.67 66.3 20.2 4.1 0.6269 0.00000
Observed freq. 109 65 22 3 1 0

Now there is a slight problem here because in the last three categories the expected count is less than
5. This means that the sample size is not large enough for our χ2 statistic to have, at least approximately,

85
Deaths 0 1 ≥2
Expected freq. 108.67 66.3 24.9
Observed freq. 109 65 26

the correct distribution. When this happens we should combine the categories so that each category has
expected count at least 5. In our case we will combine the last 4 categories to make a single category.
Now we calculate
(108.67 − 109)2 (66.3 − 65)2 (24.9 − 26)2
X2 = + + = 0.0751.
108.67 66.3 20.2
Remark 58. If some categories have expected count less than 5, always combine the categories in such
a way as to make the expected count in each category of the final table at least 5.

Remark 59. When finding the number of degrees of freedom, we have to use the number of categories
of the final table.

In our case then we have


d.f. = 3 − 1 − 1 = 1,
since we have 3 categories in the final table and we estimated one parameter from the data. Consulting
the table for the critical values of the χ2 distribution with 1 degree of freedom, we find it to be 3.84 at
the 5% and 6.63 at the 1%. Since our statistic is 0.0751 there is no evidence to reject the null hypothesis
that our observations come from a Poisson model.

Example 39 (Poisson again). Suppose that we are counting the number of accidents each week in a large
factory. We want to know whether accidents occur more or less at random, or whether they influence
each other. For instance, do they tend to cluster because there is some factor that makes accidents more
likely to happen? Alternatively, do they tend to anticluster, because after one accident has occurred
people are extra careful for a while?
Now if they really do occur at random, the number per week should obey a Poisson distribution

e−λ λx
f (x) = , x = 0, 1, 2, . . .
x!
where λ is the mean number of accidents per week.
Over a 50 week period (omitting the Xmas period when the factory was closed) the number of weeks
n(x) on which x accidents occurred was:

No. of accidents 0 1 2 3 4 5 6 7 8
Frequency 5 6 14 10 9 4 1 0 1

From these data we compute the mean number of accidents per week was λ̂ = 2.68. So if the
distribution were Poisson, we would expect the numbers of weeks on which x accidents occurred to
have been:
No. of accidents 0 1 2 3 4 5 6 7 8
Frequency 3.4 9.2 12.3 11.0 7.4 3.9 1.8 0.7 0.3

To test, we first combine cells so that none of the predicted frequencies is less than 5. That means
combining the first two cells, and also the last four: We then compute the sum of squares

(11 − 12.6)2 (14 − 12.3)2 (10 − 11.0)2 (9 − 7.4)2 (6 − 6.7)2


+ + + + = 1.37.
12.6 12.3 11.0 7.4 6.7

86
No. of accidents 0 or 1 2 3 4 5 or more
Frequency 11 14 10 9 6
Expected 12.6 12.3 11.0 7.4 6.7

We had to estimate one parameter from the data, so the number of degrees of freedom is 5 − 1 − 1 = 3.
From the tables we see that the 95% value is 7.81, so the result is not significant. There is no reason to
suppose that the accidents are happening other than at random.
Note that had we been testing the hypothesis “The data are from a Poisson distribution with λ = 2.68”,
p would have been zero and so ν would have been 4, but that is a different problem, being a more
restrictive hypothesis. The 95% value on 4 df is 9.49, which is of course larger because being allowed
to fix λ makes it likely that we can get the sum of squares smaller.

Example 40. We often just look just at the right tail of the χ2 -distribution. But sometimes the left tail
may reveal interesting features.
Mendel crossed round yellow (RY) pea plants with wrinkled green (WG) ones. According to his
theory, there should have been progeny of all four possible combinations of the characters and they
should have occurred in the ratio RY:RG:WY:WG::9:3:3:1. This is a categorical distribution of the form
9 3 1
P(X = RY) = , P(X = RG) = P(X = WY) = , P(X = WG) = .
16 16 16
He had 556 plants, so the expected frequencies were 312.75, 104.25, 104.25, 34.75. The observed
frequencies were 315, 108, 101, 32.
We can easily do a chi-squared test on the data. The appropriate sum is

2.252 3.752 (−3.25)2 (−2.75)2


+ + + = 0.47
312.75 104.25 104.25 34.75
There are no parameters estimated from the data because the probabilities are derived from the theory.
So we compare 0.47 with χ295% which for 3df is 7.81. This is clearly not a significant deviation from the
prediction, so there is no reason to doubt Mendel’s hypothesis.
On the other hand, that is an awfully small chi-squared value. So let’s just look at the left side of the
distribution. Still on 3 df we note that χ25% = 0.352. That tells us that even if Mendel was right, the
deviation he got was very small. In fact, even if he was right, the probability of getting a χ2 value of
0.47 or less, i.e. is about 0.075, roughly 1 in 13. In other words, the fit looks too good
There’s been a lot of debate about what we’re to make of all this. Of course Mendel was right, but it
seems unlikely that he should have got such good results. One suspects that he (some people say, his
assistant) simply kept only the best results out of several, or counted until they got the proportions as
they expected, or something of the sort. In any case, another use of the chi-squared distribution is to test
when we think results are too good to be true.

Remark 60. Notice that we can, also use the χ2 test for continuous variables by breaking the data
up into a finite number of bins. There are, however, better ways to do this. One such method is the
Kolmogorov-Smirnov test, which is based on the maximum difference between the observed and pre-
dicted cumulative frequencies, which we will try to cover if there is enough time.

6.3.4 Contingency Tables


The χ2 test can also be used to check whether two categorical variables are independent or not.
Consider following example. A manufacturer sells three kinds of soap. 500 randomly selected people,
200 men and 300 women, are given samples of each kind of soap and asked which they prefer. The
replies are given in the following table:

87
A B C
Men 51 68 81 200
Women 63 141 96 300
114 209 177 500

On the basis of these data, do women have different preferences in soap from men? Therefore the null
hypothesis is
H0 : preferences are independent of gender.

Notice that if the null hypothesis is correct then the probability that an individual will prefer any one
soap is independent of whether that individual is male or female. So if the probabilities that a person
in the sample prefers A, B, C are denoted by P(A), P(B), P(C) and if the probabilities that a per-
son chosen at random is male or female are P(M ) and P(F ), respectively, we will have P(AM ) =
P(A)P (M ), and so on. If they are not independent, we will have to use conditional probabilities
P(AM ) = P(A|M )P(M ), etc.
To test the null hypothesis that the two attributes are independent we proceed as follows.

1. We estimate all the individual probabilities from the data; clearly the estimate of P(A) = 114/500 =
0.228, and so on.

2. We compute the expected numbers in each of the six classes using the simple multiplication rule
for probabilities (and of course multiplying by 500 to get expected numbers). Thus since P(M ) =
200/500 = 0.4 we have, assuming independence, P(M who prefers A) = 0.4 × 0.228 = .0912
and the expected number of males who prefer A is 45.6.

3. We then use the χ2 goodness of fit test to see whether the actual numbers are significantly different
from those we predict in this way.

The predicted numbers work out to be

A B C
Men 45.6 83.6 70.8 200
Women 68.4 125.4 96 300
114 209 177 500

Note that the row and column sums are not changed. We now work out the usual statistic
X (ne − no )2
X 2 := = 8.367.
ne
But before we can complete the test we have to know how many degrees of freedom there are.
We might as well do this for the general case. We have a 2-way contingency table, i.e. there are two
categories of classification (here which sex you are and what soap you prefer). There could be more,
and the analysis is much the same if there are. Let us suppose there are r classes in the A classification
and s classes in the B, so that every individual is to be classified as Ai ∩ Bj where i = 1, 2, . . . , r and
j = 1, 2, . . . , s. In the example, r = 2 because there are only two sexes and s = 3 because there are
three classes of soap preference.
According to the null hypothesis P(Ai ∩ Bj ) = P(Ai )P(Bj ) and we estimate each of the marginal
probabilities from the data. Note, however, that we only need to estimate r − 1 probabilities as the last
one can be computed from the rest since they add up to one. Similarly, we only have to estimate s − 1
of the P(Bj ).

88
There are rs cells altogether which is the number of categories. We have to estimate (r − 1) + (s − 1)
parameters, and so the number of degrees of freedom is

ν = rs − (r − 1) − (s − 1) − 1 = rs − r − s + 1 = (r − 1)(s − 1).

Hence in the example the right number of degrees of freedom is 2, and so we compare 8.367 with
χ295%,2df = 5.99 and χ299%,2df = 9.21 so we can reject the null hypothesis. We find a lack of indepen-
dence which is significant but not highly significant.
Note that this analysis doesn’t tell you anything about which soap men or women prefer. It tells you
that there appears to be a difference, but nothing about what sort of difference that is.

6.3.5 Comparing Two Variances


In carrying out a two-sample t-test we have to know if we are justified in assuming that two variances
are equal. We want to be able to test this, and fortunately we can quite easily. If the null hypothesis
σ12 = σ22 is true, then let σ2 be the common variance of the two samples. We then have for each that

(xi − x̄)2
P

σ2
is the sum of squares of independent standard random variables and that consequently, with the usual
remark about n − 1 in place of n:
P
(x−x̄)2
s21 /σ 2 s21
= = Pnx −1 2 ,
s22 /σ 2 s22 (y−ȳ)
ny −1

has the F -distribution with (n1 − 1, n2 − 1) df.


For a 2-tailed test we note that the reciprocal of this fraction also has an F distribution, but with the df
reversed. The tables only give one tail, but we can use F5%,(m,n)df = (F95%,(n,m)df )−1 to find the other
tail.
Ordinarily we don’t have to do this, since for a two-tailed test we just agree to put the larger value on
top and use one tail. It’s as though instead of taking absolute values we just agreed to subtract the smaller
from the larger for two tail tests of the mean. Returning to the example of the petrol consumption, The

Professionals 0.94 1.20 1.00 1.06 1.02


Private 1.40 0.98 1.22 1.16 1.34

means are 1.044 and 1.220. The sums of squared deviations are .0379 and .1080, which gives

.1080/4
F = = 2.85
.0379/4

According to the tables F95%,(4,4)df = 9.6 so the data do not cause us to reject the null hypothesis of
equal variances. In fact, on the null hypothesis the probability of getting a deviation as big as the one we
observed is about 1/3, so the situation is better than it probably looks. What it really amounts to is that
it isn’t easy to estimate variances accurately from small samples. The F -distribution is actually more
important in another context, that of ANOVA which we will come to later if time permits.

89
Chapter 7

Regression

We are given some data about a dependent variable Y and some variables X which we believe determine
Y . We assume that we can write
Y = g(X1 , X2 , . . . X3 )
The aim is to estimate f . We usually suppose that the form of f is known, and that our task is to estimate
one or more parameters, although there are techniques that help us to find f , or at least to choose between
two or more possible forms.

7.1 A Single Explanatory Variable


Suppose we have a single response Y and a single explanatory or predictor variable X. We suppose that
we can write
Y (X) = g(X) + 
i.e. that Y is the sum of a deterministic function g(X) and a random error . Suppose further that we
know the functional form of g. Then we are back to the familiar problem of estimating the parameters,
and we choose to do that by the method of maximum likelihood.
We are given data in the form of pairs (xi , yi ) which are observations of pairs of random variables
(Xi , Yi ). For now we will assume that the xi are fixed, i.e. that we can control them. Then each of the
yi is given by

yi = g(xi ) + i

Since g(x) is supposed to be a known function, though with unknown parameters, Yi is a random vari-
able whose distribution is determined by that of . You can think of the Yi as having a distribution
approximately centred on the curve y = g(x) and then with a random component added on.
Suppose that i ∼ N (0, σ 2 ) are i.i.d. , which seems appropriate, given that the normal distribution is
also often called the error distribution. This is not essential to be normal, as we shall see, but it does
allow us to interpret the result as a maximum likelihood estimator. This implies that E(Yi ) = g(Xi ),
which seems reasonable. The errors are thus assumed to be random rather than systematic. We suppose
that the variance, σ 2 , is unknown, but for the time being we suppose that it is constant, that it does not
depend on X. This means we are assuming that the scatter is the same at all points on the graph, which
clearly need not be so. Then each Yi will be normally distributed with mean g(x) and variance σ 2 : that
is Yi will have probability density function
1 1
 
f (yi ) = √ exp − 2 (yi − g(xi ))2
σ 2π 2σ
The likelihood function is then, letting θ stand for all the unknown parameters of g,

90
nY
1 1
   
L(θ, σ) = √ exp − 2 (yi − g(xi ))2
σ 2π 2σ
n
1 1 X
  
2
= √ exp − 2 (yi − g(xi )) .
σ 2π 2σ
We now have to maximize L and as usual we work with log L instead
n 1 X
log L = −n log σ − log 2π − 2 (yi − g(xi ))2
2 2σ
∂L n 1 X
=− + 3 (yi − g(xi ))2
∂σ σ σ
∂L 1 X ∂g(xi )
= 2 (yi − g(xi )) .
∂θ σ ∂θ
As you can see from the calculations, what we are actually doing is choosing the parameters θ so as to
minimize (yi − g(xi ))2 . In other words, if the errors are distributed normally and σ 2 is constant (but
P

not otherwise!) the maximum likelihood estimates of the parameters are just the least squares estimates,
and the maximum likelihood estimate of the variance is obtained from the sum of the squares of the
deviations of the yi from their estimates based on the maximum likelihood estimates.
Let’s see how this works in the simplest interesting case, the linear model Y = a + bX + . Assuming
the  to be normally distributed, we have to choose a and b so as to minimize
X
(yi − a − bxi )2 .

This gives us the following system of equations


X
−2 (yi − â − b̂xi ) = 0
X
−2 xi (yi − â − b̂xi ) = 0.

and hence
X X
nâ + b̂ xi = yi
X X X
â xi + b̂ x2i = x i yi .

Solving for α̂ and β̂ we find


sxy
β̂ = ,
sxx
α̂ = ȳ − β̂ x̄,

where
X
sxx := (x − x̄)2
X
syy := (y − ȳ)2
X
sxy := (x − x̄)(y − ȳ).

The maximum likelihood estimate of σ 2 is then


1X
σ̂ 2 = (yi − â − b̂xi )2
n
Note, by the way, that the point (x̄, ȳ) must lie on the regression line, because
1 X 1X
â + b̂x̄ = (n + b xi ) = yi = ȳ.
n n

91
This is often useful.
Let’s start with an artificial example so we can see how the numbers go. We’re given the following
pairs of observations:

X 1 2 4 5 7
Y 3 4 5 8 8.

In fact, the calculator that is supplied for the exam will find the values of a and b for you (so you’d be
well advised to learn how to use it) but if you don’t have one like that it’s not all that hard to do it by
hand. We set up the following table:

X Y X2 Y2 XY
1 3 1 9 3
2 4 4 16 8
4 5 16 25 20
5 8 25 64 40
7 8 49 64 56
19 28 95 178 127

The equations we need are now

5â + 19b̂ = 28
19â + 95b̂ = 127.

and since 5 × 19 = 95 we can solve this easily to obtain â = 13/6 ≈ 2.17 and b̂ = .90.
We can now use this relation to predict Y for any given X. For example, if X = 6 we expect
Y = 2.17 + 6 × .90 = 7.57. Of course we really ought to round to no closer than 7.6 because we
mustn’t give answers that are to more significant figures than the data we began with. Here we allow
ourselves one more, but that is about as far as we can go.
Note that while this method works for interpolation it is less reliable for extrapolation, because we
can’t be so confident about what happens beyond the range that we have data for. So while you can work
out from these data that if X = 20 we expect Y = 2.17 + 20 × 0.9 = 20.17, I wouldn’t bet on it.
Note also that we have always taken X as the predictor variable and Y as the response. We could do
it the other way around, and then by symmetry the equations will be:
X X
nα̂ + β̂ yi = xi
X X X
α̂ yi + β̂ yi2 = x i yi .

which here is

5α̂ + 28β̂ = 19
28α̂ + 178β̂ = 127.

This gives α̂ = −1.64 and β̂ = .972. This is not the same line as before, because

y = â + b̂x =⇒ x = −â/b̂ + y/b̂

so we would have expected

α̂ = −2.17/.90 = −2.41 β̂ = 1/.9 = 1.1,

92
which is not the case. The reason is that if we really considered Y to be the predictor, then we would be
trying to minimise the sum of squares
X
(xi − α − βyi )2 ,

and this leads to a different result. It’s not hard to see why: we are trying to minimise the sum of squares
of horizontal deviations from the line whereas before we were minimizing the sums of squares of vertical
deviations.

7.2 More Complicated Regression Models


There are two ways in which we can make the model more complicated. We could suppose that the
function g(X) is not a linear function, or we could suppose that there is more than one predictor variable,
or both. A common example is to suppose that

Y = a + bX + cX 2 + 

This is still usually referred to as a linear model because it is linear in the parameters, though not in
the predictor variable. The original result that the maximum likelihood estimates are the least squares
estimates still applies, and it’s not hard to show that the necessary equations are

na + bΣx + cΣx2 = Σy
aΣx + bΣx2 + cΣx3 = Σxy
aΣx2 + bΣx3 + cΣx4 = Σx2 y

Fitting a cubic or higher order polynomial is done in an analogous fashion.

7.2.1 Multiple Regression.


Another way to generalise the above is to increase the number of explanatory variables1 setting

yj = β1 xj1 + β2 xj2 + · · · + βp xjp + j = xTj β + j ,

where xTj = (xj1 , . . . , xjp ) is a 1 × p vector of explanatory variables associated with the j-th response
yj , j is the error in the j-th response and β T = (β1 , . . . , βp ) is a vector of unknown parameters. We
can write this in matrix form as
y = Xβ + , (7.2.1)
where y is the n × 1 vectors such that y T = (y1 , . . . , yn )
 
x11 x12 · · · xjp
 x21

x22 · · · x2p 

X=
 .. .. .. ,
.. 
 . . . . 
xn1 xn2 · · · xnp

and T = (1 , . . . , n ) is a vector of i.i.d. N (0, σ 2 ) random variables. The model given in 7.2.1 is called
a linear regression model with design matrix X.
1
This section is heavily based on A.C.Davison’s Statistical Models

93
Under the assumptions that the 1 , . . . , n are i.i.d. N (0, σ 2 ), the responses (y1 , . . . , yn ) are also
independent normal random variables such that

yj ∼ N (xTj β, σ 2 )

and thus we can write down the likelihood function for β, σ 2 as


n
Y 1 n 1 T
o
L(β, σ 2 ) = √ exp − (yj − x j β) 2
.
j=1 2πσ 2 2σ 2

Notice that  
β1
β2 
 
xTj β = (xj1 , . . . , xjp ) × 

 ..  ∈ R.

 . 
βp
Just like before in order to maximise the likelihood we just have to maximise the sum of squares which
is independent of σ 2
n
X
SS(β) := (yj − xTj β)2 = (y − Xβ)T (y − Xβ),
j=1

which results in the system of equations


n
∂SS(β) X
=2 xji (yj − β T xj ) = 0, i = 1, . . . , p,
∂βi j=1

which can be rewritten as


X T (y − Xβ) = 0,
and provided that the p × p matrix X T X is invertible the maximum likelihood estimator of β is given by

β̂ = (X T X)−1 X T y.

Since the likelihood is maximised w.r.t. β at β̂ irrespective of σ 2 , to obtain the maximum likelihood
estimator of σ we plug β̂ in the log-likelihood and maximise w.r.t. σ, that is
1
max l(σ 2 , β̂) ∝ n log σ 2 + (y − X β̂)T (y − X β̂),
σ2 σ2

and setting the derivative w.r.t. σ 2 equal to 0, gives us


n
1 1X
c2 =
σ (y − X β̂)T (y − X β̂) = (yj − xTj β)2 .
n n j=1

c2 is consistent but biased, while an unbiased estimator is given by


It can be checked that in fact σ
n c2
S 2 := σ .
n−p

7.3 Distribution of the Regression Parameters


We are still working under the assumption that the errors i are i.i.d. normal random variables with mean
zero and variance σ 2 .

94
Recall that β̂ = (X T X)−1 X T y and since by the definition of the model y = Xβ +  we have that

β̂ = (X T X)−1 X T (Xβ + )
= X −1 (X T )−1 X T β + (X T X)−1 X T 
= β + (X T X)−1 X T ,

which is clearly a linear combination of jointly normal random variables, and is thus normally distributed
itself.
We can also compute its mean vector E(β̂) and covariance matrix Σβ̂ using the properties of multi-
variate normal distributions. They are given by

E[β̂] = β
Σβ̂ = cov[(X T X)−1 X T , (X T X)−1 X T ]
 T
= (X T X)−1 X T Σ (X T X)−1 X T
 T
= σ 2 (X T X)−1 X T I (X T X)−1 X T
= σ 2 (X T X)−1 X T X(X T X)−1
= σ 2 (X T X)−1

where we have used Remark 23, the fact that Σ = σ 2 I the identity matrix, and the facts that (X −1 )T =
(X T )−1 , (XY )T = Y T X T and (XY )−1 = Y −1 X −1 . Therefore we know that

β̂ ∼ N (β, σ 2 (X T X)−1 ).

Now we still haven’t discussed the distribution of the variance estimator


1
S2 = SS(β̂).
n−p
To do this, we will first analyze the sum of the squared residuals, that is the deviations between the
observed and the fitted values when we use β̂. These are of course given by y − Xβ but in order to
derive their distribution we will re-express them in terms of the ’s as follows

y − X β̂ = y − X(X T X)−1 X T y
 
= I − X(X T X)−1 X T y
 
= I − X(X T X)−1 X T (Xβ + )
= Xβ − X(X T X)−1 X T Xβ +  − X(X T X)−1 X T 
= Xβ − Xβ +  − X(X T X)−1 X T 
= (I − X(X T X)−1 X T ),

which is again a linear combination of normal random variables and thus normally distributed itself. We
can compute the mean vector and covariance matrix as follows

E[y − X β̂] = 0,
var(y − X β̂) = var{(I − X(X T X)−1 X T )}
= σ 2 (I − X(X T X)−1 X T )(I − X(X T X)−1 X T )T
= σ 2 (I − X(X T X)−1 X T ).

95
Finally since both vectors β̂ and y − X β̂ are linear combinations of the  we compute their covariance
matrix as follows
h
cov(β̂, y − X β̂) = cov β + (X T X)−1 X T , (I − X(X T X)−1 X T ))
= (X T X)−1 X T Σ (I − X(X T X)−1 X T )T
= σ 2 (X T X)−1 X T (I − X(X T X)−1 X T )T
= σ 2 (X T X)−1 X T (I − X(X T X)−1 X T )

since (X(X T X)−1 X T )T is symmetric

= σ 2 (X T X)−1 X T − σ 2 (X T X)−1 X T X(X T X)−1 X T )


= σ 2 (X T X)−1 X T − σ 2 (X T X)−1 X T = 0.

Since β̂ and y − X β̂ are jointly normal and their covariance matrix is 0, we conclude that β̂ and y − X β̂
are independent.
To derive the distribution of the variance estimator notice that

T  = (y − Xβ)T (y − Xβ)
h iT h i
= (y − X β̂) + X(β̂ − β) (y − X β̂) + X(β̂ − β)
= (y − X β̂)T (y − X β̂) + (β̂ − β)T X T X(β̂ − β)
T  1 1
2
= 2 (y − X β̂)T (y − X β̂) + 2 (β̂ − β)T X T X(β̂ − β), (7.3.1)
σ σ σ
since

(y − X β̂)T X = T (I − X(X T X)−1 X T )X


= T (X − X(X T X)−1 X T X) = T (X − X) = 0.

Therefore looking again at (7.3.1), the left hand side is the sum of squares of n independent standard
normal variables and therefore has the χ2 distribution with n degrees of freedom. On the other hand, on
the right hand side we have the sum of two independent variables. We know that

β̂ ∼ N (β, σ 2 (X T X)−1 ),

and thus
X(β̂ − β)/σ ∼ N (0, H),
where H := X(X T X)−1 X T . Now it can be easily seen that H 2 = H and thus H is idempotent. 
Therefore it follows that it’s eigenvalues are all 0 or 1, and since tr(H) = tr (X T X)−1 X T X) =
tr (Ip ) = p, that exactly p of them are 1 and n − p of them are 0. Therefore H = ODOT where D
is diagonal with the first p diagonal entries equal to 1 and the rest 0. Also since H is symmetric O is
orthogonal. Using Theorem 22, since X(β̂ − β)/σ ∼ N (0, H) we can write X(β̂ − β)/σ = H 1/2 W ,
where W ∼ (0, In ) a vector of iid standard normal random variables. Since H 2 = H we have H 1/2 =
H and thus X(β̂ − β)/σ = HW = ODOT W . Thus since O is orthogonal OT O = I and D2 = D
1 T T

T
T 
T

(β̂ − β) X X( β̂ − β) = ODO W ODO W
σ2
= W T ODOT ODOT W
= W T ODOT W
= (W 0 )T DW 0 ,

96
where W 0 = OW ∼ N (0, In ). Thus finally
p
1
(wi0 )2 ∼ χ2p ,
X
2
(β̂ − β)T X T X(β̂ − β) =
σ i=1

since it is the sum of the squares of p independent standard normal variables.


Therefore a χ2 random variable with n degrees of freedom is equal to the sum of a χ2 with p degrees
of freedom and an independent random variable. An easy calculation with MGFs shows that it must be
that
1
(y − X β̂)T (y − X β̂) ∼ χ2n−p .
σ2
From this, it is now obvious that
1 X
S2 = (yj − xTj β̂)2 ,
n−p

is an unbiased estimator of σ 2 .

7.4 Nonlinear Models


According to the census data, the population of a town over a 40 year period has increased as shown in
the following table:

Year 1970 1980 1990 2000 2010


Population 43,815 55,501 65,877 87,535 120,094

We want to predict what the population will be in the year 2020; this is extrapolation but hopefully not
too far. So we fit a straight line. To keep things in line we rescale the data so as not to get overflows in
our calculator, and we round the numbers because there’s no way our answer is going to be correct to 5
significant figures. We renumber the years 1,2,3,4,5, which will make 6 stand for 2020. We knock the
last three digits off the population, and this leaves the following data:

X 1 2 3 4 5
Y 44 56 66 88 120

This gives a regression equation

Y = 19.6 + 18.4X

To estimate the population in the year 2020 we set X = 6 and we obtain Y = 130, which makes an
estimated population of 130,000. You can see that two significant figures is enough!
It is always a good idea to look at the data instead of just plugging it into the computer and getting
numbers out. If we do that, we see that the points do not lie on a straight line at all, which is not
surprising because populations tend to increase exponentially.
This suggests that we ought to use a model of the form

Y = aebX + 

Again, we can set up the relevant equations by minimizing the sum of squares

97
X
(yi − aebxi )2

which leads to the equations

X
ebxi (yi − aebxi ) = 0
X
axi (yi − aebxi ) = 0.

Unlike the previous case, we can’t solve these analytically, but we can find â and b̂ numerically. On the
other hand, we can do the thing differently by working with Z = log Y , which gives us a linear model

Z = a + bX + 0

If we fit a least squares model to this we have

Z = 3.52 + 0.246X

and the predicted value for X = 6 is Z = 4.99 i.e. Y = 147 so the predicted population in 2020 is
147,000.
This is almost certainly better, at least in the sense that if the past trend continues, which is an as-
sumption we have to justify but which we usually make for want of anything better, we would expect
the higher value. The advantage of transforming the data is that we can use standard programmes. The
disadvantage is that a lot of the theoretical basis becomes shaky, because if the errors in the original data
are normally distributed with constant variance, this is not true for the transformed data. That may or
may not matter, but you have to keep it in mind.
In fact, the transformation can even improve things, because if the errors are proportional to the
measurement, then taking logs makes them more equal than they were in the first place.

7.5 Tests for correlation


If X and Y are two random variables, then we have
2
var(X) = E[(X − µX )2 ] = E[(X 2 )] − E[X] ,
2
var(Y ) = E[(Y − µY )2 ] = E[(X 2 )] − E[X] ,
cov(X, Y ) = E[(X − µX )(Y − µY )] = E[XY ] − E[X]E[Y ].

The covariance gives a measure of linear dependence between X and Y . If X and Y are big together
and small together, most terms in the product (X − X̄)(Y − Ȳ ) will be positive, so the sum will be large
and positive. If one is big when the other is small, most of the terms will be negative so the sum will be
large and negative. And if Y is just as likely to be big or small whatever X is, some of the terms will be
positive and some negative, so the sum should be small.
However the words “big” and “small” aren’t very precise, and since the variance of X and Y inflates
the covariance a better measure of linear dependence is given by the correlation

cov(X, Y )
ρ= p
var(X) var(Y )

98
which has the advantage that −1 ≤ ρ ≤ 1 so it establishes a scale. If we have a sample consisting of n
pairs of (X, Y ) values, we define the sample correlation coefficient

(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ


P P
r := pP 2 2
=qP .
(xi − x̄) Σ(yi − ȳ) ( x2i − nx̄2 )( yi2 − ȳ 2 )
P

It turns out that |r| ≤ 1 exactly as for the correlation coefficient of random variables – we have only to
replace the expectation operator by summation symbols. [check!]
The correlation coefficient expresses the correlation between two samples and hence it can hopefully
serve as an estimate of the population correlation ρ. If it is positive, then if one is bigger than average,
the other is likely to be as well. If it is negative, then if one is bigger than average, the other is likely to
be smaller. If the correlation coefficient is close to zero, then if one is bigger than average that tells us
little if anything about the other one.

Example 41. Ten strawberry plants were grown in pots in a greenhouse. Measurements were taken of
x, the level of nitrogen present in the leaf at the time of picking (ppm by weight of dry leaf matter) and
y, the crop yield in grams.

x 2.50 2.55 2.54 2.56 2.68 2.55 2.62 2.57 2.63 2.59
y 247 245 266 277 284 251 275 272 241 265

We find

Σx = 25.79 x̄ = 2.579 Σx2 = 66.537 Sx2 = .052164


Σy = 2623 ȳ = 262.3 Σy 2 = 690091 Sy2 = 15.195 Σxy = 6767.9.

We can work out the regression line from the equations

10â + 25.79b̂ = 2623


25.79â + 66.537b̂ = 6767.9.

which gives us y = −71.5 + 129.4x.


The correlation coefficient is
6767.9 − 10 × 2.579 × 262.3
r=p = 0.445.
(66.537 − 10 × 2.5792 )(690091 − 10 × 262.32 )

There are a number of tests we can apply to r. We will only mention the following:

• H0 : ρ = 0; against

• H1 : ρ 6= 0.

In terms of the regression coefficient β̂ we can restate r as


s
(Xi − X̄)(Yi − Ȳ )
P
Sxx
R= = β̂ ,
(Xi − X̄)2 Syy
P

where
X
Sxx := (Xi − X̄)2

99
X
Syy := (Yi − Ȳ )2
X
Sxy := (Xi − X̄)(Yi − Ȳ ).

Also notice that when (X, Y ) have a bivariate normal distribution then we can write
σY
E(Y |X = x) = α + βx, β= ρ.
σX
This shows that to test the above hypothesis we can equivalently test:

• H0 : β = 0;

• H1 : β 6= 0.

To test whether the correlation is significantly different from 0 we compute the t-statistic

β̂
t= √ ,
S/ Sxx

where S 2 is the unbiased variance estimator we constructed in Section 7.3 given by


n n
1 X 1 X
S2 = (yj − α̂ − β̂xj )2 = (yj − xTj β̂)2 ,
n − p j=1 n − 2 j=1

since in the bivariate case p = 2. We can also rewrite t as



r n−2
t= √ .
1 − r2
Now it can be shown that the t-statistic defined above has the t-distribution with n − 2 degrees of
freedom. The sum works out to be 1.41 and the relevant (2-tail) 5% value is 2.306, so the correlation is
not significant.
Statistical tables often give the critical values of r directly, which saves you having to work t out for
yourself.
Notice that we can use the distribution of the t-statistic above to construct confidence intervals for r.

7.5.0.0.1 Correlation and causality. Note that we have to be very careful with correlations. Above
all, we cannot infer cause and effect from a strong (i.e. ρ close to ±1) correlation. It is at least as
likely to be common cause, as in the good correlation between the numbers of taverns and the number
of Baptist ministers in American cities – which is actually to do with both being correlated with the
population of the city. This shows what happens if you don’t ask exactly how the research was carried
out. Another example is the strong correlation between monthly sales of ice skates in Canada and
surfboards in Australia, due to the fact that the start of winter in one corresponds to the start of summer
in the other.
What is more, correlation measures only a linear relation. Consider the following data:

X Y X2 Y2 XY
0 0 0 0 0
1 48 1 2304 48
2 64 4 4096 128
3 48 9 2304 144
4 0 16 0 0
10 160 30 8704 320

100
This gives x̄ = 2 and ȳ = 32, so Σ(x − x̄)(y − ȳ) = Σxy − 5 × 2 × 32 = 0 which makes ρ = 0.
Now these values are what you would get for the height of a ball thrown up with an initial speed of 64
feet/sec, taking the acceleration due to gravity as 32 ft/sec/sec. So there can be cause and effect and
yet a zero correlation, which shouldn’t surprise you because we saw earlier on that dependent random
variables can have zero correlation. What has happened is that half the time the ball was going up as
time increased and half the time it was going down as time increased. So on average it was doing neither.
That also tells you something about averages: the fact that the average speed was zero conceals the fact
that except for one brief instant when it was at its highest point, the ball was in fact moving.

7.6 Spearman’s Rank Correlation Coefficient


Just like in non-parametric tests, we often replaced quantitative variables by their relative ranks, we can
do the same to test for correlation.
Suppose we are given a sample of pairs (xi , yi ). Then the sample correlation coefficient is given by
Pn Pn
i=1 (xi − x̄) i=1 (yi − ȳ)
r= P
q qP
2
(Xi − X̄) 2 (Yi − Ȳ )

or equivalently
Pn 1 Pn Pn
i=1 xi yi − n [ i=1 xi ] [ i=1 yi ]
= qP qP
2
n 2 − 1
xi ) × n 2 − 1
yi )2
P P
i=1 xi n ( i=1 yi n (

Let R(xi ) be the rank of the i-th x observation relative to the x sample, and R(yi ) similarly w.r.t. the
y sample. The Spearman rank correlation coefficient is calculated by replacing R(xi ) and R(yi ) for xi
and yi in the usual formula for the correlation. That is
Pn 1 Pn Pn
i=1 R(xi )R(yi ) − n [ i=1 R(xi )] [ i=1 R(yi )]
rS = qP qP (7.6.1)
n 2 1 P
i=1 R(xi ) − n ( R(xi ))2 × n 2 − 1
R(yi ))2
P
i=1 R(yi ) n (

The correlation coefficient we have just been working with requires quantitative data, but there may also
be situations in which we can only rank the data. In the case of quantitative data, this provides a non-
parametric test for association between two random variables which is therefore more robust in cases
where the data fails to be close to normally distributed.
In the case where there are no ties in either the x or y observations, that is the x’s are all distinct, and
the y’s are all distinct, it can be shown by simple algebra that

6Σd2
rS = 1 − ,
n(n2 − 1)

where di ≡ xi − yi . This makes sense intuitively: We are working with the differences between the
rankings and if there were complete consistency the rankings would be identical and all the differences
would be zero. If on the other hand, there is little consistency, then the differences would be large.
Suppose, for example that we are given the following pairs:

x 10 14 6 1 7 4 9 13 2 12 5 11 15 3 8
y 4 11 5 6 12 1 14 10 7 13 2 15 9 3 8

d 6 3 1 -5 -5 3 -5 3 -5 -1 3 -4 6 0 0.

101
As a check, the differences should sum to zero. Here Σd2 = 226 and 15 × (152 − 1) = 3360 so
rs = 1 − (6 × 226)/3360 = 1 − 0.4036 = 0.5964.
There are tables for critical values of rs but the Cambridge tables give values for S = Σd2 . Here we
have S = 226 and n = 15 so the 5% critical value (we were expecting a positive correlation so this is a
1-tail test) is 310. Our value is less than that, so the rank correlation is significantly different to 0.
Note that there is also Kendall’s rank correlation, so don’t confuse the two. Also the right hand column
of the table tells you what to divide S by to get rs
To test for a significant negative correlation is easy using rs , as is testing for a significant correlation
regardless of sign. If the critical value of rs for a positive correlation is R, so that we reject the null
hypothesis for rs ≥ R then if we are testing for a negative correlation we reject Ho for rS ≤ −R.
Thus if we are testing for a significant positive correlation we reject Ho for

6Σd2
1− ≥R
n(n2 − 1)
1
Σd2 ≤ n(n2 − 1)(1 − R).
6
On the other hand, if we are testing for a significant negative correlation, we reject Ho for

6Σd2
1− ≤ −R
n(n2 − 1)
1
Σd2 ≥ n(n2 − 1)(1 + R).
6

So if the critical value for Σd2 for testing for a significant positive correlation is X (we reject for
Σd2 < X) then to test for a significant negative correlation we reject for

1
Σd2 ≥ n(n2 − 1) − X
3
This is stated, though not very clearly, in the Cambridge Tables.
Note that if two rankings are positively correlated, then if you take one of them in the other order, the
correlation will be negative.

102

You might also like