Statistics
Statistics
Lent 2015
These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.
Estimation
Review of distribution and density functions, parametric families. Examples: bino-
mial, Poisson, gamma. Sufficiency, minimal sufficiency, the Rao-Blackwell theorem.
Maximum likelihood estimation. Confidence intervals. Use of prior distributions and
Bayesian inference. [5]
Hypothesis testing
Simple examples of hypothesis testing, null and alternative hypothesis, critical region,
size, power, type I and type II errors, Neyman-Pearson lemma. Significance level of
outcome. Uniformly most powerful tests. Likelihood ratio, and use of generalised likeli-
hood ratio to construct test statistics for composite hypotheses. Examples, including
t-tests and F -tests. Relationship with confidence intervals. Goodness-of-fit tests and
contingency tables. [4]
Linear models
Derivation and joint distribution of maximum likelihood estimators, least squares,
Gauss-Markov theorem. Testing hypotheses, geometric interpretation. Examples,
including simple linear regression and one-way analysis of variance. Use of software. [7]
1
Contents IB Statistics
Contents
0 Introduction 3
1 Estimation 4
1.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Hypothesis testing 19
2.1 Simple hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Composite hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Tests of goodness-of-fit and independence . . . . . . . . . . . . . 25
2.3.1 Goodness-of-fit of a fully-specified null distribution . . . . 25
2.3.2 Pearson’s chi-squared test . . . . . . . . . . . . . . . . . . 26
2.3.3 Testing independence in contingency tables . . . . . . . . 27
2.4 Tests of homogeneity, and connections to confidence intervals . . 29
2.4.1 Tests of homogeneity . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Confidence intervals and hypothesis tests . . . . . . . . . 31
2.5 Multivariate normal theory . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Multivariate normal distribution . . . . . . . . . . . . . . 32
2.5.2 Normal random samples . . . . . . . . . . . . . . . . . . . 34
2.6 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . 35
3 Linear models 37
3.1 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Linear models with normal assumptions . . . . . . . . . . . . . . 42
3.4 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Inference for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Expected response at x∗ . . . . . . . . . . . . . . . . . . . . . . . 48
3.8 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.8.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . 50
3.8.2 Simple linear regression . . . . . . . . . . . . . . . . . . . 52
3.8.3 One way analysis of variance with equal numbers in each
group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2
0 Introduction IB Statistics
0 Introduction
Statistics is a set of principles and procedures for gaining and processing quan-
titative evidence in order to help us make judgements and decisions. In this
course, we focus on formal statistical inference. In the process, we assume that
we have some data generated from some unknown probability model, and we aim
to use the data to learn about certain properties of the underlying probability
model.
In particular, we perform parametric inference. We assume that we have
a random variable X that follows a particular known family of distribution
(e.g. Poisson distribution). However, we do not know the parameters of the
distribution. We then attempt to estimate the parameter from the data given.
For example, we might know that X ∼ Poisson(µ) for some µ, and we want
to figure out what µ is.
Usually we repeat the experiment (or observation) many times. Hence we
will have X1 , X2 , · · · , Xn being iid with the same distribution as X. We call the
set X = (X1 , X2 , · · · , Xn ) a simple random sample. This is the data we have.
We will use the observed X = x to make inferences about the parameter θ,
such as
– giving an estimate θ̂(x) of the true value of θ.
3
1 Estimation IB Statistics
1 Estimation
1.1 Estimators
The goal of estimation is as follows: we are given iid X1 , · · · , Xn , and we know
that their probability density/mass function is fX (x; θ) for some unknown θ.
We know fX but not θ. For example, we might know that they follow a Poisson
distribution, but we do not know what the mean is. The objective is to estimate
the value of θ.
Definition (Statistic). A statistic is an estimate of θ. It is a function T of the
data. If we write the data as x = (x1 , · · · , xn ), then our estimate is written as
θ̂ = T (x). T (X) is an estimator of θ.
The distribution of T = T (X) is the sampling distribution of the statistic.
Note that we adopt the convention where capital X denotes a random variable
and x is an observed value. So T (X) is a random variable and T (x) is a particular
value we obtain after experiments.
Example. Let X1 , · · · , Xn be iid N (µ, 1). A possible estimator for µ is
1X
T (X) = Xi .
n
Then for any particular observed sample x, our estimate is
1X
T (x) = xi .
n
What is the sampling distribution P of T ? Recall from IA Probability that in
general, if Xi ∼ N (µi , σi2 ), then Xi ∼ N ( µi , σi2 ), which is something we
P P
can prove by considering moment-generating functions.
So we have T (X) ∼ N (µ, 1/n). Note that by the Central Limit Theorem,
even if Xi were not normal, we still have approximately T (X) ∼ N (µ, 1/n) for
large values of n, but here we get exactly the normal distribution even for small
values of n.
The estimator n1
P
Xi we had above is a rather sensible estimator. Of course,
we can also have silly estimators such as T (X) = X1 , or even T (X) = 0.32
always.
One way to decide if an estimator is silly is to look at its bias.
Definition (Bias). Let θ̂ = T (X) be an estimator of θ. The bias of θ̂ is the
difference between its expected value and true value.
bias(θ̂) = Eθ (θ̂) − θ.
Note that the subscript θ does not represent the random variable, but the thing
we want to estimate. This is inconsistent with the use for, say, the probability
mass function.
An estimator is unbiased if it has no bias, i.e. Eθ (θ̂) = θ.
To find out Eθ (T ), we can either find the distribution of T and find its
expected value, or evaluate T as a function of X directly, and find its expected
value.
Example. In the above example, Eµ (T ) = µ. So T is unbiased for µ.
4
1 Estimation IB Statistics
If we are aiming for a low mean squared error, sometimes it could be preferable to
have a biased estimator with a lower variance. This is known as the “bias-variance
trade-off”.
For example, suppose X ∼ binomial(n, θ), where n is given and θ is to be
determined. The standard estimator is TU = X/n, which is unbiased. TU has
variance
varθ (X) θ(1 − θ)
varθ (TU ) = 2
= .
n n
Hence the mean squared error of the usual estimator is given by
5
1 Estimation IB Statistics
We can plot the mean squared error of each estimator for possible values of θ.
Here we plot the case where n = 10.
mse
0.03
unbiased estimator
0.02
0.01
biased estimator
0 θ
0 0.2 0.4 0.6 0.8 1.0
This biased estimator has smaller MSE unless θ has extreme values.
We see that sometimes biased estimators could give better mean squared
errors. In some cases, not only could unbiased estimators be worse — they could
be completely nonsense.
Suppose X ∼ Poisson(λ), and for whatever reason, we want to estimate
θ = [P (X = 0)]2 = e−2λ . Then any unbiased estimator T (X) must satisfy
Eθ (T (X)) = θ, or equivalently,
∞
X λx
Eλ (T (X)) = e−λ T (x) = e−2λ .
x=0
x!
The only function T that can satisfy this equation is T (X) = (−1)X .
Thus the unbiased estimator would estimate e−2λ to be 1 if X is even, -1 if
X is odd. This is clearly nonsense.
1.3 Sufficiency
Often, we do experiments just to find out the value of θ. For example, we might
want to estimate what proportion of the population supports some political
candidate. We are seldom interested in the data points themselves, and just
want to learn about the big picture. This leads us to the concept of a sufficient
statistic. This is a statistic T (X) that contains all information we have about θ
in the sample.
Example. Let X1 , · · · Xn be iid Bernoulli(θ), so that P(Xi = 1) = 1 − P(Xi =
0) = θ for some 0 < θ < 1. So
n
Y P P
fX (x | θ) = θxi (1 − θ)1−xi = θ xi
(1 − θ)n− xi
.
i=1
P
This depends on the data only through T (X) = xi , the total number of ones.
6
1 Estimation IB Statistics
Suppose we are now given that T (X) = t. Then what is the distribution of
X? We have
Pθ (X = x, T = t) Pθ (X = x)
fX|T =t (x) = = .
Pθ (T = t) Pθ (T = t)
Where the last equality comes because since if X = x, then T must be equal to
t. This is equal to P P −1
θ xi (1 − θ)n− xi n
n t
n−t
= .
t θ (1 − θ) t
So the conditional distribution of X given T = t does not depend on θ. So if we
know T , then additional knowledge of x does not give more information about θ.
Definition (Sufficient statistic). A statistic T is sufficient for θ if the conditional
distribution of X given T does not depend on θ.
There is a convenient theorem that allows us to find sufficient statistics.
Theorem (The factorization criterion). T is sufficient for θ if and only if
fX (x | θ) = g(T (x), θ)h(x)
for some functions g and h.
Proof. We first prove the discrete case.
Suppose fX (x | θ) = g(T (x), θ)h(x). If T (x) = t, then
Pθ (X = x, T (X) = t)
fX|T =t (x) =
Pθ (T = t)
g(T (x), θ)h(x)
=P
{y:T (y)=t} g(T (y), θ)h(y)
g(t, θ)h(x)
= P
g(t, θ) h(y)
h(x)
=P
h(y)
which doesn’t depend on θ. So T is sufficient.
The continuous case is similar. If fX (x | θ) = g(T (x), θ)h(x), and T (x) = t,
then
g(T (x), θ)h(x)
fX|T =t (x) = R
y:T (y)=t
g(T (y), θ)h(y) dy
g(t, θ)h(x)
= R
g(t, θ) h(y) dy
h(x)
=R ,
h(y) dy
which does not depend on θ.
Now suppose T is sufficient so that the conditional distribution of X | T = t
does not depend on θ. Then
Pθ (X = x) = Pθ (X = x, T = T (x)) = Pθ (X = x | T = T (x))Pθ (T = T (x)).
The first factor does not depend on θ by assumption; call it h(x). Let the second
factor be g(t, θ), and so we have the required factorisation.
7
1 Estimation IB Statistics
So T = maxi xi is sufficient.
Note that sufficient statistics are not unique. If T is sufficient for θ, then
so is any 1-1 function of T . X is always sufficient for θ as well, but it is not of
much use. How can we decide if a sufficient statistic is “good”?
Given any statistic T , we can partition the sample space X n into sets
{x ∈ X : T (x) = t}. Then after an experiment, instead of recording the actual
value of x, we can simply record the partition x falls into. If there are less
partitions than possible values of x, then effectively there is less information we
have to store.
If T is sufficient, then this data reduction does not lose any information
about θ. The “best” sufficient statistic would be one in which we achieve the
maximum possible reduction. This is known as the minimal sufficient statistic.
The formal definition we take is the following:
Definition (Minimal sufficiency). A sufficient statistic T (X) is minimal if it is
a function of every other sufficient statistic, i.e. if T ′ (X) is also sufficient, then
T ′ (X) = T ′ (Y) ⇒ T (X) = T (Y).
Again, we have a handy theorem to find minimal statistics:
Theorem. Suppose T = T (X) is a statistic that satisfies
fX (x; θ)
does not depend on θ if and only if T (x) = T (y).
fX (y; θ)
Then T is minimal sufficient for θ.
Proof. First we have to show sufficiency. We will use the factorization criterion
to do so.
Firstly, for each possible t, pick a favorite xt such that T (xt ) = t.
Now let x ∈ X N and let T (x) = t. So T (x) = T (xt ). By the hypothesis,
fX (x;θ)
fX (xt :θ) does not depend on θ. Let this be h(x). Let g(t, θ) = fX (xt , θ). Then
fX (x; θ)
fX (x; θ) = fX (xt ; θ) = g(t, θ)h(x).
fX (xt ; θ)
8
1 Estimation IB Statistics
So T is sufficient for θ.
To show that this is minimal, suppose that S(X) is also sufficient. By the
factorization criterion, there exist functions gS and hS such that
fX (x; θ) = gS (S(x), θ)hS (x).
Now suppose that S(x) = S(y). Then
fX (x; θ) gS (S(x), θ)hS (x) hS (x)
= = .
fX (y; θ) gS (S(y), θ)hS (y) hS (y)
9
1 Estimation IB Statistics
1.4 Likelihood
There are many different estimators we can pick, and we have just come up with
some criteria to determine whether an estimator is “good”. However, these do
not give us a systematic way of coming up with an estimator to actually use. In
practice, we often use the maximum likelihood estimator.
Let X1 , · · · , Xn be random variables with joint pdf/pmf fX (x | θ). We
observe X = x.
Definition (Likelihood). For any given x, the likelihood of θ is like(θ) = fX (x |
θ), regarded as a function of θ. The maximum likelihood estimator (mle) of θ is
an estimator that picks the value of θ that maximizes like(θ).
10
1 Estimation IB Statistics
Often there is no closed form for the mle, and we have to find θ̂ numerically.
When we can find the mle explicitly, in practice, we often maximize the
log-likelihood instead of the likelihood. In particular, if X1 , · · · , Xn are iid, each
with pdf/pmf fX (x | θ), then
n
Y
like(θ) = fX (xi | θ),
i=1
Xn
log like(θ) = log fX (xi | θ).
i=1
Thus P P
dl xi n − xi
= − .
dp p 1−p
P
This is zero when p = xi /n. So this is the maximum likelihood estimator
(and is unbiased).
11
1 Estimation IB Statistics
12
1 Estimation IB Statistics
Example. Suppose X1 , · · · , Xn are iid N (θ, 1). Find a 95% confidence interval
for θ. √
We know X̄ ∼ N (θ, n1 ), so that n(X̄ − θ) ∼ N (0, 1).
Let z1 , z2 be such that Φ(z2 ) − Φ(z1 ) = 0.95, where Φ is the standard normal
distribution function.√
We have P[z1 < n(X̄ − θ) < z2 ] = 0.95, which can be rearranged to give
z2 z1
P X̄ − √ < θ < X̄ − √ = 0.95.
n n
so we obtain the following 95% confidence interval:
z2 z1
X̄ − √ , X̄ − √ .
n n
There are many possible choices for z1 and z2 . Since N (0, 1) density is symmetric,
the shortest such interval is obtained by z2 = z0.025 = −z1 . We can also choose
other values such as z1 = −∞, z2 = 1.64, but we usually choose symmetric end
points.
The above example illustrates a common procedure for finding confidence
intervals:
– Find a quantity R(X, θ) such that the Pθ -distribution of R(X, θ)
√ does not
depend on θ. This is called a pivot. In our example, R(X, θ) = n(X̄ − θ).
– Write down a probability statement of the form Pθ (c1 < R(X, θ) < c2 ) = γ.
– Rearrange the inequalities inside P(. . .) to find the interval.
Usually c1 , c2 are percentage points from a known standardised distribution, often
equitailed. For example, we pick 2.5% and 97.5% points for a 95% confidence
interval. We could also use, say 0% and 95%, but this generally results in a
wider interval.
Note that if (A(X), B(X)) is a 100γ% confidence interval for θ, and T (θ)
is a monotone increasing function of θ, then (T (A(X)), T (B(X))) is a 100γ%
confidence interval for T (θ).
Example. Suppose X1 , · · · , X50 are iid N (0, σ 2 ). Find a 99% confidence interval
for σ 2 .
50
1 X 2
We know that Xi /σ ∼ N (0, 1). So 2 X ∼ χ250 .
σ i=1 i
P50
So R(X, σ 2 ) = i=1 Xi2 /σ 2 is a pivot.
Recall that χ2n (α) is the upper 100α% point of χ2n , i.e.
13
1 Estimation IB Statistics
Using
q P the q remarkabove, we know that a 99% confidence interval for σ is
Xi2
P 2
Xi
c2 , c1 .
Note that we have made a lot of approximations here, but it would be difficult
to do better than this.
Example. Suppose an opinion poll says 20% of the people are going to vote
UKIP, based on a random sample of 1, 000 people. What might the true
proportion be?
We assume we have an observation of x = 200 from a binomial(n, p) distri-
bution with n = 1, 000. Then p̂ = x/n = 0.2 is an unbiased estimate, and also
the mle.
Now var(X/n) = p(1−p) n ≈ p̂(1−
n
p̂)
= 0.00016. So a 95% confidence interval is
r r !
p̂(1 − p̂) p̂(1 − p̂)
p̂ − 1.96 , p̂ + 1.96 = 0.20±1.96×0.013 = (0.175, 0.225),
n n
If we don’t want to make that many approximations, we can note p that p(1 −pp) ≤
1/4 for all 0 ≤ p ≤ 1. So a conservative 95% interval is p̂±1.96 1/4n
√ ≈ p̂± 1/n.
So whatever proportion is reported, it will be ’accurate’ to ±1/ n.
Example. Suppose X1 , X2 are iid from U (θ − 1/2, θ + 1/2). What is a sensible
50% confidence interval for θ?
We know that each Xi is equally likely to be less than θ or greater than θ.
So there is 50% chance that we get one observation on each side, i.e.
1
Pθ (min(X1 , X2 ) ≤ θ ≤ max(X1 , X2 )) = .
2
So (min(X1 , X2 ), max(X1 , X2 )) is a 50% confidence interval for θ.
But suppose after the experiment, we obtain |x1 − x2 | ≥ 12 . For example, we
might get x1 = 0.2, x2 = 0.9, then we know that, in this particular case, θ must
lie in (min(X1 , X2 ), max(X1 , X2 )), and we don’t have just 50% “confidence”!
14
1 Estimation IB Statistics
This is why after we calculate a confidence interval, we should not say “there
is 100(1 − α)% chance that θ lies in here”. The confidence interval just says
that “if we keep making these intervals, 100(1 − α)% of them will contain θ”.
But if have calculated a particular confidence interval, the probability that that
particular interval contains θ is not 100(1 − α)%.
π(θ | x) ∝ fX (x | θ)π(θ).
posterior ∝ likelihood × prior.
15
1 Estimation IB Statistics
I randomly select one coin and flip it once, observing a head. What is the
probability that I have chosen coin 3?
Let X = 1 denote the event that I observe a head, X = 0 if a tail. Let θ
denote the probability of a head. So θ is either 0.25, 0.5 or 0.75.
Our prior distribution is π(θ = 0.25) = π(θ = 0.5) = π(θ = 0.75) = 1/3.
The probability mass function fX (x | θ) = θx (1 − θ)1−x . So we have the
following results:
So if we observe a head, then there is now a 50% chance that we have picked
the third coin.
Example. Suppose we are interested in the true mortality risk θ in a hospital
H which is about to try a new operation. On average in the country, around
10% of the people die, but mortality rates in different hospitals vary from around
3% to around 20%. Hospital H has no deaths in their first 10 operations. What
should we believe about θ?
Let Xi = 1 if the ith patient in H dies. The
P P
xi
fx (x | θ) = θ (1 − θ)n− xi
.
Suppose a priori that θ ∼ beta(a, b) for some unknown a > 0, b > 0 so that
16
1 Estimation IB Statistics
h′ (a) = 0 if Z
(a − θ)π(θ | x) dθ = 0,
or Z Z
a π(θ | x) dθ =
θπ(θ | x) dθ,
R R
Since π(θ | x) dθ = 1, the Bayes estimator is θ̂ = θπ(θ | x) dθ, the posterior
mean.
For absolute error loss,
Z
h(a) = |θ − a|π(θ | x) dθ
Z a Z ∞
= (a − θ)π(θ | x) dθ + (θ − a)π(θ | x) dθ
−∞ a
Z a Z a
=a π(θ | x) dθ − θπ(θ | x) dθ
−∞ −∞
Z ∞ Z ∞
+ θπ(θ | x) dθ − a π(θ | x) dθ.
a a
17
1 Estimation IB Statistics
Now h′ (a) = 0 if Z a Z ∞
π(θ | x) dθ = π(θ | x) dθ.
−∞ a
π(µ | x) ∝ fx (x | µ)π(µ)
µ2 τ 2
1X
∝ exp − (xi − µ)2 exp −
2 2
" P 2 #
1 xi
∝ exp − (n + τ 2 ) µ −
2 n + τ2
since we can regard n, τ and all the xi as constants in the normalisation term,
and then complete the square with respect to µ. So P the posterior distribution
of µ given x is a normal distribution with mean xi /(n + τ 2 ) and variance
1/(n + τ 2 ).
The normal density is symmetric, and so the posterior mean and the posterior
xi /(n + τ 2 ).
P
median have the same value
This is the optimal estimator for both quadratic and absolute loss.
Example. Suppose that X1 , · · · , Xn are iid Poisson(λ) random variables, and
λ has an exponential distribution with mean 1. So π(λ) = e−λ .
The posterior distribution is given by
P P
π(λ | x) ∝ enλ λ xi e−λ = λ xi e−(n+1)λ , λ > 0,
P
which is gamma ( xi + 1, n + 1). Hence under quadratic loss, our estimator is
P
xi + 1
λ̂ = ,
n+1
the posterior mean.
Under absolute error loss, λ̂ solves
λ̂
P P
λ xi e−(n+1)λ
xi +1
Z
(n + 1) 1
P dλ = .
0 ( xi )! 2
18
2 Hypothesis testing IB Statistics
2 Hypothesis testing
Often in statistics, we have some hypothesis to test. For example, we want to
test whether a drug can lower the chance of a heart attack. Often, we will have
two hypotheses to compare: the null hypothesis states that the drug is useless,
while the alternative hypothesis states that the drug is useful. Quantitatively,
suppose that the chance of heart attack without the drug is θ0 and the chance
with the drug is θ. Then the null hypothesis is H0 : θ = θ0 , while the alternative
hypothesis is H1 : θ ̸= θ0 .
It is important to note that the null hypothesis and alternative hypothesis
are not on equal footing. By default, we assume the null hypothesis is true.
For us to reject the null hypothesis, we need a lot of evidence to prove that.
This is since we consider incorrectly rejecting the null hypothesis to be a much
more serious problem than accepting it when we should not. For example, it is
relatively okay to reject a drug when it is actually useful, but it is terrible to
distribute drugs to patients when the drugs are actually useless. Alternatively,
it is more serious to deem an innocent person guilty than to say a guilty person
is innocent.
In general, let X1 , · · · , Xn be iid, each taking values in X , each with unknown
pdf/pmf f . We have two hypotheses, H0 and H1 , about f . On the basis of data
X = x, we make a choice between the two hypotheses.
Example.
– A coin has P(Heads) = θ, and is thrown independently n times. We could
have H0 : θ = 12 versus H1 : θ = 34 .
– Suppose X1 , · · · , Xn are iid discrete random variables. We could have H0 :
the distribution is Poisson with unknown mean, and H1 : the distribution
is not Poisson.
– General parametric cases: Let X1 , · · · , Xn be iid with density f (x | θ). f
is known while θ is unknown. Then our hypotheses are H0 : θ ∈ Θ0 and
H1 : θ ∈ Θ1 , with Θ0 ∩ Θ1 = ∅.
– We could have H0 : f = f0 and H1 : f = f1 , where f0 and f1 are densities
that are completely specified but do not come form the same parametric
family.
Definition (Simple and composite hypotheses). A simple hypothesis H specifies
f completely (e.g. H0 : θ = 12 ). Otherwise, H is a composite hypothesis.
19
2 Hypothesis testing IB Statistics
Definition (Size and power). When H0 and H1 are both simple, let
Lx (H1 )
Λx (H0 ; H1 ) = .
Lx (H0 )
A likelihood ratio test (LR test) is one where the critical region C is of the form
C = {x : Λx (H0 ; H1 ) > k}
for some k.
It turns out this rather simple test is “the best” in the following sense:
Lemma (Neyman-Pearson lemma). Suppose H0 : f = f0 , H1 : f = f1 , where
f0 and f1 are continuous densities that are nonzero on the same regions. Then
among all tests of size less than or equal to α, the test with the largest power is
the likelihood ratio test of size α.
Proof. Under the likelihood ratio test, our critical region is
f1 (x)
C= x: >k ,
f0 (x)
where
R k is chosen such that α = P(reject H0 | H0 ) = P(X ∈ C | H0 ) =
C
f0 (x) dx. The probability of Type II error is given by
Z
β = P(X ̸∈ C | f1 ) = f1 (x) dx.
C̄
∗
Let C be the critical region of any other test with size less than or equal to α.
Let α∗ = P(X ∈ C ∗ | H0 ) and β ∗ = P(X ̸∈ C ∗ | H1 ). We want to show β ≤ β ∗ .
We know α∗ ≤ α, i.e
Z Z
f0 (x) dx ≤ f0 (x) dx.
C∗ C
20
2 Hypothesis testing IB Statistics
Also, on C, we have f1 (x) > kf0 (x), while on C̄ we have f1 (x) ≤ kf0 (x). So
Z Z
f1 (x) dx ≥ k f0 (x) dx
∗ ∗
ZC̄ ∩C ZC̄ ∩C
f1 (x) dx ≤ k f0 (x) dx.
C̄∩C ∗ C̄∩C ∗
Hence
Z Z
∗
β−β = f1 (x) dx − f1 (x) dx
C̄ ∗
ZC̄ Z
= f1 (x) dx + f1 (x) dx
C̄∩C ∗ C̄∩C̄ ∗
Z Z
− f1 (x) dx − f1 (x) dx
∗ ∗
Z C̄ ∩C Z C̄∩C̄
= f1 (x) dx − f1 (x) dx
C̄∩C ∗ C̄ ∗ ∩C
Z Z
≤k f0 (x) dx − k f0 (x) dx
C̄∩C ∗ C̄ ∗ ∩C
Z Z
=k f0 (x) dx + f0 (x) dx
C̄∩C ∗ C∩C ∗
Z Z
−k f0 (x) dx + f0 (x) dx
C̄ ∗ ∩C C∩C ∗
∗
= k(α − α)
≤ 0.
C̄ C
C¯∗ ∩ C
C¯∗ β ∗ /H1
(f1 ≥ kf0 )
C ∗ ∩ C̄
C∗ C ∗ ∩ C α∗ /H0
(f1 ≤ kf0 )
β/H1 α/H0
21
2 Hypothesis testing IB Statistics
n(µ20 − µ21 )
µ1 − µ0
= exp nx̄ + .
σ02 2σ02
This is an increasing function of x̄, so for any k, Λx > k ⇔ x̄ > c for some c.
Hence we reject H0 if x̄ > c, where c is chosen
√ such that P(X̄ > c | H0 ) = α.
Under H0 , X̄ ∼ N (µ0 , σ02 /n), so Z = n(X̄ − µ0 )/σ0 ∼ N (0, 1).
Since x̄ > c ⇔ z > c′ for some c′ , the size α test rejects H0 if
√
n(x̄ − µ0 )
z= > zα .
σ0
For example, suppose µ0 = 5, µ1 = 6, σ0 = 1, α = 0.05, n = 4 and x =
(5.1, 5.5, 4.9, 5.3). So x̄ = 5.2.
From tables, z0.05 = 1.645. We have z = 0.4 and this is less than 1.645. So x
is not in the rejection region.
We do not reject H0 at the 5% level and say that the data are consistent
with H0 .
Note that this does not mean that we accept H0 . While we don’t have
sufficient reason to believe it is false, we also don’t have sufficient reason to
believe it is true.
This is called a z-test.
In this example, LR tests reject H0 if z > k for some constant k. The size of
such a test is α = P(Z > k | H0 ) = 1 − Φ(k), and is decreasing as k increasing.
Our observed value z will be in the rejected region iff z > k ⇔ α > p∗ = P(Z >
z | H0 ).
Definition (p-value). The quantity p∗ is called the p-value of our observed data
x. For the example above, z = 0.4 and so p∗ = 1 − Φ(0.4) = 0.3446.
In general, the p-value is sometimes called the “observed significance level” of
x. This is the probability under H0 of seeing data that is “more extreme” than
our observed data x. Extreme observations are viewed as providing evidence
against H0 .
α = sup W (θ).
θ∈Θ0
22
2 Hypothesis testing IB Statistics
23
2 Hypothesis testing IB Statistics
for some c. √
n(X̄ − µ0 )
We know that under H0 , Z = ∼ N (0, 1). So the size α
σ0
generalised likelihood test rejects H0 if
√
n(x̄ − µ0 )
> zα/2 .
σ0
n(X̄ − µ0 )2
Alternatively, since ∼ χ21 , we reject H0 if
σ02
n(x̄ − µ0 )2
> χ21 (α),
σ02
2
(check that zα/2 = χ21 (α)).
Note that this is a two-tailed test — i.e. we reject H0 both for high and low
values of x̄.
24
2 Hypothesis testing IB Statistics
The next theorem allows us to use likelihood ratio tests even when we cannot
find the exact relevant null distribution.
First consider the “size” or “dimension” of our hypotheses: suppose that
H0 imposes p independent restrictions on Θ. So for example, if Θ = {θ : θ =
(θ1 , · · · , θk )}, and we have
– H0 : θi1 = a1 , θi2 = a2 , · · · , θip = ap ; or
– H0 : Aθ = b (with A p × k, b p × 1 given); or
We will not prove this result here. In our example above, |Θ1 | − |Θ0 | = 1,
and in this case, we saw that under H0 , 2 log Λ ∼ χ21 exactly for all n in that
particular case, rather than just approximately.
Example. The following table lists the birth months of admissions to Oxford
and Cambridge in 2012.
Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
470 515 470 457 473 381 466 457 437 396 384 394
25
2 Hypothesis testing IB Statistics
n1 nk
Under H P1 , like(p1 , · · · , pk ) ∝ p1 · · · pk . So the log likelihood P is l =
constant + ni log pi . We want to maximise this subject to pi = 1. Us-
ing the Lagrange multiplier, we will find that the mle is pˆi = ni /n. Also
|Θ1 | = k − 1 (not k, since they must sum up to 1).
Under H0 , the values of pi are specified completely, say pi = p̃i . So |Θ0 | = 0.
Using our formula for pˆi , we find that
p̂1 · · · p̂nk k
n1
X ni
2 log Λ = 2 log = 2 ni log (1)
p̃n1 1 · · · p̃nk k np̃i
Here |Θ1 |−|Θ0 | = k −1. So we reject H0 if 2 log Λ > χ2k−1 (α) for an approximate
size α test.
Under H0 (no effect of month of birth), p̃i is the proportion of births in
month i in 1993/1994 in the whole population — this is not simply proportional
1
to the number of days in each month (or even worse, 12 ), as there is for example
an excess of September births (the “Christmas effect”). Then
X ni
2 log Λ = 2 ni log = 44.86.
np̃i
P(χ211 > 44.86) = 3 × 10−9 , which is our p-value. Since this is certainly less than
0.001, we can reject H0 at the 0.1% level, or can say the result is “significant at
the 0.1% level”.
The traditional levels for comparison are α = 0.05, 0.01, 0.001, roughly
corresponding to “evidence”, “strong evidence” and “very strong evidence”.
A similar common situation has H0 : pi = pi (θ) for some parameter θ and H1
as before. Now |Θ0 | is the number of independent parameters to be estimated
under H0 .
Under H0 , we find mle θ̂ by maximizing ni log pi (θ), and then
! !
pˆ1 n1 · · · pˆk nk X ni
2 log Λ = 2 log =2 ni log . (2)
p1 (θ̂)n1 · · · pk (θ̂)nk npi (θ̂)
26
2 Hypothesis testing IB Statistics
P P P
We know that δi = 0 since ei = oi . So
X δ2
i
≈
ei
X (oi − ei )2
= .
ei
This is known as the Pearson’s chi-squared test.
Example. Mendel crossed 556 smooth yellow male peas with wrinkled green
peas. From the progeny, let
(i) N1 be the number of smooth yellow peas,
(ii) N2 be the number of smooth green peas,
(iii) N3 be the number of wrinkled yellow peas,
(iv) N4 be the number of wrinkled green peas.
We wish to test the goodness of fit of the model
9 3 3 1
H0 : (p1 , p2 , p3 , p4 ) = 16 , 16 , 16 , 16 .
Suppose we observe (n1 , n2 , n3 , n4 ) = (315, 108, 102, 31).
We find (e1 , e2 , e3 , e4 ) = (312.75, 104.25, 104.25, 34.75). The actual 2 log Λ =
P (oi −ei )2
0.618 and the approximation we had is ei = 0.604.
Here |Θ0 | = 0 and |Θ1 | = 4 − 1 = 3. So we refer to test statistics χ23 (α).
Since χ23 (0.05) = 7.815, we see that neither value is significant at 5%. So there
is no evidence against Mendel’s theory. In fact, the p-value is approximately
P(χ23 > 0.6) ≈ 0.90. This is a really good fit, so good that people suspect the
numbers were not genuine.
Example. In a genetics problem, each individual has one of the three possible
genotypes, with probabilities p1 , p2 , p3 . Suppose we wish to test H0 : pi = pi (θ),
where
p1 (θ) = θ2 , p2 = 2θ(1 − θ), p3 (θ) = (1 − θ)2 .
for some θ ∈ (0, 1).
We observe Ni = ni . Under H0 , the mle θ̂ is found by maximising
X
ni log pi (θ) = 2n1 log θ + n2 log(2θ(1 − θ)) + 2n3 log(1 − θ).
27
2 Hypothesis testing IB Statistics
New car
Large Medium Small
Previous
Large 56 52 42
car
Medium 50 83 67
Small 18 51 81
This is a two-way contingency table: Each person is classified according to the
previous car size and new car size.
Consider a two-way contingency table with r rows and c columns. For
i = 1, · · · , r And j = 1, · · · , c, let pij be the probability that an individual
selected from the population under consideration is classified in row i and
column j. (i.e. in the (i, j) cell of the table).
Let Ppi+P= P(in row i) and p+j = P(in column j). Then we must have
p++ = i j pij = 1.
Suppose a random sample of n individuals is taken, and let nij be the number
of these classified
P in the (i, j) cellP of the table.
Let ni+ = j nij and n+j = i nij . So n++ = n.
We have
(N11 , · · · , N1c , N21 , · · · , Nrc ) ∼ multinomial(rc; p11 , · · · , p1c , p21 , · · · , prc ).
We may be interested in testing the null hypothesis that the two classifications
are independent. So we test
– H0 : pij = pi+ p+j for all i, j, i.e. independence of columns and rows.
– H1 : pij are unrestricted.
Of course we have the usual restrictions like p++ = 1, pij ≥ 0.
n
Under H1 , the mles are pˆij = nij .
n
Under H0 , the mles are p̂i+ = nni+ and p̂+j = n+j .
Write oij = nij and eij = np̂i+ p̂+j = ni+ n+j /n.
Then
r X c r X c
(oij − eij )2
X
X oij
2 log Λ = 2 oij log ≈ .
i=1 j=1
eij i=1 j=1
eij
Medium 50 83 67 120
car
Small 18 51 81 150
Total 124 186 190 500
28
2 Hypothesis testing IB Statistics
New car
Large Medium Small Total
Large 37.2 55.8 57.0 150
Previous
car Medium 49.6 74.4 76.0 120
Small 37.2 55.8 57.0 150
Total 124 186 190 500
Note the margins are the same. It is quite clear that they do not match well,
but we can find the p value to be sure.
X X (oij − eij )2
= 36.20, and the degrees of freedom is (3 − 1)(3 − 1) = 4.
eij
From the tables, χ24 (0.05) = 9.488 and χ24 (0.01) = 13.28.
So our observed value of 36.20 is significant at the 1% level, i.e. there is
strong evidence against H0 . So we conclude that the new and present car sizes
are not independent.
Here the row totals are fixed in advance, in contrast to our last section, where
the row totals are random variables.
For the above, we may be interested in testing H0 : the probability of
“improved” is the same for each of the three treatment groups, and so are
the probabilities of “no difference” and “worse”, i.e. H0 says that we have
homogeneity down the rows.
In general, we have independent observations from r multinomial distributions,
each of which has c categories, i.e. we observe an r × c table (nij ), for i = 1, · · · , r
and j = 1, · · · , c, where
29
2 Hypothesis testing IB Statistics
for j = 1, · · · , c, and
H1 : pij are unrestricted.
Using H1 , for any matrix of probabilities (pij ),
r
Y ni+ !
like((pij )) = pni1i1 · · · pnicic ,
i=1
n i1 ! · · · n ic !
and
r X
X c
log like = constant + nij log pij .
i=1 j=1
ij n
Using Lagrangian methods, we find that p̂ij = ni+ .
Under H0 ,
X c
log like = constant + n+j log pj .
j=1
n+j
By Lagrangian methods, we have p̂j = n++ .
Hence
r X c r X c
X p̂ij X nij
2 log Λ = nij log =2 nij log ,
i=1 j=1
p̂j i=1 j=1
ni+ n+j /n++
which is the same as what we had last time, when the row totals are unrestricted!
We have |Θ1 | = r(c − 1) and |Θ0 | = c − 1. So the degrees of freedom is
r(c − 1) − (c − 1) = (r − 1)(c − 1), and under H0 , 2 log Λ is approximately
χ2(r−1)(c−1) . Again, it is exactly the same as what we had last time!
We reject H0 if 2 log Λ > χ2(r−1)(c−1) (α) for an approximate size α test.
n n+j
If we let oij = nij , eij = i+
n++ , and δij = oij − eij , using the same approxi-
mating steps as for Pearson’s chi-squared, we obtain
X (oij − eij )2
2 log Λ ≈ .
eij
30
2 Hypothesis testing IB Statistics
We find 2 log Λ = 5.129, and we refer this to χ24 . Clearly this is not significant,
as the mean of χ24 is 4, and is something we would expect to happen solely by
chance.
We can calculate the p-value: from tables, χ24 (0.05) = 9.488, so our observed
value is not significant at 5%, and the data are consistent with H0 .
We conclude that there is no evidence for a difference between the drug at
the given doses and the placebo.
For interest,
X (oij − eij )2
= 5.173,
eij
giving the same conclusion.
And so
P(θ0 ∈ I(X) | θ = θ0 ) = P(X ∈ A(θ0 ) | θ = θ0 ) = 1 − α.
For (ii), since I(X) is a 100(1 − α)% confidence set, we have
P (θ0 ∈ I(X) | θ = θ0 ) = 1 − α.
So
P(X ∈ A(θ0 ) | θ = θ0 ) = P(θ ∈ I(X) | θ = θ0 ) = 1 − α.
31
2 Hypothesis testing IB Statistics
E[AX] = Aµ,
and
cov(AX) = A cov(X)AT . (∗)
The last one comes from
tT Σt = var(tT X) ≥ 0.
So what is the pdf of a multivariate normal? And what is the moment generating
function? Recall that a (univariate) normal X ∼ N (µ, σ 2 ) has density
1 (x − µ)2
2 1
fX (x; µ, σ ) = √ exp − ,
2πσ 2 σ2
32
2 Hypothesis testing IB Statistics
MX (t) =
T T T 1 T 1 T 1 T
exp t1 µ1 + t2 µ2 + t1 Σ11 t1 + t2 Σ22 t2 + t1 Σ12 t2 + t2 Σ21 t1 .
2 2 2
From (i), we know that MXi (ti ) = exp(tTi µi + 12 tTi Σii ti ). So MX (t) =
MX1 (t1 )MX2 (t2 ) for all t if and only if Σ12 = 0.
33
2 Hypothesis testing IB Statistics
Note that Σ is always positive semi-definite. The conditions just forbid the
case |Σ| = 0, since this would lead to dividing by zero.
√1 √1 √1 √1 √1
n n n n
··· n
√1
2×1 √−1 0 0 ··· 0
2×1
√−2
√1 √1 0 · · · 0
A=
3×2 3×2 3×2
.. .. .. .. .
. . .
. . . . . .
−(n−1)
√ 1 √ 1 √ 1 √ 1
··· √
n(n−1) n(n−1) n(n−1) n(n−1) n(n−1)
We have √
Aµ = ( nµ, 0, · · · , 0)T .
√
So Y1 ∼ N ( nµ, σ 2 ) and Yi ∼ N (0, σ 2 ) for i = 2, · · · , n. Also, Y1 , · · · , Yn are
independent, since the covariance matrix is every non-diagonal term 0.
But from the definition of A, we have
n
1 X √
Y1 = √ Xi = nX̄.
n i=1
34
2 Hypothesis testing IB Statistics
√ √
So nX̄ ∼ N ( nµ, σ 2 ), or X̄ ∼ N (µ, σ 2 /n). Also
35
2 Hypothesis testing IB Statistics
or √
n(X̄ − µ)
p ∼ tn−1 .
SXX /(n − 1)
We write σ̃ 2 = Sn−1
XX
(note that this is the unbiased estimator). Then a 100(1−α)%
confidence interval for µ is found from
α √n(X̄ − µ) α
1 − α = P −tn−1 ≤ ≤ tn−1 .
2 σ̃ 2
36
3 Linear models IB Statistics
3 Linear models
3.1 Linear models
Linear models can be used to explain or model the relationship between a
response (or dependent) variable, and one or more explanatory variables (or
covariates or predictors). As the name suggests, we assume the relationship is
linear.
Example. How do motor insurance claim rates (response) depend on the age
and sex of the driver, and where they live (explanatory variables)?
It is important to note that (unless otherwise specified), we do not assume
normality in our calculations here.
Suppose we have p covariates xj , and we have n observations Yi . We assume
n > p, or else we can pick the parameters to fix our data exactly. Then each
observation can be written as
for i = 1, · · · , n. Here
– β1 , · · · , βp are unknown, fixed parameters we wish to work out (with n > p)
– xi1 , · · · , xip are the values of the p covariates for the ith response (which
are all known).
– ε1 , · · · , εn are independent (or possibly just uncorrelated) random variables
with mean 0 and variance σ 2 .
We think of the βj xij terms to be the causal effects of xij and εi to be a random
fluctuation (error term).
Then we clearly have
– E(Yi ) = β1 xi1 + · · · βp xip .
– var(Yi ) = var(εi ) = σ 2 .
– Y1 , · · · , Yn are independent.
Note that (∗) is linear in the parameters β1 , · · · , βp . Obviously the real world
can be much more complicated. But this is much easier to work with.
Example. For each of 24 males, the maximum volume of oxygen uptake in the
blood and the time taken to run 2 miles (in minutes) were measured. We want
to know how the time taken depends on oxygen uptake.
We might get the results
37
3 Linear models IB Statistics
For each individual i, we let Yi be the time to run 2 miles, and xi be the maximum
volume of oxygen uptake, i = 1, · · · , 24. We might want to fit a straight line to
it. So a possible model is
Yi = a + bxi + εi ,
where εi are independent random variables with variance σ 2 , and a and b are
constants.
The subscripts in the equation makes it tempting to write them as matrices:
Y1 x11 · · · x1p β1 ε1
.. .. . .. .
.. , β = .. , ε = ...
.
Y = . , X = .
Yn xn1 · · · xnp . βp εn
Then
Y = Xβ + ε.
Definition (Least squares estimator). In a linear model Y = Xβ + ε, the least
squares estimator β̂ of β minimizes
S(β) = ∥Y − Xβ∥2
= (Y − Xβ)T (Y − Xβ)
Xn
= (Yi − xij βj )2
i=1
for all k. So
−2xik (Yi − xij β̂j ) = 0
38
3 Linear models IB Statistics
X T X β̂ = X T Y. (3)
for t ̸= 0 in Rp (the last inequality is since if there were a t such that ∥Xt∥ = 0,
then we would have produced a linear combination of the columns of X that
gives 0). So X T X is positive definite, and hence has an inverse. So
β̂ = (X T X)−1 X T Y, (4)
which is linear in Y.
We have
since cov Y = σ 2 I.
39
3 Linear models IB Statistics
(xi − x̄)2 .
P
where Sxx =
Hence 1
0
(X T X)−1 = n
1
0 Sxx
So
T −1 T Ȳ
β̂ = (X X )X Y = SxY ,
Sxx
P
where SxY = Yi (xi − x̄).
Hence the estimated intercept is â′ = ȳ, and the estimated gradient is
Sxy
b̂ =
Sxx
P
yi (xi − x̄)
= Pi 2
i (xi − x̄)
P r
(yi − ȳ)(xi − x̄) Syy
= pP i × (∗)
S
2
P 2
(x
i i − x̄) (y
i i − ȳ) xx
r
Syy
=r× .
Sxx
P
We have (∗) since ȳ(xi − x̄) = 0, so we can add it to the numerator. Then the
other square root things are just multiplying and dividing by the same things.
So the gradient is the Pearson product-moment correlation coefficient r times
the ratio of the empirical standard deviations of the y’s and x’s (note that the
gradient is the same whether the x’s are standardised to have mean 0 or not).
Hence we get cov(β̂) = (X T X)−1 σ 2 , and so from our expression of (X T X)−1 ,
σ2 σ2
var(â′ ) = var(Ȳ ) = , var(b̂) = .
n Sxx
Note that these estimators are uncorrelated.
Note also that these are obtained without any explicit distributional assump-
tions.
Example. Continuing our previous oxygen/time example, we have ȳ = 826.5,
Sxx = 783.5 = 28.02 , Sxy = −10077, Syy = 4442 , r = −0.81, b̂ = −12.9.
Theorem (Gauss Markov theorem). In a full rank linear model, let β̂ be the
least squares estimator of β and let β ∗ be any other unbiased estimator for β
which is linear in the Yi ’s. Then
var(tT β̂) ≤ var(tT β ∗ ).
for all t ∈ Rp . We say that β̂ is the best linear unbiased estimator of β (BLUE).
Proof. Since β ∗ is linear in the Yi ’s, β ∗ = AY for some p × n matrix A.
Since β ∗ is an unbiased estimator, we must have E[β ∗ ] = β. However, since
β = AY, E[β ∗ ] = AE[Y] = AXβ. So we must have β = AXβ. Since this
∗
40
3 Linear models IB Statistics
= E[Aε(Aε)T ]
= A(σ 2 I)AT
= σ 2 AAT .
BX = AX − (X T X)−1 X T X = Ip − Ip = 0.
cov(β ∗ ) = σ 2 AAT
= σ 2 (B + (X T X)−1 X T )(B + (X T X)−1 X T )T
= σ 2 (BB T + (X T X)−1 )
= σ 2 BB T + cov(β̂).
var(tT β ∗ ) = tT cov(β ∗ )t
= tT cov(β̂)t + tT BB T tσ 2
= var(tT β̂) + σ 2 ∥B T t∥2
≥ var(tT β̂).
var(β̂i ) ≤ var(βi∗ ).
41
3 Linear models IB Statistics
Wafer
1 2 3 4 5
1 130.5 112.4 118.9 125.7 134.0
Instrument
Let Yi,j be the resistivity of the jth wafer measured by instrument i, where
i, j = 1, · · · , 5. A possible model is
Yi,j = µi + εi,j ,
where εij are independent random variables such that E[εij ] = 0 and var(εij ) =
σ 2 , and the µi ’s are unknown constants.
This can be written in matrix form, with
1 0 ··· 0
Y1,1 ε1,1
.. .. .. . . . ..
. . . . .. .
Y1,5 1 0 · · · 0 ε1,5
Y2,1 0 1 · · · 0 µ1 ε2,1
.. .. .. . . .. ..
µ2
. . . . . .
Y= Y2,5 , X = 0 1 · · · 0 , β = µ3 , ε = ε2,5
µ4
. . . .
. . ...
.
.. .. .. .
µ5 .
Y5,1 0 0 · · · 1 ε5,1
. . . .
. . ...
.
.. .. .. ..
Then
Y = Xβ + ε.
We have
5 0 ··· 0
0 5 ··· 0
XT X = .
.. .. ..
.. . . .
0 0 ··· 5
42
3 Linear models IB Statistics
Hence 1
5 0 ··· 0
1
0
5 ··· 0
(X T X)−1 = .
.. .. ..
.. . . .
1
0 0 ··· 5
So we have
Ȳ1
T −1 T ..
µ̂ = (X X) X Y = .
Ȳ5
The residual sum of squares is
5 X
X 5 5 X
X 5
RSS = (Yi,j − µ̂i )2 = (Yi,j − Ȳi )2 = 2170.
i=1 j=1 i=1 j=1
This is a special case of the linear model we just had, so all previous results hold.
Since Y = Nn (Xβ, σ 2 I), the log-likelihood is
n n 1
l(β, σ 2 ) = − log 2π − log σ 2 − 2 S(β),
2 2 2σ
where
S(β) = (Y − Xβ)T (Y − Xβ).
If we want to maximize l with respect to β, we have to maximize the only term
containing β, i.e. S(β). So
Proposition. Under normal assumptions the maximum likelihood estimator
for a linear model is
β̂ = (X T X)−1 X T Y,
which is the same as the least squares estimator.
This isn’t coincidence! Historically, when Gauss devised the normal distri-
bution, he designed it so that the least squares estimator is the same as the
maximum likelihood estimator.
To obtain the MLE for σ 2 , we require
∂l
= 0,
∂σ 2 β̂,σ̂ 2
i.e.
n S(β̂)
− 2
+ =0
2σ̂ 2σ̂ 4
43
3 Linear models IB Statistics
So
1 1 1
σ̂ 2 = S(β̂) = (Y − X β̂)T (Y − X β̂) = RSS.
n n n
Our ultimate goal now is to show that β̂ and σ̂ 2 are independent. Then we can
apply our other standard results such as the t-distribution.
First recall that the matrix P = X(X T X)−1 X T that projects Y to Ŷ is
idempotent and symmetric. We will prove the following properties of it:
Lemma.
(i) If Z ∼ Nn (0, σ 2 I) and A is n × n, symmetric, idempotent with rank r,
then ZT AZ ∼ σ 2 χ2r .
(ii) For a symmetric idempotent matrix A, rank(A) = tr(A).
Proof.
(i) Since A is idempotent, A2 = A by definition. So eigenvalues of A are either
0 or 1 (since λx = Ax = A2 x = λ2 x).
Since A is also symmetric, it is diagonalizable. So there exists an orthogonal
Q such that
Λ = QT AQ = diag(λ1 , · · · , λn ) = diag(1, · · · , 1, 0, · · · , 0)
(ii)
rank(A) = rank(Λ)
= tr(Λ)
= tr(QT AQ)
= tr(AQT Q)
= tr A
44
3 Linear models IB Statistics
(X T X)−1 X T (Xβ) = β
and covariance
So
β̂ ∼ Np (β, σ 2 (X T X)−1 ).
rank(In − P ) = tr(In − P ) = n − p.
RSS σ2 2
σ̂ 2 =
∼ χ .
n n n−p
β̂ C
– Let V = = DY, where D = is a (p + n) × n matrix.
R In − P
Since Y is multivariate, V is multivariate with
cov(V ) = Dσ 2 IDT
CC T C(In − P )T
2
=σ
(In − P )C T (In − P )(In − P )T
CC T
2 C(In − P )
=σ
(In − P )C T (In − P )
T
CC 0
= σ2
0 In − P
45
3 Linear models IB Statistics
where σ̃ 2 = RSS/(n − p). Unlike the actual variance σ 2 (X T X)−1 jj , the standard
error is calculable from our data.
Then
q
β̂j − βj β̂j − βj (β̂ j − β j )/ σ 2 (X T X)−1jj
= q = p
SE(β̂j ) −1 RSS/((n − p)σ ) 2
σ̃ 2 (X T X)jj
By writing it in this somewhat weird form, we now recognize both the numer-
ator and denominator. The numerator is a standard normal N (0, 1), and the
46
3 Linear models IB Statistics
q
denominator is an independent χ2n−p /(n − p), as we have previously shown.
But a standard normal divided by χ2 is, by definition, the t distribution. So
β̂j − βj
∼ tn−p .
SE(β̂j )
So a 100(1 − α)% confidence interval for βj has end points β̂j ± SE(β̂j )tn−p ( α2 ).
In particular, if we want to test H0 : βj = 0, we use the fact that under H0 ,
β̂j
SE(β̂j )
∼ tn−p .
Yi = a′ + b(xi − x̄) + εi ,
and (â′ , b̂) and σ̂ 2 = RSS/n are independent, as we have previously shown.
Note that σ̂ 2 is obtained by dividing RSS by n, and is the maximum likelihood
estimator. On the other hand, σ̃ is obtained by dividing RSS by n − p, and is
an unbiased estimator.
Example. Using the oxygen/time example, we have seen that
RSS 67968
σ̃ 2 = = = 3089 = 55.62 .
n−p 24 − 2
47
3 Linear models IB Statistics
48
3 Linear models IB Statistics
We can see this as the uncertainty in the regression line σ 2 τ 2 , plus the wobble
about the regression line σ 2 . So
Ŷ ∗ − Y ∗ ∼ N (0, σ 2 (τ 2 + 1)).
We therefore find that
Ŷ ∗ − Y ∗
√ ∼ tn−p .
σ̃ τ 2 + 1
So the interval with endpoints
p α
x∗T β̂ ± σ̃ τ 2 + 1tn−p
2
∗
is a 95% prediction interval for Y . We don’t call this a confidence interval —
confidence intervals are about finding parameters of the distribution, while the
prediction interval is about our predictions.
Example. A 95% prediction interval for Y ∗ at x∗T = (1, (50 − x̄)) is
p α
x∗T ± σ̃ τ 2 + 1tn−p = 808.5 ± 55.6 × 1.02 × 2.07 = (691.1, 925.8).
2
Note that this is much wider than our our expected response! This is since there
are three sources of uncertainty: we don’t know what σ is, what b̂ is, and the
random ε fluctuation!
Example. Wafer example continued: Suppose we wish to estimate the expected
resistivity of a new wafer in the first instrument. Here x∗T = (1, 0, · · · , 0) (recall
that x is an indicator vector to indicate which instrument is used).
The estimated response at x∗T is
x∗T µ̂ = µ̂1 = y¯1 = 124.3
We find
1
τ 2 = x∗T (X T X)−1 x∗ = .
5
So a 95% confidence interval for E[Y1∗ ] is
α 10.4
x∗T µ̂ ± σ̃τ tn−p = 124.3 ± √ × 2.09 = (114.6, 134.0).
2 5
Note that we are using an estimate of σ obtained from all five instruments. If
we had only used the data from the first instrument, σ would be estimated as
sP
5
j=1 y1,j − y¯1
σ̃1 = = 8.74.
5−1
The observed 95% confidence interval for µ1 would have been
σ̃1 α
y¯1 ± √ t4 = 124.3 ± 3.91 × 2.78 = (113.5, 135.1),
5 2
which is slightly wider. Usually it is much wider, but in this special case, we
only get little difference since the data from the first instrument is relatively
tighter than the others.
A 95% prediction interval for Y1∗ at x∗T = (1, 0, · · · , 0) is
p α
x∗T µ̂ ± σ̃ τ 2 + 1tn−p = 124.3 ± 10.42 × 1.1 × 2.07 = (100.5, 148.1).
2
49
3 Linear models IB Statistics
Then
0 2 A1 0
W ∼ N2n ,σ
0 0 A2
since the off diagonal matrices are σ 2 AT1 A2 = A1 A2 = 0.
So W1 and W2 are independent, which implies
W1T W1 = ZT AT1 A1 Z = ZT A1 A1 Z = ZT A1 Z
and
W2T W2 = ZT AT2 A2 Z = ZT A2 A2 Z = ZT A2 Z
are independent
Now we go to hypothesis testing
! in general linear models:
β0
Suppose X = X0 X1 and B = , where rank(X) = p, rank(X0 ) =
n×p n×p0 n×(p−p0 ) β1
p0 .
̸ 0. Under H0 , X1 β 1 vanishes
We want to test H0 : β 1 = 0 against H1 : β 1 =
and
Y = X0 β 0 + ε.
Under H0 , the mle of β 0 and σ 2 are
ˆ
β̂ 0 = (X0T X0 )−1 X0T Y
ˆ 2 = RSS0 = 1 (Y − X0 β̂ˆ T ˆˆ
σ̂ 0 ) (Y − X0 β 0 )
n n
and we have previously shown these are independent.
50
3 Linear models IB Statistics
Note that our poor estimators wear two hats instead of one. We adopt the
convention that the estimators of the null hypothesis have two hats, while those
of the alternative hypothesis have one.
So the fitted values under H0 are
ˆ
Ŷ = X0 (X0T X0 )−1 X0T Y = P0 Y,
Also,
51
3 Linear models IB Statistics
Total n − p0 RSS0
0 −RSS
The ratio RSSRSS 0
is sometimes known as the proportion of variance explained
by β 1 , and denoted R2 .
52
3 Linear models IB Statistics
2
Sxy
Note that the proposition of variance explained is b̂2 Sxx /Syy = Sxx Syy = r2 ,
where r is the Pearson’s product-moment correlation coefficient
Sxy
r= p .
Sxx Syy
√
We have previously seen that under H0 , SE(b̂ b̂) ∼ tn−2 , where SE(b̂) = σ̃/ Sxx .
So we let √
b̂ b̂ Sxx
t= = .
SE(b̂) σ̃
Checking whether |t| > tn−2 α2 is precisely the same as checking whether
3.8.3 One way analysis of variance with equal numbers in each group
Recall that in our wafer example, we made measurements in groups, and want to
know if there is a difference between groups. In general, suppose J measurements
are taken in each of I groups, and that
Yij = µi + εij ,
where εij are independent N (0, σ 2 ) random variables, and the µi are unknown
constants.
Fitting this model gives
I X
X J I X
X J
RSS = (Yij − µ̂i )2 = (Yij − Ȳi. )2
i=1 j=1 i=1 j=1
on n − I degrees of freedom.
Suppose we want to test the hypothesis H0 : µi = µ, i.e. no difference between
groups.
Under H0 , the model is Yij ∼ N (µ, σ 2 ), and so µ̂ = Ȳ , and the fitted values
are Yˆij = Ȳ .
The observed RSS0 is therefore
X
RSS0 = (yij − ȳ.. )2 .
i,j
2
P P
Total n−1 i j (yij − ȳ.. )
53