Inference
Inference
Inference
S TATISTICAL I NFERENCE
2. Probability model: Identify the random variable(s) associated with the problem and assign
a suitable probability model. The model is described by some parameter(s) ✓.
4. Use the information contained in the sample to draw inference, that is gain knowledge for
the population parameters ✓ and provide answers to the questions.
Let X1 , X2 , . . . , Xn be independent and identically distributed (i.i.d.) random variables each with
pdf or pmf f (xi |✓). Then X1 , X2 , . . . , Xn are called a random sample of size n from population
f (xi |✓).
• A random sample of size n implies a particular probability model described by the popula-
tion f (xi |✓), that is by the marginal pdf or pmf of each Xi . Notice that it depends on some
parameter ✓, and if we know ✓ then the model would be completely specified. However, ✓
is in general unknown and it is the object we are interested in estimating. For this reason we
highlight the dependence on ✓ when indicating the pdf or pmf.
• The random sampling model describes an experiment where the variable of interest has a
probability distribution described by f (xi |✓).
1
• Each Xi has a marginal distribution given by f (xi |✓).
n
Y
fX (x|✓) ⌘ fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn |✓) = f (x1 |✓)f (x2 |✓) . . . f (xn |✓) = f (xi |✓)
i=1
Example
• Poisson:
n Pn
Y e xi e n i=1 xi
P (X1 = x1 , X2 = x2 , . . . , Xn = xn | ) = = Qn
xi ! i=1 xi !
i=1
• Exponential:
n
Y Pn
1 xi / 1 xi /
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn | ) = e = n
e i=1
i=1
1.2 Statistics
Let T (x1 , x2 , . . . , xn ) be a real or vector valued function whose domain includes the sample space
of X1 , X2 , . . . , Xn . Then the random variable Y = T (X1 , X2 , . . . , Xn ) is called a statistic.
• Inferential questions are hard to answer just by looking at the raw data.
Notice that these are random variables and we denote their observed values as x̄, s2 , s.
2
Lemma Let X1 , X2 , . . . , Xn be a random sample from a population and let g(x) be a function
such that E[g(X1 )] and Var(g(X1 )) exist. Then,
n
!
X
E g(Xi ) = nE [g(X1 )]
i=1
and !
n
X
Var g(Xi ) = nVar [g(X1 )] .
i=1
Theorem Let X1 , X2 , . . . , Xn be a random sample from a population with mean µ and variance
2 < 1. Then
1. E(X̄) = µ,
2
2. Var(X̄) = n ,
3. E(S 2 ) = 2.
As we will see later in detail we say that X̄ and S 2 are unbiased estimators of µ and 2 respectively.
Theorem Let X1 , X2 , . . . , Xn be a random sample from a population with mgf MXi (t). Then the
mgf of the sample mean is
MX̄ (t) = [MXi (t/n)]n .
When applicable, the theorem above provides a very convenient way for deriving the sampling
distribution.
Example
If we cannot use the above theorem we can derive the distribution of the transformation of random
variables by working directly with pdfs.
3
1.4 Transformation of Random Variables
Theorem: Let X,Y be random variables with pdfs fX (x), fY (y) and defined for x 2 X and
y 2 Y, respectively. Suppose that g(·) is a monotone function such that g : X ! Y and g 1 (·)
has a continuous derivative on Y. The pdf of Y is then
(
d
fX (g 1 (y)) dy g 1 (y) , y 2 Y,
fY (y) =
0, otherwise.
d d dg 1 (y)
1 1
fY (y) = FY (y) = FX (g (y)) = fX (g (y))
dy dy dy
If g 1 (·) is decreasing
1 1
FY (y) = P (Y y) = P (g (Y ) g (y)),
1 1
= P (X g (y)) = 1 FX (g (y)),
1 (y)
d 1 1 dg
fY (y) = FX (g (y)) = fX (g (y))
dy dy
(The derivative of a decreasing function is negative)
Putting both cases together if g 1 (.) is monotone
dg 1 (y)
1
fY (y) = fX (g (y))
dy
We want the distribution of Y = 1/X, therefore g(x) = 1/x and g 1 (y) = 1/y. Then,
d
dy g (y) = 1/y . We can therefore write
1 2
⇣ ⌘↵ 1 ⇣ ⌘
1 1
dg 1 (y) y exp y 1
1
fY (y) = fX (g (y)) = ↵
,
dy (↵) y2
⇣ ⌘↵+1 ⇣ ⌘
1 1
y exp y
= ↵
, 0 < y < 1.
(↵)
4
Square Transformations: What if g(·) is not monotone? For example consider Y = X 2 , then
p p
g 1 (y) = y and clearly FY (y) = P(X y) is not defined if y < 0. For y 0
p p
FY (y) = P (Y y) = P (X 2 y) = P ( yX y),
p p
= FX ( y) FX ( y)
d d p p
fY (y) = FY (y) = [FX ( y) FX ( y)],
dy dy
1 p 1 p
= p fX ( y) + p fX ( y), if y 0.
2 y 2 y
Note that ⇣ y⌘
1 1
1
fY (y) = 1 y
2 exp ,
( 12 )2 2 2
1
which is the pdf of a Gamma 2, 2 distribution, or else a 2 distribution with one degree of
freedom.
5
The Jacobian, J of the transformation g(·) is the determinant of the matrix of derivatives above.
It provides a scaling factor for the change of volume under the transformation.
Theorem: If X, Y are independent random variables with pdfs fX (x) and fY (y), the pdf of
Z = X + Y is
Z 1
fZ (z) = fX (w)fY (z w)dw
1
This formula for fZ (z) is called the convolution of fX (x) and fY (y).
Proof (of the convolution expression): We introduce an extra random variable W = X so that
Z = X + Y, and W = X or
X = W, and Y = Z W.
The Jacobian is equal to 1. Since X, Y are independent their joint pdf is fXY (x, y) = fX (x)fY (y).
We can now write
Finally, Z Z
+1 +1
fZ (z) = fZW (z, w)dw = fX (w)fY (z w)dw
1 1
Example: If X and Y are independent and identically distributed exponential random variables,
find the joint density function of U = X/Y and V = X + Y .
For U = X/Y, V = X + Y , the inverse transformation is X = U V /(1 + U ), Y = V /(1 + U ).
We have
@X @X @Y @Y
= V /(1 + U )2 , = U/(1 + U ), = V /(1 + U )2 , = 1/(1 + U ).
@U @V @U @V
The Jacobian is
V /(1 + U )2 U/(1 + U )
= V (1 + U )/(1 + U )3 = V /(1 + U )2
V /(1 + U )2 1/(1 + U )
The joint density factorises into a marginal density for V , which is Gamma with a scale parameter
and a shape parameter 2, and a Pareto density 1/(1 + u)2 for U . So U and V are independent.
6
1.5 Sampling from the Normal distribution
1. A 2
p distribution is a Gamma(p/2, 2) distribution. Its pdf is
1
f (y) = y (p/2) 1
e y/2
, x > 0.
(p/2)2p/2
1. The pdf of Y is
1/2 2 p/2 p 1/2
✓ ◆ (p+1)/2
(2⇡) p+1 2
fY (y) =
(p/2) 2 1 + y 2 /p
2. E(Y ) = 0, if p > 1.
3. Var(Y ) = p/(p 2), if p > 2.
7
1.5.3 Snedecor’s F distribution
Sx2 / 2
x U/(n 1)
F = =
Sy2 / 2
y V /(m 1)
where U ⇠ n 1,
2 V ⇠ 2
m 1 and U , V are independent.
Let X1 , X2 , . . . , Xn be a random sample from population with distribution function F (x) and
density f (x). Define X(i) to be the i th smallest of the {Xi } (i = 1, . . . , n), namely
We want to find the density function of X(i) . Notice that while Xi is one element of the random
sample, X(i) is a statistic which is function of the whole random sample. Informally, we may write
for small x,
FX(i) (x + x) FX(i) (x) P [X(i) 2 (x, x + x)]
fX(i) (x) ⇡ = .
x x
The probability X(i) is in (x, x + x) is roughly equal to the probability of (i 1) observations
in ( 1, x), one in (x, x + x) and the remaining (n i) in (x + x, +1). This is a trinomial
probability
8
Notice that this formula is function of the population cdf and pdf, i.e. of a generic Xi .
Example: Let X1 , X2 , . . . , Xn be a random sample from U (0, 1). Find the density of X(i) .
Reading
G. Casella & R. L. Berger 2.1, 4.3, 4.6, 5.1, 5.2, 5.3, 5.4
A sufficient statistic for a parameter ✓ captures, in a certain sense, all the relevant information in
the sample about ✓.
Sufficiency Principle: If T (Y ) is a sufficient statistic for ✓ then any inference for ✓ should be
based on the sample Y only through T (Y ). That is if x and y are two observed samples (that is
x = (x1 . . . xn ) and y = (y1 . . . yn )) such that T (x) = T (y) then the inference about ✓ should be
the same regardless if Y = y or Y = x was observed.
9
• If Y is discrete, the ratio above is a conditional probability mass function.
P (Y = y, T (Y ) = T (y))
P (Y = y|T (Y ) = T (y)) =
P (T (Y ) = T (y))
• If it is continuous it is just a conditional pdf.
• The definition refers to the conditional distribution. A statistic is sometimes defined as being
sufficient for a family of distributions, FY (y|✓), ✓ 2 ⇥.
Example: LetPY = (Y1 , . . . , Yn ) be a random sample from a Poisson( ) population, and let
U = T (Y ) = ni=1 Yi . It can be shown that U ⇠Poisson(n ). We can also write
P (Y = y, U = u)
P (Y = y|U = u) = ,
P (U = u)
⇢
0 if U =
6 u,
P (Y = y, U = u) =
P (Y = y) if U = u.
Qn
P (Y = y) exp( ) yi /(yi !)
P (Y = y|U = u) = = i=1
P (U = u) exp( n )(n )u /(u!)
u!
= u Qn
n i=1 yi !
We give the proof for a discrete valued Y . The proof for the continuous case is quite technical and
beyond the scope of this course.
P (Y = y) = P (Y = y, T (Y ) = T (y)) (1)
but NOT
P (T (Y ) = T (y)) = P (Y = y, T (Y ) = T (y))
10
Indeed,
X
P (T (Y ) = T (y)) = P (Yk = yk , T (Yk ) = T (y))
yk :T (yk )=T (y)
X
= P (Yk = yk ) (2)
yk :T (yk )=T (y)
That is the event {Y = y} is a subset of the event {T (Y ) = T (y)} but not the viceversa.
If T is sufficient: Suppose T is sufficient for ✓. That is P (Y = y|T (Y ) = T (y)) is independent
of ✓. We can write
(1)
P✓ (Y = y) = P✓ (Y = y, T (Y ) = T (y))
= P✓ (T (Y ) = T (y))P (Y = y|T (Y ) = T (y))
= g(T (Y ), ✓)h(Y ).
Example: Let Y = (Y1 , . . . , Yn ) be a random sample from the following distributions find a
sufficient statistic for each case.
11
2. Sufficient statistic for (µ, 2) from a N (µ, 2) population.
The joint density may be written as
✓ ◆
2 2 n/2 n(Ȳ µ)2 + (n 1)S 2
fY (Y |µ, ) = (2⇡ ) exp
2 2
1
The statistic T (Y ) = maxi yi is sufficient for ✓ since if we set g(T (Y ), ✓) = ✓ n I(maxi yi <
✓) and h(Y ) = I(mini yi > 0), we have fY (Y |✓) = g(T (Y ), ✓)h(Y ).
Example (Sufficiency of the sample): Let Y = (Y1 , . . . , Yn ) is a sample from a population with
fYi (yi |✓). Denote the joint density of the sample Y by fY (y|✓).
Note that
fY (y|✓) = fY (y|✓) ⇥ 1 = g(T (y)|✓) ⇥ h(y),
where
T (Y ) = Y, g(T (y)|✓) = fY (y|✓), h(y) = 1.
Every sample is itself a sufficient statistic. Also every statistic that is a one-to-one function of a
sufficient statistic is itself a sufficient statistic.
Definition: A sufficient statistic T (Y ) is a minimal sufficient statistic if for any other sufficient
statistic T 0 (Y ), T (Y ) is a function of T 0 (Y ).
12
• If a sufficient statistic has dimension 1, it must be a minimal sufficient statistic.
• Minimal sufficient statistics are not unique. However if two statistics are minimally suffi-
cient they must have the same dimension.
• The dimension of a minimal sufficient statistic is not always the same as the dimension of
the parameter of interest.
Reading
3 Point Estimation
Problem:
• Suppose that a real world phenomenon may be described by a probability model defined
through the random variable Y with FY (y|✓).
• We want use the information in the random sample Y to get a best guess for ✓. In other
words we want a point estimate for ✓.
Next, we look at two methods for finding point estimators, the method of moments and the maxi-
mum likelihood estimators. Then we present evaluation methods for estimators.
Description: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from population with pdf or pmf
f (y|✓1 , . . . , ✓k ). Let a sample moment be defined as
n
1X r
mr = Yi
n
i=1
µr = µr (✓1 , . . . , ✓k ) = E✓ (Yir )
Method of moments estimators are found by equating the first k sample moments to the corre-
sponding k population moments and solving the resulting system of simultaneous equations.
13
We want estimators for 2 parameters. Hence, we first write down the system of 2 equations
X̄ = E(X) = µ,
n
1X 2
Xi = E(X 2 ) = µ2 + 2
,
n
i=1
µ̂ = X̄,
n n n
1X 2 1X 2 1X n 1
ˆ2 = Xi µ̂2 = Xi X̄ 2 = (Xi X̄)2 = S2
n n n n
i=1 i=1 i=1
Definition: Let Y = (Y1 , Y2 , . . . , Yn ) be a sample from population with pdf (or pmf) f (yi |✓).
Then, given Y = y is observed, the function of ✓ defined by the joint pdf (or pmf) of Y = y
L(✓|Y = y) = fY (y|✓)
Notes
• In most cases the pdf of Y is thought as a function of Y whereas the likelihood function is
thought as a function of ✓ for a given observed sample.
• If for ✓1 , ✓2 we have L(✓1 |y) > L(✓2 |y) then the sample is more likely to have occurred if
✓ = ✓1 than if ✓ = ✓2 . In other words ✓1 is a more plausible value than ✓2 .
• Sometimes it is more convenient to work with the log-likelihood, l(✓|y) which is just the
log of the likelihood.
n
X
l(✓|Y = y) = log fY (y|✓) = log f (yi |✓)
i=1
Example: Consider X continuous random variable with pdf fX (x|✓), then for small ✏ we have
P✓ (x ✏ < X < x + ✏)
' fX (x|✓) = L(✓|x)
2✏
14
therefore if we compare the probabilities for different values of ✓ we have
and the value of ✓ which gives higher likelihood is more likely to be associated to the observed
sample since gives a higher probability.
✓ ◆n ✓ Pn ◆
2 1 i=1 (yi µ)2
L(µ, |Y = y) = p exp
2⇡ 2 2 2
n
2 n 2 1 X
l(µ, |Y = y) = log(2⇡) + log( ) 2
(yi µ)2
2 2
i=1
@l(✓|y) 1 @L(✓|y)
s(✓|y) = = ,
@✓ L(✓|y) @✓
Proposition: E(s(✓|Y )) = 0.
Proof: (for the continuous case)
Z Z @L(✓|y) Z Z
@✓ @ @
E[s(✓|Y )] = s(✓|y)f (y|✓)dy = f (y|✓)dy = f (y|✓)dy = f (y|✓)dy = 0,
Rn Rn L(✓|y) Rn @✓ @✓ Rn
because L(✓|y) = f (y|✓) and the last integral is equal to one, since the pdf is normalised. Here it
has to be intended that ✓ is the true value of the unknown parameters.
Notes:
15
2. Although the score function is usually viewed as a function of ✓, the expectation is taken
with respect to Y , actually with respect to the distribution of Y which depends on ✓. This
may be interpreted as follows. If the experiment was repeated many times the score function
would on average equal 0. That is, if we start at the true value of the parameters, on average
over many experiments the likelihood does not change if we make an infinitesimal change
of the parameter.
In most cases, if we plot the likelihood function against ✓, we get a curve with a peak in the
maximum. The sharper the peak is, the more information about ✓ exists in the sample. This is
captured by the Fisher’s information:
"✓ ◆2 #
⇥ ⇤ @
I(✓|y) ⌘ I(✓|Y = y) = E s(✓|Y )2 = E l(✓|Y )
@✓
This is the variance of the score (when computed at the true value of ✓), so the larger it is the more
the score in the true value is affected by minimal changes in the parameters, the sharper is the
peak, the more precise is our information about ✓.
again here it has to be intended that ✓ is the true value of the unknown parameters.
In this case at the true value of ✓ the Fisher info is also the negative Hessian of the log-
likelihood so it measures the concavity of the log-likelihood. In particular since the Fisher infor-
mation must be always positive (it is a variance), then the Hessian must be negative which jointly
with a zero expectation of the score (first derivative of the log-likelihood) tells us that the true
value of ✓ is a maximum of the log-likelihood for any realisation Y = y.
Proof:
Z Z
d d d
0= E [s(✓|y)] = s(✓|y)f (y|✓)dy = [s(✓|y)f (y|✓)] dy
d✓ d✓ Rn Rn d✓
Z ✓ ◆
d d
= s(✓|y) f (y|✓) + s(✓|y) f (y|✓) dy
d✓ d✓
ZR
n
✓ ◆
d 2 d
= s(✓|y) + s(✓|y) f (y|✓)dy, f (y|✓) = s(✓|y)f (y|✓)
Rn d✓ d✓
d d ⇥ ⇤
=E s(✓|y) + s(✓|y)2 = E s(✓|y) + E s(✓|y)2
d✓ d✓
Let Y = (Y1 , . . . , Yn ) be a random sample from a pdf with fYi (yi |✓). Denote with s(✓|yi ) and
I(✓|yi ) the Score function and Fisher information for Yi = yi respectively. Then, for a realisation
of the random sample we have
n
X
s(✓|y) = s(✓|yi ), I(✓|y) = nI(✓|yi ).
i=1
16
Proof: The log-likelihood function is
n
! n
Y X
`(✓|Y ) = log f (Yi |✓) = `(✓|Yi )
i=1 i=1
Hence
n n
@`(✓|Y ) X @`(✓|Yi ) X
s(✓|Y ) = = = s(✓|Yi )
@✓ @✓
i=1 i=1
For the Fisher information, using the fact that (Y1 , . . . , Yn ) are i.i.d., we have
2 !2 3
Xn X n
⇥ ⇤
I(✓|y) = E 4 s(✓|Yi ) 5 = E (s(✓|Yi ))2 = nI(✓|yi ),
i=1 i=1
Example: Let Y = (Y1 , . . . , Yn ) be a random sample from an Exp( ). Show that the score
function is
n
X
n
s(✓|y) = yi ,
i=1
n
I(✓|y) = 2
Vector parameter case If ✓ = (✓1 , . . . , ✓p )0 , then the score function is the vector
✓ ◆0
@ @
s(✓|Y ) = r✓ `(✓|Y ) = `(✓|Y ), . . . , `(✓|Y )
@✓1 @✓p
17
1. E [s(✓|Y )] = 0p
or else
@2
[I(✓|y)]i,j = E `(✓|Y )
@✓i @✓j
Example: Let Y = (Y1 , . . . , Yn ) be a random sample from a N (µ, 2 ). Show that the score
function is
0 1 Pn 1
2 i=1 (yi µ),
s(✓|y) = @ Pn
A
n 1
2 2
+ 2 4 i=1 (yi µ)2
0 n
1
2 0
I(✓|y) = @ A
n
0 2 4
We have seen that the true value of the parameter ✓ must be such that the log-likelihood attains its
maximum. This motivates mathematically the definition of maximum likelihood estimator.
Maximization: In general the likelihood function can be maximized using numerical methods.
However if the function is differentiable in ✓, calculus may be used. The values of ✓ such that
@`(✓|y)
s(✓|y) = = 0,
@✓
are possible candidates. These points may not correspond to the maximum because
1. They may correspond to the minimum. The second derivative must also be checked.
2. The zeros of the first derivative locate only local maxima, we want a global maximum.
18
3. The maximum may be at the boundary where the first derivative may not be 0.
4. These points may be outside the parameter range.
Notice that, an application of the Weak Law of Large Numbers tells us that, as n ! 1, we must
have
n
1 1X p
s(✓|Y ) = s(✓|Yi ) ! E [s(✓|Yi )] = 0
n n
i=1
which justifies our necessary condition.
Example: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from N (µ, 1), 1 < µ < +1. Find
the MLE for µ. The log likelihood function is equal to
n n n
1X 2 1X 2 X 1 2
`(µ|y) = const. (yi µ) = yi + µ yi nµ
2 2 2
i=1 i=1 i=1
Setting the score function equal to 0 yields a candidate for the global maximum:
n
X
@
`(µ̂|y) = 0 ) yi nµ̂ = 0 ) µ̂ = Ȳ .
@µ
i=1
We could check whether it corresponds to a maximum (and not a minimum) if the second deriva-
tive of the log-likelihood is negative
@2
`(µ̂|y) = n<0
@µ2
The MLE for µ is µ̂ = Ȳ (In fact more checking is required but it is omitted for simplicity).
Example: We cannot always use the above calculus recipe. For example let Y = (Y1 , Y2 , . . . , Yn )
be a random sample from U (0, ✓). Assume to observe Y = y and rank the realisations as y(1)
. . . y(n) . These are then realisations of the order statistics Y(i) . The likelihood for ✓ given
Y = y is
L(✓|y) = ✓ n I(y(1) 0)I(y(n) ✓)
and the log-likelihood for Y(1) 0 is (notice that by construction all realisations are such that
y(1) 0)
`(✓|y) = n log(✓) if ✓ y(n) ,
The function `(✓|y) is maximized at ✓ˆ = y(n) which is our estimate. Hence ✓ˆ = Y(n) is the MLE.
Induced likelihood: Let Y be a sample with likelihood L(✓|y) and let = g(✓). The induced
likelihood for given Y = y is
L⇤ ( |Y = y) = sup L(✓|Y = y)
✓:g(✓)=
Theorem (Invariance property of the MLE’s): If ✓ˆ is the MLE for ✓, then for any function g(.)
ˆ
the MLE of g(✓) is g(✓).
Example: MLE for µ2 in N(µ, 1) case is Ŷ 2 , MLE for p/(1 p) in Binomial(n, p) is p̂/(1 p̂)
etc.
19
3.5 Evaluating Estimators
Being a function of the sample, an estimator is itself a random variable. Hence it has a mean and
a variance. Let ✓ˆ be an estimator of ✓. The quantity below
E(✓ˆ ✓),
ˆ If E(✓)
is termed as the bias of the estimator ✓. ˆ = ✓ the estimator is unbiased.
Note that
An estimator ✓ˆ1 is uniformly better than ✓ˆ2 if it has smaller MSE for all ✓.
Thus nn 1 S 2 is uniformly better than S 2 . But it is not uniformly better than ˆ 2 = 1 which has zero
MSE when 2 = 1.
As seen from the previous example, we cannot find a ‘uniformly best’ estimator. Hence we restrict
attention to unbiased estimators. The MSE of an unbiased estimator is equal to its variance. A
best unbiased estimator is also termed as a minimum variance unbiased estimator.
20
Theorem (Cramér - Rao inequality): Let Y = (Y1 , . . . , Yn ) be a sample and U = h(Y ) be an
unbiased estimator of g(✓). Under regularity conditions the following holds for all ✓
⇥ @
⇤2
@✓ g(✓)
V (U ) .
I(✓|y)
1
V (U ) ,
I(✓|y)
1
V (U ) .
nI(✓|yi )
21
X n n
X
@
`(µ|y) = (yi µ) = (yi ȳ + ȳ µ)
@µ
i=1 i=1
n
X n
X
= (yi ȳ) + (ȳ µ) = n(ȳ µ).
i=1 i=1
✓ ◆
@
I(µ|y) = E n(Ȳ µ) = E( n) = n.
@µ
E(Ȳ ) = µ,
1
V (Ȳ ) = .
n
Since µ̂ = Ȳ is unbiased and attains the Cramér - Rao lower bound for µ, it is also a MVUE for
µ.
Example: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from Poisson( ). It is not hard to check
that I( |y) = n/ . Both the mean and the variance of a Poisson distribution are equal to . Hence
E(Ȳ ) = E(Yi ) = ,
E(S 2 ) = V (Yi ) = .
Consider the estimators ˆ 1 = Ȳ and ˆ 2 = S 2 . They are both unbiased. Which one to choose?
V (Yi )
V (Ȳ ) = = .
n n
Since ˆ 1 is unbiased and attains the Cramér - Rao lower bound for , it is also a MVUE for .
n
b( ) = , h(y) = ȳ, g( ) = .
22
Proof of Cramér - Rao attainment theorem: The Cramér - Rao lower bound comes from the
inequality
Cov[h(Y ), s(✓|Y )]2 V [h(Y )]V [s(✓|Y )].
The lower bound is attained if and only if the equality holds in the above which is the case if and
only if s(✓|Y ) and h(Y ) are linearly related:
(5),(7)
E[s(✓|Y )] = a(✓) + b(✓)E[h(Y )] )
(5),(7)
) 0 = a(✓) + b(✓)g(✓) ) a(✓) = b(✓)g(✓).
and the fact that when the equality holds in the above, we can write
Y = a(✓) + b(✓)X
Let U 0 be another minimum variance unbiased estimator (V (U ) = V (U 0 )), and consider the
estimator U ⇤ = 12 U + 12 U 0 .
Note that U ⇤ is also unbiased
✓ ◆
⇤ 1 1 1 1
E(U ) = E U + U0 = E(U ) + E(U 0 ) = g(✓),
2 2 2 2
and
✓ ◆
⇤ 1 1 0
V (U ) = V U+ U
2 2
✓ ◆ ✓ ◆ ✓ ◆
1 1 0 1 1 0
=V U +V U + 2Cov U, U
2 2 2 2
1 1 1
= V (U ) + V (U 0 ) + Cov(U, U 0 )
4 4 2
(9) 1 1 1
V (U ) + V (U 0 ) + [V (U )V (U 0 )]1/2
4 4 2
= V (U ). (V (U ) = V (U 0 ))
23
We must have equality in the previous expression because U is a MVUE. This implies
We can use the concept of sufficiency for searching for minimum variance unbiased estimators.
Theorem (Rao-Blackwell): Let U (Y ) be an unbiased estimator of g(✓) and T (Y ) be a sufficient
statistic for ✓. Define W (Y ) = E(U (Y )|T (Y )). Then for all ✓
1. E(W ) = g(✓),
2. V (W ) V (U ),
Proof: The proof of the Rao-Blackwell theorem is based on the following conditional expectation
properties
We can write
(12)
g(✓) = E(U ) = E[E(U |T )] = E[W (Y )],
(13)
V (U ) = V [E(U |T )] + E[V (U |T )]
= V [W (Y )] + E[V (U |T )] V [W (Y )].
24
Example: Let (Y1 , . . . P
, Yn ) be a random sample from a distribution with mean µ and variance 2
and suppose that T = ni=1 Yi is sufficient for µ. Consider the estimator µ̂1 = Y1 for µ and find
a better one.
2
E(µ̂1 ) = E(Y1 ) = µ, V (µ̂1 ) = V (Y1 ) =
Indeed
2
E(µ̂2 ) = E(Ȳ ) = µ, V (µ̂2 ) = V (Ȳ ) = V (µ̂1 )
n
Reading
G. Casella & R. L. Berger 6.3.1, 7.1, 7.2.1, 7.2.2, 7.3.1, 7.3.2, 7.3.3
4 Interval Estimation
• Point estimates provide a single value as a best guess for the parameter(s) of interest.
• Interval estimates provide an interval which we believe contains the true value of the param-
eter(s).
• More generally we may look for a confidence sets (not necessarily an interval), for example
when when we are unsure whether the result of the procedure is an interval and or in cases
of more than one parameters.
If the observed sample is y, then interval [U1 (y), U2 (y)] is an interval estimate for ✓.
25
Definition of coverage probability: The probability that the random interval contains the true
parameter ✓ is termed as coverage probability and denoted with
P [U1 (Y ) ✓ U2 (Y )]
Definition of confidence level: The infimum of all the coverage probabilities (for each ✓) is termed
as confidence level (coefficient) of the interval.
inf P [U1 (Y ) ✓ U2 (Y )]
✓
Notes:
• The random variables in the coverage probability are U1 (Y ) and U2 (Y ). The interval may
be interpreted as the probability that U1 (Y ) and U2 (Y ) contain ✓.
• If an interval has confidence level 1 ↵ the interpretation is: ‘If the experiment was repeated
many times 100 ⇥ (1 ↵)% percent of the corresponding intervals would contain the true
parameter ✓.’
P [U1 (Y ) ✓ U2 (Y )] = P [U1 (Y ) ✓ \ U2 (Y ) ✓]
=1 P [U1 (Y ) > ✓ [ U2 (Y ) < ✓]
=1 P [U1 (Y ) > ✓] + P [U2 (Y ) < ✓] P [U1 (Y ) > ✓ \ U2 (Y ) < ✓]
=1 P [U1 (Y ) > ✓] P [U2 (Y ) < ✓]
Example: given an random sample X = (X1 . . . X4 ) from N (µ, 1), compare the sample mean
X̄ which is a point estimator with the interval estimator [X̄ 1, X̄ + 1]. At first sight with the
interval estimator we just loose precision, but we actually gained in confidence. Indeed, while
P (X̄ = µ) = 0, we have
!
X̄ µ
P (X̄ 1 µ X̄ + 1) = P ( 1 X̄ µ 1) = P 2 p 2 = .9544
1/4
because X̄ ⇠ N (µ, 1/4). Therefore, we loose in precision but we now have over 95% chances of
covering the unknown parameter with this interval estimator.
E(U2 U1 )
A good interval estimator should minimise the expected length while maximising the confidence
level.
Some notation: Suppose that the random variable X follows a distribution X . We will denote
with X↵ the number for which
P (X X↵ ) = ↵
26
Naturally
P (X > X↵ ) = 1 ↵
We use such notation for various distributions. In particular we use the letter Z and Z↵ for the
standard normal distribution where we also write
P (Z Z↵ ) = (Z↵ ) = ↵.
1. [ k1 , k2 ]
2. [Y1 k1 , Y 1 + k2 ]
3. [Ȳ k1 , Ȳ + k2 ]
1. This interval does not depend on the sample. If µ 2 [k1 , k2 ], the coverage probability is 1,
otherwise it is 0. Thus the confidence level is 0.
p p
( nk1 ) + ( nk2 ) 1 (k1 ) + (k2 ) 1
27
By rearrangement we get the interval estimator for µ
1 1
Ȳ p Z1 ↵2 , Ȳ + p Z1 ↵1
n n
Using statistical tables we can construct the following table, where we fix ↵1 + ↵1 = 0.05. Hence
we have the length of 95% confidence intervals for the mean of a normal distribution with unit
variance for various lower and upper endpoints.
p
↵1 ↵2 Z 1 ↵1 Z 1 ↵2 n length
0 0.05 +1 1.645 1
0.01 0.04 2.326 1.751 4.077
0.02 0.03 2.054 1.881 3.935
0.025 0.025 1.96 1.96 3.920
0.03 0.02 1.881 2.054 3.935
0.04 0.01 1.751 2.326 4.077
0.05 0 1.645 +1 1
Why 95%:
Let us consider symmetric intervals with confidence levels 0.8, 0.9, 0.95, and 0.99. Using the
previous procedure and statistical tables we can construct the following table where we have the
length of intervals for the mean of a normal distribution with unit variance for various confidence
levels.
p
↵1 ↵2 Z 1 ↵1 Z 1 ↵2 n length
0.1 0.1 1.2816 1.2816 2.563
0.05 0.05 1.645 1.645 3.290
0.025 0.025 1.96 1.96 3.920
0.005 0.005 2.576 2.576 5.152
Definition of a pivotal function: Consider a sample Y with density fY (y|✓) and suppose that we
are interested in constructing an interval estimator for ✓. A function G = G(Y, ✓) of Y and ✓ is a
pivotal function for ✓ if its distribution is known and does not depend on ✓.
28
Example: Let Y1 , Y2 , . . . , Yn be a random sample from a N (µ, 2) with µ unknown and 2
and we can use the above to get the following pivotal function
Ȳ µ
Z= p ⇠ N (0, 1).
/ n
Notice that Z depends on µ but its distribution does not change regardless of the value of µ.
Example: Let Y1 , Y2 , . . . , Yn be a random sample from a N (µ, 2) with µ known and 2 un-
known. We know that
Yi µ
Zi = ⇠ N (0, 1) ,
P
and that Zi ’s are independent. Getting 2
i Zi gives us the following pivotal function
n
X Pn
i=1 (Yi µ)2
Zi2 = 2
⇠ 2
n.
i=1
as a pivotal function for µ because its distribution also depends on the unknown parameter .
Instead we use
Ȳ µ
p ⇠ tn 1 .
S/ n
P
In the same way we cannot use i Zi2 for 2 since its distribution depends on µ which is unknown,
instead we can use
(n 1)S 2
2
⇠ 2n 1 .
Step 1: Find a pivotal function G = G(Y, ✓) based on a reasonable point estimator for ✓.
Step 2: Use the distribution of the pivotal function to find values g1 and g2 such that
P (g1 G(Y, ✓) g2 ) = 1 ↵
29
Step 3: Manipulate the quantities G g1 and G g2 to make ✓ the reference point. This yields
inequalities of the form
Note: The endpoints U1 , U2 are usually functions of one of the g1 or g2 but not the other.
Suppose that we have a random sample Y from a N (µ, 2) (with 2 known) and we want an
interval estimator for µ with confidence level 1 ↵.
P (Z↵/2 Z Z1 ↵/2 ) =
=1 P (Z < Z↵/2 ) P (Z > Z1 ↵/2 )
=1 ↵/2 [1 (1 ↵/2)]
=1 (↵/2 + ↵/2) = 1 ↵.
Ȳ µ Ȳ µ
p Z↵/2 and p Z1 ↵/2
/ n / n
Numerical Example: Suppose that we had n = 10, Ȳ = 5.2, 2 = 2.4 and ↵ = 0.05. From
suitable tables or statistical software we get Z.975 = 1.96, so an interval estimator for µ with
confidence level 1 ↵ is
30
p p
[5.2 1.96 2.4/10, 5.2 + 1.96 2.4/10]
Suppose that we have a random sample Y from a N (µ, 2) (with also 2 unknown) and we want
an interval estimator for µ with confidence level 1 ↵.
Ȳ µ
T = T (Y, µ) = p ⇠ tn 1.
S/ n
Thus T is a pivotal function.
Step 2: We can write
S S
µ Ȳ p tn 1,↵/2 , and µ Ȳ p tn 1,1 ↵/2
n n
Step 4: Note that tn 1,↵/2 = tn 1,1 ↵/2 . We get
S S
Ȳ p tn 1,1 ↵/2 , Ȳ + p tn 1,1 ↵/2
n n
Numerical Example: Suppose that we had n = 10, Ȳ = 5.2, S 2 = 2.4 and ↵ = 0.05. From
suitable tables or statistical software we get t9,.975 = 2.262, so an interval estimator for µ with
confidence level 1 ↵ is
p p
[5.2 2.262 2.4/10, 5.2 + 2.262 2.4/10]
Note: Compared with the known 2 case the interval is now larger despite the fact that S = .
The t distribution has fatter tails than the standard Normal. On the other hand as n grows the t
distribution gets closer to the Normal.
31
Reading
5 Asymptotic Evaluations
So far we considered evaluation criteria based on samples of finite size n. But as mentioned above
there may be cases where a satisfactory solution does not exist. An alternative route is to approach
this problems with letting n ! 1, in other words study the asymptotic behaviour of the problem.
We will look mainly into asymptotic properties of maximum likelihood procedures.
• In point estimation we use the information from the sample Y to provide a best guess for
the parameters ✓.
✓ˆ = h(Y ),
that are functions of the sample Y . The realization of the sample provides a point estimate
which reflects our belief for the parameter ✓.
• There are many ways to find estimator functions. For example one can use the method of
moments or maximum likelihood estimators.
E[(✓ˆ ✓)2 ].
• But it is very hard to compare estimators based solely on MSE. Even irrational estimators
like
✓ˆ = 1,
are not worse than reasonable ones for all ✓. For this reason we restrict attention to unbiased
estimators
E(✓) ˆ = ✓.
• The Cramér-Rao theorem provides a lower bound for the variance of an unbiased estimator.
Therefore if the variance of an unbiased estimator attains that bound, it provides an optimal
solution to the problem.
• Problem: Even an unbiased estimator may not be available or may not exist.
32
• In interval estimation we want to use the information from the sample Y to provide an
interval which we believe contains the true value of the parameter(s).
• The probability that the random interval contains the true parameter ✓ is termed as coverage
probability.
• The infimum of all the coverage probabilities is termed as confidence coefficient (level) of
the interval.
• There may exist more than one intervals with the same level. One way to choose between
them is through their expected length.
• Problem: Sometimes it may be even hard to find any ‘reasonable’ interval estimator.
1. limn!1 V (Un ) = 0,
2. limn!1 Bias(Un ) = 0,
Definition: An estimator is asymptotically unbiased for ✓ if its bias goes to 0 as n ! 1 for any
✓ 2 ⇥.
33
Definition: The ratio of the Cramér-Rao lower bound over the variance of an estimator is termed
as efficiency. An efficient estimator has efficiency 1. We can compare estimators in terms of
their asymptotic efficiency, that is their efficiencies as n ! 1. An estimator is asymptotically
efficient if its asymptotic efficiency is 1.
Theorem (Asymptotic normality of MLEs): Under weak regularity conditions the maximum
ˆ satisfies
likelihood estimator g(✓)
✓ ◆
p h i
d g 0 (✓)2
ˆ
n g(✓) g(✓) ! N 0, , n ! 1,
I(✓|yi )
d
a. Yn Xn ! aX;
d
b. Xn + Yn ! X + a.
then
! !
1 ⇣p ⌘ 1
d
(✓ˆ ✓) = p nI(✓|yi )(✓ˆ ✓) ! lim p Z = 0, n ! 1,
nI(✓|yi ) n!1 nI(✓|yi )
Asymptotic distribution of MLE’s - Sketch of proof: Assume g(✓) = ✓ and let s0 (✓|Y ) denote
@ ˆ
@✓ s(✓|Y ). Let ✓ be the MLE of the true value which we denote as ✓0 .
34
Ignore the higher order terms and substitute ✓ with ✓ˆ
ˆ ) = s(✓0 |Y ) + s0 (✓0 |Y )(✓ˆ ✓0 ) )
s(✓|Y
s(✓0 |Y )
) ✓ˆ ✓0 = ˆ )=0 )
since s(✓|Y
s0 (✓0 |Y )
p
n
p n s(✓0 |Y )
) n(✓ˆ ✓0 ) = 1 0 . (15)
n s (✓0 |Y )
from the Central Limit Theorem for i.i.d. random variables and since E[s(✓0 |Yi )] = 0, Var[s(✓0 |Yi )] =
I(✓0 |Yi ).
For the denominator of (15), using the Weak Law of Large Numbers for i.i.d. random variables,
we get
n
1 0 1X 0 p ⇥ ⇤
s (✓0 |Y ) = s (✓0 |Yi ) ! E s0 (✓0 |Yi ) = I(✓0 |yi ), n ! 1.
n n
i=1
Combining these two results and using Slutsky’s theorem we get that, as n ! 1,
p ✓ ◆ ✓ ◆
n
p n s(✓0 |Y ) d 1 d 1
n(✓ˆ ✓0 ) = 1 0 ! N 0, ) ✓ˆ ✓0 ! N 0, ,
n s (✓0 |Y )
I(✓0 |yi ) I(✓0 |y)
35
n
• The Fisher’s information is I(p) = p(1 p) .
The MLE for p, p̂ = Ȳ is consistent, (asymptotically) unbiased and efficient. The asymptotic
distribution of p̂ = Ȳ is ✓ ◆
approx p(1 p)
p̂ ⇠ N p, .
n
In an extra level of approximation we may use
✓ ◆
approx p̂(1 p̂)
p̂ ⇠ N p, .
n
p̂(1 p̂)
Let Sp = n . Then
approx p̂ p approx
p̂ ⇠ N (p, Sp ) ) p ⇠ N (0, 1) .
Sp
We can use the above to construct the following asymptotic 1 ↵ confidence interval
h p p i
p̂ Z1 ↵/2 Sp , p̂ + Z1 ↵/2 Sp
Note that both ˆ 2 and ˆ are biased for small samples. But their bias goes to 0 as n ! 1.
We are interested in = g( 2) =( 2 )1/2 . The Cramér - Rao lower bound is equal to
@ 2 1
@
g( 2 )2 4 2
2
v( ) = = n =
nI( 2 |yi ) 2 4
2n
approx ˆ approx
ˆ ⇠ N ( , v(ˆ )) ) p ⇠ N (0, 1)
v(ˆ )
Reading
36
6 Hypothesis Testing
Problem:
• We want to use the information in the random sample Y to answer statements about the
population parameters ✓.
6.1.1 Definitions
• The two complementary hypotheses in a hypothesis testing problem are often called the null
and alternative. They are denoted by H0 and H1 respectively.
H0 : ✓ = c,
H0 : ✓ c,
• The subset of the sample space for which H0 will be rejected is termed as rejection re-
gion or critical region. The complement of the rejection region is termed as a acceptance
region.
Test Statistics
• One of the desired features of T is to have an interpretation such that large (or small) values
of it provide evidence against H0 .
37
Example: Let Y = (Y1 , Y2 , . . . , Yn ) be random sample Y of size n from a N (µ, 2) population
(with 2 is known). Is µ equal to µ0 or larger for a given value µ0 ?
The hypotheses of the test are
H0 : µ = µ 0 versus H1 : µ > µ 0
One test may be based on the test statistic Ȳ with rejection region
R = µ0 + 1.96 p , 1
n
Type II error: If ✓ 2 ⇥1 (H1 is true) but the test does not reject H0 .
Accept H0 Reject H0
H0 is true Correct Decision Type I error
H1 is true Type II error Correct Decision
The Type I error is associated with the significance level and the size of the test.
sup P✓ (Reject H0 ) ↵
✓2⇥0
Note: If the null hypothesis is simple then the size of the test is the probability of a type I error.
(✓) = P✓ (Reject H0 ),
that is the probability that the null hypothesis is rejected if the true parameter value is ✓.
Note:
38
⇢
probability of Type I error if ✓ 2 ⇥0 ,
(✓) = P✓ (Reject H0 ) =
1 probability of Type II error, if ✓ 2 ⇥c0 .
Also we can define the level and the size of the test through the power function
Ideally we would like the power function (✓) to be 0 when ✓ 2 ⇥0 and 1 when ✓ 2 ⇥1 , but this
is not possible. In practice we fix the size ↵ to a small value (usually 0.05) and for a given size we
try to maximize the power. Hence
• Failure to reject the null hypothesis does not imply that it holds and we say that we do not
reject H0 rather than saying we accept H0 .
• We usually set the alternative hypothesis to contain the statement that we are interested in
proving.
Example of a power function (previous example continued): The power function of the test is
p
(µ) = P (Y 2 R) = P Ȳ > µ0 + Z1 ↵ / n
✓ ◆
Ȳ µ0
=P p > Z1 ↵
/ n
✓ ◆
Ȳ µ µ µ0
=P p > Z1 ↵ p
/ n / n
✓ ◆
µ µ0
=1 Z1 ↵ p .
/ n
p-value: From the definitions so far, we either reject or not reject the null hypothesis. The follow-
ing quantity is also informative regarding the weight of evidence against H0 .
Definition: Let T (Y ) be a test statistic such that large values of T give evidence against H0 . For
an observed sample point y the corresponding p-value is
Notes:
2. In words, a p-value is the probability that we got the result of the sample or a more extreme
result. Extreme in the sense of evidence against H0 .
39
3. If we have a fixed significance level ↵, then we can describe the rejection region as
R = {y : p(y) ↵}
We reject H0 if the probability of observing a more extreme result than that of the sample,
is small (less than ↵).
The procedure for constructing a test can be given by the following general directions:
Step 1: Find an appropriate test statistic T . Figure out whether large or small values of T provide
evidence against H0 . Also find its distribution under H0 .
P✓0 (T 2 R) ↵, or P✓0 (T 2 R) = ↵
Step 3: Solve the equation to get R. The test rule is then ‘Reject H0 ’ if the sample Y is in {Y :
T (Y ) 2 R}.
Example of a Statistical test (cont’d) Let’s come back to the previous example with H0 : µ = µ0
vs H1 : µ > µ0 .
Step 1: Test statistic Ȳ . Large values are against H0 . Under H0 , Ȳ ⇠ N (µ0 , 2) or we could use
Ȳ µ
p 0 ⇠ N (0, 1)
/ n
40
Step 2: We know that under H0
✓ ◆ ✓ ◆
Ȳ µ Ȳ µ
Pµ 0 p > Z1 ↵ =P p 0 > Z1 ↵ =↵
/ n / n
Step 3: From ✓ ◆
Ȳ µ
P p 0 > Z1 ↵ = ↵,
/ n
we can get to
p
P Ȳ > µ0 + Z1 ↵ / n = ↵,
hence
p
R = {Y : Ȳ > µ0 + Z1 ↵ / n}
The tests we are interested in control by construction the probability of a type I error (it is at most
↵). A good test should also have a small probability of type II error. In other words it should also
be a powerful test.
Definition: Let C be a class of tests for testing H0 : ✓ 2 ⇥0 versus H1 : ✓ 2 ⇥c0 . A test in class C,
with power function (✓), is a Uniformly Most Powerful (UMP) class C test if
0
(✓) (✓)
for every ✓ 2 ⇥c0 and every 0 (✓) that is a power function of a test in class C.
Notes:
41
2. The above ratio of pdf’s (or pmf’s) is the ratio of the likelihood functions.
Proof of Neyman-Pearson Lemma: Preliminaries: We give the proof for continuous random
variables. For discrete random variables just replace integrals with sums.
Let S (Y ) denote the rule of a test S with rejection region RS . Note that S (Y ) = I(Y 2 RS )
where I(·) is the indicator function. Hence for all ✓
Z Z
S (Y )fY (y|✓)dy = fY (y|✓)dy (17)
Rn RS
Z Z
E[ S (Y )] = S (y)fY (y|✓)dy = fY (y|✓)dy
Rn RS
= P✓ (Reject H0 ) = S (✓) (18)
Main Proof: Let T be the Neyman-Pearson lemma test and S be another test of size ↵.
Z
(18)
S (✓1 ) k S (✓0 ) = S (y)[fY (y|✓1 ) kfY (y|✓0 )]dy
Rn
(20)
Z
S (y)[fY (y|✓1 ) kfY (y|✓0 )]dy
RT
(19)
Z
T (y)[fY (y|✓1 ) kfY (y|✓0 )]dy
RT
Z
(17)
= T (y)[fY (y|✓1 ) kfY (y|✓0 )]dy
Rn
(18)
= T (✓1 ) k T (✓0 )
T (✓0 ) = S (✓0 ) = ↵.
42
Therefore we can write
S (✓1 ) T (✓1 ),
✓ 2
◆
Ȳ ⇠ N µ,
n
Step 2: We want
n ⇣ n ⌘ o
Pµ0 exp [(µ2 µ21 ) 2Ȳ (µ0 µ1 )] > k = ↵.
2 2 0
2
But we also know that Ȳ ⇠ N (µ0 , n ) under H0 , then
p
Pµ0 Ȳ > µ0 + Z1 ↵ / n = ↵,
is an equivalent test being based on the same statistic. This will give us a most powerful test for
this testing problem.
e n i
/ i Yi !
0 0
0
43
A test with rejection region from LR > k is such that
n
!
X log k n( 0 1)
P Yi > = k1 =↵
log 1 log 0
i=1
P
but we also know that i Yi ⇠ Poisson(n 0) under H0 so we can find k.
P
Let n = 8, 0 = 2 ( i Yi ⇠ Poisson(16)) and 1 = 6. The size is 0.058 with k1 = 22 and 0.037
with k1 = 23. A test with significance level 0.05 corresponds to a k1 = 23. What if 1 = 150?
H0 : ✓ = ✓0 versus H1 : ✓ = ✓1 .
Assume a rejection region independent of ✓1 . If ✓0 < ✓1 , the test is most powerful for all ✓1 > ✓0 .
Hence it is most powerful for
H0 : ✓ = ✓0 versus H1 : ✓ > ✓0 .
H0 : ✓ = ✓0 versus H1 : ✓ < ✓0 .
What about
H0 : ✓ ✓0 versus H1 : ✓ > ✓0 , ?
The power is not affected, but is the size of the test still ↵? we would have to show that
sup P (Reject H0 ) = ↵.
✓✓0
Note that the rejection region is independent of µ1 , thus the test is applicable for all µ1 > µ0 .
Hence it is also the UMP test for
H0 : µ = µ 0 , versus H1 : µ > µ0
44
What about testing problems of the following form?
H0 : µ µ 0 , versus H1 : µ > µ0 .
where (µ) is the power function (derived on the notes of previous lectures). We can write
✓ ◆
µ µ0
sup (µ) = sup 1 Z1 ↵ p .
µµ0 µµ0 / n
The function inside the supremum is increasing in µ and equal to ↵ if µ = µ0 . Therefore the above
supremum is equal to ↵.
Note: The UMP test usually does not exist for 2-sided (composite) alternative hypotheses.
Corollary: Consider the previous testing problem, let T (Y ) be a sufficient statistic for ✓, and
g(t|✓0 ), g(t|✓1 ) be its corresponding pdf’s (or pmf’s). Then any test with rejection region S (a
subset of the sample space of T ) is a UMP level ↵ test if it satisfies
g(t|✓1 )
t 2 S, if > k,
g(t|✓0 )
H0 : ✓ 2 ⇥0 versus H1 : ✓ 2 ⇥c0 .
Definition of Likelihood Ratio test: Let Y = y be an observed sample and define the likelihood
by L(✓|y). The likelihood ratio test statistic is
sup✓2⇥ L(✓|y)
(y) = .
sup✓2⇥0 L(✓|y)
45
The constant c may be determined by the size, i.e.
sup P✓ ( (Y ) > c) = ↵.
✓2⇥0
Notes:
• The numerator is evaluated at the value of ✓ corresponding to the MLE, that is the maximum
of the likelihood over the entire parameter range.
• Hence the numerator is larger or equal to the denumerator and the statistic of the likelihood
ratio test is always greater than 1.
Example (Likelihood Ratio test): Let a random sample Y = (Y1 , . . . , Yn ) from a N (µ, 2 ),
(with 2 known). Consider the test H0 : µ = µ0 versus H1 : µ 6= µ0 . The MLE is µ̂ = Ȳ , hence
the likelihood ratio test statistic is
1
L(µ̂|Y ) (2⇡ 2 ) n/2 exp{ 2 2
[n(Ȳ µ̂)2 + (n 1)S 2 ]}
(Y ) = = 1
L(µ0 |Y ) (2⇡ 2 ) n/2 exp{ 2 2
[n(Ȳ µ0 )2 + (n 1)S 2 ]}
✓ ◆
n(Ȳ µ0 )2
= exp .
2 2
Ȳ pµ0
The test (Y ) > k is equivalent with the test / n
k1 .
(Ȳ µ0 ) 2
Note: Equivalently one can use the fact that 2 log (Y ) = 2 /n ⇠ 2.
1
46
Theorem (Likelihood ratio test and sufficiency): Let Y be a sample parametrised by ✓ and
T (Y ) be a sufficient statistic for ✓. Also let (.), ⇤ (.) be the likelihood ratio tests for Y and T
respectively. Then for every y in the sample space
⇤
(y) = (T (y)).
H0 : = 0, ⌫2N
H1 : 6= 0, ⌫ 2 N.
The likelihood ratio test is
sup ,⌫ L( , ⌫|y)
(y) = .
sup⌫ L(⌫| 0 , y)
Generally in statistics, models with many parameters have better fit but do not always give better
predictions. Parsimonious models achieve a good fit with not too many parameters. They usually
perform better in terms of prediction. The likelihood ratio tests provides a useful tool for finding
parsimonious models.
Example (Likelihood Ratio test): Suppose that X1 , . . . , Xn and Y1 , . . . , Yn are two independent
random samples from two exponential distributions with mean 1 and 2 respectively. We want
to test
H0 : 1 = 2 = , versus H1 : 1 6= 2 .
The likelihood function is
n
! n
!
X X
n n
L( 1, 2 |x, y) = 1 exp xi / 1 2 exp yi / 2 .
i=1 i=1
47
Hence, the likelihood ratio test statistic is
L( ˆ M
1
LE , ˆ M LE |x, y)
2 (X̄ + Ȳ )2n /22n
LR = =
L( ˆ M LE , ˆ M LE |x, y) X̄ n Ȳ n
⇢q q 2n
2n
=2 X̄/Ȳ + Ȳ /X̄ .
We do not know the distribution of the LR test statistic. We may attempt to isolate T = X̄/Ȳ , but
since LR is not monotone in T we cannot construct a test.
Note: We will see in the next sections how deal with such cases by constructing asymptotic tests.
The Wald test: Suitable for testing simple null hypotheses H0 : ✓ = ✓0 versus H0 : ✓ 6= ✓0 . The
statistic of the test is
✓ˆ ✓0
Z=
ˆ
se(✓)
q
The estimator ✓ˆ is the MLE and a reasonable estimate for its standard error se(✓)
ˆ = V (✓) ˆ is
given by Fisher’s information.
The Score test: Similar to the Wald test but it takes the form
S(✓0 )
Z=p
I(✓0 )
where S(·) is the Score function and I(·) is the Fisher information.
Multivariate versions of the above tests exist. These tests are similar to the likelihood ratio test but
not identical. As with the likelihood ratio test, their distribution is generally unknown. For ‘large’
sample sizes the likelihood ratio, score and Wald tests are equivalent.
• In hypothesis testing we want to use the information from the sample Y to choose between
two hypotheses about the parameter ✓: the null hypothesis H0 and the alternative H1 .
The sample values for which the H0 is rejected (accepted) is called rejection (acceptance)
region.
• There are two possible types of errors. Type I error is if we falsely reject H0 whereas type
II is if we don’t reject H0 when we should.
• The level and size ↵ of a test provide an upper bound for the type I error.
48
• The rejection region R, and hence the test itself, is specified using the probability that the
sample Y belongs to R under H0 is bounded by ↵. If H0 : ✓ 2 ⇥0 we use
sup P✓ (Y 2 R)
✓2⇥0
• The type II error determines the power of a test. In practice we fix ↵ and try to minimize
(maximize) the type II error (power).
• To find a most powerful test we can use the Neyman-Pearson Lemma. It refers to tests
where both H0 and H1 are simple hypotheses but can be extended in some cases to com-
posite hypotheses. A version based on sufficient statistics, rather than the whole sample Y ,
is available.
• Problem: A uniformly most powerful test may not be available or may not exist. Sometimes
it may be even hard to find any ‘reasonable’ test.
Suppose that ✓ can be split in two groups: ✓ = ( , ⌫) where are the main parameters of interest
of dimension k. Consider the test
H0 : = 0, ⌫2N
H1 : 6= 0, ⌫ 2 N.
Equivalently, suppose that we want to compare the constrained model of H0 with the uncon-
strained model of H1 .
49
The MLE of is ˆ = Ȳ . The likelihood ratio test statistic is
sup >0 L( |Y ) ˆ n exp( nȲ / ˆ ) Ȳ n exp(
n)
LR(Y ) = = n = n .
L( 0 |Y ) 0 exp( nȲ / 0 ) 0 exp( nȲ / 0 )
Hence, the asymptotic likelihood ratio test of size ↵ rejects if 2 log LR(Y ) > 1,1 ↵ ,
2 where
1,1 ↵ is the (1 ↵)th percentile of a 21 distribution.
2
Hence, the asymptotic likelihood ratio test of size ↵ rejects if 2 log LR(Y ) > 1,1 ↵ .
2
Reading
G. Casella & R. L. Berger 8.1, 8.2.1, 8.3.1, 8.3.2, 8.3.4, 10.3, 10.4.1
50