0% found this document useful (0 votes)
8 views50 pages

Inference

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 50

A DVANCED S TATISTICS FOR E CONOMICS

S TATISTICAL I NFERENCE

Matteo Barigozzi Giuseppe Cavaliere

This version November 11, 2020

1 Properties of a Random Sample

A description of a statistical analysis may be described as follows:

1. Consider a real-world phenomenon/problem/population with uncertainty and open ques-


tions.

2. Probability model: Identify the random variable(s) associated with the problem and assign
a suitable probability model. The model is described by some parameter(s) ✓.

3. Draw a sample from the population.

4. Use the information contained in the sample to draw inference, that is gain knowledge for
the population parameters ✓ and provide answers to the questions.

In this course we are concerned with parts 3 and 4.

1.1 Random Sample

A sample is a collection of random variables X = (X1 , X2 , . . . , Xn ). An observed sample is a


collection of observations (x1 , x2 , . . . , xn ) on (X1 , X2 , . . . , Xn ).

Let X1 , X2 , . . . , Xn be independent and identically distributed (i.i.d.) random variables each with
pdf or pmf f (xi |✓). Then X1 , X2 , . . . , Xn are called a random sample of size n from population
f (xi |✓).

• A random sample of size n implies a particular probability model described by the popula-
tion f (xi |✓), that is by the marginal pdf or pmf of each Xi . Notice that it depends on some
parameter ✓, and if we know ✓ then the model would be completely specified. However, ✓
is in general unknown and it is the object we are interested in estimating. For this reason we
highlight the dependence on ✓ when indicating the pdf or pmf.

• The random sampling model describes an experiment where the variable of interest has a
probability distribution described by f (xi |✓).

• Each Xi is an observation of the same variable.

1
• Each Xi has a marginal distribution given by f (xi |✓).

• The joint pdf or pmf is given by

n
Y
fX (x|✓) ⌘ fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn |✓) = f (x1 |✓)f (x2 |✓) . . . f (xn |✓) = f (xi |✓)
i=1

Example

• Poisson:
n Pn
Y e xi e n i=1 xi
P (X1 = x1 , X2 = x2 , . . . , Xn = xn | ) = = Qn
xi ! i=1 xi !
i=1

• Exponential:
n
Y Pn
1 xi / 1 xi /
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn | ) = e = n
e i=1

i=1

1.2 Statistics

Let T (x1 , x2 , . . . , xn ) be a real or vector valued function whose domain includes the sample space
of X1 , X2 , . . . , Xn . Then the random variable Y = T (X1 , X2 , . . . , Xn ) is called a statistic.

• Inferential questions are hard to answer just by looking at the raw data.

• Statistics provide summaries of the information in the random sample.

• They could be arbitrary functions but they cannot be functions of parameters.

The following are three statistics often used

• The sample mean


n
X1 + X2 + · · · + Xn 1X
X̄ = = Xi .
n n
i=1

• The sample variance


n
X
1 2
S2 = Xi X̄ .
n 1
i 1
p
• The sample standard deviation S = S2.

Notice that these are random variables and we denote their observed values as x̄, s2 , s.

2
Lemma Let X1 , X2 , . . . , Xn be a random sample from a population and let g(x) be a function
such that E[g(X1 )] and Var(g(X1 )) exist. Then,
n
!
X
E g(Xi ) = nE [g(X1 )]
i=1

and !
n
X
Var g(Xi ) = nVar [g(X1 )] .
i=1

Theorem Let X1 , X2 , . . . , Xn be a random sample from a population with mean µ and variance
2 < 1. Then

1. E(X̄) = µ,
2
2. Var(X̄) = n ,

3. E(S 2 ) = 2.

As we will see later in detail we say that X̄ and S 2 are unbiased estimators of µ and 2 respectively.

1.3 Sampling Distribution

The probability distribution of a statistic T = T (X) is called the sampling distribution of T .

Theorem Let X1 , X2 , . . . , Xn be a random sample from a population with mgf MXi (t). Then the
mgf of the sample mean is
MX̄ (t) = [MXi (t/n)]n .
When applicable, the theorem above provides a very convenient way for deriving the sampling
distribution.

Example

• X1 , . . . , Xn i.i.d. from N (µ, 2 ), then


 ✓ 2 (t/n)2
◆ n ✓ 2 /n)t2

t (
MX̄ (t) = exp µ + = exp µt +
n 2 2
⇣ 2

that is X̄ ⇠ N µ, n .

• X1 , . . . , Xn i.i.d. from Gamma(↵, ), then X̄ ⇠ Gamma(n↵, /n).

If we cannot use the above theorem we can derive the distribution of the transformation of random
variables by working directly with pdfs.

3
1.4 Transformation of Random Variables

1.4.1 Transformation of Scalar Random Variables

Theorem: Let X,Y be random variables with pdfs fX (x), fY (y) and defined for x 2 X and
y 2 Y, respectively. Suppose that g(·) is a monotone function such that g : X ! Y and g 1 (·)
has a continuous derivative on Y. The pdf of Y is then
(
d
fX (g 1 (y)) dy g 1 (y) , y 2 Y,
fY (y) =
0, otherwise.

Proof: Let X be a random variable with density fX (x) and Y = g(X) or X = g 1 (Y ).


If g 1 (·) is increasing
1 1
FY (y) = P (Y  y) = P (g (Y )  g (y)),
1 1
= P (X  g (y)) = FX (g (y)),

d d dg 1 (y)
1 1
fY (y) = FY (y) = FX (g (y)) = fX (g (y))
dy dy dy
If g 1 (·) is decreasing
1 1
FY (y) = P (Y  y) = P (g (Y ) g (y)),
1 1
= P (X g (y)) = 1 FX (g (y)),
 1 (y)
d 1 1 dg
fY (y) = FX (g (y)) = fX (g (y))
dy dy
(The derivative of a decreasing function is negative)
Putting both cases together if g 1 (.) is monotone

dg 1 (y)
1
fY (y) = fX (g (y))
dy

Example: Inverse Gamma Distribution. Let X ⇠ Gamma(↵, ),


⇣ ⌘
x
x↵ 1 exp
fX (x|↵, ) = , 0 < x < 1,
(↵) ↵

We want the distribution of Y = 1/X, therefore g(x) = 1/x and g 1 (y) = 1/y. Then,
d
dy g (y) = 1/y . We can therefore write
1 2

⇣ ⌘↵ 1 ⇣ ⌘
1 1
dg 1 (y) y exp y 1
1
fY (y) = fX (g (y)) = ↵
,
dy (↵) y2
⇣ ⌘↵+1 ⇣ ⌘
1 1
y exp y
= ↵
, 0 < y < 1.
(↵)

4
Square Transformations: What if g(·) is not monotone? For example consider Y = X 2 , then
p p
g 1 (y) = y and clearly FY (y) = P(X  y) is not defined if y < 0. For y 0
p p
FY (y) = P (Y  y) = P (X 2  y) = P ( yX y),
p p
= FX ( y) FX ( y)

d d p p
fY (y) = FY (y) = [FX ( y) FX ( y)],
dy dy
1 p 1 p
= p fX ( y) + p fX ( y), if y 0.
2 y 2 y

Example: 2 distribution from standard Normal.


Let X ⇠ N (0, 1), and consider Y = X 2 . Using the previous result we get
1 1 p 2 1 1 p 2
fY (y) = p p exp( y /2) + p p exp( ( y) /2),
2 y 2⇡ 2 y 2⇡
1 ⇣ y⌘
=p exp
2⇡y 2

Note that ⇣ y⌘
1 1
1
fY (y) = 1 y
2 exp ,
( 12 )2 2 2
1
which is the pdf of a Gamma 2, 2 distribution, or else a 2 distribution with one degree of
freedom.

1.4.2 Transformations of Multivariate Random Variables

Let X = (X1 , . . . , Xd )0 be d-dimensional random variable and Y = (Y1 , . . . , Yd )0 = g(X) so


that X = g 1 (Y), or else
X1 = g1 1 (Y1 , . . . , Yd ),
X2 = g2 1 (Y1 , . . . , Yd ),
..
.
Xd = gd 1 (Y1 , . . . , Yd ).
The transformation g : X ⇢ Rd ! Y ⇢ Rd from X to Y has to be one-to-one.
Consider the matrix with the partial derivatives
0 1
@g1 1 (Y ) @g1 1 (Y ) @g1 1 (Y )
. . .
B @g@Y1 1(Y ) @g@Y1 2(Y ) @Yd
C
@g2 1 (Y ) C
1 B
@g (Y1 , . . . , Yn ) B @Y1 2 2
. . .
@Y2 @Yd C
=B . . . C
@(Y1 , . . . , Yn ) B .. .. ... .. C
@ A
@gd 1 (Y ) @gd 1 (Y ) @gd 1 (Y )
@Y1 @Y2 ... @Yd

5
The Jacobian, J of the transformation g(·) is the determinant of the matrix of derivatives above.
It provides a scaling factor for the change of volume under the transformation.

Formula for multivariate transformations:


Using standard change of variables results from multivariate calculus, we get

fY1 ,...,Yd (y1 , . . . , yd ) = fX1 ,...,Xd g1 1 (Y), . . . , gd 1 (Y) |J|

Theorem: If X, Y are independent random variables with pdfs fX (x) and fY (y), the pdf of
Z = X + Y is

Z 1
fZ (z) = fX (w)fY (z w)dw
1

This formula for fZ (z) is called the convolution of fX (x) and fY (y).

Proof (of the convolution expression): We introduce an extra random variable W = X so that

Z = X + Y, and W = X or
X = W, and Y = Z W.
The Jacobian is equal to 1. Since X, Y are independent their joint pdf is fXY (x, y) = fX (x)fY (y).
We can now write

fZW (z, w) = fXY (w, z w) ⇥ 1 = fX (w)fY (z w)

Finally, Z Z
+1 +1
fZ (z) = fZW (z, w)dw = fX (w)fY (z w)dw
1 1
Example: If X and Y are independent and identically distributed exponential random variables,
find the joint density function of U = X/Y and V = X + Y .
For U = X/Y, V = X + Y , the inverse transformation is X = U V /(1 + U ), Y = V /(1 + U ).
We have
@X @X @Y @Y
= V /(1 + U )2 , = U/(1 + U ), = V /(1 + U )2 , = 1/(1 + U ).
@U @V @U @V

The Jacobian is
V /(1 + U )2 U/(1 + U )
= V (1 + U )/(1 + U )3 = V /(1 + U )2
V /(1 + U )2 1/(1 + U )

which is non-negative for U, V 0.

For U, V > 0 the joint density is therefore

fU,V (u, v) = fX,Y (x, y)v/(1 + u)2 = 2


e (x+y)
v/(1 + u)2 = 2
ve v
/(1 + u)2 .

The joint density factorises into a marginal density for V , which is Gamma with a scale parameter
and a shape parameter 2, and a Pareto density 1/(1 + u)2 for U . So U and V are independent.

6
1.5 Sampling from the Normal distribution

Theorem Let X1 , X2 , . . . , Xn be a random sample from a N (µ, 2) distribution.

1. X̄ and S 2 are independent random variables.


⇣ 2

2. X̄ has a N µ, n distribution.

3. (n 1)S 2 / 2 has chi squared distribution with n 1 degrees of freedom ( n 1 ).


2

1.5.1 Chi-squared distribution

1. A 2
p distribution is a Gamma(p/2, 2) distribution. Its pdf is
1
f (y) = y (p/2) 1
e y/2
, x > 0.
(p/2)2p/2

2. If Z is a N (0, 1) random variable, then Z 2 ⇠ 2.


1

3. If X1 , X2 , . . . , Xn are independent and Xi ⇠ 2 ,


pi then
2
X1 + X2 + · · · + Xn ⇠ p1 +···+pn

4. Let X be distributed according to a 2.


p Then E(X) = p and, for p > 2, E( X1 ) = 1/(p 2).

1.5.2 Student’s t distribution

Let X1 , X2 , . . . , Xn be a random sample from a N (µ, 2) distribution. We know


X̄ µ
p ⇠ N (0, 1).
/ n
But 2 is usually unknown. It would be most useful if we knew the distribution of the statistic
X̄ µ
T = p .
S/ n
The distribution of T is known as the t distribution with n 1 degrees of freedom. Note that
p
X̄ µ (X̄ µ)/( / n) Z
T = p = p =p ,
S/ n 2
S / 2 V /(n 1)
where Z ⇠ N (0, 1), V ⇠ n 1,
2 and Z, V are independent because of the above theorem.

Example Let Y be a random variable distributed according to Student’s t distribution with p


degrees of freedom. Show that

1. The pdf of Y is
1/2 2 p/2 p 1/2
✓ ◆ (p+1)/2
(2⇡) p+1 2
fY (y) =
(p/2) 2 1 + y 2 /p

2. E(Y ) = 0, if p > 1.
3. Var(Y ) = p/(p 2), if p > 2.

7
1.5.3 Snedecor’s F distribution

Let X1 , X2 , . . . , Xn be a random sample from a N (µx , x2 ) population, and Y1 , Y2 , . . . , Ym be a


random sample from a N (µy , y2 ) population. Suppose that we want to compare the variability
between the two populations. This can be done through the variance ratio x2 / y2 but since these
are unknown we can use Sx2 /Sy2 . The F distribution compares these quantities by giving us the
distribution of
Sx2 /Sy2
F = 2 2
x/ y

Note that F may be written as

Sx2 / 2
x U/(n 1)
F = =
Sy2 / 2
y V /(m 1)

where U ⇠ n 1,
2 V ⇠ 2
m 1 and U , V are independent.

Example Let Y be a random variable distributed according to Snedecor’s F distribution with


p and q degrees of freedom.

1. Find the pdf of Y .

2. Show that E(Y ) = q/(q 2), if q > 2.

3. Show that 1/Y is again an F distribution with q and p degrees of freedom.

4. Show that if T ⇠ tq then T 2 ⇠ F1,q

1.6 Order Statistics

Let X1 , X2 , . . . , Xn be a random sample from population with distribution function F (x) and
density f (x). Define X(i) to be the i th smallest of the {Xi } (i = 1, . . . , n), namely

X(1)  X(2)  · · ·  X(n) .

We want to find the density function of X(i) . Notice that while Xi is one element of the random
sample, X(i) is a statistic which is function of the whole random sample. Informally, we may write
for small x,
FX(i) (x + x) FX(i) (x) P [X(i) 2 (x, x + x)]
fX(i) (x) ⇡ = .
x x
The probability X(i) is in (x, x + x) is roughly equal to the probability of (i 1) observations
in ( 1, x), one in (x, x + x) and the remaining (n i) in (x + x, +1). This is a trinomial
probability

P [X(i) 2 (x, x + x)] n! i P (one observation 2 (x, x + x))


= F (x)i 1
[1 F (x+ x)]n
x (i 1)!1!(n i)! x

As x ! 0, rigorous calculations provide


n!
fX(i) (x) = F (x)i 1
f (x)[1 F (x)]n i .
(i 1)!(n i)!

8
Notice that this formula is function of the population cdf and pdf, i.e. of a generic Xi .
Example: Let X1 , X2 , . . . , Xn be a random sample from U (0, 1). Find the density of X(i) .

The density of X(i) becomes (for 0 < y < 1)


n!
fX(i) (y) = xi 1 1(1 x)n i
(i 1)!(n i)!
(n + 1)
= xi 1 (1 x)(n i+1) 1
(i) (n + 1 i)
This is a Beta(i, n i + 1) distribution. We get
i j(n j + 1)
E[Y(i) ] = , Var(Y(i) ) = .
n+1 (n + 1)2 (n + 2)

Reading

G. Casella & R. L. Berger 2.1, 4.3, 4.6, 5.1, 5.2, 5.3, 5.4

2 The Sufficiency Principle


• Each statistic reduces the observed sample into a single number; data reduction.
• Inherently there is some loss of information.
• A good inferential procedure is based on statistics that do not throw too much information.

2.1 Sufficient Statistics

A sufficient statistic for a parameter ✓ captures, in a certain sense, all the relevant information in
the sample about ✓.

Sufficiency Principle: If T (Y ) is a sufficient statistic for ✓ then any inference for ✓ should be
based on the sample Y only through T (Y ). That is if x and y are two observed samples (that is
x = (x1 . . . xn ) and y = (y1 . . . yn )) such that T (x) = T (y) then the inference about ✓ should be
the same regardless if Y = y or Y = x was observed.

Definition: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample. A statistic U = T (Y ) is a sufficient


statistic for a parameter ✓ if the conditional distribution of Y |U = u is independent of ✓. In other
words if fY (y|✓) is the joint pdf or pmf of the sample Y , and fU (u|✓) is the pdf or pmf of U , U is
a sufficient statistic if
fY,U (y, u|✓) fY (y|✓)
fY |U (y|U = u; ✓) = =
fU (u|✓) fU (u|✓)
is a constant as a function of ✓ for all y (that is it does not depend on ✓). Notice that since
U = T (Y ), we have P (Y = y, U = u) = P (Y = y), indeed theP event {Y = y} is a subset of
the event {T (Y ) = T (y)} (consider for example the case T (Y ) = ni=1 Yi ).
Notes

9
• If Y is discrete, the ratio above is a conditional probability mass function.

P (Y = y, T (Y ) = T (y))
P (Y = y|T (Y ) = T (y)) =
P (T (Y ) = T (y))
• If it is continuous it is just a conditional pdf.

• The definition refers to the conditional distribution. A statistic is sometimes defined as being
sufficient for a family of distributions, FY (y|✓), ✓ 2 ⇥.

Example: LetPY = (Y1 , . . . , Yn ) be a random sample from a Poisson( ) population, and let
U = T (Y ) = ni=1 Yi . It can be shown that U ⇠Poisson(n ). We can also write

P (Y = y, U = u)
P (Y = y|U = u) = ,
P (U = u)

and note that


0 if U =
6 u,
P (Y = y, U = u) =
P (Y = y) if U = u.

so we can then write

Qn
P (Y = y) exp( ) yi /(yi !)
P (Y = y|U = u) = = i=1
P (U = u) exp( n )(n )u /(u!)
u!
= u Qn
n i=1 yi !

U is sufficient since there is no in the above.

Theorem (Factorization Theorem): Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample with joint


pdf or pmf fY (y|✓). The statistic T (Y ) is sufficient for the parameter ✓ if and only if we can find
functions g(.) and h(.) such that

fY (y|✓) = g (T (y)|✓) h(y)

for all y 2 Rn and ✓ 2 ⇥. (Note that both T and ✓ can be vectors)

We give the proof for a discrete valued Y . The proof for the continuous case is quite technical and
beyond the scope of this course.

Proof of Factorization Theorem:


Preliminaries: Since T (Y ) is a function of Y we can write

P (Y = y) = P (Y = y, T (Y ) = T (y)) (1)

but NOT
P (T (Y ) = T (y)) = P (Y = y, T (Y ) = T (y))

10
Indeed,
X
P (T (Y ) = T (y)) = P (Yk = yk , T (Yk ) = T (y))
yk :T (yk )=T (y)
X
= P (Yk = yk ) (2)
yk :T (yk )=T (y)

That is the event {Y = y} is a subset of the event {T (Y ) = T (y)} but not the viceversa.
If T is sufficient: Suppose T is sufficient for ✓. That is P (Y = y|T (Y ) = T (y)) is independent
of ✓. We can write

(1)
P✓ (Y = y) = P✓ (Y = y, T (Y ) = T (y))
= P✓ (T (Y ) = T (y))P (Y = y|T (Y ) = T (y))
= g(T (Y ), ✓)h(Y ).

since the pmf P✓ (T (Y ) = T (y)) is typically a function of T (Y ) and ✓ whereas P (Y = y|T (Y ) =


T (y)) is independent of ✓ due to the sufficiency of T . Note that the function g(T (Y ), ✓) is the
pmf of T (Y ).

Converse: Suppose P✓ (Y = y) = g(T (y), ✓)h(y). The conditional pmf


P✓ (Y = y, T (Y ) = T (y))
P✓ (Y = y|T (Y ) = T (y)) =
P✓ (T (Y ) = T (y))
(1)(2) P✓ (Y = y)
= P
yk :T (yk )=T (y) P✓ (Yk = yk )
g(T (y), ✓)h(y)
=P
yk :T (yk )=T (y) g(T (y), ✓)h(yk )
g(T (y), ✓)h(y) h(y)
= P =P ,
g(T (y), ✓) yk :T (yk )=T (y) h(yk ) yk :T (yk )=T (y) h(yk )

which is independent of ✓. Hence T (Y ) is a sufficient statistic.

Example: Let Y = (Y1 , . . . , Yn ) be a random sample from the following distributions find a
sufficient statistic for each case.

1. Sufficient statistic for µ from a N (µ, 2) population with 2 known.


The joint density may be written as
✓ ◆
2 n/2 µ)2 + (n 1)S 2
n(Ȳ
fY (Y |µ) = (2⇡ ) exp
2 2
✓ ◆ ✓ ◆
2 n/2 (n 1)S 2 n(Ȳ µ)2
= (2⇡ ) exp exp
2 2 2 2
⇣ ⌘
n(Ȳ µ)2
The statistic T (Y ) = Ȳ is sufficient for µ since if we set g(T (Y ), µ) = exp 2 2
⇣ 2

and h(Y ) = (2⇡ 2 ) n/2 exp (n 2 1)S 2 , we have fY (Y |µ) = g(T (Y ), µ)h(Y ).

11
2. Sufficient statistic for (µ, 2) from a N (µ, 2) population.
The joint density may be written as
✓ ◆
2 2 n/2 n(Ȳ µ)2 + (n 1)S 2
fY (Y |µ, ) = (2⇡ ) exp
2 2

The statistic T (Y ) = (Ȳ , S 2 ) is sufficient for (µ, 2 ) since if we set g(T (Y ), µ, 2) =


fY (Y |µ, 2 ) and h(Y ) = 1, we have fY (Y |µ, 2 ) = g(T (Y ), µ, 2 )h(Y ).

Note: The statistic Ȳ is sufficient for µ but not for (µ, 2 ).

3. Sufficient statistic for ✓ from Unif(0, ✓).


Let I(Y 2 A) denote an indicator function, that is a function that equals 1 if Y 2 A and 0
otherwise.

The joint density may be written as


n
Y 1 1
fY (y|✓) = I(yi > 0)I(yi < ✓) = I(max yi < ✓)I(min yi > 0)
✓ ✓n i i
i=1

1
The statistic T (Y ) = maxi yi is sufficient for ✓ since if we set g(T (Y ), ✓) = ✓ n I(maxi yi <
✓) and h(Y ) = I(mini yi > 0), we have fY (Y |✓) = g(T (Y ), ✓)h(Y ).

2.2 Minimal Sufficiency

Example (Sufficiency of the sample): Let Y = (Y1 , . . . , Yn ) is a sample from a population with
fYi (yi |✓). Denote the joint density of the sample Y by fY (y|✓).

Note that
fY (y|✓) = fY (y|✓) ⇥ 1 = g(T (y)|✓) ⇥ h(y),
where
T (Y ) = Y, g(T (y)|✓) = fY (y|✓), h(y) = 1.
Every sample is itself a sufficient statistic. Also every statistic that is a one-to-one function of a
sufficient statistic is itself a sufficient statistic.

• There exist many sufficient statistics.


• Ideally, we would like the simplest possible sufficient statistic.
• If a statistic T is a function of a statistic S, then it contains no more information than S.
• We look at those sufficient statistics which are still sufficient even though they are functions
of other statistics.

Definition: A sufficient statistic T (Y ) is a minimal sufficient statistic if for any other sufficient
statistic T 0 (Y ), T (Y ) is a function of T 0 (Y ).

Some facts about minimal sufficient statistics

12
• If a sufficient statistic has dimension 1, it must be a minimal sufficient statistic.

• Minimal sufficient statistics are not unique. However if two statistics are minimally suffi-
cient they must have the same dimension.

• A one-to-one function of a minimal sufficient statistic is also a minimal sufficient statistic

• The dimension of a minimal sufficient statistic is not always the same as the dimension of
the parameter of interest.

Reading

G. Casella & R. L. Berger 3.4, 6.1, 6.2.1, 6.2.2

3 Point Estimation

Problem:

• Suppose that a real world phenomenon may be described by a probability model defined
through the random variable Y with FY (y|✓).

• Suppose also that a sample Y = (Y1 , Y2 , . . . , Yn ) is drawn from that distribution.

• We want use the information in the random sample Y to get a best guess for ✓. In other
words we want a point estimate for ✓.

• A function of Y that gives a point estimate ✓ is an estimator of ✓. If we use the observed


random sample Y = y, we get the estimate of ✓ which is a particular value of the estimator.

Next, we look at two methods for finding point estimators, the method of moments and the maxi-
mum likelihood estimators. Then we present evaluation methods for estimators.

3.1 Method of Moments

Description: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from population with pdf or pmf
f (y|✓1 , . . . , ✓k ). Let a sample moment be defined as
n
1X r
mr = Yi
n
i=1

Remember that the r-th (population) moment is

µr = µr (✓1 , . . . , ✓k ) = E✓ (Yir )

Method of moments estimators are found by equating the first k sample moments to the corre-
sponding k population moments and solving the resulting system of simultaneous equations.

Example: Y = (Y1 , Y2 , . . . , Yn ) be a random sample from N (µ, 2) population. Find estimators


for µ and 2 using the method of moments.

13
We want estimators for 2 parameters. Hence, we first write down the system of 2 equations

X̄ = E(X) = µ,
n
1X 2
Xi = E(X 2 ) = µ2 + 2
,
n
i=1

and after solving it we get the following estimators

µ̂ = X̄,
n n n
1X 2 1X 2 1X n 1
ˆ2 = Xi µ̂2 = Xi X̄ 2 = (Xi X̄)2 = S2
n n n n
i=1 i=1 i=1

3.2 The Likelihood Function

Definition: Let Y = (Y1 , Y2 , . . . , Yn ) be a sample from population with pdf (or pmf) f (yi |✓).
Then, given Y = y is observed, the function of ✓ defined by the joint pdf (or pmf) of Y = y

L(✓|Y = y) = fY (y|✓)

is called the likelihood function.

Notes

• In most cases the pdf of Y is thought as a function of Y whereas the likelihood function is
thought as a function of ✓ for a given observed sample.

• If for ✓1 , ✓2 we have L(✓1 |y) > L(✓2 |y) then the sample is more likely to have occurred if
✓ = ✓1 than if ✓ = ✓2 . In other words ✓1 is a more plausible value than ✓2 .

• The likelihood, as a function of ✓ is not always a pdf.

• Sometimes it is more convenient to work with the log-likelihood, l(✓|y) which is just the
log of the likelihood.

• If Y = (Y1 , Y2 , . . . , Yn ) is a random sample from f (yi |✓), then


n
Y
L(✓|Y = y) = fY (y|✓) = f (yi |✓)
i=1

n
X
l(✓|Y = y) = log fY (y|✓) = log f (yi |✓)
i=1

Example: Consider X continuous random variable with pdf fX (x|✓), then for small ✏ we have

P✓ (x ✏ < X < x + ✏)
' fX (x|✓) = L(✓|x)
2✏

14
therefore if we compare the probabilities for different values of ✓ we have

P✓1 (x ✏ < X < x + ✏) L(✓1 |x)


'
P✓2 (x ✏ < X < x + ✏) L(✓2 |x)

and the value of ✓ which gives higher likelihood is more likely to be associated to the observed
sample since gives a higher probability.

Example: Likelihood and log-likelihood for exponential( ):


n
! n
X X
n
L( |Y = y) = exp yi , l( |Y = y) = n log yi
i=1 i=1

Example: Likelihood and log-likelihood for N (µ, 2 ):

✓ ◆n ✓ Pn ◆
2 1 i=1 (yi µ)2
L(µ, |Y = y) = p exp
2⇡ 2 2 2
n
2 n 2 1 X
l(µ, |Y = y) = log(2⇡) + log( ) 2
(yi µ)2
2 2
i=1

3.3 Score Function and Fisher’s Information

Definition: The score function associated with the log-likelihood l(✓|y) is

@l(✓|y) 1 @L(✓|y)
s(✓|y) = = ,
@✓ L(✓|y) @✓

Proposition: E(s(✓|Y )) = 0.
Proof: (for the continuous case)
Z Z @L(✓|y) Z Z
@✓ @ @
E[s(✓|Y )] = s(✓|y)f (y|✓)dy = f (y|✓)dy = f (y|✓)dy = f (y|✓)dy = 0,
Rn Rn L(✓|y) Rn @✓ @✓ Rn

because L(✓|y) = f (y|✓) and the last integral is equal to one, since the pdf is normalised. Here it
has to be intended that ✓ is the true value of the unknown parameters.

Notes:

1. For the discrete case replace the integrals with sums.

15
2. Although the score function is usually viewed as a function of ✓, the expectation is taken
with respect to Y , actually with respect to the distribution of Y which depends on ✓. This
may be interpreted as follows. If the experiment was repeated many times the score function
would on average equal 0. That is, if we start at the true value of the parameters, on average
over many experiments the likelihood does not change if we make an infinitesimal change
of the parameter.

In most cases, if we plot the likelihood function against ✓, we get a curve with a peak in the
maximum. The sharper the peak is, the more information about ✓ exists in the sample. This is
captured by the Fisher’s information:
"✓ ◆2 #
⇥ ⇤ @
I(✓|y) ⌘ I(✓|Y = y) = E s(✓|Y )2 = E l(✓|Y )
@✓

This is the variance of the score (when computed at the true value of ✓), so the larger it is the more
the score in the true value is affected by minimal changes in the parameters, the sharper is the
peak, the more precise is our information about ✓.

Proposition: Show that under regularity conditions (e.g. exponential family)


 2 
⇥ 2
⇤ @ @
I(✓|y) = E s(✓|Y ) = E 2
l(✓|Y ) = E s(✓|Y )
@✓ @✓

again here it has to be intended that ✓ is the true value of the unknown parameters.
In this case at the true value of ✓ the Fisher info is also the negative Hessian of the log-
likelihood so it measures the concavity of the log-likelihood. In particular since the Fisher infor-
mation must be always positive (it is a variance), then the Hessian must be negative which jointly
with a zero expectation of the score (first derivative of the log-likelihood) tells us that the true
value of ✓ is a maximum of the log-likelihood for any realisation Y = y.
Proof:
Z Z
d d d
0= E [s(✓|y)] = s(✓|y)f (y|✓)dy = [s(✓|y)f (y|✓)] dy
d✓ d✓ Rn Rn d✓
Z ✓ ◆
d d
= s(✓|y) f (y|✓) + s(✓|y) f (y|✓) dy
d✓ d✓
ZR 
n
✓ ◆
d 2 d
= s(✓|y) + s(✓|y) f (y|✓)dy, f (y|✓) = s(✓|y)f (y|✓)
Rn d✓ d✓
 
d d ⇥ ⇤
=E s(✓|y) + s(✓|y)2 = E s(✓|y) + E s(✓|y)2
d✓ d✓

Let Y = (Y1 , . . . , Yn ) be a random sample from a pdf with fYi (yi |✓). Denote with s(✓|yi ) and
I(✓|yi ) the Score function and Fisher information for Yi = yi respectively. Then, for a realisation
of the random sample we have
n
X
s(✓|y) = s(✓|yi ), I(✓|y) = nI(✓|yi ).
i=1

16
Proof: The log-likelihood function is
n
! n
Y X
`(✓|Y ) = log f (Yi |✓) = `(✓|Yi )
i=1 i=1

Hence
n n
@`(✓|Y ) X @`(✓|Yi ) X
s(✓|Y ) = = = s(✓|Yi )
@✓ @✓
i=1 i=1

For the Fisher information, using the fact that (Y1 , . . . , Yn ) are i.i.d., we have
2 !2 3
Xn X n
⇥ ⇤
I(✓|y) = E 4 s(✓|Yi ) 5 = E (s(✓|Yi ))2 = nI(✓|yi ),
i=1 i=1

Or alternatively, using the Hessian,


✓ ◆ n
!
@s(✓|Y ) X @s(✓|Yi )
I(✓|Y ) = E = E
@✓ @✓
i=1
Xn ✓ ◆ X n
@s(✓|Yi )
= E = I(✓|yi ) = nI(✓|yi ).
@✓
i=1 i=1

Example: Let Y = (Y1 , . . . , Yn ) be a random sample from an Exp( ). Show that the score
function is

n
X
n
s(✓|y) = yi ,
i=1

and the Fisher’s Information matrix is

n
I(✓|y) = 2

Vector parameter case If ✓ = (✓1 , . . . , ✓p )0 , then the score function is the vector
✓ ◆0
@ @
s(✓|Y ) = r✓ `(✓|Y ) = `(✓|Y ), . . . , `(✓|Y )
@✓1 @✓p

and Fisher’s information is the matrix


⇥ ⇤
E s(✓|Y )s(✓|Y )0 .

The (i, j)th element of the Fisher’s information matrix is



@ @
[I(✓|y)]ij = E l(✓|Y ) l(✓|Y )
@✓i @✓j

It also holds that

17
1. E [s(✓|Y )] = 0p

2. I(✓|y) = V [s(✓|Y )] , V [.] denotes a covariance matrix

3. I(✓|y) = E [r✓ r0✓ `(✓|Y )] = E [r✓ s(✓|Y )]

or else

@2
[I(✓|y)]i,j = E `(✓|Y )
@✓i @✓j

Example: Let Y = (Y1 , . . . , Yn ) be a random sample from a N (µ, 2 ). Show that the score
function is

0 1 Pn 1
2 i=1 (yi µ),
s(✓|y) = @ Pn
A
n 1
2 2
+ 2 4 i=1 (yi µ)2

and the Fisher’s Information matrix is

0 n
1
2 0
I(✓|y) = @ A
n
0 2 4

3.4 Maximum Likelihood Estimators

We have seen that the true value of the parameter ✓ must be such that the log-likelihood attains its
maximum. This motivates mathematically the definition of maximum likelihood estimator.

ˆ be the parameter value at which the likelihood


Definition: For each sample point Y = y let ✓(y)
L(✓|y) attains its maximum as a function of ✓. A maximum likelihood estimator of the parameter
ˆ ).
✓ based on Y is the function ✓(Y

Maximization: In general the likelihood function can be maximized using numerical methods.
However if the function is differentiable in ✓, calculus may be used. The values of ✓ such that

@`(✓|y)
s(✓|y) = = 0,
@✓
are possible candidates. These points may not correspond to the maximum because

1. They may correspond to the minimum. The second derivative must also be checked.

2. The zeros of the first derivative locate only local maxima, we want a global maximum.

18
3. The maximum may be at the boundary where the first derivative may not be 0.
4. These points may be outside the parameter range.

Notice that, an application of the Weak Law of Large Numbers tells us that, as n ! 1, we must
have
n
1 1X p
s(✓|Y ) = s(✓|Yi ) ! E [s(✓|Yi )] = 0
n n
i=1
which justifies our necessary condition.

Example: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from N (µ, 1), 1 < µ < +1. Find
the MLE for µ. The log likelihood function is equal to
n n n
1X 2 1X 2 X 1 2
`(µ|y) = const. (yi µ) = yi + µ yi nµ
2 2 2
i=1 i=1 i=1

Setting the score function equal to 0 yields a candidate for the global maximum:
n
X
@
`(µ̂|y) = 0 ) yi nµ̂ = 0 ) µ̂ = Ȳ .

i=1

We could check whether it corresponds to a maximum (and not a minimum) if the second deriva-
tive of the log-likelihood is negative
@2
`(µ̂|y) = n<0
@µ2
The MLE for µ is µ̂ = Ȳ (In fact more checking is required but it is omitted for simplicity).

Example: We cannot always use the above calculus recipe. For example let Y = (Y1 , Y2 , . . . , Yn )
be a random sample from U (0, ✓). Assume to observe Y = y and rank the realisations as y(1) 
. . .  y(n) . These are then realisations of the order statistics Y(i) . The likelihood for ✓ given
Y = y is
L(✓|y) = ✓ n I(y(1) 0)I(y(n)  ✓)
and the log-likelihood for Y(1) 0 is (notice that by construction all realisations are such that
y(1) 0)
`(✓|y) = n log(✓) if ✓ y(n) ,
The function `(✓|y) is maximized at ✓ˆ = y(n) which is our estimate. Hence ✓ˆ = Y(n) is the MLE.

Induced likelihood: Let Y be a sample with likelihood L(✓|y) and let = g(✓). The induced
likelihood for given Y = y is
L⇤ ( |Y = y) = sup L(✓|Y = y)
✓:g(✓)=

Theorem (Invariance property of the MLE’s): If ✓ˆ is the MLE for ✓, then for any function g(.)
ˆ
the MLE of g(✓) is g(✓).

Example: MLE for µ2 in N(µ, 1) case is Ŷ 2 , MLE for p/(1 p) in Binomial(n, p) is p̂/(1 p̂)
etc.

19
3.5 Evaluating Estimators

Being a function of the sample, an estimator is itself a random variable. Hence it has a mean and
a variance. Let ✓ˆ be an estimator of ✓. The quantity below

E(✓ˆ ✓),
ˆ If E(✓)
is termed as the bias of the estimator ✓. ˆ = ✓ the estimator is unbiased.

Estimators are usually evaluated based on their bias and variance.

The mean squared error (MSE) of an estimator ✓ˆ is the function of ✓ defined by

M SE(✓) = E(✓ˆ ✓)2 .

Note that

E(✓ˆ ✓)2 = E[✓ˆ E(✓)


ˆ + E(✓)
ˆ ✓]2
= E{[✓ˆ E(✓)]
ˆ 2 + [E(✓)
ˆ ✓]2 + 2[✓ˆ E(✓)][E(
ˆ ˆ
✓) ✓]}
= E[✓ˆ E(✓)]
ˆ 2 + E[(Bias)2 ] + 2[E(✓)ˆ ˆ
E(✓)][E( ˆ
✓) E(✓)]
= V ariance + (Bias)2

An estimator ✓ˆ1 is uniformly better than ✓ˆ2 if it has smaller MSE for all ✓.

Example: Compare the MSE’s of S 2 , nn 1 S 2 , and ˆ 2 = 1 as estimators for 2 in the presence of


a random sample Y of size n from N (µ, 2 )
✓ ◆
n 1 2 n 1 2
E S = ,
n n
n 1 2
Therefore n S is biased.
✓ ◆
2 2 4 2n 1 4 n 1 2 2
MSE(S ) = ... = > = M SE S ,8 , n = 2, 3, . . .
n 1 n2 n

Thus nn 1 S 2 is uniformly better than S 2 . But it is not uniformly better than ˆ 2 = 1 which has zero
MSE when 2 = 1.

3.6 Best Unbiased Estimators

As seen from the previous example, we cannot find a ‘uniformly best’ estimator. Hence we restrict
attention to unbiased estimators. The MSE of an unbiased estimator is equal to its variance. A
best unbiased estimator is also termed as a minimum variance unbiased estimator.

20
Theorem (Cramér - Rao inequality): Let Y = (Y1 , . . . , Yn ) be a sample and U = h(Y ) be an
unbiased estimator of g(✓). Under regularity conditions the following holds for all ✓
⇥ @
⇤2
@✓ g(✓)
V (U ) .
I(✓|y)

Note that if g(✓) = ✓ we get

1
V (U ) ,
I(✓|y)

and if also Y = (Y1 , Y2 , . . . , Yn ) is a random sample

1
V (U ) .
nI(✓|yi )

Proof: We know that when ✓ is the true unknown value then

|Corr(X,Y)|  1 ) Cov(X, Y )2  V (X)V (Y ), (3)


@
@✓ f (y|✓) @
s(✓|y) = ) f (y|✓) = s(✓|y)f (y|✓), (4)
f (y|✓) @✓
E[s(✓|Y )] = 0, (5)
V [s(✓|Y )] = I(✓|y), (6)
and by assumption
E(U ) = E[h(Y )] = g(✓). (7)

Cov[h(Y ), s(✓|Y )] = E[h(Y )s(✓|Y )] E[h(Y )]E[s(✓|Y )]


Z
(5)
= E[h(Y )s(✓|Y )] = h(y)s(✓|y)f (y|✓)dy
Rn
Z
(4) @ @ (7) @
= h(y) f (y|✓)dy = E[h(Y )] = g(✓)
Rn @✓ @✓ @✓

If we replace into inequality (3), we will get


 2
2 @
Cov[h(Y ), s(✓|Y )] = g(✓)  V [h(Y )]V [s(✓|Y )] )
@✓
 2
(6) @
) g(✓)  V (U )I(✓|y).
@✓

Example: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from N (µ, 1).


n
n 1X
`(µ|y) = log(2⇡) (yi µ)2 .
2 2
i=1

21
X n n
X
@
`(µ|y) = (yi µ) = (yi ȳ + ȳ µ)

i=1 i=1
n
X n
X
= (yi ȳ) + (ȳ µ) = n(ȳ µ).
i=1 i=1
✓ ◆
@
I(µ|y) = E n(Ȳ µ) = E( n) = n.

Hence the Cramér - Rao lower bound for µ is 1/n.

Consider µ̂ = Ȳ as an estimator for µ.

E(Ȳ ) = µ,
1
V (Ȳ ) = .
n

Since µ̂ = Ȳ is unbiased and attains the Cramér - Rao lower bound for µ, it is also a MVUE for
µ.

Example: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from Poisson( ). It is not hard to check
that I( |y) = n/ . Both the mean and the variance of a Poisson distribution are equal to . Hence

E(Ȳ ) = E(Yi ) = ,

E(S 2 ) = V (Yi ) = .
Consider the estimators ˆ 1 = Ȳ and ˆ 2 = S 2 . They are both unbiased. Which one to choose?

V (Yi )
V (Ȳ ) = = .
n n

Since ˆ 1 is unbiased and attains the Cramér - Rao lower bound for , it is also a MVUE for .

Theorem (Cramér - Rao attainment): If U = h(Y ) is an unbiased estimator of g(✓), then U


attains the Cramér - Rao lower bound if and only if

s(✓|y) = b(✓) [h(y) g(✓)] .

Example: Random sample Y from N(µ, 1).

s(µ|y) = n(ȳ µ) : b(µ) = n, h(y) = ȳ, g(µ) = µ.

Example: Random sample Y from Poisson( ).


Pn
yi n
s( |y) = n + i=1 = (ȳ ),

n
b( ) = , h(y) = ȳ, g( ) = .

22
Proof of Cramér - Rao attainment theorem: The Cramér - Rao lower bound comes from the
inequality
Cov[h(Y ), s(✓|Y )]2  V [h(Y )]V [s(✓|Y )].
The lower bound is attained if and only if the equality holds in the above which is the case if and
only if s(✓|Y ) and h(Y ) are linearly related:

s(✓|Y ) = a(✓) + b(✓)h(Y ). (8)

Taking expectations on both sides we get

(5),(7)
E[s(✓|Y )] = a(✓) + b(✓)E[h(Y )] )
(5),(7)
) 0 = a(✓) + b(✓)g(✓) ) a(✓) = b(✓)g(✓).

Substituting into (8), we get

s(✓|Y ) = b(✓)g(✓) + b(✓)h(Y ) = b(✓)[h(Y ) g(✓)].

Theorem (Uniqueness of MVUE’s): If U is a best (minimum variance) unbiased estimator of


g(✓), then U is unique.

Proof: We will use again the Cauchy Schwarz inequality

Cov(X, Y )  [V (X)V (Y )]1/2 , (9)

and the fact that when the equality holds in the above, we can write

Y = a(✓) + b(✓)X

Let U 0 be another minimum variance unbiased estimator (V (U ) = V (U 0 )), and consider the
estimator U ⇤ = 12 U + 12 U 0 .
Note that U ⇤ is also unbiased
✓ ◆
⇤ 1 1 1 1
E(U ) = E U + U0 = E(U ) + E(U 0 ) = g(✓),
2 2 2 2

and

✓ ◆
⇤ 1 1 0
V (U ) = V U+ U
2 2
✓ ◆ ✓ ◆ ✓ ◆
1 1 0 1 1 0
=V U +V U + 2Cov U, U
2 2 2 2
1 1 1
= V (U ) + V (U 0 ) + Cov(U, U 0 )
4 4 2
(9) 1 1 1
 V (U ) + V (U 0 ) + [V (U )V (U 0 )]1/2
4 4 2
= V (U ). (V (U ) = V (U 0 ))

23
We must have equality in the previous expression because U is a MVUE. This implies

Cov(U, U 0 ) = [V (U )V (U 0 )]1/2 = V (U ), and (10)

U 0 = a(✓) + b(✓)U. (11)


We can write
(10) (11)
V (U ) = V (U 0 ) = Cov(U, U 0 ) = Cov[U, a(✓) + b(✓)U ]
= Cov[U, b(✓)U ] = b(✓)V (U ).

Hence b(✓) = 1. Also


(11)
E(U 0 ) = E(U ) ) E(U ) + a(✓) = E(U ) ! a(✓) = 0.

Since a(✓) = 0 and b(✓) = 1, U is unique.

3.7 Sufficiency and Minimum Variance Unbiased Estimators

We can use the concept of sufficiency for searching for minimum variance unbiased estimators.
Theorem (Rao-Blackwell): Let U (Y ) be an unbiased estimator of g(✓) and T (Y ) be a sufficient
statistic for ✓. Define W (Y ) = E(U (Y )|T (Y )). Then for all ✓

1. E(W ) = g(✓),

2. V (W )  V (U ),

3. W is a uniformly better (unbiased) estimator than U .

Proof: The proof of the Rao-Blackwell theorem is based on the following conditional expectation
properties

E(X) = E[E(X|Y )] (12)


V (X) = V [E(X|Y )] + E[V (X|Y )] (13)

We can write
(12)
g(✓) = E(U ) = E[E(U |T )] = E[W (Y )],

(13)
V (U ) = V [E(U |T )] + E[V (U |T )]
= V [W (Y )] + E[V (U |T )] V [W (Y )].

It remains to prove that W (Y ) is indeed an estimator, i.e. independent of parameters. If U is


a function only of Y , then the distribution of U |T is independent of parameters (definition of
sufficiency). Hence so is W (Y ).

Note: conditioning an unbiased estimator on a sufficient statistics results in a uniform improve-


ment, actually conditioning on anything always givens an improvement but the result might depend
on ✓ hence it will not be an estimator. Sufficiency is crucial then.

24
Example: Let (Y1 , . . . P
, Yn ) be a random sample from a distribution with mean µ and variance 2
and suppose that T = ni=1 Yi is sufficient for µ. Consider the estimator µ̂1 = Y1 for µ and find
a better one.

For the estimator µ̂1 , we have

2
E(µ̂1 ) = E(Y1 ) = µ, V (µ̂1 ) = V (Y1 ) =

Since Y1 , Y2 , . . . , Yn are identically distributed

E(Yi |T ) = E(Y1 |T ) (14)

The Rao-Blackwell theorem states that the following estimator is better


n n
1X (14) 1 X
µ̂2 = E(µ̂1 |T ) = E(Y1 |T ) = E(Y1 |T ) = E(Yi |T )
n n
i=1 i=1
n
! n
1 X 1X
= E Yi |T = Yi = Ȳ ,
n n
i=1 i=1

Indeed
2
E(µ̂2 ) = E(Ȳ ) = µ, V (µ̂2 ) = V (Ȳ ) =  V (µ̂1 )
n

Reading

G. Casella & R. L. Berger 6.3.1, 7.1, 7.2.1, 7.2.2, 7.3.1, 7.3.2, 7.3.3

4 Interval Estimation

4.1 Interval Estimators and Confidence Sets

• Point estimates provide a single value as a best guess for the parameter(s) of interest.

• Interval estimates provide an interval which we believe contains the true value of the param-
eter(s).

• More generally we may look for a confidence sets (not necessarily an interval), for example
when when we are unsure whether the result of the procedure is an interval and or in cases
of more than one parameters.

Definition of interval estimator/estimate: Let Y = (Y1 . . . Yn ) be a sample with density fY (y|✓)


and U1 (Y ), U2 (Y ) be statistics such that U1 (x)  U2 (x) for any x. The random interval
[U1 (Y ), U2 (Y )] is an interval estimator for ✓.

If the observed sample is y, then interval [U1 (y), U2 (y)] is an interval estimate for ✓.

25
Definition of coverage probability: The probability that the random interval contains the true
parameter ✓ is termed as coverage probability and denoted with

P [U1 (Y )  ✓  U2 (Y )]

Definition of confidence level: The infimum of all the coverage probabilities (for each ✓) is termed
as confidence level (coefficient) of the interval.

inf P [U1 (Y )  ✓  U2 (Y )]

Notes:

• The random variables in the coverage probability are U1 (Y ) and U2 (Y ). The interval may
be interpreted as the probability that U1 (Y ) and U2 (Y ) contain ✓.

• If an interval has confidence level 1 ↵ the interpretation is: ‘If the experiment was repeated
many times 100 ⇥ (1 ↵)% percent of the corresponding intervals would contain the true
parameter ✓.’

• The random variables in P [U1 (Y )  ✓  U2 (Y )] are U1 (Y ) and U2 (Y ). Thus

P [U1 (Y )  ✓  U2 (Y )] = P [U1 (Y )  ✓ \ U2 (Y ) ✓]
=1 P [U1 (Y ) > ✓ [ U2 (Y ) < ✓]
=1 P [U1 (Y ) > ✓] + P [U2 (Y ) < ✓] P [U1 (Y ) > ✓ \ U2 (Y ) < ✓]
=1 P [U1 (Y ) > ✓] P [U2 (Y ) < ✓]

since P [U1 (Y ) > ✓ \ U2 (Y ) < ✓] = 0.

Example: given an random sample X = (X1 . . . X4 ) from N (µ, 1), compare the sample mean
X̄ which is a point estimator with the interval estimator [X̄ 1, X̄ + 1]. At first sight with the
interval estimator we just loose precision, but we actually gained in confidence. Indeed, while
P (X̄ = µ) = 0, we have
!
X̄ µ
P (X̄ 1  µ  X̄ + 1) = P ( 1  X̄ µ  1) = P 2 p  2 = .9544
1/4

because X̄ ⇠ N (µ, 1/4). Therefore, we loose in precision but we now have over 95% chances of
covering the unknown parameter with this interval estimator.

Definition of expected length: Consider an interval estimator [U1 , U2 ]. The length U2 U1 is a


random variable. One possible measure is its expected length

E(U2 U1 )

A good interval estimator should minimise the expected length while maximising the confidence
level.
Some notation: Suppose that the random variable X follows a distribution X . We will denote
with X↵ the number for which
P (X  X↵ ) = ↵

26
Naturally

P (X > X↵ ) = 1 ↵

We use such notation for various distributions. In particular we use the letter Z and Z↵ for the
standard normal distribution where we also write

P (Z  Z↵ ) = (Z↵ ) = ↵.

Same length different confidence levels:


Suppose that we have a random sample from a N (µ, 1) and we want an interval estimator for µ.
Let k1 , k2 be positive constants. All the intervals below have the same length.

1. [ k1 , k2 ]

2. [Y1 k1 , Y 1 + k2 ]

3. [Ȳ k1 , Ȳ + k2 ]

Let’s evaluate their confidence levels. Note that Yi µ ⇠ N (0, 1)

1. This interval does not depend on the sample. If µ 2 [k1 , k2 ], the coverage probability is 1,
otherwise it is 0. Thus the confidence level is 0.

2. The coverage probability is

P (Y1 k1  µ  Y1 + k2 ) = 1 P (Y1 k1 > µ)


P (Y1 + k2 < µ) = 1 P (Y1 µ > k1 ) P (Y1 µ< k2 )
= (k1 ) ( k2 ) = (k1 ) + (k2 ) 1

which is equal with the confidence level.


p
3. Using the fact that n(Ȳ µ) ⇠ N (0, 1) and similar calculations we get a confidence level
of

p p
( nk1 ) + ( nk2 ) 1 (k1 ) + (k2 ) 1

Same confidence level different lengths:


Suppose that we have a random sample from a N (µ, 1) and we want an interval estimator for µ.
p
We know that Z = n(Ȳ µ) ⇠ N (0, 1). If ↵1 , ↵2 are positive numbers such that ↵ = ↵1 + ↵2 ,
we can write
p
P (Z↵1  n(Ȳ µ)  Z1 ↵2 ) = 1 P (Z < Z↵1 ) P (Z > Z1 ↵2 )
=1 ↵1 [1 (1 ↵2 )]
=1 (↵1 + ↵2 )
=1 ↵,

27
By rearrangement we get the interval estimator for µ

1 1
Ȳ p Z1 ↵2 , Ȳ + p Z1 ↵1
n n

The (expected) length of the interval above is



1 1 1
E Ȳ + p Z1 ↵1 Ȳ p Z1 ↵2 = p (Z1 ↵1 + Z1 ↵2 )
n n n

Using statistical tables we can construct the following table, where we fix ↵1 + ↵1 = 0.05. Hence
we have the length of 95% confidence intervals for the mean of a normal distribution with unit
variance for various lower and upper endpoints.

p
↵1 ↵2 Z 1 ↵1 Z 1 ↵2 n length
0 0.05 +1 1.645 1
0.01 0.04 2.326 1.751 4.077
0.02 0.03 2.054 1.881 3.935
0.025 0.025 1.96 1.96 3.920
0.03 0.02 1.881 2.054 3.935
0.04 0.01 1.751 2.326 4.077
0.05 0 1.645 +1 1

Why 95%:
Let us consider symmetric intervals with confidence levels 0.8, 0.9, 0.95, and 0.99. Using the
previous procedure and statistical tables we can construct the following table where we have the
length of intervals for the mean of a normal distribution with unit variance for various confidence
levels.

p
↵1 ↵2 Z 1 ↵1 Z 1 ↵2 n length
0.1 0.1 1.2816 1.2816 2.563
0.05 0.05 1.645 1.645 3.290
0.025 0.025 1.96 1.96 3.920
0.005 0.005 2.576 2.576 5.152

The level 95% is chosen as a compromise between length and confidence.

4.2 Finding Interval Estimators from Pivotal Functions

A way to construct a 1 ↵ confidence set for ✓ is by using a pivotal function.

Definition of a pivotal function: Consider a sample Y with density fY (y|✓) and suppose that we
are interested in constructing an interval estimator for ✓. A function G = G(Y, ✓) of Y and ✓ is a
pivotal function for ✓ if its distribution is known and does not depend on ✓.

28
Example: Let Y1 , Y2 , . . . , Yn be a random sample from a N (µ, 2) with µ unknown and 2

known. We know that


✓ 2

Ȳ ⇠ N µ, ,
n

and we can use the above to get the following pivotal function

Ȳ µ
Z= p ⇠ N (0, 1).
/ n
Notice that Z depends on µ but its distribution does not change regardless of the value of µ.

Example: Let Y1 , Y2 , . . . , Yn be a random sample from a N (µ, 2) with µ known and 2 un-
known. We know that
Yi µ
Zi = ⇠ N (0, 1) ,

P
and that Zi ’s are independent. Getting 2
i Zi gives us the following pivotal function

n
X Pn
i=1 (Yi µ)2
Zi2 = 2
⇠ 2
n.
i=1

Example: Let Y1 , Y2 , . . . , Yn be a random sample from a N (µ, 2) with both µ, 2 unknown.


Now we cannot use
Ȳ µ
p ⇠ N (0, 1)
/ n

as a pivotal function for µ because its distribution also depends on the unknown parameter .
Instead we use
Ȳ µ
p ⇠ tn 1 .
S/ n
P
In the same way we cannot use i Zi2 for 2 since its distribution depends on µ which is unknown,
instead we can use
(n 1)S 2
2
⇠ 2n 1 .

Constructing an interval from a pivotal function:


Suppose that we have a sample Y . To construct an interval estimator with confidence level 1 ↵
for the parameter ✓ using a pivotal function one can use the following procedure:

Step 1: Find a pivotal function G = G(Y, ✓) based on a reasonable point estimator for ✓.

Step 2: Use the distribution of the pivotal function to find values g1 and g2 such that

P (g1  G(Y, ✓)  g2 ) = 1 ↵

29
Step 3: Manipulate the quantities G g1 and G  g2 to make ✓ the reference point. This yields
inequalities of the form

✓ U1 (Y, g1 , g2 ) and ✓  U2 (Y, g1 , g2 ),

for some functions U1 (.) and U2 (.) independent of parameters.

Step 4: Give the following interval

[U1 (Y, g1 , g2 ), U2 (Y, g1 , g2 )].

Note: The endpoints U1 , U2 are usually functions of one of the g1 or g2 but not the other.

Example: Interval for µ, N (µ, 2) with known 2

Suppose that we have a random sample Y from a N (µ, 2) (with 2 known) and we want an
interval estimator for µ with confidence level 1 ↵.

Step 1: We know that


Ȳ µ
Z = Z(Y, µ) = p ⇠ N (0, 1).
/ n
Thus Z is a pivotal function.

Step 2: We can write

P (Z↵/2  Z  Z1 ↵/2 ) =
=1 P (Z < Z↵/2 ) P (Z > Z1 ↵/2 )
=1 ↵/2 [1 (1 ↵/2)]
=1 (↵/2 + ↵/2) = 1 ↵.

Step 3: Rearranging the inequalities we get

Ȳ µ Ȳ µ
p Z↵/2 and p  Z1 ↵/2
/ n / n

which we can rewrite as

µ  Ȳ p Z↵/2 , and µ Ȳ p Z1 ↵/2


n n

Step 4: Note that Z↵/2 = Z1 ↵/2 . We get

[Ȳ p Z1 ↵/2 , Ȳ + p Z1 ↵/2 ]


n n

Numerical Example: Suppose that we had n = 10, Ȳ = 5.2, 2 = 2.4 and ↵ = 0.05. From
suitable tables or statistical software we get Z.975 = 1.96, so an interval estimator for µ with
confidence level 1 ↵ is

30
p p
[5.2 1.96 2.4/10, 5.2 + 1.96 2.4/10]

or else [4.24, 6.16].

Example: Interval for µ, N (µ, 2 ), with unknown 2

Suppose that we have a random sample Y from a N (µ, 2) (with also 2 unknown) and we want
an interval estimator for µ with confidence level 1 ↵.

Step 1: We know that

Ȳ µ
T = T (Y, µ) = p ⇠ tn 1.
S/ n
Thus T is a pivotal function.
Step 2: We can write

P (tn 1,↵/2  T  tn 1,1 ↵/2 ) =


=1 P (T < tn 1,↵/2 ) P (T > Tn 1,1 ↵/2 )
=1 ↵/2 [1 (1 ↵/2)]
=1 (↵/2 + ↵/2) = 1 ↵.

Step 3: Rearranging the inequalities we get


Ȳ µ Ȳ µ
p tn 1,↵/2 and p  tn 1,1 ↵/2
S/ n S/ n

which we can rewrite as

S S
µ  Ȳ p tn 1,↵/2 , and µ Ȳ p tn 1,1 ↵/2
n n
Step 4: Note that tn 1,↵/2 = tn 1,1 ↵/2 . We get

S S
Ȳ p tn 1,1 ↵/2 , Ȳ + p tn 1,1 ↵/2
n n

Numerical Example: Suppose that we had n = 10, Ȳ = 5.2, S 2 = 2.4 and ↵ = 0.05. From
suitable tables or statistical software we get t9,.975 = 2.262, so an interval estimator for µ with
confidence level 1 ↵ is

p p
[5.2 2.262 2.4/10, 5.2 + 2.262 2.4/10]

or else [4.09, 6.31].

Note: Compared with the known 2 case the interval is now larger despite the fact that S = .
The t distribution has fatter tails than the standard Normal. On the other hand as n grows the t
distribution gets closer to the Normal.

31
Reading

G. Casella & R. L. Berger 9.1, 9.2.1, 9.2.2, 9.3.1

5 Asymptotic Evaluations

So far we considered evaluation criteria based on samples of finite size n. But as mentioned above
there may be cases where a satisfactory solution does not exist. An alternative route is to approach
this problems with letting n ! 1, in other words study the asymptotic behaviour of the problem.
We will look mainly into asymptotic properties of maximum likelihood procedures.

5.1 Summary of the Point/Interval Estimation Issues

• In point estimation we use the information from the sample Y to provide a best guess for
the parameters ✓.

• For this we use statistics termed as estimators,

✓ˆ = h(Y ),

that are functions of the sample Y . The realization of the sample provides a point estimate
which reflects our belief for the parameter ✓.

• There are many ways to find estimator functions. For example one can use the method of
moments or maximum likelihood estimators.

• We look for estimators with small mean squared error, defined as

E[(✓ˆ ✓)2 ].

• But it is very hard to compare estimators based solely on MSE. Even irrational estimators
like
✓ˆ = 1,
are not worse than reasonable ones for all ✓. For this reason we restrict attention to unbiased
estimators
E(✓) ˆ = ✓.

• An optimal solution to the problem is given by a minimum variance unbiased estimators.


Note that the variance of an unbiased estimator is equal to its MSE. If such an estimator
exists it is unique.

• The Cramér-Rao theorem provides a lower bound for the variance of an unbiased estimator.
Therefore if the variance of an unbiased estimator attains that bound, it provides an optimal
solution to the problem.

• Alternatively if an unbiased estimator is based on a complete sufficient statistic it is also


of minimum variance (see Rao-Blackwell theorem).

• Problem: Even an unbiased estimator may not be available or may not exist.

32
• In interval estimation we want to use the information from the sample Y to provide an
interval which we believe contains the true value of the parameter(s).

• The probability that the random interval contains the true parameter ✓ is termed as coverage
probability.

• The infimum of all the coverage probabilities is termed as confidence coefficient (level) of
the interval.

• A way to construct an interval is by using a pivotal function, that is a function of Y and ✓


with distribution independent of ✓.

• Alternatively one may invert an ↵ level test of H0 : ✓ = ✓0 . The parameter points ✓0


that provide an acceptance region A that contains the observed sample, provide a 1 ↵
confidence level interval. Conversely we can find the acceptance region of an ↵ level H0 :
✓ = ✓0 test by taking the sample points Y for which the resulting 1 ↵ confidence level
interval contains ✓0 (see definitions in Chapter 6).

• There may exist more than one intervals with the same level. One way to choose between
them is through their expected length.

• Problem: Sometimes it may be even hard to find any ‘reasonable’ interval estimator.

5.2 Asymptotic Evaluations

Definition: A sequence of estimators Un = U (X1 , . . . , Xn ) is a consistent sequence of estima-


tors for a parameter ✓ if, for every ✏ > 0 and every ✓ 2 ⇥,

lim P✓ (|Un ✓| < ✏) = 1.


n!1
In other words a consistent estimator converges in probability to the parameter ✓ it is estimating.
Notice that for any ✓ the property must hold, that is if we change ✓ the probability P✓ changes but
the limit will still hold.

Theorem: If Un is a sequence of estimators for a parameter ✓ satisfying

1. limn!1 V (Un ) = 0,

2. limn!1 Bias(Un ) = 0,

for every ✓ 2 ⇥, then Un is a consistent sequence of estimators.

Proof: we use Chebychev inequality, as n ! 1,

E[(Un ✓)2 ] Bias(Un )2 + V (Un )


P✓ (|Un ✓| > ✏)  = ! 0.
✏2 ✏2
An example is provided by the sample mean which has zero bias and variance V (Xi )/n.

Definition: An estimator is asymptotically unbiased for ✓ if its bias goes to 0 as n ! 1 for any
✓ 2 ⇥.

33
Definition: The ratio of the Cramér-Rao lower bound over the variance of an estimator is termed
as efficiency. An efficient estimator has efficiency 1. We can compare estimators in terms of
their asymptotic efficiency, that is their efficiencies as n ! 1. An estimator is asymptotically
efficient if its asymptotic efficiency is 1.

Theorem (Asymptotic normality of MLEs): Under weak regularity conditions the maximum
ˆ satisfies
likelihood estimator g(✓)
✓ ◆
p h i
d g 0 (✓)2
ˆ
n g(✓) g(✓) ! N 0, , n ! 1,
I(✓|yi )

where g is a continuous function. We may also write


approx
ˆ ⇠ N (g(✓), v(✓)),
g(✓)
g (✓)0 2
where v(✓) = nI(✓|y i)
which is the Cramér-Rao lower bound, since I(✓|y) = nI(✓|yi ). There-
ˆ
fore, Var(g(✓)) = v(✓).
The estimator ✓ˆ is computed using a sample of size n hence it depends on n and the theorem tells
us its behaviour when we consider larger and larger samples.
Corollary: Under weak regularity conditions the MLE ✓, ˆ or a function of it, is consistent, asymp-
totically unbiased and efficient for the parameter it is estimating.
d p
Slutsky’s theorem If, as n ! 1, Xn ! X and Yn ! a with a constant then

d
a. Yn Xn ! aX;
d
b. Xn + Yn ! X + a.

Actually asymptotic normality implies always consistency. Indeed, we have


p d
nI(✓|yi )(✓ˆ ✓) ! Z ⇠ N (0, 1), n ! 1,

then
! !
1 ⇣p ⌘ 1
d
(✓ˆ ✓) = p nI(✓|yi )(✓ˆ ✓) ! lim p Z = 0, n ! 1,
nI(✓|yi ) n!1 nI(✓|yi )

and convergence in distribution to a point is equivalent to convergence in probability. So ✓ˆ is


consistent estimator of ✓.

Asymptotic distribution of MLE’s - Sketch of proof: Assume g(✓) = ✓ and let s0 (✓|Y ) denote
@ ˆ
@✓ s(✓|Y ). Let ✓ be the MLE of the true value which we denote as ✓0 .

Consider a Taylor series expansion around the true value ✓0

s(✓|Y ) = s(✓0 |Y ) + s0 (✓0 |Y )(✓ ✓0 ) + . . .

34
Ignore the higher order terms and substitute ✓ with ✓ˆ
ˆ ) = s(✓0 |Y ) + s0 (✓0 |Y )(✓ˆ ✓0 ) )
s(✓|Y
s(✓0 |Y )
) ✓ˆ ✓0 = ˆ )=0 )
since s(✓|Y
s0 (✓0 |Y )
p
n
p n s(✓0 |Y )
) n(✓ˆ ✓0 ) = 1 0 . (15)
n s (✓0 |Y )

Then, recall that


n
X
s(✓|Y ) = s(✓|Yi ). (16)
i=1

Using (16), the numerator of (15) becomes


p ✓ Pn ◆
n p i=1 s(✓0 |Yi ) d
s(✓0 |Y ) = n 0 ! N 0, I(✓0 |yi ) , n ! 1,
n n

from the Central Limit Theorem for i.i.d. random variables and since E[s(✓0 |Yi )] = 0, Var[s(✓0 |Yi )] =
I(✓0 |Yi ).

For the denominator of (15), using the Weak Law of Large Numbers for i.i.d. random variables,
we get
n
1 0 1X 0 p ⇥ ⇤
s (✓0 |Y ) = s (✓0 |Yi ) ! E s0 (✓0 |Yi ) = I(✓0 |yi ), n ! 1.
n n
i=1

Combining these two results and using Slutsky’s theorem we get that, as n ! 1,
p ✓ ◆ ✓ ◆
n
p n s(✓0 |Y ) d 1 d 1
n(✓ˆ ✓0 ) = 1 0 ! N 0, ) ✓ˆ ✓0 ! N 0, ,
n s (✓0 |Y )
I(✓0 |yi ) I(✓0 |y)

since I(✓0 |y) = nI(✓0 |yi ).

Asymptotic pivotal function from MLEs: Note that


p approx
I(✓|y)(✓ˆ ✓) ⇠ N (0, 1).

Also, since ✓ˆ is consistent for ✓, the quantity I(✓|y)


ˆ converges in probability to I(✓|y). Hence, in
a second level of approximation, we can write for large sample sizes n
q
ˆ ✓ˆ ✓) approx
I(✓|y)( ⇠ N (0, 1).

We say that this function is asymptotically pivotal for ✓.

Example: Asymptotic estimation of Bernoulli Let (Y1 , . . . , Yn ) be a random sample from a


Bernoulli(p). We know that
Pn
• T = i=1 Yi ⇠ Binomial(n, p).

• The MLE for p is p̂ = Ȳ .

35
n
• The Fisher’s information is I(p) = p(1 p) .

The MLE for p, p̂ = Ȳ is consistent, (asymptotically) unbiased and efficient. The asymptotic
distribution of p̂ = Ȳ is ✓ ◆
approx p(1 p)
p̂ ⇠ N p, .
n
In an extra level of approximation we may use
✓ ◆
approx p̂(1 p̂)
p̂ ⇠ N p, .
n

p̂(1 p̂)
Let Sp = n . Then

approx p̂ p approx
p̂ ⇠ N (p, Sp ) ) p ⇠ N (0, 1) .
Sp
We can use the above to construct the following asymptotic 1 ↵ confidence interval
h p p i
p̂ Z1 ↵/2 Sp , p̂ + Z1 ↵/2 Sp

Note: The above interval may take values outside [0,1].

Example: Asymptotic estimators, intervals for functions of parameters Let Y = (Y1 , . . . , Yn )


be a random sample from a N (0, 2 ). We want a point and interval estimator for . The MLE for
2 is ˆ 2 = n 1 S 2 , where S 2 is the sample variance. Hence, a consistent, asymptotically unbiased
n
and efficient estimator is the MLE for , that is
r
p n 1
ˆ = ˆ2 = S
n

Note that both ˆ 2 and ˆ are biased for small samples. But their bias goes to 0 as n ! 1.
We are interested in = g( 2) =( 2 )1/2 . The Cramér - Rao lower bound is equal to
@ 2 1
@
g( 2 )2 4 2
2
v( ) = = n =
nI( 2 |yi ) 2 4
2n

If we further substitute v(ˆ ) = ˆ 2 /2n for v( ) we get

approx ˆ approx
ˆ ⇠ N ( , v(ˆ )) ) p ⇠ N (0, 1)
v(ˆ )

which leads to the following asymptotic 1 ↵ confidence level interval for


h p p i
ˆ Z1 ↵/2 v(ˆ ), ˆ + Z1 ↵/2 v(ˆ ) .

Reading

G. Casella & R. L. Berger 10.1.1, 10.1.2, 10.3.1, 10.3.2, 10.4.1

36
6 Hypothesis Testing

Problem:

• Suppose that a real world phenomenon/population may be described by a probability model


defined through the random variable Y with FY (y|✓).

• Suppose also that a sample Y = (Y1 , Y2 , . . . , Yn ) is drawn from that distribution/population.

• We want to use the information in the random sample Y to answer statements about the
population parameters ✓.

6.1 Statistical tests

6.1.1 Definitions

• A hypothesis is a statement about a population parameter.

• The two complementary hypotheses in a hypothesis testing problem are often called the null
and alternative. They are denoted by H0 and H1 respectively.

• A simple hypothesis takes the form

H0 : ✓ = c,

where c is a constant. A hypothesis that is not simple is called composite, e.g.

H0 : ✓  c,

• A hypothesis test is a rule that specifies

1. For which sample values the decision is made to accept H0 as true.


2. For which sample values H0 is rejected and H1 is accepted as true.

• The subset of the sample space for which H0 will be rejected is termed as rejection re-
gion or critical region. The complement of the rejection region is termed as a acceptance
region.

• The test rule is based on a statistic, termed as the test statistic.

Test Statistics

• The crucial part is to identify an appropriate test statistic T .

• One of the desired features of T is to have an interpretation such that large (or small) values
of it provide evidence against H0 .

• We also want to know the distribution of T under H0 , that is when H0 holds.

• We focus on test statistics that have tabulated distributions under H0 (Normal, t, 2, F ).


But this is only for convenience and it is not a strict requirement.

37
Example: Let Y = (Y1 , Y2 , . . . , Yn ) be random sample Y of size n from a N (µ, 2) population
(with 2 is known). Is µ equal to µ0 or larger for a given value µ0 ?
The hypotheses of the test are

H0 : µ = µ 0 versus H1 : µ > µ 0

One test may be based on the test statistic Ȳ with rejection region


R = µ0 + 1.96 p , 1
n

The rule is then to reject H0 if Ȳ > µ0 + 1.96 pn .

6.1.2 Types of errors in tests, power function and p-value

Consider a test with H0 : ✓ 2 ⇥0 versus H1 : ✓ 2 ⇥1 .

Type I error: If ✓ 2 ⇥0 (H0 is true) but the test rejects H0 .

Type II error: If ✓ 2 ⇥1 (H1 is true) but the test does not reject H0 .

Accept H0 Reject H0
H0 is true Correct Decision Type I error
H1 is true Type II error Correct Decision

The Type I error is associated with the significance level and the size of the test.

Definition: The test has significance level ↵ if

sup P✓ (Reject H0 )  ↵
✓2⇥0

The test has size ↵ if


sup P✓ (Reject H0 ) = ↵
✓2⇥0

Note: If the null hypothesis is simple then the size of the test is the probability of a type I error.

Power function: Let H0 : ✓ 2 ⇥0 , H1 : ✓ 2 ⇥c0 . The power function is defined as

(✓) = P✓ (Reject H0 ),

that is the probability that the null hypothesis is rejected if the true parameter value is ✓.

Note:

38

probability of Type I error if ✓ 2 ⇥0 ,
(✓) = P✓ (Reject H0 ) =
1 probability of Type II error, if ✓ 2 ⇥c0 .

Also we can define the level and the size of the test through the power function

Level: sup (✓)  ↵, and Size: sup (✓) = ↵.


✓2⇥0 ✓2⇥0

Ideally we would like the power function (✓) to be 0 when ✓ 2 ⇥0 and 1 when ✓ 2 ⇥1 , but this
is not possible. In practice we fix the size ↵ to a small value (usually 0.05) and for a given size we
try to maximize the power. Hence

• Failure to reject the null hypothesis does not imply that it holds and we say that we do not
reject H0 rather than saying we accept H0 .

• We usually set the alternative hypothesis to contain the statement that we are interested in
proving.

Example of a power function (previous example continued): The power function of the test is

p
(µ) = P (Y 2 R) = P Ȳ > µ0 + Z1 ↵ / n
✓ ◆
Ȳ µ0
=P p > Z1 ↵
/ n
✓ ◆
Ȳ µ µ µ0
=P p > Z1 ↵ p
/ n / n
✓ ◆
µ µ0
=1 Z1 ↵ p .
/ n

p-value: From the definitions so far, we either reject or not reject the null hypothesis. The follow-
ing quantity is also informative regarding the weight of evidence against H0 .

Definition: Let T (Y ) be a test statistic such that large values of T give evidence against H0 . For
an observed sample point y the corresponding p-value is

p(y) = sup P (T (Y ) T (y))


✓2⇥0

Notes:

1. Clearly 0  p(y)  1. The closer to 0 the more likely to reject.

2. In words, a p-value is the probability that we got the result of the sample or a more extreme
result. Extreme in the sense of evidence against H0 .

39
3. If we have a fixed significance level ↵, then we can describe the rejection region as

R = {y : p(y)  ↵}

We reject H0 if the probability of observing a more extreme result than that of the sample,
is small (less than ↵).

4. A similar definition can be made if small values of T give evidence against H0 .

Example of a p-value (previous example continued): Let Y = (Y1 , Y2 , . . . , Yn ) be random


sample Y of size n from a N (µ, 2 ) population (with 2 is known). We want to test H0 : µ  µ0
versus H1 : µ > µ0 .
Note Y denotes the sample whereas y the observed sample. Also we use Ȳ as T (Y ) and large
values of Ȳ provide evidence against H0 ,
✓ ◆
Ȳ µ ȳ µ
p(y) = sup P (Ȳ ȳ) = sup P p p
µµ0 µµ0 / n / n
⇢ ✓ ◆
Ȳ µ ȳ µ
= sup 1 P p  p
µµ0 / n / n
⇢ ✓ ◆ ✓ ◆
ȳ µ ȳ µ0
= sup 1 p =1 p
µµ0 / n / n

6.1.3 Constructing Statistical Tests

The procedure for constructing a test can be given by the following general directions:

Step 1: Find an appropriate test statistic T . Figure out whether large or small values of T provide
evidence against H0 . Also find its distribution under H0 .

Step 2: Use the definition of the level/size ↵ and write (R is unknown)

P✓0 (T 2 R)  ↵, or P✓0 (T 2 R) = ↵

Step 3: Solve the equation to get R. The test rule is then ‘Reject H0 ’ if the sample Y is in {Y :
T (Y ) 2 R}.

Example of a Statistical test (cont’d) Let’s come back to the previous example with H0 : µ = µ0
vs H1 : µ > µ0 .

Step 1: Test statistic Ȳ . Large values are against H0 . Under H0 , Ȳ ⇠ N (µ0 , 2) or we could use

Ȳ µ
p 0 ⇠ N (0, 1)
/ n

40
Step 2: We know that under H0
✓ ◆ ✓ ◆
Ȳ µ Ȳ µ
Pµ 0 p > Z1 ↵ =P p 0 > Z1 ↵ =↵
/ n / n

Note: It can be shown that the above also holds for H0 : µ  µ0 .

Step 3: From ✓ ◆
Ȳ µ
P p 0 > Z1 ↵ = ↵,
/ n
we can get to

p
P Ȳ > µ0 + Z1 ↵ / n = ↵,

hence

p
R = {Y : Ȳ > µ0 + Z1 ↵ / n}

6.2 Most Powerful Tests

The tests we are interested in control by construction the probability of a type I error (it is at most
↵). A good test should also have a small probability of type II error. In other words it should also
be a powerful test.

Definition: Let C be a class of tests for testing H0 : ✓ 2 ⇥0 versus H1 : ✓ 2 ⇥c0 . A test in class C,
with power function (✓), is a Uniformly Most Powerful (UMP) class C test if
0
(✓) (✓)

for every ✓ 2 ⇥c0 and every 0 (✓) that is a power function of a test in class C.

Theorem (Neyman-Pearson Lemma): Consider a test H0 : ✓ = ✓0 versus H1 : ✓ = ✓1 and


let fY (y|✓0 ), fY (y|✓1 ) denote the pdf (pmf) of the sample Y . Suppose that a test with rejection
region R satisfies
fY (y|✓1 )
y 2 R, if > k,
fY (y|✓0 )
for some k > 0 and
↵ = P✓0 (Y 2 R).
A test that satisfies the above is a uniformly most powerful test of size ↵.

Notes:

1. The value k may be chosen to satisfy


✓ ◆
fY (y|✓1 )
P✓0 ,k (Y 2 R) = P✓0 >k =↵
fY (y|✓0 )

41
2. The above ratio of pdf’s (or pmf’s) is the ratio of the likelihood functions.

Proof of Neyman-Pearson Lemma: Preliminaries: We give the proof for continuous random
variables. For discrete random variables just replace integrals with sums.

Let S (Y ) denote the rule of a test S with rejection region RS . Note that S (Y ) = I(Y 2 RS )
where I(·) is the indicator function. Hence for all ✓
Z Z
S (Y )fY (y|✓)dy = fY (y|✓)dy (17)
Rn RS

Z Z
E[ S (Y )] = S (y)fY (y|✓)dy = fY (y|✓)dy
Rn RS
= P✓ (Reject H0 ) = S (✓) (18)

where S (✓) is the power function of S.


Let T be the Neyman-Pearson lemma test, that is

RT = {y 2 Rn : fY (y|✓1 ) kfY (y|✓0 ) > 0},

with rule T (Y ) = I(Y 2 RT ). Let S be another test with S (Y ). Then

T (y) S (y), for y 2 RT , (19)

because in RT we have T (y) = 1 while in general 0  S (y)  1.


Consider the quantity B = S (y)[fY (y|✓1 ) kfY (y|✓0 )]. If y 2 RT , B 0. If y 2 / RT , B  0.
Hence
Z Z
S (y)[f Y (y|✓ 1 ) kf Y (y|✓ 0 )]dy  S (y)[fY (y|✓1 ) kfY (y|✓0 )]dy (20)
Rn RT

Main Proof: Let T be the Neyman-Pearson lemma test and S be another test of size ↵.
Z
(18)
S (✓1 ) k S (✓0 ) = S (y)[fY (y|✓1 ) kfY (y|✓0 )]dy
Rn
(20)
Z
 S (y)[fY (y|✓1 ) kfY (y|✓0 )]dy
RT
(19)
Z
 T (y)[fY (y|✓1 ) kfY (y|✓0 )]dy
RT
Z
(17)
= T (y)[fY (y|✓1 ) kfY (y|✓0 )]dy
Rn
(18)
= T (✓1 ) k T (✓0 )

Since both T and S are size ↵ tests,

T (✓0 ) = S (✓0 ) = ↵.

42
Therefore we can write

S (✓1 )  T (✓1 ),

which implies that T is a uniformly most powerful test of size ↵.

Example (Neyman-Pearson Lemma): Let Y = (Y1 , Y2 , . . . , Yn ) be random sample Y of size n


from a N (µ, 2 ) population (with 2 is known). We want to test H0 : µ = µ0 versus H1 : µ = µ1 .

Step 1: The likelihood ratio from the Neyman-Pearson Lemma is


2 ) n/2 exp{ 1
L(µ1 |Y ) (2⇡ 2 2 [n(Ȳ µ1 )2 + (n 1)S 2 ]}
LR = = 1
L(µ0 |Y ) (2⇡ 2 ) n/2 exp{
2 2
[n(Ȳ µ0 )2 + (n 1)S 2 ]}
⇣ n ⌘
2 2 2 2
= exp 2
( Ȳ + 2µ 1 Ȳ µ 1 + Ȳ 2µ 0 Ȳ + µ 1 )
⇣ 2n ⌘
2 2
= exp [(µ 0 µ 1 ) 2 Ȳ (µ 0 µ 1 )]
2 2
If µ0 < µ1 the above is large when Ȳ is large. If µ0 > µ1 the above is large when Ȳ is small. So
Ȳ can be a test statistic. Its distribution is known to be

✓ 2

Ȳ ⇠ N µ,
n

Step 2: We want
n ⇣ n ⌘ o
Pµ0 exp [(µ2 µ21 ) 2Ȳ (µ0 µ1 )] > k = ↵.
2 2 0

Step 3: If µ0 < µ1 the above is equivalent with


( )
2 2
(µ2 µ21 ) log(k)
Pµ0 Ȳ > 0 n
=↵
2(µ0 µ1 )

2
But we also know that Ȳ ⇠ N (µ0 , n ) under H0 , then
p
Pµ0 Ȳ > µ0 + Z1 ↵ / n = ↵,

is an equivalent test being based on the same statistic. This will give us a most powerful test for
this testing problem.

Example (Neyman-Pearson Lemma): Let a random sample Y = (Y1 , . . . , Yn ) from a Poisson( )


and H0 : = 0 vs H1 : = 1 . The Neyman-Pearson lemma likelihood ratio is
P
n i Yi Q ✓ ◆P i Y i
e 1 / i Yi ! 1
LR = 1
P
Yi Q = en( 0 1)

e n i
/ i Yi !
0 0
0

43
A test with rejection region from LR > k is such that
n
!
X log k n( 0 1)
P Yi > = k1 =↵
log 1 log 0
i=1
P
but we also know that i Yi ⇠ Poisson(n 0) under H0 so we can find k.

P
Let n = 8, 0 = 2 ( i Yi ⇠ Poisson(16)) and 1 = 6. The size is 0.058 with k1 = 22 and 0.037
with k1 = 23. A test with significance level 0.05 corresponds to a k1 = 23. What if 1 = 150?

Neyman-Pearson lemma for 1-sided composite hypotheses:


Neyman-Pearson lemma refers to tests with simple hypotheses

H0 : ✓ = ✓0 versus H1 : ✓ = ✓1 .

but sometimes may be used for composite hypotheses as well.

Assume a rejection region independent of ✓1 . If ✓0 < ✓1 , the test is most powerful for all ✓1 > ✓0 .
Hence it is most powerful for

H0 : ✓ = ✓0 versus H1 : ✓ > ✓0 .

Similarly, if ✓0 > ✓1 the Neyman-Pearson lemma test is most powerful for

H0 : ✓ = ✓0 versus H1 : ✓ < ✓0 .

What about
H0 : ✓  ✓0 versus H1 : ✓ > ✓0 , ?
The power is not affected, but is the size of the test still ↵? we would have to show that

sup P (Reject H0 ) = ↵.
✓✓0

Example (Neyman-Pearson Lemma): Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from


N (µ, 2 ) population (with 2 is known).
We showed that for testing H0 : µ = µ0 versus H1 : µ = µ1 , the UMP size ↵ test is constructed
by ⇣ n ⌘
2 2
LR = exp [(µ µ ) 2Ȳ (µ 0 µ 1 )] >k
2 2 0 1

which for µ0 < µ1 is equivalent to


p
Ȳ > µ0 + Z1 ↵ / n.

Note that the rejection region is independent of µ1 , thus the test is applicable for all µ1 > µ0 .
Hence it is also the UMP test for

H0 : µ = µ 0 , versus H1 : µ > µ0

44
What about testing problems of the following form?

H0 : µ  µ 0 , versus H1 : µ > µ0 .

The UMP test above would be a size ↵ test if

sup P (Reject H0 ) = sup (µ) = ↵.


µµ0 µµ0

where (µ) is the power function (derived on the notes of previous lectures). We can write
 ✓ ◆
µ µ0
sup (µ) = sup 1 Z1 ↵ p .
µµ0 µµ0 / n

The function inside the supremum is increasing in µ and equal to ↵ if µ = µ0 . Therefore the above
supremum is equal to ↵.

Note: The UMP test usually does not exist for 2-sided (composite) alternative hypotheses.

Corollary: Consider the previous testing problem, let T (Y ) be a sufficient statistic for ✓, and
g(t|✓0 ), g(t|✓1 ) be its corresponding pdf’s (or pmf’s). Then any test with rejection region S (a
subset of the sample space of T ) is a UMP level ↵ test if it satisfies

g(t|✓1 )
t 2 S, if > k,
g(t|✓0 )

for some k > 0 and ↵ = P✓0 (T 2 S).

Proof: Since T is a sufficient statistic we can write


f (y|✓1 ) g(t|✓1 )h(y) g(t|✓1 )
= = .
f (y|✓0 ) g(t|✓0 )h(y) g(t|✓0 )

6.3 Likelihood ratio test

Let ⇥ = ⇥0 [ ⇥c0 and consider hypothesis testing problems with

H0 : ✓ 2 ⇥0 versus H1 : ✓ 2 ⇥c0 .

Definition of Likelihood Ratio test: Let Y = y be an observed sample and define the likelihood
by L(✓|y). The likelihood ratio test statistic is

sup✓2⇥ L(✓|y)
(y) = .
sup✓2⇥0 L(✓|y)

A likelihood ratio test is a test with rejection region y : (y) c.

45
The constant c may be determined by the size, i.e.

sup P✓ ( (Y ) > c) = ↵.
✓2⇥0

Notes:

• The numerator is evaluated at the value of ✓ corresponding to the MLE, that is the maximum
of the likelihood over the entire parameter range.

• The denominator contains a maximum over a restricted parameter range.

• Hence the numerator is larger or equal to the denumerator and the statistic of the likelihood
ratio test is always greater than 1.

• Its distribution is usually unknown.

Example (Likelihood Ratio test): Let a random sample Y = (Y1 , . . . , Yn ) from a N (µ, 2 ),
(with 2 known). Consider the test H0 : µ = µ0 versus H1 : µ 6= µ0 . The MLE is µ̂ = Ȳ , hence
the likelihood ratio test statistic is

1
L(µ̂|Y ) (2⇡ 2 ) n/2 exp{ 2 2
[n(Ȳ µ̂)2 + (n 1)S 2 ]}
(Y ) = = 1
L(µ0 |Y ) (2⇡ 2 ) n/2 exp{ 2 2
[n(Ȳ µ0 )2 + (n 1)S 2 ]}
✓ ◆
n(Ȳ µ0 )2
= exp .
2 2

Ȳ pµ0
The test (Y ) > k is equivalent with the test / n
k1 .

We can write the previous rejection region as


⇢ ⇢
ȳ µ0 ȳ µ0
R= y: p  k1 [ y : p k1 or
/ n / n
p p
R = y : ȳ  µ0 k1 / n [ y : ȳ µ0 + k1 / n
Ȳ pµ0
Since / n
⇠ N (0, 1), if we set k1 = Z1 ↵/2 the size will be ↵:
✓ ◆ ✓ ◆
µ Ȳ µ Ȳ
Pµ0 (Y 2 R) = Pµ0 p 0  Z1 ↵/2 + Pµ0 p 0 Z1 ↵/2
/ n / n
= (Z↵/2 ) + 1 (Z1 ↵/2 ) = ↵/2 + 1 (1 ↵/2)
=↵

(Ȳ µ0 ) 2
Note: Equivalently one can use the fact that 2 log (Y ) = 2 /n ⇠ 2.
1

46
Theorem (Likelihood ratio test and sufficiency): Let Y be a sample parametrised by ✓ and
T (Y ) be a sufficient statistic for ✓. Also let (.), ⇤ (.) be the likelihood ratio tests for Y and T
respectively. Then for every y in the sample space

(y) = (T (y)).

Proof: because of sufficiency we have


sup✓2⇥ L(✓|y) sup✓2⇥ g(✓|T (y))h(y)
(y) = =
sup✓2⇥0 L(✓|y) sup✓2⇥0 g(✓|T (y))h(y)

sup✓2⇥ L (✓|T (y))
= = ⇤ (T (y)).
sup✓2⇥0 L⇤ (✓|T (y))

Likelihood Ratio test for nuisance parameters


Suppose that ✓ can be split in two groups: the main parameters and ⌫ the parameters that are of
little interest. We are interested in testing the hypothesis that takes a particular value

H0 : = 0, ⌫2N

H1 : 6= 0, ⌫ 2 N.
The likelihood ratio test is
sup ,⌫ L( , ⌫|y)
(y) = .
sup⌫ L(⌫| 0 , y)

The test may be viewed as a comparison between two models:

• The constrained model under H0 with parameters ⌫.

• The unconstrained model under H1 with parameters ⌫, .

Generally in statistics, models with many parameters have better fit but do not always give better
predictions. Parsimonious models achieve a good fit with not too many parameters. They usually
perform better in terms of prediction. The likelihood ratio tests provides a useful tool for finding
parsimonious models.

Example (Likelihood Ratio test): Suppose that X1 , . . . , Xn and Y1 , . . . , Yn are two independent
random samples from two exponential distributions with mean 1 and 2 respectively. We want
to test
H0 : 1 = 2 = , versus H1 : 1 6= 2 .
The likelihood function is
n
! n
!
X X
n n
L( 1, 2 |x, y) = 1 exp xi / 1 2 exp yi / 2 .
i=1 i=1

Under the unconstrained model of H1 we have ˆ M


1
LE = X̄ and ˆ M LE = Ȳ . Under the con-
2
strained model of H0 we get
ˆ M LE = (X̄ + Ȳ )/2.

47
Hence, the likelihood ratio test statistic is

L( ˆ M
1
LE , ˆ M LE |x, y)
2 (X̄ + Ȳ )2n /22n
LR = =
L( ˆ M LE , ˆ M LE |x, y) X̄ n Ȳ n
⇢q q 2n
2n
=2 X̄/Ȳ + Ȳ /X̄ .

We do not know the distribution of the LR test statistic. We may attempt to isolate T = X̄/Ȳ , but
since LR is not monotone in T we cannot construct a test.

Note: We will see in the next sections how deal with such cases by constructing asymptotic tests.

6.4 Other tests based on the likelihood

The Wald test: Suitable for testing simple null hypotheses H0 : ✓ = ✓0 versus H0 : ✓ 6= ✓0 . The
statistic of the test is
✓ˆ ✓0
Z=
ˆ
se(✓)
q
The estimator ✓ˆ is the MLE and a reasonable estimate for its standard error se(✓)
ˆ = V (✓) ˆ is
given by Fisher’s information.

The Score test: Similar to the Wald test but it takes the form
S(✓0 )
Z=p
I(✓0 )

where S(·) is the Score function and I(·) is the Fisher information.

Multivariate versions of the above tests exist. These tests are similar to the likelihood ratio test but
not identical. As with the likelihood ratio test, their distribution is generally unknown. For ‘large’
sample sizes the likelihood ratio, score and Wald tests are equivalent.

6.5 Asymptotic Evaluations for Hypothesis Testing

6.5.1 Summary for Hypothesis Testing Issues

• In hypothesis testing we want to use the information from the sample Y to choose between
two hypotheses about the parameter ✓: the null hypothesis H0 and the alternative H1 .
The sample values for which the H0 is rejected (accepted) is called rejection (acceptance)
region.

• There are two possible types of errors. Type I error is if we falsely reject H0 whereas type
II is if we don’t reject H0 when we should.

• The level and size ↵ of a test provide an upper bound for the type I error.

48
• The rejection region R, and hence the test itself, is specified using the probability that the
sample Y belongs to R under H0 is bounded by ↵. If H0 : ✓ 2 ⇥0 we use

sup P✓ (Y 2 R)
✓2⇥0

• A famous test is the likelihood ratio test.

• The type II error determines the power of a test. In practice we fix ↵ and try to minimize
(maximize) the type II error (power).

• To find a most powerful test we can use the Neyman-Pearson Lemma. It refers to tests
where both H0 and H1 are simple hypotheses but can be extended in some cases to com-
posite hypotheses. A version based on sufficient statistics, rather than the whole sample Y ,
is available.

• Problem: A uniformly most powerful test may not be available or may not exist. Sometimes
it may be even hard to find any ‘reasonable’ test.

6.5.2 Asymptotic Evaluations

Theorem (Asymptotic distribution of scalar LRTs): For testing H0 : ✓ = ✓0 versus H1 : ✓ 6=


✓0 , assume a random sample Y parametrised by a scalar parameter ✓ and let (Y ) be the likelihood
ratio test. Under H0 as n ! 1
d
2 log (Y ) ! 21
provided certain regularity conditions hold for the likelihood.

Suppose that ✓ can be split in two groups: ✓ = ( , ⌫) where are the main parameters of interest
of dimension k. Consider the test

H0 : = 0, ⌫2N

H1 : 6= 0, ⌫ 2 N.
Equivalently, suppose that we want to compare the constrained model of H0 with the uncon-
strained model of H1 .

Theorem (Asymptotic distribution of multi-parameter LRTs): Provided certain regularity con-


ditions hold, under H0
d
2 log (Y ) ! 2k n ! 1.
Note that k is the number of restrictions we are testing for.

Example: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from an Exponential( ) distribution.


We want to test
H0 : = 0 versus H1 : 6= 0 .

49
The MLE of is ˆ = Ȳ . The likelihood ratio test statistic is
sup >0 L( |Y ) ˆ n exp( nȲ / ˆ ) Ȳ n exp(
n)
LR(Y ) = = n = n .
L( 0 |Y ) 0 exp( nȲ / 0 ) 0 exp( nȲ / 0 )

We cannot construct an exact test since

• The distribution of LR(Y ) is unknown,


• The distribution of Ȳ is known but LR(Y ) is non-monotone in Ȳ .

Consider the quantity


⇥ ⇤
2 log LR(Y ) = 2n log( 0) log(Ȳ ) (1 Ȳ / 0)

The previous theorem establishes that, under H0 and for n ! 1,


2
2 log LR(Y ) ⇠ 1

Hence, the asymptotic likelihood ratio test of size ↵ rejects if 2 log LR(Y ) > 1,1 ↵ ,
2 where
1,1 ↵ is the (1 ↵)th percentile of a 21 distribution.
2

Example: Let Y = (Y1 , Y2 , . . . , Yn ) be a random sample from an Poisson( ) distribution. We


want to test
H0 : = 0 versus H1 : 6= 0 .
The MLE of is ˆ = Ȳ . The likelihood ratio test statistic is
sup >0 L( |Y ) exp( n ˆ ) ˆ nȲ
LR(Y ) = =
L( 0 |Y ) exp( n 0 ) n0 Ȳ
✓ ◆ nȲ
0
= exp[n( 0 Ȳ )] .

We cannot construct an exact test since

• The distribution of LR(Y ) is unknown,


• The distribution of Ȳ is known but LR(Y ) is non-monotone in Ȳ .

Consider the quantity


⇥ ⇤
2 log LR(Y ) = 2n ( 0 Ȳ ) Ȳ log( 0 /Ȳ )
The previous theorem establishes that, under H0 and for n ! 1,
2
2 log LR(Y ) ⇠ 1

Hence, the asymptotic likelihood ratio test of size ↵ rejects if 2 log LR(Y ) > 1,1 ↵ .
2

Reading

G. Casella & R. L. Berger 8.1, 8.2.1, 8.3.1, 8.3.2, 8.3.4, 10.3, 10.4.1

50

You might also like